Logistic Regression is simple and easy but one of the widely used binary classification algorithm in the field of machine learning. It is easy to execute, and it works well in many situations. Like other machine learning algorithms, the knowledge of statistic, linear algebra and calculus is needed to understand this algorithm.
Despite its name, it’s not a regression problem algorithm where you want to predict a continuous outcome. Instead, logistic regression is the go-to technique of binary classification. It provides you a discreet binary result between 0 and 1. To put it in simpler words, the result is either one thing or another. In this article, we are going to make a breast cancer predicting model using Logistic regression algorithm in Python. To understand this implementation properly, i will recommend you to visit my previous article on Logistic Regression.
We will use scikit-learn(sklearn) library to make classifier. It is an amazing library that allows you to implement a machine learning model in few lines of code(hardly 5 to 10 in most cases). It has made implementation of machine learning easier. It contains pre-made models and only thing we have to do is implement them. In addition to this it also contains some datasets for practicing. For this tutorial we will be using breast cancer dataset included in the scikit-learn library (Note: You can use dataset downloaded from any external source also. link-https://goo.gl/U2Uwz2). The dataset contains,
|Samples per class||212(M),357(B)|
where M: malignant, B:benign Some information about dataset:
(keeping in mind, cancer = load_breast_cancer() )
cancer.data: is the feature vector. features includes radius, texture, perimeter... etc for a breast.
cancer.target: contains classification label of data. It contains array of 0's and 1's. 0 indicates malignant and 1 indicates benign
cancer.target_names: contains the meaning of labels
i.e. cancer.target_names=['malignant' 'benign']
To study more about the data you may visit here
It's time to implement logistic regression in code. Let's do it...
Step1: Import the dataset included inside the sklearn.datasets.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
Step2: In this step we are dividing our actual training dataset into two substes viz. train and test subset. This is called cross validation. We are using train_test_split() function. In our case we are making 75% of our training dataset into train subset and rest 25% into test subset( train_size=0.75, we can use any proportion like 0.80 for 80%, 0.90 for 90%. Similarly we can assign value to test_size= for defining size of test subset).random_state=0 used here indicates the random selection data for splitting. If we do not include random_state, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue. It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code.
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(cancer.data,cancer.target, train_size=.75, random_state=0)
Step3: In this step we are importing Logistic Regression classifier in our program from sklearn.linear_model. To use the classifier, we need to create its instance (e.g. logisticRegr is an instance of Logistic regression classifier in our case). Here, logisticRegr.fit() is a predefined function that is used to train the model.
from sklearn.linear_model import LogisticRegression logisticRegr = LogisticRegression() logisticRegr.fit(X_train, Y_train)
Step4: In this step we are testing the accuracy of model. the function score(), returns the accuracy of the classifier. It is also predefined function.
predictions = logisticRegr.predict(X_test)
score = logisticRegr.score(X_test, Y_test)
Step5: This step is optional, as we have already information about the accuracy of the model. In this step we are printing the classification report for the classifier. classification report gives the value for different evaluation metrics( accuracy, recall, precision etc.) for a ML model. For more on evaluation metrics visit here.
from sklearn.metrics import classification_report
Step6: In this step we are testing our model by supplying test sample. we are using one of the data from test subset as test sample. The inbuilt function logisticRegr.predict() is used make prediction.
predict=logisticRegr.predict([X_test]) preds = cancer.target_names[predict] # mapping the output label with the meaning of label. print(preds)
The output of this code is,
precision recall f1-score support
0 0.91 0.98 0.95 53
1 0.99 0.94 0.97 90
accuracy 0.96 143
macro avg 0.95 0.96 0.96 143
weighted avg 0.96 0.96 0.96 143