# Logistic regression algorithm from scratch In python(Using Numpy only)

Logistic Regression is simple and easy but one of the widely used binary classification algorithm in the field of machine learning. It is easy to execute, and it works well in many situations. Like other machine learning algorithms, the knowledge of statistic, linear algebra and calculus is needed to understand this algorithm.

In this article, we will see how to implement the Logistic regression algorithm from scratch in Python(using numpy only). Coding Logistic regression algorithm from scratch is not so difficult but its a bit tricky.The full code of Logistic regression algorithm from scratch is as given below.

import numpy as np

import pandas as pd

dataset = pd.read_csv('E:/tutorials/logisticreg_data.csv') #use location of your dataset.

print(dataset) # prints the dataset

X = dataset.iloc[:, :-1].values # slice dataset upto last column(without including last column) as feature vector and store in X matrix.

Y = dataset.iloc[:,2].values #store the last column in Y and it will be the class label,NOTE: I have used 2 because there are only 3 columns in my dataset(0,1,2) be carefull in your case.

Y = Y.reshape(-1,1)# This will reshape Y as a column vector. conversally reshape(1,-1) will reshape an array as row vector

d01:17:24ef sigmoid(z):

return(1/(1+np.exp(-z)))

def cross_entropy_loss(y_pred,target):

return -np.mean((target*np.log(y_pred)+(1-target)*np.log(1-y_pred)))

def predict(X_test):

preds = []

for i in sigmoid(np.dot(X_test, W) + b):

if i>0.5:

preds.append(1)

else:

preds.append(0)

return preds

print(X.shape[1]) #shape[0] gives the number of rows and shape[1] gives the number of columns(i.e. features)

np.random.seed(0)

W = np.random.uniform(0,1,size=(X.shape[1],1))

b=0.5

for i in range(100000):

Z = np.dot(X, W) + b

Y_output = sigmoid(Z)

E = cross_entropy_loss(Y_output,Y)

print("------------->",E)

grad= Y_output - Y

grad_weight= np.dot(X.T,grad)/X.shape[0]

grad_bias = np.average(grad)

W=W-.01*grad_weight

b=b-.01*grad_bias

Y_test = predict(X_test=[1,1])

print(Y_test)Now Lets break the code and understand it

import numpy as np

import pandas as pd

dataset = pd.read_csv('E:/tutorials/logisticreg_data.csv') #use location of your dataset.

print(dataset) # prints the dataset

X = dataset.iloc[:, :-1].values # slice dataset upto last column(without including last column) as feature vector and store in X matrix.

Y = dataset.iloc[:,2].values #store the last column in Y and it will be the class label,NOTE: I have used 2 because there are only 3 columns in my dataset(0,1,2) be carefull in your case.

Y = Y.reshape(-1,1) # This will reshape Y as a column vector. conversally reshape(1,-1) will reshape an array as row vector

First of all we import all the required libraries. We have used Pandas library for importing dataset only.

def sigmoid(z):

return(1/(1+np.exp(-z)))

def cross_entropy_loss(y_pred,target):

return -np.mean((target*np.log(y_pred)+(1-target)*np.log(1-y_pred)))

During training the model we will need two functions, Logistic function(Sigmoid function) and Loss function(Cross-entropy loss).Logistic(sigmoid) functionwill take the sum of weighted input and bias.(i.e. z=WX+b). It means it will take a vector of dimension equal to that of output vector Y(say n*1) as an argument and will also return a vector of same dimension. The value returned by the sigmoid will be between (0,1) so these values can be considered as probabilities. For more on sigmoid function you can visit here.cross-entropy loss functionis used to calculate the divergence of predicted probability from the true value or actual label. It will take the target label (i.e. True label), and Y_pred (i.e. the output of sigmoid function) as argument.Both of them are (n*1) vectors. As it is an average of all individual loss it will return a single float value. It is given as,

here,m= total number of example data in datasetlogis natural log.

def predict(X_test):

preds = []

for i in sigmoid(np.dot(X_test, W) + b):

if i>0.5:

preds.append(1)

else:

preds.append(0)

return preds

This function actually is a mapping function that maps the probabilities for the test data with their respective class. It is used during prediction. The technique behind the mapping is simple. We simply set a threshold and then classify the datapoint on the basic of that threshold. If the value of sigmoid (or probability) is greater than the threshold, then we classify it as one class and if not then other class. In our case the treshold is 0.5.

The point to be noted is that the sigmoid is calculated by using the optimized value of Weight(W) and bias(b) i.e. only after training.

np.random.seed(0)

W = np.random.uniform(0,1,size=(X.shape[1],1))

b=0.5

These Lines of code will initilize weight matrix 'W' and bias 'b' with some innitial values. The bias is initilized with value 0.5 but the weight matrix is initilized with some random values between 0 and 1. Here the random.seed(0) will assure that every time when we run the code same random numbers will be assigned with Weight matrix. You can use any number within seed().

The size of weight matrix should be (M*1) where M = total features or attributes or variables. HereX.shape[1], will give the number of columns inX, which is equal to the total features or attributes.

for i in range(100000):

Y_input = np.dot(X, W) + b

Y_output = sigmoid(Y_input)

E = cross_entropy_loss(Y_output,Y)

print("------------->",E)

grad= Y_output - Y

grad_weight= np.dot(X.T,grad)/X.shape[0]

grad_bias = np.average(grad)

W=W-.01*grad_weight

b=b-.01*grad_bias

This is the training or learning part. Training means to find the optimium value of all the parameters that gives least loss(it is done by minimizing the loss). In our case, the parameters are weignt matrix 'W' and bias 'b'. For optimization, we are using gradient descent method. To read about gradient descent and its variants, you can visit here. However, I will also discuss about all the necessary concepts about gradient descent which will be more than sufficient.

Let's understand it...

The main goal behind using gradient descent is to minimize the cost( or loss) function using some basic concept of calculus.

\[ \displaystyle \begin{array}{l}\text{Let, }X\text{ be the fearure matrix or X }\!\!\_\!\!\text{ train and, }Y\text{ be their corresponding labels}\text{. }\\\text{ X=}{{\left( {\begin{array}{*{20}{c}} {{{x}_{{11}}}} & \ldots & {{{x}_{{1n}}}} \\ \vdots & \ddots & \vdots \\ {{{x}_{{m1}}}} & \cdots & {{{x}_{{mn}}}} \end{array}} \right)}_{{m*n}}},\text{ target = Y =}{{\left( {\begin{array}{*{20}{c}} {{{y}_{1}}} \\ \vdots \\ {{{y}_{m}}} \end{array}} \right)}_{{m*1}}}\\\text{ If, }W\text{ be the weight matrix, given as,}\\\text{ W=}{{\left( {\begin{array}{*{20}{c}} {{{w}_{1}}} \\ \vdots \\ {{{w}_{n}}} \end{array}} \right)}_{{n*1}}}\\\text{and }b\text{ be the bias, then the sigmoid is calculated as, }\\y\_pred(or\text{ y }\!\!\_\!\!\text{ output})=s(z)=\frac{1}{{1+{{e}^{{-z}}}}};z=X*W+b\text{ or }z=W*{{X}^{T}}+b\text{ }\end{array} \]

\[ \displaystyle \begin{array}{l}\text{Now calculate the loss}\text{.}\\\\\text{Consider a binary cross-entropy loss function }L\text{ which is given as,}\\\\L=-\frac{1}{m}\sum\limits_{{i=1}}^{m}{{\text{ }\!\!\{\!\!\text{ target*log(y }\!\!\_\!\!\text{ pred)+(1-target)*log(1-y }\!\!\_\!\!\text{ pred) }\!\!\}\!\!\text{ }}}\\\\\text{Now the training of model starts}\text{.}\\\\\text{Training of the model means to find the optimium value of }W\text{ and bias }b\text{ for which the loss function L gives minimum value}\text{.}\\\text{To obtain the minimum, we use gradient descent technique}\text{. }\\\text{The steps involved in gradient descent algorithm are, }\end{array} \]

I have directly used the equations for calculation of gradient weight and bias. To go through the derivations visit here.

\[ \displaystyle \begin{array}{l}\text{Step 1: Calculate the gradient of L w}\text{.r}\text{.t}\text{. weight }w\text{ and bias }b\text{ as,}\\\\\frac{{\partial L}}{{\partial {{w}_{j}}}}=\frac{1}{m}\sum\limits_{{i=1}}^{m}{{(y\_outpu{{t}^{i}}-{{Y}^{i}})}}x_{j}^{i},\text{ where, j =1,2,3,}…\text{,n}\text{. }\\\\\text{while implementing in code we use,}\\\\\left( {\frac{{\partial L}}{{\partial W}}} \right)\text{ = }\frac{1}{m}\left\{ {{{X}^{T}}*\left( {y\_output-Y} \right)} \right\}\\\text{where,}\\\left( {\frac{{\partial L}}{{\partial W}}} \right)={{\left( {\begin{array}{*{20}{c}} {\frac{{\partial L}}{{\partial {{w}_{1}}}}} \\ \vdots \\ {\frac{{\partial L}}{{\partial {{w}_{n}}}}} \end{array}} \right)}_{{n*1}}}and,(y\_output-Y)={{\left( {\begin{array}{*{20}{c}} {y\_outpu{{t}^{1}}-{{Y}^{1}}} \\ \vdots \\ {y\_outpu{{t}^{m}}-{{Y}^{m}}} \end{array}} \right)}_{{m*1}}}\end{array} \]

\[ \displaystyle \begin{array}{l}\text{And the gradient for the bias is calculated as,}\\\\\frac{{\partial L}}{{\partial b}}=\frac{1}{m}\sum\limits_{{i=1}}^{m}{{(y\_outpu{{t}^{i}}-{{Y}^{i}})}}\\\\\text{Step 2: Update the previous weights and biases as,}\\\\\text{W = W-}\eta \left( {\frac{{\partial L}}{{\partial W}}} \right)\\\\b\text{ = b-}\eta \frac{{\partial L}}{{\partial b}},\text{ where }\eta \text{ is the learning rate}\text{.}\\\\\text{Iterate the steps 1 and 2 untill the loss is significantly reduced}\text{.}\end{array} \]

I have performed 100000 iteration. you can do as much you like. Higher is the iteration the higher is the accuracy. You can also use while loop and terminate the loop when desired accuracy is reached.

Hurray!!! we have sucessfully made logistic regression model from scratch using numpy only.

**Lets test it….**

Before testing the model, lets look at our training data

It contains 2 features(x1 and x2) and 2 classes(class 0 and class 1).

out test point is (1,1). According to the scatter plot the test point belongs to class 1. Now lets use the model for the same prediction and look what happens,

Y_test = predict(X_test=[1,1])

print(Y_test)Output:(only showing loss and prediction)-------------> 6.899185264502869

-------------> 6.344155284348038

-------------> 5.790536874377807

-------------> 5.238669851040518

-------------> 4.689019237232466.......

-------------> 0.006062682710021059

-------------> 0.0060626235992693915

-------------> 0.006062564489688085

-------------> 0.00606250538127717

-------------> 0.00606244627403656

-------------> 0.006062387167966199

-------------> 0.006062328063066097

-------------> 0.006062268959336187

-------------> 0.006062209856776436

-------------> 0.006062150755386808

[1]

Our model also classified the data point in same class.

**Summary:**

Logistic regression is a strong classifying algorithm. It is sometimes considered as the starting point for deep learning because all the concepts involved in the logistic regression algorithm are also used in training neural networks. The optimization algorithm, concepts of loss functions etc are also used while desiging artificial neural network. So, try to get good understanding of this algorithm.

Pingback:The Logistic Regression Algorithm-Detailed Overview | KRAJ Education

Nice Article, but one more thing is how to apply Regularization to this and what is the W and B formula after adding regularization term

To update W, including regulization use

grad_weight= [np.dot(X.T,grad)+lambda*W]/X.shape[0]

where, is lambda is regularization parameter, whose values is to be selected on the basis of data.

Since the dimension of the weight matrix is higher than the bias, so you can neglect the regularization for bias. However, if you want to include it for bias, use

grad_bias = np.average(grad)+lambda*b/X.shape[0]