After taking on linear regression, let’s see another regression technique, Logistic Regression or Logit Regression.

While linear regression uses ordinary squared error function to find a best fitting line and calculate the parameters to predict the target value for new data, logistic regression predicts the probability of an event occurring. Here we won’t predict the target value rather we’ll predict the probability of any event occurring (1) and event not occurring (0). Here we don’t assume that the relation between variables is linear as we did in linear regression.

In logistic regression, we’ll use the same hypothesis notation, h(x) but this time its value will lie between 0 and 1. If its value is greater than 0.5, we say that the event is more likely to occur and vice versa. For the value to lie between 0 and 1, we use **sigmoid function **which is denoted as,

Graph of sigmoid function:

From the graph we can see that g(z) >= 0.5 for z >= 0 and g(z) < 0.5 for z < 0. We’ll use this in a short moment.

Here z is replaced with our previous h(x) value which was θ’x, so our hypothesis for logistic regression becomes,

which is the probability of y = 1, given x, parameterized by θ.

Suppose we predict y=1 when h(x) >= 0.5. From above graph this means that g(θ’x) >= 0.5 which occurs when θ’x >= 0. Similarly for y=0, h(x) < 0.5 which means that θ’x < 0. To make it clearer, let’s look at an example,

Suppose our hypothesis is,

and suppose we get our θ parameters as [-3 1 1], for y=1,

-3 + x1 + x2 >= 0 or

x1 + x2 >= 3

The above equation gives us a decision boundary as shown in the graph

Now the thing is how to determine θ parameters for logistic regression.

**Lets start the quest 🙂**

For linear regression our Squared error function was,

where n is total no. of data elements.

Let’s re-write it as

Replace

For logistic Regression,

Intuition behind using log value,

If y = 1, our cost function is and if we plot it, it will look like,

So if h(x) = 1 our cost will be zero as our hypothesis predicted the correct value but if our hypothesis predicted 0, our cost will be infinite. Its like saying about some event occurring that its not. In a way we penalize our algorithm by a very large cost. In such case we predict probability of some event happening as zero.

Similarly if y=0, our cost function is and if we plot it, it looks like,

If h(x) = 0 our cost will be zero as we predicted the correct value but again if h(x) = 1 and our y is 0 we penalize our algorithm by a large value.

So that’s how it works. Now to sum up these two equation, we write like this,

If y = 1,

else if y = 0,

.

Logistic Regression Cost Function

Again to minimize the cost, we use gradient descent algorithm,

J = 1./m * (-y' * log(sigmoid(X*theta)) - (1-y') * log(1 - sigmoid(X*theta)) ); grad = zeros(size(theta)); grad = 1./m * X' * (sigmoid(X*theta) - y);

Sigmoid Function

g = zeros(size(z)); g = 1.0 ./ (1.0 + exp(-z));