What is logistic regression ?


How many patients will suffer from diabetes?
Let’s predict people who will suffer with diabetes as per there health records
Logistic Regression:
Linear regression can be used only when y is continuous and it doesn’t fit for categorical data, so logistic regression comes here. It is used for categorial data, the regression line fits between o and 1 hence it is non-linear regression model. It can be used for both binomial and multinomial data, this model is mainly fit for binomial data
. There are two types of logistic regression techniques
1.      Ordinal logistic regression
2.      Multinomial logistic regression.
The function glm () is used for building logistic regression model. Some examples for logistic regression are spam detection, marketing and banking etc.
Logistic regression will work by calculating like hood of values by using sigmoid function, it gives the probability of target variables. It always lies between 0 and 1.
ln(p/1-p) = b0 +b1x
This is sigmoid curve
Model evaluation:
After building model, need to evaluate the model. There are different techniques to evaluate the model like
·         Alkaline information criteria (AIC)
·         Null Deviance and residual deviance
·         Confusion matrix
·         ROC - AUC
AIC:
 It's an important indicator of model fit. It follows the rule: Smaller the better. AIC penalizes increasing number of coefficients in the model. In other words, adding more variables to the model wouldn't let AIC increase. It helps to avoid overfitting.
Null Deviance and Residual Deviance:
 Null deviance is calculated from the model with no features, i.e., only intercept. The null model predicts class via a constant probability.
Residual deviance is calculated from the model having all the features. On comparison with Linear Regression, think of residual deviance as residual sum of square (RSS) and null deviance as total sum of squares (TSS).
 The larger the difference between null and residual deviance, better the model.
Confusion Matrix:
Confusion matrix is the most crucial metric commonly used to evaluate classification models. It is used to calculate true positive rate (Sensitivity) and true negative rate (Specificity).
ROC-AUC:
ROC determines the accuracy of a classification model at a user defined threshold value. It determines the model's accuracy using Area Under Curve (AUC). The area under the curve (AUC), also referred to as index of accuracy (A) or concordant index, represents the performance of the ROC curve. Higher the area, better the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).
 In this plot, our aim is to push the red curve (shown below) toward 1 (left corner) and maximize the area under curve. Higher the curve, better the model. The yellow line represents the ROC curve at 0.5 threshold. At this point, sensitivity = specificity.

Now let’s go for practical demo using diabetes data set to predict whether a patient will get diabetes or not, I this article I mainly concentrating on logistic regression model and ROC- AUC curve, rest of things are already explained in previous article (what is linear regression)
Checking null value in data set
Above function is used to check the

We can see age column having some NA values, we can see visually also, missmap () function in Amelia library. Now impute null values (imputing is the technique to fill NA values, there are different techniques and functions for imputing)
Here I am doing mean of all the values of age and replacing with NA.
glm () is the function used for logistic regression
Here I am building model with significance variables and predicting with test data, logistic regression it will give probability, so I am taking 0.5 as threshold, so probability above 0.5 as 1 otherwise 0

O/P:
ROC -AUC Curve:
 ROC stands for Receiver Operating Characteristic and AUC stands for Area Under Curve

O/P:
 More AUC give best Performance of the model 

Get data set and source code from below link

Comments

Popular posts from this blog

What is linear regression ?

Prophet function in r Part - 2

Work flow to build a Machine Learning Algorithm