What is logistic regression ?
How many patients will suffer from
diabetes?
Let’s predict people who will suffer
with diabetes as per there health records
Logistic
Regression:
Linear
regression can be used only when y is continuous and it doesn’t fit for categorical
data, so logistic regression comes here. It is used for categorial data, the
regression line fits between o and 1 hence it is non-linear regression model. It
can be used for both binomial and multinomial data, this model is mainly fit
for binomial data
. There are
two types of logistic regression techniques
1.
Ordinal logistic regression
2.
Multinomial logistic regression.
The
function glm () is used for building logistic regression model. Some examples
for logistic regression are spam detection, marketing and banking etc.
Logistic
regression will work by calculating like hood of values by using sigmoid
function, it gives the probability of target variables. It always lies between
0 and 1.
ln(p/1-p)
= b0 +b1x
This is
sigmoid curve
Model evaluation:
After
building model, need to evaluate the model. There are different techniques to
evaluate the model like
·
Alkaline information criteria (AIC)
·
Null Deviance and residual deviance
·
Confusion matrix
·
ROC - AUC
AIC:
It's an important indicator of model fit. It follows
the rule: Smaller the better. AIC penalizes increasing number of coefficients
in the model. In other words, adding more variables to the model wouldn't let
AIC increase. It helps to avoid overfitting.
Null
Deviance and Residual Deviance:
Null deviance is calculated from the model with no
features, i.e., only intercept. The null model predicts class via a constant
probability.
Residual deviance is calculated from the model having all the
features. On comparison with Linear Regression, think of residual deviance
as residual sum of square (RSS) and null deviance as total sum of squares
(TSS).
The larger the
difference between null and residual deviance, better the model.
Confusion Matrix:
Confusion matrix is the most crucial metric commonly used to
evaluate classification models. It is used to calculate true positive rate
(Sensitivity)
and true negative rate (Specificity).
ROC-AUC:
ROC determines the accuracy of a classification model
at a user defined threshold value. It determines the model's accuracy
using Area Under Curve (AUC). The area under the curve (AUC), also referred to
as index of accuracy (A) or concordant index, represents the performance of the
ROC curve. Higher the area, better the model. ROC is plotted between True
Positive Rate (Y axis) and False Positive Rate (X Axis).
In this plot, our aim
is to push the red curve (shown below) toward 1 (left corner) and maximize the
area under curve. Higher the curve, better the model. The yellow line
represents the ROC curve at 0.5 threshold. At this point, sensitivity =
specificity.
Now let’s
go for practical demo using diabetes data set to predict whether a patient will
get diabetes or not, I this article I mainly concentrating on logistic
regression model and ROC- AUC curve, rest of things are already explained in
previous article (what is linear regression)
Checking
null value in data set
Above
function is used to check the
We can see
age column having some NA values, we can see visually also, missmap () function
in Amelia library. Now impute null values (imputing is the technique to fill NA
values, there are different techniques and functions for imputing)
Here I am
doing mean of all the values of age and replacing with NA.
glm () is
the function used for logistic regression
Here I am
building model with significance variables and predicting with test data,
logistic regression it will give probability, so I am taking 0.5 as threshold,
so probability above 0.5 as 1 otherwise 0
O/P:
ROC -AUC Curve:
ROC stands for Receiver Operating
Characteristic and AUC stands for Area Under Curve
O/P:
More AUC give best Performance of the
model
Get data
set and source code from below link
Comments
Post a Comment