What is linear regression ?
What is Linear
Regression?
Linear
regression is supervised learning algorithm. It is used for prediction of
numerical output from a set of inputs. It is used for continuous data. It assumes that the two variables are
linearly related. Hence, we try to find a linear function that predicts the
response value(y) as accurately as possible as a function of the feature or
independent variable(x).
The equation of regression line for single independent
variable is represented as:
Yi
= b0 +b1Xi+ei
Where
b0 = intercept of
equation
b1 = slope of regression
line
ei = residual
error (distance between regression line and data point)
The equation of regression line for more than one
independent variable is represented as:
Yi = b0 +b1X1+b2X2+b3X3…….+bnXn+ei
Assumptions on data set:
Before building the model, we must
assume some basic assumptions on data set like
i.
Linear Relationship:
Relationship between
response and feature variables should be linear otherwise it results to either
overfitting or underfitting of model.
Model leads to underfit when the bias of model is high and the
regression line doesn’t pass through the points. It leads to overfit when the
model having high variance usually this is occurring when we are using unwanted
independent variables, so use only significant variables.
Bias: how much on an average are
the predicted values are different from the actual value.
Variance: how different will the
predictions of the model be at the same point if different samples are taken
from the same population.
This occurs because we are not training our model on full historical
data, to overcome this problem we will use K-fold cross validation method
(explained in example)
ii.
No Multi collinearity:
Multicollinearity
occurs when the independent variables are not independent from each other. To
measure multi collinearity we use Variance influence factor (VIF), if the VIF is
more than 5 then remove those variables
while training the model. For every variable VIF will be calculated by
R-squared value of the regression line against all the variables.
Ex: In below example all the variables with dark blue color are highly
corelated (VIF() > 5)
iii.
Outliers:
Outlier is an
observation point that is distant from other points, remove outliers from the
data, outliers’ effects on the performance of the model. This outlier is
defined by plotting the data sets.
After preparing your
data build regression model using function lm ()
Model Selection:
Select the model
which having less AIC value and more AUC, these AIC and AUC penalizes
additional parameters which are not useful in our model, to overcome this
penalization we will use stepwise regression.
Stepwise Regression:
It is a variable selection procedure for
independent variables, in this the selection of independent variables is done
with the help of automatic process without involving human intervention.
There are 3 types of stepwise
regression models:
Ø Forward stepwise
regression
Ø Backward stepwise
regression and
Ø Standard stepwise
regression
Forward Stepwise Regression:
It is also called as
step-up selection, it starts with adding from most significant variables in the
model and adds one at each iteration until it is getting the best AIC value.
Backward stepwise regression:
It is called as
step-down selection, first it will add all the variables to the model and
starts removing from less significant variables until it is getting best AIC
value.
Standard stepwise regression:
Combination of above
steps gives standard stepwise regression.
Now let’s go to practical
demo, here I am using house price data set which consists of parameters like
area, baths, city, floor type etc. To predict price of a house.
House Price
Prediction with R:
Open RStudio and set working directory as your data set
location
Load data set into RStudio
Read.csv () is used to load csv files and na.strings ()
is used to replace null values with NA and str () is used to check structure of
data set
Here Prices variable is our
dependent variable (which need to be predicted) and all remaining are
independent variables having less than 5 classes can be changed to factor data type,
it is mainly used for categorical data .
In our data set , 1 and 16 columns
are having more than 5 classes, so I excluding those columns in data type
conversion and data structure after conversion are
In R we having different
packages that are used for statistics calculations in CRAN
repository, install required package and import to r studio in following way
Now split data into 2
parts as training and testing
There are many ways
to split data, here I am using caret package, after splitting of data build
your model using training data set.
lm () function is
used for building linear regression model, prices is my target column and
houseprice_train is my training data set. summary(model) give summary of our
model like R- squared value, significance of variables etc.
Here, you can see
last columns having stars, the variable which having stars are more
significance variables for predicting data and p-value should be less than 5.
In order to keep significant columns, we use stepwise regression it checks
significance of variables and it gives final formula
Already I explained
there are there are 3 methods in stepwise regression, here I am using backward
method. Now predict data on our testing data set.
`
In my testing dataset
already having prices column, so I removed that column while prediction, after
prediction check error using RMSE (Root Mean squared error)
This is my error rate, which is in fractions
Get code and data set from below link
Comments
Post a Comment