What is linear regression ?

What is Linear Regression?

Linear regression is supervised learning algorithm. It is used for prediction of numerical output from a set of inputs. It is used for continuous data. It assumes that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

The equation of regression line for single independent variable is represented as:

Y_i= b₀ +b₁X_i+e_i

Where

b₀= intercept of equation

b_{1 =}slope of regression line

ei= residual error (distance between regression line and data point)

The equation of regression line for more than one independent variable is represented as:

Y_i= b₀ +b₁X₁+b₂X₂+b₃X_3…….+b_nX_n+e_i

Assumptions on data set:

Before building the model, we must assume some basic assumptions on data set like

i. Linear Relationship:

Relationship between response and feature variables should be linear otherwise it results to either overfitting or underfitting of model. Model leads to underfit when the bias of model is high and the regression line doesn’t pass through the points. It leads to overfit when the model having high variance usually this is occurring when we are using unwanted independent variables, so use only significant variables.

Bias: how much on an average are the predicted values are different from the actual value.

Variance: how different will the predictions of the model be at the same point if different samples are taken from the same population.

This occurs because we are not training our model on full historical data, to overcome this problem we will use K-fold cross validation method (explained in example)

ii. No Multi collinearity:

Multicollinearity occurs when the independent variables are not independent from each other. To measure multi collinearity we use Variance influence factor (VIF), if the VIF is more than 5 then remove those variables while training the model. For every variable VIF will be calculated by R-squared value of the regression line against all the variables.

Ex: In below example all the variables with dark blue color are highly corelated (VIF() > 5)

iii. Outliers:

Outlier is an observation point that is distant from other points, remove outliers from the data, outliers’ effects on the performance of the model. This outlier is defined by plotting the data sets.

After preparing your data build regression model using function lm ()

Model Selection:

Select the model which having less AIC value and more AUC, these AIC and AUC penalizes additional parameters which are not useful in our model, to overcome this penalization we will use stepwise regression.

Stepwise Regression:

It is a variable selection procedure for independent variables, in this the selection of independent variables is done with the help of automatic process without involving human intervention.

There are 3 types of stepwise regression models:

Ø Forward stepwise regression

Ø Backward stepwise regression and

Ø Standard stepwise regression

Forward Stepwise Regression:

It is also called as step-up selection, it starts with adding from most significant variables in the model and adds one at each iteration until it is getting the best AIC value.

Backward stepwise regression:

It is called as step-down selection, first it will add all the variables to the model and starts removing from less significant variables until it is getting best AIC value.

Standard stepwise regression:

Combination of above steps gives standard stepwise regression.

Now let’s go to practical demo, here I am using house price data set which consists of parameters like area, baths, city, floor type etc. To predict price of a house.

House Price Prediction with R:

Open RStudio and set working directory as your data set location

Load data set into RStudio

Read.csv () is used to load csv files and na.strings () is used to replace null values with NA and str () is used to check structure of data set

Here Prices variable is our dependent variable (which need to be predicted) and all remaining are independent variables having less than 5 classes can be changed to factor data type, it is mainly used for categorical data .

In our data set , 1 and 16 columns are having more than 5 classes, so I excluding those columns in data type conversion and data structure after conversion are

In R we having different packages that are used for statistics calculations in CRAN repository, install required package and import to r studio in following way

Now split data into 2 parts as training and testing

There are many ways to split data, here I am using caret package, after splitting of data build your model using training data set.

lm () function is used for building linear regression model, prices is my target column and houseprice_train is my training data set. summary(model) give summary of our model like R- squared value, significance of variables etc.

Here, you can see last columns having stars, the variable which having stars are more significance variables for predicting data and p-value should be less than 5. In order to keep significant columns, we use stepwise regression it checks significance of variables and it gives final formula

Already I explained there are there are 3 methods in stepwise regression, here I am using backward method. Now predict data on our testing data set.

In my testing dataset already having prices column, so I removed that column while prediction, after prediction check error using RMSE (Root Mean squared error)