We can not implement Linear Regression on any given data. If we have got a dataset and we are planning to implement Linear Regression Model on that data we must first check if the given data is inline with the assumptions of Linear Regression. Once the given data agrees to all the assumptions then only we can implement an effective Linear Regression Model.
Please find below the top 5 assumptions of Linear Regression Algorithm.
Linearity: Linear regression algorithm assumes that the relationship between each feature and target is linear. This assumption can be checked using scatter plot between each feature and target. The scatterplot would be able to tell whether a feature and target is linearly related or not.
Normality: Linear regression algorithm assumes that each feature data is normally distributed or is a Gaussian Distribution. A simple way to quickly check the distribution of a sample of data is to create a histogram for that data. If the data is forming a bell shaped curve it is normally distributed.
If not normal distribution we can use log transform to make it normal distribution.
Minimum multicollinearity: multicollinearity means features are co-related with each other. This can be tested by creating correlation matrix for the features. To fix this problem we like to remove few features.
No auto-correlation: Auto-correlation is a measure of the relationship between a variable’s current value and its past values. No features in the dataset should have auto- correlation. Auto-correlation comes into play if the residuals are not independent from each other. Auto- correlation can be checked using scatterplot.
Homoscedasticity: Linear regression requires Homoscedasticity. Homoscedasticity in the data indicated that the variance around the regression line is the same for all values of the feature X.
Hence before implementing a linear regression model we should check for all these assumptions and if required pre process the data to meet all these 5 assumptions.
Happy coding !!