As a beginner, I was not able to understand why any of my machine learning models wouldn’t do a good job of predicting well on the Ames housing dataset. I mean I did all the hyper parameter tuning, all though I could see a little improvement, I couldn’t see a great improvement. This made me think something’s definitely not right. I checked few kernels from kaggle and I realized that if the dataset is not normally distributed, then the ML model wouldn’t be able to do a good job of prediction.
Here is a look at the dataset that is skewed:
This is a histogram that shows the sale price of houses from the Ames dataset.
This distribution is positively skewed. Notice that the black curve is more deviated towards the right. If you encounter that your predictive (response) variable is skewed, it is recommended to fix the skewness to make good decisions by the model.
Okay, So how do I fix the skewness?
The best way to fix it is to perform a log transform of the same data, with the intent to reduce the skewness.
After taking the logarithm of the same data the curve seems to be normally distributed, although not perfectly normal, this is sufficient to fix the issues from a skewed dataset as we saw before.
If you’re curious how log transformation can reduce skewness, take a look at this paper here.
Important: If you log transform the response variable, it is required to also log transform feature variables that are skewed.
After all, you must be wondering why skewed data messes up the predictive model. The short answer would be: It affects the regression intercept, coefficients associated with the model.