Why and How to Handle Missing Values
Why Handling missing Value is Important in Data Science?
It is one of the most common situations that occur with data sets from the real world. These datasets may contain missing values for various reasons such as undefined values, data collection errors, and incorrect implementation of SQL joins while merging data sets. Almost all ML Algorithms are not good at handling Missing Values.
How to Handle Missing Values?
Types of Handling Missing Values:
CCA (Complete Case Analysis)
It is also called list-wise deletion of Cases, consisting in discarding observations(Rows) where values in any of the variables are missing.
Complete Case Analysis means analyzing only observations(Rows) for which there is information in all the variables in the datasets and Removing the rows that have missing values and are not useful.
When To Use CCA?
- When data is MCAR(Missing Completely at Random).
- When missing data is less than 5%.
Advantages of CCA
- Easy to implement as no data manipulation is required.
- Preserves variable distribution.
Disadvantages of CCA
- It can exclude a large fraction of the original dataset(if missing data is abundant).
- Excluded observations could be informative for the analysts(if the data is not MCAR).
- When using our model in Production the model will not know how to handle missing data.
Example of CCA
Univariate Impute
1) Mean/Median Imputation
When the distribution of data is almost normal then you can use mean Imputation.
When the distribution of data is skewed then you can use median Imputation.
When to use Mean/Median Imputation?
- When data is MCAR(Missing data at random).
- When data has less than 5% missing values.
Disadvantages of Mean/Median Imputations
- It changes the shape of your distribution.
- Outliers
- Change in Covariance/Correlation.
It is important to know that whenever you use Mean/Median Imputer the variance of that column will get shrinked. You just have to take care is that it doesn't change significantly. If it changes significantly it is a red flag that you should use some other technique.
Example of Mean/Median Imputation
2) Categorical Imputation
When data in Categorical Column is MCAR and the missing object is less than 5% then we can use fill that missing values with Most Frequent data.
If the Missing object is more than 5% then we create a new categorical column called missing and insert it into NaN values.
Example of Categorical Imputation
Multivariate Technique
In the Multivariate technique to feel missing data of a single column, we use all the columns in the dataset.
1) KNN Imputer
The concept of working of KNN is if there is a missing value in a row then that will be replaced by the closest row value and the similarity of any two rows is taken by Euclidean distance.
Advantage/Disadvantage of KNN Imputer
- More Accurate than other Methods.
- More number calculations are needed(In large dataset).
- In production, you have to deploy your whole training set.
If the dataset is small this method is very accurate.
2) Iterative Imputer / MICE
MICE stands for Multivariate Imputation by chained Equations.
When to use MICE?
- When data is MAR(Missing values at random).
Advantage/Disadvantage of Iterative Imputer
- More Accurate
- Makes Module slow
- In production, you have to develop your whole training set.
Example Video
Outgo and Resources.
Site: scikit-learn official documentation
YouTube: https://youtube.com/playlist?list=PLKnIA16_Rmvbr7zKYQuBfsVkjoLcJgxHH
Follow Me
If you find the ideas I share with you interesting, please don’t hesitate to connect here on Medium, Twitter, GitHub, or Kaggle.