Why, When and How to apply Feature Selection

Why to use Feature Selection
Many Machine Learning problems involve thousands or even millions of feature for each training instance. Not only do all these feature make training extremely slow, but they also make the it much harder to find a good solution. And these is the region we use Feature Selection to keep only important features and drop all the unimportant features.
When to use Feature Selection

How to apply Feature Selection
1) Dropping Constant Features
In this process we will be removing the features which have constant features which are not important for solving the problem statement.
We will be removing the constant features by using a sklearn library called Variance Threshold.
Variance Threshold library has a parameter called Threshold. Features with a training-set variance lower than the given value will be removed. By default it will keep all features with Non-Zero Variance.
In line 15 it will create a list of all constant features and in line 16 we drop all constant feature(not so useful features).
2) Pearson Correlation
In this process we remove that features which are highly correlated with your other Feature.
In statistics, one of the most common ways that we quantify a relationship between two variables is by using the Pearson correlation coefficient, which is a measure of the linear association between two variables. It has a value between -1 and 1 where:
- -1 indicates a perfectly negative linear correlation between two variables
- 0 indicates no linear correlation between two variables
- 1 indicates a perfectly positive linear correlation between two variables

As you can see from Line 10 — line 30 we have done train-test-split as it is a good practice.
In line 19 we created a function called correlation this function return features(columns) which has more threshold value(correlated) then given value.
Example:
in line 29 you can see we passed train_data and threshold=0.7 to correlation function. This function will return all the features which has more correlation than 70%. Then we drop that features from both train and test data, as we don't need that particular features.
3) Mutual Information
Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
The function relies on non parametric methods based on entropy estimation from k-nearest neighbors distances.
Before using these steps please split your data using train_test_split library and make sure there should me no null values present in the data.
In Classification Problem
Here we will take top 5 important features using mutual_info_classif and SelectKBest Library.
In Regression Problem
Here we will take Top 20 percentile Features using mutual_info_regression and SelectPercentile Library.
4) Chi2 Statistical Analysis
Compute chi-squared stats between each non-negative feature and class.
- This score should be used to evaluate categorical variables in a classification task.
This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as boo-leans or frequencies (e.g., term counts in document classification), relative to the classes.
Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification. The Chi Square statistic is commonly used for testing relationships between categorical variables.
It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories.
Perform chi2 test it returns 2 values Fscore and the pvalue. In line 23 you will get 2 set of arrays the first on is of Fscore and this value needs to be higher, more the value of Fscore more that particular feature is. Second arrays values are pvalue the lesser the pvalue more that particular feature is.
Outgo and Resources.
YouTube Playlist: https://www.youtube.com/watch?v=uMlU2JaiOd8&list=PLZoTAELRMXVPgjwJ8VyRoqmfNs2CJwhVH
Site: scikit-learn official documentation
Follow Me
If you find the ideas I share with you interesting, please don’t hesitate to connect here on Medium, Twitter, GitHub, or Kaggle.