The Curse of Dimensionality | PCA
What is Curse of Dimensionality?
Many Machine Learning involve thousands or even million of features for each training instance. Not all these features make training extremely slow, but they can also make it much harder to find a good solution, as we well see. This problem is often referred to as the Curse of Dimensionality.
Advantage of Reducing Curse of Dimensionality
- It speeds up the training.
- Dimensionality reduction is also extremely useful for data visualization.
- In some cases, reducing the dimensionality of the data may filter out some noise and unnecessary details and thus results in higher performance .
Disadvantage of Reducing Curse of Dimensionality
- Reducing Dimensionality does cause some information loss.
- It makes your pipeline more complex and harder to maintain.
so if your training is too slow, you should try to train your system with the original data before considering using dimensionality reduction.
PCA(Principle Component Analysis)
What is PCA?
principle component analysis (PCA) is by far the most popular dimensionality reduction algorithm. But it is not exactly a full machine learning algorithm, Instead an Unsupervised Machine learning algorithm.
In simplest terms, PCA is such a feature extraction method where we create new independent features from the old features and from a combination of both while keeping only those features that are most important in predicting the target. New features are extracted from old features and any feature can be dropped that is considered to be less dependent on the target variable.

Advantage of PCA
- Lack of redundancy of data given the orthogonal components.
- Principal components are independent of each other, so removes correlated features.
Disadvantage of PCA
- Even the simplest in-variance could not be captured by the PCA unless the training data explicitly provide this information.
- Data needs to be standardized before implementing PCA else it becomes difficult to identify optimal principal components.
Applying PCA using scikit-learn
- Standardizing the training data
Before using the PCA algorithm you should first Standardize your data. So that If you don’t standardize your data first, these eigenvectors will be all different lengths. Then the eigenspace of the covariance matrix will be “stretched”, leading to similarly “stretched” projections.See also here for several good answers describing the geometry of PCA.
2) Implement PCA using scikit-learn
The following code applies PCA to reduce the dimensionality of the dataset down to two dimensions.
3. Choosing the right Number of Dimensions
Instead of arbitrarily choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions that add up to a sufficiently large portion of variance.
In the following code you can set n_components to be float between 0.0 to 0.1 indicating the ratio of variance you wish to preserve. Here for example i put 0.95 which means that i want to preserve atleast 95% of the information of the original data to the compressed data(data applied after PCA).
Some other useful methods of PCA in scikit-learn
Decompress the Reduced data
Yes, it is possible to Decompress the reduced data(data in which you applied PCA algorithm). by applying inverse_transformation you can get all your dimensions back. This will not give you back the original data(as per our example only 95% of data was preserved). but it will be close to your original data.

Some Useful Attributes for PCA
Explained variance ratio
The variance ratio indicates the proportion of the dataset’s variance that lies in each PC(principle component).
>>> pca.explained_variance_ratio_
array([0.842, 0.146])
This output tells that 84.2% of the dataset’s variance lies along the first PC, and 14.6% lies along the second PC.
Outgo and Resources.
Book: Hands on machine learning with scikit learn and TensorFlow.
site: scikit-learn official PCA documentation
Follow Me
If you find the ideas I share with you interesting, please don’t hesitate to connect here on Medium, Twitter, GitHub, or Kaggle.