The Curse of Dimensionality | PCA

3 min readAug 29, 2022

What is Curse of Dimensionality?

Many Machine Learning involve thousands or even million of features for each training instance. Not all these features make training extremely slow, but they can also make it much harder to find a good solution, as we well see. This problem is often referred to as the Curse of Dimensionality.

Advantage of Reducing Curse of Dimensionality

It speeds up the training.
Dimensionality reduction is also extremely useful for data visualization.
In some cases, reducing the dimensionality of the data may filter out some noise and unnecessary details and thus results in higher performance .

Disadvantage of Reducing Curse of Dimensionality

Reducing Dimensionality does cause some information loss.
It makes your pipeline more complex and harder to maintain.

so if your training is too slow, you should try to train your system with the original data before considering using dimensionality reduction.

PCA(Principle Component Analysis)

What is PCA?

principle component analysis (PCA) is by far the most popular dimensionality reduction algorithm. But it is not exactly a full machine learning algorithm, Instead an Unsupervised Machine learning algorithm.

In simplest terms, PCA is such a feature extraction method where we create new independent features from the old features and from a combination of both while keeping only those features that are most important in predicting the target. New features are extracted from old features and any feature can be dropped that is considered to be less dependent on the target variable.

Advantage of PCA

Lack of redundancy of data given the orthogonal components.
Principal components are independent of each other, so removes correlated features.

Disadvantage of PCA

Even the simplest in-variance could not be captured by the PCA unless the training data explicitly provide this information.
Data needs to be standardized before implementing PCA else it becomes difficult to identify optimal principal components.

Applying PCA using scikit-learn

Standardizing the training data

Before using the PCA algorithm you should first Standardize your data. So that If you don’t standardize your data first, these eigenvectors will be all different lengths. Then the eigenspace of the covariance matrix will be “stretched”, leading to similarly “stretched” projections.See also here for several good answers describing the geometry of PCA.

Source: everydaycodings

2) Implement PCA using scikit-learn

The following code applies PCA to reduce the dimensionality of the dataset down to two dimensions.

Source: everydaycodings

3. Choosing the right Number of Dimensions

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions that add up to a sufficiently large portion of variance.

In the following code you can set n_components to be float between 0.0 to 0.1 indicating the ratio of variance you wish to preserve. Here for example i put 0.95 which means that i want to preserve atleast 95% of the information of the original data to the compressed data(data applied after PCA).

Source: everydaycodings

Some other useful methods of PCA in scikit-learn

Decompress the Reduced data

Yes, it is possible to Decompress the reduced data(data in which you applied PCA algorithm). by applying inverse_transformation you can get all your dimensions back. This will not give you back the original data(as per our example only 95% of data was preserved). but it will be close to your original data.

Source: everydaycodings

Example: MNIST compression that preserves 95% of the variance

Some Useful Attributes for PCA

Explained variance ratio

The variance ratio indicates the proportion of the dataset’s variance that lies in each PC(principle component).

>>> pca.explained_variance_ratio_
array([0.842, 0.146])

This output tells that 84.2% of the dataset’s variance lies along the first PC, and 14.6% lies along the second PC.

Outgo and Resources.

Book: Hands on machine learning with scikit learn and TensorFlow.

site: scikit-learn official PCA documentation

Follow Me

If you find the ideas I share with you interesting, please don’t hesitate to connect here on Medium, Twitter, GitHub, or Kaggle.

The Curse of Dimensionality | PCA

What is Curse of Dimensionality?

Advantage of Reducing Curse of Dimensionality

Disadvantage of Reducing Curse of Dimensionality

PCA(Principle Component Analysis)

What is PCA?

Advantage of PCA

Disadvantage of PCA

Applying PCA using scikit-learn

2) Implement PCA using scikit-learn

3. Choosing the right Number of Dimensions

Some other useful methods of PCA in scikit-learn

Decompress the Reduced data

Some Useful Attributes for PCA

Explained variance ratio

Outgo and Resources.

Follow Me

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Everydaycodings

No responses yet