1. Principal Component Analysis (PCA)
- Unsupervised estimator
- Dimensional reduction algorithm: PCA reduces the dimensionality of dataset by transforming a large set of features into a smaller one but still keeps most of the information in the dataset. It helps increasing performance but decrease a little accuracy. Lower dimensional is easier and faster to explore and visualize data.
- Noise filtering
2. Tools for PCA
2.1 Standardization: standardize the range of the continuous initial features so that each one of them contributes equally to the analysis. It helps avoiding the features with larger ranges will dominate over the features with small ranges.
2.2 Covariance Matrix:It shows how the features of the dataset are varying from the mean with respect to each other (or the relationship between them). Because sometimes, features are highly correlated in such a way that they contain redundant information.
The sign of the covariance means:
- if positive then : the two variables increase or decrease together (correlated)
- if negative then : One increases when the other decreases (Inversely correlated)
2.3 Eigenvectors and Eigenvalues
An eigenvector does not change direction in a transformation.
For a square matrix A, an Eigenvector and Eigenvalue make this equation true:
Principal components are new features that are constructed as linear combinations of the initial features. These combinations are done in such a way that the new features (i.e., principal components) are uncorrelated and most of the information within the initial features is compressed into the first components. PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.
PCA allows you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.
The eigenvectors of the Covariance matrix are the directions of the most variance. And eigenvalues are the amount of variance carried in each eigenvector. Apply this to PCA by ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal components in order of significance.
In order to compute the percentage of variance (information) accounted for by each component, we divide the eigenvalue of each component by the sum of eigenvalues.
2.3 Feature vector
After having the principal components, we have to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues). Finally, from remaining ones, we form a matrix of vectors that we call Feature vector.
2.4 Project the data along the principal components axes
Multiplying the transpose of the original data set by the transpose of the feature vector.
σᵢ are descending order so more significant elements left side. Example:
We can use SVD for Dimensionality Reduction (just keep the important components). This can be applied when dataset has more features (columns) than observations (rows). This helps reduced dataset to a smaller number of features. If we select the top k largest singular values. An approximate B of the vector A: B = (UΣVᵀ)k
from numpy import diag
from numpy import zeros
from scipy.linalg import svd
import numpy as np
# define a matrix
A = np.array([[1,2],[2,4],[3,6],[4,8]]).T
# Singular-value decomposition
U, s, VT = svd(A)
# create m x n Sigma matrix
Sigma = zeros((A.shape[0], A.shape[1]))
# populate Sigma with n x n diagonal matrix
Sigma[:A.shape[0], :A.shape[0]] = diag(s)
# select
n_elements = 1
Sigma = Sigma[:, :n_elements]
VT = VT[:n_elements, :]
# reconstruct
B = U.dot(Sigma.dot(VT))
print(B)
# transform
T = U.dot(Sigma)
print(T)
Output: [[ -5.47722558] [-10.95445115]]
The explained variance tells you how much information (variance) can be attributed to each of the principal components. This is important as while you can convert 4 dimensional space to 2 dimensional space, you lose some of the variance (information) when you do this. By using the attribute explained_variance_ratio_, you can see that the first principal component contains 98% of the variance and the second principal component contains 2% of the variance. Together, the two components contain 100% of the information.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.decomposition import PCA
rng = np.random.RandomState(8)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
pca = PCA(n_components=2)
pca.fit(X)
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
comp = comp * np.sqrt(var) # scale component by its variance explanation power
plt.plot([0, comp[0]], [0, comp[1]], label=f"Component {i}", linewidth=5,
color=f"C{i + 2}")
plt.figure(1)
plt.gca().set(aspect='equal',
title="2-dimensional dataset with principal components",
xlabel='first feature', ylabel='second feature')
fig = plt.figure(2)
fig.suptitle('projected')
#X_projected = pca.inverse_transform(X)
#loss = ((X - X_projected) ** 2).mean()
#print(loss)
y = X.dot(pca.components_[1])
x = X.dot(pca.components_[0])
plt.scatter(x, y)
plt.show()
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.decomposition import PCA
rng = np.random.RandomState(8)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
pca = PCA(n_components=1)
pca.fit(X)
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
comp = comp * np.sqrt(var) # scale component by its variance explanation power
plt.plot([0, comp[0]], [0, comp[1]], label=f"Component {i}", linewidth=5,
color=f"C{i + 2}")
plt.figure(1)
plt.gca().set(aspect='equal',
title="2-dimensional dataset with principal components",
xlabel='first feature', ylabel='second feature')
X_pca = pca.transform(X)
fig = plt.figure(2)
fig.suptitle('projected')
X_projected = pca.inverse_transform(X_pca)
loss = ((X - X_projected) ** 2).mean()
plt.scatter(X_projected[:, 0], X_projected[:, 1])
plt.show()
from sklearn.datasets import load_digits
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.decomposition import PCA
digits = load_digits()
pca = PCA(2) # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('RdBu', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
plt.show()
4.4 Choosing the number of components
0 Comments