PCA Component Selection¶
When using Principal Component Analysis (PCA), a common question is: how many components should you keep? The knee point of the cumulative explained variance curve indicates where additional components contribute diminishing information.
Example¶
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
import numpy as np
from kneed import KneeLocator
# Load dataset
X, _ = load_digits(return_X_y=True)
# Fit PCA with all components
pca = PCA().fit(X)
# Cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components = range(1, len(cumulative_variance) + 1)
# Find the knee
kl = KneeLocator(
list(n_components),
cumulative_variance.tolist(),
curve="concave",
direction="increasing",
)
print(f"Optimal components: {kl.knee}")
Visualizing¶
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.plot(n_components, cumulative_variance, "bo-", markersize=3)
plt.vlines(kl.knee, 0, 1, linestyles="--", colors="r", label=f"knee = {kl.knee}")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Component Selection")
plt.legend()
plt.show()
Tips¶
- Use
curve="concave"anddirection="increasing"for cumulative variance curves - The knee point tells you where you get the most variance for the fewest components
- Adjust
Sto control how aggressively the knee is detected