I am following the Coursera NLP specialization, and in particular the lab "Another explanation about PCA" in Course 1 Week 3.
From the lab, I recovered the following code. It creates 2 random variables, rotates them to make them dependent and correlated, and then run PCA on them:
import numpy as np
from sklearn.decomposition import PCA
import math
std1 = 1 # The desired standard deviation of our first random variable
std2 = 0.333 # The desired standard deviation of our second random variable
x = np.random.normal(0, std1, 1000) # Get 1000 samples from x ~ N(0, std1)
y = np.random.normal(0, std2, 1000) # Get 1000 samples from y ~ N(0, std2)
# PCA works better if the data is centered
x = x - np.mean(x) # Center x
y = y - np.mean(y) # Center y
#Define a pair of dependent variables with a desired amount of covariance
n = 1 # Magnitude of covariance.
angle = np.arctan(1 / n) # Convert the covariance to and angle
print('angle: ', angle * 180 / math.pi)
# Create a rotation matrix using the given angle
rotationMatrix = np.array([[np.cos(angle), np.sin(angle)],
[-np.sin(angle), np.cos(angle)]])
# Create a matrix with columns x and y
xy = np.concatenate(([x], [y]), axis=0).T
# Get covariance matrix of xy
print("Covariance matrix of xy")
covmat = np.cov(xy, rowvar=False)
print(f"{np.sqrt(covmat[0,0]):.3f} = {std1}")
print(f"{np.sqrt(covmat[1,1]):.3f} = {std2}")
# Transform the data using the rotation matrix. It correlates the two variables
data = np.dot(xy, rotationMatrix)
# Get covariance matrix of data
print("Covariance matrix of data")
covmat = np.cov(data, rowvar=False)
print(f"{np.sqrt(covmat[0,0]):.3f} = {std1}")
print(f"{np.sqrt(covmat[1,1]):.3f} = {std2}")
print(f"{covmat[0,1]:.3f} = {n}")
# Apply PCA.
pcaTr = PCA(n_components=2).fit(data)
# In theory, the Eigenvector matrix must be the
# inverse of the original rotationMatrix.
print("** These two matrices should be equal **")
print("Eigenvector matrix")
print(pcaTr.components_)
print("Inverse of original rotation matrix")
print(np.linalg.inv(rotationMatrix))
I get the following output:
angle: 45.0
Covariance matrix of xy
1.031 = 1
0.325 = 0.333
Covariance matrix of data
0.764 = 1
0.765 = 0.333
0.479 = 1
** This two matrices should be equal **
Eigenvector matrix
[[ 0.70632393 0.70788877]
[ 0.70788877 -0.70632393]]
Inverse of original rotation matrix
[[ 0.70710678 0.70710678]
[-0.70710678 0.70710678]]
- Why does
n=1define the magnitude of the covariance? - Why we can obtain the angle between the variables as
angle = np.arctan(1 / n)? - Why I don't achieve this covariance between the variable when I take element (0,1) of the covariance matrix (the second time that I run
np.cov, so the linecovmat = np.cov(data, rowvar=False)? - Why does the rotation change the variables variance from the initial
std1andstd2? - Why "In theory, the Eigenvector matrix must be the inverse of the original rotationMatrix" ?
- Why this is not the case?