Interpreting the results of randomized PCA in scikit-learn

Question

I'm using scikit-learn to do a genome-wide association study with a feature vector of about 100K SNPs. My goal is to tell the biologists which SNPs are "interesting".

RandomizedPCA really improved my models, but I'm having trouble interpreting the results. Can scikit-learn tell me which features are used in each component?

score 4 · Accepted Answer · answered Mar 08 '16 at 02:34

Yes, through the components_ property:

import numpy, seaborn, pandas, sklearn.decomposition
data = numpy.random.randn(1000, 3) @ numpy.random.randn(3,3)
seaborn.pairplot(pandas.DataFrame(data, columns=['x', 'y', 'z']));

sklearn.decomposition.RandomizedPCA().fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243],
       [ 0.71047094, -0.05037554, -0.70192119]])

sklearn.decomposition.RandomizedPCA(2).fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243]])

We see that the truncated decomposition is simply the truncation of the full decomposition. Each row contains the coefficients of the corresponding principal component.

Many thanks! And apologies for the delayed accept. – retsreg Apr 01 '16 at 01:33 — retsreg, Apr 01 '16 at 01:33

Interpreting the results of randomized PCA in scikit-learn

1 Answers1