7

I'm using scikit-learn to do a genome-wide association study with a feature vector of about 100K SNPs. My goal is to tell the biologists which SNPs are "interesting".

RandomizedPCA really improved my models, but I'm having trouble interpreting the results. Can scikit-learn tell me which features are used in each component?

Emre
  • 10,481
  • 1
  • 29
  • 39
retsreg
  • 73
  • 4

1 Answers1

4

Yes, through the components_ property:

import numpy, seaborn, pandas, sklearn.decomposition
data = numpy.random.randn(1000, 3) @ numpy.random.randn(3,3)
seaborn.pairplot(pandas.DataFrame(data, columns=['x', 'y', 'z']));

Faceted scatter plot

sklearn.decomposition.RandomizedPCA().fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243],
       [ 0.71047094, -0.05037554, -0.70192119]])

sklearn.decomposition.RandomizedPCA(2).fit(data).components_

> array([[ 0.43929754,  0.81097276,  0.38644644],
       [-0.54977152,  0.58291122, -0.59830243]])

We see that the truncated decomposition is simply the truncation of the full decomposition. Each row contains the coefficients of the corresponding principal component.

Emre
  • 10,481
  • 1
  • 29
  • 39