1

I'd like to calculate the mutual information between two datasets, but I'd prefer not to cluster them first.

I'm thinking of using SciKit-Learn's mutual_info_score metric, but it's documentation suggests the inputs should be clusters, not whole datasets. My intuition is that clustering is necessary because calculating the complete mutual information score between large datasets is computationally expensive.

The datasets I'm trying to compare are large, 400,000 rows by 180 columns. Do I have to use clustering on these datasets?

Connor
  • 597
  • 1
  • 15

1 Answers1

1

Technically, Mutual Information (MI) is a measure of how dependent two variables are with each other. It can be used in different settings, as long as the two variables can be defined.

Additionally it is essential that these two variables represent the same entities, for example the height and the weight of a sample of individuals listed in the same order: $(h_1,w_1),(h_2, w_2), (h_3, w_3), ...$

So you don't have to have clusters, but you need to compare two variables, not two complex datasets. This is certainly the reason why the documentation mentions clusterings of the same dataset, since the two lists of clusters can serve as variables.

So with 180 columns, it's not clear how you can apply MI: every pair of columns? That would give you 180*180/2=16200 scores (and it would probably take a long time to compute them).

Erwan
  • 24,823
  • 3
  • 13
  • 34