1

I have a dataset with 5000 samples and 500,000 features (all categorical with a cardinality of 3).

Two problems I'm trying to solve:

  1. Loading the dataset - I can't load it in memory despite using a computing cluster, so I'm assuming I should use a parallelization library like Dask, Spark, or Vaex. Is this the best idea?
  2. Feature selection - how to run feature selection within a parallelization library? Can this be done with Dask, Spark, Vaex?

2 Answers2

2

5000 samples and 500,000 is not that big - it all depends how much memory you have. Also remember you can always and always optimize your data format. e.g. if they are float64 - do they need to be ? if they are categorical, how they are encoded ? (one character or a 20 character word?) and such. so Yes, if you can load the data into memory good for you if not here are the suggestions:

  1. if you only and only have 5K samples - you must not use all for feature selection.
  2. you can drop features that have very low variance - in an extreme scenario if the variance of a column is 0 - for sure it is useless.
  3. there is something called feature-screening proposed by Fan et. al from Princeton https://orfe.princeton.edu/abstracts/feature-screening-distance-correlation-learning - in short: you can lower your dimension by using a univariat model and then afterwards use multivariate-feature selection models.
user702846
  • 323
  • 1
  • 14
  • Thanks for these points. I'm already encoding the categories as single characters (0, 1, 2). I will absolutely not use all samples for feature selection, I'm pretty watchful of data leakage. I will start with removing low-variance features and see how many that leaves me with. Thanks for mentioning the feature-screening strategy, going to look at the paper. – applebanana_456789 May 20 '21 at 13:04
  • make sure you use the most efficient type for your columns - if you onlu have (0, 1, 2) use int8. – user702846 May 20 '21 at 23:08
0

For the first part, I guess your matrix should be sparse. You can convert your matrix into sparse and then read it into memory.

For the second part, it depends on the sparsity of your matrix is and how many features you want to select. One way is to get the top n variable features, run PCA and get the top m PCs. n and m depend on the sparcity of your matrix. n can be a value between 5000 to 50000 and you can define m by plotting the variance for each PC and finding the inflection point.

Amirby
  • 21
  • 3
  • Most of the dataset is 0, but 0 actually means something in my context. Should it still be made into a sparse matrix? And for clarity, do you mean to load it parallelized, then convert it to sparse, then load it in memory? I'll try the tips you gave for the second part, thanks – applebanana_456789 May 21 '21 at 17:35