1

I have a sparse matrix, $X$, created by TfidfVectorizer and its size is $(500000, 200000)$. I want to convert $X$ to a data frame but I'm always getting a memory error.

I tried

pd.DataFrame(X.toarray(), columns=tokens)

and

pd.read_csv(X.toarray().astype("float32"), columns=tokens, chunksize=...).

And it seems that when I convert $X$ to a numpy array using X.toarray(), I get an error.

Can someone tell me what is an easy solution for this? Is there anyway I can create a sparse dataframe from $X$ without memory error?

I have been running my codes on Google Colab Pro and I think it provides me less than 100 GB Ram.

  • 1
    I think the real answer here is "don't do that". Look into one of the distributed computing frameworks instead of trying to do everything on one machine. – Philip Kendall Apr 11 '21 at 18:03

3 Answers3

3

You can use pandas.Dataframe.sparse.from_spmatrix. It will create a Dataframe populated by pd.arrays.SparseArray from a scipy sparse matrix.

Pandas used to have explicit sparse dataframes, but in more modern versions there is no such concept. Only normal pd.Dataframe populated by sparse data.

zachdj
  • 2,624
  • 6
  • 13
  • The sparse data frame is created but the problem is how can I work with that? When I feed it into neural networks, again it gives a memory error. –  Apr 11 '21 at 03:47
  • 2
    @Moonlight You asked how to create a sparse data frame. If your actual issue was "I want to run a huge dataset through some neural nets via Pandas", then that's what you should have asked in your question. – Philip Kendall Apr 11 '21 at 18:01
2

I have had to deal with huge data frames as you mention, in mi case the problem was "solved" by storing the data frame as pickle pd.to_pickle() and not as csv.

The memory usage reduced by 60%

I also heard recently about a format named feather

For reference:

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

enter image description here

Multivac
  • 2,784
  • 2
  • 8
  • 26
  • This one seems to need a path to store a previously created data frame. I haven’t created a data frame yet –  Apr 11 '21 at 03:49
0

You can also use max_df and min_df or max_features for tfidfvectorizer apart from sparse array.

MANU
  • 101
  • 3