4

Could someone give an example of the application of Tf-idf with sparse data (lots of zeros) in sklearn? I am not quite sure where to insert the weight of Tf-idf and how to rightly obtain the weight. Without knowing this fully, I would not be able to well use the tool for prediction. Thank you.

yearntolearn
  • 143
  • 5

1 Answers1

2

There is an application of tf-idf on the sklearn website.

sklearn handles sparse matrices for you, so I wouldn't worry about it too much:

Fortunately, most values in X will be zeros since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory. scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

Regarding your point about inserting the weight, I guess you have already performed tf-idf on your training corpus, but you don't know how to apply it to your test corpus? If so you could so as follows (taken from above link)

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)

# Perform tf-idf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

docs_new = ['God is love', 'OpenGL on the GPU is fast'] # New test documents
X_test_counts = count_vect.transform(docs_new) # Count vectorise the new documents
X_test_tfidf = tfidf_transformer.transform(X_new_counts) # transfrom the test counts 
Harpal
  • 903
  • 1
  • 7
  • 13