0

I am reading several questions on this topic. It seems quite clear to me for TFIDF why we have .fit_transform() and .transform() reading these questions:

  1. What's the difference between fit and fit_transform in scikit-learn models?

  2. How fit_transform, transform and TfidfVectorizer works

Not clear in the situation I have a NMF() that is trained with a Tfidf() array as input.

From sklearn documentation:

fit(X[, y])

Learn a NMF model for the data X.

fit_transform(X[, y, W, H])

Learn a NMF model for the data X and returns the transformed data. This is more efficient than calling fit followed by transform.

transform(X)

Transform the data X according to the fitted NMF model.

The transformed data in that case are the topic for each input data?

.transform() should be used with non-training data to link to their probable topics? *

For educational purpose I created a NMF using as dataset a df columns with engine failures descriptions.

My goal was to identify the different kind of failures defined:

# importing and applying tfidf vectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95, min_df=2)
dtm = tfidf.fit_transform(df['clean_text'])

nmf_model = NMF(n_components=7,random_state=42)
nmf_model.fit(dtm)
# clustring the topics with symptoms
topic_results = nmf_model.transform(dtm)
  • I used non-training data because as is a unsupervised technique I thought was not right to call them testing data
Andrea Ciufo
  • 123
  • 5

0 Answers0