I am reading several questions on this topic.
It seems quite clear to me for TFIDF why we have .fit_transform() and .transform()
reading these questions:
Not clear in the situation I have a NMF() that is trained with a Tfidf() array as input.
From sklearn documentation:
fit(X[, y])
Learn a NMF model for the data X.
fit_transform(X[, y, W, H])
Learn a NMF model for the data X and returns the transformed data. This is more efficient than calling fit followed by transform.
transform(X)
Transform the data X according to the fitted NMF model.
The transformed data in that case are the topic for each input data?
.transform() should be used with non-training data to link to their probable topics? *
For educational purpose I created a NMF using as dataset a df columns with engine failures descriptions.
My goal was to identify the different kind of failures defined:
# importing and applying tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.95, min_df=2)
dtm = tfidf.fit_transform(df['clean_text'])
nmf_model = NMF(n_components=7,random_state=42)
nmf_model.fit(dtm)
# clustring the topics with symptoms
topic_results = nmf_model.transform(dtm)
- I used non-training data because as is a unsupervised technique I thought was not right to call them testing data