I have a multiclass text classification problem and I've tried different solutions and models, but I was not satisfied with the results. So I've decided to use GloVe ( Global Vectors for Word Representation ) , but somehow all the models performed even worse. So my question is, is it possible that NLP models perform even worse by the use of some word embeddings models like GloVe or FastText? Or did I just made a bad implementation? The code is given below:
embedding_model = {}
f = open(r'../../langauge_detection/glove.840B.300d.txt', "r", encoding="utf8")
for line in f:
values = line.split()
word = ''.join(values[:-300])
coefs = np.asarray(values[-300:], dtype='float32')
embedding_model[word] = coefs
f.close()
def sent2vec(s):
words = str(s).lower()
words = word_tokenize(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
try:
M.append(embedding_model[w])
except:
continue
M = np.array(M)
v = M.sum(axis=0)
if type(v) != np.ndarray:
return np.zeros(300)
return v / np.sqrt((v ** 2).sum())
X_train, X_test, y_train, y_test = train_test_split(df.website_text, df.industry, test_size=0.2, random_state=42)
x_train_glove = [sent2vec(x) for x in tqdm(X_train)]
x_test_glove = [sent2vec(x) for x in tqdm(X_test)]
x_train_glove = np.array(x_train_glove)
x_test_glove = np.array(x_test_glove)
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)
sgd.fit(x_train_glove, y_train)