Is it normal for a model to perform worse with the use of word embeddings?

Question

I have a multiclass text classification problem and I've tried different solutions and models, but I was not satisfied with the results. So I've decided to use GloVe ( Global Vectors for Word Representation ) , but somehow all the models performed even worse. So my question is, is it possible that NLP models perform even worse by the use of some word embeddings models like GloVe or FastText? Or did I just made a bad implementation? The code is given below:

embedding_model = {}
f = open(r'../../langauge_detection/glove.840B.300d.txt', "r", encoding="utf8")
for line in f:
    values = line.split()
    word = ''.join(values[:-300])
    coefs = np.asarray(values[-300:], dtype='float32')
    embedding_model[word] = coefs
f.close()

def sent2vec(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embedding_model[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    return v / np.sqrt((v ** 2).sum())

X_train, X_test, y_train, y_test = train_test_split(df.website_text, df.industry, test_size=0.2, random_state=42)

x_train_glove = [sent2vec(x) for x in tqdm(X_train)]
x_test_glove = [sent2vec(x) for x in tqdm(X_test)]

x_train_glove = np.array(x_train_glove)
x_test_glove = np.array(x_test_glove)

from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)
sgd.fit(x_train_glove, y_train)

The Glove embeddings maybe cannot is suitable for your particular application, it's a generic model. — Allan, Aug 18 '22 at 13:18

score 0 · Accepted Answer · answered Aug 18 '22 at 15:48

There are various cases where a problem works better with a simpler representation of text than word embeddings:

Data size: if it's too small, the model may overfit because the embeddings give too much precision. Generally embeddings are more subtle so they require more data diversity.
The selected embeddings are not suitable for the data, e.g. general text embeddings may not work well with scientific texts, social media data, etc. Embeddings are trained on some data, so if this training data is too different from the data for the application then it won't give good results.

Generally one should never assume that a method is always better than another, as per the No Free Lunch theorem.

Thank you! That was really helpful – HasanArcas Aug 18 '22 at 20:58 — HasanArcas, Aug 18 '22 at 20:58

Is it normal for a model to perform worse with the use of word embeddings?

1 Answers1