How to fine-tune hyperameters of unsupervised training in fasttext?

Question

I want train fasttext unsupervised model on my text dataset. However there are many hyperparameters in train_unsupervised method:

    lr                # learning rate [0.05]
    dim               # size of word vectors [100]
    ws                # size of the context window [5]
    epoch             # number of epochs [5]
    minCount          # minimal number of word occurences [5]
    minn              # min length of char ngram [3]
    maxn              # max length of char ngram [6]
    neg               # number of negatives sampled [5]
    wordNgrams        # max length of word ngram [1]
    thread            # number of threads [number of cpus]
    lrUpdateRate      # change the rate of updates for the learning rate [100]

Some of them influence quality of embeddings dramatically (dim, lr, minn, maxn especially). However I haven't found any method for tuning those hyperparameters. How could I do that? And also, how features of my dataset (mean sentence length for example) may influence choice of some of those hyperparameters?

Brian Spiering · Answer 1 · 2022-09-06T16:10:28.867

2

In order to tune hyperparameters, you'll need an evaluation metric. One evaluation metric for embeddings is performance on analogies (e.g., man is to king as woman is to _____). There is an analogy test set created by Google. You can adjust embedding hyperparameter values and see which ones perform better on that collection of analogies.

edited Sep 06 '22 at 16:10

answered Sep 06 '22 at 15:55

Brian Spiering

20,142
2
25
102

but i trained model on insurance company's Jira tasks description dataset to make it understand specific language used there, so i don't think that these google tests will be representative. should i manually select pairs of text from similar dataset and compare cosine similarities between embeddings? is there is a better way? – Ir8_mind Sep 07 '22 at 13:41

How to fine-tune hyperameters of unsupervised training in fasttext?

1 Answers1