1

I want train fasttext unsupervised model on my text dataset. However there are many hyperparameters in train_unsupervised method:

    lr                # learning rate [0.05]
    dim               # size of word vectors [100]
    ws                # size of the context window [5]
    epoch             # number of epochs [5]
    minCount          # minimal number of word occurences [5]
    minn              # min length of char ngram [3]
    maxn              # max length of char ngram [6]
    neg               # number of negatives sampled [5]
    wordNgrams        # max length of word ngram [1]
    thread            # number of threads [number of cpus]
    lrUpdateRate      # change the rate of updates for the learning rate [100]

Some of them influence quality of embeddings dramatically (dim, lr, minn, maxn especially). However I haven't found any method for tuning those hyperparameters. How could I do that? And also, how features of my dataset (mean sentence length for example) may influence choice of some of those hyperparameters?

Ir8_mind
  • 183
  • 4

1 Answers1

2

In order to tune hyperparameters, you'll need an evaluation metric. One evaluation metric for embeddings is performance on analogies (e.g., man is to king as woman is to _____). There is an analogy test set created by Google. You can adjust embedding hyperparameter values and see which ones perform better on that collection of analogies.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
  • but i trained model on insurance company's Jira tasks description dataset to make it understand specific language used there, so i don't think that these google tests will be representative. should i manually select pairs of text from similar dataset and compare cosine similarities between embeddings? is there is a better way? – Ir8_mind Sep 07 '22 at 13:41