Training fasttext on your own corpus

Question

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item?

For example, I have this DataFrame:

 text                                               |     summary
 ------------------------------------------------------------------
 this is sentence one this is sentence two continue | one two other
 other similar sentences some other                 | word word sent

Basically, the column text is an article so it has many sentences. Because of the preprocessing, I no longer have full stop .. So the question is can I do something like this directly or do I need to split each sentences.

docs = df['text']
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(docs)

What are the differences? Is this the right way of training fasttext in your own corpus?

Thank you!

It should give worse results if your 'sentences' are in fact very long documents and better results if you split your documents into blocks of k words where k is not too large - you might not even need to recover your lost punctuation. — Valentas, Oct 15 '21 at 14:50

score 0 · Answer 1 · answered Sep 21 '22 at 07:12

0

It should give worse results if your 'sentences' are in fact very long documents and better results if you split your documents into blocks of k words where k is not too large - you might not even need to recover your lost punctuation.

answered Sep 21 '22 at 07:12

Moneeba Anwar

11
5

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 28 '22 at 03:00

Training fasttext on your own corpus

1 Answers1