0

I want to train fasttext on my own corpus. However, I have a small question before continuing. Do I need each sentences as a different item in corpus or can I have many sentences as one item?

For example, I have this DataFrame:

 text                                               |     summary
 ------------------------------------------------------------------
 this is sentence one this is sentence two continue | one two other
 other similar sentences some other                 | word word sent

Basically, the column text is an article so it has many sentences. Because of the preprocessing, I no longer have full stop .. So the question is can I do something like this directly or do I need to split each sentences.

docs = df['text']
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(docs)

What are the differences? Is this the right way of training fasttext in your own corpus?

Thank you!

BlueMango
  • 113
  • 3
  • 1
    It should give worse results if your 'sentences' are in fact very long documents and better results if you split your documents into blocks of k words where k is not too large - you might not even need to recover your lost punctuation. – Valentas Oct 15 '21 at 14:50

1 Answers1

0

It should give worse results if your 'sentences' are in fact very long documents and better results if you split your documents into blocks of k words where k is not too large - you might not even need to recover your lost punctuation.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 28 '22 at 03:00