How does machine learning algorithms process text?

Question

I'm still new in machine learning and I've been trying to expand my knowledge about it. For my first project, I want to classify if a tweet is suicidal or not using the gradient boost algorithm.

I do know that ml models can't process plain text which is why we have to represent them as numbers. These numeric values will be the input features to the machine learning model (correct me if I'm wrong).

But what I don't understand is how these numbers/vectors are being processed by the model to train it and make a prediction.

Hopefully someone can explain how plain text are converted into words and what's happening internally as they are taken as input to the machine learning model.

score 0 · Answer 1 · answered Sep 08 '22 at 11:29

this is the question of text representation: how text can be be converted (simplified) into numerical features, in a way which preserves the meaning as much a possible and makes it usable for ML.

Nowadays there are two main types of text representation: the traditional one based on one hot encoding, and the recent one based on word embeddings. Note that there are many variants of those, and even other types.

I think the most intuitive explanation is to study the traditional representation: in its most basic form, every word in the full corpus is assigned a fixed index $i$, and every document (or sentence) is represented as a vector in which every position $i$ has value 1 if and only if the corresponding word $w_i$ is present in the document. This way the learning algorithm can create conditions like "if word w_i belongs to the document then predict class X" for instance. The rest is the usual learning process: the algorithm finds the statistical patterns which connect the features/words to the classes and produces a model which exploits those patterns.

Word embeddings offer a more subtle but also more complex representation of the meaning of a word. Every dimension represents some specific kind of semantic information, but it is cannot be interpreted directly.

How does machine learning algorithms process text?

1 Answers1