How can a ML algorithm learn to classify fake news?

Question

I am new in Machine learning techniques and in fake news detection by using these algorithms (SVM, nn, logistic regression,..). I would like to understand how an algorithm can learn from a training set which include news and fake news, what it will be necessary to have (target), what type of information can be relevant for a good analysis and learning ...

I read many papers and most of them do not mention the type of news used, the criteria for building the classifiers, but they only show the results of the prediction when a new news is generated. This could be perfectly fine, but it could be also easily to say: I got this amazing results!, without understanding how these results were got.

I would appreciate if you could show me an example of how an algorithm can learn from text (not necessarily news, but also tweets, or something else would be fine).

Following discussion with Erwan: one of his previous answer partially has answered my question. However I would like to understand the following. One needs to have a corpus, then label news/tweets in fake/not fake, then run the model. But how the algorithm works on texts and takes relevant words or features for detecting fake news?

So my question is: If I have a corpus on Trump, would the algorithm be able to detect fake news on Vitamin C, without any words (verb, adj, noun,..) in common between the two dataset, except stopwords?

Does this answer your question? [Fake News Detection problem](https://datascience.stackexchange.com/questions/66391/fake-news-detection-problem) — Erwan, Sep 16 '20 at 10:11
Partially it answered my question. Thanks Erwin. My question includes also the following: you mentioned in your previous answer that one needs to have a corpus, then label news/tweets in fake/not fake, then run the model. However, how the algorithm works on texts and take relevant words or features for detecting fake news? If I have a corpus on Trump, the algorithm would be able to detect fake news on Vitamin C? — user104674, Sep 16 '20 at 10:25
There are many papers which show the results got from models/algorithms but authors never mention what they took into consideration. So how they accuracy of prediction can be improved? — user104674, Sep 16 '20 at 10:28

score 1 · Answer 1 · answered Sep 16 '20 at 12:07

Following discussion with Erwan: one of his previous answer partially has answered my question. However I would like to understand the following. One needs to have a corpus, then label news/tweets in fake/not fake, then run the model. But how the algorithm works on texts and takes relevant words or features for detecting fake news?

First, let me emphasize that the concept of "fake news" is very vague and subjective. This fact alone is a red flag for any rational data scientist, since if humans don't always agree what is the correct answer it's going to be difficult to know whether a program gives the correct answer or not.

Now let's assume one has some news data annotated as fake or not. We can represent every news item as a vector of features (for example as a simple one-hot encoding vector of words), and for every document provide these features together with the "fake or not" boolean label to a supervised ML algorithm.

The supervised ML algorithm doesn't know and doesn't care about the semantic task, that is it doesn't have any idea what the label represents. It's job is only to find the best way to predict a given label (whatever this represents) from the given features. Typically it does this by measuring statistical links between features and labels, for instance using the fact that a particular word is often associated with a particular label in the training set.

It can do this part very well, but the rest is up to humans: are the features really good indications for the label? are the labels really representative of the task? Is the amount of training data sufficient? Is the training data diverse enough? Is the ML algorithm the right choice for this task? How to evaluate and interpret the results? Etc.

The point is: no matter how complex it is, the algorithm doesn't care about the meaning, it just uses the relations that it finds in the data. If for some reason the word "rhinoceros" always appears with the label "fake" in the training data, then when applying the system any news containing the word "rhinoceros" is likely to be predicted as fake. It's easy to see that this can lead to serious errors, depending on the quality/diversity of the data used for training. It's also easy to see that, with a very "sensitive" task such as fake news detection, a good evaluation result with the same kind of data as the training data is not a proof that the system can actually detect fake news in any context, if only because it's impossible to have a fully representative sample of any kind of fake news past, present and future.

So my question is: If I have a corpus on Trump, would the algorithm be able to detect fake news on Vitamin C, without any words (verb, adj, noun,..) in common between the two dataset, except stopwords?

As you can guess from my explanation above, it's not sure at all that the algorithm would easily adapt to a new type of data. That being said, normally a good algorithm on this task would probably focus on the subtle cues rather than the topic words, for instance the terms which are meant to trigger an emotional response with the reader. To some extent, these indications might work across topics. But again there's no guarantee that the system works in general, especially since the people who write and disseminate fake news can easily change their techniques if these become easily detected.

How can a ML algorithm learn to classify fake news?

1 Answers1