Newbie questions: real-time clustering of messages

Question

I'm very much a newbie in NLP, so please accept my apologies if this is an obvious question, the wrong place to ask it or any other error I could be making.

I am considering using NLP for some subset of real-time spam detection in real-time chat. The general idea would be to observe semantic clusters forming in real-time, as they could indicate activation of a wave of spambots. This, by itself, won't be sufficient to indicate that it's spam but I suspect that it would be an interesting data point in the process.

More specifically:

my system receives text messages in real-time (e.g. thousands to tens of thousands per second);
I would like to classify them and see if semantic clusters emerge;
I will need to cleanup the data regularly (e.g. remove everything that's older than one hour) to avoid retaining potential private data;
I cannot rely on external services for privacy reasons, so whatever happens, I'll need to write code. I'm fine with that.

I figure that I need to encode my text messages into vectors, using e.g. BERT or some other existing model. So far, so good. My difficulties are:

real-time classification of a growing dataset, with an unknown number of clusters (I'll be able to experiment with the distance, though);
regular cleanup.

Are there any well-known algorithms or libraries that I should look at? I'm not afraid to code and optimize my code, if I have a good reason to believe that it's going to work.

score 2 · Answer 1 · answered Oct 04 '22 at 11:06

This looks like a problem of topic modelling: unsupervised clustering based distributional semantics.

For the issue of the unknown number of clusters, HDP is certainly a good option.
For analyzing the changing behaviour, there are dynamic topic models which can represent this, in particular D-LDA, DETM.

Newbie questions: real-time clustering of messages

1 Answers1