2

I'm very much a newbie in NLP, so please accept my apologies if this is an obvious question, the wrong place to ask it or any other error I could be making.

I am considering using NLP for some subset of real-time spam detection in real-time chat. The general idea would be to observe semantic clusters forming in real-time, as they could indicate activation of a wave of spambots. This, by itself, won't be sufficient to indicate that it's spam but I suspect that it would be an interesting data point in the process.

More specifically:

  • my system receives text messages in real-time (e.g. thousands to tens of thousands per second);
  • I would like to classify them and see if semantic clusters emerge;
  • I will need to cleanup the data regularly (e.g. remove everything that's older than one hour) to avoid retaining potential private data;
  • I cannot rely on external services for privacy reasons, so whatever happens, I'll need to write code. I'm fine with that.

I figure that I need to encode my text messages into vectors, using e.g. BERT or some other existing model. So far, so good. My difficulties are:

  • real-time classification of a growing dataset, with an unknown number of clusters (I'll be able to experiment with the distance, though);
  • regular cleanup.

Are there any well-known algorithms or libraries that I should look at? I'm not afraid to code and optimize my code, if I have a good reason to believe that it's going to work.

Yoric
  • 121
  • 3

1 Answers1

2

This looks like a problem of topic modelling: unsupervised clustering based distributional semantics.

  • For the issue of the unknown number of clusters, HDP is certainly a good option.
  • For analyzing the changing behaviour, there are dynamic topic models which can represent this, in particular D-LDA, DETM.
Erwan
  • 24,823
  • 3
  • 13
  • 34