I am looking for some hints on how to curate a list of stopwords. Does someone know / can someone recommend a good method to extract stopword lists from the dataset itself for preprocessing and filtering?
The Data:
a huge amount of human text input of variable length (searchterms and whole sentences (up to 200 characters) ) over several years. The text contains a lot of spam (like machine input from bots, single words, stupid searches, product searches ... ) and only a few % of seems to be useful. I realised that sometimes (only very rarely) people search my side by asking really cool questions. These questions are so cool, that i think it is worth to have a deeper look into them to see how people search over time and what topics people have been interested in using my website.
My problem:
is that i am really struggling with the preprocessing (i.e. dropping the spam). I already tried some stopword list from the web (NLTK etc.), but these don't really help my needs regarding this dataset.
Thanks for your ideas and discussion folks!