DBSCAN clustering on document [updated]?

Question

I am new in topic modeling and text clustering domain and I am trying to learn more. I would like to use the DBSCAN to cluster the text data. There are many posts and sources on how to implement the DBSCAN on python such as 1, 2, 3 but either they are too difficult for me to understand or not in python.
I have a CSV data that has userID and message that they wrote as follows:

user.csv (number of csv rows:400 (#message))

userID         messages
112   The car was broken and Kevin fixed it
.
.
.

I know some steps to apply DBSCAN such as:

Remove stop words
Find similarity distance ( I have a code that does the cosine similarity)

I am also aware that sci-kit learn has the demo at 4 but I prefer the manual implementation that I can see what's going on in the code.

It would be great if you can provide your help with code that I can run in my side to learn it.

Add the code that you already have to your question to make it answerable. Right now, it is too unspecific. Also beware that results of cosine on short text tend to be very poor, and choosing DBSCAN parameters will lose a big problem. So I doubt you will succeed. — Has QUIT--Anony-Mousse, Jun 30 '19 at 06:28
@Anony-Mousse thank you for your comment. I just updated the post with my entire code that performs the similarity. The code above calculates both Levenstein and cosine similarity. I chose Levenstein as practice. That's correct, my CSV file has short text that the user provided. Please let me know a good alternative for similarity if you have something in mind. I will appreciate. — Bilgin, Jun 30 '19 at 19:02
There is no use in turning a distance matrix into a pandas data frame. That just causes more overhead, and your code already will be very slow. Stick to *vectorized* numpy where possible. But there is no DBSCAN in your code yet. — Has QUIT--Anony-Mousse, Jun 30 '19 at 19:44
@Anony-Mousse Sure, I will keep only with numpy and avoid pandas. I just added the DBSCAN code that I found from [here](https://datascience.stackexchange.com/questions/20198/cluster-documents-and-identify-the-prominent-document-in-the-cluster) but it is very simple and using sci-kit learn. I am new in python and I could not able to write the algorithm from scratch. Please let me know if you have any idea about how I can do it. Thanks — Bilgin, Jun 30 '19 at 20:25
Do what? And don't forget to choose a meaningful epsilon *and* minpts or you'll get useless clusters. — Has QUIT--Anony-Mousse, Jun 30 '19 at 20:45
@Anony-Mousse do you know any good posts or reference hat help me to understand and implement it? — Bilgin, Jun 30 '19 at 20:54
Implement DBACAN? Just implement the pseudocode from the paper, it's fairly easy. Just do it. — Has QUIT--Anony-Mousse, Jun 30 '19 at 21:15

score 1 · Answer 1 · answered Jul 01 '19 at 07:59

Bilgin!

Anony-Mousse puts right questions and gives good suggestions. Before you use the self-implemented DBSCAN code - write it on paper. Perhaps it is not the best algorithm at all for your database so try sci-kit learn implementation first to see the results.

Here are the Python implementation https://github.com/chrisjmccormick/dbscan/blob/master/dbscan.py and here is the theory https://github.com/chrisjmccormick/dbscan/blob/master/dbscan.py

Good luck!

DBSCAN clustering on document [updated]?

1 Answers1