i'm currently experimenting with scikit and the DBSCAN algorithm. And i'm wondering how to combine the data with the labels to write them into a new file. I'd also like to understand how the labels array is used to filter the examples.
Please correct me anytime i say something wrong because i'd like to understand the whole process better.
For example my data looks like this:
city x y
A 1 1
B 1 1
C 5 5
D 8 8
So if i understand it correctly i first need to split the data i'd like to cluster. If i use the data like above it will consider the city column aswell (or even fail).
So my next array would look like this:
x y
1 1
1 1
5 5
8 8
Now i'll use the DBSCAN on my Array and it will create a cluster model. The clusterlabels are now stored in the array foo.labels_.
As far as i know i can filter the data with those labels to get the items within the clusters. Let's assume my data is in the dataframe cities:
cluster0 = cities[foo.labels_ == 0]
What i don't understand is how this works. I somehow don't get what exactly happens here. I know that the labels are in an array where the column is a number and the value of the column is the cluster. So how do i get the correct index of my cities?
So after the clustering i'd like to export my data back into a CSV file with the following format:
city x y cluster
A 1 1 0
B 1 1 0
C 5 5 1
D 8 8 2
My guess is to use to original dataframe and add another column like this:
cities = cities.assign(cluster=p.Series(labels_))
But i'm absolutely unsure if that's the correct way to achieve what i want to do.
I'd really appreciate some opinions and explanations.