5

i'm currently experimenting with scikit and the DBSCAN algorithm. And i'm wondering how to combine the data with the labels to write them into a new file. I'd also like to understand how the labels array is used to filter the examples.

Please correct me anytime i say something wrong because i'd like to understand the whole process better.

For example my data looks like this:

city    x    y
A       1    1
B       1    1
C       5    5
D       8    8

So if i understand it correctly i first need to split the data i'd like to cluster. If i use the data like above it will consider the city column aswell (or even fail).

So my next array would look like this:

x    y
1    1
1    1
5    5
8    8

Now i'll use the DBSCAN on my Array and it will create a cluster model. The clusterlabels are now stored in the array foo.labels_.

As far as i know i can filter the data with those labels to get the items within the clusters. Let's assume my data is in the dataframe cities:

cluster0 = cities[foo.labels_ == 0]

What i don't understand is how this works. I somehow don't get what exactly happens here. I know that the labels are in an array where the column is a number and the value of the column is the cluster. So how do i get the correct index of my cities?

So after the clustering i'd like to export my data back into a CSV file with the following format:

city    x     y     cluster
A       1     1     0
B       1     1     0
C       5     5     1
D       8     8     2

My guess is to use to original dataframe and add another column like this:

cities = cities.assign(cluster=p.Series(labels_))

But i'm absolutely unsure if that's the correct way to achieve what i want to do.

I'd really appreciate some opinions and explanations.

swas
  • 53
  • 1
  • 4

1 Answers1

4

As the algorithm should not change the order of the lists you could just add the clusters list

 cities["cluster"] = cluster 

If you are really paranoid you can add your input parameters a second time to the dataframe in the same way and compare the diff in values (should be 0).

El Burro
  • 790
  • 1
  • 4
  • 11
  • Thank you for your answer. So it's basically two lists in the same order which get joined again. I thought there is a littile bit more _intelligence_ behind it. If i understand it correctly than my idea with `cluster0 = cities[foo.labels_ == 0]` would return a wrong result or am i wrong? – swas Mar 28 '18 at 10:23
  • well not a wrong result but only the subset where this is the case. if this happens the order is broken and you cannot proceed easily. To my knowledge you cannot assign a list to a dataframe with different length. – El Burro Mar 28 '18 at 11:03