Questions tagged [dbscan]

DBSCAN means density-based spatial clustering of applications with noise and is a popular density-based cluster analysis algorithm.

It is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature. OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the ε parameter with a maximum search radius.

See also wikipedia.

75 questions
11
votes
1 answer

Knn distance plot for determining eps of DBSCAN

I would like to use the knn distance plot to be able to figure out which eps value should I choose for the DBSCAN algorithm. Based on this page: The idea is to calculate, the average of the distances of every point to its k nearest neighbors. …
Marc Lamberti
  • 327
  • 1
  • 3
  • 8
5
votes
1 answer

How to plot/visualize clusters in scikit-learn (sklearn)?

I have done some clustering and I would like to visualize the results. Here is the function I have written to plot my clusters: import sklearn from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.preprocessing import…
makansij
  • 809
  • 2
  • 11
  • 15
5
votes
2 answers

How are clusters from DBSCAN sometimes non-convex?

I've been using clustering in my bag of ML techniques for quite some time now, and I've never found a satisfying answer to this question. In DBSCAN, we define a maximum radius with which to form clusters. The algorithm will scan the space and…
makansij
  • 809
  • 2
  • 11
  • 15
5
votes
1 answer

Python clustering and labels

i'm currently experimenting with scikit and the DBSCAN algorithm. And i'm wondering how to combine the data with the labels to write them into a new file. I'd also like to understand how the labels array is used to filter the examples. Please…
swas
  • 53
  • 1
  • 4
4
votes
1 answer

Clustering pair-wise distance dataset

I have generated a dataset of pairwise distances as follows: id_1 id_2 dist_12 id_2 id_3 dist_23 I want to cluster this data so as to identify the pattern. I have been looking at Spectral clustering and DBSCAN, but I haven't been able to come to a…
t_m
  • 143
  • 5
4
votes
2 answers

How do we interpret the outputs of DBSCAN clustering?

I am starting to learn DBSCAN for clustering but the interpretation part of it seems to be tricky to understand. dataset = np.vstack((quotient_times, quotient)).T scaler = StandardScaler() dataset = scaler.fit_transform(dataset) db_scan =…
Brown
  • 207
  • 1
  • 4
  • 7
4
votes
1 answer

How to use precomputed distance matrix and min_sample for DBSCAN clustering method?

I want to perform DBSCAN on my datapoints, but I don't have access to the data, I just have the pairwise distance of datapoints. Additionally, I have no idea about the number of clusters but I do want that each cluster contains at least 40 data…
Nshn
  • 71
  • 1
  • 5
2
votes
0 answers

Estimate eps value in DBSCAN using KNN algorithm

I would like to estimate the best eps value for the DBSCAN algorithm on this dataset by following this set of rules: Set a minPts: 10 Compute the reachability distance of the 10-th nearest neighbour for each data-point. Sort the set of reachability…
2
votes
1 answer

Understanding and find the best eps value for DBSCAN

I'm trying to run the DBSCAN algorithm on this .csv. In the first part of my program I load it and plot the data inside it to check its distribution. This is the first part of the code: import csv import sys import os from os.path import join from…
2
votes
1 answer

DBSCAN on textual and numerical columns

I have a dataset which has two columns: title price sentence1 12 sentence2 13 I have used doc2vec to convert the sentences into vectors of size 100 as below: LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for…
Jazz
  • 420
  • 1
  • 5
  • 15
2
votes
2 answers

Nice real data sets for testing DBSCAN?

I'm looking for real datasets on which I could test my DBSCAN algorithm implementation, that is, a dataset of points in (ideally 2 dimmensional) space, or a set of nodes and info about the distances between them. I have looked on SNAP and CRAWDAD…
math_lover
  • 131
  • 1
  • 5
2
votes
1 answer

Types of artificial anomalies

I am working on some algorithms for anomaly detection The dataset is clean our anomalies so I want to add some artificial anomalies. I have added some anomalies. I get the maximum value of the dataset and add 20-25%, meaning these added anomalies…
E199504
  • 605
  • 1
  • 6
  • 11
2
votes
1 answer

How to use Cosine Distance matrix for Clustering algorithms like mean-shift, DBSCAN, and optics?

I am trying to compare different clustering algorithms for my text data. I first calculated the tf-idf matrix and used it for the cosine distance matrix (cosine similarity). Then I used this distance matrix for K-means and Hierarchical clustering…
Piyush Ghasiya
  • 155
  • 1
  • 6
2
votes
0 answers

Estimating minPts in DBSCAN for document layout clustering

I am trying to choose parameters for DBSCAN clustering algorithm, in particular minPts. The Wikipedia article suggests a rule of thumb to derive minPts from the number of dimensions D in the data set. minPts >= D + 1. For larger datasets, with much…
dzieciou
  • 697
  • 1
  • 6
  • 15
2
votes
2 answers

Is it safe to use labels created from unsupervised model to train a supervised model using the same data?

I have a dataset where I have to detect anomalies. Now, I use a subset of the data(let's call that subset A) and apply the DBSCAN algorithm to detect anomalies on set A.Once the anomalies are detected, using the dbscan labels I create a label…
1
2 3 4 5