2

I am trying to compare different clustering algorithms for my text data. I first calculated the tf-idf matrix and used it for the cosine distance matrix (cosine similarity). Then I used this distance matrix for K-means and Hierarchical clustering (ward and dendrogram). I want to use the distance matrix for mean-shift, DBSCAN, and optics.

Below is the part of the code showing the distance matrix.

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(Strategies) #fit the vectorizer to synopses


terms = tfidf_vectorizer.get_feature_names()

from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
print(dist)

I am new to both python and clustering. I found the code for K-means and hierarchical clustering and tried to understand it but I cannot apply it for other clusterings algorithms. It would be very helpful if I can get some simple explanation of each clustering algorithm and how this distance matrix can be used to implement (if possible) in different clustering.

Thanks in advance!

Piyush Ghasiya
  • 155
  • 1
  • 6

1 Answers1

3

Several scikit-learn clustering algorithms can be fit using cosine distances:

from collections      import defaultdict
from sklearn.datasets import load_iris
from sklearn.cluster  import DBSCAN, OPTICS

# Define sample data
iris = load_iris()
X = iris.data

# List clustering algorithms
algorithms = [DBSCAN, OPTICS] # MeanShift does not use a metric

# Fit each clustering algorithm and store results
results = defaultdict(int)
for algorithm in algorithms:
    results[algorithm] = algorithm(metric='cosine').fit(X)
Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
  • Thanks for the fast reply but I am getting an error. NameError: name 'clustering_algorithms' is not defined. Also, what would be X? Where I would be using 'dist' which I have calculated (in my code)? Please can you elaborate a little more? – Piyush Ghasiya Mar 05 '20 at 03:23
  • I had a typo; I fixed it. `X` is the standard name for a data array in scikit-learn. You don't need `dist`, use `cosine_distances` instead. – Brian Spiering Mar 05 '20 at 06:08
  • Does that mean that I should replace X with tfidf_matrix (as visible from my code above)? When I did that I again got an error: TypeError: __init__() got an unexpected keyword argument 'metric'. – Piyush Ghasiya Mar 05 '20 at 07:21
  • Sorry for my naive questions. – Piyush Ghasiya Mar 05 '20 at 07:26
  • Got `ValueError: Expected 2D array, got 1D array instead` while working with `DBSCAN`, changing `metric=cosine_distances` to `metric='cosine'` worked. – hafiz031 May 27 '21 at 05:04
  • This does not work for MeanShift. – Sep Jan 19 '22 at 14:42
  • Thanks for pointing that out. I have updated my answer. – Brian Spiering Jan 19 '22 at 22:54