0

I am working on a time series clustering problem. I made two models (hierarchical tree) with different pre-processing techniques using this class in package dtaidistance:

clustering.LinkageTree(dtw.distance_matrix_fast, {})

I am now studying the differences between the algorithms and I am facing issues when generating the classes since class 1 in model 1 is not class 1 in model 2.

My question; how to correspond the classes between both models ?

Desired output:

enter image description here

1 Answers1

0

One possible way is to build a cross table of labels correspondence:

import pandas as pd

x = pd.DataFrame()
x['clustering1'] = [1,1,1,1,1,0,0,0,0,0,0,2,2,2]
x['clustering2'] = [2,2,2,2,1,1,1,1,1,1,3,3,3,3]

pd.crosstab(x['clustering1'], x['clustering2'])

enter image description here

and then, for each row (clustering1's label) select column (clustering2's label) with the largest value.

This approach is not a silver bullet, since:

  1. You could end up with two (or more) maximums in a row/column, which typically means that two (or more) clusters from one clustering are embedded into cluster from another clustering.
  2. In general you will not always end up with good correspondence between cluster labels. For example, imagine clustering social network users, it could be the case that one clustering algorithm will find social groups corresponding to hobbies (e.g. sport, movies, etc), and the other - people from the same universities.
Anvar Kurmukov
  • 385
  • 1
  • 7