I have a dataset of transactional data with customer ID and I want to segment the dataset into groups using cluster analysis. I'm interested in following the evolution of each cluster over time, but since customers have very different behaviours (roughly 50% of the time a customer will change cluster the week after), I was wondering what would be a statistically sound approach. Is it a good idea to train a clustering algorithm every week and look backwards at the weekly evolution of each segment?
-
Which clustering technique are you using to generate the clusters? – pseudoabdul Jan 12 '22 at 02:42
-
I used k-means but my question was not just limited to that. – Egodym Jan 12 '22 at 03:04
-
k-means will work because the number of clusters is fixed. One option is to run k-means every week over the data. You will have generated time series which you can then analyze. For example, you could plot the size of your largest cluster over time. – pseudoabdul Jan 12 '22 at 06:40
4 Answers
You can try
- Dynamic mode decomposition.
- Dynamic Time Warping. Found a nice resource on Towards data science blog.
These two have proven better approaches than PCA for time series clustering.
Happy coding
- 111
- 3
May be what you were looking for is the Rand index ?
This "is a measure of the similarity between two data clusterings", in other words, if the RI is close to 1 (after repeated clustering over a time window) then your segment are stable.
- 31
- 5
Cluster once.
Study the clusters and refine them to define classes.
Then classify points to these classes.
- 7,969
- 1
- 14
- 30
-
Thanks. Any reference to dive deeper? My concern is whether clustering once 2 years of monthly data would yield different results than clustering each month separately and then looking at the results. – Egodym May 14 '20 at 22:28
Run Clustering periodically (say every month). Use the elbow method to make a decision on the best number of clusters (be open to this aspect of the system changing over time). Define / Label what each cluster represents - The centroids of each cluster represents the average behavior of the inmates within the cluster.
- 785
- 5
- 8