6

My data is a group of 10 thousand points (each having an node location (x,y)) that are spread across a plane. They are also chromatically-colored based on their weight.

I need to finalize a bayesian nonparametric clustering method that groups points on mainly weight, but also on distance: that is, clusters are by defintion have some relevance to distance, but there are clear topological distinguishing factors between the first quarter and the last quarter of data (I say quarter as an arbitrary amount; in reality, the exact number and topology of the cluster changes through iterations).

As you can see in this picture, I’ve tried to use notability to create a crude chromatically-colored image of data with varying cluster topology types; over each iteration of my algorithm, the clusters, as mentioned, change location (based on their weight) and their shape

As you can see in this picture above, I’ve tried to use notability to create a crude chromatically-colored image of data with varying cluster topology types; over each iteration of my algorithm, the clusters, as mentioned, change location (based on their weight) and their shape, and some overlap (and the possiblility of new clusters (or a possible decrease in the total number) is very high per iteration, where this image represents one iteration of x points)

Additionally, since I am doing this analysis with data via python, i was thinking of using the T-SNE machine learning package as a substitute for a generic clustering method, but I only have a limited knowledge on its functionality. Also, since my data is based on the same weighted scale, it may be overkill.

EDIT: I changed the picture to show overlapping clusters, so it is clearer what I mean. However, keep in mind even these visible clusters are not homogenous in weight (they still vary but in a small threshold). Sure there is noise, but I really want to treat each cluster independently to see each cluster’s behavior over time (as well as clusters that are newly formed, hence the nonparametric method)

  • 2
    Did you try DBScan? – Itamar Mushkin Jul 13 '20 at 11:09
  • What did you try? – Itamar Mushkin Jul 13 '20 at 11:10
  • @ItamarMushkin I did try dbscan alone but the very big issue is that dbscan doesnt understand classifications by weight. Weight does not directly correlate with importance though: that is, I need to cluster high weights together and low weights together (and both are equally important), but almost always different clusters with different topologies overlap into one big cluster (So DBSCAN puts them together). Because of this, i have tried doing manually separating the values by weight and THEN doing DBSCAN but this is an extremely tedious for 1000s of iterations. What do you recommend I do? – ChessGrandMaster Jul 13 '20 at 17:31
  • @ItamarMushkin It is because of this that I might have to resort to some unecessarily complicated algorithm like tsne for relatively basic and not heavily varied data (after all, the only variation is based on weight and their distance to one another), which i really dont want to do. – ChessGrandMaster Jul 13 '20 at 17:34
  • @ItamarMushkin To make what im saying clearer, i updated the image as well – ChessGrandMaster Jul 13 '20 at 17:42
  • Can you just treat 'weight' as another feature? If you want to cluster the highs with the highs and the lows with the lows, it's not conceptually different than clustering east with east and west with west – Itamar Mushkin Jul 13 '20 at 17:59
  • @ItamarMushkin True, but the issue with that is how do I standardize what weights can be clustered together if the variability of them per iteration changes: meaning that if u put all weights on a 0-1 scale, the 0-.2 might be a klein-bottle like topological structure in one iteration, but after 10 more iterations the cluster weight group is 0-.1 or 0-0.3 instead? Is there a computationally non-complex machine learning algorithm possibly that can recognize these changes (simply because the groupings of weight directly correlate with the density/topology) – ChessGrandMaster Jul 13 '20 at 18:25
  • You could manipulate weight to be far apart by adding a constant to the ones in each group that you know you want to be separate. That'll save you from your iterations. I like DBSCAN, but I love OPTIC. However the only decent implementation that I ever use is in a java app called ELKI. I didnt like the other implementations very much. – Josh Jul 13 '20 at 19:54
  • @Josh I would do that but all weighted values are equally important (for diffferent reasons) so I cant just take out the ones and then do the process. However, as a start, if i were to do this, id still need to differentiate what is a part of the darkest cluster (the around 0.8-1, but that range changes per iteration) and what isnt. Do you have any advice to do that? – ChessGrandMaster Jul 14 '20 at 21:32
  • Okay, so I'm starting to understand the question a little better. You're iterating over time and clusters in each iteration are sometimes different and sometimes the "same" from previous iterations and you're trying to tie them together so you can monitor them over time? How would you know if in iteration 2 a cluster is the same one shifted, instead of a new one that appeared slightly to the left? – Josh Jul 14 '20 at 21:53
  • @Josh I am not combining data from different iterations. Rather, each iteration produces something like the image posted (where there are x clusters and y regions of clusters overlap each other per iteration, and x and y are very dynamic/fluid). If i use density based methods alone, then it would put overlapping clusters together into one cluster without regard for weight. If i make weight an additional dimension, this also does not work because I will have to manually change what the eps would be in the x/y directions and separately for z for each iteration (which isnt feasible). – ChessGrandMaster Jul 15 '20 at 07:04
  • Therefore, what I could use is a method that first automatically can differentiate groups by weight and then cluster those groups separately (some groups may/should only have 1 cluster [namely the higher weight ones in the center], while the lower weight ones should have many dispersed clusters). Even in doing this, the first part is tough because weight and density should be simultaneous factors in making groups (density algorithms would cluster these weight/density based groups into even further nuanced clusters). So, weight+density -> groups; then densty alone->clusters in each group @Josh – ChessGrandMaster Jul 15 '20 at 07:15
  • @Josh Any thoughts? – ChessGrandMaster Jul 16 '20 at 17:58
  • Sorry @ChessGrandMaster, I don't have much to offer. I don't believe I fully understand the situation well enough to advise. It seems like you know what you're after though - which is a density based algorithm. As long as you pay attention to how you standardize the scales of each variable, then I think you're on your way. To you original point - I'd avoid PCA or TSNE to reduce variables unless you're absolutely certain that what it's producing is what you want. – Josh Jul 17 '20 at 13:38

3 Answers3

0

I would look at "Fuzzy-C Clustering".

This type of clustering is "soft" in that it provides a likelihood of a given point in a given cluster based upon weights, etc.

Below are some links to get into the weeds a little...

Towards Data Science, Wikipedia and the Python docs.

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
7royboy
  • 1
  • 1
0

It looks like you already have decided that you need a Bayesian Non-Parametric approach, why not start then with the Dirichlet Process and then see if the results are satisfactory. You don't mention the reasons why you need to use a Bayesian nonparametric approach, so I am not sure of the entire background here.

0

One option is spectral clustering. Spectral clustering can find the "connectedness" in data.

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34
Brian Spiering
  • 20,142
  • 2
  • 25
  • 102