Visualizing the difference of a set of strings

Question

I have a distance metric on a collection of strings on the order of tens of thousands. What would be an intuitive way to summarize how 'different' these strings are or when they overlap?

My goal is, to visually ensure high entropy and be able to recognize clustering regions and the strings associated with it.

I envision a kind of clustering plot where there is some radius around each string that captures its neighbors.... but this requires having a meaningful coordinate system.

score 0 · Answer 1 · answered Dec 04 '22 at 10:43

One approach you could take is to use a dimensionality reduction algorithm, such as t-SNE, to map the strings to a lower-dimensional space. This will allow you to visualize the strings in a 2D or 3D plot, where similar strings will be clustered together and dissimilar strings will be farther apart. You can then look for clusters and regions of high density in the plot to identify groups of similar strings and evaluate the overlap between them.

Another approach could be to use a clustering algorithm, such as K-means, to group the strings into clusters based on their similarity. This will allow you to identify clusters of similar strings and evaluate the overlap between them. You can then visualize the clusters on a scatter plot, where each cluster is represented by a different color, to see how the strings are distributed and how much overlap there is between the clusters.

Visualizing the difference of a set of strings

1 Answers1