How to evaluate the similarity of two columns containing strings?

Question

I am new to text processing and stuck on a problem to identify the similarity of columns. To detail the problem, consider we have two columns with string values:

Column A      |        Column B
-------------------------------
abcd          |          xyz
foo           |          bar
xyzzy         |          acct
xyz           |          world
onex          |          foo
...           |          ...
...           |          ...

The length of columns can be in order of thousands. Is there an approach to identify how similar the columns are?

Currently, I am creating Minhash signatures for both the columns and computing the Jaccard similarity b/w the signatures. But the problem is, the similarity scores are coming too low even for the columns which have a considerate overlap of values.

Then, I tried creating signatures by taking fractions of values that are most frequently occurring but that does not seem to help either.

Is there any other approach to work on this?

Maybe revise your code... MinHash seems completely redundant for an order of thousands of strings. If you can see a "considerate overlap" Jaccard similarity should show it too. — Valentas, Nov 12 '21 at 09:27
Do you want to take into account the string similarity between pairs of strings, or just the number of strings in common? — Erwan, Nov 12 '21 at 11:45

Peter · Answer 1 · 2021-11-19T08:04:25.953

You could use similarity metrics for strings. There are a number of "off the shelf" packages to compare string similarity, such as stringdist for R.

The stringsim function - for instance - allows you to compare string similarity (and there are options to use different metrics).

Example (in R):

library(stringdist)

stringsim("cat", "catfish")
> [1] 0.4285714

# Also works with vectors
df = data.frame(a=c("cat","dog","tree"),b=c("catfish","hotdog","forest"))

stringsim(df$a,df$b, method="jaccard")
> [1] 0.4285714 0.6000000 0.5000000

Also see this github-repo for fuzzy-matching etc.

How to evaluate the similarity of two columns containing strings?

1 Answers1