I am new to text processing and stuck on a problem to identify the similarity of columns. To detail the problem, consider we have two columns with string values:
Column A | Column B
-------------------------------
abcd | xyz
foo | bar
xyzzy | acct
xyz | world
onex | foo
... | ...
... | ...
The length of columns can be in order of thousands. Is there an approach to identify how similar the columns are?
Currently, I am creating Minhash signatures for both the columns and computing the Jaccard similarity b/w the signatures. But the problem is, the similarity scores are coming too low even for the columns which have a considerate overlap of values.
Then, I tried creating signatures by taking fractions of values that are most frequently occurring but that does not seem to help either.
Is there any other approach to work on this?