How to compute similarity matrix for strings efficiently?

Asked Aug 24 '21 at 07:57

Active Aug 24 '21 at 07:57

Viewed 10 times

Here I'm trying to compute similarity between 1000 cross 10000 strings (using Levenshtein distance), I'm using a dataframe approach where you just need to compare n(n-1)/2 comparisons instead of n*n. But even this took a lot of time, is there a better way to optimise further?

import time
import random, string
import Levenshtein
import pandas as pd

# random alphanumeric strings of length 10
rand_ls = [''.join(random.choices(string.ascii_letters + string.digits, k=10)) for i in range(1000)]

# a dataframe filled with 0's but with shape n * n where n = len(rand_ls)
df = pd.DataFrame(0, index = rand_ls, columns = rand_ls)

s = time.time()
for i in range(len(df)):
    for j in range(len(df)):
        if i > j:
            dist = Levenshtein.ratio(rand_ls[i], rand_ls[j])
            df.iloc[i, j] = dist
            df.iloc[j, i] = dist
            
e = time.time()
print(e-s)
# took 130 sec for 1000*1000 comparisons

asked Aug 24 '21 at 07:57

David Gladson

How to compute similarity matrix for strings efficiently?

0 Answers0