Here I'm trying to compute similarity between 1000 cross 10000 strings (using Levenshtein distance), I'm using a dataframe approach where you just need to compare n(n-1)/2 comparisons instead of n*n. But even this took a lot of time, is there a better way to optimise further?
import time
import random, string
import Levenshtein
import pandas as pd
# random alphanumeric strings of length 10
rand_ls = [''.join(random.choices(string.ascii_letters + string.digits, k=10)) for i in range(1000)]
# a dataframe filled with 0's but with shape n * n where n = len(rand_ls)
df = pd.DataFrame(0, index = rand_ls, columns = rand_ls)
s = time.time()
for i in range(len(df)):
for j in range(len(df)):
if i > j:
dist = Levenshtein.ratio(rand_ls[i], rand_ls[j])
df.iloc[i, j] = dist
df.iloc[j, i] = dist
e = time.time()
print(e-s)
# took 130 sec for 1000*1000 comparisons