1

I have the following dataframe:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog, Cat, 'Fish' and 'Dry Fish'. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. Finally for 'Fish' and 'Dry Fish' unique id will be, say 4. And so on.

I created a code for the logic above:

# pip install thefuzz
from thefuzz import fuzz

df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index

    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1

    if is_i_used == True:
        i += 1
        is_i_used = False
                           

Code generates expected result:

    name         cluster_number  id
0   Beach               1        1
1   South Beach         1        1
2   Big Dog             2        2
3   Cat                 2        3
4   Dog                 2        2
5   Dry Fish            2        4
6   Fish                2        4
7   Ant                 3        5
8   Bird                3        6
9   Dear                4        7

Computation runs for 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows. I am trying to understand how to vectorize the code.

Also thefuzz module has process function and it allows to process data at once:

from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))

But I don't see if it can help with speeding up the code.

illuminato
  • 306
  • 1
  • 7

2 Answers2

0

The code is so slow because there is there are nested loops (i.e., runtime complexity is O(n²)).

One way to get remove the inner loop is to hash previously seen answers. Instead of iterating through each row a second time and performing a conditional check, perform a set membership check (i.e., runtime complexity of O(1)).

Additionally, pandas is inherently slower for computation than other options in Python. Switching to NumPy and/or Numba will speed up the code.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
  • I updated the code with `while loop` instead of `for loop`. Now time complexity is `O(n*max_cluster_size)`. It is still too slow for my task. – illuminato Dec 12 '22 at 14:56
  • Do not loop (while or for). Look up the answer in a precomputed data structure, like a set. – Brian Spiering Dec 12 '22 at 15:00
  • NumPy and Numba are only viable for certain text sim metrics, and require the data be preprocessed beforehand in place or on load, which isn't a bad thing but is a consideration – Andy Dec 12 '22 at 16:19
0

Frame challenge:

Text similarity does not scale nicely, and vectorizing will not yield the performance increases you need here. The specific approach I reach for when I need to scale up text similarity is a pre-clustering step using cheap calculations and saving the expensive text similarity calculations for items within the same pre-cluster. Pre-clustering technique chosen depends on document type and length, for example for shorter natural language strings like names you could build a hashmap of {name_token: {name_ids_containing_token}}, whereas for longer natural language documents something like minhash may be more appropriate.

Andy
  • 650
  • 4
  • 13