How to find common patterns in thousands of strings?

Question

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]

Not like that.

But bigger patterns like this:

If I have to _________, then I will ___.

["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]

I WANT TO DISCOVER NEW PATTERNS LIKE THE ABOVE. I don't already know the patterns.

I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.

Example usage of this would be finding the most common sentence structures a book uses.

The goal isn't to create the perfect algorithm or anything. I am willing to do it the brute-force way if need be like you might to find common substrings in sentences.

Is there a way to find patterns like this?

It seems like regular expressions would be a good fit for your problem. — noe, Jun 12 '22 at 12:19
The above example is only an example of the type of patterns I want to discover. Regex would only work if I already know the patterns. I want to discover unknown patterns. — Mohit Gangrade, Jun 12 '22 at 12:28
crosspost at https://stackoverflow.com/questions/72591638/how-to-find-common-patterns-in-thousands-of-strings — milahu, Nov 26 '22 at 08:36

score 1 · Answer 1 · answered Jun 13 '22 at 17:28

It's not easy, especially if you want any kind of pattern with various number of words and at any distance from each other.

The closest method I know would be to compute a huge coocurrence matrix with ngrams:

Extract all the possible $n$-grams with size $n\leq N$ (for instance $N=3$).
Filter out the least frequent ones. Depending on the size of the data the frequency threshold should be high enough to make the number of n-grams manageable, but not too high other some patterns may be missed.
Given the resulting set of n-grams, count the number of coocurrences (number of sentences containing both) for every pair of n-grams. Store this in the coocurrence matrix.
Extract the most common coocurrences from the matrix.

sudoer · Answer 2 · 2022-11-04T12:56:47.040

1

The class of algorithms to search for is called "sequence alignment", usually found in bioinformatics. Example: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm or https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm

edited Nov 04 '22 at 12:56

answered Nov 04 '22 at 12:56

sudoer

11
2

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/82886) – Ethan Nov 04 '22 at 21:24
Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 10 '22 at 12:17

How to find common patterns in thousands of strings?

2 Answers2