2

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]

Not like that.

But bigger patterns like this:

If I have to _________, then I will ___.

["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]

I WANT TO DISCOVER NEW PATTERNS LIKE THE ABOVE. I don't already know the patterns.

I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.

Example usage of this would be finding the most common sentence structures a book uses.

The goal isn't to create the perfect algorithm or anything. I am willing to do it the brute-force way if need be like you might to find common substrings in sentences.

Is there a way to find patterns like this?

  • It seems like regular expressions would be a good fit for your problem. – noe Jun 12 '22 at 12:19
  • 1
    The above example is only an example of the type of patterns I want to discover. Regex would only work if I already know the patterns. I want to discover unknown patterns. – Mohit Gangrade Jun 12 '22 at 12:28
  • Ahh, I see, sorry for the confusion. – noe Jun 12 '22 at 13:54
  • crosspost at https://stackoverflow.com/questions/72591638/how-to-find-common-patterns-in-thousands-of-strings – milahu Nov 26 '22 at 08:36

2 Answers2

1

It's not easy, especially if you want any kind of pattern with various number of words and at any distance from each other.

The closest method I know would be to compute a huge coocurrence matrix with ngrams:

  1. Extract all the possible $n$-grams with size $n\leq N$ (for instance $N=3$).
  2. Filter out the least frequent ones. Depending on the size of the data the frequency threshold should be high enough to make the number of n-grams manageable, but not too high other some patterns may be missed.
  3. Given the resulting set of n-grams, count the number of coocurrences (number of sentences containing both) for every pair of n-grams. Store this in the coocurrence matrix.
  4. Extract the most common coocurrences from the matrix.
Erwan
  • 24,823
  • 3
  • 13
  • 34
1

The class of algorithms to search for is called "sequence alignment", usually found in bioinformatics. Example: https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm or https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm

sudoer
  • 11
  • 2
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/82886) – Ethan Nov 04 '22 at 21:24
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 10 '22 at 12:17