1

I need to find a way to detect if a large string contains a specific substring.

Imagine that I have a full contract page converted to string in my Python program. What I want to do is to say if a specific term (a smaller string than the whole page string) exists in the page converted to string. The matching must be done semantically.

As an example, in the following case, I expect to have a good score for Query 1 and a bad score for Query 2. Notice that it's possible that my query has more than one sentence.

Query 1

"The supplier will accomplish the fulfill the delivery of the products in one week"

Query 2

"I like Phil Colins"

Page text

"The supplier will be paid annually in jully.

The delivery of products will be made in one week.

......... "

How would you do this task?

Valentin Calomme
  • 5,396
  • 3
  • 20
  • 49

1 Answers1

1

This is essentially information retrieval: usually there is a collection of documents and the goal is to find the document which is the most similar to a given query (what you call the "semantic concept").

The traditional way to do that is to convert the collection of documents as vectors, typically with TFIDF weight but there are many options (I assume that recent approaches favor using word (or document) embeddings). Then a similarity measure (for example cosine) is used to measure the similarity between every document and the query, and the most similar document is selected. Since in this case there would be a single document, you could use a threshold on the similarity level in order to answer yes or no (concept found or not).

Erwan
  • 24,823
  • 3
  • 13
  • 34