Measuring precision and recall

Question

Trying to improve my chat App:

Using previous (pre-processed) chat interactions from my domain, I have built a tool that offers the user 5 possible utterances to a given chat context, for example:

Raw: "Hi John."

Context: hi [[USER_NAME]]
Utterances : [Hi ,Hello , How are you, Hi there, Hello again]

Of Course the results are not always relevant, for example:

Raw: "Hi John. How are you? I am fine, are you in the office?"

Context: hi [[USER_NAME]] how are you i am fine are you in the office
Utterances : [Yes, No , Hi , Yes i am, How are you]

I am using Elasticsearch with TF/IDF similarity model and an index structured like so:

{
  "_index": "engagements",
  "_type": "context",
  "_id": "48",
  "_score": 1,
  "_source": {
    "context": "hi [[USER_NAME]] how are you i am fine are you in the office",
    "utterance": "Yes I am"
  }
}

Problem: I know for sure that for the context "hi [[USER_NAME]] how are you i am fine are you in the office" the utterance "Yes I am" is relevant, however "Yes" , "No" are relevant too because they appeared on a similar context.

Trying to use this excellent video, as a starting point

Q: How can I measure precision and recall, if all I know (from my raw data) is just one true utterance?

From my understanding, the difficulty you face is how to penalize when the output is small (i.e. 1 or 2 available options). You could multiply each output with its length. For instance, lets say you aim for 100 different utterances, if you have one output which equals to 1 or 0 to relevant or not respectively, you could multiply it with 0.01. If you have 2 outputs you would multiply by 0.02 and so on. I hope this kinda answers your question. I am not knowledgeable in this field maybe there are some other advanced metrics. — 20-roso, Nov 14 '16 at 19:15
If this answers your question you could possibly favour to multiply by the log of the length instead. This seems to me more reasonable. — 20-roso, Nov 14 '16 at 19:29
Thanks for the reply, I wonder how can this be done, if I don't really know the recall count, for example one answer could have 100 relevant utterances and the other can have only one. — Shlomi Schwartz, Nov 15 '16 at 09:06
From what I notice your utterances are comma delimited, however, some utterances may include a comma (you should check that). If that is true, perhaps you could change the delimiter (to a pipe for instance) and then count the different utterance instances. — 20-roso, Nov 15 '16 at 13:14

Brian Spiering · Answer 1 · 2021-07-11T13:07:15.757

1

Precision and recall are "hard" metrics. They are measure if the model's prediction is exactly the same as the target label.

Often times systems like yours can use a more flexible metric such as top-5 error rate, the model is considered to have generated the correct response if the target label is one of the model’s top 5 predictions.

edited Jul 11 '21 at 13:07

answered Jul 10 '21 at 17:25

Brian Spiering

20,142
2
25
102

Measuring precision and recall

1 Answers1