2

I am curious how it is done as I am interested in doing something similar. I have some manually transcribed data that contains tags for multiple speakers. I want to compare how well the out of the box ASRs (Google, AWS Transcribe) are able to diarise speakers (or in simple words identify and transcribe audio with multiple speakers). I want to compare it to the ground truth data I have and come up with a comparison metric.

I can use Levenshtein Distance or something like Ratcliff-Obershelp similarity as a metric. But I am trying to learn if there is a more standard way of doing this?

Samarth
  • 339
  • 1
  • 8
  • [This repository](https://github.com/syhw/wer_are_we) appears to track progress in ASR, it links many recent papers so I would assume that they use state of the art evaluation methods. – Erwan Oct 21 '20 at 23:28

1 Answers1

0

The answer I was looking for is Word Error Rate. It is the most standard way of comparing ASR transcriptions wrt ground truth. It is less granular than what I had in mind, it is basically Levenshtein Distance on a word level rather than a character level.

jiwer in python has a few other metrics and is easy to use.

Samarth
  • 339
  • 1
  • 8