I was looking for a way to explore evaluation metrics for language translation models and I came across spBLEU. I can’t find any implementations/examples that would help me start. Does anyone have a lead on what I can pursue?
thanks in advance!
I was looking for a way to explore evaluation metrics for language translation models and I came across spBLEU. I can’t find any implementations/examples that would help me start. Does anyone have a lead on what I can pursue?
thanks in advance!
spBLEU was introduced in the Flores-101 article:
[...] we propose to use BLEU over text tokenized with a single language-agnostic and publicly available fixed SentencePiece subword model. We call this evaluation method spBLEU, for brevity. It has the benefit of continuing to use a metric that the community is familiar with, while addressing the proliferation of tokenizers.
For this, we have trained a SentencePiece (SPM) tokenizer (Kudo and Richardson, 2018) with 256,000 tokens using monolingual data (Conneau et al., 2020; Wenzek et al., 2019) from all the Flores-101 languages. SPM is a system that learns subword units based on training data, and does not require tokenization. The logic is not dependent on language, as the system treats all sentences as sequences of Unicode. Given the large amount of multilingual data and the large number of languages, this essentially provides a universal tokenizer, that can operate on any language.
The Sentencepiece model and some related utilities can be found on the Flores github repo.
To use it, just follow their instructions:
# tokenize with SPM
python scripts/spm_encode.py \
--model flores_spm_model_here \
--output_format=piece \
--inputs={untok_hyp_file} \
--outputs={hyp_file}
# calculate with sacrebleu
cat {hyp_file} | sacrebleu {ref_file}