Home > Article > Technology peripherals > Meta's large-scale study on language translation, the results are all "routine"
In early July this year, Meta AI released a new translation model called No Language Left Behind (NLLB), which we can literally translate as "no language left behind."
As the name suggests, NLLB can support arbitrary translation between 200 languages, and Meta AI has also made it open source. It can translate languages you have never seen before, such as Luganda, Urdu, etc.
However, this research has been questioned recently. Some people believe that many of the claims made by Meta AI in the NLLB are unfounded and misleading, and the evaluation results are seriously flawed. Additionally, skeptics say it would be easy to get higher numbers than they report based on Meta AI's assessment methodology.
The skeptic is natural language processing research scientist Benjamin Marie, who is proficient in translation technology. What he questioned can be summarized as Meta AI comparing spBLEU and BLEU side by side.
Regarding this question, some researchers said: spBLEU is a reasonable metric, provided there are no spaces in the text (Thai, etc.). But comparing spBLEU and BLEU is definitely incorrect.
##Netizen Arle Lommel said in reply to Benjamin Marie: This is a great point. This also taught me that when it comes to machine learning research, we should be very cautious about research that lacks confirmation. What you find here does suggest that the problem gets complicated when people just refer to fractions without controlling how they are produced.
Vedanuj Goswami, one of the authors of the paper, said: “We 100% agree with the authors that you cannot divide the BLEU score into Comparison with different tokenizers. But the author’s main argument that most of the results in our paper are incomparable is not true.
In our paper, Table 30 and Table 31 The same tokenizer is used for spBLEU evaluation (FLORES-101 spm tokenizer) specifically for comparability. We do not use the FLORES-200 spm tokenizer. We describe this in detail in the title of Table 30 and in Section 8.3.1 .Similarly, Tables 35, 36, 37, 38 all use comparable metrics/tokenizers for proper comparison. We have updated the paper
In general, the current machine translation The evaluation method is not yet perfect, and different papers use different methods."
################### Specific content:#########The evaluation method is flawed#########First let us make a simple analogy:############Paul has 25 bananas and Bill has 30 tomatoes. Would you say Bill has 5 more bananas than Paul? ############BLEU is like a banana, spBLEU is like a tomato. Replace Paul with Previous work and Bill with NLLB. We can now write something like this: ############The previous work was performed at 25 BLEU and NLLB was performed at 30 spBLEU. Would you say NLLB is 5 BLEU points better than previous work? ######
With the above analogy, the content introduced below may be easier to understand.
Previously, Meta AI released a paper that comprehensively explained and evaluated NLLB. In the abstract of the paper, they claim that the model achieves a 44% BLEU improvement compared to previous SOTA methods. In other words, NLLB will produce better results than previous studies.
Regarding BLEU, it is rare in the history of machine translation research to see BLEU improve by 44% over previous SOTA technology. So this simple sentence in the paper represents scientific progress. Some media directly reported this statement and, without further verification, positioned Meta AI at the top of language machine translation.
If Meta AI chooses to publish such a large technical study, they should provide very reliable scientific evidence. Otherwise, Meta AI's claim to do better than others, without any evidence, will only undermine the very hard work that other research institutions have done and are doing.
Marie To explain the NLLB error problem, he attempts to demonstrate how Meta AI can be misled by its own results. Using simple examples from NLLB and similar examples she found herself, Marie demonstrates that it's easy to go beyond SOTA when using NLLB's flawed assessment methods. Finally, Marie identifies and specifically explains the main errors in their assessment.
Meta AI compared its model with data from more than 20 previous studies and concluded that NLLB significantly outperformed previous studies. To make such a large number of comparisons feasible, they rely on automated evaluation metrics for machine translation evaluation, primarily BLEU and spBLEU.
BLEU is extremely popular in machine translation, despite its shortcomings.
For example, we want to translate the following French text from the FLORES101 dataset into English using Google Translate. If you speak French, you will notice that this is a very poor quality translation: grammatical errors, inconsistent terminology, and it does not read naturally. In fact, since the dataset was created from English, Meta AI only evaluates machine translation when translating to English.
We can do this by counting how many tokens in Google Translate are also in this reference translation and compare it to the reference translation Compare. A token is defined here as a sequence of characters separated by a space. Orange highlights all token sequences in the Google Translate above that appear in the reference translation below.
Considering only all matching tokens, the BLEU score can be calculated to be 50.8 BLEU. This score alone means nothing, it only makes sense when compared to another BLEU score.
The key point to understand here is that the score is calculated based on tokens, which is ignored in most machine translation research. The BLEU score is calculated using SacreBLEU, which performs its own internal tokenization, basically just adding spaces before punctuation. This is one of the most reliable and repeatable methods of calculating BLEU scores. Meta AI uses spBLEU.
So what is spBLEU? It is BLEU but uses different tokenization. It tokenizes Google Translate and reference translations as follows.
The token associated with spBLEU generates the token by breaking the word into smaller fragments (attached to the token It's not important here, try ignoring it). A direct consequence of using spBLEU tokenization is that we end up with more tokens for both translations and references. Since there are more tokens, we can expect Google Translate to match more tokens from the reference. Then the score will grow. In fact, the spBLEU score here is 54.8.
We can’t help but ask 4 points higher than the BLEU score calculated above using SacreBLEU internal tokenization? So is translation getting better and better?
Apparently not, the translation remains the same. Comparing BLEU and spBLEU makes no sense at all. BLEU and spBLEU handle Google Translate and reference translations differently and are used for evaluation purposes only. They are actually different indicators. If they were the same indicator, we wouldn't have to name them differently. As we often read and hear in the machine translation research community, it is not fair, or even unfair, to compare translation quality using BLEU scores calculated for different or even almost similar tokens. If you want your research to be scientifically credible, you just need to calculate your BLEU score consistently using the exact same tokenization.
##Meta AI claims that NLLB is much better than previous studies because they can always achieve better spBLEU scores than previously published BLEU scores, the opposite is true. Because getting the spBLEU score lower than the BLEU score for a given translation is an extremely difficult task. Even more incomprehensible is why not just use the chrBLEU metric if their goal is to get the highest score.
For example in Google Translate and Reference Translation, each character becomes a token (in other words, spaces are added between characters).
We then calculate the chrBLEU value to be 75.5, which is 20.7 points higher than spBLEU. According to NLLB's assessment, this will be a significant improvement that will be a new high point for machine translation, while the original Google Translate remains unchanged.
Examples of errors in the paper
Now, let’s look at the specifics of NLLB evaluation Example.
Meta AI claims to have outperformed previous work by comparing its numbers to previously published figures. In this paper, conclusions are drawn from Tables 30, 31, 32, 35, 36, 37, and 38, which are compared with previous work.
will start from table 32. This is one of the most illustrative examples because of the different types of errors that can occur.
From the table, all numbers except the NLLB-200 column are copied directly from previously published papers IndicBART and IndicTrans. For readability, Meta AI marks the highest score for each language in bold, with the bold column indicating the corresponding system is the best.
The table says spBLEU for all, which is misleading. Actually, all means only NLLB-200, since IndicBART and IndicTrans use not spBLEU but BLEU. However, upon comparison, it is found that the spBLEU score of NLLB is higher than the BLEU score of previous work. But does that mean NLLB is better? Is this like 30 tomatoes better than 25 bananas?
In the text explaining the results we can see:
For example (c) Google Translate, (d) Microsoft Translate. NLLB-200 significantly outperforms all models in most directions. The training dataset for NLLB-200 includes 25 Indian languages, almost twice as many as those covered by (a) and (b). The performance improvements can be attributed to more multilingual transmissions, as well as improved data quality for Indic language mining and back-translation.
In other words, NLLB had more tomatoes than the previous study had bananas. So NLLB has more bananas.
spBLEU scores are higher than BLEU scores because they are calculated on smaller and different tokens. However, does NLLB translate better? We simply can't answer. To make matters worse, IndicBART and IndicTrans are not comparable as they both use two different token methods.
Most of the tables listed above have similar problems and have more or less errors.
If you look at the papers published by IndicBART and IndicTrans to check the numbers, you will see that there are other issues. Columns (a, b) in Table 32 are all swapped, the IndicBART numbers are the numbers in indicatrans and vice versa.
If you look at Table 30, the problem is even bigger. However, Table 30 has been updated in the paper, and Benjamin Marie expressed his gratitude to Vedanuj for updating the article. Table 30 does mention that the tokenizer is the same. I admit my mistake.
As shown in Table 32, Meta AI claims that NLLB is superior to previous DeltaLM and Deepnet, while comparing the BLEU obtained using different calculation methods Fraction. What is new here is that they also compared NLLB to their previous work, M2M-100, also evaluated using spBLEU. So does this comparison make sense? No. Even though they both use spBLEU, they actually use different tokenizers, which makes comparison impossible. They make the following statement in footnote 28:
"Our analysis shows that when performed on the FLORES-101 language When measured, there are minor differences between the SPM-200 model of FLORES-200 and the SPM-100 model of FLORES-101. The main advantage of SPM-200 is that it covers more than 200 languages."
Small differences are also differences. In this case, these differences matter because we are doing scientific research.
One advancement for NLLB compared to their work on M2M-100 is the addition of more languages to the model and dataset. It includes the tokenization model. Technically speaking, if you add more languages with different writing systems to this tokenizer while keeping the vocabulary size constant, you will mechanically get a vocabulary with smaller tokens. As seen above, using smaller tokens may result in better scores. Let's verify it.
As shown below:
##This tokenization generates 95 tokens , while NLLB generates 97 tokens. This is just a subtle difference, if spBLEU is calculated using M2M-100 tokenization, the score is 53.8, which is 1 point lower than NLLB tokenization. According to the machine translation research literature, a difference of 1 point is usually enough to claim that a system is significantly better. As expected, NLLB will produce higher scores than M2M-100.
The next table is the last table in this article: Table 31.
## Likewise, we have the same problem mentioned above:
1. M2M-100 and NLLB use two different tokenizations for scoring, so comparison cannot be made. 2. MMTAfrica seems to use M2M-100 tokenization in their paper. It's comparable to the M2M-100, but not to the NLLB.
There are still some problems in the article, so I won’t introduce them one by one here. The main mistake made by Meta AI in NLLB is a very common mistake in machine translation evaluation, although we should admit that this work is truly amazing and may provide higher translation quality for many languages.
The above is the detailed content of Meta's large-scale study on language translation, the results are all "routine". For more information, please follow other related articles on the PHP Chinese website!