Home > Article > Technology peripherals > Exploring the impact of vocabulary selection on language model training: A groundbreaking study
How are language models affected by different vocabulary lists? How to balance these effects?
In a recent experiment, researchers pre-trained and fine-tuned 16 language models with different corpus. This experiment used NanoGPT, a small-scale architecture (based on GPT-2 SMALL), and trained a total of 12 models. The network architecture configuration of NanoGPT is: 12 attention heads, 12 layers of transformers, the word embedding dimension is 768, and approximately 400,000 iterations (approximately 10 epochs) were performed. Then 4 models were trained on GPT-2 MEDIUM. The architecture of GPT-2 MEDIUM was set to 16 attention heads, 24 layers of transformers, the word embedding dimension was 1024, and 600,000 iterations were performed. All models are pre-trained using NanoGPT and OpenWebText datasets. In terms of fine-tuning, the researchers used the instruction data set provided by baize-chatbot and added an additional 20,000 and 500,000 "dictionary" entries to the two types of models respectively
In the future, The researchers plan to release the code, pre-trained models, instruction-tuned models, and fine-tuned datasets
# However, due to the lack of GPU sponsors (this is a free open source project), in order to reduce costs, The researchers are currently not continuing, although there is room to further improve the research content. In the pre-training stage, these 16 models need to run on 8 GPUs for a total of 147 days (a single GPU needs to be used for 1,176 days), at a cost of US$8,000
The research results can be summarized as:
According to the experimental results, the results of englishcode-32000-consistent are the best. However, as mentioned above, when using TokenMonster where a single token corresponds to multiple words, there is a trade-off between SMLQA (Ground Truth) accuracy and word ratio, which increases the slope of the learning curve. The researchers firmly believe that by forcing 80% of the tokens to correspond to one word and 20% of the tokens to correspond to multiple words, this trade-off can be minimized and a "best of both worlds" vocabulary can be achieved. The researchers believe that this method has the same performance as the one-word vocabulary list, while improving the word ratio by about 50%.
The implication of this sentence is that flaws and complexity in the tokenizer have a greater impact on the model's ability to learn facts than on its language ability
This phenomenon is an interesting feature discovered during the training process. It makes sense when we think about the way model training works. The researcher has no evidence to justify his reasoning. But essentially, because the fluency of language is easier to correct during backpropagation than the factuality of language (which are extremely subtle and context-dependent), this means that any improvement in tokenizer efficiency will be less consistent than the factuality of language Regardless of gender, there will be a ripple effect that directly translates into improved information fidelity, as seen in the SMLQA (Ground Truth) benchmark. Simply put: a better tokenizer is a more realistic model, but not necessarily a smoother model. The other way around: a model with an inefficient tokenizer can still learn to write fluently, but the additional cost of fluency reduces the model's credibility.
Before conducting these tests, the researchers believed that 32,000 was the optimal vocabulary size table scale, and experimental results also confirmed this. The performance of the 50256-balanced model is only 1% better than the 32000-balanced model on the SMLQA (Ground Truth) benchmark, but the model size increases by 13%. In order to clearly prove this point of view, in multiple models based on MEDIUM, this article conducted experiments by dividing the vocabulary into 24000, 32000, 50256 and 100256 words.
The researchers tested TokenMonster and tested three specific optimization modes: balanced, consistent and strict . Different optimization modes will affect the way punctuation marks and capcodes are combined with word tokens. The researchers initially predicted that the consistent mode would perform better (because it is less complex), although the word ratio (i.e., the ratio of characters to tokens) would be slightly lower
The experimental results seem to confirm The above conjecture, but researchers have also observed some phenomena. First, consistent mode seems to perform about 5% better than balanced mode on the SMLQA (Ground Truth) benchmark. However, consistent mode performs significantly worse (28%) on the SQuAD (Data Extraction) benchmark. However, the SQuAD benchmark exhibits large uncertainties (repeated runs give different results) and is not convincing. The researchers did not test convergence between balanced and consistent, so this may simply mean that the consistent pattern is easier to learn. In fact, consistent may do better on SQuAD (data extraction) because SQuAD is harder to learn and less likely to hallucinate.
This is an interesting finding in itself, because it means that there is no obvious problem with combining punctuation and words into a single token. All other tokenizers so far have argued that punctuation should be separated from letters, but as you can see from the results here, words and punctuation can be merged into a single token with no noticeable performance loss. This is also confirmed by 50256-consistent-oneword, a combination that performs on par with 50256-strict-oneword-nocapcode and outperforms p50k_base. 50256-consistent-oneword combines simple punctuation with the word token (which the other two combinations do not).
After enabling the strict mode of capcode, it will have obvious negative effects. On SMLQA, 50256-strict-oneword-nocapcode scores 21.2 and on SQuAD it scores 23.8, while 50256-strict-oneword scores 16.8 and 20.0 respectively. The reason is obvious: the strict optimization mode prevents the merging of capcodes and word tokens, resulting in more tokens being needed to represent the same text, resulting in an 8% reduction in word ratio. In fact, strict-nocapcode is more similar to consistent than strict mode. On various indicators, 50256-consistent-oneword and 50256-strict-oneword-nocapcode are almost equal
In most cases, the researchers concluded that the model is useful for learning inclusion The meaning of punctuation marks and tokens of words does not present much difficulty. That is, the consistency model has higher grammatical accuracy and fewer grammatical errors than the balanced model. Taking all things into consideration, the researchers recommend that everyone use the consistency mode. Strict mode can only be used with capcode disabled
As described above , compared with balanced mode, consistent mode has higher syntax accuracy (fewer syntax errors). This is reflected in the very slight negative correlation between word ratio and grammar, as shown in the figure below. Beyond that, the most noteworthy point is that the syntax results for both GPT-2 tokenizer and tiktoken p50k_base are terrible (respectively) compared to TokenMonster’s 50256-strict-oneword-nocapcode (98.6% and 98.4%). 98.1% and 97.5%). The researchers initially thought it was just a coincidence, but multiple samplings produced the same range of results. It's unclear what the cause is.
MTLD is used to represent the language diversity of generated sample text . It seems to be closely related to the n_embed parameter and not to features such as vocabulary size, optimization mode, or maximum number of words per token. This is particularly evident in the 6000-balanced model (n_embd is 864) and the 8000-consistent model (n_embd is 900)
In the medium-sized model, p50k_base has the highest MTLD, which is 43.85, but also has the lowest grammar score. The reason for this is unclear, but the researchers speculate that the selection of training data may be a bit strange.
The purpose of the SQuAD benchmark is to evaluate a model’s ability to extract data from a piece of text. The specific method is to provide a paragraph of text and ask a question, requiring that the answer must be found in the paragraph of text. The test results are not very meaningful, there are no obvious patterns or correlations, and they are not affected by the overall parameters of the model. In fact, the 8000-balanced model with 91 million parameters scored higher in SQuAD than the 50256-consistent-oneword model with 354 million parameters. The reason for this may be that there are not enough examples of this style, or there are too many question-answer pairs in the fine-tuning dataset. Or maybe this is just a less than ideal benchmark
The SMLQA benchmark is proposed by Test "truth" with objectively answered common sense questions, such as "Which country's capital is Jakarta?" and "Who wrote the Harry Potter books?"
It is worth noting that in this benchmark test, the two reference tokenizers, GPT-2 Tokenizer and p50k_base, performed very well. The researchers initially thought they had wasted months of time and thousands of dollars, but it turned out tiktoken performed better than TokenMonster. However, it turns out that the problem is related to the number of words corresponding to each token. This is particularly evident in the "MEDIUM" model, as shown in the figure below
##The performance of the single-word vocabulary is slightly better than TokenMonster's default vocabulary where each token corresponds to multiple words.
Another important observation is that when the vocabulary size is below 32,000, the vocabulary size will directly affect the true value, even if the n_embd parameter of the model is adjusted to compensate for the reduction in model size. This is counterintuitive because the researchers originally thought that 16000-balanced with n_embd of 864 (parameter of 121.34 million) and 8000-consistent with n_embd of 900 (parameter of 123.86 million) would be better than 50256-consistent with n_embd of 768 (parameter of 121.34 million). 123.59 million) performed better, but that's not the case - both performed much worse (13.7 and 15.1 vs. 50256-consistent's 16.4). However, both "adjusted" models received the same training time, which resulted in significantly fewer pre-training times (albeit the same time)
The researchers trained 12 models on the default NanoGPT architecture. The architecture is based on the GPT-2 architecture with 12 attention heads and 12 layers, and the embedding parameter size is 768. None of these models have reached a convergence state. Simply put, they have not reached their maximum learning ability. The model was trained for 400,000 iterations, but it appears that 600,000 iterations are needed to reach maximum learning capability. The reasons for this situation are very simple. One is the budget problem, and the other is the uncertainty of the convergence point.
Results of the small model:
Pearson correlation of small model:
##Conclusion of small model:Rewritten content: The optimal vocabulary level is reached when the vocabulary is 32,000. In the vocabulary increasing stage from 8,000 to 32,000, increasing the vocabulary improves the accuracy of the model. However, when the vocabulary size increases from 32,000 to 50,257, the total parameters of the model also increase accordingly, but the accuracy improvement is only 1%. After exceeding 32,000, the gain decreases rapidly
Poor tokenizer design will have an impact on the accuracy of the model, but not on grammatical correctness or language diversity. Performance on the ground-truth benchmark for tokenizers with more complex grammatical rules (such as tokens corresponding to multiple words, word and punctuation combinations, capcode encoding tokens, and total vocabulary reduction) in the parameter range of 90 million to 125 million Poor. However, this sophisticated tokenizer design did not have a statistically significant impact on the linguistic diversity or grammatical correctness of the generated texts. Even a compact model, such as one with 90 million parameters, can effectively exploit more complex markers. More complex vocabulary takes longer to learn, reducing the time it takes to acquire information related to basic facts. Since none of these models have been fully trained, the potential for further training to close the performance gap remains to be seen Rewritten in Chinese:
3. Validation loss is not a valid metric for comparing models using different tokenizers. The validation loss has a very strong correlation (0.97 Pearson correlation) with the word ratio (the average number of characters per token) for a given tokenizer. To compare loss values between tokenizers, it may be more useful to measure the loss relative to characters rather than tokens, since the loss value is proportional to the average number of characters corresponding to each token 4. F1 score is not suitable as a metric for evaluating language models because these language models are trained to generate variable-length responses (completion is indicated by an end-of-text marker). This is because the longer the text sequence, the more severe the penalty of the F1 formula. F1 score tends to produce shorter response models Their ability to produce grammatically coherent answers can be fine-tuned. Although these answers are often incorrect or illusory, they are relatively coherent and demonstrate understanding of contextual background. The vocabulary of the generated text is diverse as the embedding size increases Sexuality and grammatical accuracy increased significantly and were slightly negatively correlated with word ratio. This means that a vocabulary with a larger word-to-word ratio will make learning grammatical and lexical diversity slightly more difficult 7. When adjusting the model parameter size, the word-to-word ratio is related to SMLQA ( There is no statistically significant correlation between the Ground Truth) or SQuAD (Information Extraction) benchmarks. This means that a tokenizer with a higher word-to-word ratio will not negatively impact the performance of the model. Compared with the "balanced" one, the "consistent" category seems to perform slightly better on the SMLQA (Ground Truth) benchmark, but worse on the SQuAD (Information Extraction) benchmark many. Although more data is needed to confirm this
Medium model with 16 layers of attention heads and 24 layers of transformer layers Training the model to convergence did significantly reduce the performance difference between simpler and more complex vocabularies. The benchmark results of SMLQA (Ground Truth) and SQuAD (Data Extraction) are very close. The main difference is that 50256-consistent has the advantage of a 23.5% higher word ratio than p50k_base. However, for vocabularies with multiple words per token, the performance cost of truth values is smaller, but this can be solved using the method I discussed at the top of the page. Results for the models in :
After 560000 iterations, all models started to converge, As shown below:
In the next phase, we will use englishcode-32000-consistent to train and benchmark MEDIUM’s model. This vocabulary contains 80% word tokens and 20% multi-word tokens
After training and benchmarking the small model, the researchers clearly found that the measured results reflected the model's learning speed, not the model's learning ability. Furthermore, the researchers did not optimize the computational potential of the GPU because the default NanoGPT parameters were used. In order to solve this problem, the researchers chose to use a tokenizer with 50257 tokens and a medium language model to study four variants. The researchers adjusted the batch size from 12 to 36 and reduced the block size from 1024 to 256 to ensure full utilization of the VRAM capabilities of the 24GB GPU. Then 600,000 iterations were performed instead of 400,000 as in the smaller model. Pretraining for each model took an average of just over 18 days, three times the 6 days required for smaller models.
Future Outlook
The above is the detailed content of Exploring the impact of vocabulary selection on language model training: A groundbreaking study. For more information, please follow other related articles on the PHP Chinese website!