Home > Article > Technology peripherals > The performance of small models is saturated and the performance is poor. Is the root cause due to Softmax?
The emergence of small language models is to make up for the disadvantages of expensive training and inference of large language models, but it also has the fact that its performance declines after training to a certain stage (saturation phenomenon). So what is the reason for this phenomenon? Can it be overcome and exploited to improve the performance of small language models?
The latest progress in the field of language modeling lies in pre-training highly parameterized neural networks on extremely large-scale web text corpora. In practice, using such a model for training and inference can be costly, prompting the use of smaller alternative models. However, it has been observed that smaller models may suffer from saturation and a phenomenon characterized by a decrease in capability and plateauing at some advanced stage of training.
A recent paper found that this saturation sum phenomenon can be explained by a mismatch between the latent dimensionality of the smaller model and the high rank of the target context probability distribution. This mismatch affects the performance of the linear prediction heads used in these models by employing what is known as the softmax bottleneck.
Paper link: https://arxiv.org/pdf/2404.07647.pdf
This article measures the impact of the softmax bottleneck under different settings. And it was found that models based on less than 1000 hidden dimensions tend to adopt degraded latent representations in the later stages of pre-training, resulting in reduced evaluation performance.
Introduction
The representation degradation problem is a common phenomenon that affects various modes such as self-supervised learning methods of text data. Observations of intermediate representations of language models reveal their low-angle variability (or anisotropy), or unusual dimensions that arise during training. However, these observations are mostly made on relatively small-scale models with dimensions comparable to family models such as BERT or GPT-2.
These models usually consist of a neural network f_θ that accepts a token sequence:
and generates a relatively low-dimensional R^d Context representation, where d is the hidden dimension of the model. They then rely on a language modeling head that produces the logarithm of the context token probabilities. A common choice for a language modeling head is a linear layer with parameters W ∈ R^(V×d), where V is the number of possible tokens. The resulting probability distribution for the next token is therefore where σ is the softmax function.
In the field of language modeling, the current trend is to extend the generative pre-training method introduced by GPT-2, which means training neural models composed of billions of parameters on huge corpus of web text. However, training and applying these highly parameterized models raises energy and hardware-related issues, which requires finding ways to achieve similar performance levels with smaller models.
However, evaluation of the Pythia model suite shows that training small models on very large corpora can lead to saturation, manifested by performance degradation late in pre-training. This paper explores this saturation phenomenon through the lens of representation degradation and finds that there is a strong correlation between the two phenomena, while further demonstrating that representation degradation occurs in language modeling heads of small models and has been demonstrated both theoretically and empirically. above shows how linear language modeling headers can become a performance bottleneck for architectures based on small hidden dimensions.
Language model saturation phenomenon
This article first verifies that it is indeed possible to observe and quantify performance saturation of Pythia checkpoints because they are the only release in the middle of a range of model sizes. checking point. This paper measures the cross-entropy of Pythia checkpoints on 50,000 tokens randomly sampled from their pre-training dataset (i.e., The Pile).
It can be clearly seen in Figure 1a that even the 410 million parameter model encounters saturation, manifested by increased in-domain loss at advanced training stages.
In Figure 1b, this paper fits the data points of the model starting from 410 million parameters according to the method of Hoffmann et al. (2022), and only optimizes Model-dependent constants (A and α), while reusing all other values (B = 410.7, β = 0.28, E = 1.69). Here we review the relationship between parameter count N and token count T given by Hoffmann et al. (2022):
This paper finds that the optimal parameters are A = 119.09 and α = 0.246. The authors show fitted curves of token counts corresponding to optimal and final checkpoints. It can be observed that the performance of the final checkpoint is on average about 8% lower than the extrapolated value. The loss-minimizing (optimal) checkpoint is expected to be lower than the extrapolation method due to incomplete learning rate cooling, but its performance is only about 4% lower than the extrapolation method.
A similar performance saturation phenomenon was also observed in the data set used for the evaluation of the language model evaluation tool (LM Evaluation Harness), as shown in Table 1.
Performance saturation is Rank Saturation
Scale Anisotropy
Anisotropy is a common form of representation degradation observed in various small language models, and it consists of reduced angular variability in the representation distribution in a specific layer. Previous research (Ethayarajh, 2019; Godey et al., 2024) noted that almost all layers of small deformed language models are anisotropic. A common way to measure anisotropy in a set of vector representations H is the average cosine similarity: model. To solve this problem, this paper calculates the average cosine similarity between layers for a series of model intermediate representations; namely GPT-2, OPT, Pythia and Gemma. This article uses a subsample of The Pile because it is assumed that the domain of this dataset includes or matches the domain of the pre-trained datasets used in these suites.
In Figure 2, it can be observed that most layers of most Transformer models are anisotropic to some extent, regardless of their scale. However, there seems to be a dichotomy in the last layer, where the model is either almost isotropic or highly anisotropic. This paper notes that this dichotomy is consistent with one of the saturation phenomena of the Pythia suite, where only models with 160 million parameters or fewer are affected by the anisotropy of the last layer.
This article studies the training dynamics of anisotropy in the Pythia suite and compares it to the saturation phenomenon in Figure 3.
Figure 3 clearly demonstrates the clear correlation between the emergence of performance saturation phenomena and the emergence of anisotropy in the last layer representation of the model. It also shows a sudden increase in anisotropy near the saturation point during training. What is observed here is that in a specific in-domain corpus, the model rapidly loses performance upon saturation and never seems to fully recover from this explosion.
Singular Value Saturation
Mean cosine similarity is a valuable measure of distribution uniformity, but including other metrics can help better capture certain manifolds complexity. Furthermore, it only focuses on the output embeddings of the language model and not on their weights. This section extends the analysis of this paper by studying the singular value distribution of language modeling heads to connect the empirical observations to the theoretical findings of this paper.Figure 4 shows the singular value distribution along the final prediction layer weight W during training:
Figure 4 reveals a specific pattern of spectral saturation , which occurs roughly at the same time as performance saturation. The figure shows that the singular value distribution gradually flattens during the training process, almost reaching uniformity, and then suddenly evolves into a spiked distribution with the largest singular value relatively high relative to other distributions.
To quantify this behavior more accurately, this article uses the singular entropy metric, calculated as the Kullback-Leibler divergence between the normalized singular value distribution and the uniform distribution.
Figure 5 shows how the singular distribution evolves differently for models using fewer than 410 million parameters compared to models using larger parameters. The heads of small models see their singular value distributions gradually become more uniform, until they suddenly degrade, which again correlates with degraded language model performance. The singular value distribution of larger models tends to be more stable and does not show an obvious monotonic pattern throughout training. Softmax bottleneck and language dimensionThe inherent dimension of natural language
Intuitive In other words, the singular value distribution saturation phenomenon observed above only applies to smaller models, which calls into question the dimensions involved in the optimization of the LM head. This section proposes to empirically measure the critical value of the rank of an LM head and estimate the dimensions of the contextual probability distribution that the output of this head should match.To empirically measure the impact of linear head rank, this paper proposes to train a rank-restricted head on pre-trained contextual representations derived from highly parameterized language models. To control the maximum rank r, consider a head of the form W = AB ∈ R^(V×d), where the coefficients of A ∈ R^(V×r) and B ∈ R^(r×d) start from N(0 ,1) extracted (d is the hidden dimension of the model). The rank of this W matrix is scanned over a range of values constrained by the parameter r ∈ [1, d].
By freezing the language model and training the rank-restricted head on approximately 150 million tokens, while adjusting the learning rate to adapt to the number of trainable parameters.
It can be observed in Figure 6 that regardless of the model size, when the rank of the language modeling head W falls below 1000, the perplexity starts to decrease significantly. This implies that for models with larger hidden dimensions, the head is not a major performance bottleneck, but for models with smaller hidden dimensions it may hurt performance independently of the quality of the output representation.
Another interesting factor is the inherent dimensionality of the estimation data itself. In order to avoid possible effects related to specific inductive bias, this paper trained a naive 5-gram language model on several data sets with different coverage (IMDb, Wikitext, and The Pile), using two different vocabulary sizes. Tokenizer (30k tokens for Llama-2, 50k tokens for Pythia). Given C observed 5-grams, this paper considers the matrix W ∈ R^(C×V), where each row is the probability distribution of possible tokens given 4 tokens, and calculates their singular value distributions, such as Terashima (2003).
Figure 7 reports W-error, the minimum approximation error for a matrix W of rank d predicted by the Eckart-Young-Mirsky theorem (see Lemma 5.2) and normalized to the Frobenius of W norm.
Theoretical bottleneck
At the same time, the estimated rank of W is of the same order of magnitude as the hidden dimension It cannot be ignored in comparison. Here we will analyze the connection between the dimensions and performance of an ideal linear language modeling head from a theoretical perspective.
This section aims to identify the formal link between the inherent dimensions of context distributions and performance bottlenecks that can be attributed to the lower dimensionality of language model output representations. To this end, a language modeling head optimized on an ideal context representation is conceived, and the relationship between its spectral properties and the performance gap that arises when training a low-rank head on the same representation is explored.
For more research details, please view the original paper.
The above is the detailed content of The performance of small models is saturated and the performance is poor. Is the root cause due to Softmax?. For more information, please follow other related articles on the PHP Chinese website!