Home  >  Article  >  Technology peripherals  >  Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

王林
王林forward
2023-04-12 18:04:031418browse

Although Google’s Bard has overturned, Google’s AI strength still cannot be underestimated.

Since the beginning of the year, the Google Research year-end summary series led by Jeff Dean "Google Research, 2022 & beyond" has been continuously updated, and it has also been updated recently. Issue 4.

The theme of this issue is "Improving model efficiency". Let's take a look at what ideas Google engineers have come up with!

Operation efficiency is key

Over the past decade, deep learning has experienced explosive development, largely due to the integration of new algorithms and architectures, data Significant increases in volume and improvements in computing power.

Compared with ten years ago, artificial intelligence and machine learning models have become larger and more complex, with deeper and more complex network structures, more parameters, and more parameters used in training. More data, together, has fueled some of the most transformative results in the history of machine learning.

As these models are increasingly deployed in production and business applications, the model's inference efficiency and running cost have gone from a minor factor to a major limitation factor.

Google’s response in this regard is to continue to invest heavily in machine learning efficiency, mainly solving the following four problems:

1. Efficient Architecture

2. Data Efficiency

3. Training Efficiency

4. Inference Efficiency

In addition to efficiency, the model also faces many problems surrounding authenticity, security, privacy and freshness.

This article will focus on a series of new algorithms developed by Google Research to address the above challenges.

Efficient model architecture

One of the most basic questions is: Is there a better way to parameterize the model to improve efficiency?

In 2022, Google Research focuses on new technologies that enhance models by retrieving context, blending experts to make transformers (at the core of most large machine learning models) more efficient, injecting external Knowledge.

Context-augmented models

In order to pursue higher quality and efficiency, neural models Can be enhanced from external context in large databases or trainable memories.

By leveraging retrieved context, neural networks do not need to memorize a large amount of world knowledge in their internal parameters and are able to achieve better parameter efficiency, interpretability, and realism.

In the article "Decoupled context processing for context-enhanced language modeling", researchers explore a decoupled codec architecture that incorporates external context into the language model. Simple architecture.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

## Paper link: ​https://arxiv.org/abs/2210.05758​

The model is able to significantly save computational effort while giving competitive results in autoregressive language modeling and open-domain question answering tasks.

Pre-trained large language models (LLM) consume a lot of information through self-supervision of large training sets, but it is unclear how the "world knowledge" of these models is related to Input context interacts.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Through knowledge aware fine-tuning (KAFT), researchers combine counterfactual context and irrelevant context Combined with standard supervised data sets, the controllability and robustness of LLM are enhanced.

One of the problems in exploring modular deep networks is how to design a concept database with corresponding computing modules. The researchers proposed a theoretical architecture that puts "remember events" in the form of sketches. Stored in an external LSH table, which includes a pointers module to handle sketches.

Another piece of the puzzle for context-augmented models is accelerators for quickly retrieving information from large databases.

The researchers developed a TPU-based nearest neighbor search algorithm that is consistent with the TPU's performance model and provides analytical guarantees for expected recall, resulting in optimal performance.

Search algorithms usually involve a large number of hyperparameters and design choices, which makes them difficult to optimize on new tasks, so the researchers further proposed a new constrained optimization algorithm to Automatically tuning hyperparameters, taking desired cost or recall as input, the algorithm produces tunings that are empirically very close to the Pareto frontier of speed-recall and give leading performance on standard benchmarks.

Mixture-of-experts model

Mixture-of-experts (MoE, Mixture-of-experts) models have been proven to increase An efficient means of increasing the capacity of neural network models without unduly increasing their computational cost. The basic idea of ​​MoEs is to build a network from multiple expert sub-networks, where each input is processed by an appropriate expert sub-network.

As a result, MoEs only call a small part of the entire model compared to standard neural networks, thus increasing the efficiency of language model applications such as GLaM.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Deciding which experts should be active to participate in a specific input depends on the routing function function), the design of routing is very challenging because the developer's expectation is that each expert model is appropriate and will not be under- or over-utilized.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

In a recent work, researchers proposed Expert Choice Routing, a The new routing mechanism, which does not assign each input token to the top-k experts, but in turn assigns each expert to the top-k token, can automatically adjust the load balancing of experts while naturally allowing multiple experts to process Enter token

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://openreview.net/pdf?id=jdJo1HIVinI​

Efficient Transformers

##Transformer is currently the most popular sequence-to-sequence model, from visual to natural language understanding and has demonstrated very powerful performance in a range of challenging tasks.

A core component of this type of model is the attention layer, which calculates the similarity between the "query" and the "key". Construct an appropriately value-weighted combination. Although the performance is strong, the computational efficiency of the attention mechanism is not high, and the complexity is usually the second power of the length of the input sequence.

As the scale of Transformer continues to expand, research on one of the important issues is very valuable: whether there are any naturally occurring structures or patterns of learning models that can solve the problem of effective attention. principle.

In this regard, Google Research studied the learning embeddings of the intermediate MLP layers and found that they are very sparse. For example, the T5-large model only has

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

##Paper link: ​https:/ /arxiv.org/pdf/2210.06313.pdf​

Researchers recently proposed the Treeformer model, an alternative to standard attention computations that relies on decision trees, which intuitively can quickly identify a small subset of keys relevant to a query, and only in this Perform attention operations on the set. Based on experience, Treeformer can reduce FLOPs of the attention layer by 30 times.

At the same time, the researchers also introduced sequential attention, a differentiable feature selection method that combines attention and greedy algorithms. This technology has been proven to be directly and cost-effective. Seam migration to large-scale models.

Another way to improve Transformer efficiency is to speed up the calculation of softmax in the attention layer.

Based on the study of low-rank approximation of the softmax kernel, researchers proposed a new type of random features, providing the first "positive and bounded" random features of the softmax kernel. Approximately, and computationally linear over sequence length.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

## Paper link: ​https://arxiv.org/abs/2205.15317​

And also proposed the first mechanism covering multiple attention masks, such as causal encoding and relative position encoding.

Training Efficiency

Effective optimization methods are the cornerstone of modern machine learning applications, and are especially important in large-scale environments.

In this case, even first-order adaptive methods like Adam often require a large amount of calculations, and the stability of training will become very difficult.

In addition, these methods are often irrelevant to the architecture of the neural network and do not consider the structural information within the model architecture, resulting in low training efficiency. This also encourages new technologies to more effectively optimize modern Neural network model.

Google Research has developed some new training techniques based on model architecture, for example, for training the Transofmre network, including the new scale-invariant Transofmre network and the new clipping method. When combined with vanilla stochastic gradient descent (SGD), it can speed up training.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/pdf/2202.00980.pdf​

Using this approach, the researchers achieved for the first time the ability to effectively train BERT using only simple SGD, without the need for adaptivity.

In addition, the researchers proposed a new method, LocoProp, to obtain results similar to the second-order optimizer while using the same computing and memory resources as the first-order optimizer. performance.

LocoProp obtains a modular view of neural networks by decomposing them into a combination of layers. Each layer is then allowed to have its own loss function as well as output targets and weight adjusters. With this setup, after appropriate forward and backward gradient passes, LocoProp continues to perform parallel updates to the "local loss" of each layer.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://proceedings.mlr.press/v151/amid22a.html​

In fact, these updates are theoretically and empirically similar to higher-order optimizers, and LocoProp achieved the same results as higher-order optimizers on the deep autoencoder benchmark. Comparable performance, while being significantly faster.

A key assumption of optimizers like SGD is that each data point is independently and identically sampled from a distribution, but in real-world application environments such as reinforcement learning , this assumption is difficult to satisfy because the model (or agent) must learn from data generated based on its own predictions.

The researchers proposed a new algorithm method called SGD with reverse experience replay, which can be used in linear dynamical systems and nonlinear dynamical systems. Find the optimal solution in several situations such as Q-learning and reinforcement learning.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

## Paper link: ​https://arxiv.org/abs/2103.05896​

Additionally, an improved version of this method, IER, is experimentally proven to be state-of-the-art and the most stable experience replay technique on various popular RL benchmarks.

Data Efficiency

Deep neural networks rely heavily on large data sets, with the attendant storage costs and potential security/privacy issues in these data Training modern deep neural networks on the set also comes with high computational costs.

A promising method to solve this problem is data subset selection, where the goal of the learner is to find the most informative subset from a large number of training samples to be close to ( Even improve) the training of the entire training set.

The researchers analyzed a subset selection framework designed to work with arbitrary model families in a practical batch setting where the learner can sample one sample at a time, accessing the context and real labels, but to limit the overhead, once a large enough batch of samples is selected, its state can only be updated, i.e., the model weights are further trained.

And based on this, an algorithm called IWeS was developed, which selects samples through importance sampling, where the sampling probability assigned to each sample is based on the previously selected The entropy of the batch-trained model. The paper provides a theoretical analysis that demonstrates bounds on generalization and sampling rate.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/pdf/2301.12052.pdf​

Another problem with training large networks is that they can be highly sensitive to the training data seen upon deployment and to the distribution changes between the data, especially when dealing with a limited number of training data, these data may not include all deployment time scenarios.

A recent study hypothesized that "extreme simplicity bias" is the key issue behind this fragility of neural networks, and its latest work makes this hypothesis feasible, leading to two New complementary methods DAFT and FRR, combined provide significantly more powerful neural networks. In particular, these two methods use adversarial fine-tuning and inverse feature prediction to improve the robustness of the learning network.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/pdf/2006.07710.pdf​

Inference Efficiency

Increasing the size of a neural network has been shown to have a surprising effect in improving its predictive accuracy, however, it is difficult to exploit these advantages in the real world is challenging because the cost of inference for large models can be prohibitive, this issue also drives strategies to improve service efficiency without sacrificing accuracy.

Researchers have proposed different strategies to achieve this goal, especially those based on knowledge distillation and adaptive computing.

Distillation

Distillation is a simple and effective model compression method that greatly scales large neural models It has potential applicability and has been proven to be very effective in a series of practical applications such as advertising recommendation.

Most use cases for distillation involve directly applying basic models to a given domain, with only a limited understanding of when and why this should be done. Google's research looks at tailoring distillation to specific circumstances and systematically examines the factors that determine distillation success.

Algorithmically, by carefully modeling the noise in the labels provided by the teacher model, the researchers developed a principled approach to reweighting the training examples, and a robust approach The subset of data to be sampled has teacher labels.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: https://arxiv.org/abs/2210.06711

In the process of "teacher-guided training", researchers proposed a new distillation framework: instead of passively using teachers to annotate a fixed data set, teachers are actively used to guide information samples Selections are used for annotation, which makes the distillation process more efficient in limited data or long-tail settings.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2208.06825​

We also studied new methods from cross-encoder (dual-encoder, such as BERT) to factorial dual-encoder (dual-encoder), which is also a pair of (query, document) Relevance is an important setting for scoring.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: https://proceedings.mlr.press/v162/menon22a/menon22a.pdf

The paper studies the reasons for the performance gap between the cross encoder and the dual encoder, pointing out that this may be the result of generalization rather than the capacity limitation of the dual encoder.

A carefully constructed distillation loss function can alleviate this situation and close the performance gap between cross-encoders and dual-encoders.

Subsequently, further improvement of dual-encoder distillation by matching embeddings from the teacher model was further studied in EmbedDistil. This strategy can also be used to extract information from large-to-small dual-encoder models, where inheriting and freezing teacher document embeddings proves to be very effective.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2301.12005​

provides a new perspective on theoretical aspects, through the distillation of supervised complexity, to measure how students are able to predict teachers' labels.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

##Paper link: ​https://arxiv.org/abs/2301.12245​

Using the Neural Tangent Kernel (NTK) theory, some conceptual conclusions are drawn. For example, the ability gap may affect distillation, because the labels of such teachers may appear similar to pure randomness. Label students.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2301.12923​

At the same time, it is further proved that the point where distillation causes students to underfit is also a difficult problem for the teacher model. Intuitively, this may help students focus their limited abilities on On those samples that can be reasonably modeled.

Adaptive calculation

Although distillation is an effective method to reduce the cost of inference, it has The effect is uniform, and intuitively some "easy" samples may inherently require less computation than relatively "hard" samples.

The goal of adaptive computing is to design mechanisms that can perform such sample-dependent calculations.

Confident Adaptive Language Modeling (CALM) introduces controlled early exit functionality for Transformer-based text generators such as T5.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2207.07061​

In this form of adaptive computation, the model dynamically modifies the number of Transformer layers used at each decoding step, where the early-exit gate uses a confidence metric with a decision threshold. Metrics are calibrated to meet statistical performance guarantees.

This way, the model only needs to compute the full decoder layer stack for the most challenging predictions, and only a few decoder layers for simpler predictions. In practice, the model uses about one-third as many layers on average for predictions, resulting in a 2-3x speedup while maintaining the same level of generation quality.

A commonly used adaptive calculation mechanism consists of a cascade of two or more basic models, where the key issue is deciding whether to simply use the predictions of the current model or defer predictions to downstream models and learn when Postponement requires designing a suitable loss function that can utilize appropriate signals as supervision for postponing decisions.

Google Research systematically studied existing loss functions and demonstrated that they may not be suitable for training samples due to the implicit application of label smoothing. The paper also showed that this can be alleviated through post-hoc training of delayed rules. case, this training does not require modifying the model internals in any way.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: https://openreview.net/pdf?id=_jg6Sf6tuF7​

For retrieval applications, standard semantic search techniques use a fixed representation for each embedding generated by large models, that is, regardless of the downstream task and its associated computing environment or constraints, the size of the representation and Abilities are mostly fixed.

Matryoshka representation learning introduces the flexibility to adjust the representation according to the deployment environment, forcing the representation to have a natural ordering in its coordinates, so that for resource-constrained environments, only the highest few coordinates of the representation are used ; while for richer and precision-critical settings, more coordinates represented can be used.

Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth

Paper link: ​https://openreview.net/pdf?id=9njZa1fm35​

When combined with standard approximate nearest neighbor search techniques, such as scanning neural networks, MRL is able to provide up to 16x lower computation for the same recall and accuracy metrics.

Summary

Large machine learning models have demonstrated transformative results in multiple domains, but efficiency in training and inference is becoming a critical requirement to make these models feasible in the real world. .

By developing new basic technologies, Google Research has made significant investments in making large-scale machine learning models efficient, which also requires sustained efforts. In the future, we will continue to explore core challenges to make machine learning models more robust. and efficient.

The above is the detailed content of Make training and inference of large models faster than ever! Google’s 2022 year-end summary, the fourth. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete