search
HomeTechnology peripheralsAIMake training and inference of large models faster than ever! Google's 2022 year-end summary, the fourth

Although Google’s Bard has overturned, Google’s AI strength still cannot be underestimated.

Since the beginning of the year, the Google Research year-end summary series led by Jeff Dean "Google Research, 2022 & beyond" has been continuously updated, and it has also been updated recently. Issue 4.

The theme of this issue is "Improving model efficiency". Let's take a look at what ideas Google engineers have come up with!

Operation efficiency is key

Over the past decade, deep learning has experienced explosive development, largely due to the integration of new algorithms and architectures, data Significant increases in volume and improvements in computing power.

Compared with ten years ago, artificial intelligence and machine learning models have become larger and more complex, with deeper and more complex network structures, more parameters, and more parameters used in training. More data, together, has fueled some of the most transformative results in the history of machine learning.

As these models are increasingly deployed in production and business applications, the model's inference efficiency and running cost have gone from a minor factor to a major limitation factor.

Google’s response in this regard is to continue to invest heavily in machine learning efficiency, mainly solving the following four problems:

1. Efficient Architecture

2. Data Efficiency

3. Training Efficiency

4. Inference Efficiency

In addition to efficiency, the model also faces many problems surrounding authenticity, security, privacy and freshness.

This article will focus on a series of new algorithms developed by Google Research to address the above challenges.

Efficient model architecture

One of the most basic questions is: Is there a better way to parameterize the model to improve efficiency?

In 2022, Google Research focuses on new technologies that enhance models by retrieving context, blending experts to make transformers (at the core of most large machine learning models) more efficient, injecting external Knowledge.

Context-augmented models

In order to pursue higher quality and efficiency, neural models Can be enhanced from external context in large databases or trainable memories.

By leveraging retrieved context, neural networks do not need to memorize a large amount of world knowledge in their internal parameters and are able to achieve better parameter efficiency, interpretability, and realism.

In the article "Decoupled context processing for context-enhanced language modeling", researchers explore a decoupled codec architecture that incorporates external context into the language model. Simple architecture.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

## Paper link: ​https://arxiv.org/abs/2210.05758​

The model is able to significantly save computational effort while giving competitive results in autoregressive language modeling and open-domain question answering tasks.

Pre-trained large language models (LLM) consume a lot of information through self-supervision of large training sets, but it is unclear how the "world knowledge" of these models is related to Input context interacts.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Through knowledge aware fine-tuning (KAFT), researchers combine counterfactual context and irrelevant context Combined with standard supervised data sets, the controllability and robustness of LLM are enhanced.

One of the problems in exploring modular deep networks is how to design a concept database with corresponding computing modules. The researchers proposed a theoretical architecture that puts "remember events" in the form of sketches. Stored in an external LSH table, which includes a pointers module to handle sketches.

Another piece of the puzzle for context-augmented models is accelerators for quickly retrieving information from large databases.

The researchers developed a TPU-based nearest neighbor search algorithm that is consistent with the TPU's performance model and provides analytical guarantees for expected recall, resulting in optimal performance.

Search algorithms usually involve a large number of hyperparameters and design choices, which makes them difficult to optimize on new tasks, so the researchers further proposed a new constrained optimization algorithm to Automatically tuning hyperparameters, taking desired cost or recall as input, the algorithm produces tunings that are empirically very close to the Pareto frontier of speed-recall and give leading performance on standard benchmarks.

Mixture-of-experts model

Mixture-of-experts (MoE, Mixture-of-experts) models have been proven to increase An efficient means of increasing the capacity of neural network models without unduly increasing their computational cost. The basic idea of ​​MoEs is to build a network from multiple expert sub-networks, where each input is processed by an appropriate expert sub-network.

As a result, MoEs only call a small part of the entire model compared to standard neural networks, thus increasing the efficiency of language model applications such as GLaM.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Deciding which experts should be active to participate in a specific input depends on the routing function function), the design of routing is very challenging because the developer's expectation is that each expert model is appropriate and will not be under- or over-utilized.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

In a recent work, researchers proposed Expert Choice Routing, a The new routing mechanism, which does not assign each input token to the top-k experts, but in turn assigns each expert to the top-k token, can automatically adjust the load balancing of experts while naturally allowing multiple experts to process Enter token

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://openreview.net/pdf?id=jdJo1HIVinI​

Efficient Transformers

##Transformer is currently the most popular sequence-to-sequence model, from visual to natural language understanding and has demonstrated very powerful performance in a range of challenging tasks.

A core component of this type of model is the attention layer, which calculates the similarity between the "query" and the "key". Construct an appropriately value-weighted combination. Although the performance is strong, the computational efficiency of the attention mechanism is not high, and the complexity is usually the second power of the length of the input sequence.

As the scale of Transformer continues to expand, research on one of the important issues is very valuable: whether there are any naturally occurring structures or patterns of learning models that can solve the problem of effective attention. principle.

In this regard, Google Research studied the learning embeddings of the intermediate MLP layers and found that they are very sparse. For example, the T5-large model only has

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

##Paper link: ​https:/ /arxiv.org/pdf/2210.06313.pdf​

Researchers recently proposed the Treeformer model, an alternative to standard attention computations that relies on decision trees, which intuitively can quickly identify a small subset of keys relevant to a query, and only in this Perform attention operations on the set. Based on experience, Treeformer can reduce FLOPs of the attention layer by 30 times.

At the same time, the researchers also introduced sequential attention, a differentiable feature selection method that combines attention and greedy algorithms. This technology has been proven to be directly and cost-effective. Seam migration to large-scale models.

Another way to improve Transformer efficiency is to speed up the calculation of softmax in the attention layer.

Based on the study of low-rank approximation of the softmax kernel, researchers proposed a new type of random features, providing the first "positive and bounded" random features of the softmax kernel. Approximately, and computationally linear over sequence length.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

## Paper link: ​https://arxiv.org/abs/2205.15317​

And also proposed the first mechanism covering multiple attention masks, such as causal encoding and relative position encoding.

Training Efficiency

Effective optimization methods are the cornerstone of modern machine learning applications, and are especially important in large-scale environments.

In this case, even first-order adaptive methods like Adam often require a large amount of calculations, and the stability of training will become very difficult.

In addition, these methods are often irrelevant to the architecture of the neural network and do not consider the structural information within the model architecture, resulting in low training efficiency. This also encourages new technologies to more effectively optimize modern Neural network model.

Google Research has developed some new training techniques based on model architecture, for example, for training the Transofmre network, including the new scale-invariant Transofmre network and the new clipping method. When combined with vanilla stochastic gradient descent (SGD), it can speed up training.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/pdf/2202.00980.pdf​

Using this approach, the researchers achieved for the first time the ability to effectively train BERT using only simple SGD, without the need for adaptivity.

In addition, the researchers proposed a new method, LocoProp, to obtain results similar to the second-order optimizer while using the same computing and memory resources as the first-order optimizer. performance.

LocoProp obtains a modular view of neural networks by decomposing them into a combination of layers. Each layer is then allowed to have its own loss function as well as output targets and weight adjusters. With this setup, after appropriate forward and backward gradient passes, LocoProp continues to perform parallel updates to the "local loss" of each layer.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://proceedings.mlr.press/v151/amid22a.html​

In fact, these updates are theoretically and empirically similar to higher-order optimizers, and LocoProp achieved the same results as higher-order optimizers on the deep autoencoder benchmark. Comparable performance, while being significantly faster.

A key assumption of optimizers like SGD is that each data point is independently and identically sampled from a distribution, but in real-world application environments such as reinforcement learning , this assumption is difficult to satisfy because the model (or agent) must learn from data generated based on its own predictions.

The researchers proposed a new algorithm method called SGD with reverse experience replay, which can be used in linear dynamical systems and nonlinear dynamical systems. Find the optimal solution in several situations such as Q-learning and reinforcement learning.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

## Paper link: ​https://arxiv.org/abs/2103.05896​

Additionally, an improved version of this method, IER, is experimentally proven to be state-of-the-art and the most stable experience replay technique on various popular RL benchmarks.

Data Efficiency

Deep neural networks rely heavily on large data sets, with the attendant storage costs and potential security/privacy issues in these data Training modern deep neural networks on the set also comes with high computational costs.

A promising method to solve this problem is data subset selection, where the goal of the learner is to find the most informative subset from a large number of training samples to be close to ( Even improve) the training of the entire training set.

The researchers analyzed a subset selection framework designed to work with arbitrary model families in a practical batch setting where the learner can sample one sample at a time, accessing the context and real labels, but to limit the overhead, once a large enough batch of samples is selected, its state can only be updated, i.e., the model weights are further trained.

And based on this, an algorithm called IWeS was developed, which selects samples through importance sampling, where the sampling probability assigned to each sample is based on the previously selected The entropy of the batch-trained model. The paper provides a theoretical analysis that demonstrates bounds on generalization and sampling rate.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/pdf/2301.12052.pdf​

Another problem with training large networks is that they can be highly sensitive to the training data seen upon deployment and to the distribution changes between the data, especially when dealing with a limited number of training data, these data may not include all deployment time scenarios.

A recent study hypothesized that "extreme simplicity bias" is the key issue behind this fragility of neural networks, and its latest work makes this hypothesis feasible, leading to two New complementary methods DAFT and FRR, combined provide significantly more powerful neural networks. In particular, these two methods use adversarial fine-tuning and inverse feature prediction to improve the robustness of the learning network.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/pdf/2006.07710.pdf​

Inference Efficiency

Increasing the size of a neural network has been shown to have a surprising effect in improving its predictive accuracy, however, it is difficult to exploit these advantages in the real world is challenging because the cost of inference for large models can be prohibitive, this issue also drives strategies to improve service efficiency without sacrificing accuracy.

Researchers have proposed different strategies to achieve this goal, especially those based on knowledge distillation and adaptive computing.

Distillation

Distillation is a simple and effective model compression method that greatly scales large neural models It has potential applicability and has been proven to be very effective in a series of practical applications such as advertising recommendation.

Most use cases for distillation involve directly applying basic models to a given domain, with only a limited understanding of when and why this should be done. Google's research looks at tailoring distillation to specific circumstances and systematically examines the factors that determine distillation success.

Algorithmically, by carefully modeling the noise in the labels provided by the teacher model, the researchers developed a principled approach to reweighting the training examples, and a robust approach The subset of data to be sampled has teacher labels.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: https://arxiv.org/abs/2210.06711

In the process of "teacher-guided training", researchers proposed a new distillation framework: instead of passively using teachers to annotate a fixed data set, teachers are actively used to guide information samples Selections are used for annotation, which makes the distillation process more efficient in limited data or long-tail settings.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2208.06825​

We also studied new methods from cross-encoder (dual-encoder, such as BERT) to factorial dual-encoder (dual-encoder), which is also a pair of (query, document) Relevance is an important setting for scoring.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: https://proceedings.mlr.press/v162/menon22a/menon22a.pdf

The paper studies the reasons for the performance gap between the cross encoder and the dual encoder, pointing out that this may be the result of generalization rather than the capacity limitation of the dual encoder.

A carefully constructed distillation loss function can alleviate this situation and close the performance gap between cross-encoders and dual-encoders.

Subsequently, further improvement of dual-encoder distillation by matching embeddings from the teacher model was further studied in EmbedDistil. This strategy can also be used to extract information from large-to-small dual-encoder models, where inheriting and freezing teacher document embeddings proves to be very effective.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2301.12005​

provides a new perspective on theoretical aspects, through the distillation of supervised complexity, to measure how students are able to predict teachers' labels.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

##Paper link: ​https://arxiv.org/abs/2301.12245​

Using the Neural Tangent Kernel (NTK) theory, some conceptual conclusions are drawn. For example, the ability gap may affect distillation, because the labels of such teachers may appear similar to pure randomness. Label students.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2301.12923​

At the same time, it is further proved that the point where distillation causes students to underfit is also a difficult problem for the teacher model. Intuitively, this may help students focus their limited abilities on On those samples that can be reasonably modeled.

Adaptive calculation

Although distillation is an effective method to reduce the cost of inference, it has The effect is uniform, and intuitively some "easy" samples may inherently require less computation than relatively "hard" samples.

The goal of adaptive computing is to design mechanisms that can perform such sample-dependent calculations.

Confident Adaptive Language Modeling (CALM) introduces controlled early exit functionality for Transformer-based text generators such as T5.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://arxiv.org/abs/2207.07061​

In this form of adaptive computation, the model dynamically modifies the number of Transformer layers used at each decoding step, where the early-exit gate uses a confidence metric with a decision threshold. Metrics are calibrated to meet statistical performance guarantees.

This way, the model only needs to compute the full decoder layer stack for the most challenging predictions, and only a few decoder layers for simpler predictions. In practice, the model uses about one-third as many layers on average for predictions, resulting in a 2-3x speedup while maintaining the same level of generation quality.

A commonly used adaptive calculation mechanism consists of a cascade of two or more basic models, where the key issue is deciding whether to simply use the predictions of the current model or defer predictions to downstream models and learn when Postponement requires designing a suitable loss function that can utilize appropriate signals as supervision for postponing decisions.

Google Research systematically studied existing loss functions and demonstrated that they may not be suitable for training samples due to the implicit application of label smoothing. The paper also showed that this can be alleviated through post-hoc training of delayed rules. case, this training does not require modifying the model internals in any way.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: https://openreview.net/pdf?id=_jg6Sf6tuF7​

For retrieval applications, standard semantic search techniques use a fixed representation for each embedding generated by large models, that is, regardless of the downstream task and its associated computing environment or constraints, the size of the representation and Abilities are mostly fixed.

Matryoshka representation learning introduces the flexibility to adjust the representation according to the deployment environment, forcing the representation to have a natural ordering in its coordinates, so that for resource-constrained environments, only the highest few coordinates of the representation are used ; while for richer and precision-critical settings, more coordinates represented can be used.

Make training and inference of large models faster than ever! Googles 2022 year-end summary, the fourth

Paper link: ​https://openreview.net/pdf?id=9njZa1fm35​

When combined with standard approximate nearest neighbor search techniques, such as scanning neural networks, MRL is able to provide up to 16x lower computation for the same recall and accuracy metrics.

Summary

Large machine learning models have demonstrated transformative results in multiple domains, but efficiency in training and inference is becoming a critical requirement to make these models feasible in the real world. .

By developing new basic technologies, Google Research has made significant investments in making large-scale machine learning models efficient, which also requires sustained efforts. In the future, we will continue to explore core challenges to make machine learning models more robust. and efficient.

The above is the detailed content of Make training and inference of large models faster than ever! Google's 2022 year-end summary, the fourth. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
The AI Skills Gap Is Slowing Down Supply ChainsThe AI Skills Gap Is Slowing Down Supply ChainsApr 26, 2025 am 11:13 AM

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

How One Company Is Quietly Working To Transform AI ForeverHow One Company Is Quietly Working To Transform AI ForeverApr 26, 2025 am 11:12 AM

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Nvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentNvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentApr 26, 2025 am 11:11 AM

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI Paints A New Picture For The Future Of Art And DesignAI Paints A New Picture For The Future Of Art And DesignApr 26, 2025 am 11:10 AM

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

How Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesHow Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesApr 26, 2025 am 11:09 AM

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

The Existential Threat To UniversitiesThe Existential Threat To UniversitiesApr 26, 2025 am 11:08 AM

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The Prototype: American Scientists Are Looking For Jobs AbroadThe Prototype: American Scientists Are Looking For Jobs AbroadApr 26, 2025 am 11:07 AM

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

All About Open AI's Latest GPT 4.1 Family - Analytics VidhyaAll About Open AI's Latest GPT 4.1 Family - Analytics VidhyaApr 26, 2025 am 10:19 AM

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools