Home >Technology peripherals >AI >Is it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage

Is it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage

王林
王林forward
2023-05-20 12:52:061378browse

The emergence of conversational AI such as ChatGPT has made people accustomed to such a thing: input a piece of text, code or a picture, and the conversational robot will give you the answer you want. But behind this simple interaction method, the AI ​​model needs to perform very complex data processing and calculations, and tokenization is a common one.

In the field of natural language processing, tokenization refers to dividing text input into smaller units, called "tokens". These tokens can be words, subwords, or characters, depending on the specific word segmentation strategy and task requirements. For example, if we perform tokenization on the sentence "I like eating apples", we will get a sequence of tokens: ["I", "Like", "Eat", "Apple"]. Some people translate tokenization into "word segmentation", but some people think that this translation is misleading. After all, the segmented token may not be the "word" we understand every day.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Source: https://towardsdatascience.com/dynamic-word-tokenization-with-regex -tokenizer-801ae839d1cd

The purpose of Tokenization is to convert the input data into a form that can be processed by the computer and provide a structured representation for subsequent model training and analysis. . This method brings convenience to deep learning research, but it also brings a lot of trouble. Andrej Karpathy, who just joined OpenAI some time ago, pointed out several of them.

First of all, Karpathy believes that Tokenization introduces complexity: by using tokenization, the language model is not a complete end-to-end model. It requires a separate stage for tokenization, which has its own training and inference process and requires additional libraries. This increases the complexity of introducing data from other modalities.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

In addition, tokenization will also make the model error-prone in certain scenarios, such as when using text completion. With the full API, if your prompt ends with a space, the results you get may be very different.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

## Picture source: https://blog.scottlogic.com/2021/08/31/a -primer-on-the-openai-api-1.html

For another example, because of the existence of tokenization, the powerful ChatGPT does not actually write words in reverse (below Test results are from GPT 3.5).

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

There may be many such examples. Karpathy believes that to solve these problems, we must first abandon tokenization.

A new paper published by Meta AI explores this question. Specifically, they proposed a multi-scale decoder architecture called "MEGABYTE" that can perform end-to-end differentiable modeling of sequences exceeding one million bytes.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Paper link: https://arxiv.org/pdf/2305.07185.pdf

Importantly, this paper shows the feasibility of abandoning tokenization and was evaluated by Karpathy as "promising".

The following are the details of the paper.

Paper Overview

As mentioned in the machine learning article, the reason why machine learning seems to be able to solve many complex problems is because it transforms these problems into for math problems.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

And NLP has the same idea. Texts are all "unstructured data". We need to convert these data into "structured data" first. "Data", structured data can be transformed into mathematical problems, and word segmentation is the first step in transformation.

Due to the high cost of both self-attention mechanisms and large feed-forward networks, large transformer decoders (LLM) typically use only thousands of context tokens. This severely limits the set of tasks to which LLM can be applied.

Based on this, researchers from Meta AI proposed a new method for modeling long byte sequences - MEGABYTE. This method divides the byte sequence into fixed-size patches, similar to token.

The MEGABYTE model consists of three parts:

  1. patch embedder, which works by losslessly concatenating the embeddings of each byte. Simply encode patches;
  2. Global module - large autoregressive transformer with input and output patch representations;
  3. Local module - A small autoregressive model that predicts bytes in a patch.

Crucially, the study found that most bytes are relatively easy to predict for many tasks (e.g., completing a word given the first few characters ), which means that it is not necessary to use a large neural network for every byte, but instead a much smaller model can be used for intra-patch modeling.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

The MEGABYTE architecture has made three major improvements to the Transformer for long sequence modeling:

sub-quadratic self-attention. Most work on long sequence models focuses on reducing the quadratic cost of self-attention. By decomposing a long sequence into two shorter sequences and optimal patch size, MEGABYTE reduces the cost of the self-attention mechanism to 一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了, making even long sequences easy to process.

per-patch feed-forward layer. In very large models such as GPT-3, more than 98% of FLOPS are used to compute position-wise feedforward layers. MEGABYTE enables larger, more expressive models at the same cost by using large feedforward layers per-patch (instead of per-position). With patch size P, the baseline transformer will use the same feedforward layer with m parameters P times, while MEGABYTE only needs to use the layer with mP parameters once at the same cost.

3. Parallel decoding. The transformer must perform all calculations serially during generation because the input of each time step is the output of the previous time step. By generating patch representations in parallel, MEGABYTE achieves greater parallelism in the generation process. For example, a MEGABYTE model with 1.5B parameters generates sequences 40% faster than a standard 350M parameter transformer, while also improving perplexity when trained using the same computation.

Overall, MEGABYTE allows us to train larger, better-performing models with the same compute budget, will be able to handle very long sequences, and improves generation during deployment speed.

MEGABYTE also contrasts with existing autoregressive models, which typically use some form of tokenization where sequences of bytes are mapped into larger discrete tokens (Sennrich et al., 2015; Ramesh et al., 2021; Hsu et al., 2021). Tokenization complicates preprocessing, multimodal modeling, and transfer to new domains, while hiding useful structure in the model. This means that most SOTA models are not truly end-to-end models. The most widely used tokenization methods require the use of language-specific heuristics (Radford et al., 2019) or loss of information (Ramesh et al., 2021). Therefore, replacing tokenization with an efficient and performant byte model will have many advantages.

The study conducted experiments on MEGABYTE and some powerful baseline models. Experimental results show that MEGABYTE performs comparably to subword models on long-context language modeling, achieves state-of-the-art density estimation perplexity on ImageNet, and allows audio modeling from raw audio files. These experimental results demonstrate the feasibility of large-scale tokenization-free autoregressive sequence modeling.

MEGABYTE Main components

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##patch embedder

A patch embedder of size P can map the byte sequence

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

into a length of # A patch embedding sequence of

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

## and dimension

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了##.

First, each byte is embedded with a lookup table

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了, forming an embedding of size D_G and adding positional embeddings.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Then, the byte embedding is reshaped into dimensions of

The sequence of K patches embedded in

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了. To allow autoregressive modeling, the patch sequence is padded with a padding embedding (

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了) from the trainable patch size, and then from the input Remove the last patch. This sequence is the input to the global model, represented as

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##Global module

The global module is a decoder-only architecture P・D_G dimensional transformer model, which operates on k patch sequences. The global module combines self-attention mechanism and causal mask to capture the dependencies between patches. The global module inputs representations of k patch sequences

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

, and outputs updated representations## by performing self-attention on previous patches.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Final global module output

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Contains K patch representations of P・D_G dimensions. For each of these, we reshape them into sequences of length P and dimension D_G, where position p uses the dimension p·D_G to (p 1)·D_G. Each location is then mapped to a local module dimension with matrix

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

, where D_L is the local module dimension. These are then combined with a byte embedding of size D_L for the next

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

token.

The local byte embedding is offset by 1 with a trainable local pad embedding (E^local-pad ∈ R^DL), allowing for in-path Autoregressive modeling. Finally get the tensor

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了


一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##Local module

The local module is a smaller, decoder-only architecture D_L-dimensional transformer model that contains P elements. Running on a single patch k, each element is the sum of a global module output and the embedding of the previous byte in the sequence. K copies of the local module are run independently on each patch and in parallel during training, thus computing the representation

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##Finally, the researcher can calculate the vocabulary probability distribution for each position. The p-th element of the k-th patch corresponds to element t of the complete sequence, where t = k·P p.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了Efficiency analysis

Training efficiency

The researchers analyzed the costs of different architectures when scaling sequence length and model size. As shown in Figure 3 below, the MEGABYTE architecture uses fewer FLOPS than comparably sized transformers and linear transformers across a variety of model sizes and sequence lengths, allowing the use of larger models at the same computational cost.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Generation efficiency

Consider such a The MEGABYTE model, which has L_global layers in the global model and L_local layers in the local module, with patch size P, is compared with the transformer architecture with L_local L_global layers. Generating each patch with MEGABYTE requires an O (L_global P・L_local) sequence of serial operations. When L_global ≥ L_local (global modules have more layers than local modules), MEGABYTE can reduce the inference cost by nearly P times.

Experimental results

Language modeling

Researchers emphasize 5 aspects of long-range dependence The language modeling capabilities of MEGABYTE were evaluated on different data sets, namely Project Gutenberg (PG-19), Books, Stories, arXiv and Code. The results are shown in Table 7 below, MEGABYTE consistently outperforms the baseline transformer and PerceiverAR on all datasets.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

The researchers also expanded the training data on PG-19. The results are shown in Table 8 below. MEGABYTE is significant. Outperforms other byte models and is comparable to SOTA models trained on subwords.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Image Modeling

Researcher A large MEGABYTE model was trained on the ImageNet 64x64 data set, in which the parameters of the global and local modules are 2.7B and 350M respectively, and there are 1.4T tokens. They estimate that training the model takes less than half the number of GPU hours required to reproduce the best PerceiverAR model in the Hawthorne et al., 2022 paper. As shown in Table 8 above, MEGABYTE has comparable performance to PerceiverAR's SOTA, while using only half of the latter's calculations.

We compared three transformer variants, namely vanilla, PerceiverAR, and MEGABYTE, to test the scalability of long sequences at increasingly larger image resolutions. The results are shown in Table 5 below. Under this computational control setting, MEGABYTE outperforms the baseline model at all resolutions.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Table 14 below summarizes the precise settings used by each baseline model, including context length and number of latents.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Audio modeling

Audio cum With the sequence structure of text and the continuous nature of images, this is an interesting application for MEGABYTE. The model in this article obtained a bpb of 3.477, which is significantly lower than the perceiverAR (3.543) and vanilla transformer model (3.567). Additional ablation results are detailed in Table 10 below.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

For more technical details and experimental results, please refer to the original paper.

The above is the detailed content of Is it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete