search
HomeTechnology peripheralsAIIs it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage

The emergence of conversational AI such as ChatGPT has made people accustomed to such a thing: input a piece of text, code or a picture, and the conversational robot will give you the answer you want. But behind this simple interaction method, the AI ​​model needs to perform very complex data processing and calculations, and tokenization is a common one.

In the field of natural language processing, tokenization refers to dividing text input into smaller units, called "tokens". These tokens can be words, subwords, or characters, depending on the specific word segmentation strategy and task requirements. For example, if we perform tokenization on the sentence "I like eating apples", we will get a sequence of tokens: ["I", "Like", "Eat", "Apple"]. Some people translate tokenization into "word segmentation", but some people think that this translation is misleading. After all, the segmented token may not be the "word" we understand every day.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Source: https://towardsdatascience.com/dynamic-word-tokenization-with-regex -tokenizer-801ae839d1cd

The purpose of Tokenization is to convert the input data into a form that can be processed by the computer and provide a structured representation for subsequent model training and analysis. . This method brings convenience to deep learning research, but it also brings a lot of trouble. Andrej Karpathy, who just joined OpenAI some time ago, pointed out several of them.

First of all, Karpathy believes that Tokenization introduces complexity: by using tokenization, the language model is not a complete end-to-end model. It requires a separate stage for tokenization, which has its own training and inference process and requires additional libraries. This increases the complexity of introducing data from other modalities.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

In addition, tokenization will also make the model error-prone in certain scenarios, such as when using text completion. With the full API, if your prompt ends with a space, the results you get may be very different.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

## Picture source: https://blog.scottlogic.com/2021/08/31/a -primer-on-the-openai-api-1.html

For another example, because of the existence of tokenization, the powerful ChatGPT does not actually write words in reverse (below Test results are from GPT 3.5).

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

There may be many such examples. Karpathy believes that to solve these problems, we must first abandon tokenization.

A new paper published by Meta AI explores this question. Specifically, they proposed a multi-scale decoder architecture called "MEGABYTE" that can perform end-to-end differentiable modeling of sequences exceeding one million bytes.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Paper link: https://arxiv.org/pdf/2305.07185.pdf

Importantly, this paper shows the feasibility of abandoning tokenization and was evaluated by Karpathy as "promising".

The following are the details of the paper.

Paper Overview

As mentioned in the machine learning article, the reason why machine learning seems to be able to solve many complex problems is because it transforms these problems into for math problems.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

And NLP has the same idea. Texts are all "unstructured data". We need to convert these data into "structured data" first. "Data", structured data can be transformed into mathematical problems, and word segmentation is the first step in transformation.

Due to the high cost of both self-attention mechanisms and large feed-forward networks, large transformer decoders (LLM) typically use only thousands of context tokens. This severely limits the set of tasks to which LLM can be applied.

Based on this, researchers from Meta AI proposed a new method for modeling long byte sequences - MEGABYTE. This method divides the byte sequence into fixed-size patches, similar to token.

The MEGABYTE model consists of three parts:

  1. patch embedder, which works by losslessly concatenating the embeddings of each byte. Simply encode patches;
  2. Global module - large autoregressive transformer with input and output patch representations;
  3. Local module - A small autoregressive model that predicts bytes in a patch.

Crucially, the study found that most bytes are relatively easy to predict for many tasks (e.g., completing a word given the first few characters ), which means that it is not necessary to use a large neural network for every byte, but instead a much smaller model can be used for intra-patch modeling.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

The MEGABYTE architecture has made three major improvements to the Transformer for long sequence modeling:

sub-quadratic self-attention. Most work on long sequence models focuses on reducing the quadratic cost of self-attention. By decomposing a long sequence into two shorter sequences and optimal patch size, MEGABYTE reduces the cost of the self-attention mechanism to 一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了, making even long sequences easy to process.

per-patch feed-forward layer. In very large models such as GPT-3, more than 98% of FLOPS are used to compute position-wise feedforward layers. MEGABYTE enables larger, more expressive models at the same cost by using large feedforward layers per-patch (instead of per-position). With patch size P, the baseline transformer will use the same feedforward layer with m parameters P times, while MEGABYTE only needs to use the layer with mP parameters once at the same cost.

3. Parallel decoding. The transformer must perform all calculations serially during generation because the input of each time step is the output of the previous time step. By generating patch representations in parallel, MEGABYTE achieves greater parallelism in the generation process. For example, a MEGABYTE model with 1.5B parameters generates sequences 40% faster than a standard 350M parameter transformer, while also improving perplexity when trained using the same computation.

Overall, MEGABYTE allows us to train larger, better-performing models with the same compute budget, will be able to handle very long sequences, and improves generation during deployment speed.

MEGABYTE also contrasts with existing autoregressive models, which typically use some form of tokenization where sequences of bytes are mapped into larger discrete tokens (Sennrich et al., 2015; Ramesh et al., 2021; Hsu et al., 2021). Tokenization complicates preprocessing, multimodal modeling, and transfer to new domains, while hiding useful structure in the model. This means that most SOTA models are not truly end-to-end models. The most widely used tokenization methods require the use of language-specific heuristics (Radford et al., 2019) or loss of information (Ramesh et al., 2021). Therefore, replacing tokenization with an efficient and performant byte model will have many advantages.

The study conducted experiments on MEGABYTE and some powerful baseline models. Experimental results show that MEGABYTE performs comparably to subword models on long-context language modeling, achieves state-of-the-art density estimation perplexity on ImageNet, and allows audio modeling from raw audio files. These experimental results demonstrate the feasibility of large-scale tokenization-free autoregressive sequence modeling.

MEGABYTE Main components

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##patch embedder

A patch embedder of size P can map the byte sequence

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

into a length of # A patch embedding sequence of

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

## and dimension

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了##.

First, each byte is embedded with a lookup table

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了, forming an embedding of size D_G and adding positional embeddings.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Then, the byte embedding is reshaped into dimensions of

The sequence of K patches embedded in

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了. To allow autoregressive modeling, the patch sequence is padded with a padding embedding (

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了) from the trainable patch size, and then from the input Remove the last patch. This sequence is the input to the global model, represented as

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##Global module

The global module is a decoder-only architecture P・D_G dimensional transformer model, which operates on k patch sequences. The global module combines self-attention mechanism and causal mask to capture the dependencies between patches. The global module inputs representations of k patch sequences

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

, and outputs updated representations## by performing self-attention on previous patches.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Final global module output

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Contains K patch representations of P・D_G dimensions. For each of these, we reshape them into sequences of length P and dimension D_G, where position p uses the dimension p·D_G to (p 1)·D_G. Each location is then mapped to a local module dimension with matrix

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

, where D_L is the local module dimension. These are then combined with a byte embedding of size D_L for the next

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

token.

The local byte embedding is offset by 1 with a trainable local pad embedding (E^local-pad ∈ R^DL), allowing for in-path Autoregressive modeling. Finally get the tensor

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了


一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##Local module

The local module is a smaller, decoder-only architecture D_L-dimensional transformer model that contains P elements. Running on a single patch k, each element is the sum of a global module output and the embedding of the previous byte in the sequence. K copies of the local module are run independently on each patch and in parallel during training, thus computing the representation

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

##Finally, the researcher can calculate the vocabulary probability distribution for each position. The p-th element of the k-th patch corresponds to element t of the complete sequence, where t = k·P p.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了Efficiency analysis

Training efficiency

The researchers analyzed the costs of different architectures when scaling sequence length and model size. As shown in Figure 3 below, the MEGABYTE architecture uses fewer FLOPS than comparably sized transformers and linear transformers across a variety of model sizes and sequence lengths, allowing the use of larger models at the same computational cost.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Generation efficiency

Consider such a The MEGABYTE model, which has L_global layers in the global model and L_local layers in the local module, with patch size P, is compared with the transformer architecture with L_local L_global layers. Generating each patch with MEGABYTE requires an O (L_global P・L_local) sequence of serial operations. When L_global ≥ L_local (global modules have more layers than local modules), MEGABYTE can reduce the inference cost by nearly P times.

Experimental results

Language modeling

Researchers emphasize 5 aspects of long-range dependence The language modeling capabilities of MEGABYTE were evaluated on different data sets, namely Project Gutenberg (PG-19), Books, Stories, arXiv and Code. The results are shown in Table 7 below, MEGABYTE consistently outperforms the baseline transformer and PerceiverAR on all datasets.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

The researchers also expanded the training data on PG-19. The results are shown in Table 8 below. MEGABYTE is significant. Outperforms other byte models and is comparable to SOTA models trained on subwords.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Image Modeling

Researcher A large MEGABYTE model was trained on the ImageNet 64x64 data set, in which the parameters of the global and local modules are 2.7B and 350M respectively, and there are 1.4T tokens. They estimate that training the model takes less than half the number of GPU hours required to reproduce the best PerceiverAR model in the Hawthorne et al., 2022 paper. As shown in Table 8 above, MEGABYTE has comparable performance to PerceiverAR's SOTA, while using only half of the latter's calculations.

We compared three transformer variants, namely vanilla, PerceiverAR, and MEGABYTE, to test the scalability of long sequences at increasingly larger image resolutions. The results are shown in Table 5 below. Under this computational control setting, MEGABYTE outperforms the baseline model at all resolutions.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Table 14 below summarizes the precise settings used by each baseline model, including context length and number of latents.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

Audio modeling

Audio cum With the sequence structure of text and the continuous nature of images, this is an interesting application for MEGABYTE. The model in this article obtained a bpb of 3.477, which is significantly lower than the perceiverAR (3.543) and vanilla transformer model (3.567). Additional ablation results are detailed in Table 10 below.

一定要「分词」吗?Andrej Karpathy:是时候抛弃这个历史包袱了

For more technical details and experimental results, please refer to the original paper.

The above is the detailed content of Is it necessary to 'participle'? Andrej Karpathy: It's time to throw away this historical baggage. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.