Home > Article > Technology peripherals > Full of useful information! The first text version of Master Karpathy's two-hour AI course, a new workflow automatically converts videos into articles
Some time ago, the AI course launched by AI master Karpathy has already received 150,000 views across the entire network.
At that time, some netizens said that the value of this 2-hour course was equivalent to 4 years of college.
Just in the past few days, Karpathy had a new idea:
will 2 The 13-hour "Building a GPT Tokenizer from Scratch" video content is transformed into a book chapter or blog post, focusing on the topic of "word segmentation."
The specific steps are as follows:
- Add subtitles or narration text to the video.
- Cut the video into paragraphs with matching images and text.
- Use the prompt engineering technology of large language models to translate paragraph by paragraph.
- Outputs the results as a web page with links to parts of the original video.
More broadly, such a workflow can be applied to any video input, automatically generating "companion guides" for various tutorials in a format that is easier to read, browse, and search.
This sounds feasible, but also quite challenging.
He wrote an example to illustrate his imagination under the GitHub project minbpe.
Address: https://github.com/karpathy/minbpe/blob/master/lecture.md
Karpathy said that this was a task that he completed manually, which was to watch the video and translate it into an article in markdown format.
"I only watched about 4 minutes of the video (i.e. 3% done), and this already took about 30 minutes to write, so it would be great if something like this could be done automatically Very good".
Next, it’s class time!
Hello everyone, today we will discuss the issue of "word segmentation" in LLM.
Unfortunately, "word segmentation" is a relatively complex and tricky component of the most advanced large models, but it is necessary for us to understand it in detail.
Because many of the flaws of LLM may be attributed to neural networks or other seemingly mysterious factors, but these flaws can actually be traced back to "word segmentation".
Character-level word segmentation
So, what is word segmentation?
In fact, in the previous video "Let's build GPT from scratch", I have already introduced tokenization, but that was just a very simple character-level version.
If you go to Google colab and check out that video, you'll see that we start with the training data (Shakespeare), which is just a big string in Python:
First Citizen: Before we proceed any further, hear me speak.All: Speak, speak.First Citizen: You are all resolved rather to die than to famish?All: Resolved. resolved.First Citizen: First, you know Caius Marcius is chief enemy to the people.All: We know't, we know't.
But how do we input strings into LLM?
We can see that we first need to build a vocabulary for all possible characters in the entire training set:
# here are all the unique characters that occur in this textchars = sorted(list(set(text)))vocab_size = len(chars)print(''.join(chars))print(vocab_size)# !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz# 65
Then based on the vocabulary above, create a lookup table for converting between single characters and integers. This lookup table is just a Python dictionary:
stoi = { ch:i for i,ch in enumerate(chars) }itos = { i:ch for i,ch in enumerate(chars) }# encoder: take a string, output a list of integersencode = lambda s: [stoi[c] for c in s]# decoder: take a list of integers, output a stringdecode = lambda l: ''.join([itos[i] for i in l])print(encode("hii there"))print(decode(encode("hii there")))# [46, 47, 47, 1, 58, 46, 43, 56, 43]# hii there
Once we convert a string into a sequence of integers, we see that each integer, is Index used as a 2D embedding of trainable parameters.
Because our vocabulary size is vocab_size=65 , this embedding table will also have 65 rows:
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, n_embd)def forward(self, idx, targets=None):tok_emb = self.token_embedding_table(idx) # (B,T,C)
Here, the integer "extracts" a row from the embedding table, and this row is the vector representing the word segmentation. This vector will then be fed into the Transformer as input for the corresponding time step.
Using the BPE algorithm for "character block" word segmentation
Naive settings for the "character-level" language model Say, it's all good.
But in practice, in state-of-the-art language models, people use more complex schemes to build these representational vocabularies.
Specifically, these solutions do not work at the character level, but at the "character block" level. The way these chunk vocabularies are built is using algorithms such as Byte Pair Encoding (BPE), which we describe in detail below.
Let’s briefly review the historical development of this method. The paper that uses the byte-level BPE algorithm for language model word segmentation is the GPT-2 paper Language Models are Unsupervised published by OpenAI in 2019. Multitask Learners.
Paper address: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Scroll down to Section 2.2, “Input Representation”, where they describe and motivate this algorithm. At the end of this section, you'll see them say:
The vocabulary expanded to 50,257 words. We also increased the context size from 512 to 1024 tokens and used a larger batchsize of 512.
Recall that in the Transformer’s attention layer, each token is associated with a limited list of previous tokens in the sequence.
This article points out that the context length of the GPT-2 model has increased from 512 tokens in GPT-1 to 1024 tokens.
In other words, token is the basic "atom" of the LLM input.
"Tokenization" is the process of converting the original string in Python into a token list, and vice versa.
There is another popular example that proves the universality of this abstraction. If you also search for "token" in Llama 2's paper, you will get 63 matching results.
For example, the paper claims that they trained on 2 trillion tokens, etc.
Paper address: https://arxiv.org/pdf/2307.09288.pdf
A brief discussion on the complexity of word segmentation
Before we delve into the details of implementation, let us briefly explain the necessity of a detailed understanding of the "word segmentation" process.
Word segmentation is at the heart of many, many weird problems in LLM, and I suggest you don't ignore it.
Many seemingly problems with neural network architecture are actually related to word segmentation. Here are just a few examples:
- Why doesn't LLM spell words? ——Word segmentation
- Why can’t LLM perform super simple string processing tasks, such as reversing strings? ——Word segmentation
#- Why is LLM worse in non-English language (such as Japanese) tasks? ——Participle
#- Why is LLM not good at simple arithmetic? ——Word segmentation
#- Why does GPT-2 encounter more problems when coding in Python? ——Word segmentation
- Why does my LLM suddenly stop when it sees the string ? ——Participle
- What is this strange warning I received about "trailing whitespace"? --Participle
- If I ask LLM about "SolidGoldMagikarp", why does it crash? ——Word segmentation
#- Why should I use YAML with LLM instead of JSON? ——Word Segmentation
#- Why is LLM not a true end-to-end language modeling? ——Participle
We will return to these questions at the end of the video.
Visual preview of word segmentation
Next, let us load this word segmentation WebApp.
Address: https://tiktokenizer.vercel.app/
for this web application The advantage is that word segmentation runs in real time in a web browser, allowing you to easily enter some text strings on the input side and see the word segmentation results on the right.
At the top, you can see that we are currently using the gpt2 tokenizer, and you can see that the string pasted in this example is currently being tokenized into 300 tokens.
Here, they are clearly shown with color:
For example, the string "Tokenization" Encoded to token30642, followed by token 1634.
token "is" (note that this is three characters, including the preceding space, this is very important!) is 318.
Pay attention to the use of spaces, because it is absolutely present in the string and must be worded together with all other characters. However, it is usually omitted during visualization for the sake of clarity.
You can turn its visualization features on and off at the bottom of the app. Likewise, token "at" is 379, "the" is 262, and so on.
Next, we have a simple arithmetic example.
Here we see that the tokenizer may be inconsistent in its decomposition of numbers. For example, the number 127 is a 3-character token, but the number 677 is because there are 2 tokens: 6 (again, note the preceding space) and 77.
We rely on LLM to explain this arbitrariness.
It must learn about these two tokens (6 and 77 actually combine to form the number 677), both within its parameters and during training.
Similarly, we can see that if LLM wants to predict that the result of this sum is the number 804, it must output within two time steps:
First, it must issue token "8", then token "04".
Note that all of these splits look completely arbitrary. In the example below, we can see that 1275 is "12", then "75", 6773 is actually three tokens "6", "77", and "3", and 8041 is "8" and "041" .
(To be continued...)
(TODO: If we want to continue the text version of the content, unless we figure out how to get it from the video Automatically generated in)
Netizens said, great, actually I prefer reading these posts rather than watching videos, it's easier to pace myself.
Some netizens also gave Karpathy advice:
"Feels tricky, but it might be possible using LangChain. I was wondering if I could use whisper transcription to produce a high-level outline with clear chapters, and then process those chapter chunks in parallel, in the context of the overall outline , focus on the specific content of the respective chapter blocks (also generate illustrations for each parallel-processed chapter). Then all generated reference marks are compiled to the end of the article through LLM."
Someone has written a pipeline for this, and it will be open source soon.
The above is the detailed content of Full of useful information! The first text version of Master Karpathy's two-hour AI course, a new workflow automatically converts videos into articles. For more information, please follow other related articles on the PHP Chinese website!