


Some time ago, the AI course launched by AI master Karpathy has already received 150,000 views across the entire network.
At that time, some netizens said that the value of this 2-hour course was equivalent to 4 years of college.
Just in the past few days, Karpathy had a new idea:
will 2 The 13-hour "Building a GPT Tokenizer from Scratch" video content is transformed into a book chapter or blog post, focusing on the topic of "word segmentation."
The specific steps are as follows:
- Add subtitles or narration text to the video.
- Cut the video into paragraphs with matching images and text.
- Use the prompt engineering technology of large language models to translate paragraph by paragraph.
- Outputs the results as a web page with links to parts of the original video.
More broadly, such a workflow can be applied to any video input, automatically generating "companion guides" for various tutorials in a format that is easier to read, browse, and search.
This sounds feasible, but also quite challenging.
He wrote an example to illustrate his imagination under the GitHub project minbpe.
Address: https://github.com/karpathy/minbpe/blob/master/lecture.md
Karpathy said that this was a task that he completed manually, which was to watch the video and translate it into an article in markdown format.
"I only watched about 4 minutes of the video (i.e. 3% done), and this already took about 30 minutes to write, so it would be great if something like this could be done automatically Very good".
Next, it’s class time!
Text version of the "LLM word segmentation" course
Hello everyone, today we will discuss the issue of "word segmentation" in LLM.
Unfortunately, "word segmentation" is a relatively complex and tricky component of the most advanced large models, but it is necessary for us to understand it in detail.
Because many of the flaws of LLM may be attributed to neural networks or other seemingly mysterious factors, but these flaws can actually be traced back to "word segmentation".
Character-level word segmentation
So, what is word segmentation?
In fact, in the previous video "Let's build GPT from scratch", I have already introduced tokenization, but that was just a very simple character-level version.
If you go to Google colab and check out that video, you'll see that we start with the training data (Shakespeare), which is just a big string in Python:
First Citizen: Before we proceed any further, hear me speak.All: Speak, speak.First Citizen: You are all resolved rather to die than to famish?All: Resolved. resolved.First Citizen: First, you know Caius Marcius is chief enemy to the people.All: We know't, we know't.
But how do we input strings into LLM?
We can see that we first need to build a vocabulary for all possible characters in the entire training set:
# here are all the unique characters that occur in this textchars = sorted(list(set(text)))vocab_size = len(chars)print(''.join(chars))print(vocab_size)# !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz# 65
Then based on the vocabulary above, create a lookup table for converting between single characters and integers. This lookup table is just a Python dictionary:
stoi = { ch:i for i,ch in enumerate(chars) }itos = { i:ch for i,ch in enumerate(chars) }# encoder: take a string, output a list of integersencode = lambda s: [stoi[c] for c in s]# decoder: take a list of integers, output a stringdecode = lambda l: ''.join([itos[i] for i in l])print(encode("hii there"))print(decode(encode("hii there")))# [46, 47, 47, 1, 58, 46, 43, 56, 43]# hii there
Once we convert a string into a sequence of integers, we see that each integer, is Index used as a 2D embedding of trainable parameters.
Because our vocabulary size is vocab_size=65 , this embedding table will also have 65 rows:
class BigramLanguageModel(nn.Module):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, n_embd)def forward(self, idx, targets=None):tok_emb = self.token_embedding_table(idx) # (B,T,C)
Here, the integer "extracts" a row from the embedding table, and this row is the vector representing the word segmentation. This vector will then be fed into the Transformer as input for the corresponding time step.
Using the BPE algorithm for "character block" word segmentation
Naive settings for the "character-level" language model Say, it's all good.
But in practice, in state-of-the-art language models, people use more complex schemes to build these representational vocabularies.
Specifically, these solutions do not work at the character level, but at the "character block" level. The way these chunk vocabularies are built is using algorithms such as Byte Pair Encoding (BPE), which we describe in detail below.
Let’s briefly review the historical development of this method. The paper that uses the byte-level BPE algorithm for language model word segmentation is the GPT-2 paper Language Models are Unsupervised published by OpenAI in 2019. Multitask Learners.
Paper address: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Scroll down to Section 2.2, “Input Representation”, where they describe and motivate this algorithm. At the end of this section, you'll see them say:
The vocabulary expanded to 50,257 words. We also increased the context size from 512 to 1024 tokens and used a larger batchsize of 512.
Recall that in the Transformer’s attention layer, each token is associated with a limited list of previous tokens in the sequence.
This article points out that the context length of the GPT-2 model has increased from 512 tokens in GPT-1 to 1024 tokens.
In other words, token is the basic "atom" of the LLM input.
"Tokenization" is the process of converting the original string in Python into a token list, and vice versa.
There is another popular example that proves the universality of this abstraction. If you also search for "token" in Llama 2's paper, you will get 63 matching results.
For example, the paper claims that they trained on 2 trillion tokens, etc.
Paper address: https://arxiv.org/pdf/2307.09288.pdf
A brief discussion on the complexity of word segmentation
Before we delve into the details of implementation, let us briefly explain the necessity of a detailed understanding of the "word segmentation" process.
Word segmentation is at the heart of many, many weird problems in LLM, and I suggest you don't ignore it.
Many seemingly problems with neural network architecture are actually related to word segmentation. Here are just a few examples:
- Why doesn't LLM spell words? ——Word segmentation
- Why can’t LLM perform super simple string processing tasks, such as reversing strings? ——Word segmentation
#- Why is LLM worse in non-English language (such as Japanese) tasks? ——Participle
#- Why is LLM not good at simple arithmetic? ——Word segmentation
#- Why does GPT-2 encounter more problems when coding in Python? ——Word segmentation
- Why does my LLM suddenly stop when it sees the string ? ——Participle
- What is this strange warning I received about "trailing whitespace"? --Participle
- If I ask LLM about "SolidGoldMagikarp", why does it crash? ——Word segmentation
#- Why should I use YAML with LLM instead of JSON? ——Word Segmentation
#- Why is LLM not a true end-to-end language modeling? ——Participle
We will return to these questions at the end of the video.
Visual preview of word segmentation
Next, let us load this word segmentation WebApp.
Address: https://tiktokenizer.vercel.app/
for this web application The advantage is that word segmentation runs in real time in a web browser, allowing you to easily enter some text strings on the input side and see the word segmentation results on the right.
At the top, you can see that we are currently using the gpt2 tokenizer, and you can see that the string pasted in this example is currently being tokenized into 300 tokens.
Here, they are clearly shown with color:
For example, the string "Tokenization" Encoded to token30642, followed by token 1634.
token "is" (note that this is three characters, including the preceding space, this is very important!) is 318.
Pay attention to the use of spaces, because it is absolutely present in the string and must be worded together with all other characters. However, it is usually omitted during visualization for the sake of clarity.
You can turn its visualization features on and off at the bottom of the app. Likewise, token "at" is 379, "the" is 262, and so on.
Next, we have a simple arithmetic example.
Here we see that the tokenizer may be inconsistent in its decomposition of numbers. For example, the number 127 is a 3-character token, but the number 677 is because there are 2 tokens: 6 (again, note the preceding space) and 77.
We rely on LLM to explain this arbitrariness.
It must learn about these two tokens (6 and 77 actually combine to form the number 677), both within its parameters and during training.
Similarly, we can see that if LLM wants to predict that the result of this sum is the number 804, it must output within two time steps:
First, it must issue token "8", then token "04".
Note that all of these splits look completely arbitrary. In the example below, we can see that 1275 is "12", then "75", 6773 is actually three tokens "6", "77", and "3", and 8041 is "8" and "041" .
(To be continued...)
(TODO: If we want to continue the text version of the content, unless we figure out how to get it from the video Automatically generated in)
Netizens are online, giving advice
Netizens said, great, actually I prefer reading these posts rather than watching videos, it's easier to pace myself.
Some netizens also gave Karpathy advice:
"Feels tricky, but it might be possible using LangChain. I was wondering if I could use whisper transcription to produce a high-level outline with clear chapters, and then process those chapter chunks in parallel, in the context of the overall outline , focus on the specific content of the respective chapter blocks (also generate illustrations for each parallel-processed chapter). Then all generated reference marks are compiled to the end of the article through LLM."
Someone has written a pipeline for this, and it will be open source soon.
The above is the detailed content of Full of useful information! The first text version of Master Karpathy's two-hour AI course, a new workflow automatically converts videos into articles. For more information, please follow other related articles on the PHP Chinese website!

ai合并图层的快捷键是“Ctrl+Shift+E”,它的作用是把目前所有处在显示状态的图层合并,在隐藏状态的图层则不作变动。也可以选中要合并的图层,在菜单栏中依次点击“窗口”-“路径查找器”,点击“合并”按钮。

ai橡皮擦擦不掉东西是因为AI是矢量图软件,用橡皮擦不能擦位图的,其解决办法就是用蒙板工具以及钢笔勾好路径再建立蒙板即可实现擦掉东西。

虽然谷歌早在2020年,就在自家的数据中心上部署了当时最强的AI芯片——TPU v4。但直到今年的4月4日,谷歌才首次公布了这台AI超算的技术细节。论文地址:https://arxiv.org/abs/2304.01433相比于TPU v3,TPU v4的性能要高出2.1倍,而在整合4096个芯片之后,超算的性能更是提升了10倍。另外,谷歌还声称,自家芯片要比英伟达A100更快、更节能。与A100对打,速度快1.7倍论文中,谷歌表示,对于规模相当的系统,TPU v4可以提供比英伟达A100强1.

ai可以转成psd格式。转换方法:1、打开Adobe Illustrator软件,依次点击顶部菜单栏的“文件”-“打开”,选择所需的ai文件;2、点击右侧功能面板中的“图层”,点击三杠图标,在弹出的选项中选择“释放到图层(顺序)”;3、依次点击顶部菜单栏的“文件”-“导出”-“导出为”;4、在弹出的“导出”对话框中,将“保存类型”设置为“PSD格式”,点击“导出”即可;

Yann LeCun 这个观点的确有些大胆。 「从现在起 5 年内,没有哪个头脑正常的人会使用自回归模型。」最近,图灵奖得主 Yann LeCun 给一场辩论做了个特别的开场。而他口中的自回归,正是当前爆红的 GPT 家族模型所依赖的学习范式。当然,被 Yann LeCun 指出问题的不只是自回归模型。在他看来,当前整个的机器学习领域都面临巨大挑战。这场辩论的主题为「Do large language models need sensory grounding for meaning and u

ai顶部属性栏不见了的解决办法:1、开启Ai新建画布,进入绘图页面;2、在Ai顶部菜单栏中点击“窗口”;3、在系统弹出的窗口菜单页面中点击“控制”,然后开启“控制”窗口即可显示出属性栏。

ai移动不了东西的解决办法:1、打开ai软件,打开空白文档;2、选择矩形工具,在文档中绘制矩形;3、点击选择工具,移动文档中的矩形;4、点击图层按钮,弹出图层面板对话框,解锁图层;5、点击选择工具,移动矩形即可。

引入密集强化学习,用 AI 验证 AI。 自动驾驶汽车 (AV) 技术的快速发展,使得我们正处于交通革命的风口浪尖,其规模是自一个世纪前汽车问世以来从未见过的。自动驾驶技术具有显着提高交通安全性、机动性和可持续性的潜力,因此引起了工业界、政府机构、专业组织和学术机构的共同关注。过去 20 年里,自动驾驶汽车的发展取得了长足的进步,尤其是随着深度学习的出现更是如此。到 2015 年,开始有公司宣布他们将在 2020 之前量产 AV。不过到目前为止,并且没有 level 4 级别的 AV 可以在市场


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver CS6
Visual web development tools

Dreamweaver Mac version
Visual web development tools

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Notepad++7.3.1
Easy-to-use and free code editor

Zend Studio 13.0.1
Powerful PHP integrated development environment
