You still have to "roll" even if you don't have a job.
The restless Andrej Karpathy has a new project! After announcing his departure from OpenAI, Karpathy tweeted that "I can take a break this week." 86972512239665
This state of having nothing to do makes even Musk envious (I am envious).
But if you really think Karpathy will have some time off, that's a bit "too young, too navie". No, sharp-eyed netizens discovered Karpathy’s new project -
minbpe, which is dedicated to creating a minimal, clean and educational algorithm for the BPE (Byte Pair Encoding, byte pair encoding) algorithm commonly used in LLM word segmentation code
In just one day, the project’s GitHub star has reached 1.2k.
Source: https://twitter.com/ZainHasan6/status/1758727767204495367 Someone posted a picture saying that Karpathy "cooked a big meal for everyone" ”.
Some people cheered, Karpathy is back.
Image source: https://twitter.com/fouriergalois/status/1758775281391677477
Let’s take a look at what the “minbpe” project specifically talks about. Project Introduction
GitHub address: https://github.com/karpathy/minbpeWe know that the BPE algorithm is "byte-level" and operates on UTF-8 encoded strings. The algorithm is generalized in large language models (LLM) through the GPT-2 paper and GPT-2 related code. Nowadays, all modern LLMs (such as GPT, Llama, Mistral) use the BPE algorithm to train their tokenizers. Karpathy's minbpe project repository provides two Tokenizers, both of which can perform the 3 main functions of the tokenizer: 1) train tokenizer vocabulary and merge it into the specified text, 2) encode from text to token, 3 ) decodes from token to text. The detailed repository files are as follows:
- minbpe/base.py: implements the Tokenizer class, which is the base class. It includes training, encoding and decoding stubs, save/load functions, and some common utility functions. However, this class should not be used directly, but inherited.
- minbpe/basic.py: Implements BasicTokenizer, the simplest implementation of the BPE algorithm that operates directly on text.
- minbpe/regex.py: Implements RegexTokenizer, which further splits input text by regular expression patterns. As a preprocessing stage, it splits the input text by categories (e.g. letters, numbers, punctuation) before word segmentation. This ensures that merging across category boundaries does not occur. It was introduced in the GPT-2 paper and continues to be used in GPT-4.
- minbpe/gpt4.py: Implement GPT4Tokenizer. This class is a lightweight package of RegexTokenizer, which accurately reproduces the GPT-4 word segmentation in the tiktoken (OpenAI open source word segmentation artifact) library. The wrapper handles some details about restoring exact merging in the tokenizer, and handles some 1-byte token permutations. It should be noted that the parity check has not been fully completed and no special tokens have been processed.
The script train.py trains the two main tokenizers on the input text tests/taylorswift.txt and saves the vocabulary to disk for visualization. Karpathy says the script takes about 25 seconds to run on his MacBook (M1). Karpathy also stated that all documents are very short and well-commented and include usage examples. Below is a reproduced example from the BPE Wikipedia article. from minbpe import BasicTokenizertokenizer = BasicTokenizer()text = "aaabdaaabac"tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 mergesprint(tokenizer.encode(text))# [258, 100, 258, 97, 99]print(tokenizer.decode([258, 100, 258, 97, 99]))# aaabdaaabactokenizer.save("toy")# writes two files: toy.model (for loading) and toy.vocab (for viewing)
Also provides how to implement GPT4Tokenizer and how it compares to tiktoken. text = "hello123!!!? (안녕하세요!) ?"# tiktokenimport tiktokenenc = tiktoken.get_encoding("cl100k_base")print(enc.encode(text))# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]# oursfrom minbpe import GPT4Tokenizertokenizer = GPT4Tokenizer()print(tokenizer.encode(text))# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
Of course, Karpathy is not content with just launching the GitHub project, he said the video will be released soon.
The above is the detailed content of Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day.. For more information, please follow other related articles on the PHP Chinese website!
Statement:The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn