Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day.-AI-php.cn

Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day.

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 05, 2024 pm 10:09 PM

openaiindustryminbpe

You still have to "roll" even if you don't have a job.

The restless Andrej Karpathy has a new project!

In the past few days, OpenAI has been very lively. First, AI guru Andrej Karpathy officially announced his resignation, and then the video generation model Sora shook the AI circle.

After announcing his departure from OpenAI, Karpathy tweeted that "I can take a break this week." 86972512239665

Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day.

This state of having nothing to do makes even Musk envious (I am envious).

But if you really think Karpathy will have some time off, that's a bit "too young, too navie".

Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day. No, sharp-eyed netizens discovered Karpathy’s new project -

minbpe, which is dedicated to creating a minimal, clean and educational algorithm for the BPE (Byte Pair Encoding, byte pair encoding) algorithm commonly used in LLM word segmentation code

In just one day, the project’s GitHub star has reached 1.2k.

Source: https://twitter.com/ZainHasan6/status/1758727767204495367

Someone posted a picture saying that Karpathy "cooked a big meal for everyone" ”. Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day.

Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day. Some people cheered, Karpathy is back.

Image source: https://twitter.com/fouriergalois/status/1758775281391677477

Let’s take a look at what the “minbpe” project specifically talks about.

Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day. Project Introduction

GitHub address: https://github.com/karpathy/minbpe

We know that the BPE algorithm is "byte-level" and operates on UTF-8 encoded strings. The algorithm is generalized in large language models (LLM) through the GPT-2 paper and GPT-2 related code.

Nowadays, all modern LLMs (such as GPT, Llama, Mistral) use the BPE algorithm to train their tokenizers.

Karpathy's minbpe project repository provides two Tokenizers, both of which can perform the 3 main functions of the tokenizer: 1) train tokenizer vocabulary and merge it into the specified text, 2) encode from text to token, 3 ) decodes from token to text.

The detailed repository files are as follows:

minbpe/base.py: implements the Tokenizer class, which is the base class. It includes training, encoding and decoding stubs, save/load functions, and some common utility functions. However, this class should not be used directly, but inherited.
minbpe/basic.py: Implements BasicTokenizer, the simplest implementation of the BPE algorithm that operates directly on text.
minbpe/regex.py: Implements RegexTokenizer, which further splits input text by regular expression patterns. As a preprocessing stage, it splits the input text by categories (e.g. letters, numbers, punctuation) before word segmentation. This ensures that merging across category boundaries does not occur. It was introduced in the GPT-2 paper and continues to be used in GPT-4.
minbpe/gpt4.py: Implement GPT4Tokenizer. This class is a lightweight package of RegexTokenizer, which accurately reproduces the GPT-4 word segmentation in the tiktoken (OpenAI open source word segmentation artifact) library. The wrapper handles some details about restoring exact merging in the tokenizer, and handles some 1-byte token permutations. It should be noted that the parity check has not been fully completed and no special tokens have been processed.

The script train.py trains the two main tokenizers on the input text tests/taylorswift.txt and saves the vocabulary to disk for visualization. Karpathy says the script takes about 25 seconds to run on his MacBook (M1).

Karpathy also stated that all documents are very short and well-commented and include usage examples. Below is a reproduced example from the BPE Wikipedia article.

from minbpe import BasicTokenizertokenizer = BasicTokenizer()text = "aaabdaaabac"tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 mergesprint(tokenizer.encode(text))# [258, 100, 258, 97, 99]print(tokenizer.decode([258, 100, 258, 97, 99]))# aaabdaaabactokenizer.save("toy")# writes two files: toy.model (for loading) and toy.vocab (for viewing)

Also provides how to implement GPT4Tokenizer and how it compares to tiktoken.

text = "hello123!!!? (안녕하세요!) ?"# tiktokenimport tiktokenenc = tiktoken.get_encoding("cl100k_base")print(enc.encode(text))# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]# oursfrom minbpe import GPT4Tokenizertokenizer = GPT4Tokenizer()print(tokenizer.encode(text))# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]

Of course, Karpathy is not content with just launching the GitHub project, he said the video will be released soon.

Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day.

Extended reading:

The above is the detailed content of Karpathy, who left OpenAI and was unemployed, started a new large-scale model project, and the number of stars exceeded 1,000 in a day.. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Tool Calling in LLMsApr 14, 2025 am 11:28 AM

Large language models (LLMs) have surged in popularity, with the tool-calling feature dramatically expanding their capabilities beyond simple text generation. Now, LLMs can handle complex automation tasks such as dynamic UI creation and autonomous a

How ADHD Games, Health Tools & AI Chatbots Are Transforming Global HealthApr 14, 2025 am 11:27 AM

Can a video game ease anxiety, build focus, or support a child with ADHD? As healthcare challenges surge globally — especially among youth — innovators are turning to an unlikely tool: video games. Now one of the world’s largest entertainment indus

UN Input On AI: Winners, Losers, And OpportunitiesApr 14, 2025 am 11:25 AM

“History has shown that while technological progress drives economic growth, it does not on its own ensure equitable income distribution or promote inclusive human development,” writes Rebeca Grynspan, Secretary-General of UNCTAD, in the preamble.

Learning Negotiation Skills Via Generative AIApr 14, 2025 am 11:23 AM

Easy-peasy, use generative AI as your negotiation tutor and sparring partner. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining

TED Reveals From OpenAI, Google, Meta Heads To Court, Selfie With MyselfApr 14, 2025 am 11:22 AM

The TED2025 Conference, held in Vancouver, wrapped its 36th edition yesterday, April 11. It featured 80 speakers from more than 60 countries, including Sam Altman, Eric Schmidt, and Palmer Luckey. TED’s theme, “humanity reimagined,” was tailor made

Joseph Stiglitz Warns Of The Looming Inequality Amid AI Monopoly PowerApr 14, 2025 am 11:21 AM

Joseph Stiglitz is renowned economist and recipient of the Nobel Prize in Economics in 2001. Stiglitz posits that AI can worsen existing inequalities and consolidated power in the hands of a few dominant corporations, ultimately undermining economic

What is Graph Database?Apr 14, 2025 am 11:19 AM

Graph Databases: Revolutionizing Data Management Through Relationships As data expands and its characteristics evolve across various fields, graph databases are emerging as transformative solutions for managing interconnected data. Unlike traditional

LLM Routing: Strategies, Techniques, and Python ImplementationApr 14, 2025 am 11:14 AM

Large Language Model (LLM) Routing: Optimizing Performance Through Intelligent Task Distribution The rapidly evolving landscape of LLMs presents a diverse range of models, each with unique strengths and weaknesses. Some excel at creative content gen

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux latest version

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Chinese version

Chinese version, very easy to use

Dreamweaver Mac version

Visual web development tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7499

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers