Home  >  Article  >  Technology peripherals  >  The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

WBOY
WBOYforward
2023-06-06 11:10:04964browse

All the papers on Arxiv are converted into tokens, and the total amount is only 14.1GB.

This is the feat accomplished by Alexander, the latest hot open source project.

In fact, this is only the first step.

Ultimately, they want to turn the entire Internet into Tokens, in other words, transform everything into the way large models such as ChatGPT understand the world.

Once such a data set is born, wouldn't it be a new powerful tool for developing large models like GPT-4, and it will be possible to understand astronomy and geography just around the corner? !

As soon as the news came out, it immediately attracted huge attention.

The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

Netizens praised, epic.

The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.


The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

##And behind this are only four people with an average age of 20 years old Initiated by a teenager, all Arxiv paper data sets have been released, and they will release the Embedding search platform next week.

Start from all papers on Arxiv

More than 4 million projects, 600 million tokens, and 3.07 billion vector dimensions.

This open source project called Alexander starts with each paper on Arxiv.

The chosen method is embedding. Simply put, it is to visualize various objects in the real world into vectors that the computer can understand.

The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

The most classic example is to represent the original image as grayscale pixels.

The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

#The biggest feature of this technology is that it can express the semantic similarity perceived by humans.

For example, it is difficult to find papers by keywords when there are 10 words that mean the same thing. But embedding can be done, so it is suitable for search, clustering, recommendation and classification.

Based on practicality and efficiency considerations, the development team only chose to embed the title and abstract of the paper.

After testing various models, we finally chose to use the InstructorXL text embedding model, which is suitable for a variety of tasks

(such as classification, retrieval, clustering, etc.) by simply providing task instructions without any fine-tuning. Text evaluation, etc.) and fields (such as science, finance, medicine, etc.)

Next week they will release Arxiv search. The process so far is to first perform a similarity search on the 100 closest articles, then calculate the embeddings of these on the fly and conduct a second, more complex search.

The ultimate goal is an entire Internet embedded plan.

Crazy open source plan of a 20-year-old boy

There are two main reasons why we want to launch such a crazy open source plan.

On the one hand, it is to embed huge value. Many problems in the world are just search, clustering, recommendation or classification, and these are things that embeddings are very good at. And as mentioned before, some complex puzzles can be solved.

On the other hand the cost is one time and very cheap. In most cases there is no need to perform a second calculation on the same file. Currently, every 100 million Tokens only cost $

1$.

But they didn’t find any open embedded data sets, so this organization came into being.

They will also open more data sets in the future, and these will be selected by these users. In addition to the public data sets on the official website, the remaining open source projects have opened voting channels.

The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

It is worth mentioning that behind it is a team of teenagers with an average age of only 20 years old.

The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.

And their team name is also very domineering, Macrocosm (Macro World) Alliance.

As long as you zoom in far enough, humans become a single creature.

According to the official introduction, they are committed to building plug-ins for ChatGPT and other similar products. They are also developing core products, personal research assistants based on large models to help learning, teaching and scientific research.

Interested friends can click on the link below to learn more~

https://alex.macrocosm.so/download

The above is the detailed content of The crazy open source plans of the four post-00s generation: the entire Internet is converted into a large model corpus, and the cost of embedding 100 million tokens is only US$1.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete