Home > Article > Technology peripherals > The amount of text data used for training Google PaLM 2 is nearly 5 times that of the original generation
According to news on May 17, Google launched its latest large-scale language model PaLM 2 at the 2023 I/O Developer Conference last week. Internal company documents show that the amount of text data used to train new models starting in 2022 is almost five times that of the previous generation.
It is reported that Google’s newly released PaLM 2 can perform more advanced programming, computing and creative writing tasks. Internal documents revealed that the number of tokens used to train PaLM 2 is 3.6 trillion.
The so-called token is a string. People will segment the sentences and paragraphs in the text used to train the model. Each string is usually called a token. This is an important part of training large language models, teaching them to predict which word will come next in a sequence.
The previous generation of large language model PaLM released by Google in 2022 used 780 billion tokens in training.
Although Google has been keen to demonstrate its prowess in artificial intelligence technology, showing how it can be embedded in search engines, email, word processing and spreadsheets, it has been reluctant to disclose the scale of training data. or other details. Microsoft-backed OpenAI is also keeping details of its newly released GPT-4 large-scale language model secret.
Both companies stated that the reason for not disclosing this information is the fierce competition in the artificial intelligence industry. Both Google and OpenAI want to attract users who want to use chatbots instead of traditional search engines to search for information.
But as competition in the field of artificial intelligence heats up, the research community is demanding more transparency.
Since launching PaLM 2, Google has said that the new model is smaller than the previous large language model, which means the company's technology can become more efficient at completing more complex tasks. Parameters are often used to describe the complexity of a language model. According to internal documents, PaLM 2 was trained with 340 billion parameters, and the original PaLM was trained with 540 billion parameters.
Google had no immediate comment.
In a blog post about PaLM 2, Google said that the new model uses a "new technology" called "compute-optimal scaling" (compute-optimal scaling), which can make PaLM 2 " More efficient, with better overall performance, such as faster inference, fewer service parameters, and lower service costs."
When releasing PaLM 2, Google revealed that the new model was trained in 100 languages and Capable of performing a variety of tasks. PaLM 2 is used in 25 features and products, including Google's experimental chatbot Bard. PaLM 2 has four different versions according to parameter scale, ranging from small to large: Gecko, Otter, Bison and Unicorn.
According to information publicly disclosed by Google, PaLM 2 is more powerful than any existing model. Facebook announced the launch of a large language model called LLaMA in February this year, which used 1.4 trillion tokens in training. OpenAI disclosed the relevant training scale when it released GPT-3. At that time, the company stated that the model had been trained on 300 billion tokens. In March this year, OpenAI released a new model, GPT-4, and said it performed at “human levels” in many professional tests.
According to the latest documents, the language model launched by Google two years ago was trained on 1.5 trillion tokens.
As new generative AI applications quickly become mainstream in the technology industry, the controversy surrounding the underlying technology is becoming increasingly fierce.
In February of this year, El Mahdi El Mhamdi, a senior scientist in Google’s research department, resigned due to the company’s lack of transparency. On Tuesday, OpenAI CEO Sam Altman testified at a U.S. Senate Judiciary Subcommittee hearing on privacy and technology and agreed with new systems to deal with artificial intelligence.
“For a very new technology, we need a new framework,” Altman said. “Of course, companies like ours have a lot of responsibility for the tools they put out.”
The above is the detailed content of The amount of text data used for training Google PaLM 2 is nearly 5 times that of the original generation. For more information, please follow other related articles on the PHP Chinese website!