


Produced by Huxiu Technology Group
Author|Qi Jian
Editor|Chen Yifan
Header picture|FlagStudio
"Will OpenAI open source large models again?"
When Zhang Hongjiang, Chairman of the Zhiyuan Research Institute, asked OpenAI CEO Sam Altman, who attended the 2023 Zhiyuan Conference online, about open source issues, Sam Altman smiled and said, OpenAI will open more codes in the future. But there is no specific open source timetable.
This discussion comes from one of the topics of this Zhiyuan conference-Open source large model.
On June 9, the 2023 Zhiyuan Conference was held in Beijing, and all seats were packed. At the conference site, AI-related words such as "computing power", "large model" and "ecology" appeared from time to time in the chats of participants, as well as various companies in this industry chain.
At this conference, Zhiyuan Research Institute released the comprehensive open source Wu Dao 3.0. Including visual large model series "Vision", language large model series "Sky Eagle", and original large model evaluation system "Libra" .
Open source for large models means making the model code public for AI developers to study. The "Sky Eagle" basic layer language model in Wu Dao 3.0 is still a commercially available model, and everyone can use this large model for free.
Currently, Microsoft’s in-depth partners OpenAI, Google and BAAI are the three institutions at the forefront of the field of artificial intelligence. "In a recent interview, Microsoft President Brad Smith mentioned BAAI, the "strongest" AI research institution in China, which is as famous as OpenAI and Google. This institution is the Beijing Zhiyuan Artificial Intelligence Research Institute. Many people in the industry believe that , the artificial intelligence conference hosted by this institute is a benchmark for industry trends.
Zhiyuan Research Institute, which is highly recognized by the president of Microsoft, has launched the AI large model "Enlightenment" project as early as October 2020, and has successively released two versions of the Enlightenment model 1.0 and 2.0. Among them, the officially announced parameter scale of Enlightenment 2.0 reaches 1.7 trillion. At that time, it had only been a year since OpenAI released the 175 billion-parameter GPT-3 model.
However, such a pioneer of large AI models has been extremely low-key during the AI large model craze in the past six months.
While big models are emerging one after another among large manufacturers and start-up companies, Zhiyuan has remained "silent" to the outside world for more than three months, except for the "SAM" that collided with Meta's cutout AI "SAM" in early April. SegGPT”, revealing almost nothing about the large AI model to the public.
In this regard, many people inside and outside the AI industry have questions. Why does Zhiyuan Research Institute, a leader in the field of AI large models, seem to be late in the climax of large models?
Will the open source model dismantle OpenAI’s moat?
"Although the competition for large models is fierce now, neither OpenAI nor Google has a moat, because "open source" is rising in the field of AI large models."
In a document leaked by Google, internal Google researchers believe that open source models may lead the future of large model development. The document mentions that "Open source models have faster iterations and are more customizable. Stronger and more private, people will not pay for restricted models when the free, unrestricted alternatives are of equal quality. " This may also be one of the reasons why Zhiyuan chose to develop open source large models one.
At present, there are not many open source commercial large models. Zhiyuan Research Institute conducted a survey on some of the AI large models that have been released. Among the 39 open source language large models released abroad, the commercially available large models include 16. Among the 28 major language models released in China, a total of 11 are open source models, but only one of them is an open source and commercially available model.
The large language model released by Zhiyuan this time is open source and commercially available. It is also one of the few open source large language models currently available for commercial use. This also determines that such a model needs to be more cautious before releasing it.
"As far as Zhiyuan is concerned, we definitely don't want the open source model to be too ugly, so we will release it with caution." An AI researcher at the Zhiyuan conference said that open source models will inevitably have to be repeatedly verified and have bugs picked up by a large number of developers. In order to ensure the quality of the open source model, Zhiyuan's research and development progress may have been slowed down by "open source".
Huang Tiejun, President of Zhiyuan Research Institute, believes that the current open source and openness of large models in our country's market is far from enough. "We should further strengthen open source and open source. Open source and open source are also competitions. There are really good standards and good algorithms. Only by evaluating and comparing can we prove our technical level.”
There is a lack of transparency when domestic manufacturers release large models, and many people doubt whether these manufacturers have truly conducted independent research and development. Some people say that they call ChatGPT via API, while others say that they are trained using the answer data of ChatGPT, the LLaMA model leaked by Meta. The open source model cuts off these doubts from the source.
However, the open source model and improving technical transparency are not to prove one's innocence, but to really "concentrate efforts to do big things." According to Zhiyuan data, the daily training cost of Tianying Big Language Model is more than 100,000 yuan. Under the general trend of the domestic "War of One Hundred Models" or even "War of Thousands of Models", many industries are The repeated expenses caused by a large number of unnecessary repeated trainings may be astronomical.
And open source models can reduce repeated training. For companies with model needs, directly using large open source commercial AI models and combining their own data for training may be the best solution for AI implementation and industry applications. .
Another consideration of open source is to accumulate users and developers in the early stage in order to build a good ecosystem and achieve future commercialization. A founder of a large domestic model company told Huxiu, “OpenAI’s GPT-1 and GPT-2 are both open source large models. This is to accumulate users and improve the recognition of the model. Once the model capabilities of GPT-3 are fully Display, commercialization will become the focus of consideration, and this model will gradually become closed. Therefore, open source models generally will not allow commercial use, which is also due to subsequent commercialization considerations."
But obviously, as a non-profit research institution, Zhiyuan has no commercial considerations when it comes to open source issues. For Zhiyuan, in terms of model open source, on the one hand, it hopes to promote scientific research and innovation in the AI large model industry and accelerate industrial implementation by opening up open source such as underlying models. On the other hand, perhaps we also want to accumulate more user feedback based on open source models and improve the usability of large models in engineering.
However, model open source is not "perfect".
An AI technical director of a major factory told Huxiu that the current commercialization market for large AI models can be divided into three tiers. The first tier is the leading players who are fully capable of self-developed models, and the second tier is those who need to For enterprises that train proprietary models based on specific scenarios, the third layer is for small and medium-sized customers who only need general model capabilities and can use API calls to meet their needs.
In this context, open source models can help leading players with self-research capabilities save a lot of time and cost in developing models. But for second- and third-tier companies, they need to set up their own technical teams to train and tune the models. For many companies with less technical strength, this will make the implementation process more difficult. It gets more complicated, and open source seems to have a bit of a "free stuff is the most expensive" feel to them.
This "enlightenment" is no longer that "enlightenment"
Chiyuan’s Enlightenment 3.0 is a completely redeveloped large-scale model series. This is also one of the reasons for its “late release”.
Now that we have the foundation of Enlightenment 2.0, why does Zhiyuan need to develop a new model system?On the one hand, it is the adjustment of the technical direction of the model, and on the other hand, it is due to the "replacement" of the underlying training data of the model.
"The development of Wu Dao 2.0 will be in 2021, so whether it is a language model (such as GLM) or a Vincentian graph model (such as CogView), the algorithm architecture it is based on is relatively early from now on. In the past year Many, the model architecture in related fields has been more verified or evolved. For example, the decoder only architecture used in the language model has been confirmed that with higher quality data, it can be obtained in the basic model of large-scale parameters. Better generation performance. In the text graph model, we switched to difussion-based for further innovation. So in Wu Dao 3.0, weadopted these updates for the large language model, the large text graph generation model, etc. Lin Yonghua, vice president and chief engineer of Zhiyuan Research Institute, said that based on the research of past models, Wu Dao 3.0 has been reconstructed in many directions. In addition, Wudao 3.0 has also comprehensively optimized and upgraded the training data of the underlying model. The training data uses the updated Wudao Chinese data, including from 2021 to the present, and has undergone more stringent quality cleaning; on the other hand, , a large amount of high-quality Chinese has been added, including Chinese books, literature, etc.;
In addition, high-quality code data sets have been added, so the basic model has also undergone great changes.The underlying model training data is not native Chinese, causing many domestic models to have problems with Chinese understanding capabilities. Many large-scale AI models at home and abroad use massive open source data from abroad for training. Major sources include the famous open source dataset Common Crawl.
Zhiyuan analyzed 1 million Common Crawl web page data and
can extract 39,052 Chinese web pages. From the perspective of website sources, there are 25,842 websites that can extract Chinese, of which only 4,522 have IPs in mainland China, accounting for only 17%.This not only greatly reduces the accuracy of Chinese data, but also reduces security. "The corpus used to train the basic model will largely affect the compliance, security and values generated by AIGC applications, fine-tuned models and other content." Lin Yonghua said that the Chinese ability of the Tianying basic model is not a simple translation; By "pressing enough Chinese knowledge into this model", 99% of its Chinese Internet data comes from domestic websites, and companies can safely conduct continuous training based on it. At the same time, through a large amount of refined processing and cleaning of data and numbers, a model with the same or even better performance can be trained with a small amount of data. This data can even be as low as 30% or 40% of the data amount. It can catch up with or exceed the existing open source model. Now it seems that this path may be a better solution for Zhiyuan. Because in terms of training data, Zhiyuan has shortcomings compared with Internet manufacturers. Large Internet companies have rich user interaction data and a large amount of copyright data for training. Not long ago, Alibaba Damo Academy just released a video language data set Youku-mPLUG. All the content in it comes from Youku, a video platform owned by Alibaba. Since Zhiyuan does not have a deep user base, in terms of training data, it can only obtain authorization through negotiation with the copyright owner, and collect and accumulate bit by bit through some public welfare data projects. However, at present, Zhiyuan’s Chinese data set can only be partially open source. The main reason is that the copyright of Chinese data is scattered in the hands of various institutions. At present, Zhiyuan’s training data is obtained through the coordination of multiple parties. The open source model studies open access. Most of the data can only be applied to Zhiyuan's models and does not have the right to be used for secondary use. "It is very necessary in China to establish an industrial alliance for data sets, unite copyright holders, and conduct unified planning of training data for artificial intelligence, but this requires the wisdom of top-level design." Lin Yonghua told Huxiu. Whampoa Military Academy in the domestic large model industry Wudao 3.0 is telling a different story from Wudao 2.0, and the changes in the R&D team are one of them. As a pioneer in the AI large model industry, Zhiyuan Research Institute is like the Whampoa Military Academy of domestic AI large models. From Zhiyuan scholars to grassroots engineers, they have all become popular in the industry in today's large-scale model craze. The original team of Zhiyuan has also incubated several large-scale entrepreneurial teams. Before Wu Dao 3.0, a large model series was a combination of research results jointly released by multiple external laboratories, but this time Wu Dao 3.0 is a series of models completely self-developed by the Zhiyuan team. Wudao 2.0 model was released in 2021, including Wenyuan, Wenlan, Wenhui and Wensu. Among them, the two core models were completed by two laboratories of Tsinghua University. Today, the two teams have founded their own companies and developed their own independent products in the research and development direction of CPM and GLM. Among them, Tsinghua University’s Knowledge Engineering Laboratory (KEG), the main R&D team of GLM, launched the open source model ChatGLM-6B together with Zhipu AI, and has been widely recognized by the industry; Tsinghua University, the main R&D team of CPM Shenyan Technology, which is composed of some members of the Natural Language Processing and Social Humanities Computing Laboratory (THUNLP) of the Department of Computer Science, has been favored by various capitals since its establishment one year ago. Tencent Investment and Sequoia Investment appeared in the two rounds of financing this year. China, Qiji Chuangtan and other funds are present. Someone close to Zhiyuan Research Institute told Huxiu that Since the rise of large-scale AI models in China, the Zhiyuan team has become the “hunting target” of the talent war. “The entire R&D team is being targeted by other companies or headhunters. superior". At present, in the domestic AI large model industry, what is most in short supply is money, and what is most in short supply is people. Search ChatGPT on the three platforms of Liepin, Maimai, and BOSS Zhipin. The monthly salary for positions with master's and doctoral degrees is generally higher than 30,000, and the highest is 90,000. "In terms of salary, big IT companies don't take much advantage. The research and development of large AI models are all done at a high level. The salary offered by startups may be more competitive." Yu Jia, COO of Xihu Xinchen, told Huxiu that talent The war will become increasingly fierce in the AI industry. "Double salary is not competitive at all in the eyes of many employees at Zhiyuan. Because now they are poaching people with five or even ten times the salary. No matter how ideal you are and how you plan for the future, It is difficult to resist the temptation of an annual salary of more than one million." A person close to Zhiyuan told Huxiu, Since Zhiyuan is a non-profit research institution, the salary level is difficult to match that of major Internet companies or companies with a large amount of capital behind it. compared to supported startups. Through headhunting, Huxiu learned that the starting salary of natural language processing experts currently exceeds 1 million. For some employees with long working years and low wages, it is difficult not to waver when faced with several times their salary. However, judging from the current public data of Zhiyuan Research Institute, most of the core project team leaders of Zhiyuan Research Institute are still responsible for the research and development projects of Zhiyuan Research Institute full-time. "The models of Enlightenment 3.0 are all developed by Zhiyuan's own researchers, including Tianying, Libra, and Vision. " Lin Yonghua said that Zhiyuan Research Institute's current R&D strength has been among the best in the industry. It's top notch. People who are changing and want to change the world are on Huxiu APP
The above is the detailed content of Why is the large model of China's most powerful AI research institute late?. For more information, please follow other related articles on the PHP Chinese website!

今年以来,360集团创始人周鸿祎在所有公开场合的讲话都离不开一个话题,那就是人工智能大模型。他曾自称“GPT的布道者”,对ChatGPT取得的突破赞不绝口,更是坚定看好由此产生的AI技术迭代。作为一个擅于表达的明星企业家,周鸿祎的演讲往往妙语连珠,所以他的“布道”也创造过很多热点话题,确实为AI大模型添了一把火。但对周鸿祎而言,光做意见领袖还不够,外界更关心他执掌的360公司如何应对这波AI新浪潮。事实上,在360内部,周鸿祎也早已掀起一场全员变革,4月份,他发出内部信,要求360每一位员工、每

蚂蚁集团在上海举行的第二届外滩大会上,宣布正式发布旗下的金融大模型据蚂蚁集团介绍,蚂蚁金融大模型基于其自研基础大模型,并针对金融产业进行深度定制,底层算力集群达到万卡规模。目前,该大模型已在蚂蚁集团财富、保险平台全面测试。同时,基于该大模型的两款产品——智能金融助理“支小宝2.0”、服务金融产业专家的智能业务助手“支小助”,也已正式亮相。据介绍,两款大模型产品展示了蚂蚁从基础大模型到行业大模型以及产业应用的全栈布局和进展。本站附两款产品目前进度如下:“支小宝2.0”已经开始内测近半年,将在完成相

2023年8月16日,WAVESUMMIT深度学习开发者大会在中国举办,该活动由深度学习技术及应用国家工程研究中心主办,百度飞桨和文心大模型承办。在会上,百度发布了文心大模型、飞桨平台和AI原生应用如流等一系列技术、产品的最新进展和生态成果。百度集团副总裁兼首席信息官李莹发表了主题演讲,她认为当前以AI大模型为核心技术的第四次科技革命将从根本上推动生产力变革,为各行各业提供强大支持,并为企业办公领域带来前所未有的发展机遇基于AI原生思维,李莹宣布,百度智能工作知识管理理念“创新流水线=AIx知识

6月21日,北大光华管理学院联合腾讯,宣布升级“数字中国筑塔计划”,共同推出“企业管理者人工智能通识课”系列课程。在第一课上,腾讯集团高级执行副总裁、云与智慧产业事业群CEO汤道生回顾了AI发展的历史,表示算法创新、算力增强、开源共创三大因素的叠加,构成了AI的“增长飞轮”。大模型的快速进步,推动我们正在进入一个被AI重塑的时代。汤道生表示,大模型只是起点,未来,应用落地的产业变革是更大的图景。企业过去的研发、生产、销售、服务等环节中,有很多依赖人来判断、协调与沟通的地方,今天都值得去看看,哪些

近日,科大讯飞公告其构建的“讯飞星火认知大模型”将举行升级发布会,推出该人工智能大模型的V1.5(1.5版本)。此前,朗玛信息也因推出“朗玛•39AI全科医生”大模型产品举行发布会。此外,还有5家上市公司也在与投资者沟通交流中,披露已布局AI(人工智能)大模型的信息。来源:摄图网科大讯飞的“讯飞星火认知大模型”升级至1.5版近日,科大讯飞股份有限公司(证券简称:科大讯飞;证券代码:002230.SZ)披露了《关于讯飞星火认知大模型升级发布会的提示性公告》。公告显示,2023年5月6日,科大讯飞举

近期,WakeData惟客数据(以下简称“WakeData”)完成了新一轮的产品能力升级。在2022年11月的产品发布会上,已传递出WakeData的“三个坚定”:始终坚定技术投入,全面夯实核心产品的科技能力和自研率;始终坚定国产化适配能力,支持国产芯片、操作系统、数据库、中间件、国密算法等,并在同领域实现对国外厂商的国产化替代;始终坚定拥抱生态,与伙伴共创共赢。WakeData继续新一轮的产品能力升级,凭借过去5年的技术积累,以及在地产、零售、汽车等行业和垂直领域的实践,与战略伙伴联合研发具

最近,重庆人工智能创新中心成功孵化了云从科技的首个大模型解决方案,名为“从容大模型训推一体机”,并已成功部署。作为国内最早布局大模型的云服务商之一,华为不仅致力于深耕算力,打造强大的算力基础设施来支持中国人工智能事业的发展,而且还着眼于通用大模型和行业大模型,真正实现为千行百业和科学研究提供优质的人工智能服务经过重庆人工智能创新中心技术团队、昇腾研发专家和云从科技人工智能研究院的共同努力,一个月内顺利完成了“从容大模型训推一体机”的精度与性能对齐、产品集成与测试等工作,这成为了重庆人工智能创新中


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Mac version
God-level code editing software (SublimeText3)

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.