Implementing Tsinghua UltraChat multi-round conversations using multiple ChatGPT APIs-AI-php.cn

Home

Technology peripherals

Implementing Tsinghua UltraChat multi-round conversations using multiple ChatGPT APIs

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 22, 2023 pm 08:37 PM

aiOpen source

Since the release of ChatGPT, the popularity of conversation models has only increased during this period. While we admire the amazing performance of these models, we should also guess the huge computing power and massive data support behind them.

As far as data is concerned, high-quality data is crucial, and for this reason OpenAI has put a lot of effort into data and annotation work. Multiple studies have shown that ChatGPT is a more reliable data annotator than humans. If the open source community can obtain large amounts of dialogue data from powerful language models such as ChatGPT, it can train dialogue models with better performance. This is proven by the Alpaca family of models – Alpaca, Vicuna, Koala. For example, Vicuna replicated ChatGPT’s nine-step success by fine-tuning instructions for the LLaMA model using user sharing data collected from ShareGPT. Increasing evidence shows that data is the primary productivity for training powerful language models.

ShareGPT is a ChatGPT data sharing website where users upload ChatGPT answers they find interesting. The data on ShareGPT is open but trivial and needs to be collected and organized by researchers themselves. If there is a high-quality, wide-ranging data set, the open source community will get twice the result with half the effort in developing conversation models.

Based on this, a recent project called UltraChat systematically constructed an ultra-high-quality conversation data set. The project authors tried to use two independent ChatGPT Turbo APIs to conduct conversations to generate multiple rounds of conversation data.

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

## Project address: https://github.com/thunlp/UltraChat
Dataset address: http://39.101.77.220/
Dataset interaction address: https://atlas. nomic.ai/map/0ce65783-c3a9-40b5-895d-384933f50081/a7b46301-022f-45d8-bbf4-98107eabdbac

Specifically, the project aims to We are building an open source, large-scale, multi-round dialogue data based on Turbo APIs to facilitate researchers to develop powerful language models with universal dialogue capabilities. In addition, taking into account privacy protection and other factors, the project will not directly use data on the Internet as prompts. In order to ensure the quality of the generated data, the researchers used two independent ChatGPT Turbo APIs in the generation process, in which one model plays the role of the user to generate questions or instructions, and the other model generates feedback.

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

If you directly use ChatGPT to generate it freely based on some seed conversations and questions, it is prone to problems such as single topics and repeated content, making it difficult to guarantee data. diversity itself. To this end, UltraChat has systematically classified and designed the topics and task types covered by the conversation data, and also conducted detailed prompt engineering for the user model and reply model, which consists of three parts:

Questions about the World: This part of the conversation comes from broad inquiries about concepts, entities, and objects in the real world. The topics covered cover technology, art, finance and other fields.
Writing and Creation: This part of the dialogue data focuses on instructing the AI to create a complete text material from scratch, and based on this, follow-up questions or further guidance To improve your writing, content types include articles, blogs, poems, stories, plays, emails, and more.
Assisted rewriting (Writing and Creation) of existing data: The dialogue data is generated based on existing data. Instructions include but are not limited to rewriting, continuation, translation, induction, reasoning, etc., and the topics covered are also very diverse.

These three parts of data cover most users’ requirements for AI models. At the same time, these three types of data will also face different challenges and require different construction methods.

For example, the main challenge of the first part of the data is how to cover common knowledge in human society as widely as possible in a total of hundreds of thousands of conversations. To this end, the researchers used automatically generated topics and sources from Wikidata Two aspects of entities are filtered and constructed.

The challenges in the second and third parts mainly come from how to simulate user instructions and make the generation of user models as diverse as possible in subsequent conversations without deviating from the ultimate goal of the conversation ( Generate materials or rewrite materials as required), for which the researchers have fully designed and experimented with the input prompts of the user model. After the construction was completed, the authors also post-processed the data to weaken the hallucination problem.

Currently, the project has released the first two parts of the data, with a data volume of 1.24 million, which should be the largest related data set in the open source community. The content contains rich and colorful conversations in the real world, and the final part of the data will be released in the future.

World problem data comes from 30 representative and diverse meta-themes, as shown in the figure below:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Based on the above meta-themes, this project generated 1100 sub-themes for data construction;
For each sub-theme , generate up to 10 specific questions;
Then use the Turbo API to generate new related questions for each of the 10 questions;
For each question, the two models are iteratively used to generate 3 to 7 dialogue rounds as described above.

Additionally, this project collected the 10,000 most commonly used named entities from Wikidata; used the ChatGPT API to generate 5 meta-questions for each entity; for each meta Questions, 10 more specific questions and 20 related but general questions were generated; 200,000 specific questions, 250,000 general questions and 50,000 meta-questions were sampled, and 3~7 dialogue rounds were generated for each question.

Next let’s look at a specific example:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

We tested the data on the UltraChat platform Search results. For example, if you enter "music", the system will automatically search for 10,000 sets of music-related ChatGPT conversation data, and each set is a multi-round conversation

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

The search results for entering the keyword "mathematics (math)", there are 3346 groups of multi-round conversations:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Currently, UltraChat covers There are already many information fields, including medical, education, sports, environmental protection and other topics. At the same time, the author tried to use the open source LLaMa-7B model to perform supervised instruction fine-tuning on UltraChat, and found that after only 10,000 steps of training, there were very impressive effects. Some examples are as follows:

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

World Knowledge: List 10 good Chinese and American universities respectively

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Imagination question: What are the possible consequences after space travel becomes possible?

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Syllogism: Is a whale a fish?

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

##Hypothetical question: Prove that Jackie Chan is better than Bruce Lee

调用多个ChatGPT API相互对话，清华开源的多轮对话数据UltraChat来了

Overall, UltraChat is a high-quality, wide-ranging ChatGPT conversation data set that can be combined with other data sets to significantly improve the quality of open source conversation models. At present, UltraChat only releases the English version, but it will also release the Chinese version of the data in the future. Interested readers are welcome to explore it.

The above is the detailed content of Implementing Tsinghua UltraChat multi-round conversations using multiple ChatGPT APIs. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles