Home >Technology peripherals >AI >'I used personal WeChat chat records and blog posts to create my own digital clone AI'
Besides flying a plane, cooking the perfect rib roast, getting 6-pack abs, and making my company a ton of money, one thing I've always wanted to do is implement a chatbot.
Compared with the little yellow chicken who simply responded through keyword matching many years ago, chatgpt is now comparable to human intelligence. Chat AI has been making progress, but they are somewhat different from what I thought.
I chat with a lot of people on WeChat. Some people chat more and some less. I can also talk in the group. I can also write blogs and public accounts. I will be in many I leave comments on social media, and I also post on Weibo. These are the traces I leave in the online world. To a certain extent, these things constitute the world’s perception of me. From this perspective, they also constitute me. Integrate these data - my responses to different messages, every article I write, every sentence, every Weibo I post, etc. - into a neural network model to update the parameters. In theory, you can get a digital copy of me.
In principle, this is different from saying to chatgpt "Please play the role of a person named Xiao Wang, whose experience is XXX". Although with chatgpt's wisdom, such an acting is effortless and may be confusing, but In fact, the parameters of chatgpt have not changed. This is more like "playing" rather than "reshaping". Chatgpt's hundreds of billions of parameters have not changed one. It gets some information from your previous text and then uses its wisdom to deal with you.
I like to write some metaphors that are not very useful in articles, and I like to make some summaries at the end. When chatting with people, I like to use "okay" To deal with it, and at the same time to express surprise with "wtf". Sometimes I am reticent, and other times I talk endlessly. These are some characteristics that I can perceive. In addition, there are more fixed things that I can't even detect. Habits, but these subtle and fuzzy things, I can't tell chatgpt, it's like when you introduce yourself, you can introduce it very richly, but it is still far from the real you, and sometimes even completely opposite, because when we realize When we realize our own existence, we are actually performing ourselves. Only when we are not aware of our own existence and integrate into life, are we truly ourselves.
After the release of chatgpt, I started to learn the technical principles of large text models based on my interest. It felt like joining the Chinese army in 1949, because for individual enthusiasts, it is impossible to make a difference in any aspect or in any small vertical field. The possibility of surpassing chatgpt no longer exists. At the same time, it is not open source, so there is no other idea except using it.
But some open source text pre-training models that have appeared in the past two months, such as the famous llama and chatglm6b, have made me want to clone myself again. Last week, I was ready to give it a try.
First of all, I need data, enough data and all generated by me. The simplest data source is my WeChat chat records and blog, because the WeChat chat records have not been completely cleared, from 2018 to now, WeChat on my mobile phone occupies 80G of storage space. I have always felt that someone has usurped a piece of space at home. Now if I can use the data here, I will let go of the 80G.
I backed up my WeChat chat history a few years ago, and I found the tool I used back then, which is open source on github The tool is called WechatExporter. I will put the link at the end of the article. Using this tool, you can back up all the chat records of the WeChat mobile phone in the iPhone on a Windows computer and export it to plain text format. This is an operation that requires patience, because first The entire phone needs to be backed up on the computer, and then this tool will read the WeChat records from the backup file and export it.
I spent about 4 hours backing up, and then quickly exported all my WeChat chat records, which were exported to many text files according to the chat objects
This includes group chats and one-on-one chats.
Then I started to do data cleaning. I dived a lot in most groups. I filtered out some groups in which I was more active. In addition, I also filtered out some chat records with individuals. I chatted with them a lot, and at the same time they I was also willing to use the chat records to do this. In the end, about 50 chat text files were enough for me to use.
I wrote a python script to traverse these text files, find out all my speeches, and the previous sentence, make them into a conversation format, and then store them in json. In this way, I have my own WeChat chat data set.
At this time, I also asked my colleague to use a crawler to crawl all my own blog posts. After he crawled and sent them to me, I remembered that I could actually use the blog The built-in export function in the background can export directly. Although the blog data was also very clean, I didn’t know how to use it at first, because what I wanted to train was a chat model, and blog posts were long paragraphs, not chats, so I trained for the first time. Only these pure chat records of WeChat are used.
I chose chatglm-6b as the pre-training model. On the one hand, its Chinese effect has been trained well enough. On the other hand, its parameters are 6 billion. My machine can run without much effort. Well, another reason is that there are already several fine-tuning training programs on github (I will list them together at the end of the article). In addition, it can be referred to as 6B, and the 6pen I made has the same surname as 6. This is also Makes me more inclined to use it.
Considering that my WeChat chat data was eventually available for about 100,000 pieces, I set a relatively low learning rate and increased the epoch. One night a few days ago, before going to bed, I finished writing the training script , and started running, and then I went to sleep, hoping to finish the run when I woke up, but I woke up almost every hour that night.
After I woke up in the morning, the model was trained. Unfortunately, the loss did not drop well, which means that the model trained for 12 hours was not very good, but I am a novice in deep learning. I was thankful to be able to run it without error, so instead of feeling disappointed, I started using this model to run the dialogue.
In order to add a sense of ceremony, I didn’t want to use jupyter notes or chat in a dark terminal. I found an open source front-end chat page, made slight modifications, and then deployed the model and encapsulated the API. , and then use the front-end page to call this API, so you can achieve a more similar chat.
Please don’t laugh at me. I used my own 100,000 WeChat chat records to train the model. The following is the first conversation between me and him (or it?)
I tried it again, and the result was still not very good. I am not the kind of person who is embarrassed to take action without optimizing to the extreme, so I am not shy to post directly. I gave it to a few friends, and their feedback was that it looked a bit like you. They also sent me screenshots of the conversation.
##First version, this model does have some similarities to me , I can’t say clearly, but I feel a little bit like this. If you ask it where you went to university or where your hometown is, it will not answer accurate information, and it must be wrong, because there is no such information in my chat history. A lot of people ask me that. In a way, this model doesn't understand me. It's like a clone.When I receive a WeChat message with content A and I reply to B, there are some reasons. Some of these reasons are stored in the seven to eight billion neurons in my physical brain. Theory If I generate enough data, perhaps hundreds of billions, then an artificial intelligence model with large enough parameters can be very close to my brain. 100,000 may be a little less, but it is still enough to make the model's 6 billion Change some of the parameters to make it closer to me than the original pre-trained model.
In addition, it has a bigger disadvantage, that is, it cannot pop out a few words, and the answers are very brief. Although this is in line with my WeChat chat style many times, it is not what I want. It says more.
At this time, I suddenly thought of my blog. How can I convert these blogs into questions and answers? I thought of chatgpt. Under my carefully constructed prompt, it successfully converted a piece of text from my blog article. It turned into multiple dialogue-style questions and answers:
Sometimes chatgpt will return some content that does not conform to the format, so I wrote a proofreading script to convert various Returns that do not comply with the rules will be modified to standard json, and the field names will remain unchanged.
Then I encapsulated it into an interface, placed it on a server in Hong Kong, and wrote a script on my computer to divide my blog posts into 500 words and convert them into questions and answers in batches , limited by the interface speed of chatgpt, it took me almost another night to convert my more than two hundred blog posts into almost 5,000 conversation data sets.
At this time, I faced a choice. If blog conversations were added to the WeChat conversation data set for training, then the proportion of blog conversations would be too low and the impact might be very small. In other words, it would be no different from the previous model. Large; another option is to simply use the data from the article to train a new model.
I asked the algorithm guy at 6pen for help. After determining that the model weights could be fused and finding a way to get the fusion script from him, I adopted the latter method.
5000 questions and answers, the training speed is very fast, one or two hours is enough. In the afternoon, I took a look at the training progress while writing documents. After the training was completed before get off work, I started to integrate the models, so that the previous users The model trained on WeChat chat records is merged with the model trained on my blog.
The weights of the two models can be freely configured. I tried a variety of different ratios. Considering that there is some rebound in loss during the model convergence process, I also tried model versions with different number of steps
I talked to these models all night long to find the ones that worked best, but I found that it seemed difficult for me to find out, these models, There are some different behaviors, some are more irritable, some are like licking a dog, some are very cold, and some are very enthusiastic. Then I realized that to some extent, this may be a different side of me. Although it is understandable It will definitely make people who are engaged in deep learning and are familiar with its principles scornful, but it will not lose some romance.
In the end, I found that the weight ratio of the chat and article models was 7 to 2, and using the model saved in step 6600, the fusion effect was better most of the time. A little better, of course, it may be that it was already two o'clock in the middle of the night, and my judgment was impaired, but anyway, I determined him as the final model.
I talked to him a lot.
It is clear , he is very different from chatgpt. He cannot help me write code or write copy, and he is not smart enough. Because the training data does not include multiple rounds of dialogue, his understanding of multiple rounds of dialogue is even worse. At the same time, he He doesn’t know me very well. In addition to knowing his own name (that is, my name), he can’t actually answer a lot of my other information accurately. However, he often says a few simple words to give me an idea. It's a familiar feeling, maybe it's an illusion, who knows.
In general, all the well-known large text models that exist now are trained with massive amounts of data. The training process will try to include all the information generated by all human beings. This information makes the model more powerful. Ten thousand parameters can be continuously optimized, for example, the 2043475th parameter is increased by 4, and the 9047113456th parameter is decreased by 17, and a smarter neural network model is obtained.
These models are getting smarter, but they are more like humans than individuals, and when I retrain the model with my own data, I get something completely different , a model that is closer to the individual. Although neither the amount of data I generate nor the parameter amount and structure of the pre-trained model I use may be able to support a model that is similar to my brain, but the research on this It's still very interesting to try.
I redeployed this webpage and added a layer of serverless protection in the middle. Therefore, now everyone can try to chat with this digital version of me. The service is provided by my ancestral V100. The server is provided and there is only one, so if there are many people, there may be various problems. I will put the link at the bottom.
The more data you produce actively and from the heart, the more likely you are to get a digital copy closer to you in the future. This may have some moral or even ethical issues, but this is a high probability What will happen is that after I accumulate more data, or have a better pre-trained model or training method, I may try training again at any time. This will not be a profit-making or any business-related project. To a certain extent, this is a way for me to pursue myself.
Thinking about it this way, life seems to be less lonely.
My digital clone online chat: https://ai.greatdk.com
You can also experience it by clicking on the bottom to read the original text, but because There is only one ancestral V100 graphics card providing inference, so I set a request limit. Even so, it may hang. I will restart the service every 10 minutes. If you are really interested and find that it hangs, you can Try again after some time
Projects I use and refer to:
The above is the detailed content of 'I used personal WeChat chat records and blog posts to create my own digital clone AI'. For more information, please follow other related articles on the PHP Chinese website!