Home  >  Article  >  Technology peripherals  >  Chatbots are digesting the internet, and the internet wants to reap the rewards

Chatbots are digesting the internet, and the internet wants to reap the rewards

王林
王林forward
2023-05-16 16:31:06713browse

Chatbots are digesting the internet, and the internet wants to reap the rewards

Artificial intelligence companies are exploiting the content created by countless people on the Internet without their consent or compensation. Now, a growing number of tech and media companies are demanding payment in hopes of getting a piece of the chatbot craze.

Here’s the translation:

If you’ve ever blogged, posted on Reddit, or shared anything on the open web, chances are you’ve contributed to Contributed to the birth of the latest generation of artificial intelligence.

Google’s Bard, OpenAI’s ChatGPT, Microsoft’s new version of Bing, and similar tools provided by other startups all integrate artificial intelligence language models. But these clever robot writers wouldn’t be possible without the vast amounts of text freely available on the Internet.

Nowadays, web content has once again become the focus of competition. This hasn't happened since the early days of the search engine wars. Tech giants are trying to carve out this irreplaceable source of information, rich in new value, as their own territory.

Originally unsuspecting tech and media companies are realizing that this data is critical to fostering a new generation of language-based artificial intelligence. Reddit is one of OpenAI's valuable training resources, but it recently announced that it would charge artificial intelligence companies for data access. OpenAI declined to comment.

Recently, Twitter also began charging for data access services, a change that affects many aspects of Twitter’s business, including the use of data by artificial intelligence companies. The News Media Alliance, which represents publishers, announced in a paper this month that companies should pay licensing fees when they use work produced by their members to train artificial intelligence.

Prashanth Chandrasekar, CEO of Stack Overflow, a Q&A site for programmers, said: “What’s really important to us is ownership of information.” For large-scale artificial intelligence The smart company plans to start charging for access to user-generated content on the site. "The Stack Overflow community has spent so much effort answering questions over the past 15 years, and we really want to make sure the effort is rewarded."

There have been many artificial intelligence services before, such as OpenAI’s Dall-E 2, which can generate images through learning, but have been accused of large-scale theft of intellectual property. The companies that created these systems are currently involved in lawsuits over these allegations. The battle over AI-generated text may be even bigger, involving not only issues of compensation and credit, but also privacy issues.

But Emily M. Bender, a computational linguist at the University of Washington, believes that under current laws, artificial intelligence agencies are not responsible for their actions.

The dispute stems from the way artificial intelligence chatbots are developed. The core algorithms of these robots are called "large language model algorithms", which need to imitate the content and manner of human speech by absorbing and processing large amounts of existing language text data. This type of data is different from the behavioral and personal information used by services such as Facebook parent company Meta Platforms to target ads we are used to on the internet.

This data is created by human users using various services, such as the hundreds of millions of posts made by Reddit users. Only on the Internet can you find a large enough library of artificially generated words. Without it, none of today’s chat-based AI and related technologies would succeed.

Jesse Dodge, a research scientist at the non-profit Allen Institute for Artificial Intelligence, found in a 2021 paper that Wikipedia and countless copyright-protected websites from media organizations large and small Protected news articles are present in the most commonly used web crawler databases. Both Google and Facebook use this dataset to train large language models, and OpenAI uses a similar database.

OpenAI no longer discloses its data sources, but according to a 2020 paper published by the company, its large language model uses posts scraped from Reddit to filter and improve the data used to train its artificial intelligence.

Reddit spokesman Tim Rathschmidt said it was uncertain how much revenue it would generate from charging companies to access its data, but believed the data they had could Help improve today's state-of-the-art large-scale language models.

Reports say publishing industry executives have been investigating: To what extent is their content used to train ChatGPT and other artificial intelligence tools? How do they think they should be compensated? And what laws can they use to defend their rights? However, Danielle Coffey, the organization's general counsel, said that so far, no agreement has been reached with any of the owners of large AI chat engines (such as Google, OpenAI, Microsoft, etc.) to let They pay for a portion of the training data scraped from members of the News Media Alliance.

Twitter did not respond to a request for comment. Microsoft declined to comment. A Google spokesperson said: "We have a long history of helping creators and publishers monetize their content and strengthen relationships with their audiences. In line with our AI principles, we will continue to do so in a responsible and ethical manner. Innovate in an ethical way." The spokesperson also said that "it is still early days" and Google is soliciting opinions on how to build artificial intelligence that is beneficial to the open network.

Legal and Ethical Quagmire

In some cases, copying data available on the open web (also known as scraping) is legal, although companies are still discussing how and where The details of when they were allowed to do so were debated.

Most companies and organizations are willing to put their data online because they want it to be discovered and indexed by search engines so that people can find the content. However, copying this data to train artificial intelligence, replacing the need to find the original source, is entirely different.

Computational linguist Bender said technology companies that collect information from the Internet to train artificial intelligence operate on the principle: "We can accept it, therefore it is ours." Converting text (including books, magazine articles, essays on personal blogs, patents, scientific papers, and Wikipedia content) into chatbot answers removes links to the source of the material. It also makes it harder for users to verify what the bot is telling them. This is a big problem for systems that often lie.

These large-scale scrapes also steal our personal information. Common Crawl is a non-profit organization that has been crawling vast amounts of content on the open web for more than a decade and making its database freely available to researchers. Common Crawl's database is also used as a starting point for companies looking to train artificial intelligence, including Google, Meta, OpenAI and others.

Sebastian Nagel, a data scientist and engineer at Common Crawl, says a blog post you wrote years ago that has since been deleted may still be It's present in the training data used by OpenAI, which uses web content from years ago to train its artificial intelligence.

Unlike search indexes owned by Google and Microsoft, removing personal information from trained AI requires retraining the entire model, Bender said. Dodge also said that because the cost of retraining a large language model can be very high, even if users can prove that personal data was used to train artificial intelligence, the company is unlikely to do so. Due to the enormous computing power required, such models can cost tens of millions of dollars to train.

But Dodge added that in most cases it would also be difficult to get an AI trained on a data set that includes personal information to regurgitate that information. OpenAI said it has adjusted its chat-based system to reject requests for personal information. The European Union and U.S. governments are considering new laws and regulations to govern this type of artificial intelligence.

Accountability and Profit Sharing

Some proponents of AI believe that AI should have access to all the data their engineers can get because that’s how humans learn. Logically, why shouldn't a machine do this?

Bender said that aside from the fact that artificial intelligence is not yet the same as humans, there is a problem with the above point of view, that is, according to current laws, artificial intelligence cannot be responsible for its own actions. People who plagiarize the work of others, or who try to repackage misinformation as truth, can face severe consequences, but a machine and its creators do not share the same responsibility.

Of course, this may not always be the case. Just like copyright owner Getty sued image-generating AI companies for using their intellectual property as training data, businesses and other organizations will likely end up suing the makers of chat-based AI if they use their content without authorization. Go to court unless they agree to a warrant.

Those personal essays written by countless people, as well as posts posted on obscure forums and disappeared social networks, and all kinds of other things, can really make today's chatbots as capable as writers. Okay? Perhaps the only benefit the creators of these content can gain from this is that they have contributed something to the cultivation of chatbots in terms of their use of language.

The above is the detailed content of Chatbots are digesting the internet, and the internet wants to reap the rewards. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete