Chatbots are digesting the internet, and the internet wants to reap the rewards-AI-php.cn

Home

Technology peripherals

Chatbots are digesting the internet, and the internet wants to reap the rewards

王林

May 16, 2023 pm 04:31 PM

AIlanguage model

Chatbots are digesting the internet, and the internet wants to reap the rewards

Artificial intelligence companies are exploiting the content created by countless people on the Internet without their consent or compensation. Now, a growing number of tech and media companies are demanding payment in hopes of getting a piece of the chatbot craze.

Here’s the translation:

If you’ve ever blogged, posted on Reddit, or shared anything on the open web, chances are you’ve contributed to Contributed to the birth of the latest generation of artificial intelligence.

Google’s Bard, OpenAI’s ChatGPT, Microsoft’s new version of Bing, and similar tools provided by other startups all integrate artificial intelligence language models. But these clever robot writers wouldn’t be possible without the vast amounts of text freely available on the Internet.

Nowadays, web content has once again become the focus of competition. This hasn't happened since the early days of the search engine wars. Tech giants are trying to carve out this irreplaceable source of information, rich in new value, as their own territory.

Originally unsuspecting tech and media companies are realizing that this data is critical to fostering a new generation of language-based artificial intelligence. Reddit is one of OpenAI's valuable training resources, but it recently announced that it would charge artificial intelligence companies for data access. OpenAI declined to comment.

Recently, Twitter also began charging for data access services, a change that affects many aspects of Twitter’s business, including the use of data by artificial intelligence companies. The News Media Alliance, which represents publishers, announced in a paper this month that companies should pay licensing fees when they use work produced by their members to train artificial intelligence.

Prashanth Chandrasekar, CEO of Stack Overflow, a Q&A site for programmers, said: “What’s really important to us is ownership of information.” For large-scale artificial intelligence The smart company plans to start charging for access to user-generated content on the site. "The Stack Overflow community has spent so much effort answering questions over the past 15 years, and we really want to make sure the effort is rewarded."

There have been many artificial intelligence services before, such as OpenAI’s Dall-E 2, which can generate images through learning, but have been accused of large-scale theft of intellectual property. The companies that created these systems are currently involved in lawsuits over these allegations. The battle over AI-generated text may be even bigger, involving not only issues of compensation and credit, but also privacy issues.

But Emily M. Bender, a computational linguist at the University of Washington, believes that under current laws, artificial intelligence agencies are not responsible for their actions.

The dispute stems from the way artificial intelligence chatbots are developed. The core algorithms of these robots are called "large language model algorithms", which need to imitate the content and manner of human speech by absorbing and processing large amounts of existing language text data. This type of data is different from the behavioral and personal information used by services such as Facebook parent company Meta Platforms to target ads we are used to on the internet.

This data is created by human users using various services, such as the hundreds of millions of posts made by Reddit users. Only on the Internet can you find a large enough library of artificially generated words. Without it, none of today’s chat-based AI and related technologies would succeed.

Jesse Dodge, a research scientist at the non-profit Allen Institute for Artificial Intelligence, found in a 2021 paper that Wikipedia and countless copyright-protected websites from media organizations large and small Protected news articles are present in the most commonly used web crawler databases. Both Google and Facebook use this dataset to train large language models, and OpenAI uses a similar database.

OpenAI no longer discloses its data sources, but according to a 2020 paper published by the company, its large language model uses posts scraped from Reddit to filter and improve the data used to train its artificial intelligence.

Reddit spokesman Tim Rathschmidt said it was uncertain how much revenue it would generate from charging companies to access its data, but believed the data they had could Help improve today's state-of-the-art large-scale language models.

Reports say publishing industry executives have been investigating: To what extent is their content used to train ChatGPT and other artificial intelligence tools? How do they think they should be compensated? And what laws can they use to defend their rights? However, Danielle Coffey, the organization's general counsel, said that so far, no agreement has been reached with any of the owners of large AI chat engines (such as Google, OpenAI, Microsoft, etc.) to let They pay for a portion of the training data scraped from members of the News Media Alliance.

Twitter did not respond to a request for comment. Microsoft declined to comment. A Google spokesperson said: "We have a long history of helping creators and publishers monetize their content and strengthen relationships with their audiences. In line with our AI principles, we will continue to do so in a responsible and ethical manner. Innovate in an ethical way." The spokesperson also said that "it is still early days" and Google is soliciting opinions on how to build artificial intelligence that is beneficial to the open network.

Legal and Ethical Quagmire

In some cases, copying data available on the open web (also known as scraping) is legal, although companies are still discussing how and where The details of when they were allowed to do so were debated.

Most companies and organizations are willing to put their data online because they want it to be discovered and indexed by search engines so that people can find the content. However, copying this data to train artificial intelligence, replacing the need to find the original source, is entirely different.

Computational linguist Bender said technology companies that collect information from the Internet to train artificial intelligence operate on the principle: "We can accept it, therefore it is ours." Converting text (including books, magazine articles, essays on personal blogs, patents, scientific papers, and Wikipedia content) into chatbot answers removes links to the source of the material. It also makes it harder for users to verify what the bot is telling them. This is a big problem for systems that often lie.

These large-scale scrapes also steal our personal information. Common Crawl is a non-profit organization that has been crawling vast amounts of content on the open web for more than a decade and making its database freely available to researchers. Common Crawl's database is also used as a starting point for companies looking to train artificial intelligence, including Google, Meta, OpenAI and others.

Sebastian Nagel, a data scientist and engineer at Common Crawl, says a blog post you wrote years ago that has since been deleted may still be It's present in the training data used by OpenAI, which uses web content from years ago to train its artificial intelligence.

Unlike search indexes owned by Google and Microsoft, removing personal information from trained AI requires retraining the entire model, Bender said. Dodge also said that because the cost of retraining a large language model can be very high, even if users can prove that personal data was used to train artificial intelligence, the company is unlikely to do so. Due to the enormous computing power required, such models can cost tens of millions of dollars to train.

But Dodge added that in most cases it would also be difficult to get an AI trained on a data set that includes personal information to regurgitate that information. OpenAI said it has adjusted its chat-based system to reject requests for personal information. The European Union and U.S. governments are considering new laws and regulations to govern this type of artificial intelligence.

Some proponents of AI believe that AI should have access to all the data their engineers can get because that’s how humans learn. Logically, why shouldn't a machine do this?

Bender said that aside from the fact that artificial intelligence is not yet the same as humans, there is a problem with the above point of view, that is, according to current laws, artificial intelligence cannot be responsible for its own actions. People who plagiarize the work of others, or who try to repackage misinformation as truth, can face severe consequences, but a machine and its creators do not share the same responsibility.

Of course, this may not always be the case. Just like copyright owner Getty sued image-generating AI companies for using their intellectual property as training data, businesses and other organizations will likely end up suing the makers of chat-based AI if they use their content without authorization. Go to court unless they agree to a warrant.

Those personal essays written by countless people, as well as posts posted on obscure forums and disappeared social networks, and all kinds of other things, can really make today's chatbots as capable as writers. Okay? Perhaps the only benefit the creators of these content can gain from this is that they have contributed something to the cultivation of chatbots in terms of their use of language.

The above is the detailed content of Chatbots are digesting the internet, and the internet wants to reap the rewards. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to use ChatGPT for English proofreading! A thorough explanation of how to use it and promptsMay 13, 2025 am 01:23 AM

Efficient English Proofreading Using ChatGPT: Reduce time and cost, and realize high-quality English sentences English proofreading, which is essential in business and academic fields, requires time, cost, and high level of English proficiency. However, ChatGPT can help you solve these challenges efficiently and acquire native level expressiveness. In this article, we will explain the specific methods of English proofreading using ChatGPT, its benefits, points to be aware of, and the role of human experts. This is practical information useful for anyone who is aiming to write high-quality English sentences. OpenAI's latest AI agent

Explaining how to analyze financial statements using ChatGPT! Examples of prompts are also introducedMay 13, 2025 am 01:22 AM

ChatGPT empowers corporate financial analysis: efficient, accurate, time-saving Accurate financial analysis is the key to formulating a business strategy. However, traditional financial data analysis is time-consuming and requires expertise. The emergence of ChatGPT provides an effective solution to this puzzle. This article will provide detailed descriptions on how to use ChatGPT to perform complex financial analysis efficiently. Whether it is the free version (GPT-3.5) or the paid version of ChatGPT Plus (GPT-4), it can be competent for financial analysis tasks, while GPT-4 can significantly improve the speed and automation level and greatly improve the financial analysis process. We will provide detailed explanations from building prompt words to analyzing specific indicators, combining specific cases to help you improve the quality of financial analysis

Explaining how to cancel ChatGPT Plus! We also introduce steps and points to note for each deviceMay 13, 2025 am 01:21 AM

ChatGPT Plus Unsubscribe Guide: Smooth cancellation procedures and points to note This article will explain in an easy-to-understand manner how to cancel ChatGPT Plus. We will also explain in detail how to cancel the browser and app versions, what to be careful about when canceling, what to do if you can't cancel, and how to delete your account. How to cancel ChatGPT Plus How to cancel ChatGPT Plus differs between PC browser version and app version (iOS/Android). How to cancel on the browser version (PC) Log in to ChatGPT. "My" is located in the left sidebar

An easy-to-understand explanation of how to set up a character in ChatGPT and prompts!May 13, 2025 am 01:18 AM

Make ChatGPT more fun! Give it a cat's character and Kansai accent! This article will explain in detail how to set up a character for ChatGPT to make its conversation more attractive. We will cover practical information such as preparation for the free version of ChatGPT, character customization features, actual Prompt examples, and more. Open up a new interactive experience through personalized customization and AI communication methods! For details of the latest AI agent "OpenAI Deep Research" released by OpenAI, please click the link below: [ChatGPT] OpenAI Deep Research Detailed explanation: How to use and charging standards! Table of contents How to set up roles for ChatGPT Preparation Role setting method Reality

Explaining how to use ChatGPT browsers! Also introduces useful extensionsMay 13, 2025 am 01:17 AM

ChatGPT: A thorough explanation of everything from browser usage to extensions! Short path to productivity improvement AI has rapidly penetrated our lives and businesses, and its use is steadily accelerating. In the future, AI utilization skills will become essential, just like English and programming. With many companies deploying AI services, from large companies such as Google and Microsoft to startups, ChatGPT is attracting particular attention. This article will provide a detailed explanation of ChatGPT, how to use it in your browser, and even useful extensions. Use ChatGPT

How to write a book review using ChatGPT! We also introduce prompts and output examplesMay 13, 2025 am 01:16 AM

A guide to creating a book review using ChatGPT: A balance between efficiency and ethics ChatGPT is a powerful tool that can streamline book reviews, but leaving it to AI to do so is not good for education, and it also involves the risk of plagiarism. In this article, we will explain how to effectively utilize ChatGPT, its benefits and precautions, and ethical use. The importance of prompt engineering Prompt engineering (adjusted instructions to AI) is essential to making the most of ChatGPT. Giving clear and specific instructions makes it more accurate and beneficial

What causes ChatGPT registration? Explaining common errors and solutionsMay 13, 2025 am 01:15 AM

ChatGPT is a very practical conversational AI, but sometimes it may be unusable if registration failed. Common registration problems include SMS verification code failure and password setting errors. This article will explain the common causes of failures and specific solutions during ChatGPT registration in a clear form of a question and answer. If you have difficulty creating an account, please refer to this article to troubleshoot and resolve the issue step by step. In addition, for users who cannot successfully register, we will also recommend some alternative AI chat services. For the latest AI agent released by OpenAI, please click the link below for the information about "OpenAI Deep Research": 【ChatGPT】OpenAI Deep Resea

4 ways to load tables into ChatGPT! A thorough explanation of the actual steps and points to be careful about!May 13, 2025 am 01:14 AM

Natural language processing as an AI technology can be a groundbreaking help to our work and research, but the loading of table data in particular has become increasingly important. In this article, we will introduce in detail how to efficiently load and utilize tables in ChatGPT. Understanding this feature will be key to maximize the use of ChatGPT, as ingesting table data is useful for organizing, analyzing and even calculating information. Here, we will explain various methods for proper data processing, including how to import information from Excel or spreadsheets, and how to use data via URLs.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.