search
HomeTechnology peripheralsAIResearch shows: Data sources remain the main bottleneck for AI

Data is the lifeblood of the machine. Without it, you can’t build anything related to AI. Many organizations are still struggling to get good, clean data to sustain their AI and machine learning initiatives, according to Appen's State of AI and Machine Learning report released this week.

According to Appen's survey on artificial intelligence, among the four stages of artificial intelligence - data procurement, data preparation, model training and deployment, and human-guided model evaluation, data procurement consumes the most resources and costs the most. The longest and most challenging. 504 business leaders and technology experts.

On average, data procurement consumes 34% of an organization’s AI budget, while data preparation and model testing and deployment each account for 24%, and model evaluation 15%, according to Appen’s survey, which was conducted by Harris The poll was conducted and included IT decision-makers, business leaders and managers, and technology practitioners from the United States, United Kingdom, Ireland and Germany.

In terms of time, data procurement consumes approximately 26% of an organization’s time, while data preparation and model testing, deployment, and model evaluation account for 24% and 23% respectively. Finally, 42% of technologists consider data sourcing to be the most challenging stage of the AI ​​lifecycle, compared to model evaluation (41%), model testing and deployment (38%), and data preparation (34%).

研究表明:数据来源仍然是 AI 的主要瓶颈

According to technology experts, data sourcing is the biggest challenge facing artificial intelligence. But business leaders see things differently...

Despite the challenges, organizations are making it work. According to Appen, four-fifths (81%) of respondents said they are confident they have enough data to support their AI initiatives. Perhaps the key to this success: The vast majority (88%) are augmenting their data by using external AI training data providers such as Appen.

However, the accuracy of the data is questionable. Appen found that only 20% of respondents reported data accuracy of more than 80%. Only 6% (about 1 in 10) said their data was 90% accurate or better. In other words, one in five data contains errors for more than 80% of organizations.

With that in mind, it’s perhaps not surprising that nearly half (46%) of respondents agree that data accuracy is important “but we can fix it,” according to Appen’s survey. Only 2% said data accuracy is not a big need, while 51% agreed it is a critical need.

It appears that Appen CTO Wilson Pang’s view on the importance of data quality matches the 48% of customers who believe data quality is not important.

“Data accuracy is critical to the success of AI and ML models, as quality-rich data results in better model output and consistent processing and decision-making,” Pang said in the report. “To achieve good results, data sets must be accurate, comprehensive, and scalable.”

研究表明:数据来源仍然是 AI 的主要瓶颈

Over 90% of Appen respondents said they use pre-labeled data

The rise of deep learning and data-centric AI has shifted the motivation for AI success from good data science and machine learning modeling to good data collection, management and mark. This is especially true for today’s transfer learning techniques, where AI practitioners step out on top of a large pre-trained language or computer vision model and retrain a small set of layers with their own data.

Better data can also help prevent unnecessary bias from creeping into AI models and often prevent undesirable AI outcomes. This is especially true for large language models, said Ilia Shifrin, senior director of AI at Appen.

“Companies face another challenge with the rise of large language models (LLMs) trained on multilingual web crawler data,” Shifrin said in the report. "These models often exhibit bad behavior due to the abundance of toxic language, as well as racial, gender, and religious biases in the training corpora."

Bias in Web data raises some thorny issues, although there are some workarounds methods (changing training regimens, filtering training data and model output, and learning from human feedback and testing), but more research is needed to establish a good standard for "human-centered" LLM benchmarks and model evaluation methods, Shifrin said. .

According to Appen, data management remains the biggest obstacle facing AI. The survey found that 41% of people in the AI ​​cycle believe data management is the biggest bottleneck. Lack of data ranked fourth, with 30% citing it as the biggest obstacle to AI success.

But there’s some good news: The time organizations spend managing and preparing data is trending downward. That's just over 47% this year, compared with 53% in last year's report, Appen said.

研究表明:数据来源仍然是 AI 的主要瓶颈

Data accuracy levels may not be as high as some organizations would like

“The majority of respondents use external data providers and it can be inferred that by outsourcing data sourcing and preparation, data scientists are saving money on proper management , the time required to clean and label the data,” the data labeling company said.

However, judging by the relatively high error rates in the data, perhaps organizations should not scale back their data procurement and preparation processes (whether internal or external). There are many competing needs when it comes to building and maintaining AI processes—hiring qualified data professionals was another top need identified by Appen. However, until significant progress is made in data management, organizations should continue to put pressure on their teams to continue driving the importance of data quality.

The survey also found that 93% of organizations strongly or somewhat agree that ethical AI should be the “foundation” of AI projects. Appen CEO Mark Brayan said it was a good start, but there was more work to be done. "The problem is that many people face the challenge of trying to build great AI with poor data sets, which creates a significant obstacle to achieving their goals," Brayan said in a press release.

In-house, custom-collected data remains the majority of organizations’ data sets used for AI, accounting for 38% to 42% of data, according to Appen’s report. Synthetic data performed surprisingly strongly, accounting for 24% to 38% of an organization's data, while pre-labeled data (typically sourced from data service providers) accounted for 23% to 31% of the data.

Synthetic data in particular has the potential to reduce the incidence of bias in sensitive AI projects, with 97% of Appen respondents saying they use synthetic data “when developing inclusive training datasets.”

Other interesting findings from the report include:

  • 77% of organizations retrain their models monthly or quarterly;
  • 55% of U.S. organizations claim They are ahead of their competitors, compared with 44% in Europe;
  • 42% of organizations report “widespread” rollout of AI, compared with 51% in the 2021 State of Artificial Intelligence report;
  • 7% of organizations report their AI budget exceeds $5 million, compared to 9% last year.

The above is the detailed content of Research shows: Data sources remain the main bottleneck for AI. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Tool Calling in LLMsTool Calling in LLMsApr 14, 2025 am 11:28 AM

Large language models (LLMs) have surged in popularity, with the tool-calling feature dramatically expanding their capabilities beyond simple text generation. Now, LLMs can handle complex automation tasks such as dynamic UI creation and autonomous a

How ADHD Games, Health Tools & AI Chatbots Are Transforming Global HealthHow ADHD Games, Health Tools & AI Chatbots Are Transforming Global HealthApr 14, 2025 am 11:27 AM

Can a video game ease anxiety, build focus, or support a child with ADHD? As healthcare challenges surge globally — especially among youth — innovators are turning to an unlikely tool: video games. Now one of the world’s largest entertainment indus

UN Input On AI: Winners, Losers, And OpportunitiesUN Input On AI: Winners, Losers, And OpportunitiesApr 14, 2025 am 11:25 AM

“History has shown that while technological progress drives economic growth, it does not on its own ensure equitable income distribution or promote inclusive human development,” writes Rebeca Grynspan, Secretary-General of UNCTAD, in the preamble.

Learning Negotiation Skills Via Generative AILearning Negotiation Skills Via Generative AIApr 14, 2025 am 11:23 AM

Easy-peasy, use generative AI as your negotiation tutor and sparring partner. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining

TED Reveals From OpenAI, Google, Meta Heads To Court, Selfie With MyselfTED Reveals From OpenAI, Google, Meta Heads To Court, Selfie With MyselfApr 14, 2025 am 11:22 AM

The ​TED2025 Conference, held in Vancouver, wrapped its 36th edition yesterday, April 11. It featured 80 speakers from more than 60 countries, including Sam Altman, Eric Schmidt, and Palmer Luckey. TED’s theme, “humanity reimagined,” was tailor made

Joseph Stiglitz Warns Of The Looming Inequality Amid AI Monopoly PowerJoseph Stiglitz Warns Of The Looming Inequality Amid AI Monopoly PowerApr 14, 2025 am 11:21 AM

Joseph Stiglitz is renowned economist and recipient of the Nobel Prize in Economics in 2001. Stiglitz posits that AI can worsen existing inequalities and consolidated power in the hands of a few dominant corporations, ultimately undermining economic

What is Graph Database?What is Graph Database?Apr 14, 2025 am 11:19 AM

Graph Databases: Revolutionizing Data Management Through Relationships As data expands and its characteristics evolve across various fields, graph databases are emerging as transformative solutions for managing interconnected data. Unlike traditional

LLM Routing: Strategies, Techniques, and Python ImplementationLLM Routing: Strategies, Techniques, and Python ImplementationApr 14, 2025 am 11:14 AM

Large Language Model (LLM) Routing: Optimizing Performance Through Intelligent Task Distribution The rapidly evolving landscape of LLMs presents a diverse range of models, each with unique strengths and weaknesses. Some excel at creative content gen

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.