Home >Technology peripherals >AI >Data sources are still the main bottleneck of artificial intelligence
According to Appen’s “State of Artificial Intelligence and Machine Learning” report released this week, agencies are still struggling to obtain good, clean data to sustain their artificial intelligence and machine learning programs.
According to Appen’s survey of 504 business leaders and technology experts, among the four stages of artificial intelligence, data sources; data preparation; models Training and deployment; the human-led model evaluation phase—the data source consumes the most resources, takes the longest, and is the most challenging.
According to Appen’s survey, data sources consume an average of 34% of an organization’s AI budget, with data preparation, model testing and deployment each accounting for 24%, and model evaluation accounting for 15%. The survey was conducted by Harris Poll and included IT decision-makers, business leaders and managers, and technology practitioners from the United States, United Kingdom, Ireland and Germany.
In terms of time, data sources consume approximately 26% of the time, data preparation time is 24%, model testing, deployment and model evaluation time are each 23% . Finally, 42% of technicians believe that data sourcing is the most challenging stage in the AI life cycle. The other stages are: model evaluation (41%), model testing and deployment (38%), and data preparation (34%) .
Despite the challenges, organizations are working hard to make it work. According to Appen, four-fifths (81%) of respondents said they have enough data to support their AI initiatives. The key to success may be this: The vast majority (88%) of companies augment their data by using external AI training data providers such as Appen.
However, the accuracy of the data is still open to question. Appen found that only 20% of respondents reported data accuracy of more than 80%. Only 6% (roughly one in 20 people) said their data was 90% accurate or better.
With this in mind, nearly half (46%) of respondents believe data accuracy is important, according to Appen’s survey. Only 2% believe data accuracy is not a big need, while 51% believe it is a critical need.
Appen’s Chief Technology Officer Wilson Pang has a different view on the importance of data quality, with 48% of his customers not believing data quality is important.
“Data accuracy is critical to the success of AI and ML models, as quality-rich data yields better model output and consistent processing and decision-making,” the report said. “In order to obtain For good results, data sets must be accurate, comprehensive, and scalable.”
The rise of deep learning and data-centric artificial intelligence has shifted the motivation for AI success from good data science and machine learning Model shift to good data collection, management and labeling. This is especially true in today's transfer learning techniques. Practitioners of artificial intelligence will abandon a large pre-trained language or computer vision model and retrain a small part of it on their own data.
Better data can also help prevent unnecessary bias from seeping into AI models and preventing bad outcomes that AI can lead to. This is especially true for large language models.
The report says: “With the rise of large language models (LLMs) trained on multilingual web scraping data, enterprises are facing another challenge. As training corpora are filled with toxic languages, and Racial, gender, and religious biases, these models often exhibit undesirable behavior."
Bias in network data raises thorny issues, although there are some workarounds (changing training regimens, filtering training data, and model output, and learn from human feedback and testing), but more research is needed to create a “human-centered LLM” benchmark and good standard for model evaluation methods.
Appen said data management remains the biggest obstacle facing artificial intelligence. The survey found that 41% of people believe that data management is the biggest bottleneck in the artificial intelligence cycle. In fourth place is a lack of data, with 30% of respondents citing this as the biggest obstacle to AI success.
But there’s some good news: The time enterprises spend managing and preparing data is falling. This year's rate was just over 47%, compared with 53% in last year's report, Appen said.
“Since the majority of respondents use external data providers, it can be inferred that by outsourcing data sourcing and preparation, data scientists are saving time required to properly manage, clean, and label their data.” Data Labeling the company said.
However, judging by the relatively high error rates in the data, perhaps organizations should not scale back their data sources and preparation processes (whether internal or external). There are many competing needs when it comes to building and maintaining AI processes—the need to hire qualified data professionals was another top need identified by Appen. However, until significant progress is made in data management, organizations should continue to pressure their teams to continue driving the importance of data quality.
The survey also found that 93% of organizations strongly or to some extent agree that AI ethics should be the "foundation" of AI projects. Appen CEO Mark Brayan said it was a good start but there was still much work to be done. "The problem is that many people are facing the challenge of trying to build great AI with poor data sets, which creates huge obstacles to achieving their goals," Brayan said in a press release.
According to Appen’s report, custom-collected data within enterprises remains the primary data set used for AI, accounting for 38% to 42% of data. Synthetic data showed surprisingly strong performance, accounting for 24% to 38% of an organization's data, while pre-labeled data (usually from data service providers) accounted for 23% to 31% of the data.
In particular, synthetic data has the potential to reduce the incidence of bias in sensitive AI projects, with 97% of Appen’s survey participants saying they used synthetic data in “developing inclusive training datasets.”
Other interesting findings from the report include:
The above is the detailed content of Data sources are still the main bottleneck of artificial intelligence. For more information, please follow other related articles on the PHP Chinese website!