Home  >  Article  >  Technology peripherals  >  Data management has become the largest bottleneck in the development of artificial intelligence

Data management has become the largest bottleneck in the development of artificial intelligence

王林
王林forward
2023-04-29 13:25:061060browse

Data management has become the largest bottleneck in the development of artificial intelligence

The true sign of greatness when it comes to infrastructure is that it flies easily overlooked. The better it performs, the less we think about it. For example, the importance of mobile infrastructure only comes to our minds when we find ourselves struggling to connect. Just like when we drive down a new, freshly paved highway, we give little thought to the road surface as it passes silently beneath our wheels. A poorly maintained highway, on the other hand, reminds us of its existence with every pothole, turf, and bump we encounter.

Infrastructure requires our attention only when it is missing, inadequate, or damaged. And in computer vision, the infrastructure—or rather, what's missing from it—is what many people are concerned about right now.

Computing sets the standard for infrastructure

Underpinning every AI/ML project (including computer vision) are three basic development pillars - data, algorithms/models, and compute. Of these three pillars, computing is by far the one with the most powerful and solid infrastructure. With decades of dedicated enterprise investment and development, cloud computing has become the gold standard for IT infrastructure across enterprise IT environments—and computer vision is no exception.

In the infrastructure-as-a-service model, developers have enjoyed on-demand, pay-as-you-go access to an ever-expanding pipeline of computing power for nearly 20 years. In that time, it has revolutionized enterprise IT by dramatically improving agility, cost efficiency, scalability, and more. With the advent of dedicated machine learning GPUs, it’s safe to say that this part of the computer vision infrastructure stack is alive and well. If we want to see computer vision and AI realize their full potential, it would be wise to use compute as the model on which the rest of the CV infrastructure stack is based.

The lineage and limitations of model-driven development

Until recently, algorithm and model development have been the driving force behind the development of computer vision and artificial intelligence. On both the research and commercial development side, teams have worked hard for years to test, patch, and incrementally improve AI/ML models, and share their progress in open source communities like Kaggle. The fields of computer vision and artificial intelligence made great progress in the first two decades of the new millennium by focusing their efforts on algorithm development and modeling.

In recent years, however, this progress has slowed because model-centric optimization violates the law of diminishing returns. Furthermore, model-centric approaches have several limitations. For example, you cannot use the same data for training and then train the model again. Model-centric approaches also require more manual labor in terms of data cleaning, model validation, and training, which can take away valuable time and resources from more innovative revenue-generating tasks.

Today, through communities like Hugging Face, CV teams have free and open access to a vast array of large, complex algorithms, models, and architectures, each supporting different core CV capabilities—from object recognition and facial landmark recognition to pose estimation and feature matching. These assets have become as close to an “off-the-shelf” solution as one could imagine – providing computer vision and AI teams with a ready-made whiteboard to train on any number of specialized tasks and use cases.

Just as basic human abilities like hand-eye coordination can be applied and trained on a variety of different skills - from playing table tennis to pitching - these modern ML algorithms can also be trained to perform a range of specific tasks. application. However, while humans become specialized through years of practice and sweat, machines do this through training on data.

Data-Centric Artificial Intelligence and Big Data Bottlenecks

This has prompted many leading figures in the field of artificial intelligence to call for a new era of deep learning development - an era in which the main engine of progress It's data. Just a few years ago, Andrew Ng and others announced that data-centricity was the direction of AI development. During this short time, the industry flourished. In just a few years, a plethora of novel commercial applications and use cases for computer vision have emerged, spanning a wide range of industries—from robotics and AR/VR to automotive manufacturing and home security.

Recently, we conducted research on hand-on-steering-wheel detection in cars using a data-centric approach. Our experiments show that by using this approach and synthetic data we are able to identify and generate specific edge cases that are lacking in the training dataset.

Data management has become the largest bottleneck in the development of artificial intelligence

Datagen generates composite images for the hand-on-steering-wheel test (Image courtesy of Datagen)

While the computer vision industry is buzzing about data, not all of it is fanatical. While the field has established that data is the way forward, there are many obstacles and pitfalls along the way, many of which have already hobbled CV teams. A recent survey of U.S. computer vision professionals revealed that the field is plagued by long project delays, non-standardized processes, and resource shortages—all of which stem from data. In the same survey, 99% of respondents stated that at least one CV project has been canceled indefinitely due to insufficient training data.

Even the lucky 1% who have avoided project cancellation so far cannot avoid project delays. In the survey, every respondent reported experiencing significant project delays due to insufficient or insufficient training data, with 80% reporting delays lasting three months or more. Ultimately, the purpose of infrastructure is one of utility—to facilitate, accelerate, or communicate. In a world where severe delays are just part of doing business, it's clear that some vital infrastructure is missing.

Traditional training data challenges infrastructure

However, unlike computing and algorithms, the third pillar of AI/ML development is not suitable for infrastructureization-especially in the field of computer vision, In this field, data is large, disorganized, and both time and resource intensive to collect and manage. While there are many labeled, freely available databases of visual training data online (such as the now famous ImageNet database), they have proven insufficient by themselves as a source of training data in commercial CV development.

This is because, unlike models that generalize by design, training data is by its very nature application-specific. Data is what distinguishes one application of a given model from another, and therefore must be unique not only to a specific task, but also to the environment or context in which that task is performed. Unlike computing power, which can be generated and accessed at the speed of light, traditional visual data must be created or collected by humans (by taking photos in the field or searching the Internet for suitable images), and then painstakingly cleaned and labeled by humans (this is a A process prone to human error, inconsistency and bias).

This raises the question, "How can we make data visualizations that are both suitable for a specific application and are easily commoditized (i.e., fast, cheap, and versatile)?" Despite these two These qualities may seem contradictory, but a potential solution has emerged; showing great promise as a way to reconcile these two fundamental but seemingly incompatible qualities.

Path to synthetic data and full CV stack

Data management has become the largest bottleneck in the development of artificial intelligence

Computer Vision (CV) is One of the leading areas of modern artificial intelligence

The only way to make visual training data that has specific applications and saves time and resources at scale is to use synthetic data. For those unfamiliar with the concept, synthetic data is human-generated information designed to faithfully represent some real-world equivalent. In terms of visual synthetic data, this means realistic computer-generated 3D imagery (CGI) in the form of still images or videos.

In response to many of the issues arising in the data center era, a burgeoning industry has begun to form around synthetic data generation - a growing ecosystem of small and mid-sized startups offering a variety of solutions that leverage synthetic data to solve a series of pain points listed above.

The most promising of these solutions use AI/ML algorithms to generate photorealistic 3D images and automatically generate associated ground truth (i.e., metadata) for each data point. Synthetic data therefore eliminates the often months-long manual labeling and annotation process, while also eliminating the possibility of human error and bias.

In our paper (published at NeurIPS 2021), Discovering group bias in facial landmark detection using synthetic data, we found that to analyze the performance of a trained model and identify its weaknesses, a subset of the data must be set aside carry out testing. The test set must be large enough to detect statistically significant deviations with respect to all relevant subgroups within the target population. This requirement can be difficult to meet, especially in data-intensive applications.

We propose to overcome this difficulty by generating synthetic test sets. We use the facial landmark detection task to validate our proposal by showing that all biases observed on real datasets can also be seen on well-designed synthetic datasets. This shows that synthetic test sets can effectively detect model weaknesses and overcome limitations in size or diversity of real test sets.

Today, startups are providing enterprise CV teams with sophisticated self-service synthetic data generation platforms that mitigate bias and allow for scaling data collection. These platforms allow enterprise CV teams to generate use case-specific training data on a metered, on-demand basis—bridging the gap between specificity and scale that makes traditional data unsuitable for infrastructureization.

New Hopes for Computer Vision’s So-Called “Data Managers”

There’s no denying that this is an exciting time for the field of computer vision. But, like any other changing field, these are challenging times. Great talent and brilliant minds rush into a field full of ideas and enthusiasm, only to find themselves held back by a lack of adequate data pipelines. The field is so mired in inefficiencies that data scientists today are known to be a field where one in three organizations already struggles with a skills gap, and we cannot afford to waste valuable human resources.

Synthetic data opens the door to a true training data infrastructure—one that may one day be as simple as turning on the tap for a glass of water or providing computation. This is sure to be a welcome refreshment for the data managers of the world.

The above is the detailed content of Data management has become the largest bottleneck in the development of artificial intelligence. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete