Home >Technology peripherals >AI >Synthetic data: the future of machine learning
Translator | Bugatti
Reviewer | Sun Shujuan
Data can be said to be the lifeblood of machine learning models. But what happens when access to this valuable resource is restricted? As many projects and companies are beginning to demonstrate, this is when synthetic data is a viable, if not a great, option.
Synthetic data is artificially generated information that is not obtained through direct measurement. “Fake” data is not a new or revolutionary concept per se. It is essentially a method of generating test or training data for a model that lacks available or necessary information to function properly.
In the past, lack of data led to the convenience method of using a randomly generated set of data points. While this may be sufficient for teaching and testing purposes, random data is not the data you want to train any kind of predictive model on. That's what's different about the concept of synthetic data, it's reliable.
Synthetic data is essentially a unique concept where we can cleverly generate randomized data. Therefore, this approach can be applied to more complex use cases, not just tests.
While the way synthetic data is generated is no different than random data—just through a more complex set of inputs—synthetic data does serve a different purpose and therefore has unique requirements.
Synthetic methods are based on and limited to certain criteria that are fed as input in advance. Actually, it's not random. It is based on a set of sample data with a specific distribution and criteria that determine the possible range, distribution, and frequency of data points. Roughly speaking, the goal is to replicate real data to populate a larger data set that will then be large enough to train a machine learning model.
This approach becomes of particular interest when exploring deep learning methods for refining synthetic data. Algorithms can compete with each other, aiming to surpass each other in their ability to generate and identify synthetic data. In effect, the aim here is to engage in an artificial arms race to generate hyper-realistic data.
If we cannot gather the valuable resources needed to advance civilization, we will find a way to create them. This principle now applies equally to the data world of machine learning and artificial intelligence.
When training an algorithm, it is crucial to have a very large data sample size, otherwise the patterns identified by the algorithm may be too simple for practical applications. This is actually very logical. Just as human intelligence often takes the easiest route to solving a problem, the same often happens when training machine learning and artificial intelligence.
For example, consider applying this to an object recognition algorithm that can accurately identify dogs from a set of cat images. If the amount of data is too small, the AI risks relying on patterns that are not essential features of the object it is trying to identify. In this case, the AI might still be effective, but break down when it encounters data that doesn't follow the pattern it initially identified.
So, what is the solution? We drew a lot of slightly different animals, forcing the network to find the underlying structure of the image, not just the location of certain pixels. But instead of drawing a million dogs by hand, it would be better to build a system specifically for drawing dogs that can be used to train classification algorithms—which is actually what we do when we feed synthetic data in order to train machine learning.
However, this approach has obvious flaws. Merely generating data out of thin air does not represent the real world, so the algorithm is likely to fail when encountering real data. The solution is to collect a subset of the data, analyze and identify trends and ranges within it, and then use that data to generate large amounts of random data that is likely to be representative of what the data would look like if we collected it all ourselves.
This is also the value of synthetic data. We no longer have to collect data endlessly and then need to clean and process it before use.
The world is currently going through a very dramatic shift, especially in the European Union: privacy and the data generated are increasingly protected. In the field of machine learning and AI, strengthening data protection is a long-standing problem. Restricted data is often exactly what is needed to train algorithms to perform and deliver value to end users, especially for B2C solutions.
Privacy concerns are often addressed when individuals decide to use a solution and therefore approve the use of their data. The problem here is that it's difficult to get users to give you their personal data until you have a solution that provides enough value to be willing to hand it over. As a result, suppliers often find themselves in a chicken-and-egg dilemma.
Synthetic data is the solution, and companies can gain access to subsets of data through early adopters. They can then use this information as a basis to generate enough data for training machine learning and AI. This approach can greatly reduce the time-consuming and expensive need for private data and still allow algorithms to be developed for real users.
For some industries, such as healthcare, banking, and law, synthetic data provides a way to more easily access large amounts of data that were previously unavailable, eliminating the challenges that new and more advanced algorithms often face. Constraints.
The problem with real data is that it is not generated for the purpose of training machine learning and AI algorithms, it is simply a by-product of events happening around us. As mentioned earlier, this obviously limits the availability and ease of use of the data collected, but also limits the parameters of the data and the possibility of defects (outliers) that could corrupt the results. This is why synthetic data, which can be customized and controlled, is more efficient when training models.
However, although very suitable for training scenarios, synthetic data will inevitably always rely on at least a small part of the real data for its own creation. So the synthetic data never replaces the original data it relies on. More realistically, it will significantly reduce the amount of real data required for algorithm training. This process requires a lot more data than testing - usually 80% of the data is used for training and the other 20% is used for testing.
Finally, if done right, synthetic data provides a faster and more efficient way to get the data we need at a lower cost than getting data from the real world, while reducing the annoying data Private issues.
Original title: Synthetic data: The future of machine learning, author: Christian Lawaetz Halvorsen
The above is the detailed content of Synthetic data: the future of machine learning. For more information, please follow other related articles on the PHP Chinese website!