Not everything that is inorganic, manufactured or synthetic is fake or inferior. This assertion is particularly true when it comes to synthetic data in the context of machine learning. Simulated data is not only useful but also more practical when compared to real or actual data, in some cases.
In the field of machine learning, synthetic data is crucial to ensure that an AI system has been trained sufficiently before it is deployed. Machine learning engineering, the process of producing a machine learning (ML) model with the help of software engineering and data science principles, will encounter critical difficulties without synthetic data.
SEE ALSO: HOW MACHINE LEARNING AND AI WILL IMPACT ENGINEERING
What is synthetic data?
Synthetic data, according to Gartner, is “data generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world.” In short, it is information borne out of simulation and not by direct measurement. It is different from data collected through an actual survey, visual capture, and other direct data gathering methods.
It is important to emphasize, however, that synthetic data is not false information. While it may be manufactured, it is based on real-world facts and circumstances. It approximates data that would be generated, based on carefully developed models. It compensates for the scarcity of available data or the difficulty of obtaining the desired information for machine learning model training.
Several studies prove how synthetic data is capable of delivering machine learning outcomes that are similar to, and even surpass, what can be achieved when using real data. One such study comes from the University of Barcelona’s Faculty of Mathematics and Computer Science, exploring the use of synthetic data for deep learning in counting pedestrians.
The study concludes that synthetic data is indeed useful in training AI systems while providing various advantages. “The obtained results suggest the incorporation of synthetic data as a well-suited surrogate for the missing real [data] along with alleviating required exhaustive labeling,” the study writes.
As far as practical applications are concerned, a number of companies are already using synthetic data in their business models. OneView, for one, offers custom and scalable synthetic data for the remote sensing industry. The company synthesizes visual data to train the AI systems used for analytics of remote sensing imagery. The company raised $3.5 million in seed funding for its business.
How is synthetic data generated and used?
Synthetic data should not be equated with random information, although randomization has a role in its generation. For a more illustrative discussion, a good point of reference is the synthetic data generation process of OneView, which specializes in creating synthetic visual data for remote sensing imagery analytics and related applications.
OneView follows a six-layer process that starts with the layout, wherein the basic elements of an environment - urban, agricultural, maritime, or any other - are laid out. The next step is the placement of objects-of-interest that are the goal of detection, as well as distractors to better train the ML models on how to differentiate the “goal object” from similar-looking objects.

Then the appearance building stage follows. It is during this stage when colors, textures, random erosions, noises, and other detailed visual elements are added to simulate real images.

The fourth step involves the application of conditions such as the weather and time of the day. For the fifth step, sensor parameters (the camera lens) are implemented. Lastly, annotations are added to make the resulting synthetic data ready for machine learning systems.

OneView employs advanced gaming engines to generate 3D models for its datasets. These are the same engines used by popular games such as Fortnite and Grand Theft Auto. Gaming engines have advanced significantly over the years and now are capable of producing hyper-realistic imagery that can be mistaken for actual photos. Also, randomization factors are employed to avoid creating patterns or repetitive information that are not helpful in machine learning training.
Generally, machine learning engineers are not directly involved in the preparation of synthetic data. However, they often work with data scientists to get inputs on perfecting the ML model for a project. They collaborate with data experts to make sure that the resulting AI system has learned what it needs to operate as intended.
SEE ALSO: WHAT IS DEEP LEARNING AND WHY IS IT MORE RELEVANT THAN EVER?
A necessity, not just an option
Obtaining real data can be very resource-intensive. To take a comprehensive representation of overhead views of a city, for example, it is necessary to deploy several drones and repeat the process for different times of the day, weather conditions, traffic situations, and other variables.
Doing all of these is not only extremely expensive; it is also virtually impossible to capture all the needed data in a timely manner. If it does not rain for the next several months, how can images of the city on a rainy day become obtainable? What if only images of wildfire-overridden and smog-covered landscapes become available for several months?
Synthetic data provides numerous advantages that make it not only a viable option but a necessary data source. It addresses the limitations of real data gathering while providing other benefits, which can be summarized as follows:
- Fast data generation and use (with built-in annotations)
- Comprehensive representation
- Customizability
- Scalability
Machine learning cannot proceed without the training part meeting its target accuracy levels, something that is not attainable without the right amount and range of data. Synthetic data is not only easier to produce, but it can also be generated with annotations already integrated. Additionally, it is customizable and scalable, so it can be adjusted to reflect different situations and conditions. It is doubtlessly easier to simulate topographic features, cars, buildings, and other elements than to wait for actual scenes to show different scenes for cameras to capture.
Annotation is vital for any machine learning model training as it acts as a guide for identifying objects or data elements. Without it, machine learning may interpret data in the wrong way and skew the entire AI learning process.
Machine learning enabler
A Fujitsu whitepaper concludes that synthetic data is a fitting solution for the AI data challenge while enabling faster product development. “The reality is that the cost of quality data acquisition is high, and this is acting as a barrier preventing many from considering AI deployment. To tackle this challenge, organizations are increasingly looking towards synthetic data to address the data shortfall that is preventing AI adoption,” the paper notes.
Synthetic data is vital in the machine learning engineering process. It does not only serve as an alternative for actual data; it is often the only way to provide enough and varied data to ML systems to cover a wide range of situations and get around the expense, logistics, and technical limitations of actual data gathering.