AI Synthetic Human Data is Freer of Bias Than Real Data

by | Apr 26, 2022 | Machine Learning / Artificial Intelligence | 0 comments

Karen Hao published another interesting article in MIT Technology Review on June 11, 2021, that featured images of fake human faces which were cheaper to produce and market than real faces, besides being more bias-free – These creepy fake humans herald a new age in AI.

Datagen is a company based in Israel that is working with four major U.S. tech firms and is competing with Synthesis AI in the demand for digital humans, along with Click-Ins, an upstart that uses synthetic AI to automate vehicle inspections of new models while safeguarding private information.  Mostly.ai works with financial and insurance companies providing fake client data spreadsheets to simulate company data that doesn’t exist yet. AI synthetic data is a growing business, as many as data types.

Collecting real-time data is an expensive and time-consuming process; it’s messy and packed with biases. For the synthetic process, Datagen scans real humans first to gather details of irises and skin texture down to curved fingers with nails. Then Datagen pumps the collected raw data through a series of algorithms to develop 3D representations of human bodies, faces, eyes, and hands. This synthetic data is clean and fresh and can be used to build highly diverse data sets like perfectly labeled faces at different ages, shapes, and ethnicities that create a face-detection system that fits across populations.

Nevertheless, synthetic data has its limitations – realistic is not real.         

Deep learning has always been about data. But in the last few years, the AI community has learned that good data is more important than big data. Even a small amount of the right, cleanly labeled data can do more to improve an AI system’s performance than 10 times the amount of uncurated data, or even a more advanced algorithm.

 Datagen and other synthetic companies are collecting as much data as possible and then tuning their algorithms for better performance.  But Karen Hao says they “should be using the same algorithms while improving on the composition of their data.”  Today, Datagen uses a synthetic data generator with its team to create and test multiple new data to identify which one maximizes a model’s performance.

 However, there are challenges of privacy and fairness associated with synthetic data. For one, it can encode sensitive information about real people, according to Aaron Roth, professor of computer science at the U. of Pennsylvania, though Datagen doesn’t claim that it conceals such information.

 The research proposes to combine differential privacy and generative adversarial networks to produce the strongest privacy protection, but skeptics doubt it because of marketing vendors:

 Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.

Generative adversarial networks (GAN) is a class of machine learning frameworks in which two neural networks contest with each other in a game where one agent’s gain is another agent’s loss (Wikipedia).

Others suggest that fake humans might still be more likely to diverge from reality.  Still others say, “perfectly balanced data sets don’t automatically translate into perfectly fair AI systems.”  Karen Hao points out that early research shows that in some cases, it may not even be possible to achieve both private and fair AI with synthetic data.

Take-away: There is a place for AI synthetic data. Karen Hao believes it may become a necessity, though data must be tested.  Her final thought: “Synthetic data is likely to get better over time, but not by accident.”