Evaluating synthetic data and its quality for AI use cases

AI Governance & Assurance

In episode 5 of the AI Fundamentalists podcast we discussed synthetic data, what it is, where it is useful, and where it can be harmful. OpenAI’s Sam Altman recently claimed that he is “pretty confident that soon all data will be synthetic data”.

We express our skepticism that this can be accomplished to the desired effect, especially for text data. Even leadership at the synthetic data generation company Gretel, who would theoretically prosper with this paradigm shift to synthetic data has stated: “The content on the web is more and more AI-generated, and I do think that will lead to degradation over time [because] LLMs are producing regurgitated knowledge, without any new insights,”. To hear our full discussion on the topic, check out the podcast.

How to assess the quality of synthetic data

This post is dedicated to a request we received to expand on a topic we briefly touched on during the podcast: How do you evaluate the quality of synthetic data?

As we discuss in the podcast, one of the major difficulties with synthetic data, and what keeps it from being able to be widely used is its ability to properly capture the interrelationships between data points. For tabular data, such as demographic data, customer churn data, fraud detection, etc. the nuances matter. The line between what is signal and what is noise is very strong.

When you think about something as complex as human language a single word, depending on the context, could mean many different things. For example, the word “run” has 645 different meanings according to the Oxford English Dictionary. Depending on how it is used in the sentence, we could be talking about a run in a baseball game, a limited run of sneakers, or watching our laptops run on our desks, the list goes on 623 more times.

Context and nuance matter

So, how do you evaluate the quality? Everyone’s favorite answer: it depends. Context matters and one’s knowledge of what is being mimicked and how well it is representative of the underlying distribution(s) being modeled are key.Note: This implies you have data you are modeling off of or a deep understanding of what you are attempting to create, like in the case of system stress testing.

Statistical evaluation and comparative analysis

Another key statistical evaluation in validating the quality of synthetic data, (remember, we are bringing stats back!) is distributional comparative analysis. This method involves analyzing the comparison of the underlying distributions of your original data and synthetic data.

When we say “distribution”, think of the normal bell curve, or something really esoteric like the Marchenko–Pastur distribution. You will want to evaluate how well the synthetic data conforms to the original distribution using a statistical test, such as the Kolmogorov–Smirnov test (or doesn’t conform, in a stress-testing use case, but evaluate if it has the desired properties).

If the interrelationships between variables are the most important factor to validate, run associations between features on your real and your synthetic data. Then evaluate whether the feature relationships still hold for the synthesized data. A more advanced version of this can be conducted with a principal component analysis.Note: you should always check simpler descriptive statistics such as minimums, maximums, means, standard deviations, inter-quartile ranges, etc.

Not as good as the real thing

The final consideration should be an acknowledgment that synthetic data will never be as good as the real thing. The best cubic zirconia is not as good as the real thing. If the highest fidelity data is required for a high-risk application, your efforts would most likely be best survived gathering more real data that is representative of the underlying use case. You can develop high-quality synthetic data, but it is extremely painstaking and tedious. Modelers should evaluate what their needs are first.

In summary, to evaluate the quality and efficacy of synthetic data you need:

  1. The original data you are modeling, and/or a very deep understanding of what you are attempting to capture
  2. Correlational analyses to evaluate the interrelationships between the original data and determine if they are present in the synthetic data.
  3. A distributional analysis of how well the shape of the synthetic data captures the same probability distribution as the distribution you are attempting to model.
  4. Recognition that even the best cubic zirconia won’t be the same as the real deal - just ask any aficionado in your life. If you need full-fidelity “diamond” data for mission-critical applications that make consequential decisions about end users, you may be better served obtaining more real data points than generating fake data.

If you haven’t yet, please listen to the full recording of the AI Fundamentalists podcast, available wherever you get podcasts. If you have any questions about a topic on the podcast or a request for a future topic, please drop us a line at aifundamentalists@monitaur.ai.