Generate Synthetic Data to Build Robust Machine Learning Models in Data Scares Scenario

Explore statistical approaches to transform experts knowledge into data with practical examples

Introduction

Machine learning models need to be trained on sufficient, high-quality data that will recur in the future to make accurate predictions.

Generating synthetic data is a powerful technique to address various challenges, especially when real-world data is inaccurate or insufficient for model training.

In this article, I’ll explore major synthetic data generation methods that leverages statistical / probabilistic models. I’ll examine:

  • univariate approaches driven by PDF estimations and
  • multivariate approaches like Kernel Density Estimation and Bayesian Networks,

taking a real-world use case for an example.

What is Synthetic Data Generation

Synthetic data generation is data enhancement technique in machine learning to generate new data from scratch.

Its fundamental approaches involve using statistical models or deep generative models to analyze the patterns and relationships within existing data to produce new data:

Statistical / Probabilistic Models:

  • Univariate approach: Column-by-column PDF/PMF Estimation
  • Multivariate approach: Kernel Density Estimation (KDE), Bayesian Networks

Deep Generative Models:

  • Generative Adversarial Networks (GANs)
  • Variational Autoencoders (VAEs)

In the statistical model approach, we can take univariate approaches where a column (feature) is examined and generated independently or univariate approaches where multiple correlated columns are generated and updated accordingly.

Why Synthetic Data: Typical Use Cases

Even sophisticated machine learning algorithms perform poorly when training data is scarce or inaccurate.

But securing high-quality data, crucial for robust models, is challenging due to its unavailability or imperfections.

Among many data enhancement techniques, generating synthetic data can offer comprehensive solutions to tackle these challenges:

  • Real data is unavailable or extremely limited: New products, rare events, niche scenarios, hypothetical or future conditions lack historical data to test the model’s resilience.
  • Data privacy is paramount: Sensitive information (e.g., medical records, financial transactions, personal identifiable information) cannot be directly used for development or sharing due to regulations (GDPR, HIPAA) or ethical concerns.
  • Accelerating development: Providing immediate access to large datasets, removing dependencies on lengthy data collection or access approval processes.

Now, I’ll detail how we can leverage this in a real-world use case.

Univariate Approach: Column-by-Column PDF/PMF Estimation

Univariate approaches model the distribution of each column independently by fitting the column into a theoretical statistical distribution.

This method assumes independence of each column where no correlation with other columns exists in the dataset.

So, synthetic values for a given column are sampled solely from its estimated distribution, without considering the values present in any other columns.

These statistical distributions are defined by

  • Probability Density Function (PDF) for numerical columns and
  • Probability Mass Function (PMF) for categorical columns.

Best when:

  • The dataset has dominate columns or simple enough to disregard correlations.
  • Require computational efficiencies.

How the Univariate Approach Works

In univariate approach, we take simple three steps to generate synthetic dataset:

  • Step 1. Estimate a theoretical PDF (or PMF) of the column we want to generate.
  • Step 2. Based on the estimation, generate new synthetic values for the column (I’ll cover various methods later).
  • Step 3. Combine the synthetic column with an existing dataset.

Step 1 is critical for synthetic data to accurately reflect a true underlying distribution of real-world data.

In the next section, I’ll explore how the algorithm mimic true statistical distribution of numerical columns, starting with an overview of PDFs.

What is Probability Density Function (PDF)

Probability Density Function (PDF) is a statistic method to describe the probability of a continuous variable taking on a given value.

The following figure visualizes a valid PDF for a continuous random variable X uniformly distributed between 0 and 5 (uniform distribution).

The area under the PDF curve indicates the probability of a random value falling within a certain range.

For instance, the probability of a random value x being in the range of 0.9 to 1.6 is 14% because the area under the PDF curve is estimated 0.14, as highlighted in pink.

Fig. A PDF for a continuous random variable uniformly distributed between 0 and 5

For a function f(x) to be a valid PDF, it must satisfy two conditions (which we can see from the diagram):

  1. Non-negative for all possible values of x:

2. The integral of the function over its entire domain must be equal to one:

Different distributions have unique PDF shapes like bell curves, exponential decays, or uniform spreads.

The below figure visualizes various PDF variations. We can also find the uniform distribution referred in the previous example is one of the PDF variations:

Fig. Various PDFs

In context of synthetic data generation, a PDF of synthetic data needs to match with a PDF of real-world data.

In next section, I’ll explore how we can estimate a PDF for synthetic data.

How to Estimate PDF of Synthetic Data

Based on data availability, we can take two strategies to estimate accurate PDFs:

Strategy 1. Form a strong assumption

When we don’t have any data to start with, we can form a strong assumption on the true underlying distribution based on our research and experts’ knowledge for instance, and create data from scratch.

Strategy 2. Use univariate non-parametric methods

When we have an original data (whether its sufficient or not), we can leverage univariate non-parametric methods where the algorithm learns the underlying distribution from the data without assuming any specific distribution.

Real-world data is imperfect with noise, skewness, heavy tails, and so on. It might follow a combination of multiple PDFs.

Strategy 2 can help tackle these imperfections especially when we don’t have confident clues on true PDFs.

Let us explore each strategy using a real-world use case.

Use Case: Synthetic Dataset for Customer Service Call Durations

Let us imagine a scenario where a manager of a customer service call center plans to reallocate their workforce based on call duration and frequency.

However, the call center didn’t track the call duration, so they don’t have call duration data.

Problem: Lack of data

Our objective: Create synthetic data that reflects true data distribution of call durations based on the manager’s insights.

Target variable: call_duration_sec (continuous) that represents the total time in seconds a customer service call lasts.

The Original Dataset:

The following features are recorded by agents and stored in the database:

  • call_type (categorical): “Order_Inquiry”, “Technical_Support”, “E-commerce_Return”, “Billing_Issue”, “General_Inquiry”, “Complaint_Resolution”.
  • customer_tier (categorical/ordinal): “Bronze”, “Silver”, “Gold”, “Platinum”.
  • previous_contact_in_7_days (binary/boolean): “True”, “False”.
  • product_category (categorical): “Electronics”, “Apparel”, “Home_Goods”, “Software”.
  • day_of_week (categorical): “Monday”, “Tuesday”, “Wednesday”, “Thursday”, “Friday”, “Saturday”, “Sunday”.
  • time_of_day_bucket (categorical):”Morning_Peak”, “Afternoon_Off-Peak”, “Evening_Rush”, “Late_Night”.

Leave a Reply

Your email address will not be published. Required fields are marked *