In the world of healthcare analytics, “Generating Synthetic Patient Data” goes beyond technology. It’s key to protecting patient privacy and drives innovative research. This guide dives into creating synthetic patient data with Synthetic Data Vault (SDV).
Why Generate Synthetic Patient Data?
The use of real patient data in medical research is fraught with privacy concerns and ethical dilemmas. Regulations like HIPAA (Health Insurance Portability and Accountability Act) underscore the need for confidentiality, creating a barrier to research. Here, generating synthetic patient data emerges as a solution, allowing researchers to mirror the statistical properties of real data without compromising individual privacy.
Our Sample Dataset for synthetic patient data
To keep it simple we will generate sample data, which contain the following:
- Patient ID – a random number
- Age – A random number below 100
- Gender – M/F
- Diagnosis – like Cancer, Pneumonia etc
- Treatment – like chemotherapy, immunization etc
- Outcome – like Stable, Declined, Improved etc
There is a sample Python script in my Github repo which will generate this random data. We use this generated data as real data and use SDV to generate synthetic data.
The data will look like: (please note these values are randomly generated and may not be accurate)
Patient ID | Age | Gender | Diagnosis | Treatment | Outcome |
6884 | 79 | M | Alzheimer’s Disease | Immunotherapy | Remission |
3000 | 40 | F | Breast Cancer | Chemotherapy | Stable |
5360 | 52 | M | Prostate Cancer | Radiation Therapy | Declined |
8798 | 30 | M | Hyperthyroidism | Medication | Improved |
What is SDV (Synthetic Data Vault)
SDV Generates synthetic data across single table, relational, and time series data. Supports multiple models & evaluations. You can use synthetic data in place of real data for added protection, or use it in addition to your real data as an enhancement.
Setup a Python virtual environment and install the package:
pip install sdv
Dataset and metadata generation
We use our patient data set as real data. Below are the list of import packages required:
import pandas as pd
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
from sdv.evaluation.single_table import run_diagnostic
from sdv.evaluation.single_table import evaluate_quality
We load out data into a pandas data frame and use SingleTableMetadata to generate the metadata for the given dataset. SDV will automatically generate the metadata, but we do some changes to make sure it will understand the columns.
# Assume 'real_data' is your original dataset
real_data = pd.read_csv('sample.csv') # Your original dataset
# generate metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
# correct metadata to use the same values as in the real data
metadata.update_column(
column_name='Gender',
sdtype='categorical'
)
metadata.update_column(
column_name='Diagnosis',
sdtype='categorical'
)
metadata.update_column(
column_name='Treatment',
sdtype='categorical'
)
metadata.update_column(
column_name='Outcome',
sdtype='categorical'
)
# save metadata to json
metadata.save_to_json('metadata.json')
For example, in the above code we have updated some of the columns with the sdtype. There are different sdtypes supported. Sdtype categorical
describes columns that contain distinct categories. For example, the outcome column can take only “Stable”, “Declined”, “Improved” and “Remission” values.
Creating the Synthesizer
A Synthesizer is an object we use to create synthetic data using machine learning. Below are the steps:
- You’ll start by creating a synthesizer based on your metadata
- Next, you’ll train the synthesizer using real data. In this phase, the synthesizer will learn patterns from the real data.
- Once your synthesizer is trained, you can use it to generate new, synthetic data.
SDV supports multiple synthesizers, we are using Gaussian Copula Synthesizer. You can check supported synthesizers here for single table.
from sdv.single_table import GaussianCopulaSynthesizer
# Step 1: Create the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
# Step 2: Train the synthesizer
synthesizer.fit(real_data)
# Step 3: Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=500)
print(synthetic_data.head(20))
You can even customize your synthesizer, but that is our of scope of this article. Read more about customization for single table synthesizers here.
Sampling
You can generate huge amounts of synthetic data. In the above section we generate only 500 rows. You can even generate million rows and save it to a CSV file as well.
synthetic_data = synthesizer.sample(num_rows=1_000_000)
# save the data as a CSV
synthetic_data.to_csv('synthetic_data.csv', index=False)
More about single table sampling can be found here.
Quality of the Synthetic data
Furthermore, you can evaluate and visualize the synthetic data against the real data. Using the SDV, you can diagnose any problems in the synthetic data, evaluate the data quality and visualize the data. This will significantly improve your overall data quality.
diagnostic = run_diagnostic(
real_data=real_data,
synthetic_data=synthetic_data,
metadata=metadata
)
quality_report = evaluate_quality(
real_data,
synthetic_data,
metadata
)
Perform basic checks to ensure the synthetic data is valid. Compare the real and synthetic data’s statistical similarity. Visualize the real and synthetic data side-by-side. More information about these topics here.
Benefits and Challenges in Generating Synthetic Data
Generating synthetic patient data with SDV presents several benefits, including privacy preservation and facilitating research where real data is scarce or sensitive. However, challenges like maintaining data accuracy and managing complex medical data structures are significant concerns. These aspects are crucial for the reliability and effectiveness of the synthetic data generated.
Conclusion
Generating synthetic data with SDV offers a promising path in healthcare research. It skillfully balances data accessibility with privacy and ethics. Following this guide’s steps, researchers and data scientists can explore new possibilities in medical research and analytics. This article is a basic introduction on how to generate synthetic data using SDV. You can find more SDV features on their documentation website.