Synthetic Data Generation Workshop container for building realistic synthetic datasets based on existing data using Megatron-LM.
Synthetic Data Generation Workshop container
This container is for use with the Synthetic Data Generation Workshop. Once the container is pulled, it can be run using the following command:
Port 8888 is for running the Jupyter server and port 6006 is for viewing TensorBoard during model training.
This container has an end-to-end workflow involving:
- Data prep and ETL
- GPT2 Model training with Megatron-LM
- Inference
- Evaluation of Synthetic Data
Synthetic Data Generation is a data augmentation technique and is necessary for increasing the robustness of models by supplying additional data to train models.
An ideal synthetic dataset generated on top of real data is a dataset that shares with the real data:
- the same features (columns)
for a particular feature, they share the same data type (integer, float, string, etc)
- the same distributions in an individual column
- the same joint distributions when considering multiple columns
- the same conditional distributions (i.e. applying a condition on one distribution and looking at another)
A synthetic data generator is a model that can be trained on the real data, and then be utilized to create new synthetic data with the properties described above.