SDG Workshop | NVIDIA NGC

NVIDIA

SDG Workshop

Container

NVIDIA

SDG Workshop

Synthetic Data Generation Workshop container for building realistic synthetic datasets based on existing data using Megatron-LM.

Synthetic Data Generation Workshop container

This container is for use with the Synthetic Data Generation Workshop. Once the container is pulled, it can be run using the following command:

docker run --gpus all -it --rm -p 8888:8888 -p 6006:6006 sdg_workshop:3.1 jupyter lab --NotebookApp.token ''

Port 8888 is for running the Jupyter server and port 6006 is for viewing TensorBoard during model training.

This container has an end-to-end workflow involving:

Data prep and ETL
GPT2 Model training with Megatron-LM
Inference
Evaluation of Synthetic Data

Synthetic Data Generation is a data augmentation technique and is necessary for increasing the robustness of models by supplying additional data to train models.

An ideal synthetic dataset generated on top of real data is a dataset that shares with the real data:

the same features (columns)

for a particular feature, they share the same data type (integer, float, string, etc)

the same distributions in an individual column
the same joint distributions when considering multiple columns
the same conditional distributions (i.e. applying a condition on one distribution and looking at another)

A synthetic data generator is a model that can be trained on the real data, and then be utilized to create new synthetic data with the properties described above.

Publisher

NVIDIA

Latest Tag1

UpdatedAugust 2, 2022 UTC

Compressed Size13.57 GB

Multinode SupportNo

Multi-Arch SupportNo