NGC | Catalog
Welcome Guest

MK-SQuIT

For pull tags and more information, please view on a desktop device.
Logo for MK-SQuIT

Description

SQuIT (Synthesizing Questions using Iterative Template-Filling) is a generated dataset produced with little human intervention. This container provides several tutorial applications - an interactive dataset explorer, a walkthrough of the generation pipeline, and a demonstration using NeMo to fine tune and evaluate a model on the dataset.

Publisher

MeetKai Inc.

Latest Tag

1

Modified

October 18, 2021

Compressed Size

4.3 GB

Multinode Support

No

Multi-Arch Support

No

MK-SQuIT Demo Container

MK-SQuIT is a synthetic dataset containing English and SPARQL query pairs. The assembly of question-query pairs is handled with very little human intervention, sidestepping the tedious and costly process of hand labeling data. A neural machine translation model can then be trained on such a dataset, allowing laymen users to access information rich knowledge graphs without an understanding of query syntax.

The MK-SQuIT Demo Container provides an end-to-end framework for this process - from generating a Text2Sparql dataset to training a model for the task.

The container is packaged with all data necessary for each stage. Most notably, you will find a generated example dataset and a fine-tuned baseline model. Additionally, we have provided several easy to use tutorial notebooks:

  • Explore-Dataset.ipynb: To interact with the dataset through Tensorflow Projector.
  • Generation-Pipeline.ipynb: A walkthrough of the generation pipeline.
  • Neural_Machine_Translation-Text2Sparql.ipynb: A tutorial for training and evaluating our baseline model with NeMo.

Installation and Getting Started

Prerequisites

Ensure that the following requirements are met:

  • 9.4 GBs of HDD space
  • 12 GBs of RAM are available

To follow the NeMo tutorial, these requirements must also be met:

  • GPU has at least 11 GBs of RAM

Running the NGC Container

1. Download the container from NGC

docker pull nvcr.io/isv-nemo-partners/meetkai/mk-squit:1

2. Run the application

docker run --gpus all -it --rm --name mk_squit \
    --ipc=host \
    -p 8888:8888 \
    -p 6006:6006 \
    nvcr.io/isv-nemo-partners/meetkai/mk-squit:1

3. Access the server

Use the following link in a browser:

http://localhost:8888

On startup, all notebooks are readily available. They do not need to be run in any particular order.

Note: Some manual annotation is required within the generation pipeline. However, generation-ready data is included so that this step can be bypassed.

Multi-GPU

To train and evaluate with multiple GPUs, the scripts must be run from the terminal. This is because Pytorch Lightning's DDP (distributed data parallel) mode cannot be used within Jupyter Notebooks.

1. Run the container and access it using bash:

docker container ls  # Take note of docker process ID
docker exec -it {PID} bash

2. Run scripts from workspace root:

# Define paths:
SRC_DATA_DIR=./out
TGT_DATA_DIR=./out
# Download and reformat the dataset for NeMo
python3 ./model/data/import_datasets.py \
    --source_data_dir $SRC_DATA_DIR \
    --target_data_dir $TGT_DATA_DIR
# Train with all GPUs
python3 ./model/text2sparql.py \
    model.train_ds.filepath="$TGT_DATA_DIR"/train.tsv \
    model.validation_ds.filepath="$TGT_DATA_DIR"/test_easy.tsv \
    model.test_ds.filepath="$TGT_DATA_DIR"/test_hard.tsv \
    model.batch_size=32 \
    model.nemo_path="$TGT_DATA_DIR"/NeMo_logs/bart.nemo \
    exp_manager.exp_dir="$TGT_DATA_DIR"/NeMo_logs \
    trainer.gpus=-1

Generation Pipeline

Generating a Text2Sparql dataset is done through the following steps:

  1. Gather raw data from WikiData (entity and properties).
  2. Preprocess the data: Cleaning and aggregating related fields. Adding a type field which is used by the pipeline.
  3. Annotate the type fields: The only required manual stage, but allows us to improve the generation of rational queries substantially.
  4. Generate type list: Consolidate annotated data into file used by the pipeline.
  5. Generate dataset: Generate iterative templates and build question-query pairs.

Source code is found at: mk_squit/generation

Directory Structure

The application is divided into the following structure:

  • data: Raw and annotated data used for generation.
  • mk_squit: Generation source code.
  • model: Scripts to train / evaluate the baseline model (pulled from nvidia/NeMo/examples/nlp/text2sparql).
  • out: Output data folder containing the generated dataset, fine-tuned baseline BART model, and tf-projector metadata.
  • scripts: Additional scripts. Most importantly, code required for preprocessing WikiData.
.
|-- Dockerfile
|-- Explore-Dataset.ipynb
|-- Generation-Pipeline.ipynb
|-- LICENSE
|-- 2011.02566.pdf
|-- Neural_Machine_Translation-Text2Sparql.ipynb
|-- README.md
|-- data
|   |-- base_templates.json
|   |-- chemical-5k-preprocessed.json
|   |-- chemical-5k.json
|   |-- literary_work-5k-preprocessed.json
|   |-- literary_work-5k.json
|   |-- literary_work-props-preprocessed.json
|   |-- literary_work-props.json
|   |-- movie-5k-preprocessed.json
|   |-- movie-5k.json
|   |-- movie-props-preprocessed.json
|   |-- movie-props.json
|   |-- person-5k-preprocessed.json
|   |-- person-5k.json
|   |-- person-props-preprocessed.json
|   |-- person-props.json
|   |-- pos-examples.txt
|   |-- television_series-5k-preprocessed.json
|   |-- television_series-5k.json
|   |-- television_series-props-preprocessed.json
|   |-- television_series-props.json
|   `-- type-list-autogenerated.json
|-- mk_squit
|   |-- __init__.py
|   |-- generation
|   |   |-- __init__.py
|   |   |-- full_query_generator.py
|   |   |-- predicate_bank.py
|   |   |-- template_filler.py
|   |   |-- template_generator.py
|   |   `-- type_generator.py
|   `-- utils
|       |-- __init__.py
|       |-- entity_resolver.py
|       `-- metrics.py
|-- model
|   |-- README.md
|   |-- conf
|   |   `-- text2sparql_config.yaml
|   |-- data
|   |   `-- import_datasets.py
|   |-- evaluate.sh
|   |-- evaluate_text2sparql.py
|   |-- params.sh
|   |-- score_predictions.py
|   |-- setup.sh
|   |-- text2sparql.py
|   `-- train.sh
|-- out
|   |-- NeMo_logs
|   |   `-- bart.nemo
|   |-- test_easy_queries_v3.tsv
|   |-- test_hard_queries_v3.tsv
|   |-- tf-projector
|   |   |-- meta.tsv
|   |   `-- vecs.tsv
|   |-- train_queries_v3.tsv
|   `-- workspaces
|       `-- lab-a511.jupyterlab-workspace
|-- requirements.txt
`-- scripts
    |-- gather_wikidata.py
    |-- generate_type_list.py
    |-- preprocess.py
    |-- stats
    |   `-- calculate_stats.py
    `-- tf_projector
        `-- generate_embeddings.py

License

This container includes code derived from nvidia/NeMo which is licensed here.

Any code exclusively sourced by MeetKai Inc. for MK-SQuIT follows the MIT license located within the container.