MK-SQuIT is a synthetic dataset containing English and SPARQL query pairs. The assembly of question-query pairs is handled with very little human intervention, sidestepping the tedious and costly process of hand labeling data. A neural machine translation model can then be trained on such a dataset, allowing laymen users to access information rich knowledge graphs without an understanding of query syntax.
The MK-SQuIT Demo Container provides an end-to-end framework for this process - from generating a Text2Sparql dataset to training a model for the task.
The container is packaged with all data necessary for each stage. Most notably, you will find a generated example dataset and a fine-tuned baseline model. Additionally, we have provided several easy to use tutorial notebooks:
Ensure that the following requirements are met:
To follow the NeMo tutorial, these requirements must also be met:
1. Download the container from NGC
docker pull nvcr.io/isv-nemo-partners/meetkai/mk-squit:1
2. Run the application
docker run --gpus all -it --rm --name mk_squit \
--ipc=host \
-p 8888:8888 \
-p 6006:6006 \
nvcr.io/isv-nemo-partners/meetkai/mk-squit:1
3. Access the server
Use the following link in a browser:
http://localhost:8888
On startup, all notebooks are readily available. They do not need to be run in any particular order.
Note: Some manual annotation is required within the generation pipeline. However, generation-ready data is included so that this step can be bypassed.
To train and evaluate with multiple GPUs, the scripts must be run from the terminal. This is because Pytorch Lightning's DDP (distributed data parallel) mode cannot be used within Jupyter Notebooks.
1. Run the container and access it using bash:
docker container ls # Take note of docker process ID
docker exec -it {PID} bash
2. Run scripts from workspace root:
# Define paths:
SRC_DATA_DIR=./out
TGT_DATA_DIR=./out
# Download and reformat the dataset for NeMo
python3 ./model/data/import_datasets.py \
--source_data_dir $SRC_DATA_DIR \
--target_data_dir $TGT_DATA_DIR
# Train with all GPUs
python3 ./model/text2sparql.py \
model.train_ds.filepath="$TGT_DATA_DIR"/train.tsv \
model.validation_ds.filepath="$TGT_DATA_DIR"/test_easy.tsv \
model.test_ds.filepath="$TGT_DATA_DIR"/test_hard.tsv \
model.batch_size=32 \
model.nemo_path="$TGT_DATA_DIR"/NeMo_logs/bart.nemo \
exp_manager.exp_dir="$TGT_DATA_DIR"/NeMo_logs \
trainer.gpus=-1
Generating a Text2Sparql dataset is done through the following steps:
Source code is found at: mk_squit/generation
The application is divided into the following structure:
.
|-- Dockerfile
|-- Explore-Dataset.ipynb
|-- Generation-Pipeline.ipynb
|-- LICENSE
|-- 2011.02566.pdf
|-- Neural_Machine_Translation-Text2Sparql.ipynb
|-- README.md
|-- data
| |-- base_templates.json
| |-- chemical-5k-preprocessed.json
| |-- chemical-5k.json
| |-- literary_work-5k-preprocessed.json
| |-- literary_work-5k.json
| |-- literary_work-props-preprocessed.json
| |-- literary_work-props.json
| |-- movie-5k-preprocessed.json
| |-- movie-5k.json
| |-- movie-props-preprocessed.json
| |-- movie-props.json
| |-- person-5k-preprocessed.json
| |-- person-5k.json
| |-- person-props-preprocessed.json
| |-- person-props.json
| |-- pos-examples.txt
| |-- television_series-5k-preprocessed.json
| |-- television_series-5k.json
| |-- television_series-props-preprocessed.json
| |-- television_series-props.json
| `-- type-list-autogenerated.json
|-- mk_squit
| |-- __init__.py
| |-- generation
| | |-- __init__.py
| | |-- full_query_generator.py
| | |-- predicate_bank.py
| | |-- template_filler.py
| | |-- template_generator.py
| | `-- type_generator.py
| `-- utils
| |-- __init__.py
| |-- entity_resolver.py
| `-- metrics.py
|-- model
| |-- README.md
| |-- conf
| | `-- text2sparql_config.yaml
| |-- data
| | `-- import_datasets.py
| |-- evaluate.sh
| |-- evaluate_text2sparql.py
| |-- params.sh
| |-- score_predictions.py
| |-- setup.sh
| |-- text2sparql.py
| `-- train.sh
|-- out
| |-- NeMo_logs
| | `-- bart.nemo
| |-- test_easy_queries_v3.tsv
| |-- test_hard_queries_v3.tsv
| |-- tf-projector
| | |-- meta.tsv
| | `-- vecs.tsv
| |-- train_queries_v3.tsv
| `-- workspaces
| `-- lab-a511.jupyterlab-workspace
|-- requirements.txt
`-- scripts
|-- gather_wikidata.py
|-- generate_type_list.py
|-- preprocess.py
|-- stats
| `-- calculate_stats.py
`-- tf_projector
`-- generate_embeddings.py
This container includes code derived from nvidia/NeMo which is licensed here.
Any code exclusively sourced by MeetKai Inc. for MK-SQuIT follows the MIT license located within the container.