NVIDIA

NVIDIA Time Series Prediction Platform

Resource

NVIDIA

NVIDIA Time Series Prediction Platform

NVIDIA Time Series Prediction Platform is a tool designed to compare easily and experiment with arbitrary combinations of forecasting models, time-series datasets, and other configurations.

Back to File Browser

4_MultiGpu_HpSearch.ipynb

In [1]:

# Copyright 2023 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

This notebook assumes that the Vertex-AI instance has more than 1 GPU
Please set the num_gpus variable accordingly

In [2]:

%load_ext autoreload
%autoreload 2

NOTE: These notebooks are designed to highlight different features of NVIDIA-TSPP. For this reason, all the examples are created to run quickly with only few iterations. The parameters should be tuned to get optimal results.

In [3]:

# This notebook assumes access to 2 Nvidia-GPUs. Please set this accordingly
num_gpus=2

Topics Covered

Multi-GPU Training
Hyper-Parameter Search

Setup

Imports

In [4]:

import hydra
import warnings
import torch
from omegaconf import OmegaConf
import os
from hydra import compose, initialize
from hydra.core.global_hydra import GlobalHydra
from hydra.core.hydra_config import HydraConfig
from hydra.utils import get_original_cwd
import conf.conf_utils
from hydra_utils import get_config
from data.data_utils import Preprocessor
from training.utils import set_seed

warnings.filterwarnings("ignore")

curr_workdir = globals()['_dh'][0]

/usr/local/lib/python3.8/dist-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Using backend: pytorch

In [5]:

# Since this notebook uses the NVIDIA-TSPP container. TSPP code is available at /workspace directory
tspp_ws = "/workspace"

For this notebook, we will again use the electricity dataset and the Temporal Fusion Transformer (TFT) model for training. Take a look at the 1_TsppOverview notebook for data download and preprocessing instructions in addition to an in depth description on training.

In [6]:

dataset = "electricity"
model="tft"

Dataset Download

In [7]:

# Download for the dataset. Set skip_download to False and run the cell if you need to download the dataset.
os.chdir(curr_workdir)
skip_download = False

# DOWNLOAD DATASET
if not skip_download:
    !python {tspp_ws}/data/script_download_data.py --dataset {dataset} --output_dir {tspp_ws}/datasets

Using backend: pytorch
#### Running download script ###
Getting electricity data...
/workspace/datasets/electricity
Pulling data from https://archive.ics.uci.edu/ml/machine-learning-databases/00321/LD2011_2014.txt.zip to /workspace/datasets/electricity/LD2011_2014.txt.zip
100% [..................................................] 261335609 / 261335609done
Unzipping file: /workspace/datasets/electricity/LD2011_2014.txt.zip
Done.
Aggregating to hourly data
Done.
Download completed.

Dataset Preprocessing

In [8]:

! python {tspp_ws}/launch_preproc.py dataset={dataset}

{'_target_': 'data.data_utils.Preprocessor', 'config': {'graph': False, 'source_path': '/workspace/datasets/electricity/electricity.csv', 'dest_path': '/workspace/datasets/electricity/', 'time_ids': 'days_from_start', 'train_range': [0, 1315], 'valid_range': [1308, 1339], 'test_range': [1332, 10000], 'dataset_stride': 1, 'scale_per_id': True, 'encoder_length': 168, 'example_length': 192, 'MultiID': False, 'features': [{'name': 'categorical_id', 'feature_type': 'ID', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 371}, {'name': 'hours_from_start', 'feature_type': 'TIME', 'feature_embed_type': 'CONTINUOUS'}, {'name': 'power_usage_weight', 'feature_type': 'WEIGHT', 'feature_embed_type': 'CONTINUOUS'}, {'name': 'power_usage', 'feature_type': 'TARGET', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}}, {'name': 'hour', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 25}, {'name': 'day_of_week', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 8}, {'name': 'hours_from_start', 'feature_type': 'KNOWN', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}}, {'name': 'categorical_id', 'feature_type': 'STATIC', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 371}], 'train_samples': 450000, 'valid_samples': 50000, 'binarized': True, 'time_series_count': 369}}
Reading in data from CSV File: /workspace/datasets/electricity/electricity.csv
Sorting on time feature
Mapping nodes
Mapping categoricals to bounded range
Splitting datasets
Calculating scalers
Applying scalers
Applying scalers
Applying scalers
Fixing any nans in continuous features
Fixing any nans in continuous features
Fixing any nans in continuous features
Saving preprocessor state at /workspace/datasets/electricity/tspp_preprocess.bin
Saving processed data at /workspace/datasets/electricity/

We train the model for 1 epoch with Batch Size of 1024 on a single GPU
For all the examples below, training criterion is set to Quantile and Automatic Mixed Precision (AMP) is enabled

In [9]:

# Create an Output Directory
os.chdir(curr_workdir)
output_workdir = os.path.join(curr_workdir, F'outputs/4_MultiGpu_{model}_{dataset}_1GPU')
os.makedirs(output_workdir, exist_ok = True)

! python {tspp_ws}/launch_training.py \
seed=1234 \
model={model} \
dataset={dataset} \
trainer/criterion=quantile \
trainer.config.amp=True \
trainer.config.num_epochs=1 \
trainer.config.batch_size=1024 \
hydra.run.dir={output_workdir} \
+trainer.config.force_rerun=True

Using backend: pytorch
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
 Training with 1 epochs 
 Epoch 0 
Epoch 0 | step 0 |avg loss 1.332 |walltime 0.895 |
Epoch 0 | step 25 |avg loss 0.640 |walltime 2.842 |
Epoch 0 | step 50 |avg loss 0.440 |walltime 4.898 |
Epoch 0 | step 75 |avg loss 0.370 |walltime 6.775 |
Epoch 0 | step 100 |avg loss 0.306 |walltime 8.656 |
Epoch 0 | step 125 |avg loss 0.274 |walltime 10.537 |
Epoch 0 | step 150 |avg loss 0.253 |walltime 12.416 |
Epoch 0 | step 175 |avg loss 0.238 |walltime 14.284 |
Epoch 0 | step 200 |avg loss 0.233 |walltime 16.170 |
Epoch 0 | step 225 |avg loss 0.226 |walltime 18.052 |
Epoch 0 | step 250 |avg loss 0.221 |walltime 19.933 |
Epoch 0 | step 275 |avg loss 0.217 |walltime 21.815 |
Epoch 0 | step 300 |avg loss 0.217 |walltime 23.698 |
Epoch 0 | step 325 |avg loss 0.212 |walltime 25.583 |
Epoch 0 | step 350 |avg loss 0.210 |walltime 27.464 |
Epoch 0 | step 375 |avg loss 0.206 |walltime 29.376 |
Epoch 0 | step 400 |avg loss 0.206 |walltime 31.256 |
Epoch 0 | step 425 |avg loss 0.206 |walltime 33.126 |
 Calculating Validation Metrics 
 Epoch 0 Validation Metrics: {'val_loss': 0.2326} 
Epoch 0 | step event |avg loss 0.204 |walltime 35.641 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
 MAE : 45.27942846755962  RMSE : 360.505482514993  SMAPE : 9.768091436720974  ND : 0.05960232344216114

Multi-GPU Training on NVIDIA-TSPP

In the cell below, we assume that 2 GPUs are available. For 2 GPUs, we set the batch size per GPU to 512, which makes the Global Batch Size to be 1024 (Same as the training in the previous cell with a single GPU) for Data-Parallel Training
Multi-GPU training on NVIDIA-TSPP requires minimal changes to the command-line:
Set hydra/launcher=torchrun, hydra.launcher.nproc_per_node={NUMBER OF GPUs}
output directory is specified using hydra.sweep.dir instead of hydra.run.dir

In [10]:

# Create an Output Directory
os.chdir(curr_workdir)
output_workdir = os.path.join(curr_workdir, F'outputs/4_MultiGpu_{model}_{dataset}_{str(num_gpus)}GPUs')
os.makedirs(output_workdir, exist_ok = True)

! python {tspp_ws}/launch_training.py \
-m hydra/launcher=torchrun \
hydra.launcher.nproc_per_node={num_gpus} \
seed=1234 \
model={model} \
dataset={dataset} \
trainer/criterion=quantile \
trainer.config.amp=True \
trainer.config.num_epochs=1 \
trainer.config.batch_size=512 \
hydra.sweep.dir={output_workdir} \
+trainer.config.force_rerun=True

Using backend: pytorch
[2023-01-27 21:37:11,052][HYDRA] 	#0 : seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=512 +trainer.config.force_rerun=True
[2023-01-27 21:37:11,368][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
[2023-01-27 21:37:12,136][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-01-27 21:37:12,139][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2023-01-27 21:37:12,139][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2023-01-27 21:37:12,147][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
 Training with 1 epochs 
 Epoch 0 
Epoch 0 | step 0 |avg loss 1.363 |walltime 1.308 |
Epoch 0 | step 25 |avg loss 0.643 |walltime 2.715 |
Epoch 0 | step 50 |avg loss 0.444 |walltime 4.080 |
Epoch 0 | step 75 |avg loss 0.373 |walltime 5.482 |
Epoch 0 | step 100 |avg loss 0.314 |walltime 6.856 |
Epoch 0 | step 125 |avg loss 0.278 |walltime 8.231 |
Epoch 0 | step 150 |avg loss 0.256 |walltime 9.694 |
Epoch 0 | step 175 |avg loss 0.242 |walltime 11.064 |
Epoch 0 | step 200 |avg loss 0.231 |walltime 12.441 |
Epoch 0 | step 225 |avg loss 0.225 |walltime 13.818 |
Epoch 0 | step 250 |avg loss 0.218 |walltime 15.183 |
Epoch 0 | step 275 |avg loss 0.217 |walltime 16.550 |
Epoch 0 | step 300 |avg loss 0.214 |walltime 17.925 |
Epoch 0 | step 325 |avg loss 0.209 |walltime 19.235 |
Epoch 0 | step 350 |avg loss 0.213 |walltime 20.549 |
Epoch 0 | step 375 |avg loss 0.209 |walltime 21.857 |
Epoch 0 | step 400 |avg loss 0.208 |walltime 23.163 |
Epoch 0 | step 425 |avg loss 0.204 |walltime 24.473 |
 Calculating Validation Metrics 
 Epoch 0 Validation Metrics: {'val_loss': 0.2325} 
Epoch 0 | step event |avg loss 0.204 |walltime 26.138 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
 MAE : 43.92580624930932  RMSE : 346.78918885139916  SMAPE : 9.63901996648538  ND : 0.05782052027014336 
[2023-01-27 21:37:46,597][torch.distributed.elastic.multiprocessing.api][WARNING] - Closing process 512 via signal SIGTERM

Hyper-Parameter Search on NVIDIA-TSPP

Hyperparameter searches can be used to find close-to-optimal hyperparameter configurations for a given model or dataset. In the Nvidia-TSPP, hyperparameter searches are driven by Optuna by setting: hydra/sweeper=optuna

Cell below does hp search on Learning Rate: 'trainer.optimizer.lr=tag(log, interval(1e-5, 1e-2))', with an objective to minimize Mean Absolute Error (MAE):+optuna_objectives=[MAE], hydra.sweeper.direction=[minimize]. We can also optimize on multiple objectives simultaneously:+optuna_objectives=[MAE,RMSE,SMAPE],hydra.sweeper.direction=[minimize,minimize,minimize]
number of trials are set using: hydra.sweeper.n_trials={NUMBER OF TRIALS}

More info on setting up the parameter ranges can be found on the hydra docs

In [11]:

# Create an Output Directory
os.chdir(curr_workdir)
output_workdir = os.path.join(curr_workdir, F'outputs/4_HpSearch_{model}_{dataset}_1GPU')
os.makedirs(output_workdir, exist_ok = True)


! python {tspp_ws}/launch_training.py \
-m \
'trainer.optimizer.lr=tag(log, interval(1e-5, 1e-2))' \
seed=1234 \
model={model} \
dataset={dataset} \
trainer/criterion=quantile \
trainer.config.amp=True \
trainer.config.num_epochs=1 \
trainer.config.batch_size=1024 \
hydra/sweeper=optuna \
+optuna_objectives=[MAE] \
hydra.sweeper.direction=[minimize] \
hydra.sweeper.n_trials=2 \
hydra.sweep.dir={output_workdir} \
+trainer.config.force_rerun=True

Using backend: pytorch
[32m[I 2023-01-27 21:37:51,922][0m A new study created in memory with name: no-name-eeccd462-37e0-4106-b282-fa6fdc93e404[0m
[2023-01-27 21:37:51,922][HYDRA] Study name: no-name-eeccd462-37e0-4106-b282-fa6fdc93e404
[2023-01-27 21:37:51,922][HYDRA] Storage: None
[2023-01-27 21:37:51,922][HYDRA] Sampler: TPESampler
[2023-01-27 21:37:51,922][HYDRA] Directions: ['minimize']
[2023-01-27 21:37:51,925][HYDRA] Launching 2 jobs locally
[2023-01-27 21:37:51,925][HYDRA] 	#0 : trainer.optimizer.lr=0.007617913943598092 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
 Training with 1 epochs 
 Epoch 0 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.682 |
Epoch 0 | step 25 |avg loss 0.626 |walltime 10.820 |
Epoch 0 | step 50 |avg loss 0.512 |walltime 12.958 |
Epoch 0 | step 75 |avg loss 0.441 |walltime 15.072 |
Epoch 0 | step 100 |avg loss 0.387 |walltime 17.171 |
Epoch 0 | step 125 |avg loss 0.345 |walltime 19.281 |
Epoch 0 | step 150 |avg loss 0.317 |walltime 21.353 |
Epoch 0 | step 175 |avg loss 0.301 |walltime 23.435 |
Epoch 0 | step 200 |avg loss 0.292 |walltime 25.525 |
Epoch 0 | step 225 |avg loss 0.279 |walltime 27.611 |
Epoch 0 | step 250 |avg loss 0.273 |walltime 29.702 |
Epoch 0 | step 275 |avg loss 0.264 |walltime 31.778 |
Epoch 0 | step 300 |avg loss 0.263 |walltime 33.873 |
Epoch 0 | step 325 |avg loss 0.255 |walltime 35.948 |
Epoch 0 | step 350 |avg loss 0.248 |walltime 38.042 |
Epoch 0 | step 375 |avg loss 0.245 |walltime 40.114 |
Epoch 0 | step 400 |avg loss 0.245 |walltime 42.205 |
Epoch 0 | step 425 |avg loss 0.241 |walltime 44.290 |
 Calculating Validation Metrics 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
 Epoch 0 Validation Metrics: {'val_loss': 0.2726} 
Epoch 0 | step event |avg loss 0.240 |walltime 55.177 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
Using backend: pytorch
 MAE : 60.19307298214023  RMSE : 517.9206028680389  SMAPE : 11.718166139740203  ND : 0.07923348695599146 
[2023-01-27 21:38:56,226][HYDRA] 	#1 : trainer.optimizer.lr=9.691240305477588e-05 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
 Training with 1 epochs 
 Epoch 0 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.417 |
Epoch 0 | step 25 |avg loss 0.944 |walltime 10.530 |
Epoch 0 | step 50 |avg loss 0.734 |walltime 12.616 |
Epoch 0 | step 75 |avg loss 0.638 |walltime 14.704 |
Epoch 0 | step 100 |avg loss 0.532 |walltime 16.786 |
Epoch 0 | step 125 |avg loss 0.478 |walltime 18.812 |
Epoch 0 | step 150 |avg loss 0.456 |walltime 20.894 |
Epoch 0 | step 175 |avg loss 0.443 |walltime 22.976 |
Epoch 0 | step 200 |avg loss 0.430 |walltime 25.067 |
Epoch 0 | step 225 |avg loss 0.413 |walltime 27.169 |
Epoch 0 | step 250 |avg loss 0.400 |walltime 29.261 |
Epoch 0 | step 275 |avg loss 0.385 |walltime 31.539 |
Epoch 0 | step 300 |avg loss 0.374 |walltime 33.656 |
Epoch 0 | step 325 |avg loss 0.359 |walltime 35.741 |
Epoch 0 | step 350 |avg loss 0.342 |walltime 37.892 |
Epoch 0 | step 375 |avg loss 0.326 |walltime 40.032 |
Epoch 0 | step 400 |avg loss 0.317 |walltime 42.185 |
Epoch 0 | step 425 |avg loss 0.309 |walltime 44.329 |
 Calculating Validation Metrics 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
 Epoch 0 Validation Metrics: {'val_loss': 0.3225} 
Epoch 0 | step event |avg loss 0.298 |walltime 55.212 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
Using backend: pytorch
 MAE : 75.59034680561963  RMSE : 718.691816179415  SMAPE : 14.195638117660497  ND : 0.09950126253562437 
[2023-01-27 21:40:00,477][HYDRA] Best parameters: {'trainer.optimizer.lr': 0.007617913943598092}
[2023-01-27 21:40:00,477][HYDRA] Best value: 60.19307298214023

After running the above cell, Nvidia-TSPP prints the best parameters out of all the trials it ran. Best Parameters can also be found at: {Output Directory}/optimization_results.yaml. Different trials are stored as: {Output Directory}/{Trial Number}

Parallel HP Search
While doing hp search on a machine with more than one GPU, we can parallelize the hp search by using the joblib hydra plugin and launch multiple instances of the model with different hyper-parameters on multiple gpus in parallel. To use the plugin, we have to specify hydra/launcher=joblib together with the number of parallel jobs hydra.launcher.n_jobs={NUMBER OF GPUs}. For example:

In [12]:

# Create an Output Directory
os.chdir(curr_workdir)
output_workdir = os.path.join(curr_workdir, F'outputs/4_HpSearch_{model}_{dataset}_Parallel')
os.makedirs(output_workdir, exist_ok = True)


! python {tspp_ws}/launch_training.py \
-m \
'trainer.optimizer.lr=tag(log, interval(1e-5, 1e-2))' \
seed=1234 \
model={model} \
dataset={dataset} \
trainer/criterion=quantile \
trainer.config.amp=True \
trainer.config.num_epochs=1 \
trainer.config.batch_size=1024 \
hydra/launcher=joblib \
hydra/sweeper=optuna \
+optuna_objectives=[MAE] \
hydra.sweeper.direction=[minimize] \
hydra.launcher.n_jobs={num_gpus} \
hydra.sweeper.n_trials=2 \
hydra.sweep.dir={output_workdir} \
+trainer.config.force_rerun=True

Using backend: pytorch
[32m[I 2023-01-27 21:40:06,033][0m A new study created in memory with name: no-name-f91f5425-2dda-4ba2-94e3-8be810083dd1[0m
[2023-01-27 21:40:06,033][HYDRA] Study name: no-name-f91f5425-2dda-4ba2-94e3-8be810083dd1
[2023-01-27 21:40:06,033][HYDRA] Storage: None
[2023-01-27 21:40:06,033][HYDRA] Sampler: TPESampler
[2023-01-27 21:40:06,033][HYDRA] Directions: ['minimize']
[2023-01-27 21:40:06,035][HYDRA] Joblib.Parallel(n_jobs=2,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 2 jobs
[2023-01-27 21:40:06,035][HYDRA] Launching jobs, sweep output dir : /home/jupyter/outputs/4_HpSearch_tft_electricity_Parallel
[2023-01-27 21:40:06,035][HYDRA] 	#0 : trainer.optimizer.lr=7.413617454888925e-05 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
[2023-01-27 21:40:06,035][HYDRA] 	#1 : trainer.optimizer.lr=0.0002006182960201028 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
Using backend: pytorch
Using backend: pytorch
/usr/local/lib/python3.8/dist-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)
  warnings.warn(msg, DeprecatedFeatureWarning)
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
/usr/local/lib/python3.8/dist-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)
  warnings.warn(msg, DeprecatedFeatureWarning)
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
 Training with 1 epochs 
 Epoch 0 
 Training with 1 epochs 
 Epoch 0 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
/workspace/training/trainer.py:210: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
  nn.utils.clip_grad_norm(self.model.parameters(), self.config.gradient_norm)
/workspace/training/trainer.py:210: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
  nn.utils.clip_grad_norm(self.model.parameters(), self.config.gradient_norm)
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.653 |
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.683 |
Epoch 0 | step 25 |avg loss 0.996 |walltime 10.743 |
Epoch 0 | step 25 |avg loss 0.840 |walltime 10.772 |
Epoch 0 | step 50 |avg loss 0.757 |walltime 12.819 |
Epoch 0 | step 50 |avg loss 0.628 |walltime 12.916 |
Epoch 0 | step 75 |avg loss 0.692 |walltime 14.918 |
Epoch 0 | step 75 |avg loss 0.492 |walltime 15.047 |
Epoch 0 | step 100 |avg loss 0.600 |walltime 17.069 |
Epoch 0 | step 100 |avg loss 0.451 |walltime 17.132 |
Epoch 0 | step 125 |avg loss 0.518 |walltime 19.154 |
Epoch 0 | step 125 |avg loss 0.428 |walltime 19.228 |
Epoch 0 | step 150 |avg loss 0.481 |walltime 21.197 |
Epoch 0 | step 150 |avg loss 0.406 |walltime 21.367 |
Epoch 0 | step 175 |avg loss 0.462 |walltime 23.315 |
Epoch 0 | step 175 |avg loss 0.384 |walltime 23.463 |
Epoch 0 | step 200 |avg loss 0.449 |walltime 25.412 |
Epoch 0 | step 200 |avg loss 0.361 |walltime 25.543 |
Epoch 0 | step 225 |avg loss 0.436 |walltime 27.477 |
Epoch 0 | step 225 |avg loss 0.334 |walltime 27.630 |
Epoch 0 | step 250 |avg loss 0.427 |walltime 29.574 |
Epoch 0 | step 250 |avg loss 0.312 |walltime 29.714 |
Epoch 0 | step 275 |avg loss 0.413 |walltime 31.682 |
Epoch 0 | step 275 |avg loss 0.291 |walltime 31.804 |
Epoch 0 | step 300 |avg loss 0.403 |walltime 33.774 |
Epoch 0 | step 300 |avg loss 0.278 |walltime 33.882 |
Epoch 0 | step 325 |avg loss 0.392 |walltime 35.853 |
Epoch 0 | step 325 |avg loss 0.268 |walltime 36.008 |
Epoch 0 | step 350 |avg loss 0.382 |walltime 37.960 |
Epoch 0 | step 350 |avg loss 0.258 |walltime 38.142 |
Epoch 0 | step 375 |avg loss 0.371 |walltime 40.039 |
Epoch 0 | step 375 |avg loss 0.251 |walltime 40.220 |
Epoch 0 | step 400 |avg loss 0.361 |walltime 42.131 |
Epoch 0 | step 400 |avg loss 0.249 |walltime 42.303 |
Epoch 0 | step 425 |avg loss 0.352 |walltime 44.218 |
Epoch 0 | step 425 |avg loss 0.246 |walltime 44.379 |
 Calculating Validation Metrics 
 Calculating Validation Metrics 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
 Epoch 0 Validation Metrics: {'val_loss': 0.3629} 
Epoch 0 | step event |avg loss 0.339 |walltime 55.447 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
 Epoch 0 Validation Metrics: {'val_loss': 0.2681} 
Epoch 0 | step event |avg loss 0.242 |walltime 55.627 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
Using backend: pytorch
Using backend: pytorch
 MAE : 87.59028952790135  RMSE : 893.3840862081481  SMAPE : 15.592594907363159  ND : 0.11529705527477666 
 MAE : 51.18473352366787  RMSE : 393.0887328938385  SMAPE : 11.43831024946904  ND : 0.06737560843914299 
[2023-01-27 21:41:15,183][HYDRA] Best parameters: {'trainer.optimizer.lr': 0.0002006182960201028}
[2023-01-27 21:41:15,184][HYDRA] Best value: 51.18473352366787

HP Search and Data-Parallel Training
In the cell below, we run Data-Parallel training with Batch Size Per GPU set to 512 and search for the optimal Learning Rate sequentially.

In [13]:

# Create an Output Directory
os.chdir(curr_workdir)
output_workdir = os.path.join(curr_workdir, F'outputs/4_HpSearch_{model}_{dataset}_{str(num_gpus)}GPUs')
os.makedirs(output_workdir, exist_ok = True)


! python {tspp_ws}/launch_training.py \
-m \
'trainer.optimizer.lr=tag(log, interval(1e-5, 1e-2))' \
hydra/launcher=torchrun \
hydra.launcher.nproc_per_node={num_gpus} \
seed=1234 \
model={model} \
dataset={dataset} \
trainer/criterion=quantile \
trainer.config.amp=True \
trainer.config.num_epochs=1 \
trainer.config.batch_size=512 \
hydra/sweeper=optuna \
+optuna_objectives=[MAE] \
hydra.sweeper.direction=[minimize] \
hydra.sweeper.n_trials=2 \
hydra.sweep.dir={output_workdir} \
+trainer.config.force_rerun=True

Using backend: pytorch
[32m[I 2023-01-27 21:41:23,246][0m A new study created in memory with name: no-name-23307da9-c9ef-4b7a-8cf4-e4f84cd54643[0m
[2023-01-27 21:41:23,246][HYDRA] Study name: no-name-23307da9-c9ef-4b7a-8cf4-e4f84cd54643
[2023-01-27 21:41:23,246][HYDRA] Storage: None
[2023-01-27 21:41:23,246][HYDRA] Sampler: TPESampler
[2023-01-27 21:41:23,246][HYDRA] Directions: ['minimize']
[2023-01-27 21:41:23,248][HYDRA] 	#0 : trainer.optimizer.lr=0.00010443943894071373 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=512 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
[2023-01-27 21:41:23,592][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
[2023-01-27 21:41:24,384][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-01-27 21:41:24,390][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2023-01-27 21:41:24,391][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2023-01-27 21:41:24,395][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
 Training with 1 epochs 
 Epoch 0 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.363 |walltime 9.372 |
Epoch 0 | step 25 |avg loss 0.929 |walltime 10.848 |
Epoch 0 | step 50 |avg loss 0.727 |walltime 12.304 |
Epoch 0 | step 75 |avg loss 0.621 |walltime 13.845 |
Epoch 0 | step 100 |avg loss 0.516 |walltime 15.320 |
Epoch 0 | step 125 |avg loss 0.473 |walltime 16.788 |
Epoch 0 | step 150 |avg loss 0.453 |walltime 18.257 |
Epoch 0 | step 175 |avg loss 0.438 |walltime 19.733 |
Epoch 0 | step 200 |avg loss 0.419 |walltime 21.197 |
Epoch 0 | step 225 |avg loss 0.407 |walltime 22.669 |
Epoch 0 | step 250 |avg loss 0.390 |walltime 24.145 |
Epoch 0 | step 275 |avg loss 0.376 |walltime 25.590 |
Epoch 0 | step 300 |avg loss 0.366 |walltime 27.116 |
Epoch 0 | step 325 |avg loss 0.347 |walltime 28.652 |
Epoch 0 | step 350 |avg loss 0.337 |walltime 30.178 |
Epoch 0 | step 375 |avg loss 0.322 |walltime 31.676 |
Epoch 0 | step 400 |avg loss 0.312 |walltime 33.134 |
Epoch 0 | step 425 |avg loss 0.297 |walltime 34.570 |
 Calculating Validation Metrics 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
 Epoch 0 Validation Metrics: {'val_loss': 0.3171} 
Epoch 0 | step event |avg loss 0.290 |walltime 45.022 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
Using backend: pytorch
 MAE : 71.48956308818265  RMSE : 634.5886544370983  SMAPE : 13.603083729132912  ND : 0.09410330929802689 
[2023-01-27 21:42:18,850][torch.distributed.elastic.multiprocessing.api][WARNING] - Closing process 2923 via signal SIGTERM
[2023-01-27 21:42:18,932][HYDRA] 	#1 : trainer.optimizer.lr=9.966102578171706e-05 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=512 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
[2023-01-27 21:42:19,273][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
[2023-01-27 21:42:20,031][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-01-27 21:42:20,035][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2023-01-27 21:42:20,036][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2023-01-27 21:42:20,041][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
 Training with 1 epochs 
 Epoch 0 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.363 |walltime 9.365 |
Epoch 0 | step 25 |avg loss 0.937 |walltime 10.834 |
Epoch 0 | step 50 |avg loss 0.732 |walltime 12.335 |
Epoch 0 | step 75 |avg loss 0.632 |walltime 13.806 |
Epoch 0 | step 100 |avg loss 0.525 |walltime 15.242 |
Epoch 0 | step 125 |avg loss 0.477 |walltime 16.672 |
Epoch 0 | step 150 |avg loss 0.456 |walltime 18.116 |
Epoch 0 | step 175 |avg loss 0.441 |walltime 19.535 |
Epoch 0 | step 200 |avg loss 0.423 |walltime 20.976 |
Epoch 0 | step 225 |avg loss 0.411 |walltime 22.423 |
Epoch 0 | step 250 |avg loss 0.394 |walltime 23.866 |
Epoch 0 | step 275 |avg loss 0.381 |walltime 25.284 |
Epoch 0 | step 300 |avg loss 0.372 |walltime 26.723 |
Epoch 0 | step 325 |avg loss 0.354 |walltime 28.155 |
Epoch 0 | step 350 |avg loss 0.344 |walltime 29.611 |
Epoch 0 | step 375 |avg loss 0.328 |walltime 31.030 |
Epoch 0 | step 400 |avg loss 0.318 |walltime 32.475 |
Epoch 0 | step 425 |avg loss 0.304 |walltime 33.913 |
 Calculating Validation Metrics 
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
 Epoch 0 Validation Metrics: {'val_loss': 0.3242} 
Epoch 0 | step event |avg loss 0.296 |walltime 44.484 |
 Saving checkpoint to best_checkpoint.zip 
 Saving checkpoint to last_checkpoint.zip 
 Training Stopped 
Using backend: pytorch
 MAE : 72.73564587405967  RMSE : 652.7126345935541  SMAPE : 13.841927865419018  ND : 0.0957435559122871 
[2023-01-27 21:43:14,535][torch.distributed.elastic.multiprocessing.api][WARNING] - Closing process 4030 via signal SIGTERM
[2023-01-27 21:43:14,617][HYDRA] Best parameters: {'trainer.optimizer.lr': 0.00010443943894071373}
[2023-01-27 21:43:14,617][HYDRA] Best value: 71.48956308818265