NVIDIA Time Series Prediction Platform is a tool designed to compare easily and experiment with arbitrary combinations of forecasting models, time-series datasets, and other configurations.
This notebook assumes that the Vertex-AI instance has more than 1 GPU
Please set the num_gpus variable accordingly
NOTE: These notebooks are designed to highlight different features of NVIDIA-TSPP. For this reason, all the examples are created to run quickly with only few iterations. The parameters should be tuned to get optimal results.
Setup
Imports
/usr/local/lib/python3.8/dist-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Using backend: pytorch
For this notebook, we will again use the electricity dataset and the Temporal Fusion Transformer (TFT) model for training. Take a look at the 1_TsppOverview notebook for data download and preprocessing instructions in addition to an in depth description on training.
Dataset Download
Using backend: pytorch #### Running download script ### Getting electricity data... /workspace/datasets/electricity Pulling data from https://archive.ics.uci.edu/ml/machine-learning-databases/00321/LD2011_2014.txt.zip to /workspace/datasets/electricity/LD2011_2014.txt.zip 100% [..................................................] 261335609 / 261335609done Unzipping file: /workspace/datasets/electricity/LD2011_2014.txt.zip Done. Aggregating to hourly data Done. Download completed.
Dataset Preprocessing
{'_target_': 'data.data_utils.Preprocessor', 'config': {'graph': False, 'source_path': '/workspace/datasets/electricity/electricity.csv', 'dest_path': '/workspace/datasets/electricity/', 'time_ids': 'days_from_start', 'train_range': [0, 1315], 'valid_range': [1308, 1339], 'test_range': [1332, 10000], 'dataset_stride': 1, 'scale_per_id': True, 'encoder_length': 168, 'example_length': 192, 'MultiID': False, 'features': [{'name': 'categorical_id', 'feature_type': 'ID', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 371}, {'name': 'hours_from_start', 'feature_type': 'TIME', 'feature_embed_type': 'CONTINUOUS'}, {'name': 'power_usage_weight', 'feature_type': 'WEIGHT', 'feature_embed_type': 'CONTINUOUS'}, {'name': 'power_usage', 'feature_type': 'TARGET', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}}, {'name': 'hour', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 25}, {'name': 'day_of_week', 'feature_type': 'KNOWN', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 8}, {'name': 'hours_from_start', 'feature_type': 'KNOWN', 'feature_embed_type': 'CONTINUOUS', 'scaler': {'_target_': 'sklearn.preprocessing.StandardScaler'}}, {'name': 'categorical_id', 'feature_type': 'STATIC', 'feature_embed_type': 'CATEGORICAL', 'cardinality': 371}], 'train_samples': 450000, 'valid_samples': 50000, 'binarized': True, 'time_series_count': 369}}
Reading in data from CSV File: /workspace/datasets/electricity/electricity.csv
Sorting on time feature
Mapping nodes
Mapping categoricals to bounded range
Splitting datasets
Calculating scalers
Applying scalers
Applying scalers
Applying scalers
Fixing any nans in continuous features
Fixing any nans in continuous features
Fixing any nans in continuous features
Saving preprocessor state at /workspace/datasets/electricity/tspp_preprocess.bin
Saving processed data at /workspace/datasets/electricity/
We train the model for 1 epoch with Batch Size of 1024 on a single GPU
For all the examples below, training criterion is set to Quantile and Automatic Mixed Precision (AMP) is enabled
Using backend: pytorch
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Training with 1 epochs
Epoch 0
Epoch 0 | step 0 |avg loss 1.332 |walltime 0.895 |
Epoch 0 | step 25 |avg loss 0.640 |walltime 2.842 |
Epoch 0 | step 50 |avg loss 0.440 |walltime 4.898 |
Epoch 0 | step 75 |avg loss 0.370 |walltime 6.775 |
Epoch 0 | step 100 |avg loss 0.306 |walltime 8.656 |
Epoch 0 | step 125 |avg loss 0.274 |walltime 10.537 |
Epoch 0 | step 150 |avg loss 0.253 |walltime 12.416 |
Epoch 0 | step 175 |avg loss 0.238 |walltime 14.284 |
Epoch 0 | step 200 |avg loss 0.233 |walltime 16.170 |
Epoch 0 | step 225 |avg loss 0.226 |walltime 18.052 |
Epoch 0 | step 250 |avg loss 0.221 |walltime 19.933 |
Epoch 0 | step 275 |avg loss 0.217 |walltime 21.815 |
Epoch 0 | step 300 |avg loss 0.217 |walltime 23.698 |
Epoch 0 | step 325 |avg loss 0.212 |walltime 25.583 |
Epoch 0 | step 350 |avg loss 0.210 |walltime 27.464 |
Epoch 0 | step 375 |avg loss 0.206 |walltime 29.376 |
Epoch 0 | step 400 |avg loss 0.206 |walltime 31.256 |
Epoch 0 | step 425 |avg loss 0.206 |walltime 33.126 |
Calculating Validation Metrics
Epoch 0 Validation Metrics: {'val_loss': 0.2326}
Epoch 0 | step event |avg loss 0.204 |walltime 35.641 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
MAE : 45.27942846755962 RMSE : 360.505482514993 SMAPE : 9.768091436720974 ND : 0.05960232344216114
Multi-GPU Training on NVIDIA-TSPP
In the cell below, we assume that 2 GPUs are available. For 2 GPUs, we set the batch size per GPU to 512, which makes the Global Batch Size to be 1024 (Same as the training in the previous cell with a single GPU) for Data-Parallel Training
Multi-GPU training on NVIDIA-TSPP requires minimal changes to the command-line:
Set hydra/launcher=torchrun, hydra.launcher.nproc_per_node={NUMBER OF GPUs}
output directory is specified using hydra.sweep.dir instead of hydra.run.dir
Using backend: pytorch
[2023-01-27 21:37:11,052][HYDRA] #0 : seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=512 +trainer.config.force_rerun=True
[2023-01-27 21:37:11,368][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
[2023-01-27 21:37:12,136][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-01-27 21:37:12,139][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2023-01-27 21:37:12,139][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2023-01-27 21:37:12,147][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Training with 1 epochs
Epoch 0
Epoch 0 | step 0 |avg loss 1.363 |walltime 1.308 |
Epoch 0 | step 25 |avg loss 0.643 |walltime 2.715 |
Epoch 0 | step 50 |avg loss 0.444 |walltime 4.080 |
Epoch 0 | step 75 |avg loss 0.373 |walltime 5.482 |
Epoch 0 | step 100 |avg loss 0.314 |walltime 6.856 |
Epoch 0 | step 125 |avg loss 0.278 |walltime 8.231 |
Epoch 0 | step 150 |avg loss 0.256 |walltime 9.694 |
Epoch 0 | step 175 |avg loss 0.242 |walltime 11.064 |
Epoch 0 | step 200 |avg loss 0.231 |walltime 12.441 |
Epoch 0 | step 225 |avg loss 0.225 |walltime 13.818 |
Epoch 0 | step 250 |avg loss 0.218 |walltime 15.183 |
Epoch 0 | step 275 |avg loss 0.217 |walltime 16.550 |
Epoch 0 | step 300 |avg loss 0.214 |walltime 17.925 |
Epoch 0 | step 325 |avg loss 0.209 |walltime 19.235 |
Epoch 0 | step 350 |avg loss 0.213 |walltime 20.549 |
Epoch 0 | step 375 |avg loss 0.209 |walltime 21.857 |
Epoch 0 | step 400 |avg loss 0.208 |walltime 23.163 |
Epoch 0 | step 425 |avg loss 0.204 |walltime 24.473 |
Calculating Validation Metrics
Epoch 0 Validation Metrics: {'val_loss': 0.2325}
Epoch 0 | step event |avg loss 0.204 |walltime 26.138 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
MAE : 43.92580624930932 RMSE : 346.78918885139916 SMAPE : 9.63901996648538 ND : 0.05782052027014336
[2023-01-27 21:37:46,597][torch.distributed.elastic.multiprocessing.api][WARNING] - Closing process 512 via signal SIGTERM
Hyperparameter searches can be used to find close-to-optimal hyperparameter configurations for a given model or dataset. In the Nvidia-TSPP, hyperparameter searches are driven by Optuna by setting: hydra/sweeper=optuna
Cell below does hp search on Learning Rate: 'trainer.optimizer.lr=tag(log, interval(1e-5, 1e-2))', with an objective to minimize Mean Absolute Error (MAE):+optuna_objectives=[MAE], hydra.sweeper.direction=[minimize]. We can also optimize on multiple objectives simultaneously:+optuna_objectives=[MAE,RMSE,SMAPE],hydra.sweeper.direction=[minimize,minimize,minimize]
number of trials are set using: hydra.sweeper.n_trials={NUMBER OF TRIALS}
More info on setting up the parameter ranges can be found on the hydra docs
Using backend: pytorch
[32m[I 2023-01-27 21:37:51,922][0m A new study created in memory with name: no-name-eeccd462-37e0-4106-b282-fa6fdc93e404[0m
[2023-01-27 21:37:51,922][HYDRA] Study name: no-name-eeccd462-37e0-4106-b282-fa6fdc93e404
[2023-01-27 21:37:51,922][HYDRA] Storage: None
[2023-01-27 21:37:51,922][HYDRA] Sampler: TPESampler
[2023-01-27 21:37:51,922][HYDRA] Directions: ['minimize']
[2023-01-27 21:37:51,925][HYDRA] Launching 2 jobs locally
[2023-01-27 21:37:51,925][HYDRA] #0 : trainer.optimizer.lr=0.007617913943598092 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Training with 1 epochs
Epoch 0
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.682 |
Epoch 0 | step 25 |avg loss 0.626 |walltime 10.820 |
Epoch 0 | step 50 |avg loss 0.512 |walltime 12.958 |
Epoch 0 | step 75 |avg loss 0.441 |walltime 15.072 |
Epoch 0 | step 100 |avg loss 0.387 |walltime 17.171 |
Epoch 0 | step 125 |avg loss 0.345 |walltime 19.281 |
Epoch 0 | step 150 |avg loss 0.317 |walltime 21.353 |
Epoch 0 | step 175 |avg loss 0.301 |walltime 23.435 |
Epoch 0 | step 200 |avg loss 0.292 |walltime 25.525 |
Epoch 0 | step 225 |avg loss 0.279 |walltime 27.611 |
Epoch 0 | step 250 |avg loss 0.273 |walltime 29.702 |
Epoch 0 | step 275 |avg loss 0.264 |walltime 31.778 |
Epoch 0 | step 300 |avg loss 0.263 |walltime 33.873 |
Epoch 0 | step 325 |avg loss 0.255 |walltime 35.948 |
Epoch 0 | step 350 |avg loss 0.248 |walltime 38.042 |
Epoch 0 | step 375 |avg loss 0.245 |walltime 40.114 |
Epoch 0 | step 400 |avg loss 0.245 |walltime 42.205 |
Epoch 0 | step 425 |avg loss 0.241 |walltime 44.290 |
Calculating Validation Metrics
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 Validation Metrics: {'val_loss': 0.2726}
Epoch 0 | step event |avg loss 0.240 |walltime 55.177 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
Using backend: pytorch
MAE : 60.19307298214023 RMSE : 517.9206028680389 SMAPE : 11.718166139740203 ND : 0.07923348695599146
[2023-01-27 21:38:56,226][HYDRA] #1 : trainer.optimizer.lr=9.691240305477588e-05 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Training with 1 epochs
Epoch 0
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.417 |
Epoch 0 | step 25 |avg loss 0.944 |walltime 10.530 |
Epoch 0 | step 50 |avg loss 0.734 |walltime 12.616 |
Epoch 0 | step 75 |avg loss 0.638 |walltime 14.704 |
Epoch 0 | step 100 |avg loss 0.532 |walltime 16.786 |
Epoch 0 | step 125 |avg loss 0.478 |walltime 18.812 |
Epoch 0 | step 150 |avg loss 0.456 |walltime 20.894 |
Epoch 0 | step 175 |avg loss 0.443 |walltime 22.976 |
Epoch 0 | step 200 |avg loss 0.430 |walltime 25.067 |
Epoch 0 | step 225 |avg loss 0.413 |walltime 27.169 |
Epoch 0 | step 250 |avg loss 0.400 |walltime 29.261 |
Epoch 0 | step 275 |avg loss 0.385 |walltime 31.539 |
Epoch 0 | step 300 |avg loss 0.374 |walltime 33.656 |
Epoch 0 | step 325 |avg loss 0.359 |walltime 35.741 |
Epoch 0 | step 350 |avg loss 0.342 |walltime 37.892 |
Epoch 0 | step 375 |avg loss 0.326 |walltime 40.032 |
Epoch 0 | step 400 |avg loss 0.317 |walltime 42.185 |
Epoch 0 | step 425 |avg loss 0.309 |walltime 44.329 |
Calculating Validation Metrics
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 Validation Metrics: {'val_loss': 0.3225}
Epoch 0 | step event |avg loss 0.298 |walltime 55.212 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
Using backend: pytorch
MAE : 75.59034680561963 RMSE : 718.691816179415 SMAPE : 14.195638117660497 ND : 0.09950126253562437
[2023-01-27 21:40:00,477][HYDRA] Best parameters: {'trainer.optimizer.lr': 0.007617913943598092}
[2023-01-27 21:40:00,477][HYDRA] Best value: 60.19307298214023
After running the above cell, Nvidia-TSPP prints the best parameters out of all the trials it ran.
Best Parameters can also be found at: {Output Directory}/optimization_results.yaml.
Different trials are stored as: {Output Directory}/{Trial Number}
Parallel HP Search
While doing hp search on a machine with more than one GPU, we can parallelize the hp search by using the joblib hydra plugin and launch multiple instances of the model with different hyper-parameters on multiple gpus in parallel. To use the plugin, we have to specify hydra/launcher=joblib together with the number of parallel jobs hydra.launcher.n_jobs={NUMBER OF GPUs}. For example:
Using backend: pytorch
[32m[I 2023-01-27 21:40:06,033][0m A new study created in memory with name: no-name-f91f5425-2dda-4ba2-94e3-8be810083dd1[0m
[2023-01-27 21:40:06,033][HYDRA] Study name: no-name-f91f5425-2dda-4ba2-94e3-8be810083dd1
[2023-01-27 21:40:06,033][HYDRA] Storage: None
[2023-01-27 21:40:06,033][HYDRA] Sampler: TPESampler
[2023-01-27 21:40:06,033][HYDRA] Directions: ['minimize']
[2023-01-27 21:40:06,035][HYDRA] Joblib.Parallel(n_jobs=2,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 2 jobs
[2023-01-27 21:40:06,035][HYDRA] Launching jobs, sweep output dir : /home/jupyter/outputs/4_HpSearch_tft_electricity_Parallel
[2023-01-27 21:40:06,035][HYDRA] #0 : trainer.optimizer.lr=7.413617454888925e-05 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
[2023-01-27 21:40:06,035][HYDRA] #1 : trainer.optimizer.lr=0.0002006182960201028 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=1024 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
Using backend: pytorch
Using backend: pytorch
/usr/local/lib/python3.8/dist-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)
warnings.warn(msg, DeprecatedFeatureWarning)
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
/usr/local/lib/python3.8/dist-packages/apex/__init__.py:68: DeprecatedFeatureWarning: apex.amp is deprecated and will be removed by the end of February 2023. Use [PyTorch AMP](https://pytorch.org/docs/stable/amp.html)
warnings.warn(msg, DeprecatedFeatureWarning)
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Training with 1 epochs
Epoch 0
Training with 1 epochs
Epoch 0
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
/workspace/training/trainer.py:210: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
nn.utils.clip_grad_norm(self.model.parameters(), self.config.gradient_norm)
/workspace/training/trainer.py:210: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_.
nn.utils.clip_grad_norm(self.model.parameters(), self.config.gradient_norm)
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.653 |
Epoch 0 | step 0 |avg loss 1.332 |walltime 8.683 |
Epoch 0 | step 25 |avg loss 0.996 |walltime 10.743 |
Epoch 0 | step 25 |avg loss 0.840 |walltime 10.772 |
Epoch 0 | step 50 |avg loss 0.757 |walltime 12.819 |
Epoch 0 | step 50 |avg loss 0.628 |walltime 12.916 |
Epoch 0 | step 75 |avg loss 0.692 |walltime 14.918 |
Epoch 0 | step 75 |avg loss 0.492 |walltime 15.047 |
Epoch 0 | step 100 |avg loss 0.600 |walltime 17.069 |
Epoch 0 | step 100 |avg loss 0.451 |walltime 17.132 |
Epoch 0 | step 125 |avg loss 0.518 |walltime 19.154 |
Epoch 0 | step 125 |avg loss 0.428 |walltime 19.228 |
Epoch 0 | step 150 |avg loss 0.481 |walltime 21.197 |
Epoch 0 | step 150 |avg loss 0.406 |walltime 21.367 |
Epoch 0 | step 175 |avg loss 0.462 |walltime 23.315 |
Epoch 0 | step 175 |avg loss 0.384 |walltime 23.463 |
Epoch 0 | step 200 |avg loss 0.449 |walltime 25.412 |
Epoch 0 | step 200 |avg loss 0.361 |walltime 25.543 |
Epoch 0 | step 225 |avg loss 0.436 |walltime 27.477 |
Epoch 0 | step 225 |avg loss 0.334 |walltime 27.630 |
Epoch 0 | step 250 |avg loss 0.427 |walltime 29.574 |
Epoch 0 | step 250 |avg loss 0.312 |walltime 29.714 |
Epoch 0 | step 275 |avg loss 0.413 |walltime 31.682 |
Epoch 0 | step 275 |avg loss 0.291 |walltime 31.804 |
Epoch 0 | step 300 |avg loss 0.403 |walltime 33.774 |
Epoch 0 | step 300 |avg loss 0.278 |walltime 33.882 |
Epoch 0 | step 325 |avg loss 0.392 |walltime 35.853 |
Epoch 0 | step 325 |avg loss 0.268 |walltime 36.008 |
Epoch 0 | step 350 |avg loss 0.382 |walltime 37.960 |
Epoch 0 | step 350 |avg loss 0.258 |walltime 38.142 |
Epoch 0 | step 375 |avg loss 0.371 |walltime 40.039 |
Epoch 0 | step 375 |avg loss 0.251 |walltime 40.220 |
Epoch 0 | step 400 |avg loss 0.361 |walltime 42.131 |
Epoch 0 | step 400 |avg loss 0.249 |walltime 42.303 |
Epoch 0 | step 425 |avg loss 0.352 |walltime 44.218 |
Epoch 0 | step 425 |avg loss 0.246 |walltime 44.379 |
Calculating Validation Metrics
Calculating Validation Metrics
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 Validation Metrics: {'val_loss': 0.3629}
Epoch 0 | step event |avg loss 0.339 |walltime 55.447 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
Epoch 0 Validation Metrics: {'val_loss': 0.2681}
Epoch 0 | step event |avg loss 0.242 |walltime 55.627 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
Using backend: pytorch
Using backend: pytorch
MAE : 87.59028952790135 RMSE : 893.3840862081481 SMAPE : 15.592594907363159 ND : 0.11529705527477666
MAE : 51.18473352366787 RMSE : 393.0887328938385 SMAPE : 11.43831024946904 ND : 0.06737560843914299
[2023-01-27 21:41:15,183][HYDRA] Best parameters: {'trainer.optimizer.lr': 0.0002006182960201028}
[2023-01-27 21:41:15,184][HYDRA] Best value: 51.18473352366787
HP Search and Data-Parallel Training
In the cell below, we run Data-Parallel training with Batch Size Per GPU set to 512 and search for the optimal Learning Rate sequentially.
Using backend: pytorch
[32m[I 2023-01-27 21:41:23,246][0m A new study created in memory with name: no-name-23307da9-c9ef-4b7a-8cf4-e4f84cd54643[0m
[2023-01-27 21:41:23,246][HYDRA] Study name: no-name-23307da9-c9ef-4b7a-8cf4-e4f84cd54643
[2023-01-27 21:41:23,246][HYDRA] Storage: None
[2023-01-27 21:41:23,246][HYDRA] Sampler: TPESampler
[2023-01-27 21:41:23,246][HYDRA] Directions: ['minimize']
[2023-01-27 21:41:23,248][HYDRA] #0 : trainer.optimizer.lr=0.00010443943894071373 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=512 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
[2023-01-27 21:41:23,592][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
[2023-01-27 21:41:24,384][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-01-27 21:41:24,390][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2023-01-27 21:41:24,391][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2023-01-27 21:41:24,395][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Training with 1 epochs
Epoch 0
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.363 |walltime 9.372 |
Epoch 0 | step 25 |avg loss 0.929 |walltime 10.848 |
Epoch 0 | step 50 |avg loss 0.727 |walltime 12.304 |
Epoch 0 | step 75 |avg loss 0.621 |walltime 13.845 |
Epoch 0 | step 100 |avg loss 0.516 |walltime 15.320 |
Epoch 0 | step 125 |avg loss 0.473 |walltime 16.788 |
Epoch 0 | step 150 |avg loss 0.453 |walltime 18.257 |
Epoch 0 | step 175 |avg loss 0.438 |walltime 19.733 |
Epoch 0 | step 200 |avg loss 0.419 |walltime 21.197 |
Epoch 0 | step 225 |avg loss 0.407 |walltime 22.669 |
Epoch 0 | step 250 |avg loss 0.390 |walltime 24.145 |
Epoch 0 | step 275 |avg loss 0.376 |walltime 25.590 |
Epoch 0 | step 300 |avg loss 0.366 |walltime 27.116 |
Epoch 0 | step 325 |avg loss 0.347 |walltime 28.652 |
Epoch 0 | step 350 |avg loss 0.337 |walltime 30.178 |
Epoch 0 | step 375 |avg loss 0.322 |walltime 31.676 |
Epoch 0 | step 400 |avg loss 0.312 |walltime 33.134 |
Epoch 0 | step 425 |avg loss 0.297 |walltime 34.570 |
Calculating Validation Metrics
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 Validation Metrics: {'val_loss': 0.3171}
Epoch 0 | step event |avg loss 0.290 |walltime 45.022 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
Using backend: pytorch
MAE : 71.48956308818265 RMSE : 634.5886544370983 SMAPE : 13.603083729132912 ND : 0.09410330929802689
[2023-01-27 21:42:18,850][torch.distributed.elastic.multiprocessing.api][WARNING] - Closing process 2923 via signal SIGTERM
[2023-01-27 21:42:18,932][HYDRA] #1 : trainer.optimizer.lr=9.966102578171706e-05 seed=1234 model=tft dataset=electricity trainer/criterion=quantile trainer.config.amp=True trainer.config.num_epochs=1 trainer.config.batch_size=512 +optuna_objectives=[MAE] +trainer.config.force_rerun=True
[2023-01-27 21:42:19,273][torch.distributed.elastic.rendezvous.static_tcp_rendezvous][INFO] - Creating TCPStore as the c10d::Store implementation
[2023-01-27 21:42:20,031][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2023-01-27 21:42:20,035][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2023-01-27 21:42:20,036][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2023-01-27 21:42:20,041][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Training with 1 epochs
Epoch 0
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 | step 0 |avg loss 1.363 |walltime 9.365 |
Epoch 0 | step 25 |avg loss 0.937 |walltime 10.834 |
Epoch 0 | step 50 |avg loss 0.732 |walltime 12.335 |
Epoch 0 | step 75 |avg loss 0.632 |walltime 13.806 |
Epoch 0 | step 100 |avg loss 0.525 |walltime 15.242 |
Epoch 0 | step 125 |avg loss 0.477 |walltime 16.672 |
Epoch 0 | step 150 |avg loss 0.456 |walltime 18.116 |
Epoch 0 | step 175 |avg loss 0.441 |walltime 19.535 |
Epoch 0 | step 200 |avg loss 0.423 |walltime 20.976 |
Epoch 0 | step 225 |avg loss 0.411 |walltime 22.423 |
Epoch 0 | step 250 |avg loss 0.394 |walltime 23.866 |
Epoch 0 | step 275 |avg loss 0.381 |walltime 25.284 |
Epoch 0 | step 300 |avg loss 0.372 |walltime 26.723 |
Epoch 0 | step 325 |avg loss 0.354 |walltime 28.155 |
Epoch 0 | step 350 |avg loss 0.344 |walltime 29.611 |
Epoch 0 | step 375 |avg loss 0.328 |walltime 31.030 |
Epoch 0 | step 400 |avg loss 0.318 |walltime 32.475 |
Epoch 0 | step 425 |avg loss 0.304 |walltime 33.913 |
Calculating Validation Metrics
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Using backend: pytorch
Epoch 0 Validation Metrics: {'val_loss': 0.3242}
Epoch 0 | step event |avg loss 0.296 |walltime 44.484 |
Saving checkpoint to best_checkpoint.zip
Saving checkpoint to last_checkpoint.zip
Training Stopped
Using backend: pytorch
MAE : 72.73564587405967 RMSE : 652.7126345935541 SMAPE : 13.841927865419018 ND : 0.0957435559122871
[2023-01-27 21:43:14,535][torch.distributed.elastic.multiprocessing.api][WARNING] - Closing process 4030 via signal SIGTERM
[2023-01-27 21:43:14,617][HYDRA] Best parameters: {'trainer.optimizer.lr': 0.00010443943894071373}
[2023-01-27 21:43:14,617][HYDRA] Best value: 71.48956308818265