This sample code is a ready-to-use Jupyter notebook that shows a multi-node, multi-GPU NVIDIA RAPIDS application running on a Dask cluster. The notebook uses part of the NYC Taxi & Limousine Commission yellow taxi data (2018 calendar year). Its goal is to predict the fare amount for a given trip given the times and coordinates of the taxi trip using a Random Forest Model.
This example is designed to work seamlessly with the NGC-AzureML Quick Launch CLI Toolkit (azureml-ngc-tools), the toolkit leverages the Dask Cloud Provider, a native cloud integration for dask, that helps manage Dask clusters on different cloud platforms. The toolkit is previously ran to set up a Dask Cluster with Azure VMs, both scheduler and workers have the resources required to run this notebook, containers, libraries and this notebook itself, preinstalled for you.
The RAPIDS suite of software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
cuDF is a Python GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data all in a pandas-like API familiar to data scientists. The dask cudf library includes methods to use Dask and cuDF.
cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that are compatible with other RAPIDS projects, all in a scikit-learn-like API familiar to data scientists. The cuml.dask library provides a access to the RAPIDS cuML package with Dask.
The example is meant to run on the scheduler of the Dask cluster set up by NGC-AzureML Quick Launch CLI Toolkit using the config files on NGC at this location, the example starts by initializing a dask.distributed.client by pointing it to the address of the scheduler (as the notebook is running already on the scheduler the address is simply localhost)
The config files request the creation of a Dask cluster with two “Standard_NC12s_v3” Azure VMs, each supporting two 16G V100 GPUs, then the total number of core available on the Dask cluster should be four.
Then, the NYC Taxi & Limousine Commission yellow taxi is loaded from the Microsoft Azure Open Datasets catalog. The data is inspected, where it could be seen that the yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, driver-reported passenger counts, distance, along with latitude, longitude, etc. These are the information that would be used to estimate the trip fare amount.
The data needs to be cleaned up before it can be used in a meaningful way, in this part the user could recognize familiar pandas like functions such as (column) drop and fillna.
Later, new features are added by making use of "user defined functions" on the dataframe. The notebook makes use of apply_rows, which is similar to Pandas' apply function. apply_rows operation is JIT compiled by numba into GPU kernels.
Then just as with scikit-learn, the data could be easily split into training and testing sets.
Training data is then fitted to a Random Forest Model, and Inference is ran on the validation set.
The NGC-AzureML Quick Launch CLI Toolkit (azureml-ngc-tools) is the quickest way to get started with deploying this example on a Microsoft Azure. The steps involved are: 1. Download the ready-to-use config files for this example from NGC here. 2. Install azureml-ngc-tools following the instructions described. 3. Modify the azure_config.json with your Azure billing credentials and run azureml-ngc-tools as described in Step 1. 4. Copy the URL to your AzureML environment produced by the toolkit to your browser of choice.
To take a look at the notebook before you deploy, follow these steps: 1. Navigate to the File Browser tab of this asset 2. Under the actions menu (three dots) select "View Jupyter" 3. There you have it! You can read the sample code before you deploy). 2. Install azureml-ngc-tools following the instructions described. 3. Modify the azure_config.json with your Azure billing credentials and run azureml-ngc-tools as described in Step 1. 4. Copy the URL to your AzureML environment produced by the toolkit to your browser of choice.
The config files listed here when used with the NGC-AzureML Quick Launch CLI toolkit (azureml-ngc-tools) uses RAPIDS which is to be used in accordance with the End User License Agreement included with it. Licenses are also available along with the model application zip file. By pulling and using the RAPIDS container and downloading models, you accept the terms and conditions of these licenses.