NVIDIA NGC Base VM Images is available in Google Cloud Marketplace, in the following example, we use GCP's latest A100 GPU in A2 family to launch a NGC VM. NGC VM preinstalls NVIDIA Docker.
TAO(TLT) Stream Analysytics container provides the runtime dependencies for steps below. It can be launched with
docker run --gpus all -it \
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8888:8888 \
-v "$(pwd)":/tlt
mkdir /tlt/models
ngc registry model download-version "nvidia/tlt_peoplenet:unpruned_v2.1"
First, we convert the pretrained models weights to a etlt format, then optimized to a TensorRT Engine.
detectnet_v2 export -k "tlt_encode" -m /tlt/models/tlt_peoplenet_vunpruned_v2.1/resnet34_peoplenet.tlt -o /tlt/models/tlt_peoplenet_vunpruned_v2.1/resnet34_peoplenet.etlt
tlt-converter -k "tlt_encode" -d 3,544,960 -e /tlt/models/tlt_peoplenet_vunpruned_v2.1/model.engine -o output_cov/Sigmoid,output_bbox/BiasAdd /tlt/models/tlt_peoplenet_vunpruned_v2.1/resnet34_peoplenet.etlt
Second, we move the pretrained engine to a GCS directory.
gsutil cp ~/tlt/models/tlt_peoplenet_vunpruned_v2.1/model.engine gs://dongm-tlt/tlt/triton-model/peoplenet_tlt/1/model.plan
Here is a sample Triton configuration file for the RGB PeopleNet model, we also drop it to the same bucket.
gsutil cp config.pbtxt gs://dongm-tlt/tlt/triton-model/peoplenet_tlt/
The GKE A100 cluster could be created with command below:
export PROJECT_ID=k80-exploration
export ZONE=us-central1-a
export REGION=us-central1
export DEPLOYMENT_NAME=dongm-cv-gke
gcloud beta container clusters create ${DEPLOYMENT_NAME} \
--addons=HorizontalPodAutoscaling,HttpLoadBalancing,Istio \
--machine-type=n1-standard-8 \
--node-locations=${ZONE} \
--zone=${ZONE} \
--subnetwork=default \
--scopes cloud-platform \
--num-nodes 1 \
--project ${PROJECT_ID}
# add GPU node pools, user can modify number of node based on workloads
gcloud container node-pools create accel \
--project ${PROJECT_ID} \
--zone ${ZONE} \
--cluster ${DEPLOYMENT_NAME} \
--num-nodes 1 \
--accelerator type=nvidia-tesla-a100,count=1 \
--enable-autoscaling --min-nodes 1 --max-nodes 2 \
--machine-type a2-highgpu-1g \
--disk-size=100 \
--scopes cloud-platform \
--verbosity error
# so that you can run kubectl locally to the cluster
gcloud container clusters get-credentials ${DEPLOYMENT_NAME} --project ${PROJECT_ID} --zone ${ZONE}
# deploy NVIDIA device plugin for GKE to prepare GPU nodes for driver install
kubectl apply -f
# make sure you can run kubectl locally to access the cluster
kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user "$(gcloud config get-value account)"
# enable stackdriver custom metrics adaptor
kubectl apply -f
Last, we use GKE Triton Marketplace Application to launch a Triton deployment in GKE, pointing model repository to gs://dongm-tlt/tlt/triton-model
. Given the compatibilty required by TensorRT, please note we will launch a A100 GKE cluster to host the application, and in the meanwhile, use Triton v2.5.0
version of GKE application to be compatible with TAO 3.0.
Once the application has been deployed, we will leverage TAO Triton Application to send inference request. Follow getting started guide to set up python environment, but skip start server part as server has been deployed to GKE. Find out Istio ingress IP with: export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
, for example:
In triton_dev
environment, install jupyter and ipython.
Launch a jupyter notebook from the GCP VM ``
And set up a port forwarding to your client gcloud compute ssh --project k80-exploration --zone us-central1-a dongm-nvidia-ngc-base-vm -- -L 8888:localhost:8888
And run the client python script in the notebook tao_triton_client.ipynb
or with command below:
python3 ${TLT_TRITON_REPO_ROOT}/tlt_triton/python/entrypoints/ ../inference_images/rgb \
-m peoplenet_tlt \
-x 1 -b 16 --mode DetectNet_v2 \
-i https -u --async \
--output_path ../inference_images_out \
--postprocessing_config ../inference_config/clustering_config_peoplenet_rgb.prototxt
Follow notebook ir2kitti_preprocess.ipynb
to convert data label to kitti format, preprocess the images, generate tf records for Training.
We modify the TAO Training configuration (training_config/training_spec.txt in the zip file) to point to the TF records location and kick off a retraining.
detectnet_v2 train -e training_config/training_spec.txt -r experiments -n "final_model" -k "tlt_encode" --gpus 8
In the example configuration, we train for 60 epochs and the accuracy could reach 72% by the end of training.
First, we convert the retrained models weights to a etlt format, then optimized to a TensorRT Engine.
detectnet_v2 export -m /tlt/experiments/weights/final_model.tlt -o /tlt/models/ir.etlt -k "tlt_encode"
tlt-converter -k "tlt_encode" -d 3,544,960 -e /tlt/models/ir.engine -o output_cov/Sigmoid,output_bbox/BiasAdd /tlt/models/ir.etlt
Second, we move the retrained engine to a GCS directory.
gsutil cp /tlt/models/ir.engine gs://dongm-tlt/tlt/triton-model/peoplenet_ir_tlt/1/model.plan
We also modify the config.pbtxt file to fit IR model as below,
name: "peoplenet_ir_tlt"
platform: "tensorrt_plan"
max_batch_size: 16
input [
name: "input_1"
data_type: TYPE_FP32
dims: [ 3, 544, 960 ]
output [
name: "output_bbox/BiasAdd"
data_type: TYPE_FP32
dims: [ 4, 34, 60 ]
name: "output_cov/Sigmoid"
data_type: TYPE_FP32
dims: [ 1, 34, 60 ]
dynamic_batching { }
then copy:
gsutil cp config.pbtxt gs://dongm-tlt/tlt/triton-model/peoplenet_tlt/
And run the client python script in the notebook tao_triton_client.ipynb
or with command below
python3 ${TLT_TRITON_REPO_ROOT}/tlt_triton/python/entrypoints/ directory_to_test_images \
-m peoplenet_tlt \
-x 1 -b 16 --mode DetectNet_v2 \
--class_list person \
-i https -u --async \
--output_path directory_to_test_images_output \
--postprocessing_config inference_config/clustering_config_peoplenet_ir.prototxt