The Allegro Trains platform is a full system open source ML / DL experiment manager and ML-Ops solution. It is composed of a Python SDK, server, Web UI, and execution agents. Allegro Trains enables data scientists and data engineers to effortlessly track, manage, compare, and collaborate on their experiments as well as easily manage their training workloads on remote machines. Allegro Trains is designed for effortless integration so that teams can preserve their existing methods and practices. Use it on a daily basis to boost collaboration and visibility, or use it to automatically collect your experimentation logs, outputs, and data to one centralized server.
With only two lines added to your code, the Allegro Trains Python package automatically tracks:
> Supported frameworks: Tensorflow, PyTorch, Keras, XGBoost and Scikit-Learn (MxNet is coming soon) > Includes seamless integration (including version control) with Jupyter Notebook and PyCharm remote debugging
Use the Trains Demo Server to try out Trains and test your code with no additional setup.
For more information on using the Trains python package, see our Quick Start Guide or the complete Trains Documentation page.
Using the Trains Agent and zero configuration, you can now set up a dynamic AI experiment cluster
Trains Agent was built to address the DL/ML R&D DevOps needs:
Trains Agent executes experiments using the following process
For more information on using the Trains Agent, see Using the Trains Agent or the complete Trains Documentation page.
The Trains Server is the backend service infrastructure for Trains. It allows multiple users to collaborate and manage their experiments. By default, Allegro Trains is set up to work with the Trains Demo Server, which is open to anyone and resets periodically. In order to host your own server, you will need to install Trains Server and point Trains to it.
Trains Server contains the following components:
Follow the instructions below to add and deploy Trains Server (and Trains Agents) to your Kubernetes clusters using Helm.
a Kubernetes cluster
kubectl
is installed and configured (see Install and Set Up kubectl in the Kubernetes documentation)
helm
installed (see Installing Helm in the Helm documentation)
one node labeled app: trains
Important: Trains Server deployment uses node storage. If more than one node is labeled as app: trains
and you redeploy or update later, then Trains Server may not locate all of your data.
Trains Server uses Elasticsearch which requires some specific system settings (for more information, see Notes for production use and defaults):
Connect to the node you labeled as app=trains
If your system contains a /etc/sysconfig/docker
Docker configuration file, Add the options in quotes to the available arguments in the OPTIONS
section:
OPTIONS="--default-ulimit nofile=1024:65536 --default-ulimit memlock=-1:-1"
Otherwise, edit /etc/docker/daemon.json
(if it exists) or create it (if it does not exist).
Add or modify the defaults-ulimits
section as shown below. Be sure the defaults-ulimits
section contains the nofile
and memlock
sub-sections and values shown.
Note: Your configuration file may contain other sections. If so, confirm that the sections are separated by commas (valid JSON format). For more information about Docker configuration files, see Daemon configuration file, in the Docker documentation.
The Trains Server required defaults values are (json):
{
"default-ulimits": {
"nofile": {
"name": "nofile",
"hard": 65536,
"soft": 1024
},
"memlock":
{
"name": "memlock",
"soft": -1,
"hard": -1
}
}
}
Set the Maximum Number of Memory Map Areas
Elastic requires that the vm.max_map_count kernel setting, which is the maximum number of memory map areas a process can use, is set to at least 262144.
For CentOS 7, Ubuntu 16.04, Mint 18.3, Ubuntu 18.04 and Mint 19.x, we tested the following commands to set vm.max_map_count:
echo "vm.max_map_count=262144" > /tmp/99-trains.conf
sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf
sudo sysctl -w vm.max_map_count=262144
For information about setting this parameter on other systems, see the elastic documentation.
Restart docker:
sudo service docker restart
Fetch the Trains Helm chart to your local directory:
helm fetch https://helm.ngc.nvidia.com/partners/charts/trains-chart-0.14.1+1.tgz
By default, the Trains Server deployment uses storage on a single node (labeled app=trains
).
To change the type of storage used (for example NFS), see Configuring trains-server storage for NFS.
By default, one instance of Trains Agent is created in the trains
Kubernetes cluster.
To change this setting, create a local values.yaml
as specified in Configuring Trains Agents in your cluster.
Install trains-chart
on your cluster:
helm install trains-chart-0.14.1+1.tgz --namespace=trains --name trains-server
Alternatively, in case you've created a local values.yaml
file, use:
helm install trains-chart-0.14.1+1.tgz --namespace=trains --name trains-server --values values.yaml
A trains
namespace is created in your cluster and Trains Server and Agent(s) are deployed in it.
After the Trains Server is deployed, the services expose the following node ports:
30008
30080
30081
For example, to access the Trains Web-App point your browser to http://[trains-server-node-ip]:30080
where the Trains Server node is the node labeled app=trains
.
Access Trains Server by creating a load balancer and domain name with records pointing to the load balancer.
Once you have a load balancer and domain name set up, follow these steps to configure access Trains Server on your Kubernetes cluster:
Create domain records
Create 3 records to be used for Trains Web-App, File server and API access using the following rules:
app.[your domain name]
files.[your domain name]
api.[your domain name]
For example, app.trains.mydomainname.com
, files.trains.mydomainname.com
and api.trains.mydomainname.com
Point the records you created to the load balancer
Configure the load balancer to redirect traffic coming from the records you created:
app.[your domain name]
should be redirected to Kubernetes Trains Server node on port 30080
files.[your domain name]
should be redirected to Kubernetes Trains Server node on port 30081
api.[your domain name]
should be redirected to Kubernetes Trains Server node on port 30008
In order to create Trains Agent instances as part of your deployment, create or update your local values.yaml
file.
This values.yaml
file should be used in your helm install
command (see Deploying trains-server in Kubernetes Clusters Using Helm)
The file must contain the following values in the agent
section:
numberOfTrainsAgents
: controls the number of Trains Agent pods to be deployed. Each Agent pod will listen for and execute experiments from the Trains ServernvidiaGpusPerAgent
: defines the number of GPUs required by each Agent podtrainsApiHost
: the URL used to access the Trains API server, as defined in your load-balancer (usually https://api.[your domain name]
, see Accessing trains-server)trainsWebHost
: the URL used to access the Trains Web-App, as defined in your load-balancer (usually https://app.[your domain name]
, see Accessing trains-server)trainsFilesHost
: the URL used to access the Trains File Server, as defined in your load-balancer (usually https://files.[your domain name]
, see Accessing trains-server)Additional optional values in the agent
section include:
defaultBaseDocker
: the default docker image used by the Agent running in the Agent pod in order to execute an experiment. Default is nvidia/cuda
.agentVersion
: determines the specific Agent version to be used in the deployment, for example "==0.13.3"
. Default is null
(use latest version)trainsGitUser
/ trainsGitPassword
: GIT credentials used by Trains Agent running an experiment when cloning the GIT repository defined in the experiment, if defined. Default is null
(not used)awsAccessKeyId
/ awsSecretAccessKey
/ awsDefaultRegion
: AWS account info used by the Trains Python package when uploading files to an AWS S3 buckets (not required if only using the default Trains File Server). Default is null
(not used)azureStorageAccount
/ azureStorageKey
: Azure account info used by the Trains Python package when uploading files to MS Azure Blob Service (not required if only using the default Trains File Server). Default is null
(not used)For example, the following values.yaml
file requests 4 Agent instances in your deployment (see chart-example-values.yaml):
agent:
numberOfTrainsAgents: 4
nvidiaGpusPerAgent: 1
defaultBaseDocker: "nvidia/cuda"
trainsApiHost: "https://api.trains.mydomain.com"
trainsWebHost: "https://app.trains.mydomain.com"
trainsFilesHost: "https://files.trains.mydomain.com"
trainsGitUser: null
trainsGitPassword: null
awsAccessKeyId: null
awsSecretAccessKey: null
awsDefaultRegion: null
azureStorageAccount: null
azureStorageKey: null
The Trains Server deployment uses a PersistentVolume
of type HostPath
,
which uses a fixed path on the node labeled app: trains
.
The existing chart supports changing the volume type to NFS
,
by setting the use_nfs
value and configuring the NFS persistent volume using additional values in your local values.yaml
file:
storage:
use_nfs: true
nfs:
server: "[nfs-server-ip-address]"
base_path: "/nfs/path/for/trains/data"
You can also configure the Trains Server for:
For detailed instructions, see Configuring Trains Server in the Trains Documentation page.
Allegro Trains documentation is available here
For more examples and use cases, check examples.
If you have any questions: post on our Slack Channel, or tag your questions on stackoverflow with 'trains' tag.
For feature requests or bug reports, please use GitHub issues.
Additionally, you can always find us at trains@allegro.ai