Allegro Trains Agent

NGC Catalog

CLASSIC

Welcome Guest

For copy image paths and more information, please view on a desktop device.

Description

Allegro Trains delivers an optimized, seamless, and scalable solution for training on DGX machines with ML-Ops and experiment management features.

Publisher

Allegro AI

Latest Tag

latest

Modified

April 4, 2023

Compressed Size

1.86 GB

Multinode Support

Yes

Multi-Arch Support

What is the Allegro Trains platform

Allegro Trains is a full system open source ML / DL experiment manager and ML-Ops solution. It is composed of a Python SDK, server, Web UI, and execution agents. Allegro Trains enables data scientists and data engineers to effortlessly track, manage, compare, and collaborate on their experiments as well as easily manage their training workloads on remote machines. Allegro Trains is designed for effortless integration so that teams can preserve their existing methodsand practices. Use it on a daily basis to boost collaboration and visibility, or use it to automatically collect your experimentation logs, outputs, and data to one centralized server.

This container is intended for use by the allegro-trains Helm Chart - see here for more information.

Trains Agent: Simple and Flexible Experiment Orchestration

Using the Trains Agent and zero configuration, you can now set up a dynamic AI experiment cluster

Trains Agent was built to address the DL/ML R&D DevOps needs:

Easily add & remove machines from the cluster
Reuse machines without the need for any dedicated containers or images
Combine GPU resources across any cloud and on-prem
No need for yaml/json/template configuration of any kind
User friendly UI
Manageable resource allocation that can be used by researchers and engineers
Flexible and controllable scheduler with priority support
Automatic instance spinning in the cloud (coming soon)

Trains Agent executes experiments using the following process

Create a new virtual environment (or launch the selected docker image)
Clone the code into the virtual-environment (or inside the docker)
Install python packages based on the package requirements listed for the experiment
- Special note for PyTorch: The Trains Agent will automatically select the torch packages based on the CUDA_VERSION environment variable of the machine
Execute the code while monitoring the process
Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)

For more information on using the Trains Agent, see Using the Trains Agent or the complete Trains Documentation page.

License Information

Trains Agent is provided under the Apache License, Version 2.0

Documentation, Community & Support

TRAINS documentation is available here

For more examples and use cases, check examples.

If you have any questions: post on our Slack Channel, or tag your questions on stackoverflow with 'trains' tag.

For feature requests or bug reports, please use GitHub issues.

Additionally, you can always find us at trains@allegro.ai