NGC | Catalog
CatalogContainersAllegro Trains Agent

Allegro Trains Agent

Logo for Allegro Trains Agent
Description
Allegro Trains delivers an optimized, seamless, and scalable solution for training on DGX machines with ML-Ops and experiment management features.
Publisher
Allegro AI
Latest Tag
latest
Modified
April 4, 2023
Compressed Size
1.86 GB
Multinode Support
Yes
Multi-Arch Support
No

What is the Allegro Trains platform

Allegro Trains is a full system open source ML / DL experiment manager and ML-Ops solution. It is composed of a Python SDK, server, Web UI, and execution agents. Allegro Trains enables data scientists and data engineers to effortlessly track, manage, compare, and collaborate on their experiments as well as easily manage their training workloads on remote machines. Allegro Trains is designed for effortless integration so that teams can preserve their existing methodsand practices. Use it on a daily basis to boost collaboration and visibility, or use it to automatically collect your experimentation logs, outputs, and data to one centralized server.

This container is intended for use by the allegro-trains Helm Chart - see here for more information.

Trains Agent: Simple and Flexible Experiment Orchestration

Using the Trains Agent and zero configuration, you can now set up a dynamic AI experiment cluster

Trains Agent was built to address the DL/ML R&D DevOps needs:

  • Easily add & remove machines from the cluster
  • Reuse machines without the need for any dedicated containers or images
  • Combine GPU resources across any cloud and on-prem
  • No need for yaml/json/template configuration of any kind
  • User friendly UI
  • Manageable resource allocation that can be used by researchers and engineers
  • Flexible and controllable scheduler with priority support
  • Automatic instance spinning in the cloud (coming soon)

Trains Agent executes experiments using the following process

  • Create a new virtual environment (or launch the selected docker image)
  • Clone the code into the virtual-environment (or inside the docker)
  • Install python packages based on the package requirements listed for the experiment
    • Special note for PyTorch: The Trains Agent will automatically select the torch packages based on the CUDA_VERSION environment variable of the machine
  • Execute the code while monitoring the process
  • Log all stdout/stderr in the Trains UI, including the cloning and installation process, for easy debugging
  • Monitor the execution and allow you to manually abort the job using the Trains UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)

For more information on using the Trains Agent, see Using the Trains Agent or the complete Trains Documentation page.

License Information

Trains Agent is provided under the Apache License, Version 2.0

Documentation, Community & Support

TRAINS documentation is available here

For more examples and use cases, check examples.

If you have any questions: post on our Slack Channel, or tag your questions on stackoverflow with 'trains' tag.

For feature requests or bug reports, please use GitHub issues.

Additionally, you can always find us at trains@allegro.ai