Building an Automated Recovery Pipeline for GPU Clusters with Slurm on Azure Part 2

Disclaimer: The slurm-cluster-health-manager project is a sample tool created specifically for the article it accompanies. This is not an official Microsoft product, and it is not supported or maintained by Microsoft.   In Part 1, we introduced how to detect Slurm job failures using Epilog and initiate the first step of an…

Learn More
Share:

You may be interested in

What you're searching for?

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors