Optimizing Large-Scale AI Performance with Pretraining Validation on a Single Azure ND GB200 v6

by Mishty Dhekial (Software Engineer Intern) and Hugo Affaticati (Cloud Infrastructure Engineer) Why Llama? The Llama3 8B model was selected as the focus of this analysis due to its relevance as a modern, open-weight large language model (LLM) architecture. Llama models are widely used in both research and industry. Their…

19/08/2025Azure High Performance Computing (HPC) Blog

Share:

You may be interested in

Unpacking the Performance of Microsoft Azure ND GB200 v6 Virtual Machines
Azure High Performance Computing (HPC) Blog,
For a comprehensive understanding of our benchmarking methodologies and detailed performance results, please refer to our benchmarking guide available on the official Azure GitHub repository: Azure AI Benchmarking Guide. Breakdown…
17/03/2025
Ramp up with me… on HPC: Understanding Virtual Machines, CPUs, and GPUs
Azure High Performance Computing (HPC) Blog,
There are a lot of different products you need to successfully complete a high-performance computing (HPC) workload. You’ll hear several terms regularly, like virtual machines, CPUs, GPUs, compute power, and…
31/08/2023
Scaling Up in the Cloud: The WEKA Data Platform and Azure HPC Windows Grid Integration
Azure High Performance Computing (HPC) Blog,
Co-Written with Erik Garcia, WEKA Director of Cloud Sales, Brian Markenson, WEKA System Engineer, & Adam Fowler, WEKA System Engineer High Performance Compute (HPC) grids in the Financial Services Industry are unique…
28/02/2024
How to identify the recommended VM for your HPC workloads
Azure High Performance Computing (HPC) Blog,
Azure offers access to 7 distinct VM categories, including Compute Optimized, Memory Optimized, and Accelerated Compute, comprising over 50 different families such as Fsv2-series, Edv4-series, and ND A100 v4-series. Each…
11/05/2023
Integrating external Grid Engine Scheduler to CycleCloud (Cloud Bursting scenario)
Azure High Performance Computing (HPC) Blog,
Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale…
13/03/2023
What is Azure HPC?
Azure High Performance Computing (HPC) Blog,
Our mission overall has been to democratize access to supercomputing. We’ve always believed that people could do great things with access to high performance compute. So, while hosting the AI…
22/08/2024
Azure announces new AI optimized VM series featuring AMD’s flagship MI300X GPU
Azure High Performance Computing (HPC) Blog,
In our relentless pursuit of pushing the boundaries of artificial intelligence, we understand that cutting-edge infrastructure and expertise is needed to harness the full potential of advanced AI. At Microsoft,…
15/11/2023
Building an Automated Recovery Pipeline for GPU Clusters with Slurm on Azure Part 2
Azure High Performance Computing (HPC) Blog,
Disclaimer: The slurm-cluster-health-manager project is a sample tool created specifically for the article it accompanies. This is not an official Microsoft product, and it is not supported or maintained by Microsoft.…
13/06/2025
A quick start guide to benchmarking AI models in Azure: Llama 2 from MLPerf Inference v4.0
Azure High Performance Computing (HPC) Blog,
By: Mark Gitau, Software Engineer, and Hugo Affaticati, Technical Program Manager 2   Useful resources: New NC H100 v5-series: Microsoft NC H100 v5-series Thought leadership article: Aka.ms/Blog/MLPerfInfv4 Azure results for MLPerf Inference: MLPerf…
27/03/2024
Learn how to power your AI transformation with the Microsoft Cloud at NVIDIA GTC.
Azure High Performance Computing (HPC) Blog,
Welcome to the new era where AI is driving innovation and rapidly changing what applications look like, how they’re designed and built, and how they’re delivered. Nearly every industry…
09/02/2024
Running DeepSeek-R1 on a single NDv5 MI300X VM
Azure High Performance Computing (HPC) Blog,
Contributors: Davide Vanzo, Yuval Mazor, Jesse Lopez DeepSeek-R1 is an open-weights reasoning model built on DeepSeek-V3, designed for conversational AI, coding, and complex problem-solving. It has gained significant attention…
01/02/2025
Annual Roundup of AI Infrastructure Breakthroughs for 2023
Azure High Performance Computing (HPC) Blog,
What a difference a year makes! Last year I said 2022 was a banner year for AI developments… if 2022 was a banner year then how should we describe 2023?…
27/03/2024