Model Parallelism: Building and Deploying Large Neural Networks

Model Parallelism: Building and Deploying Large Neural NetworksMPBDLNNNVNvidiaNV-MPBDLNN1.0<p>In this workshop, participants will learn how to: </p> <ul> <li>Train neural networks across multiple servers</li><li>Use techniques such as activation checkpointing, gradient accumulation, and various forms of model parallelism to overcome the challenges associated with large-model memory footprint</li><li>Capture and understand training performance characteristics to optimize model architecture</li><li>Deploy very large multi-GPU models to production using NVIDIA Triton™ Inference Server</li></ul><p>Familiarity with: </p> <ul> <li>Good understanding of PyTorch</li><li>Good understanding of deep learning and data parallel training concepts</li><li>Practice with deep learning and data parallel are useful, but optional</li></ul><h4>Introduction</h4> <ul> <li>Meet the instructor.</li><li>Create an account at courses.nvidia.com/join</li></ul><h4>Introduction to Training of Large Models</h4><ul> <li>Learn about the motivation behind and key challenges of training large models.</li><li>Get an overview of the basic techniques and tools needed for large-scale training.</li><li>Get an introduction to distributed training and the Slurm job scheduler.</li><li>Train a GPT model using data parallelism.</li><li>Profile the training process and understand execution performance.</li></ul><h4>Model Parallelism: Advanced Topics</h4> <ul> <li>Increase the model size using a range of memory-saving techniques.</li><li>Get an introduction to tensor and pipeline parallelism.</li><li>Go beyond natural language processing and get an introduction to DeepSpeed.</li><li>Auto-tune model performance.</li><li>Learn about mixture-of-experts models.</li></ul><h4>Inference of Large Models</h4> <ul> <li>Understand the challenges of deployment associated with large models.</li><li>Explore techniques for model reduction.</li><li>Learn how to use TensorRT-LLM.</li><li>Learn how to use Triton Inference Server.</li><li>Understand the process of deploying GPT checkpoint to production.</li><li>See an example of prompt engineering.</li></ul><h4>Final Review</h4><ul> <li>Review key learnings and answer questions.</li><li>Complete the assessment and earn a certificate.</li><li>Complete the workshop survey.</li></ul>In this workshop, participants will learn how to: - Train neural networks across multiple servers - Use techniques such as activation checkpointing, gradient accumulation, and various forms of model parallelism to overcome the challenges associated with large-model memory footprint - Capture and understand training performance characteristics to optimize model architecture - Deploy very large multi-GPU models to production using NVIDIA Triton™ Inference ServerFamiliarity with: - Good understanding of PyTorch - Good understanding of deep learning and data parallel training concepts - Practice with deep learning and data parallel are useful, but optionalIntroduction - Meet the instructor. - Create an account at courses.nvidia.com/join Introduction to Training of Large Models - Learn about the motivation behind and key challenges of training large models. - Get an overview of the basic techniques and tools needed for large-scale training. - Get an introduction to distributed training and the Slurm job scheduler. - Train a GPT model using data parallelism. - Profile the training process and understand execution performance. Model Parallelism: Advanced Topics - Increase the model size using a range of memory-saving techniques. - Get an introduction to tensor and pipeline parallelism. - Go beyond natural language processing and get an introduction to DeepSpeed. - Auto-tune model performance. - Learn about mixture-of-experts models. Inference of Large Models - Understand the challenges of deployment associated with large models. - Explore techniques for model reduction. - Learn how to use TensorRT-LLM. - Learn how to use Triton Inference Server. - Understand the process of deploying GPT checkpoint to production. - See an example of prompt engineering. Final Review - Review key learnings and answer questions. - Complete the assessment and earn a certificate. - Complete the workshop survey.1 day500.00500.00500.00500.00500.00420.00500.00690.00