Distributed Machine Learning

This lab requires an AWS Cloud9 IDE. If you do not have an AWS Cloud9 IDE set up, complete sections a. Sign in to the Console through d. Work with the AWS CLI in the Getting Started in the Cloud workshop.

In this workshop, you’ll learn how to use AWS ParallelCluster with EFA and FSx for Lustre to:

  • Create a new cluster with preconfigured Conda environments.
  • Upload training data from an AWS Cloud9 instance to an Amazon S3 bucket.
  • Set up a distributed file system using FSx for Lustre and preprocess the training data with a Slurm job.
  • Run multi-node, multi-GPU data parallel training of a large scale natural language understanding model using the PyTorch DistributedDataParallel API.