b. Create a distributed ML cluster

In this step, you create a cluster configuration that supports your Distributed Machine Learning task.

If you are not familiar with AWS ParallelCluster, EFA and FSx, we recommend that you first complete the AWS Amazon FSx for Lustre lab and AWS EFA lab before proceeding. In particular, you need to be able to examine the FSx file system and examine the EFA enabled instance. The use of NICE DCV to interact with the cluster through a remote desktop is optional. Check out the Remote Visualization using NICE DCV lab for more information.

Create a Cluster Configuration File

This section assumes that you are familiar with AWS ParallelCluster and the process of bootstrapping a cluster.

Let us reuse the SSH key-pair created earlier.

The cluster configuration that you generate for training large scale ML models includes constructs from EFA and FSx that you can explore in the previous sections of this workshop. The main additions to the cluster configuration script are:

  • Set the compute nodes as p3dn.24xlarge instances. The p3dn.24xlarge is one of the EFA supported instance types with multiple GPUs.
  • Set the cluster initial size to 0 compute nodes and maximum size to 2 instances. The cluster uses Auto Scaling Groups that will grow and shrink between the min and max limits based on the cluster utilization and job queue backlog.
  • Set the compute capacity type to be CapacityType=SPOT. AWS EC2 Spot instances are available for less than the cost of On-Demand Instances, but it is possible that they are interrupted. As the training workload provides model checkpointing - saving the model as training progresses - you will be able to restart training after a job failure. Consider running other compute capacity types in the case of limited spot instance availability or when running large scale training workloads that cannot be interrupted. Refer to this documentation to learn more about the impact of Spot instance interruptions in ParallelCluster.
  • Set the custom actions install script URL to the S3 path with the Conda configuration script. Also, you need to specify that ParallelCluster has access to this S3 bucket. Add following to the config:
CustomActions:
  OnNodeConfigured:
    Script: s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh
Iam:
  S3Access:
    - BucketName: mlbucket-${BUCKET_POSTFIX}
  • The selected job scheduler for this example is SLURM.

For more details about the configuration options, see the AWS ParallelCluster User Guide, the EFA parameters and the FSx parameters sections of the AWS ParallelCluster User Guide.

If you are using a different terminal than the previous section, make sure that the Amazon S3 bucket name is correct.

# create the cluster configuration
export AWS_REGION=$(curl --silent http://169.254.169.254/latest/meta-data/placement/region)
export IFACE=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/)
export SUBNET_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${IFACE}/subnet-id)
cat > ml-config.yaml << EOF
Region: ${AWS_REGION}
Image:
  Os: alinux2
SharedStorage:
  - MountDir: /shared
    Name: default-ebs
    StorageType: Ebs

  - Name: fsxshared
    StorageType: FsxLustre
    MountDir: /lustre
    FsxLustreSettings:
      StorageCapacity: 1200
      ImportPath: s3://mlbucket-${BUCKET_POSTFIX}
      DeploymentType: SCRATCH_2

HeadNode:
  InstanceType: c5n.2xlarge
  Networking:
    SubnetId: ${SUBNET_ID}
  Ssh:
    KeyName: ${AWS_KEYPAIR}
  Dcv:
    Enabled: true

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: compute
      ComputeResources:
      - Name: p3dn24xlarge
        InstanceType: p3dn.24xlarge
        MinCount: 0
        MaxCount: 2
        DisableSimultaneousMultithreading: true
        Efa:
          Enabled: true
      CapacityType: SPOT
      CustomActions:
        OnNodeConfigured:
          Script: s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh
      Iam:
        S3Access:
          - BucketName: mlbucket-${BUCKET_POSTFIX}
      Networking:
        SubnetIds:
          - ${SUBNET_ID}
        PlacementGroup:
          Enabled: true
EOF

If you want to check the content of your configuration file, use the following command:

cat ml-config.yaml

Now, you are ready to create your Distributed ML cluster.

Generate a Cluster for Machine Learning

Create the cluster using the following command. This process would take about 15 minutes (depending on the resources/ settings).

pcluster create-cluster --cluster-name ml-cluster -c ml-config.yaml

The cluster creation continues even if the terminal session you are on gets terminated. To check on the status of the creation, use the command: pcluster describe-cluster --cluster-name ml-cluster.

Connect to Your Cluster

Once created, connect to your cluster.

pcluster ssh --cluster-name ml-cluster -i ${AWS_KEYPAIR}.pem

Next, preprocess the training data.