a. Upload training data to S3

In this step, you create an environment configuration script to train a Natural Language Understanding model and upload the training data to an S3 bucket.

Create an Amazon S3 Bucket and Upload Training Data and Environment Setup Script

First, create an Amazon S3 bucket and upload the training data folder. This training folder will be accessed by the cluster worker nodes through FSx.

  1. Open a terminal in your AWS Cloud9 instance.
  2. Run the following commands to create a new Amazon S3 bucket. These commands also retrieve and store the Wikitext 103 dataset
# generate a unique postfix
export BUCKET_POSTFIX=$(uuidgen --random | cut -d'-' -f1)
echo "Your bucket name will be mlbucket-${BUCKET_POSTFIX}"
aws s3 mb s3://mlbucket-${BUCKET_POSTFIX}

# downloading data:
export URL="https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip"
export FILE="wikitext-103-v1.zip"
wget $URL -O $FILE
unzip $FILE

# upload to your bucket
aws s3 cp wikitext-103 s3://mlbucket-${BUCKET_POSTFIX}/wikitext-103 --recursive

# delete local copies
rm -rf wikitext-103*

The next step is to create a post-installation script to be executed by ParallelCluster when provisioning the instances. This script first configures NVIDIA NCCL to work with the already available AWS EFA software. NCCL is the communication library used by PyTorch for GPU-to-GPU communication. For more information, refer to the Getting started with EFA and NCCL documentation.

The script also installs Miniconda3 and configures an environment with PyTorch and Fairseq in a shared filesystem. In the coming cluster configuration steps, you set up a GP2 Amazon EBS volume that will be attached to the head node and shared through NFS to be mounted by the compute nodes on /shared.

cat > post-install.sh << EOF

export HOME=/home/ec2-user

# start configuration of NCCL and EFA only if CUDA and EFA present
if [ -d "\$CUDA_DIRECTORY" ] && [ -d "\$EFA_DIRECTORY" ]; then

    # installing NCCL
    if [ ! -d "\$NCCL_DIRECTORY" ]; then
        echo "Installing NVIDIA nccl"
        cd /home/ec2-user
        git clone https://github.com/NVIDIA/nccl.git

        cd /home/ec2-user/nccl
        make -j src.build

    # installing aws-ofi-nccl

    if [ ! -d "\$AWS_OFI_DIRECTORY" ]; then
        echo "Installing aws-ofi-nccl"
        cd /home/ec2-user
        git clone https://github.com/aws/aws-ofi-nccl.git -b aws
    ./configure --with-mpi=\$OPENMPI_DIRECTORY --with-libfabric=\$EFA_DIRECTORY --with-nccl=\$NCCL_DIRECTORY/build --with-cuda=\$CUDA_DIRECTORY
    sudo make install

# configuring the conda environment
cd /shared

if [ ! -d "\$CONDA_DIRECTORY" ]; then
  # control will enter here if $DIRECTORY doesn't exist.
  echo "Conda installation not found. Installing..."
  wget -O miniconda.sh \
      "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" \
      && bash miniconda.sh -b -p /shared/.conda \
      && /shared/.conda/bin/conda init bash \
      && eval "\$(/shared/.conda/bin/conda shell.bash hook)" \
      && rm -rf miniconda.sh

  conda install python=3.6 -y


if [ ! -d "\$FAIRSEQ_DIRECTORY" ]; then
    # control will enter here if $DIRECTORY doesn't exist.
    echo "Fairseq repository not found. Installing..."
    git clone https://github.com/pytorch/fairseq.git \$FAIRSEQ_DIRECTORY

    pip install -e \$FAIRSEQ_DIRECTORY -U

    pip install boto3 torch tqdm -y

chown -R ec2-user:ec2-user /lustre
chown -R ec2-user:ec2-user /shared

sudo -u ec2-user /shared/.conda/bin/conda init bash


# upload to your bucket
aws s3 cp post-install.sh s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh

# delete local copies
rm -rf post-install.sh

Before continuing, check the content of your bucket using the AWS CLI with the command aws s3 ls s3://mlbucket-${BUCKET_POSTFIX} or the AWS console.

Next, define the configuration of the ML cluster by creating the AWS ParallelCluster configuration file.