Model training

Model Training

While Jupyter notebook is good for interactive model training, you may like to package the training code as Docker image and run it in Amazon EKS cluster.

This chapter explains how to build a training model for Fashion-MNIST dataset using TensorFlow and Keras on Amazon EKS. This dataset contains 70,000 grayscale images in 10 categories and is meant to be a drop-in replace of MNIST.

Docker image

We will use a pre-built Docker image seedjeffwan/mnist_tensorflow_keras:1.13.1 for this exercise. This image uses tensorflow/tensorflow:1.13.1 as the base image. The image has training code and downloads training and test data sets. It also stores the generated model in an S3 bucket.

Alternatively, you can use Dockerfile to build the image by using the command below. We will skip this step for now

docker build -t <dockerhub_username>/<repo_name>:<tag_name> .

Create S3 bucket

Create an S3 bucket where training model will be saved:

export HASH=$(< /dev/urandom tr -dc a-z0-9 | head -c6)
export S3_BUCKET=$HASH-eks-ml-data
aws s3 mb s3://$S3_BUCKET --region $AWS_REGION

This name will be used in the pod specification later. This bucket is also used for serving the model.

If you want to use an existing bucket in a different region, then make sure to specify the exact region as the value of AWS_REGION environment variable in mnist-training.yaml.

Setup AWS credentials in EKS cluster

AWS credentials are required to save model on S3 bucket. These credentials are stored in EKS cluster as Kubernetes secrets.

Create an IAM user ‘s3user’, attach S3 access policy and retrieve temporary credentials

aws iam create-user --user-name s3user
aws iam attach-user-policy --user-name s3user --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam create-access-key --user-name s3user > /tmp/create_output.json

Next, record the new user’s credentials into environment variables:

export AWS_ACCESS_KEY_ID_VALUE=$(jq -j .AccessKey.AccessKeyId /tmp/create_output.json | base64)
export AWS_SECRET_ACCESS_KEY_VALUE=$(jq -j .AccessKey.SecretAccessKey /tmp/create_output.json | base64)

Apply to EKS cluster:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: aws-secret
type: Opaque
data:
  AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID_VALUE
  AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY_VALUE
EOF

Run training using pod

Create pod:

curl -LO https://eksworkshop.com/advanced/420_kubeflow/kubeflow.files/mnist-training.yaml
envsubst < mnist-training.yaml | kubectl create -f -

This will start a pod which will start the training and save the generated model in S3 bucket. Check status:

kubectl get pods

You’ll see similar output

NAME              READY   STATUS    RESTARTS   AGE
mnist-training    1/1     Running   0          2m45s

Note: If your mnist-training fail for some reason, please copy our trained model to your bucket by running following command, this will unblock your inference experiment in the next chapter.

aws s3 sync s3://reinvent-opn401/mnist/tf_saved_model  s3://$S3_BUCKET/mnist/tf_saved_model
Expand here to see complete logs

The last line shows that the exported model is saved to S3 bucket.