DeepRacer-for-Cloud v5.2.2 now available with new real-time training metrics

grafana panel

DeepRacer-for-Cloud provides a great way for developers to train DeepRacer models on EC2 (or other cloud compute instances, or even local servers) however many users have noticed that unlike the official AWS console it didn’t provide the kind of friendly web UI showing the current state of training.

While there are some fantastic log analysis notebooks available these can be a little tricky to set up and often require re-loading vast amounts of log data to get a refreshed view of the metrics.

Deepracer-for-Cloud v5.2.2 is now available and has added an exciting new feature which enables real-time metrics visualisation using Grafana.

Update – This new functionality has now also been added to DeepRacer-on-the-Spot.

Under the hood this involves creating three new containers for Telegraf, InfluxDB, and Grafana.

The Robomaker simulation workers send the training metrics to Telegraf, which aggregates and stores them in the InfluxDB time-series database. Grafana provides a presentation layer for interactive dashboards.

telegraf to influx to grafana

Getting started

To use this new feature you will need v5.2.2 of Deepracer-for-Cloud, and also the v5.2.2 Robomaker container image.

Updating DeepRacer-for-Cloud

If you’re installing DRfC for the first time then it should already download the correct image and templates, but if you’re upgrading an existing install then you’ll need to do a few steps:

If you installed DRfC the recommended way by cloning the GitHub repo then you should do a git pull on the master branch to fetch the latest updates.

To enable real-time metrics you need to add two additional lines to your system.env file:

DR_TELEGRAF_HOST=telegraf
DR_TELEGRAF_PORT=8092

In almost all cases you can paste these directly in without modifying the values, as the hostname will reference the telegraf container running inside Docker.

If this is your first install then these lines will need to be uncommented.

Updating the Robomaker container image

First pull the updated container image from DockerHub. Use the cpu or gpu tag as appropriate for your system.

docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-cpu

or

docker pull awsdeepracercommunity/deepracer-robomaker:5.2.2-gpu

Then update the DR_ROBOMAKER_IMAGE line in system.env to match the new image tag you just pulled.

DR_ROBOMAKER_IMAGE=5.2.2-cpu

Starting the metrics stack

You can then start the metrics containers using dr-start-metrics. (You might need to relogin or reload your shell to pick up the new changes in bin/activate.sh)

This will start the three new containers. If it’s the first time starting the metrics stack then Grafana will need to run some database migrations that can take 30-60 seconds before the web UI is available.

Collecting metrics

As long as the two Telegraf lines have been added to system.env and you have v5.2.2 of the robomaker container then all you have to do is start training normally and the metrics will be automatically generated.

Using the dashboards

Once the metrics stack is running you should be able to access the Grafana web UI on port 3000 (eg, http://localhost:3000 if running locally)

Grafana initially starts with an admin user provisioned (username admin, password admin). It will prompt you to choose a new password upon first connect, so you should do this right away.

A template dashboard is provided to show how to access basic DeepRacer training metrics. You can use this dashboard as a base to build your own more customised dashboards.

After connecting to the Grafana Web UI with a browser use the menu to browse to the Dashboards section.

Grafana dashboards screenshot

The template dashboard called DeepRacer Training template should be visible, showing graphs of reward, progress, and completed lap times.

Graph panels with data

As this is an automatically provisioned dashboard you are not able to save changes to it, however you can copy it by clicking on the small cog icon to enter the dashboard settings page, and then clicking Save as to make an editable copy.

Grafana dashboards are interactive – you can over over datapoints to see more details, and you can click and drag on a graph panel to zoom in.

You can also change the time range using the selector box on the top right, and also select an auto-refresh period from the selector next to that.

A full user guide on how to work the dashboards is available on the Grafana website.

Currently we record metrics for training and evaluation sessions such as reward, progress, average and best lap times but in the future we’ll be adding more even metrics and dashboards.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.