MLPerf

MLPerf is an open, industry-standard benchmark suite for measuring machine learning performance. It’s maintained by the MLCommons consortium (which includes NVIDIA, Google, Intel, and others). MLPerf is primarily categorized by two major types of ML processes: training and inference. There are also specialized benchmarks for specific hardware and workloads. Original documents describing MLPerf can be found here: MLPerf Training and MLPerf Inference. Those are interesting to understand some of the key motivations behind these benchmarks, including the uniqueness of ML/DL workloads w.r.t. benchmarking.

MLPerf evaluates systems on standardized ML workloads across multiple domains, including:

Image classification (e.g., ResNet-50)
Object detection (e.g., SSD, Mask R-CNN)
Language modeling (e.g., BERT)
Recommendation systems (e.g., DLRM)
Speech recognition (e.g., RNN-T)

Full list for training can be found here and inference can be found here.

One can use MLPerf results to: (i) Compare hardware (e.g., GPU vs CPU); (ii) Tune software stacks (CUDA, PyTorch, TensorFlow); and (iii) Validate scaling behavior or deployment efficiency.

Testing on a single VM

Training

Example: Single Stage Detector.

In this example we assume we have a VM in azure with SKU Standard_NC40ads_H100_v5 and image almalinux:almalinux-hpc:8_10-hpc-gen2:latest. Once machine is provisioned, check if gpus+cuda is detected via: nvidia-smi.

prepmlperf

#!/bin/bash
# script to do all setup, data download, and training
# based on https://github.com/mlcommons/training/tree/master/single_stage_detector

export DATADIR="/datadrive"
export MYDATA="$DATADIR/mydata"
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world

sudo jq '. + {"data-root": "/mnt/docker"}' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp >/dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json && sudo systemctl restart docker
docker info | grep 'Docker Root Dir'

cd $DATADIR
git clone https://github.com/mlcommons/training.git

# alternative for faster install (with possible side effects;e.g install on other python versions)
# pip3 install --prefer-binary opencv-python-headless
# sed -i.bak -E 's/load_zoo_dataset\(\s*name="([^"]+)"\s*,/load_zoo_dataset("\1",/' fiftyone_openimages.py

pip3 install fiftyone
python -m pip show fiftyone
sudo ln -s $(which python3) /usr/local/bin/python
cd $DATADIR/training/single_stage_detector/scripts

echo "Downloading dataset... this will take several minutes"
./download_openimages_mlperf.sh -d $MYDATA

# prepare docker
cd $DATADIR/training/single_stage_detector/
docker build -t mlperf/single_stage_detector .
docker run --rm -it --gpus=all --ipc=host -v $MYDATA:/datasets/open-images-v6-mlperf mlperf/single_stage_detector bash

# inside the container:
# apt-get update ; apt-get install vim -y
# sed -i '0,/\${SLURM_LOCALID-}/s//${SLURM_LOCALID:-0}/' run_and_time.sh
# update config_DGXA100_001x08x032.sh
# DGXNGPU=1, DGXSOCKETCORES=20, DGXNSOCKET=1, DGXHT=2
# source config_DGXA100_001x08x032.sh
# -----------------------------------------------------------
# conda create -n torch212 python=3.10
# conda init bash
# source ~/.bashrc
# conda activate torch212
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
#
# fix: coco_eval.py
# # Old line:
# import torch._six
# #Replace with
# import collections.abc as container_abcs
#
# pip3 install mlperf_logging
# pip3 install pycocotools
# ./run_and_time.sh

Once run script is started, output should be:

Epoch: [0]  [    0/36571]  eta: 1 day, 20:19:44  lr: 0.000000  loss: 2.2699 (2.2699)  classification: 1.5590 (1.5590)  bbox_reg
ression: 0.7109 (0.7109)  time: 4.3637  data: 2.2989  max mem: 51676
Epoch: [0]  [   20/36571]  eta: 6:17:02  lr: 0.000000  loss: 2.1944 (2.2521)  classification: 1.4886 (1.5371)  bbox_regression:
 0.7036 (0.7150)  time: 0.4317  data: 0.0003  max mem: 52125
Epoch: [0]  [   40/36571]  eta: 5:20:59  lr: 0.000000  loss: 2.1934 (2.2440)  classification: 1.4949 (1.5292)  bbox_regression:
 0.6956 (0.7148)  time: 0.4309  data: 0.0003  max mem: 52125
Epoch: [0]  [   60/36571]  eta: 5:03:05  lr: 0.000000  loss: 2.2322 (2.2630)  classification: 1.5102 (1.5478)  bbox_regression:
 0.7024 (0.7151)  time: 0.4384  data: 0.0004  max mem: 52125
.
.
.

Multiple VMs

Similar to single VM but a cluster needs to be provision and parameters related to SLURM need to be adjusted. See details HERE.