MLPerf
MLPerf is an open, industry-standard benchmark suite for measuring machine learning performance. It’s maintained by the MLCommons consortium (which includes NVIDIA, Google, Intel, and others). MLPerf is primarily categorized by two major types of ML processes: training and inference. There are also specialized benchmarks for specific hardware and workloads. Original documents describing MLPerf can be found here: MLPerf Training and MLPerf Inference. Those are interesting to understand some of the key motivations behind these benchmarks, including the uniqueness of ML/DL workloads w.r.t. benchmarking.
MLPerf evaluates systems on standardized ML workloads across multiple domains, including:
- Image classification (e.g., ResNet-50)
- Object detection (e.g., SSD, Mask R-CNN)
- Language modeling (e.g., BERT)
- Recommendation systems (e.g., DLRM)
- Speech recognition (e.g., RNN-T)
Full list for training can be found here and inference can be found here.
One can use MLPerf results to: (i) Compare hardware (e.g., GPU vs CPU); (ii) Tune software stacks (CUDA, PyTorch, TensorFlow); and (iii) Validate scaling behavior or deployment efficiency.
Testing on a single VM
Training
Example: Single Stage Detector.
In this example we assume we have a VM in azure with SKU Standard_NC40ads_H100_v5
and image almalinux:almalinux-hpc:8_10-hpc-gen2:latest
. Once machine is provisioned, check if gpus+cuda is detected via: nvidia-smi
.
#!/bin/bash
# script to do all setup, data download, and training
# based on https://github.com/mlcommons/training/tree/master/single_stage_detector
export DATADIR="/datadrive"
export MYDATA="$DATADIR/mydata"
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world
sudo jq '. + {"data-root": "/mnt/docker"}' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.tmp >/dev/null && sudo mv /etc/docker/daemon.json.tmp /etc/docker/daemon.json && sudo systemctl restart docker
docker info | grep 'Docker Root Dir'
cd $DATADIR
git clone https://github.com/mlcommons/training.git
# alternative for faster install (with possible side effects;e.g install on other python versions)
# pip3 install --prefer-binary opencv-python-headless
# sed -i.bak -E 's/load_zoo_dataset\(\s*name="([^"]+)"\s*,/load_zoo_dataset("\1",/' fiftyone_openimages.py
pip3 install fiftyone
python -m pip show fiftyone
sudo ln -s $(which python3) /usr/local/bin/python
cd $DATADIR/training/single_stage_detector/scripts
echo "Downloading dataset... this will take several minutes"
./download_openimages_mlperf.sh -d $MYDATA
# prepare docker
cd $DATADIR/training/single_stage_detector/
docker build -t mlperf/single_stage_detector .
docker run --rm -it --gpus=all --ipc=host -v $MYDATA:/datasets/open-images-v6-mlperf mlperf/single_stage_detector bash
# inside the container:
# apt-get update ; apt-get install vim -y
# sed -i '0,/\${SLURM_LOCALID-}/s//${SLURM_LOCALID:-0}/' run_and_time.sh
# update config_DGXA100_001x08x032.sh
# DGXNGPU=1, DGXSOCKETCORES=20, DGXNSOCKET=1, DGXHT=2
# source config_DGXA100_001x08x032.sh
# -----------------------------------------------------------
# conda create -n torch212 python=3.10
# conda init bash
# source ~/.bashrc
# conda activate torch212
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
#
# fix: coco_eval.py
# # Old line:
# import torch._six
# #Replace with
# import collections.abc as container_abcs
#
# pip3 install mlperf_logging
# pip3 install pycocotools
# ./run_and_time.sh
Once run script is started, output should be:
Epoch: [0] [ 0/36571] eta: 1 day, 20:19:44 lr: 0.000000 loss: 2.2699 (2.2699) classification: 1.5590 (1.5590) bbox_reg
ression: 0.7109 (0.7109) time: 4.3637 data: 2.2989 max mem: 51676
Epoch: [0] [ 20/36571] eta: 6:17:02 lr: 0.000000 loss: 2.1944 (2.2521) classification: 1.4886 (1.5371) bbox_regression:
0.7036 (0.7150) time: 0.4317 data: 0.0003 max mem: 52125
Epoch: [0] [ 40/36571] eta: 5:20:59 lr: 0.000000 loss: 2.1934 (2.2440) classification: 1.4949 (1.5292) bbox_regression:
0.6956 (0.7148) time: 0.4309 data: 0.0003 max mem: 52125
Epoch: [0] [ 60/36571] eta: 5:03:05 lr: 0.000000 loss: 2.2322 (2.2630) classification: 1.5102 (1.5478) bbox_regression:
0.7024 (0.7151) time: 0.4384 data: 0.0004 max mem: 52125
.
.
.
Multiple VMs
Similar to single VM but a cluster needs to be provision and parameters related to SLURM need to be adjusted. See details HERE.