Skip to content

GPU Metrics¶

This page lists some commonly used GPU metrics.

Cluster Level¶

Metric Name	Description
Number of GPUs	Total number of GPUs in the cluster
Average GPU Utilization	Average compute utilization of all GPUs in the cluster
Average GPU Memory Utilization	Average memory utilization of all GPUs in the cluster
GPU Power	Power consumption of all GPUs in the cluster
GPU Temperature	Temperature of all GPUs in the cluster
GPU Utilization Details	24-hour usage details of all GPUs in the cluster (includes max, avg, current)
GPU Memory Usage Details	24-hour memory usage details of all GPUs in the cluster (includes min, max, avg, current)
GPU Memory Bandwidth Utilization	For example, an Nvidia V100 GPU has a maximum memory bandwidth of 900 GB/sec. If the current memory bandwidth is 450 GB/sec, the utilization is 50%

Node Level¶

Metric Name	Description
GPU Mode	Usage mode of GPUs on the node, including full-card mode, MIG mode, vGPU mode
Number of Physical GPUs	Total number of physical GPUs on the node
Number of Virtual GPUs	Number of vGPU devices created on the node
Number of MIG Instances	Number of MIG instances created on the node
GPU Memory Allocation Rate	Memory allocation rate of all GPUs on the node
Average GPU Utilization	Average compute utilization of all GPUs on the node
Average GPU Memory Utilization	Average memory utilization of all GPUs on the node
GPU Driver Version	Driver version information of GPUs on the node
GPU Utilization Details	24-hour usage details of each GPU on the node (includes max, avg, current)
GPU Memory Usage Details	24-hour memory usage details of each GPU on the node (includes min, max, avg, current)

Pod Level¶

Category	Metric Name	Description
Application Overview GPU - Compute & Memory	Pod GPU Utilization	Compute utilization of the GPUs used by the current Pod
	Pod GPU Memory Utilization	Memory utilization of the GPUs used by the current Pod
	Pod GPU Memory Usage	Memory usage of the GPUs used by the current Pod
	Memory Allocation	Memory allocation of the GPUs used by the current Pod
	Pod GPU Memory Copy Ratio	Memory copy ratio of the GPUs used by the current Pod
GPU - Engine Overview	GPU Graphics Engine Activity Percentage	Percentage of time the Graphics or Compute engine is active during a monitoring cycle
	GPU Memory Bandwidth Utilization	Memory bandwidth utilization (Memory BW Utilization) indicates the fraction of cycles during which data is sent to or received from the device memory. This value represents the average over the interval, not an instantaneous value. A higher value indicates higher utilization of device memory. A value of 1 (100%) indicates that a DRAM instruction is executed every cycle during the interval (in practice, a peak of about 0.8 (80%) is the maximum achievable). A value of 0.2 (20%) indicates that 20% of the cycles during the interval are spent reading from or writing to device memory.
	Tensor Core Utilization	Percentage of time the Tensor Core pipeline is active during a monitoring cycle
	FP16 Engine Utilization	Percentage of time the FP16 pipeline is active during a monitoring cycle
	FP32 Engine Utilization	Percentage of time the FP32 pipeline is active during a monitoring cycle
	FP64 Engine Utilization	Percentage of time the FP64 pipeline is active during a monitoring cycle
	GPU Decode Utilization	Decode engine utilization of the GPU
	GPU Encode Utilization	Encode engine utilization of the GPU
GPU - Temperature & Power	GPU Temperature	Temperature of all GPUs in the cluster
	GPU Power	Power consumption of all GPUs in the cluster
	GPU Total Power Consumption	Total power consumption of the GPUs
GPU - Clock	GPU Memory Clock	Memory clock frequency
	GPU Application SM Clock	Application SM clock frequency
	GPU Application Memory Clock	Application memory clock frequency
	GPU Video Engine Clock	Video engine clock frequency
	GPU Throttle Reasons	Reasons for GPU throttling
GPU - Other Details	PCIe Transfer Rate	Data transfer rate of the GPU through the PCIe bus
	PCIe Receive Rate	Data receive rate of the GPU through the PCIe bus

Comments