The world of hardware benchmarking has been evolving for decades, with numerous tests, metrics, and benchmarks being developed to measure the performance of different hardware components. With the rise of artificial intelligence (AI), however, benchmarking has taken on a new level of importance. AI is revolutionizing nearly all industries, from healthcare to finance, transportation to retail, automation to security. But making sense of AI performance requires a new approach to benchmarking—one that is tailored to the specific demands AI places on hardware performance.
Why is benchmarking AI hardware different than benchmarking traditional computing hardware? Simply put, AI is a fundamentally different type of workload than traditional sequential or parallel computing. AI models are designed to learn from massive quantities of data and make predictions based on that data. The training process for AI models requires high-speed data processing and storage, as well as large quantities of memory and efficient use of processor resources—all while minimizing power consumption.
To meet these requirements, hardware developers have been hard at work developing specialized hardware architectures designed to accelerate AI workloads. These include GPUs from Nvidia, Google’s TPU chips, and custom accelerators from companies like Intel, Xilinx, and Nervana. But how can we compare the performance of these specialized AI hardware architectures?
AI-specific benchmarking metrics
One of the most well-known benchmarks for AI workloads is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where competitors submit models that can accurately classify one of 1,000 categories of objects contained in a dataset of over 1.2 million labeled images. This benchmark was originally created in 2010 and has been run annually since then.
But as AI applications have become more complex, the ILSVRC benchmark alone is no longer sufficient. Different AI applications require different metrics to measure their performance. Here are a few examples:
– TensorFlow benchmark: TensorFlow is a popular open-source software library used to build and train deep learning models. The TensorFlow benchmark measures how quickly hardware can perform common operations used in TensorFlow models, like convolution and matrix multiplication.
– ResNet benchmark: ResNet is a common architecture used in computer vision models, consisting of deep convolutional neural networks. The ResNet benchmark measures the accuracy and speed of hardware running a ResNet-based visual recognition model.
– Reinforcement Learning (RL) benchmarks: RL is a form of machine learning where agents learn by interacting with an environment and receiving rewards or penalties. RL benchmarks measure the time required for hardware to train and execute RL models.
Each of these benchmarks provides insight into how hardware performs on different types of AI workloads. But there are other factors to consider when evaluating AI hardware performance.
Power efficiency
One of the key challenges in developing high-performance AI hardware is keeping power consumption in check. GPUs and other accelerators can consume a significant amount of power, which can limit their use in mobile devices and other applications where power efficiency is critical.
To evaluate power efficiency, benchmarkers can use metrics like performance-per-watt, which measures how much performance a hardware component delivers for each watt of power it consumes. Other power-related metrics include peak power consumption and idle power consumption.
Memory bandwidth
AI models require large amounts of memory to store data during training and inference. The speed at which data can be moved between memory and other components like the processor and accelerator is known as memory bandwidth.
High memory bandwidth is critical for achieving high performance on AI workloads, and different hardware architectures have different memory subsystems that can impact performance. Benchmarking metrics like memory bandwidth, memory latency, and cache hit rates can help evaluate the memory subsystem of different hardware components.
Real-world applications
Finally, it’s important to remember that AI hardware is only useful if it can be applied to real-world applications. Benchmarking hardware in isolation is useful to compare different architectures, but it doesn’t necessarily reflect how the hardware will perform in real-world scenarios.
To address this, benchmarkers can create benchmarks that simulate real-world applications or workloads. For example, the MLPerf benchmark extends the approach taken by the ILSVRC, but with a broader set of tasks and evaluation criteria. MLPerf includes benchmarks for image classification, object detection, natural language processing, and more, with the goal of providing a more comprehensive evaluation of hardware performance across a range of AI workloads.
Conclusion
Benchmarking AI hardware is a complex task, requiring a range of metrics and benchmarks to evaluate both performance and power efficiency. While benchmarks like the ILSVRC have been useful for evaluating hardware on certain types of workloads in the past, new benchmarks are needed to evaluate the latest AI hardware architectures.
Whether you’re a hardware designer looking to optimize your product for AI workloads, or an end-user evaluating different hardware options for your application, understanding the nuances of AI hardware benchmarking is critical. By keeping these factors in mind, you can gain a better understanding of the performance characteristics of different hardware and optimize your AI applications for the best possible performance.