The demand for data center infrastructure has never been higher as enterprises race to adopt AI and launch new services. Training large-language models and delivering LLM-powered real-time services are two different challenges.
NVIDIA’s platforms performed well in the latest round of industry benchmarks for MLPerf, Inference v4.1, across all data center tests. The first submission of the NVIDIA Blackwell revealed up to four times more performance than the H100 Tensor core GPU due to the use of a second-generation Transform Engine and FP4 tensor cores.
The NVIDIA H200 tensor core GPU achieved outstanding results in every data center benchmark category, including the Mixtral 8x7B Mixture of Experts (MoE), LLM. This benchmark features 46.7 billion parameters with 12.9 billion active parameters per token.
They are gaining popularity because they can answer various questions and perform more diverse tasks with a single deployment. These models are also more efficient, as they only use a small number of experts to make an inference. This means they can deliver results faster than dense models.
Due to the continued growth of LLMs, inference requests are increasing. Multi-GPU computing is essential to meet the real-time latency needs of today’s LLMs and serve as many users as possible. NVIDIA NVLink and NVSwitch are high-bandwidth communications between GPUs based on the NVIDIA hopper architecture. They provide real-time and cost-effective inference for large models. Blackwell will extend NVLink Switch capabilities by allowing larger NVLink Domains of 72 GPUs.
In addition to NVIDIA, 10 NVIDIA partner companies — ASUSTek (Dell Technologies), Cisco, Fujitsu (Giga Computing), Hewlett Packard Enterprise(HPE), Juniper Networks (Juniper), Lenovo, Quanta Cloud Technology, and Supermicro — also submitted solid MLPerf Inference results, highlighting the availability of NVIDIA platform.
Relentless Software Innovation
NVIDIA platforms are continuously updated with new features and performance improvements.
In the latest round of inference, NVIDIA’s offerings, such as the NVIDIA Jetson and NVIDIA Triton Inference Server platforms, have seen a dramatic increase in performance.
The NVIDIA H200 GPU achieved up to 27% higher generative AI inference performances than the previous round. This highlights the added value that customers receive over time by investing in the NVIDIA Platform.
Triton Inference Server is an open-source, fully featured inference server that is part of NVIDIA’s AI platform. It can be used with NVIDIA Enterprise. This inference platform consolidates framework-specific servers into one unified platform. This reduces the total cost of ownership of AI models deployed in production and can cut model deployment time from months to minutes.
In this round, the Triton Inference Server was able to deliver performance that was nearly equal to NVIDIA’s bare metal submissions. This shows that organizations do not have to decide between using an AI inference service with a high level of features and peak performance.
Going to The Edge
At the edge, generative AI can turn sensor data like videos and images into actionable insights in real-time with strong context awareness. NVIDIA Jetson, the platform for edge AI, robotics, and AI, is unique in its ability to run any model locally. This includes LLMs (local learning models), vision transformers, and Stable Diffusion.
In this round of the MLPerf benchmarks, NVIDIA Jetson AgX Orin systems-on-modules have achieved a throughput improvement of more than 6.2x and 2.4x in latency over the previous round for the GPTJ LLM workload. Developers can use the general-purpose model with 6 billion parameters to interface seamlessly with human language.
Performance Management All-Around
This round of MLPerf Inference demonstrated the versatility and superior performance of NVIDIA’s platforms—from the data center to the edge—for all the benchmark workloads. It also supercharged the most innovative AI-powered applications and services.
CoreWeave, the first cloud provider to announce general availability, and server manufacturers ASUS, Dell Technologies HPE QTC Supermicro, and Supermicro are now offering H200 GPU-powered servers.