substack.com

InferenceX v2: NVIDIA Blackwell Vs AMD vs Hopper - Formerly InferenceMAX

Brief

InferenceX v2 is SemiAnalysis’ expanded open-source benchmark suite for frontier LLM inference, aimed at measuring not just peak throughput but the full tradeoff curve between interactivity, throughput, cost, and energy efficiency across modern GPU systems. The new release covers all six recent NVIDIA western SKUs and all AMD western GPU SKUs from the past three years, using close to 1,000 GPUs per full sweep. Its headline result is that NVIDIA’s rack-scale Blackwell systems—especially GB200 and GB300 NVL72—massively outperform both Hopper and AMD once production-style inference techniques are enabled, particularly for mixture-of-experts models such as DeepSeek R1. SemiAnalysis argues that the real differentiator is not just chip FLOPS, but the ability to combine disaggregated prefill, wide expert parallelism, FP4 quantization, and mature multi-node serving software such as Dynamo and TensorRT-LLM. In that setup, GB200/GB300 NVL72 can deliver up to 98-100x the realized performance of strong H100 baselines and as much as 65x better tokens-per-dollar. The architectural reason is straightforward: NVL72 keeps 72 GPUs inside a single 900 GB/s-per-GPU NVLink domain, allowing expert-parallel all-to-all traffic and weight loading to stay on a far faster fabric than the 400-800 Gbit/s scale-out interconnect used between standard nodes.

AMD’s position is more nuanced. MI355X appears genuinely competitive in single-node FP8 and in FP8 disaggregated serving when compared specifically against NVIDIA running SGLang, and SemiAnalysis notes rapid improvement in AMD’s software stack over the prior two months, including ~2x gains in some DeepSeek R1 FP4 configurations and >20% throughput-per-GPU gains from MoRI in the 20-45 tok/s/user band. But the report’s core criticism is that AMD still lacks “composability”: isolated optimizations work, yet combining FP4, disaggregated serving, and wideEP causes performance to degrade sharply. That leaves MI355X far behind B200 once NVIDIA’s production stack, especially Dynamo + TRT-LLM, enters the picture. The report also critiques AMD’s upstream support model, noting MI355X still depends on an old forked vLLM 0.10.1 ROCm image while the official 0.15.1 image hard-fails, and citing insufficient CI hardware donations to projects like vLLM and SGLang.

The article is also valuable for its system-level economics. It shows why disaggregated prefill improves utilization by separating compute-heavy prefill from memory-bound decode, and why wideEP reduces redundant model weight loading across MoE deployments. It frames Anthropic’s “fast mode” as an inference scheduling decision rather than a hardware mystery: serving the same model at 2.5x higher tok/s/user naturally drives 6-12x higher cost per token because accelerator hourly cost is fixed while batching efficiency falls. Multi-token prediction is presented as the most powerful software lever in the current stack, often slashing cost per million tokens by multiples while preserving accuracy on benchmark checks like GSM8k. For anyone trying to understand who actually wins the AI infrastructure race, the main lesson is that rack-scale topology, interconnect bandwidth, software maturity, and inference orchestration now matter as much as the raw silicon.

Why it matters

SemiAnalysis’ InferenceX v2 benchmarks nearly 1,000 frontier GPUs across NVIDIA Hopper/Blackwell/Blackwell Ultra and AMD Instinct SKUs, adding the first third-party full-Pareto tests of GB300 NVL72/B300 and multi-node MI355X FP4/FP8 disaggregated serving with wide expert parallelism.

Key details

  • NVIDIA’s rack-scale Blackwell systems dominate state-of-the-art MoE inference: GB200/GB300 NVL72 delivered up to 98-100x higher realized performance than a strong H100 disagg+wideEP baseline at ~116 tok/s/user, and 9.7x to 65x better tokens-per-dollar than Hopper depending on interactivity.
  • AMD’s MI355X is competitive in narrower cases: on FP8 disaggregated serving with SGLang+MoRI it roughly matches B200 running SGLang, and in single-node FP8 serving MI355X often beats B200 on perf/TCO; however, it falls well behind when compared against NVIDIA’s more mature Dynamo+TensorRT-LLM stack.
  • The biggest AMD weakness is software composability: FP4 + disaggregated prefill + wide expert parallelism performs worse than theory predicts, and in some 1k/1k scenarios MI355X with MTP only barely beats B200 without MTP, while B200 with Dynamo TRT-LLM + MTP remains clearly ahead.
  • NVL72’s system architecture is a major advantage for MoE inference. Within-rack NVLink provides 900 GB/s per GPU unidirectional bandwidth across 72 GPUs, versus roughly 50-100 GB/s per GPU over InfiniBand/Ethernet outside the rack; at 60 tok/s/user, each GB200 NVL GPU produced just under 3x the tokens/s of each B200 GPU.
  • Multi-token prediction (MTP) is one of the highest-leverage optimizations in the suite: enabling MTP consistently improved throughput with little measured accuracy loss, cutting one DeepSeek R1 FP4 Dynamo TRT cost figure from $0.251 to $0.057 per million total tokens, and reducing a GB300 FP4 8k/1k workload at 150 tok/s/user from about $2.35 to $0.11 per million tokens.
Cleaned source text

title: InferenceX v2: NVIDIA Blackwell Vs AMD vs Hopper - Formerly InferenceMAX

author: SemiAnalysis

content_type: newsletter

publication: substack.com

published: 2026-02-16T17:13:11+00:00

source_url: gmail://19c6773afd463fae

word_count: 12387

The Artist Known as InferenceMAX. GB300 NVL72, MI355X, B200, H100, Disaggregated Serving, Wide Expert Parallelism, Large Mixture of Experts, SGLang, vLLM, TRTLLM

͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­͏ ­

Forwarded this email? Subscribe here for more

InferenceX v2: NVIDIA Blackwell Vs AMD vs Hopper - Formerly InferenceMAX

GB300 NVL72, MI355X, B200, H100, Disaggregated Serving, Wide Expert Parallelism, Large Mixture of Experts, SGLang, vLLM, TRTLLM

Dylan Patel, Cam Quilici, Bryan Shan, Alec Ibarra, Kimbo Chen, Daniel Nishball, and Cheang Kang Wen

Feb 16| | | ∙| | Preview

READ IN APP

Introduction

InferenceXv2 (formerly InferenceMAX) builds on the foundation established by InferenceMAXv1, our open-source, continuously updated inference benchmark that has set a new standard for AI inference performance and economics. InferenceMAXv1 moved beyond static, point-in-time benchmarks by running continuous tests across hundreds of chips and popular open-source frameworks. Free dashboard available here.

Our benchmark has been widely reproduced, validated and/or supported by almost every major buyer of compute from Google Cloud to Microsoft Azure to Oracle, OpenAI, and many more.

InferenceXv2 builds on this foundation. It expands coverage to include large scale DeepSeek MoE disaggregated inference (disagg prefill, or simply “disagg”) with wide expert parallelism (wideEP) optimization to all 6 NVIDIA western GPU SKUs from the past 4 years as well as to every single AMD western GPU SKU released in the past 3 years – in total InferenceXv2 utilizes close to 1000 frontier GPUs for a full benchmark run across all SKUs.

With today’s release, InferenceXv2 is now the first suite to benchmark the Blackwell Ultra GB300 NVL72 and B300 across the whole pareto frontier curve, and it is the first third party benchmark to test disagg+wideEP multi-node FP4 and FP8 MI355X performance. In future iterations of InferenceX, we will continue to focus heavily on disaggregated serving with wide expert parallelism as that is what is deployed in production at Frontier AI Labs like OpenAI, Anthropic, xAI, Google Deepmind, DeepSeek as well as advanced API providers like TogetherAI, Baseten, and Fireworks. In this article, we will also break down the system engineering principles and economics in play around the latest Claude Code Fast mode feature.

Our benchmark is completely open-source under Apache 2.0 – this means that we are able to move at the same rapid speed at which the AI software ecosystem is advancing. If you like our work and would like to show us some support, please drop a star on our GitHub! We also provide a free data visualizer at https://inferencex.com for everyone in the ML community to explore the complete dataset themselves.

We will add DeepSeekv4 and other popular Chinese frontier models with day 0 support as over the past 6 months, we now have cleaned up a lot of tech debt and are able to move fast with stable infrastructure. We will also be adding TPUv7 Ironwood and Trainium3 to InferenceX later this year! If you want to contribute to our impactful mission while earning a competitive compensation, consider applying here.

Source: InferenceMAX GitHub

Key Observations and Results to Highlight

We see competitive perf per TCO results on FP8 MI355X disagg+wideEP SGLang on AMD compared to FP8 B200 disagg+wideEP SGLang, but when compared to widely used Dynamo TRTLLM B200 FP8, TRT continues to framemog. This is amazing news that AMD SGLang Disagg prefill+wideEP for FP8 is able to match NVIDIA’s SGLang performance.

We also see that for single node aggregated serving, AMD’s SGLang delivers better perf per TCO than NVIDIA’s SGLang for FP8. It is also great to see that AMD has deprecated their second class fork of vllm to move further upstream and closer to delivering first class experience. Stay tuned for our “State of AMD” article where we talk about the many areas where AMD’s pace of improvement has been rapid & also the areas where the pace of improvement has been lackluster. We recommend that NVIDIA focus even more on SGLang & vLLM ecosystem in addition their TRTLLM engine. Jensen needs to staff more resources & engineers towards contributing open ecosystems like SGLang & vLLM.

SemiAnalysis InferenceX is free open source software and reader-supported. To receive new posts and support our work consider becoming a free or paid subscriber.

Upgrade to paid

When it comes to the latest inference techniques that are used by the most prominent frontier large-scale inference services (such as disagg prefill+wideEP+FP4), Nvidia absolutely frame mogs with the B200, B300 and ASU frat leader, rack scale GB200/GB300 NVL72 across both SGLang and TRTLLM. Nvidia GPUs also dominate when it comes to energy efficiency, with much lower all-in provisioned picoJoules of energy per token across all workloads.

Turning to AMD, we find that the biggest issue with inference on their systems and using their software is _composability_. That is, many of AMDs inference optimization implementations work well in isolation, but when combined with other optimizations, the result is not as competitive as one would expect. Specifically, the composability of disagg prefill, wideEP and FP4 inference optimizations needs significant improvement.

While performance is competitive on AMD when enabling just a subset of the SOTA inference optimizations, enabling all three major optimizations that labs use, AMD’s performance is currently not competitive with Nvidia’s. We strongly recommend to AMD that they focus heavily on composability of different inference optimizations. We have been told that AMD will start focusing on software composability of FP4+distributed inferencing across their whole software stack. This will happen after Chinese New Year as most of their disagg prefill+wideEP 10x inference engineers are based in China

Nvidia’s GB300 NVL72 doesn’t disappoint. It achieves up to 100x on FP8 vs FP4 compared to even a strong H100 disagg+wideEP+MTP baseline and 65x on FP8 vs FP8. On H100 vs GB200 NVL72, we see up to 55x realized performance difference at 75 tok/s/user. Rack scale Blackwell NVL72 is framemogging hopper and makes hopper looks like it is jestermaxxing. As Jensen said at GTC 2025, he is chief revenue destroyer.

At GTC 2024, Jensen claimed that Blackwell will deliver up to 30x perf on inference compared to H100, Jensen under promised & overdelivered on Blackwell inference performance. This should curtail the instances of analysts cracking “Jensen Math” jokes for some time.

Source: SemiAnalysis InferenceX

Acknowledgments and InferenceX™ (formerly InferenceMAX) Initiative Supporters

We would like to thank Jensen Huang and Ian Buck for supporting this open-source effort by providing access to the latest GB300 NVL72 systems along with access to servers representing all GPU SKUs that they have produced for the past four years. We would like to thank the Nvidia team for allowing us to conduct independent benchmarks across this close to 1000 GPUs. Thank you to Jatin Gangani, Kedar Potdar, Sridhar Ramaswamy, Ishan Dhanani, Sahithi Chigurupati, along with many other Nvidia inference engineers for helping to validate and optimize Blackwell & Hopper configurations.

We’re also grateful to Lisa Su and Anush Elangovan for their support of InferenceMAX and for supporting our work with the dozens of AMD engineers like Chun, Andy, Bill, Ramine, Theresa, Parth, etc that contributed to InferenceMAX & upstream vLLM/SGLang bug fixes, as well as for their responsiveness on helping debug and triage AMD exclusive bugs so as to help optimize AMD performance.

We also want to recognize the SGLang, vLLM, and TensorRT-LLM maintainers for building a world-class software stack and open sourcing it to the entire world. You can check their articles on InferenceX here:

SemiAnalysis InferenceMAX: vLLM maintainers & NVIDIA accelerate Blackwell Inference

GPT-OSS Performance Optimizations: Pushing Pareto Frontier

SGLang & NVIDIA Accelerating SemiAnalysis InferenceMAX & GB200 Together

The InferenceX initiative is also supported by many major buyers of compute and prominent members of the ML community including those from OpenAI, Microsoft, vLLM, Tri Dao, PyTorch Foundation, Oracle and more. You can find the full list here.

SemiAnalysis InferenceX is free open source software and reader-supported. To receive new posts and support our work, consider becoming a free or paid subscriber.

A Primer on Important Technical Concepts

In this section, we will give a brief primer on technical concepts that may help the reader better interpret results. Some readers may not need this and can skip directly to our analysis of results. We will take a deeper dive into some of these topics after the results analysis.

Interactivity vs Throughput Tradeoff

The fundamental tradeoff with LLM inference is throughput versus latency. _Interactivity_ (tok/s/user) describes how fast each user of a system receives tokens – it is the inverse of time per output token (TPOT). _Throughput_ (tok/s) describes how many total tokens a system can crank out across all users. One can achieve higher total throughput by batching requests, but each request will be allocated less FLOPs and thus complete slower. This is analogous to the choice of riding a metro bus vs a race car. The metro bus serves many riders, but also makes frequent stops which takes time, but the cost of the metro bus can be amortized across many passengers. The race car can only carry one or two passengers, but it will make few if any additional stops meaning a faster travel time overall, but it is much more expensive to ride per passenger. The metro bus might make more sense for people heading to the park on a weekend, while the race car might be better for bringing a celebrity to their destination. There is no one size fits all solution.

Source: SemiAnalysis

Most benchmark results we will show in this article are InferenceX is a curve. It is important to analyze throughput at various levels of interactivity/latency instead of just looking at maximum achieved throughput (which normally can only be achieved at a single low interactivity). With inference, there is no one size fits all use case. The level of interactivity and throughput needed depends on the use case. For instance, real-time speech models require extremely low latency so that the end user can maintain a natural “conversation” with the LLM, whereas a basic QA chatbot may allow for higher latency. We leave it up to the reader to look at the curve and apply this principle to identify where their use case falls on the throughput-interactivity curve.

The Cost/Perf per TCO vs Interactivity/End-to-End Latency curve mostly follows the Throughput vs Interactivity/End-to-End Latency Curve: More tokens/hour leads to a lower cost per token as fixed $/hour costs are amortized over more tokens produced.

Prefill and Decode Phases

Inference contains two main phases: prefill and decode. _Prefill_ occurs during the first forward pass of a request’s lifetime. It is computationally intensive since all tokens in the request are processed in parallel. This phase is responsible for “filling up” the KV cache for a sequence. After prefill, responses are generated (or _decoded_) one token at a time. Each forward pass loads the entire KV cache for a sequence from HBM, while only performing the computation for a single token, making decode memory (bandwidth) intensive.

When prefill and decode performed on the same engine, prefill constantly disrupts decode batches leading to worse overall performance.

Disaggregated Prefill

Disaggregated prefill (aka PD disaggregation or simply “disagg”) is the practice of separating the prefill and decode phases across separate pools of GPUs or clusters. These separate prefill and decode pools can be tuned independently and scaled to match the needs of workloads.

Tensor Parallel, Expert Parallel, Data Parallel (TP, EP, DP)

TP allows for maximize interactivity at small batch sizes, but it must carry out an all-reduce at every layer. EP shards experts, exploiting MoE sparsity, with the drawback being an all-to-all collective (which is more costly than simpler collectives like all-reduce) is carried out for MoE layers and can be imbalanced at small batches. DP replicates the entire model (or just parts of a model, like attention) on multiple groups of GPUs (ranks) and then load balances requests among ranks. It is the simplest to scale, but repeats weight loading which can be wasteful at scale.

One of the main goals of InferenceX is to visualize performance improvements over time. While new chips are released on an O(yearly) cadence, software releases happen on an O(weekly) cadence. Our goal is to constantly update recipes with the latest and greatest software improvements and benchmark the configurations.

DeepSeek R1

The AMD team has significantly improved performance for all configurations of SGLang DeepSeek R1 FP4. For the same interactivity, AMD has almost doubled the amount of throughput in the span of less than 2 months. Moreover, we have pushed AMD to upstream performance enhancing changes from their forked SGLang images into the official SGLang image. From December 2025 to January 2026, AMD’s software was improved up to 2x in performance.

In order to continue becoming closer to an first class experience, AMD needs increase their support of vLLM & SGLang maintainers through compute contributions and code contributions & having more reviewers that work for AMD to speed up the review process of AMD PRs into the upstream.

On the other hand, Nvidia’s results were more consistent, with minor improvements for B200 SGLang over a similar time period.

Many of the mature SKUs had minimal improvements. For example, H200 TRT single node has not changed in performance in the span of 4 months since October, but this is because Hopper support has been excellent since day 1, and performance has close to peak theoretical for this workload all along, making it hard to deliver incremental performance gains.

MI300X and MI325X have seen some improvements, mainly from the most recent SGLang release. Note that for much of the history of InferenceX, AMD was using “private” ROCm images that were not upstreamed, so runs prior to ~Jan 2026 cannot be compared directly to those that are more recent.

GB200 Dynamo TRT-LLM disagg has seen some significant improvements as well, with a 20% increase in max throughput in the span of a little over 1 month. We also see improvements in the middle interactivities, where wide EP is deployed. This is likely due to maturing wide EP kernels on GB200.

B200 SGLang has seen steady and continuous improvement for both FP4 and FP8 scenarios since our initial launch, with throughput per GPU doubling at some interactivity levels since last October.

For MI355X Disaggregated inference serving, AMD recommends using SGLang with MoRI. MoRI is AMD’s MoE dispatch/combine collective and KV Cache transfer library built from first principles by AMD’s cracked 10x China-based engineering team. Although MoRI needs much more open CI and testing, we are strong supporters of the direction that MoRI is taking. This is because instead of taking AMD’s historical approach, which was to fork NVIDIA’s NCCL into RCCL, MoRI is built from scratch by taking the lessons from RCCL/NCCL and building an entirely new package from first principles. The use of MoRI has also delivered good speedups in the span of more than a month, with throughput per GPU increasing by more than 20% in the 20-45 tok/s/user interactivity range.

GPT-OSS 120B

For MI300X and MI325X, we have seen marginal improvements across the board. Some AITER optimizations helped MI300X performance across all interactivities, and switching to the upstream vLLM ROCm image led to improvements.

In the case of the MI325X, it appears that not all performance enhancements that were present in the downstream ROCm fork image (used during the October 5th, 2025 run) have made it into the official vLLM ROCm image.

Unfortunately, the MI355X literally still uses a fork of the vLLM 0.10.1 build `rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1`). We would love to have seen it updated it by now, but unfortunately the current official image (0.15.1, at the time this article was written) is not yet optimized for the MI355X and runs into hard errors. We had also run into hard errors crashes on Mi355 for vLLM 0.14. Word on the street is that vLLM 0.16.0 will finally deliver all the changes needed for better MI355X performance.

Turning back to Nvidia’s systems, both Hopper and Blackwell saw a steady performance increase between vLLM 0.11.2 and 0.13.0. Soon, we will update recipes for Nvidia GPUs to use the latest vLLM version and we expect even greater performance gains after making the switch. We also observed a performance bump in the latest 1.2.0 version of TRT-LLM.

Disaggregated Inference Frameworks

NVIDIA uses Dynamo for its disaggregated inference setup. Dynamo is an inference framework designed for multi-node distributed inference, featuring techniques such as prefill-decode disaggregation, request routing, and KV cache offloading. It is inference-engine agnostic, allowing us to use SGLang and TRT LLM as backends in our benchmark. For AMD, we use SGLang with two different KV cache transfer frameworks: MoRI and Mooncake. MoRI is a high-performance communication interface focusing on RDMA and GPU integration, offering applications such as network collective operations and expert parallel kernels. Mooncake, which recently joined the PyTorch ecosystem, supports prefill-decode disaggregation and many fault tolerant multi-node features.

DeepSeek Disagg +WideEP Results Deep Dive

At all most interactivity levels, disagg outperform aggregated inference (grey lines) in terms of total token throughput per GPU. Multi-node disaggregrated prefill framemogs single node aggregrated serving.

Nvidia continues to push new updates for B200/GB200 FP8. The latest data on DeepSeek FP8 B200 TRT single node (both MTP enabled/disabled) vs GB200 Dynamo+TRT disagg (both MTP enabled/disabled). This indicates consistent engineering effort to improve rack-scale inference software and wideEP kernels.

When comparing MI355X disaggregated inference vs aggregated inference, we noticed a similar pattern. Disaggregated inference only overtakes aggregated inference at low interactivity, high batch sizes. This is true across FP4, and it is likely due to poorly optimized kernels.

When composing disagg prefill+wideEP with FP4 on the MI355X, we observe suffers subpar performance.

Although theoretical modeling shows that disagg inference on MI355Xs should perform way better than single node, disagg actually performs worse for higher interactivity levels due to a lack of kernel and collective optimization in the ROCm software stack when composing multiple SOTA inference optimizations together.