The big news this week from Nvidia, splashed in headlines across all forms of media, was the company’s announcement about its Vera Rubin GPU.
This week, Nvidia CEO Jensen Huang used his CES keynote to highlight performance metrics for the new chip. According to Huang, the Rubin GPU is capable of 50 PFLOPs of NVFP4 inference and 35 PFLOPs of NVFP4 training performance, representing 5x and 3.5x the performance of Blackwell.
But it won’t be available until the second half of 2026. So what should enterprises be doing now?
Blackwell keeps on getting better
The current, shipping Nvidia GPU architecture is Blackwell, which was announced in 2024 as the successor to Hopper. Alongside that release, Nvidia emphasized that that its product engineering path also included squeezing as much performance as possible out of the prior Grace Hopper architecture.
It’s a direction that will hold true for Blackwell as well, with Vera Rubin coming later this year.
“We continue to optimize our inference and training stacks for the Blackwell architecture,” Dave Salvator, director of accelerated computing products at Nvidia, told VentureBeat.
In the same week that Vera Rubin was being touted by Nvidia’s CEO as its most powerful GPU ever, the company published new research showing improved Blackwell performance.
How Blackwell performance has improved inference by 2.8x
Nvidia has been able to increase Blackwell GPU performance by up to 2.8x per GPU in a period of just three short months.
The performance gains come from a series of innovations that have been added to the Nvidia TensorRT-LLM inference engine. These optimizations apply to existing hardware, allowing current Blackwell deployments to achieve higher throughput without hardware changes.
The performance gains are measured on DeepSeek-R1, a 671-billion parameter mixture-of-experts (MoE) model that activates 37 billion parameters per token.
Among the technical innovations that provide the performance boost:
-
Programmatic dependent launch (PDL): Expanded implementation reduces kernel launch latencies, increasing throughput.
-
All-to-all communication: New implementation of communication primitives eliminates an intermediate buffer, reducing memory overhead.
-
Multi-token prediction (MTP): Generates multiple tokens per forward pass rather than one at a time, increasing throughput across various sequence lengths.
-
NVFP4 format: A 4-bit floating point format with hardware acceleration in Blackwell that reduces memory bandwidth requirements while preserving model accuracy.
The optimizations reduce cost per million tokens and allow existing infrastructure to serve higher request volumes at lower latency. Cloud providers and enterprises can scale their AI services without immediate hardware upgrades.
Blackwell has also made training performance gains
Blackwell is also widely used as a foundational hardware component for training the largest of large language models.
In that respect, Nvidia has also reported significant gains for Blackwell when used for AI training.
Since its initial launch, the GB200 NVL72 system delivered up to 1.4x higher training performance on the same hardware — a 40% boost achieved in just five months without any hardware upgrades.
The training boost came from a series of updates including:
-
Optimized training recipes. Nvidia engineers developed sophisticated training recipes that effectively leverage NVFP4 precision. Initial Blackwell submissions used FP8 precision, but the transition to NVFP4-optimized recipes unlocked substantial additional performance from the existing silicon.
-
Algorithmic refinements. Continuous software stack enhancements and algorithmic improvements enabled the platform to extract more performance from the same hardware, demonstrating ongoing innovation beyond initial deployment.
Double-down on Blackwell or wait for Vera Rubin?
Salvator noted that the high-end Blackwell Ultra is a market-leading platform purpose-built to run state-of-the-art AI models and applications.
He added that the Nvidia Rubin platform will extend the company’s market leadership and enable the next generation of MoEs to power a new class of applications to take AI innovation even further.
Salvator explained that the Vera Rubin is built to address the growing demand in compute created by the continuing growth in model size and reasoning token generation from leading models such as MoE.
“Blackwell and Rubin can serve the same models, but the difference is the performance, efficiency and token cost,” he said.
According to Nvidia’s early testing results, compared to Blackwell, Rubin can train large MoE models in a quarter the number of GPUs, inference token generation with 10X more throughput per watt, and inference at 1/10th the cost per token.
“Better token throughput performance and efficiency, means newer models can be built with more reasoning capability and faster agent-to-agent interaction, creating better intelligence at lower cost,” Salvator said.
What it all means for enterprise AI builders
For enterprises deploying AI infrastructure today, current investments in Blackwell remain sound despite Vera Rubin’s arrival later this year.
Organizations with existing Blackwell deployments can immediately capture the 2.8x inference improvement and 1.4x training boost by updating to the latest TensorRT-LLM versions — delivering real cost savings without capital expenditure. For those planning new deployments in the first half of 2026, proceeding with Blackwell makes sense. Waiting six months means delaying AI initiatives and potentially falling behind competitors already deploying today.
However, enterprises planning large-scale infrastructure buildouts for late 2026 and beyond should factor Vera Rubin into their roadmaps. The 10x improvement in throughput per watt and 1/10th cost per token represent transformational economics for AI operations at scale.
The smart approach is phased deployment: Leverage Blackwell for immediate needs while architecting systems that can incorporate Vera Rubin when available. Nvidia’s continuous optimization model means this isn’t a binary choice; enterprises can maximize value from current deployments without sacrificing long-term competitiveness.
