← All posts

AI's Hardware Story Is Splitting in Two Directions

Two real hardware stories landed the same week — a datacenter memory shortage and capable open models collapsing onto ten-year-old servers.

  • AI
  • Engineering
  • Performance
  • Product
A split-view illustration of a data center rack — modern GPU stack on one side, single ten-year-old server on the other.

AI’s hardware story is being told in two directions this week, and most coverage is only tracking one. The “memory shortage” framing — frontier workloads outrunning datacenter capacity — is real. The other half of the story, harder to spot in headlines, is that capable open models keep collapsing into commodity hardware already sitting in everyone’s racks. Both moves landed on the same day. Both are real. What changes once you hold them together is what you should actually budget for in 2026 and 2027.

What landed this week

Two pieces from June 1st, on opposite sides of the same line.

  1. Nvidia open-sourced Cosmos 3, the first SOTA open-weight physical AI foundation model. Two sizes (16B Nano and 64B Super), a Mixture-of-Transformers architecture combining a reasoning tower (a vision-language model) with a generator tower (diffusion-based), trained for robotics and autonomous vehicles. Training scripts, datasets, and NIM deployment microservices included.

    The structural detail worth naming: Nvidia’s incentive here is compute adoption, not a model moat. Physical AI is a brand-new GPU-hungry workload category — every robotics team or AV company that adopts Cosmos 3 keeps buying compute underneath the model. The release expands the addressable workload for Nvidia’s hardware while preserving the model layer as a commodity. It’s the same logic Google has been running with Gemma releases, translated to physical AI.

  2. Gemma 4 26B MoE runs at reading speed on a 2016 DDR3 Xeon with no GPU. A hobbyist’s writeup walks through 25 ik_llama.cpp optimization flags — speculative decoding with multi-token prediction, CPU-side Mixture-of-Experts routing, Flash Attention ported to CPU, mlock to pin model weights in RAM, pushing fluid inference onto hardware ten years old.

    The model itself is competitive with frontier-tier open weights. The hardware costs cents on the dollar and already lives in everyone’s depreciation tables. The bottleneck the optimization breaks is memory bandwidth, not raw compute: the bottleneck most “AI hardware crisis” coverage assumes is binding.

If you don’t live in this jargon every day

The one term that does the most work in this story is memory bandwidth. When the AI industry talks about a hardware shortage, the constraint usually isn’t compute (the math the chip can do per second) but how fast the chip can move model weights in and out of memory to feed that math. Frontier GPUs are expensive partly because their memory bandwidth is enormous. Ten-year-old Xeons are cheap partly because theirs isn’t.

What the Xeon writeup does is route around the bandwidth wall with software tricks: predict the next few tokens with a smaller model and verify them in batches (speculative decoding), keep only a sliver of the model active per token (the Mixture-of-Experts routing), pin the weights so the OS doesn’t page them out (mlock). A model that “shouldn’t” run on old hardware does anyway. That’s the move. It’s why the same week can show both “AI is hardware-constrained” (true at the datacenter scale) and “AI runs on a 2016 Xeon” (also true) without contradiction.

Two real hardware stories, one week. The framing that holds both is “which workload.”

What changes once you hold both stories together

The “AI memory shortage” headline is accurate at the workload scale Nvidia is selling into. Physical AI, frontier-scale fine-tuning, very long-context inference, real-time multimodal systems: all keep getting hungrier. Datacenters are genuinely capacity-constrained for those workloads, and the build cycle for new capacity runs in years, not quarters.

What the headline doesn’t capture is that the floor under “capable text inference” keeps falling. Open models the size of a frontier release from two years ago now run on hardware that’s been depreciated to zero. The next wave of optimization (CPU-side MoE routing, speculative decoding with multi-token prediction, weight quantization that doesn’t lose much) is compressing the threshold further every quarter.

The result is a bifurcation, not a uniform shortage. New AI workloads need new compute. Existing AI workloads, the ones most operators are actually trying to deploy at scale today, keep getting cheaper to run on hardware they already have. Most procurement cycles aren’t budgeting for both directions at once because most coverage isn’t naming both directions at once.

(I’m reasonably sure about the top half: Nvidia’s incentive structure on Cosmos 3 is the same one any chip vendor runs, and physical AI as a workload category is straightforwardly bandwidth-hungry. I’m a little less sure how widely the Xeon-floor reframing applies. The optimization stack in that writeup is specific, and the model’s behavior on harder workloads (longer contexts, tool use, multi-turn agentic tasks) hasn’t been broadly benchmarked yet. Read the floor story as “the threshold is collapsing faster than people think,” not as “you can throw out your GPU budget.“)

In practice

Three concrete moves for anyone sizing AI infrastructure this year:

  1. Split the workload before you size the hardware. Physical AI, frontier-scale training, and long-context inference belong on one budget. Standard text inference on capable open models belongs on a different one, and that second budget is probably overbuilt if you wrote it in 2024.
  2. Measure your inference workload’s actual memory-bandwidth profile. If the workload is bandwidth-bound, CPU-side optimization may close more ground than buying a smaller GPU instance. The Xeon writeup shows the upper bound of what’s possible; your stack is probably closer to that ceiling than your procurement model assumes.
  3. Keep a separate line item for physical AI. If your roadmap doesn’t include robotics, AV, or simulation workloads, the Cosmos 3 release probably doesn’t change this year’s budget. If it does, those workloads are now real, and they aren’t going to be cheap. They also weren’t on the 2025 budget.

What to watch next

Whether Google or another frontier lab matches Cosmos 3 with an open physical AI release before Q4 — that’s the test of whether “open-weight frontier model as compute demand driver” is now the standard playbook for chip vendors, or just Nvidia’s. And whether the next round of CPU inference optimizations pushes the floor below DDR3 hardware, which would force a similar reframing on the floor side. Both questions get answered in the next two quarters.

Most enterprise AI budgets approved in 2025 will look overbuilt by 2027, for every workload except the physical-AI ones, which weren’t on the budget at all.