Rob Colantuoni

September 11, 2023

Tags: AI and Infrastructure

GPUs Are the New Scarcity

I’ve been living in GPU infrastructure for a few months now

And I’m pretty convinced we’re watching the early innings of a real shift in how cloud computing works. LLMs and generative AI aren’t just new apps—they’re creating infrastructure demands that the old cloud model was never built for.

The supply-demand mess nobody’s solving

Demand for GPU compute has gone vertical. Everyone wants to train or fine-tune models, run inference, or both. But supply of high-end GPUs—especially NVIDIA’s H100 and what comes next—is constrained. Manufacturing. Power. Capital to build these clusters. It all bottlenecks.

So you get a supply-demand mismatch that feels different from anything I’ve seen in cloud. CPU compute? Effectively a commodity. You can get it from anyone, competitive pricing, usually available when you need it. GPU compute? Scarce. Expensive. Often unavailable exactly when you need it most.

That mismatch is opening space for a different kind of infrastructure player. The classic cloud playbook—build giant data centers, fill them with homogeneous hardware, sell standardized compute—doesn’t map cleanly to GPUs. GPU workloads are more intensive, more specialized, more sensitive to how things are wired together, and way more variable in how they hit.

What’s actually hard about building this stuff

Building GPU cloud infrastructure is a different beast than traditional cloud. Some of what I’ve been chewing on:

Power and cooling. A rack of H100s draws way more power than a rack of CPUs. The density requirements are pushing up against what existing data center designs can handle. Liquid cooling went from niche to basically necessary. That changes where you site things, how you procure power, how you design the physical plant.

Networking. For training, the interconnect between GPUs matters almost as much as the GPUs themselves. Multi-node training needs high bandwidth and low latency between nodes. InfiniBand and RoCE are the standards, but they add complexity and cost. Your networking topology directly affects training performance—which means infrastructure choices have first-order impact on how fast research moves.

Scheduling and orchestration. GPU workloads behave differently from CPU workloads. A training job might want 64 GPUs for three days, then nada. An inference workload might want 4 GPUs steadily but with wildly variable request rates. The scheduler has to handle both patterns, minimize idle time, and still guarantee availability. Kubernetes was built for CPU workloads; GPU scheduling needs real extension work.

Multi-tenancy. Sharing GPU infrastructure across tenants gets tricky. GPU memory, CUDA context switching, PCIe bandwidth—all of it can create interference. Getting multi-tenancy right is table stakes for economics, but doing it without killing performance is hard.

Cost optimization. GPUs are expensive. Idle GPU time is real money. Every minute a $30K GPU sits unused hurts. That drives you toward smarter scheduling, preemptible workloads, packing strategies—stuff that doesn’t have clear analogues in the CPU world.

Where I think this is headed

A few trends I’m betting on:

Disaggregated infrastructure. The tight coupling of GPU, CPU, memory, and storage in current server designs feels limiting. I expect we’ll see more architectures where GPU pools, memory pools, and storage can be composed dynamically based on what the workload actually needs.

Inference optimization. Right now the crunch is on training. But as more models ship to production, inference will dominate by volume. Inference has different characteristics—smaller batches, tighter latency, more predictable patterns—and that’ll drive different infrastructure optimizations.

Specialized hardware. NVIDIA’s dominance is real but probably not permanent. AMD, Intel, Google’s TPUs, and a bunch of AI chip startups are all in the mix. The infrastructure layer has to stay flexible enough to work across different accelerator architectures.

Edge inference. Not everything needs to run in the cloud. Latency-sensitive stuff, privacy-constrained workloads, cost optimization—they’ll push inference toward the edge. That means demand for smaller, more efficient inference hardware and the software to run it.

Why I find this interesting

For infrastructure people, this is one of the most interesting spaces I’ve run into. The problems are hard. The constraints are real. The impact is immediate. Every bit of improvement in utilization, every reduction in scheduling overhead, every optimization in the networking stack—it all lands as faster AI research and cheaper AI deployment.

The GPU cloud isn’t a side bet. It’s becoming the substrate for the next generation of computing. What we build now will shape how fast AI capabilities advance and who gets access to them. That’s worth working on.