My #1 Prediction for AI in 2025: Inference

Will replace "training" as a main theme for investing and compute news cycle

Dec 21, 2024

My top prediction for 2025: In A word: Inference.

It’s not just another buzzword—it’s the word for 2025, the driving force behind AI’s evolution and the investment theme of the year. I’m putting it on the record: Inference is the next big frontier.

From Training to Inference: The Evolution of AI

In 2022, the smart money zeroed in on training as the foundation of AI. Investors flocked to NVIDIA, the undisputed leader in training compute. The bet paid off handsomely—NVIDIA stock skyrocketed as its GPUs powered the training of large language models (LLMs) like GPT, Claude, and others.

But now, the spotlight shifts to Inference.

Inference is where AI gets its smarts—thinking and responding in real-time. It’s what happens when you ask ChatGPT a question and get an answer back. If training is about building the playbook, inference is about running the plays.

The Inference Era: Why It Matters

Inference is quickly becoming the bottleneck for scaling AI. Training was about uploading data and teaching models, but inference—retrieving, reasoning, and delivering answers—is where the real computational heavy lifting happens. AI systems are evolving from answering simple queries to tackling complex problems, delivering insights on par with STEM graduates or PhDs.

Enter ChatGPT-01: The Inference Gobbler

On September 12, 2024, a major shift occurred with the launch of ChatGPT-01, OpenAI’s first inference-dominant model. Unlike earlier iterations, GPT-01 takes its time—minutes, not seconds—to generate responses. But the tradeoff is worth it: the answers are orders of magnitude better.

This model redefined what’s possible in reasoning-heavy tasks like science and mathematics, signaling that the era of inference was here.

ChatGPT-03: Go ahead and ask A $3,500 Question.

Then came ChatGPT-03, just three months later. This model is a powerhouse, capable of reasoning at a level approaching a human STEM PhD. But that capability comes at a price—literally. Answering a single, complex query can cost up to $3,500 due to the intense inference compute required.

This isn’t just a technical milestone; it’s an economic shift. The cost of inference is skyrocketing, and the demand for dedicated inference hardware has never been higher.

Training Days: NVIDIA’s Era of Dominance

From 2022 to 2024, NVIDIA ruled the “training era.” Its GPUs powered the massive compute needed to train AI systems, and the company thrived as the backbone of data centers worldwide. NVIDIA’s general-purpose GPUs were versatile and invaluable for the training phase of AI’s development.

But every era has its limits.

Inference Era: Cracks in NVIDIA’s Armor

NVIDIA’s GPUs are excellent all-purpose tools, but they’re not optimized for the demands of dedicated inference. Dedicated inference requires chips that are faster, cheaper, and far more energy-efficient than NVIDIA’s current offerings.

The New Wave of Inference Leaders

As AI moves into the inference era, specialized hardware companies are stepping up with chips tailored for reasoning-heavy tasks, offering greater speed, efficiency, and cost-effectiveness than general-purpose GPUs like NVIDIA’s.

Here are the frontrunners in dedicated inference compute:

• Positron AI: Known for its chips that are 3x faster, 50% cheaper, and use 1/3 the power of NVIDIA’s GPUs, universal software compatibility integrates with existing AI frameworks.( Disclosure I’m considering an investment in this company)

• Tenstorrent: Building scalable processors tailored for AI inference, Tenstorrent is tackling the computational demands of real-time reasoning and large-scale deployment.

• SambaNova Systems: Known for their custom hardware and software platforms, SambaNova’s architecture excels in high-performance inference and enterprise-level AI applications.

• Cerebras Systems: Famous for its wafer-scale engine, Cerebras is delivering unparalleled compute density, making it ideal for inference workloads requiring vast computational power.

• Groq: Specializing in low-latency inference accelerators, Groq’s architecture is optimized for real-time decision-making and other time-sensitive applications. (Disclosure: Entanglement LLC a personal investment of has created software for Groq)

• NextSilicon: Leveraging software-defined compute accelerators, NextSilicon is redefining how inference is processed in high-performance environments. (Disclosure: I’ve invested in this through A private Third Point fund)

• Etched: An emerging player, Etched is developing next-generation chips optimized for transformer-based AI models, targeting highly efficient inference.

Conclusion: Inference Will Define 2025

Inference compute will be the defining theme of 2025.

As AI evolves, the demands for reasoning and real-time intelligence will grow exponentially. The models of the future won’t just answer questions; they’ll deliver insights and solve problems at a level that rivals or surpasses human expertise.

Will these systems exceed the reasoning abilities of human PhDs by 2025? That depends on how you measure it, but my bet is on yes.

Inference is the next big investment opportunity, and it’s happening now.

Grapes To Qubits: Dave Sokolin

Discussion about this post