AI Insights 05 Jan 2026 Raspal Chima

The Real World Performance of Large Language Models

In the race to adopt AI, it’s easy to focus on what a Large Language Model (LLM) can do. But for your business, your users, and your bottom line, the question of how fast and reliably it can do it is just as critical.

With the rise of Agentic AI—where models navigate systems, write code, and execute complex workflows—performance has become the single most significant factor in a successful implementation.

Poor performance can frustrate users, cripple productivity, and turn a promising AI tool into a frustrating bottleneck. This guide cuts through the noise to give you a practical understanding of what performance really means, with real-world benchmarks and a look at the trade-offs you need to consider.

Understanding Performance: What are Tokens per Second (tps)?

The primary metric for LLM performance is tokens per second (tps). A "token" is a unit of text, roughly equivalent to ¾ of a word. A higher tps means a faster, more fluid stream of text generation.

However, not all tps are created equal, and focusing solely on this metric can be misleading.

Latency vs. Throughput:

Latency (Time-to-First-Token) is the time it takes from submitting a prompt to receiving the first word. Low latency is crucial for Voice Mode (like GPT-5’s native audio) or real-time customer service. Users expect instant responses.
Throughput (Total Volume) is the number of tokens processed per second. High throughput is essential for Agents. When an Agent reads a 200-page PDF to extract data, you don't care about the first word; you care about how fast it can crunch the whole document.

These two factors are often in tension: maximising throughput (by processing many requests in parallel) can increase latency for individual users, and vice versa. The right balance depends on your use case.

What Do TPS Figures Actually Mean?

LLM providers often report tps numbers, but these can refer to different things:

API Performance: This is the raw speed at which a cloud service (like Azure AI, OpenAI API or Groq) can process requests. Specialised inference providers (like Groq and Cerebras) now routinely demonstrate 3,000+ tps on models like Llama 4-8B. This is "blink and you miss it" speed, ideal for real-time translation or instant UI updates.
Frontier Model Performance: Heavyweight models (GPT-5, Claude 4.5 Opus) typically run slower, often around 80–120 tps. This is faster than human reading speed but slower than the "instant" feeling of smaller models.
Enterprise Server Load: Modern hardware (like NVIDIA’s Blackwell B200 chips) allows enterprise servers to handle thousands of concurrent requests without slowing down, maintaining high throughput even during peak hours.
User Interface (UI) Performance: The speed you experience in a web interface like the consumer version of ChatGPT. This is often deliberately slower and isn't a technical limitation; it’s a design choice to make the interaction feel more natural and readable, preventing a giant wall of text from appearing instantly.

For most business applications that we build, we care about the raw API performance, as this determines the true speed of the underlying workflow.

Real-World Benchmarks: What to Expect from Your Hardware

The hardware you choose defines your capabilities. A "fast" model on old hardware is a slow model in practice. Below is the current performance landscape at the time of writing:

Hardware Tier	Primary Technology	Typical Speed (tps)	Best Use Case
Ultra-Inference Cloud	LPUs (Groq/Cerebras)	2,500 – 4,000+	Real-time translation, instant UI generation, high-speed browsing agents.
Enterprise Server	NVIDIA Blackwell (B200)	800 – 1,200	Massive context windows (1M+ tokens), multi-user enterprise RAG, high-accuracy reasoning.
Workstation / AI PC	Dual RTX 5090 / Mac M5 Max	150 – 250	Private coding assistants, sensitive data processing, local agency.
Consumer Mobile	Apple M5 / Snapdragon G4	40 – 70	On-device personal assistants, basic email drafting, offline smart features.

"Reasoning Time"

The biggest shift since 2025 is the widespread adoption of Reasoning Models (like OpenAI’s o3, Google’s Gemini 3, and Anthropic’s Claude 4.5).

These models don't just speak; they think first.

The "Thinking" Pause: When you ask GPT-5 a complex question, it might pause for 2–10 seconds to plan its approach. This is not "lag"; it is the model generating invisible "reasoning tokens."

The Trade-off: You are trading Latency for Intelligence. For a complex coding task, a 10-second wait is acceptable if the resulting code is bug-free. For a customer service chat, it is not.

Key Takeaway: Do not judge a Reasoning Model by its speed. Judge it by its success rate.

The Optimisation Trade-Off: Quantization

To improve performance on local hardware or private clouds, models are often “quantized.”

In simple terms, quantization reduces the precision of the mathematical calculations within the model, making it smaller and faster. It’s like saving a high-resolution photo as a lower-quality JPEG to reduce the file size.

The Impact: Quantization can dramatically increase tps, often doubling performance or more.
The Cost: A 4-bit quantized Llama 4 is often indistinguishable from the full version for standard tasks. However, for Complex Reasoning or highly technical Coding, full precision is still recommended to avoid subtle logic errors.

Choosing the right level of quantization is a technical decision that requires balancing your specific needs for speed versus accuracy.

The Road Ahead

The gap between "fast, cheap models" and "smart, slow models" is widening.

Small Models (Llama 4, GPT-5-mini): Will run instantly, potentially even on your laptop or phone (NPU), handling 90% of daily tasks.

Reasoning Models (GPT-5, o3): Will live in the cloud, acting as "expensive consultants" that you call only when you need deep problem-solving.

The key to a high-performance AI strategy in 2026 is Routing: building a system that automatically sends easy tasks to the fast models and hard tasks to the smart ones. This gives you the speed of a sprinter with the brain of a professor.

Recent AI Posts

How Local LLMs Work: Running AI on Your Own Machine

The rise of large language models has transformed AI interaction, but most users initially relied on cloud-based services. Today, the narrative has shifted toward Local LLMs—running powerful AI models directly on your own hardware. This approach provides complete data privacy, eliminates internet dependency, and opens possibilities for customisation that cloud services can't match.

19 Mar 2026

Choosing Your AI Engine: A Practical Comparison For Business Leaders

You’ve decided that using AI will be useful to your business. Now you face a critical and confusing decision: which Large Language Model (LLM) should power your project? In a landscape dominated by names like ChatGPT, Claude, and Gemini, choosing the right engine is crucial for success. Selecting the wrong one can lead to budget overruns, poor performance, or a solution that simply doesn’t meet your needs.

The technical choice is actually a strategic business decision. The guide below provides a clear comparison, focusing on the practical differences that matter most to your project’s outcome and its ROI. Models evolve quickly, so think of the examples here as representative patterns rather than a definitive “league table”.

19 Feb 2026

LLM Comparison: Summary and Use Cases

With so many large language models (LLMs) available, selecting the right one depends on your specific needs. Whether you're coding, analysing documents, working within a team, or managing costs, each model offers unique strengths. Here's a quick guide to help you decide which LLM best fits your use case.

30 Jan 2026

All AI Insights

We're Easy to Talk to - Let's Talk

Don't worry if you don't know about the technical stuff or exactly how AI will help your business. We will happily discuss your ideas and advise you.

Birmingham:

London:

Email: