Blog / AI Evaluation

Is my LLM getting dumber, or is it just me?

November 2025

Person in pachinko parlor representing model performance changes

Users have long complained that GPT models seem to get worse over time. OpenAI has denied this. Who's right?

I designed 43 tests and ran them repeatedly over several months. The results are clear: model performance does drift over time, sometimes significantly.

The methodology

I created a battery of 43 standardized prompts covering seven categories: instruction following, arithmetic, letter counting, logical reasoning, code generation, factual recall, and multi-step tasks. Each test has an objectively correct answer. I ran these tests repeatedly across different GPT model versions.

This is not a perception study. It's measurement. Either the model gets the answer right or it doesn't.

What I found

The results form a clear U-shaped pattern:

Model Accuracy Notes
GPT-4 (June 2023) 86% Original release
GPT-4-turbo 46.5% 40-point drop
GPT-4o variants 63-70% Partial recovery
GPT-5 (August 2025) 86% Full recovery
GPT-5.1 (November 2025) 95.3% New peak
Chart showing GPT model accuracy over different versions forming a U-shape

Accuracy across GPT model versions shows a distinct U-shaped pattern

The pattern is striking. Performance dropped dramatically with GPT-4-turbo, partially recovered with GPT-4o, and only returned to original levels with GPT-5.

Chronological chart showing the U-shaped performance curve over time

The U-shaped recovery pattern when viewed chronologically

Why this happens

Several mechanisms could explain the degradation:

Critical insight: The regression is primarily in instruction following, not knowledge. The models often knew the correct answer but wrapped it in explanatory text when exact formatting was requested. This causes practical failures in structured output tasks.

When I asked models to "compute (999 minus 456) times 3 and respond with the integer only," GPT-4-turbo often provided correct math but returned something like "The answer is 1,629" instead of just "1629". The model knew the answer but failed the actual task.

The Pachinko problem

This reminds me of Japanese pachinko machines that offer early wins to hook players, then tighten odds. Cloud AI services change unpredictably while users remain unaware. You sign up for a capability, and that capability may silently shift.

The exact mechanism matters less than the implication: you cannot assume consistent performance from a model over time. What worked last month may not work this month.

Business implications

For regulated industries, this matters acutely:

The governance gap is real: organizations documented processes using "GPT-4" without knowing that label might mean different things at different times.

What should you do?

  1. Pin specific model versions. Use API parameters to lock to exact model versions rather than aliases like "gpt-4" that may point to different underlying models over time.
  2. Build regression test suites. Implement automated testing for any production AI system. Track performance over time, not just at deployment.
  3. Require change notifications. If you're negotiating enterprise AI contracts, require vendors to notify you before model changes that could affect outputs.
  4. Document with version specificity. Record which exact model version produced which outputs for audit trails. "GPT-4" is not specific enough.
  5. Establish AI governance committees. Treat model risk like you would any other technology risk. Silent model changes should trigger review processes.
  6. Consider self-hosting. For stability-critical applications, open-source models that you control may provide more consistent behavior than cloud services that change without notice.

The bottom line

My data do not prove that OpenAI deliberately degraded their models. The changes may reflect safety improvements or architectural shifts rather than intentional cost-cutting.

But intent doesn't change the practical reality: if you built workflows assuming stable model behavior, those assumptions may be wrong. Teams using GenAI need continuous evaluation, not blind trust in a model label.

Key takeaway: Model capability is not static. The U-shaped recovery pattern suggests capabilities eventually improve, but the governance gap between what you validated and what runs in production remains unresolved. Monitor, don't assume.

Back to blog