The AI App Experience Matters More Than Benchmarks Now

Different experiences with app connectors in Claude, Perplexity, and ChatGPT.

I was catching up on different articles after the release of Claude Opus 4.5 earlier this week, and this part from Simon Willison’s blog post about it stood out to me:

I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.

This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn’t possible before. In the past these have felt a lot more obvious, but today it’s often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.

This is something that I’ve felt every few weeks (with each new model release from the major AI labs) over the past year: if you’re really plugged into this ecosystem, it can be hard to spot meaningful differences between major models on a release-by-release basis. That’s not to say that real progress in intelligence, knowledge, or tool-calling isn’t being made: benchmarks and evaluations performed by established organizations tell a clear story. At the same time, it’s also worth keeping in mind that more companies these days may be optimizing their models for benchmarks to come out on top and, more importantly, that the vast majority of folks don’t have a suite of personal benchmarks to evaluate different models for their workflows. Simon Willison thinks that people who use AI for work should create personalized test suites, which is something I’m going to consider for prompts that I use frequently. I also feel like Ethan Mollick’s advice of picking a reasoning model and checking in every few months to reassess AI progress is probably the best strategy for most people who don’t want to tweak their AI workflows every other week.

As I was thinking about this, I also came across this post by Matt Birchler (paywall, and a highly recommended one, too):

That said, ChatGPT has been the number one app in the App Store basically since it launched a couple of years ago. And as I write this today, Google’s Gemini is number two, and xAI’s Grok is number six. And I think that these are gonna be here to stay. The fact that these are a blank canvas, you can enter basically anything into it and get useful information out of it, has proven incredibly compelling to everyday people. Let’s remember that ChatGPT is not baked into the iPhone. People are actively going to the App Store to download this app. And so, I truly think this chat interface is gonna be with us for a long time because it allows such a variety of functionality that can tailor itself to each individual user.

And this post by Sebastiaan de With (X link):

The only meaningful distinguishing quality in AI right now as models become a commodity is being better at making new / great interfaces.

The OpenAI team is very good at this: https://t.co/D4o1J5QGZh

— Sebastiaan de With (@sdw) November 25, 2025

Both of these ideas have been on my mind a lot: the modern flavor of chatbot UIs clearly resonate with people because the LLM experience has gotten good enough across the board to be useful, especially with the addition of web search and reduced hallucinations; and, since the baseline is now good enough, the app experience and how LLMs are woven into a people’s daily lives and workflows will be the differentiators going forward.

As I mentioned on Connected last week, my version of this is that, personally, I “vibe” more with Claude than other LLMs. I prefer its design and interleaved thinking approach; I like that Anthropic doesn’t have image or video generation products (which I find despicable); and, of course, Claude’s ecosystem of app integrations and skills means that I can use it as a new form of non-deterministic automation that lets me work faster. By the same token, that’s why – despite its widely documented advancements – I don’t like chatting or working with Gemini: I’m not a fan of how it responds, its chat UI, and its lack of app connectors.

At the same time, I also recognize that OpenAI knows how to design polished interactions for hundreds of millions of people (their voice mode is unparalleled, and I have high hopes for ChatGPT Apps). And I also like to complement Claude’s lackluster web search features with Perplexity and ChatGPT Pro (Silvia and I share a team plan as a “fake” family subscription, and we have limited access to the Pro model), both of which can reference more sources when I’m doing deep research on any given topic.

Which brings me to my takeaway: from my perspective, despite a stronger baseline, there is still no single LLM that “does it all” these days. Beyond my nerdy experiments, I generally alternate between two modes: Claude for most work-related tasks, and either ChatGPT or Perplexity for web search. The combination of these two modalities gives me everything I need from modern LLMs, and it provides me with the mix of performance, design, and app integrations I like best.

Ultimately, choosing between any LLM at the frontier of AI right now is a highly subjective matter that comes down to cost, workflow, app ecosystem, design, and, yes, pure “vibes”. Personally, I try to avoid fixating on benchmarks and instead prioritize these qualities in my decisions.

It’s hard to differentiate between recent LLM releases of major models if you look at benchmarks alone. However, if you consider other factors beyond coding benchmarks and abstract numbers, there are plenty of practical differences between the major AI apps right now that are worth exploring and judging for yourself.

Our Top Amazon Early Black Friday Picks

I Finally Tested the M5 iPad Pro’s Neural-Accelerated AI, and the Hype Is Real

Apple Announces 45 App Store Awards Finalists for 2025

The AI App Experience Matters More Than Benchmarks Now

Access Extra Content and Perks