The AI App Experience Matters More Than Benchmarks Now

By Federico Viticci

Different experiences with app connectors in Claude, Perplexity, and ChatGPT.

I was catching up on different articles after the release of Claude Opus 4.5 earlier this week, and this part from Simon Willison’s blog post about it stood out to me:

I’m not saying the new model isn’t an improvement on Sonnet 4.5—but I can’t say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two.

This represents a growing problem for me. My favorite moments in AI are when a new model gives me the ability to do something that simply wasn’t possible before. In the past these have felt a lot more obvious, but today it’s often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.

This is something that I’ve felt every few weeks (with each new model release from the major AI labs) over the past year: if you’re really plugged into this ecosystem, it can be hard to spot meaningful differences between major models on a release-by-release basis. That’s not to say that real progress in intelligence, knowledge, or tool-calling isn’t being made: benchmarks and evaluations performed by established organizations tell a clear story. At the same time, it’s also worth keeping in mind that more companies these days may be optimizing their models for benchmarks to come out on top and, more importantly, that the vast majority of folks don’t have a suite of personal benchmarks to evaluate different models for their workflows. Simon Willison thinks that people who use AI for work should create personalized test suites, which is something I’m going to consider for prompts that I use frequently. I also feel like Ethan Mollick’s advice of picking a reasoning model and checking in every few months to reassess AI progress is probably the best strategy for most people who don’t want to tweak their AI workflows every other week.

Trying to Make Sense of the Rumored, Gemini-Powered Siri Overhaul

By Federico Viticci

Quite the scoop from Mark Gurman yesterday on what Apple is planning for major Siri improvements in 2026:

Apple Inc. is planning to pay about $1 billion a year for an ultrapowerful 1.2 trillion parameter artificial intelligence model developed by Alphabet Inc.’s Google that would help run its long-promised overhaul of the Siri voice assistant, according to people with knowledge of the matter.

There is a lot to unpack here and I have a lot of questions.

On MiniMax M2 and LLMs with Interleaved Thinking Steps

By Federico Viticci

MiniMax M2 with interleaved thinking steps and tools in TypingMind.

In addition to Kimi K2 (which I recently wrote about here) and GLM-4.6 (which will become an option on Cerebras in a few days, when I’ll play around with it), one of the more interesting open-source LLM releases out of China lately is MiniMax M2. This MoE model (230B parameters, 10B activated at any given time) claims to reach 90% of the performance of Sonnet 4.5…at 8% the cost. You can read more about the model here; Simon Willison blogged about it here; you can also test it with MLX on an Apple silicon Mac.

What I find especially interesting about M2 is that it’s the first model to support interleaved thinking steps in between responses and tool calls, which is something that Anthropic pioneered with Claude Sonnet 4 back in May. Here’s Skyler Miao, head of engineering at MiniMax, in a post on X (unfortunately, most of the open-source AI community is only active there):

As we work more closely with partners, we’ve been surprised how poorly community support interleaved thinking, which is crucial for long, complex agentic tasks. Sonnet 4 introduced it 5 months ago, but adoption is still limited.

We think it’s one of the most important features for agentic models: it makes great use of test-time compute.

The model can reason after each tool call, especially when tool outputs are unexpected. That’s often the hardest part of agentic jobs: you can’t predict what the env returns. With interleaved thinking, the model could reason after get tool outputs, and try to find out a better solution.

We’re now working with partners to enable interleaved thinking in M2 — and hopefully across all capable models.

I’ve been using Claude as my main “production” LLM for the past few months and, as I’ve shared before, I consider the fact that both Sonnet and Haiku think between steps an essential aspect of their agentic nature and integration with third-party apps.

That being said, I have been testing MiniMax M2 on TypingMind in addition to Kimi K2 for the past week and it is, indeed, impressive. I plugged MiniMax M2 into TypingMind using their Anthropic-compatible endpoint; out of the box, the model worked with interleaved thinking and the several plugins I’ve built for myself in TypingMind using Claude. I haven’t used M2 for any vibe-coding tasks yet, but for other research or tool-based queries (like adding notes to Notion and tasks to Todoist), M2 effectively felt like a version of Sonnet not made by Anthropic.

Right now, MiniMax M2 isn’t hosted on any of the fast inference providers; I’ve accessed it via the official MiniMax API endpoint, whose inference speed isn’t that different from Anthropic’s cloud. The possibility of MiniMax M2 on Cerebras or Groq is extremely fascinating, and I hope it’s in the cards for the near future.

AI Experiments: Fast Inference with Groq and Third-Party Tools with Kimi K2 in TypingMind

By Federico Viticci

Kimi K2, hosted on Groq, running in TypingMind with a custom plugin I made.

I’ll talk about this more in depth in Monday’s episode of AppStories (if you’re a Plus subscriber, it’ll be out on Sunday), but I wanted to post a quick note on the site to show off what I’ve been experimenting with this week. I started playing around with TypingMind, a web-based wrapper for all kinds of LLMs (from any provider you want to use), and, in the process, I’ve ended up recreating parts of my Claude setup with third-party apps…at a much, much higher speed. Here, let me show you with a video:

Kimi K2 hosted on Groq on the left.Replay

Anthropic Releases Haiku 4.5: Sonnet 4 Performance, Twice as Fast

By Federico Viticci

Earlier today, Anthropic released Haiku 4.5, a new version of their “small and fast” model that matches Sonnet 4 performance from five months ago at a fraction of the cost and twice the speed. From their announcement:

What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.

And:

Claude Sonnet 4.5, released two weeks ago, remains our frontier model and the best coding model in the world. Claude Haiku 4.5 gives users a new option for when they want near-frontier performance with much greater cost-efficiency. It also opens up new ways of using our models together. For example, Sonnet 4.5 can break down a complex problem into multi-step plans, then orchestrate a team of multiple Haiku 4.5s to complete subtasks in parallel.

I’m not a programmer, so I’m not particularly interested in benchmarks for coding tasks and Claude Code integrations. However, as I explained in this Plus segment of AppStories for members, I’m very keen to play around with fast models that considerably reduce inference times to allow for quicker back and forth in conversations. As I detailed on AppStories, I’ve had a solid experience with Cerebras and Bolt for Mac to generate responses at over 1,000 tokens per second.

I have a personal test that I like to try with all modern LLMs that support MCP: how quickly they can append the word “Test” to my daily note in Notion. Based on a few experiments I ran earlier today, Haiku 4.5 seems to be the new state of the art for both following instructions and speed in this simple test.

I ran my tests with LLMs that support MCP-based connectors: Claude and Mistral. Both were given system-level instructions on how to access my daily notes: Claude had the details in its profile personalization screen; in Mistral, I created a dedicated agent with Notion instructions. So, all things being equal, here’s how long it took three different, non-thinking models to run my command:

Mistral: 37 seconds
Claude Sonnet 4.5: 47 seconds
Claude Haiku 4.5: 18 seconds

That is a drastic latency reduction compared to Sonnet 4.5, and it’s especially impressive when we consider how Mistral is using Flash Answers, which is fast inference powered by Cerebras. As I shared on AppStories, it seems to confirm that it’s possible to have speed and reliability for agentic tool-calling without having to use a large model.

I ran other tests with Haiku 4.5 and the Todoist MCP and, similarly, I was able to mark tasks as completed and reschedule them in seconds, with none of the latency I previously observed in Sonnet 4.5 and Opus 4.1. As it stands now, if you’re interested in using LLMs with apps and connectors without having to wait around too long for responses and actions, Haiku 4.5 is the model to try.

LLMs As Conduits for Data Portability Between Apps

By Federico Viticci

One of the unsung benefits of modern LLMs – especially those with MCP support or proprietary app integrations – is their inherent ability to facilitate data transfer between apps and services that use different data formats.

This is something I’ve been pondering for the past few months, and the latest episode of Cortex – where Myke wished it was possible to move between task managers like you can with email clients – was the push I needed to write something up. I’ve personally taken on multiple versions of this concept with different LLMs, and the end result was always the same: I didn’t have to write a single line of code to create import/export functionalities that two services I wanted to use didn’t support out of the box.

Testing Claude’s Native Integration with Reminders and Calendar on iOS and iPadOS

By Federico Viticci

Reminders created by Claude for iOS after a series of web searches.

A few months ago, when Perplexity unveiled their voice assistant integrated with native iOS frameworks, I wrote that I was surprised no other major AI lab had shipped a similar feature in its iOS apps:

The most important point about this feature is the fact that, in hindsight, this is so obvious and I’m surprised that OpenAI still hasn’t shipped the same feature for their incredibly popular ChatGPT voice mode. Perplexity’s iOS voice assistant isn’t using any “secret” tricks or hidden APIs: they’re simply integrating with existing frameworks and APIs that any third-party iOS developer can already work with. They’re leveraging EventKit for reminder/calendar event retrieval and creation; they’re using MapKit to load inline snippets of Apple Maps locations; they’re using Mail’s native compose sheet and Safari View Controller to let users send pre-filled emails or browse webpages manually; they’re integrating with MusicKit to play songs from Apple Music, provided that you have the Music app installed and an active subscription. Theoretically, there is nothing stopping Perplexity from rolling additional frameworks such as ShazamKit, Image Playground, WeatherKit, the clipboard, or even photo library access into their voice assistant. Perplexity hasn’t found a “loophole” to replicate Siri functionalities; they were just the first major AI company to do so.

It’s been a few months since Perplexity rolled out their iOS assistant, and, so far, the company has chosen to keep the iOS integrations exclusive to voice mode; you can’t have text conversations with Perplexity on iPhone and iPad and ask it to look at your reminders or calendar events.

Anthropic, however, has done it and has become – to the best of my knowledge – the second major AI lab to plug directly into Apple’s native iOS and iPadOS frameworks, with an important twist: in the latest version of Claude, you can have text conversations and tell the model to look into your Reminders database or Calendar app without having to use voice mode.

Some Early Tests and Notes on ChatGPT Agent

By Federico Viticci

ChatGPT agent in action.

Earlier this week, OpenAI released ChatGPT agent, a new agentic model that combines the text-focused capabilities of Deep Research with the browser-based automation of Operator into a single, well, agent that can autonomously browse the web, read webpages, and interact with web apps. OpenAI describes the (lowercase) agent as ChatGPT having its own computer.

I Have Many Questions About Apple’s Updated Foundation Models and the (Great) ‘Use Model’ Action in Shortcuts

By Federico Viticci

Apple’s ‘Use Model’ action in Shortcuts.

I mentioned this on AppStories during the week of WWDC: I think Apple’s new ‘Use Model’ action in Shortcuts for iOS/iPadOS/macOS 26, which lets you prompt either the local or cloud-based Apple Foundation models, is Apple Intelligence’s best and most exciting new feature for power users this year. This blog post is a way for me to better explain why as well as publicly investigate some aspects of the updated Foundation models that I don’t fully understand yet.

Coming Soon: What’s Next on Apple TV and Apple Arcade in December 2025

The AI App Experience Matters More Than Benchmarks Now

Just the Best: MacStories’ Black Friday 2025 Picks

Posts in notes

The AI App Experience Matters More Than Benchmarks Now

Trying to Make Sense of the Rumored, Gemini-Powered Siri Overhaul

On MiniMax M2 and LLMs with Interleaved Thinking Steps

AI Experiments: Fast Inference with Groq and Third-Party Tools with Kimi K2 in TypingMind

Anthropic Releases Haiku 4.5: Sonnet 4 Performance, Twice as Fast

LLMs As Conduits for Data Portability Between Apps

Testing Claude’s Native Integration with Reminders and Calendar on iOS and iPadOS

Some Early Tests and Notes on ChatGPT Agent

I Have Many Questions About Apple’s Updated Foundation Models and the (Great) ‘Use Model’ Action in Shortcuts