I Finally Tested the M5 iPad Pro’s Neural-Accelerated AI, and the Hype Is Real

By Federico Viticci

The M5 iPad Pro.

The best kind of follow-up article isn’t one that clarifies a topic that someone got wrong (although I do love that, especially when that “someone” isn’t me); it’s one that provides more context to a story that was incomplete. My M5 iPad Pro review was an incomplete narrative. As you may recall, I was unable to test Apple’s promised claims of 3.5× improvements for local AI processing thanks to the new Neural Accelerators built into the M5’s GPU. It’s not that I didn’t believe Apple’s numbers. I simply couldn’t test them myself due to the early nature of the software and the timing of my embargo.

Well, I was finally able to test local AI performance with a pre-release version of MLX optimized for M5, and let me tell you: not only is the hype real, but the numbers I got from my extensive tests over the past two weeks actually exceed Apple’s claims.

Trying to Make Sense of the Rumored, Gemini-Powered Siri Overhaul

By Federico Viticci

Quite the scoop from Mark Gurman yesterday on what Apple is planning for major Siri improvements in 2026:

Apple Inc. is planning to pay about $1 billion a year for an ultrapowerful 1.2 trillion parameter artificial intelligence model developed by Alphabet Inc.’s Google that would help run its long-promised overhaul of the Siri voice assistant, according to people with knowledge of the matter.

There is a lot to unpack here and I have a lot of questions.

On MiniMax M2 and LLMs with Interleaved Thinking Steps

By Federico Viticci

MiniMax M2 with interleaved thinking steps and tools in TypingMind.

In addition to Kimi K2 (which I recently wrote about here) and GLM-4.6 (which will become an option on Cerebras in a few days, when I’ll play around with it), one of the more interesting open-source LLM releases out of China lately is MiniMax M2. This MoE model (230B parameters, 10B activated at any given time) claims to reach 90% of the performance of Sonnet 4.5…at 8% the cost. You can read more about the model here; Simon Willison blogged about it here; you can also test it with MLX on an Apple silicon Mac.

What I find especially interesting about M2 is that it’s the first model to support interleaved thinking steps in between responses and tool calls, which is something that Anthropic pioneered with Claude Sonnet 4 back in May. Here’s Skyler Miao, head of engineering at MiniMax, in a post on X (unfortunately, most of the open-source AI community is only active there):

As we work more closely with partners, we’ve been surprised how poorly community support interleaved thinking, which is crucial for long, complex agentic tasks. Sonnet 4 introduced it 5 months ago, but adoption is still limited.

We think it’s one of the most important features for agentic models: it makes great use of test-time compute.

The model can reason after each tool call, especially when tool outputs are unexpected. That’s often the hardest part of agentic jobs: you can’t predict what the env returns. With interleaved thinking, the model could reason after get tool outputs, and try to find out a better solution.

We’re now working with partners to enable interleaved thinking in M2 — and hopefully across all capable models.

I’ve been using Claude as my main “production” LLM for the past few months and, as I’ve shared before, I consider the fact that both Sonnet and Haiku think between steps an essential aspect of their agentic nature and integration with third-party apps.

That being said, I have been testing MiniMax M2 on TypingMind in addition to Kimi K2 for the past week and it is, indeed, impressive. I plugged MiniMax M2 into TypingMind using their Anthropic-compatible endpoint; out of the box, the model worked with interleaved thinking and the several plugins I’ve built for myself in TypingMind using Claude. I haven’t used M2 for any vibe-coding tasks yet, but for other research or tool-based queries (like adding notes to Notion and tasks to Todoist), M2 effectively felt like a version of Sonnet not made by Anthropic.

Right now, MiniMax M2 isn’t hosted on any of the fast inference providers; I’ve accessed it via the official MiniMax API endpoint, whose inference speed isn’t that different from Anthropic’s cloud. The possibility of MiniMax M2 on Cerebras or Groq is extremely fascinating, and I hope it’s in the cards for the near future.

AI Experiments: Fast Inference with Groq and Third-Party Tools with Kimi K2 in TypingMind

By Federico Viticci

Kimi K2, hosted on Groq, running in TypingMind with a custom plugin I made.

I’ll talk about this more in depth in Monday’s episode of AppStories (if you’re a Plus subscriber, it’ll be out on Sunday), but I wanted to post a quick note on the site to show off what I’ve been experimenting with this week. I started playing around with TypingMind, a web-based wrapper for all kinds of LLMs (from any provider you want to use), and, in the process, I’ve ended up recreating parts of my Claude setup with third-party apps…at a much, much higher speed. Here, let me show you with a video:

Kimi K2 hosted on Groq on the left.Replay

Claude Adds Screenshot and Voice Shortcuts to Its Mac App

By John Voorhees

Claude’s new in-context screenshot tool.

Anthropic introduced a couple of new features in its Claude Mac app today that lower the friction of working with the chatbot.

First, after giving screenshot and accessibility permissions to Claude, you can double tap the Option button to activate the app’s chat field as an overlay at the bottom of your screen. The shortcut simultaneously triggers crosshairs for dragging out a rectangle on your Mac’s screen. Once you do, the app takes a screenshot and the chat field moves to the side of the area you selected with the screenshot attached. Type your query, and it and the screenshot are sent together to Claude, switching you to Claude and kicking off your request automatically.

Instead of double-tapping the Option key, you can also set the keyboard shortcut to Option + Space, or a custom key combination. That’s nice because not all automation systems support two modifier keys as a shortcut. For example, Logitech’s Creative Console cannot record a double tap of the Option button as a shortcut.

Sending your query and screenshot takes you back to the Claude app for your response.

I send a lot of screenshots to Claude, especially when I’m debugging scripts. This new shortcut will greatly accelerate that process simply by switching me back to Claude for my answer. It’s a small thing, but I expect it will add up over time.

My only complaint is that the experience has been inconsistent across my Macs. On my M1 Max Mac Studio with 64GB of memory, it takes 3-5 seconds for Claude to attach the screenshot to its chat field whereas on the M4 Max MacBook Pro I’ve been testing, the process is almost instant. The MacBook Pro is a much faster Mac than my Mac Studio, but I was surprised at the difference since it occurs at the screenshot phase of the interaction. My guess is that another app or system process is interfering with Claude.

Am I talking to the Claude chatbot or lighting my Dock on fire.

The other new feature of Claude is that you can set the Caps Lock button to trigger voice input. Once you trigger voice input, an orange cloud appears at the bottom of your screen indicating that your microphone is active. The visual is a little over-the-top, but the feature is handy. Tap the Caps Lock button again to finish the recording, which is then transcribed into a Claude chat field at the bottom of your screen. Just hit return to upload your query, and you’re switched back to the Claude app for a response.

One of the greatest strengths of modern AI chatbots is their multi-modality. What Anthropic has done with these new Claude features is made two of those modes – images and audio – a little bit easier, which gets you from input to a response a little faster, which I appreciate. I highly recommend giving both features a try.

Max Weinbach on the M5’s Neural Accelerators →

Linked By Federico Viticci

In addition to the M5 iPad Pro, which I reviewed earlier today, I also received an M5 MacBook Pro review unit from Apple last week. I really wanted to write a companion piece to my iPad Pro story about MLX and the M5’s Neural Accelerators; sadly, I couldn’t get the latest MLX branch to work on the MacBook Pro either.

However, Max Weinbach at Creative Strategies did, and shared some impressive results with the M5 and its GPU’s Neural Accelerators:

These dedicated neural accelerators in each core lead to that 4x speedup of compute! In compute heavy parts of LLMs, like the pre-fill stage (the processing that happens during the time to first token) this should lead to massive speed-ups in performance! The decode, generating each token, should be accelerated by the memory bandwidth improvements of the SoC.

Now, I would have loved to show this off! Unfortunately, full support for the Neural Accelerators isn’t in MLX yet. There is preliminary support, though! There will be an update later this year with full support, but that doesn’t mean we can’t test now! Unfortunately, I don’t have an M4 Mac on me (traveling at the moment) but what I was able to do was compare M5 performance before and after tensor core optimization! We’re seeing between a 3x and 4x speedup in prefill performance!

Looking at Max’s benchmarks with Qwen3 8B and a ~20,000-token prompt, there is indeed a 3.65x speedup in tokens/sec in the prefill stage – jumping from 158.2 tok/s to a remarkable 578.7 tok/s. This is why I’m very excited about the future of MLX for local inference on M5, and why I’m also looking forward to M5 Pro/M5 Max chipsets in future Mac models.

Permalink

M5 iPad Pro Review: An AI and Gaming Upgrade for AI and Games That Aren’t There Yet

By Federico Viticci

The M5 iPad Pro.

How do you review an iPad Pro that’s visually identical to its predecessor and marginally improves upon its performance with a spec bump and some new wireless radios?

Let me try:

I’ve been testing the new M5 iPad Pro since last Thursday. If you’re a happy owner of an M4 iPad Pro that you purchased last year, stay like that; there is virtually no reason for you to sell your old model and get an M5-upgraded edition. That’s especially true if you purchased a high-end configuration of the M4 iPad Pro last year with 16 GB of RAM, since upgrading to another high-end M5 iPad Pro model will get you…16 GB of RAM again.

The story is slightly different for users coming from older iPad Pro models and those on lower-end configurations, but barely. Starting this year, the two base-storage models of the iPad Pro are jumping from 8 GB of RAM to 12 GB, which helps make iPadOS 26 multitasking smoother, but it’s not a dramatic improvement, either.

Apple pitches the M5 chip as a “leap” for local AI tasks and gaming, and to an extent, that is true. However, it is mostly true on the Mac, where – for a variety of reasons I’ll cover below – there are more ways to take advantage of what the M5 can offer.

In many ways, the M5 iPad Pro is reminiscent of the M2 iPad Pro, which I reviewed in October 2022: it’s a minor revision to an excellent iPad Pro redesign that launched the previous year, which set a new bar for what we should expect from a modern tablet and hybrid computer – the kind that only Apple makes these days.

For all these reasons, the M5 iPad Pro is not a very exciting iPad Pro to review, and I would only recommend this upgrade to heavy iPad Pro users who don’t already have the (still remarkable) M4 iPad Pro. But there are a couple of narratives worth exploring about the M5 chip on the iPad Pro, which is what I’m going to focus on for this review.

Anthropic Releases Haiku 4.5: Sonnet 4 Performance, Twice as Fast

By Federico Viticci

Earlier today, Anthropic released Haiku 4.5, a new version of their “small and fast” model that matches Sonnet 4 performance from five months ago at a fraction of the cost and twice the speed. From their announcement:

What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.

And:

Claude Sonnet 4.5, released two weeks ago, remains our frontier model and the best coding model in the world. Claude Haiku 4.5 gives users a new option for when they want near-frontier performance with much greater cost-efficiency. It also opens up new ways of using our models together. For example, Sonnet 4.5 can break down a complex problem into multi-step plans, then orchestrate a team of multiple Haiku 4.5s to complete subtasks in parallel.

I’m not a programmer, so I’m not particularly interested in benchmarks for coding tasks and Claude Code integrations. However, as I explained in this Plus segment of AppStories for members, I’m very keen to play around with fast models that considerably reduce inference times to allow for quicker back and forth in conversations. As I detailed on AppStories, I’ve had a solid experience with Cerebras and Bolt for Mac to generate responses at over 1,000 tokens per second.

I have a personal test that I like to try with all modern LLMs that support MCP: how quickly they can append the word “Test” to my daily note in Notion. Based on a few experiments I ran earlier today, Haiku 4.5 seems to be the new state of the art for both following instructions and speed in this simple test.

I ran my tests with LLMs that support MCP-based connectors: Claude and Mistral. Both were given system-level instructions on how to access my daily notes: Claude had the details in its profile personalization screen; in Mistral, I created a dedicated agent with Notion instructions. So, all things being equal, here’s how long it took three different, non-thinking models to run my command:

Mistral: 37 seconds
Claude Sonnet 4.5: 47 seconds
Claude Haiku 4.5: 18 seconds

That is a drastic latency reduction compared to Sonnet 4.5, and it’s especially impressive when we consider how Mistral is using Flash Answers, which is fast inference powered by Cerebras. As I shared on AppStories, it seems to confirm that it’s possible to have speed and reliability for agentic tool-calling without having to use a large model.

I ran other tests with Haiku 4.5 and the Todoist MCP and, similarly, I was able to mark tasks as completed and reschedule them in seconds, with none of the latency I previously observed in Sonnet 4.5 and Opus 4.1. As it stands now, if you’re interested in using LLMs with apps and connectors without having to wait around too long for responses and actions, Haiku 4.5 is the model to try.

LLMs As Conduits for Data Portability Between Apps

By Federico Viticci

One of the unsung benefits of modern LLMs – especially those with MCP support or proprietary app integrations – is their inherent ability to facilitate data transfer between apps and services that use different data formats.

This is something I’ve been pondering for the past few months, and the latest episode of Cortex – where Myke wished it was possible to move between task managers like you can with email clients – was the push I needed to write something up. I’ve personally taken on multiple versions of this concept with different LLMs, and the end result was always the same: I didn’t have to write a single line of code to create import/export functionalities that two services I wanted to use didn’t support out of the box.

Our Top Amazon Early Black Friday Picks

I Finally Tested the M5 iPad Pro’s Neural-Accelerated AI, and the Hype Is Real

Apple Announces 45 App Store Awards Finalists for 2025

Posts tagged with "artificial intelligence"

I Finally Tested the M5 iPad Pro’s Neural-Accelerated AI, and the Hype Is Real

Trying to Make Sense of the Rumored, Gemini-Powered Siri Overhaul

On MiniMax M2 and LLMs with Interleaved Thinking Steps

AI Experiments: Fast Inference with Groq and Third-Party Tools with Kimi K2 in TypingMind

Claude Adds Screenshot and Voice Shortcuts to Its Mac App

Max Weinbach on the M5’s Neural Accelerators →

M5 iPad Pro Review: An AI and Gaming Upgrade for AI and Games That Aren’t There Yet

Anthropic Releases Haiku 4.5: Sonnet 4 Performance, Twice as Fast

LLMs As Conduits for Data Portability Between Apps