AI Experiments: Fast Inference with Groq and Third-Party Tools with Kimi K2 in TypingMind

By Federico Viticci

Kimi K2, hosted on Grow, running in TypingMind with a custom plugin I made.

I’ll talk about this more in depth in Monday’s episode of AppStories (if you’re a Plus subscriber, it’ll be out on Sunday), but I wanted to post a quick note on the site to show off what I’ve been experimenting with this week. I started playing around with TypingMind, a web-based wrapper for all kinds of LLMs (from any provider you want to use), and, in the process, I’ve ended up recreating parts of my Claude setup with third-party apps…at a much, much higher speed. Here, let me show you with a video:

Kimi K2 hosted on Groq on the left.Replay

Max Weinbach on the M5’s Neural Accelerators →

Linked By Federico Viticci

In addition to the M5 iPad Pro, which I reviewed earlier today, I also received an M5 MacBook Pro review unit from Apple last week. I really wanted to write a companion piece to my iPad Pro story about MLX and the M5’s Neural Accelerators; sadly, I couldn’t get the latest MLX branch to work on the MacBook Pro either.

However, Max Weinbach at Creative Strategies did, and shared some impressive results with the M5 and its GPU’s Neural Accelerators:

These dedicated neural accelerators in each core lead to that 4x speedup of compute! In compute heavy parts of LLMs, like the pre-fill stage (the processing that happens during the time to first token) this should lead to massive speed-ups in performance! The decode, generating each token, should be accelerated by the memory bandwidth improvements of the SoC.

Now, I would have loved to show this off! Unfortunately, full support for the Neural Accelerators isn’t in MLX yet. There is preliminary support, though! There will be an update later this year with full support, but that doesn’t mean we can’t test now! Unfortunately, I don’t have an M4 Mac on me (traveling at the moment) but what I was able to do was compare M5 performance before and after tensor core optimization! We’re seeing between a 3x and 4x speedup in prefill performance!

Looking at Max’s benchmarks with Qwen3 8B and a ~20,000-token prompt, there is indeed a 3.65x speedup in tokens/sec in the prefill stage – jumping from 158.2 tok/s to a remarkable 578.7 tok/s. This is why I’m very excited about the future of MLX for local inference on M5, and why I’m also looking forward to M5 Pro/M5 Max chipsets in future Mac models.

Permalink

M5 iPad Pro Review: An AI and Gaming Upgrade for AI and Games That Aren’t There Yet

By Federico Viticci

The M5 iPad Pro.

How do you review an iPad Pro that’s visually identical to its predecessor and marginally improves upon its performance with a spec bump and some new wireless radios?

Let me try:

I’ve been testing the new M5 iPad Pro since last Thursday. If you’re a happy owner of an M4 iPad Pro that you purchased last year, stay like that; there is virtually no reason for you to sell your old model and get an M5-upgraded edition. That’s especially true if you purchased a high-end configuration of the M4 iPad Pro last year with 16 GB of RAM, since upgrading to another high-end M5 iPad Pro model will get you…16 GB of RAM again.

The story is slightly different for users coming from older iPad Pro models and those on lower-end configurations, but barely. Starting this year, the two base-storage models of the iPad Pro are jumping from 8 GB of RAM to 12 GB, which helps make iPadOS 26 multitasking smoother, but it’s not a dramatic improvement, either.

Apple pitches the M5 chip as a “leap” for local AI tasks and gaming, and to an extent, that is true. However, it is mostly true on the Mac, where – for a variety of reasons I’ll cover below – there are more ways to take advantage of what the M5 can offer.

In many ways, the M5 iPad Pro is reminiscent of the M2 iPad Pro, which I reviewed in October 2022: it’s a minor revision to an excellent iPad Pro redesign that launched the previous year, which set a new bar for what we should expect from a modern tablet and hybrid computer – the kind that only Apple makes these days.

For all these reasons, the M5 iPad Pro is not a very exciting iPad Pro to review, and I would only recommend this upgrade to heavy iPad Pro users who don’t already have the (still remarkable) M4 iPad Pro. But there are a couple of narratives worth exploring about the M5 chip on the iPad Pro, which is what I’m going to focus on for this review.

Anthropic Releases Haiku 4.5: Sonnet 4 Performance, Twice as Fast

By Federico Viticci

Earlier today, Anthropic released Haiku 4.5, a new version of their “small and fast” model that matches Sonnet 4 performance from five months ago at a fraction of the cost and twice the speed. From their announcement:

What was recently at the frontier is now cheaper and faster. Five months ago, Claude Sonnet 4 was a state-of-the-art model. Today, Claude Haiku 4.5 gives you similar levels of coding performance but at one-third the cost and more than twice the speed.

And:

Claude Sonnet 4.5, released two weeks ago, remains our frontier model and the best coding model in the world. Claude Haiku 4.5 gives users a new option for when they want near-frontier performance with much greater cost-efficiency. It also opens up new ways of using our models together. For example, Sonnet 4.5 can break down a complex problem into multi-step plans, then orchestrate a team of multiple Haiku 4.5s to complete subtasks in parallel.

I’m not a programmer, so I’m not particularly interested in benchmarks for coding tasks and Claude Code integrations. However, as I explained in this Plus segment of AppStories for members, I’m very keen to play around with fast models that considerably reduce inference times to allow for quicker back and forth in conversations. As I detailed on AppStories, I’ve had a solid experience with Cerebras and Bolt for Mac to generate responses at over 1,000 tokens per second.

I have a personal test that I like to try with all modern LLMs that support MCP: how quickly they can append the word “Test” to my daily note in Notion. Based on a few experiments I ran earlier today, Haiku 4.5 seems to be the new state of the art for both following instructions and speed in this simple test.

I ran my tests with LLMs that support MCP-based connectors: Claude and Mistral. Both were given system-level instructions on how to access my daily notes: Claude had the details in its profile personalization screen; in Mistral, I created a dedicated agent with Notion instructions. So, all things being equal, here’s how long it took three different, non-thinking models to run my command:

Mistral: 37 seconds
Claude Sonnet 4.5: 47 seconds
Claude Haiku 4.5: 18 seconds

That is a drastic latency reduction compared to Sonnet 4.5, and it’s especially impressive when we consider how Mistral is using Flash Answers, which is fast inference powered by Cerebras. As I shared on AppStories, it seems to confirm that it’s possible to have speed and reliability for agentic tool-calling without having to use a large model.

I ran other tests with Haiku 4.5 and the Todoist MCP and, similarly, I was able to mark tasks as completed and reschedule them in seconds, with none of the latency I previously observed in Sonnet 4.5 and Opus 4.1. As it stands now, if you’re interested in using LLMs with apps and connectors without having to wait around too long for responses and actions, Haiku 4.5 is the model to try.

LLMs As Conduits for Data Portability Between Apps

By Federico Viticci

One of the unsung benefits of modern LLMs – especially those with MCP support or proprietary app integrations – is their inherent ability to facilitate data transfer between apps and services that use different data formats.

This is something I’ve been pondering for the past few months, and the latest episode of Cortex – where Myke wished it was possible to move between task managers like you can with email clients – was the push I needed to write something up. I’ve personally taken on multiple versions of this concept with different LLMs, and the end result was always the same: I didn’t have to write a single line of code to create import/export functionalities that two services I wanted to use didn’t support out of the box.

Testing Claude’s Native Integration with Reminders and Calendar on iOS and iPadOS

By Federico Viticci

Reminders created by Claude for iOS after a series of web searches.

A few months ago, when Perplexity unveiled their voice assistant integrated with native iOS frameworks, I wrote that I was surprised no other major AI lab had shipped a similar feature in its iOS apps:

The most important point about this feature is the fact that, in hindsight, this is so obvious and I’m surprised that OpenAI still hasn’t shipped the same feature for their incredibly popular ChatGPT voice mode. Perplexity’s iOS voice assistant isn’t using any “secret” tricks or hidden APIs: they’re simply integrating with existing frameworks and APIs that any third-party iOS developer can already work with. They’re leveraging EventKit for reminder/calendar event retrieval and creation; they’re using MapKit to load inline snippets of Apple Maps locations; they’re using Mail’s native compose sheet and Safari View Controller to let users send pre-filled emails or browse webpages manually; they’re integrating with MusicKit to play songs from Apple Music, provided that you have the Music app installed and an active subscription. Theoretically, there is nothing stopping Perplexity from rolling additional frameworks such as ShazamKit, Image Playground, WeatherKit, the clipboard, or even photo library access into their voice assistant. Perplexity hasn’t found a “loophole” to replicate Siri functionalities; they were just the first major AI company to do so.

It’s been a few months since Perplexity rolled out their iOS assistant, and, so far, the company has chosen to keep the iOS integrations exclusive to voice mode; you can’t have text conversations with Perplexity on iPhone and iPad and ask it to look at your reminders or calendar events.

Anthropic, however, has done it and has become – to the best of my knowledge – the second major AI lab to plug directly into Apple’s native iOS and iPadOS frameworks, with an important twist: in the latest version of Claude, you can have text conversations and tell the model to look into your Reminders database or Calendar app without having to use voice mode.

Claude’s Chat History and App Integrations as a Form of Lock-In →

Linked By Federico Viticci

Earlier today, Anthropic announced that, similar to ChatGPT, Claude will be able to search and reference your previous chats with it. From their support document:

You can now prompt Claude to search through your previous conversations to find and reference relevant information in new chats. This feature helps you continue discussions seamlessly and retrieve context from past interactions without re-explaining everything.

If you’re wondering what Claude can actually search:

You can prompt Claude to search conversations within these boundaries:

All chats outside of projects.

Individual project conversations (searches are limited to within each specific project).

Conversation history is a powerful feature of modern LLMs, and although Anthropic hasn’t announced personalized context based on memory yet (a feature that not everybody likes), it seems like that’s the next shoe to drop. Chat search, memory with personalized context, larger context windows, and performance are the four key aspects I preferred in ChatGPT; Anthropic just addressed one of them, and a second may be launching soon.

As I’ve shared on Mastodon, despite the power and speed of GPT-5, I find myself gravitating more and more toward Claude (and specifically Opus 4.1) because of MCP and connectors. Claude works with the apps I already use and allows me to easily turn conversations into actions performed in Notion, Todoist, Spotify, or other apps that have an API that can talk to Claude. This is changing my workflow in two notable ways: I’m only using ChatGPT for “regular” web search queries (mostly via the Safari extension) and less for work because it doesn’t match Claude’s extensive MCP support with tools; and I’m prioritizing web apps that have well-supported web APIs that work with LLMs over local apps that don’t (Spotify vs. Apple Music, Todoist vs. Reminders, Notion vs. Notes, etc.). Chat search (and, again, I hope personalized context based on memory soon) further adds to this change in the apps I use.

Let me offer an example. I like combining Claude’s web search abilities with Zapier tools that integrate with Spotify to make Claude create playlists for me based on album reviews or music roundups. A few weeks ago, I started the process of converting this Chorus article into a playlist, but I never finished the task since I was running into Zapier rate limits. This evening, I asked Claude if we ever worked on any playlists, it found the old chats and pointed out that one of them still needed to be completed. From there, it got to work again, picked up where it left off in Chorus’ article, and finished filling the playlist with the most popular songs that best represent the albums picked by Jason Tate and team. So not only could Claude find the chat, but it got back to work with tools based on the state of the old conversation.

Resuming a chat that was about creating a Spotify playlist (right). Sadly, Apple Music doesn’t integrate with LLMs like this.

Even more impressively, after Claude was done finishing the playlist from an old chat, I asked it to take all the playlists created so far and append their links to my daily note in Notion; that also worked. From my phone, in a conversation that started as a search test for old chats and later grew into an agentic workflow that called tools for web search, Spotify, and Notion.

I find these use cases very interesting, and they’re the reason I struggle to incorporate ChatGPT into my everyday workflow beyond web searches. They’re also why I hesitate to use Apple apps right now, and I’m not sure Liquid Glass will be enough to win me back over.

Permalink

Some Early Tests and Notes on ChatGPT Agent

By Federico Viticci

ChatGPT agent in action.

Earlier this week, OpenAI released ChatGPT agent, a new agentic model that combines the text-focused capabilities of Deep Research with the browser-based automation of Operator into a single, well, agent that can autonomously browse the web, read webpages, and interact with web apps. OpenAI describes the (lowercase) agent as ChatGPT having its own computer.

Testing DeepSeek R1-0528 on the M3 Ultra Mac Studio and Installing Local GGUF Models with Ollama on macOS

By Federico Viticci

DeepSeek released an updated version of their popular R1 reasoning model (version 0528) with – according to the company – increased benchmark performance, reduced hallucinations, and native support for function calling and JSON output. Early tests from Artificial Analysis report a nice bump in performance, putting it behind OpenAI’s o3 and o4-mini-high in their Intelligence Index benchmarks. The model is available in the official DeepSeek API, and open weights have been distributed on Hugging Face. I downloaded different quantized versions of the full model on my M3 Ultra Mac Studio, and here are some notes on how it went.

M5 iPad Pro Review: An AI and Gaming Upgrade for AI and Games That Aren’t There Yet

A Fresh Spin on Apple Music: Exploring Daft Music’s Liquid Glass Design

Jump Into the Liquid Glass Pool: A MacStories OS 26 App Roundup

Posts tagged with "LLMs"

AI Experiments: Fast Inference with Groq and Third-Party Tools with Kimi K2 in TypingMind

Max Weinbach on the M5’s Neural Accelerators →

M5 iPad Pro Review: An AI and Gaming Upgrade for AI and Games That Aren’t There Yet

Anthropic Releases Haiku 4.5: Sonnet 4 Performance, Twice as Fast

LLMs As Conduits for Data Portability Between Apps

Testing Claude’s Native Integration with Reminders and Calendar on iOS and iPadOS

Claude’s Chat History and App Integrations as a Form of Lock-In →

Some Early Tests and Notes on ChatGPT Agent

Testing DeepSeek R1-0528 on the M3 Ultra Mac Studio and Installing Local GGUF Models with Ollama on macOS