Early Impressions of Claude Opus 4 and Using Tools with Extended Thinking

By Federico Viticci

Claude Opus 4 and extended thinking with tools.

For the past two days, I’ve been testing an early access version of Claude Opus 4, the latest model by Anthropic that was just announced today. You can read more about the model in the official blog post and find additional documentation here. What follows is a series of initial thoughts and notes based on the 48 hours I spent with Claude Opus 4, which I tested in both the Claude app and Claude Code.

For starters, Anthropic describes Opus 4 as its most capable hybrid model with improvements in coding, writing, and reasoning. I don’t use AI for creative writing, but I have dabbled with “vibe coding” for a collection of personal Obsidian plugins (created and managed with Claude Code, following these tips by Harper Reed), and I’m especially interested in Claude’s integrations with Google Workspace and MCP servers. (My favorite solution for MCP at the moment is Zapier, which I’ve been using for a long time for web automations.) So I decided to focus my tests on reasoning with integrations and some light experiments with the upgraded Claude Code in the macOS Terminal.

Notes on Early Mac Studio AI Benchmarks with Qwen3-235B-A22B and Qwen2.5-VL-72B

By Federico Viticci

I received a top-of-the-line Mac Studio (M3 Ultra, 512 GB of RAM, 8 TB of storage) on loan from Apple last week, and I thought I’d use this opportunity to revive something I’ve been mulling over for some time: more short-form blogging on MacStories in the form of brief “notes” with a dedicated Notes category on the site. Expect more of these “low-pressure”, quick posts in the future.

I’ve been sent this Mac Studio as part of my ongoing experiments with assistive AI and automation, and one of the things I plan to do over the coming weeks and months is playing around with local LLMs that tap into the power of Apple Silicon and the incredible performance headroom afforded by the M3 Ultra and this computer’s specs. I have a lot to learn when it comes to local AI (my shortcuts and experiments so far have focused on cloud models and the Shortcuts app combined with the LLM CLI), but since I had to start somewhere, I downloaded LM Studio and Ollama, installed the llm-ollama plugin, and began experimenting with open-weights models (served from Hugging Face as well as the Ollama library) both in the GGUF format and Apple’s own MLX framework.

LM Studio.

I posted some of these early tests on Bluesky. I ran the massive Qwen3-235B-A22B model (a Mixture-of-Experts model with 235 billion parameters, 22 billion of which activated at once) with both GGUF and MLX using the beta version of the LM Studio app, and these were the results:

GGUF: 16 tokens/second, ~133 GB of RAM used
MLX: 24 tok/sec, ~124 GB RAM

As you can see from these first benchmarks (both based on the 4-bit quant of Qwen3-235B-A22B), the Apple Silicon-optimized version of the model resulted in better performance both for token generation and memory usage. Regardless of the version, the Mac Studio absolutely didn’t care and I could barely hear the fans going.

I also wanted to play around with the new generation of vision models (VLMs) to test modern OCR capabilities of these models. One of the tasks that has become kind of a personal AI eval for me lately is taking a long screenshot of a shortcut from the Shortcuts app (using CleanShot’s scrolling captures) and feed it either as a full-res PNG or PDF to an LLM. As I shared before, due to image compression, the vast majority of cloud LLMs either fail to accept the image as input or compresses the image so much that graphical artifacts lead to severe hallucinations in the text analysis of the image. Only o4-mini-high – thanks to its more agentic capabilities and tool-calling – was able to produce a decent output; even then, that was only possible because o4-mini-high decided to slice the image in multiple parts and iterate through each one with discrete pytesseract calls. The task took almost seven minutes to run in ChatGPT.

This morning, I installed the 72-billion parameter version of Qwen2.5-VL, gave it a full-resolution screenshot of a 40-action shortcut, and let it run with Ollama and llm-ollama. After 3.5 minutes and around 100 GB RAM usage, I got a really good, Markdown-formatted analysis of my shortcut back from the model.

To make the experience nicer, I even built a small local-scanning utility that lets me pick an image from Shortcuts and runs it through Qwen2.5-VL (72B) using the ‘Run Shell Script’ action on macOS. It worked beautifully on my first try. Amusingly, the smaller version of Qwen2.5-VL (32B) thought my photo of ergonomic mice was a “collection of seashells”. Fair enough: there’s a reason bigger models are heavier and costlier to run.

Given my struggles with OCR and document analysis with cloud-hosted models, I’m very excited about the potential of local VLMs that bypass memory constraints thanks to the M3 Ultra and provide accurate results in just a few minutes without having to upload private images or PDFs anywhere. I’ve been writing a lot about this idea of “hybrid automation” that combines traditional Mac scripting tools, Shortcuts, and LLMs to unlock workflows that just weren’t possible before; I feel like the power of this Mac Studio is going to be an amazing accelerator for that.

Next up on my list: understanding how to run MLX models with mlx-lm, investigating long-context models with dual-chunk attention support (looking at you, Qwen 2.5), and experimenting with Gemma 3. Fun times ahead!

Post-Chat UI →

Linked By Federico Viticci

Fascinating analysis by Allen Pike on how, beyond traditional chatbot interactions, the technology behind LLMs can be used in other types of user interfaces and interactions:

While chat is powerful, for most products chatting with the underlying LLM should be more of a debug interface – a fallback mode – and not the primary UX.

So, how is AI making our software more useful, if not via chat? Let’s do a tour.

There are plenty of useful, practical examples in the story showing how natural language understanding and processing can be embedded in different features of modern apps. My favorite example is search, as Pike writes:

Another UI convention being reinvented is the search field.

It used to be that finding your flight details in your email required typing something exact, like “air canada confirmation”, and hoping that’s actually the phrasing in the email you’re thinking of.

Now, you should be able to type “what are the flight details for the offsite?” and find what you want.

Having used Shortwave and its AI-powered search for the past few months, I couldn’t agree more. The moment you get used to searching without exact queries or specific operators, there’s no going back.

Experience this once, and products with an old-school text-match search field feel broken. You should be able to just find “tax receipts from registered charities” in your email app, “the file where the login UI is defined” in your IDE, and “my upcoming vacations” in your calendar.

Interestingly, Pike mentions Command-K bars as another interface pattern that can benefit from LLM-infused interactions. I knew that sounded familiar – I covered the topic in mid-November 2022, and I still think it’s a shame that Apple hasn’t natively implemented these anywhere in their apps, especially now that commands can be fuzzier (just consider what Raycast is doing). Funnily enough, that post was published just two weeks before the public debut of ChatGPT on November 30, 2022. That feels like forever ago now.

Permalink

Sycophancy in GPT-4o →

Linked By Federico Viticci

OpenAI found itself in the middle of another controversy earlier this week, only this time it wasn’t about publishers or regulation, but about its core product – ChatGPT. Specifically, after rolling out an update to the default 4o model with improved personality, users started noticing that ChatGPT was adopting highly sycophantic behavior: it weirdly agreed with users on all kinds of prompts, even about topics that would typically warrant some justified pushback from a digital assistant. (Simon Willison and Ethan Mollick have a good roundup of the examples as well as the change in the system prompt that may have caused this.) OpenAI had to roll back the update and explain what happened on the company’s blog:

We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable—often described as sycophantic.

We are actively testing new fixes to address the issue. We’re revising how we collect and incorporate feedback to heavily weight long-term user satisfaction and we’re introducing more personalization features, giving users greater control over how ChatGPT behaves.

And:

We also believe users should have more control over how ChatGPT behaves and, to the extent that it is safe and feasible, make adjustments if they don’t agree with the default behavior.

Today, users can give the model specific instructions to shape its behavior with features like custom instructions. We’re also building new, easier ways for users to do this. For example, users will be able to give real-time feedback to directly influence their interactions and choose from multiple default personalities.

“Easier ways” for users to adjust ChatGPT’s behavior sound to me like a user-friendly toggle or slider to adjust ChatGPT’s personality (Grok has something similar, albeit unhinged), which I think would be a reasonable addition to the product. I’ve long argued that Siri should come with an adjustable personality similar to CARROT Weather, which lets you tweak whether you want the app to be “evil” or “professional” with a slider. I increasingly feel like that sort of option would make a lot of sense for modern LLMs, too.

Permalink

What Siri Isn’t: Perplexity’s Voice Assistant and the Potential of LLMs Integrated with iOS

By Federico Viticci

Perplexity’s voice assistant for iOS.

You’ve probably heard that Perplexity – a company whose web scraping tactics I generally despise, and the only AI bot we still block at MacStories – has rolled out an iOS version of their voice assistant that integrates with several native features of the operating system. Here’s their promo video in case you missed it:

This is a very clever idea: while other major LLMs’ voice modes are limited to having a conversation with the chatbot (with the kind of quality and conversation flow that, frankly, annihilates Siri), Perplexity put a different spin on it: they used native Apple APIs and frameworks to make conversations more actionable (some may even say “agentic”) and integrated with the Apple apps you use every day. I’ve seen a lot of people calling Perplexity’s voice assistant “what Siri should be” or arguing that Apple should consider Perplexity as an acquisition target because of this, and I thought I’d share some additional comments and notes after having played with their voice mode for a while.

How Could Apple Use Open-Source AI Models?→

Linked By Federico Viticci

Yesterday, Wayne Ma, reporting for The Information, published an outstanding story detailing the internal turmoil at Apple that led to the delay of the highly anticipated Siri AI features last month. From the article:

In November 2022, OpenAI released ChatGPT to a thunderous response from the tech industry and public. Within Giannandrea’s AI team, however, senior leaders didn’t respond with a sense of urgency, according to former engineers who were on the team at the time.

The reaction was different inside Federighi’s software engineering group. Senior leaders of the Intelligent Systems team immediately began sharing papers about LLMs and openly talking about how they could be used to improve the iPhone, said multiple former Apple employees.

Excitement began to build within the software engineering group after members of the Intelligent Systems team presented demos to Federighi showcasing what could be achieved on iPhones with AI. Using OpenAI’s models, the demos showed how AI could understand content on a user’s phone screen and enable more conversational speech for navigating apps and performing other tasks.

Assuming the details in this report are correct, I truly can’t imagine how one could possibly see the debut of ChatGPT two years ago and not feel a sense of urgency. Fortunately, other teams at Apple did, and it sounds like they’re the folks who have now been put in charge of the next generation of Siri and AI.

There are plenty of other details worth reading in the full story (especially the parts about what Rockwell’s team wanted to accomplish with Siri and AI on the Vision Pro), but one tidbit in particular stood out to me: Federighi has now given the green light to rely on third-party, open-source LLMs to build the next wave of AI features.

Federighi has already shaken things up. In a departure from previous policy, he has instructed Siri’s machine-learning engineers to do whatever it takes to build the best AI features, even if it means using open-source models from other companies in its software products as opposed to Apple’s own models, according to a person familiar with the matter.

“Using” open-source models from other companies doesn’t necessarily mean shipping consumer features in iOS powered by external LLMs. I’ve seen some people interpret this paragraph as Apple preparing to release a local Siri powered by Llama 4 or DeepSeek, and I think we should pay more attention to that “build the best AI features” (emphasis mine) line.

My read of this part is that Federighi might have instructed his team to use distillation to better train Apple’s in-house models as a way to accelerate the development of the delayed Siri features and put them back on the company’s roadmap. Given Tim Cook’s public appreciation for DeepSeek and this morning’s New York Times report that the delayed features may come this fall, I wouldn’t be shocked to learn that Federighi told Siri’s ML team to distill DeepSeek R1’s reasoning knowledge into a new variant of their ∼3 billion parameter foundation model that runs on-device. Doing that wouldn’t mean that iOS 19’s Apple Intelligence would be “powered by DeepSeek”; it would just be a faster way for Apple to catch up without throwing away the foundational model they unveiled last year (which, supposedly, had a ~30% error rate).

In thinking about this possibility, I got curious and decided to check out the original paper that Apple published last year with details on how they trained the two versions of AFM (Apple Foundation Model): AFM-server and AFM-on-device. The latter would be the smaller, ~3 billion model that gets downloaded on-device with Apple Intelligence. I’ll let you guess what Apple did to improve the performance of the smaller model:

For the on-device model, we found that knowledge distillation (Hinton et al., 2015) and structural pruning are effective ways to improve model performance and training efficiency. These two methods are complementary to each other and work in different ways. More specifically, before training AFM-on-device, we initialize it from a pruned 6.4B model (trained from scratch using the same recipe as AFM-server), using pruning masks that are learned through a method similar to what is described in (Wang et al., 2020; Xia et al., 2023).

Or, more simply:

AFM-server core training is conducted from scratch, while AFM-on-device is distilled and pruned from a larger model.

If the distilled version of AFM-on-device that was tested until a few weeks ago produced a wrong output one third of the time, perhaps it would be a good idea to perform distillation again based on knowledge from other smarter and larger models? Say, using 250 Nvidia GB300 NVL72 servers?

(One last fun fact: per their paper, Apple trained AFM-server on 8192 TPUv4 chips for 6.3 trillion tokens; that setup still wouldn’t be as powerful as “only” 250 modern Nvidia servers today.)

Permalink

Is Electron Really That Bad?→

Linked By Federico Viticci

I’ve been thinking about this video by Theo Browne for the past few days, especially in the aftermath of my story about working on the iPad and realizing its best apps are actually web apps.

I think Theo did a great job contextualizing the history of Electron and how we got to this point where the majority of desktop apps are built with it. There are two sections of the video that stood out to me and I want to highlight here. First, this observation – which I strongly agree with – regarding the desktop apps we ended up having thanks to Electron and why we often consider them “buggy”:

There wouldn’t be a ChatGPT desktop app if we didn’t have something like Electron. There wouldn’t be a good Spotify player if we didn’t have something like Electron. There wouldn’t be all of these awesome things we use every day. All these apps… Notion could never have existed without Electron. VS Code and now Cursor could never have existed without Electron. Discord absolutely could never have existed without Electron.

All of these apps are able to exist and be multi-platform and ship and theoretically build greater and greater software as a result of using this technology. That has resulted in some painful side effects, like the companies growing way faster than expected because they can be adopted so easily. So they hire a bunch of engineers who don’t know what they’re doing, and the software falls apart. But if they had somehow magically found a way to do that natively, it would have happened the same exact way.

This has nothing to do with Electron causing the software to be bad and everything to do with the software being so successful that the companies hire too aggressively and then kill their own software in the process.

The second section of the video I want to call out is the part where Theo links to an old thread from the developer of BoltAI, a native SwiftUI app for Mac that went through multiple updates – and a lot of work on the developer’s part – to ensure the app wouldn’t hit 100% CPU usage when simply loading a conversation with ChatGPT. As documented in the thread from late 2023, this is a common issue for the majority of AI clients built with SwiftUI, which is often less efficient than Electron when it comes to rendering real-time chat messages. Ironic.

Theo argues:

You guys need to understand something. You are not better at rendering text than the Chromium team is. These people have spent decades making the world’s fastest method for rendering documents across platforms because the goal was to make Chrome as fast as possible regardless of what machine you’re using it on. Electron is cool because we can build on top of all of the efforts that they put in to make Electron and specifically to make Chromium as effective as it is. The results are effective.

The fact that you can swap out the native layer with SwiftUI with even just a web view, which is like Electron but worse, and the performance is this much better, is hilarious. Also notice there’s a couple more Electron apps he has open here, including Spotify, which is only using less than 3% of his CPU. Electron apps don’t have to be slow. In fact, a lot of the time, a well-written Electron app is actually going to perform better than an equivalently well-written native app because you don’t get to build rendering as effectively as Google does.

Even if you think you made up your mind about Electron years ago, I suggest watching the entire video and considering whether this crusade against more accessible, more frequently updated (and often more performant) desktop software still makes sense in 2025.

Permalink

Using Simon Willison’s LLM CLI to Process YouTube Transcripts in Shortcuts with Claude and Gemini

By Federico Viticci

Video Processor.

I’ve been experimenting with different automations and command line utilities to handle audio and video transcripts lately. In particular, I’ve been working with Simon Willison’s LLM command line utility as a way to interact with cloud-based large language models (primarily Claude and Gemini) directly from the macOS terminal.

For those unfamiliar, Willison’s LLM CLI tool is a command line utility that lets you communicate with services like ChatGPT, Gemini, and Claude using shell commands and dedicated plugins. The llm command is extremely flexible when it comes to input and output; it supports multiple modalities like audio and video attachments for certain models, and it offers custom schemas to return structured output from an API. Even for someone like me – not exactly a Terminal power user – the different llm commands and options are easy to understand and tweak.

Today, I want to share a shortcut I created on my Mac that takes long transcripts of YouTube videos and:

reformats them for clarity with proper paragraphs and punctuation, without altering the original text,
extracts key points and highlights from the transcript, and
organizes highlights by theme or idea.

I created this shortcut because I wanted a better system for linking to YouTube videos, along with interesting passages from them, on MacStories. Initially, I thought I could use an app I recently mentioned on AppStories and Connected to handle this sort of task: AI Actions by Sindre Sorhus. However, when I started experimenting with long transcripts (such as this one with 8,000 words from Theo about Electron), I immediately ran into limitations with native Shortcuts actions. Those actions were running out of memory and randomly stopping the shortcut.

I figured that invoking a shell script using macOS’ built-in ‘Run Shell Script’ action would be more reliable. Typically, Apple’s built-in system actions (especially on macOS) aren’t bound to the same memory constraints as third-party ones. My early tests indicated that I was right, which is why I decided to build the shortcut around Willison’s llm tool.

Opening iOS Is Good News for Smartwatches →

Linked By Federico Viticci

Speaking of opening up iOS to more types of applications, I enjoyed this story by Victoria Song, writing at The Verge about the new EU-mandated interoperability requirements that include, among other things, smartwatches:

This is a big reason why it’s a good thing that the European Commission recently gave Apple marching orders to open up iOS interoperability to other gadget makers. You can read our explainer on the nitty gritty of what this means, but the gist is that it’s going to be harder for Apple to gatekeep iOS features to its own products. Specific to smartwatches, Apple will have to allow third-party smartwatch makers to display and interact with iOS notifications. I’m certain Garmin fans worldwide, who have long complained about the inability to send quick replies on iOS, erupted in cheers.

And this line, which is so true in its simplicity:

Some people just want the ability to choose how they use the products they buy.

Can you imagine if your expensive Mac desktop had, say, some latency if you decided to enter text with a non-Apple keyboard? Or if the USB-C port only worked with proprietary Apple accessories? Clearly, those restrictions would be absurd on computers that cost thousands of dollars. And yet, similar restrictions have long existed on iPhones and the iOS ecosystem, and it’s time to put an end to them.

Permalink