Sycophancy in GPT-4o →

Linked By Federico Viticci

OpenAI found itself in the middle of another controversy earlier this week, only this time it wasn’t about publishers or regulation, but about its core product – ChatGPT. Specifically, after rolling out an update to the default 4o model with improved personality, users started noticing that ChatGPT was adopting highly sycophantic behavior: it weirdly agreed with users on all kinds of prompts, even about topics that would typically warrant some justified pushback from a digital assistant. (Simon Willison and Ethan Mollick have a good roundup of the examples as well as the change in the system prompt that may have caused this.) OpenAI had to roll back the update and explain what happened on the company’s blog:

We have rolled back last week’s GPT‑4o update in ChatGPT so people are now using an earlier version with more balanced behavior. The update we removed was overly flattering or agreeable—often described as sycophantic.

We are actively testing new fixes to address the issue. We’re revising how we collect and incorporate feedback to heavily weight long-term user satisfaction and we’re introducing more personalization features, giving users greater control over how ChatGPT behaves.

And:

We also believe users should have more control over how ChatGPT behaves and, to the extent that it is safe and feasible, make adjustments if they don’t agree with the default behavior.

Today, users can give the model specific instructions to shape its behavior with features like custom instructions. We’re also building new, easier ways for users to do this. For example, users will be able to give real-time feedback to directly influence their interactions and choose from multiple default personalities.

“Easier ways” for users to adjust ChatGPT’s behavior sound to me like a user-friendly toggle or slider to adjust ChatGPT’s personality (Grok has something similar, albeit unhinged), which I think would be a reasonable addition to the product. I’ve long argued that Siri should come with an adjustable personality similar to CARROT Weather, which lets you tweak whether you want the app to be “evil” or “professional” with a slider. I increasingly feel like that sort of option would make a lot of sense for modern LLMs, too.

Permalink

AppStories - Ep. 433 - 35 minutes

Tackling Trackers

Federico

John

This week, Federico and John tackle tracking apps. From database apps to media trackers, they consider what makes a good tracking app no matter what you’re tracking.

On AppStories+, Federico quizzes John about what’s on his desk, the tech he’d be happy to have a burgler steal, and more.

Subscribe here.

We deliver AppStories+ to subscribers with bonus content, ad-free, and at a high bitrate early every week.

To learn more about an AppStories+ subscription, visit our Plans page, or read the AppStories+ FAQ.

AppStories Episode 433 - Tackling Trackers

0:00
35:23

↓

AppStories+ Deeper into the world of apps

This episode is sponsored by:

WaterMinder – The Best Water Tracker App for Your Hydration Needs!

What Siri Isn’t: Perplexity’s Voice Assistant and the Potential of LLMs Integrated with iOS

By Federico Viticci

Perplexity’s voice assistant for iOS.

You’ve probably heard that Perplexity – a company whose web scraping tactics I generally despise, and the only AI bot we still block at MacStories – has rolled out an iOS version of their voice assistant that integrates with several native features of the operating system. Here’s their promo video in case you missed it:

This is a very clever idea: while other major LLMs’ voice modes are limited to having a conversation with the chatbot (with the kind of quality and conversation flow that, frankly, annihilates Siri), Perplexity put a different spin on it: they used native Apple APIs and frameworks to make conversations more actionable (some may even say “agentic”) and integrated with the Apple apps you use every day. I’ve seen a lot of people calling Perplexity’s voice assistant “what Siri should be” or arguing that Apple should consider Perplexity as an acquisition target because of this, and I thought I’d share some additional comments and notes after having played with their voice mode for a while.

Join

Access Extra Content and Perks

Founded in 2015, Club MacStories has delivered exclusive content every week for nearly a decade.

What started with weekly and monthly email newsletters has blossomed into a family of memberships designed for every MacStories fan.

Learn more here and from our Club FAQs.

Club MacStories: Weekly and monthly newsletters via email and the web that are brimming with apps, tips, automation workflows, longform writing, early access to the MacStories Unwind podcast, periodic giveaways, and more;

Club MacStories+: Everything that Club MacStories offers, plus an active Discord community, advanced search and custom RSS features for exploring the Club’s entire back catalog, bonus columns, and dozens of app discounts;

Club Premier: All of the above and AppStories+, an extended version of our flagship podcast that’s delivered early, ad-free, and in high-bitrate audio.

Learn more here and from our Club FAQs.

The Current State of Major LLMs and Their Shortcuts Integrations

By Federico Viticci

Earlier this week, I decided to do some research about the current state of major LLM apps and their implementations of Shortcuts actions. While millions of people are interacting with chatbots on a daily basis using their respective websites and dedicated mobile apps, I thought it’d be interesting to see how these popular services are...

AppStories - Ep. 432 - 54 minutes

How We’re Using AI

Federico

John

This week, Federico and John revisit the fast-paced world of artificial intelligence to describe how they’re using a variety of tools for their everyday workflows.

On AppStories+, John shares his theory of the way we’ll look at AI models in the future.

Subscribe here.

We deliver AppStories+ to subscribers with bonus content, ad-free, and at a high bitrate early every week.

To learn more about an AppStories+ subscription, visit our Plans page, or read the AppStories+ FAQ.

AppStories Episode 432 - How We’re Using AI

0:00
54:14

↓

AppStories+ Deeper into the world of apps

This episode is sponsored by:

Notion – Try the powerful, easy-to-use Notion AI today.

Automation Academy: How I Turn Voice Recordings into Searchable Obsidian Notes with Shortcuts, Hazel, and LLMs

By Federico Viticci

As I mentioned last week in the MacStories Weekly newsletter and have been hinting recently on both Connected and AppStories, I’m in the process of building a “perfect memory” system in Obsidian that allows me to save, archive, and search anything I write, think about, or come across on the Internet. This project is a...

AppStories - Ep. 431 - 34 minutes

Time for Calendars

Federico

John

This week, Federico and John survey their favorite calendar apps, discussing the strengths and weaknesses of each.

On AppStories+, Federico shares Shortcuts tips for working with Google’s Gemini API and the highly structured data it returns. Plus he and John share their concern and cautious optimism for the future of Shortcuts.

Subscribe here.

We deliver AppStories+ to subscribers with bonus content, ad-free, and at a high bitrate early every week.

To learn more about an AppStories+ subscription, visit our Plans page, or read the AppStories+ FAQ.

AppStories Episode 431 - Time for Calendars

0:00
34:25

↓

AppStories+ Deeper into the world of apps

My Obsidian Setup, Part 12: Rethinking YouTube Watch Later with Markdown and AI

By Federico Viticci

Earlier this week on the Connected Pro pre-show, I mentioned that I’ve decided to take on the challenge of building a “perfect memory” for myself in Obsidian. The project involves three key aspects: Saving all kinds of content into Obsidian: my articles, transcribed voice recordings, but also videos I watch online and interesting webpages I...

How Could Apple Use Open-Source AI Models?→

Linked By Federico Viticci

Yesterday, Wayne Ma, reporting for The Information, published an outstanding story detailing the internal turmoil at Apple that led to the delay of the highly anticipated Siri AI features last month. From the article:

In November 2022, OpenAI released ChatGPT to a thunderous response from the tech industry and public. Within Giannandrea’s AI team, however, senior leaders didn’t respond with a sense of urgency, according to former engineers who were on the team at the time.

The reaction was different inside Federighi’s software engineering group. Senior leaders of the Intelligent Systems team immediately began sharing papers about LLMs and openly talking about how they could be used to improve the iPhone, said multiple former Apple employees.

Excitement began to build within the software engineering group after members of the Intelligent Systems team presented demos to Federighi showcasing what could be achieved on iPhones with AI. Using OpenAI’s models, the demos showed how AI could understand content on a user’s phone screen and enable more conversational speech for navigating apps and performing other tasks.

Assuming the details in this report are correct, I truly can’t imagine how one could possibly see the debut of ChatGPT two years ago and not feel a sense of urgency. Fortunately, other teams at Apple did, and it sounds like they’re the folks who have now been put in charge of the next generation of Siri and AI.

There are plenty of other details worth reading in the full story (especially the parts about what Rockwell’s team wanted to accomplish with Siri and AI on the Vision Pro), but one tidbit in particular stood out to me: Federighi has now given the green light to rely on third-party, open-source LLMs to build the next wave of AI features.

Federighi has already shaken things up. In a departure from previous policy, he has instructed Siri’s machine-learning engineers to do whatever it takes to build the best AI features, even if it means using open-source models from other companies in its software products as opposed to Apple’s own models, according to a person familiar with the matter.

“Using” open-source models from other companies doesn’t necessarily mean shipping consumer features in iOS powered by external LLMs. I’ve seen some people interpret this paragraph as Apple preparing to release a local Siri powered by Llama 4 or DeepSeek, and I think we should pay more attention to that “build the best AI features” (emphasis mine) line.

My read of this part is that Federighi might have instructed his team to use distillation to better train Apple’s in-house models as a way to accelerate the development of the delayed Siri features and put them back on the company’s roadmap. Given Tim Cook’s public appreciation for DeepSeek and this morning’s New York Times report that the delayed features may come this fall, I wouldn’t be shocked to learn that Federighi told Siri’s ML team to distill DeepSeek R1’s reasoning knowledge into a new variant of their ∼3 billion parameter foundation model that runs on-device. Doing that wouldn’t mean that iOS 19’s Apple Intelligence would be “powered by DeepSeek”; it would just be a faster way for Apple to catch up without throwing away the foundational model they unveiled last year (which, supposedly, had a ~30% error rate).

In thinking about this possibility, I got curious and decided to check out the original paper that Apple published last year with details on how they trained the two versions of AFM (Apple Foundation Model): AFM-server and AFM-on-device. The latter would be the smaller, ~3 billion model that gets downloaded on-device with Apple Intelligence. I’ll let you guess what Apple did to improve the performance of the smaller model:

For the on-device model, we found that knowledge distillation (Hinton et al., 2015) and structural pruning are effective ways to improve model performance and training efficiency. These two methods are complementary to each other and work in different ways. More specifically, before training AFM-on-device, we initialize it from a pruned 6.4B model (trained from scratch using the same recipe as AFM-server), using pruning masks that are learned through a method similar to what is described in (Wang et al., 2020; Xia et al., 2023).

Or, more simply:

AFM-server core training is conducted from scratch, while AFM-on-device is distilled and pruned from a larger model.

If the distilled version of AFM-on-device that was tested until a few weeks ago produced a wrong output one third of the time, perhaps it would be a good idea to perform distillation again based on knowledge from other smarter and larger models? Say, using 250 Nvidia GB300 NVL72 servers?

(One last fun fact: per their paper, Apple trained AFM-server on 8192 TPUv4 chips for 6.3 trillion tokens; that setup still wouldn’t be as powerful as “only” 250 modern Nvidia servers today.)

Permalink