John Voorhees

3107 posts on MacStories since November 2015

John is MacStories’ Managing Editor, has been writing about Apple and apps since joining the team in 2015, and today, runs the site alongside Federico.

John also co-hosts four MacStories podcasts: AppStories, which covers the world of apps, MacStories Unwind, which explores the fun differences between American and Italian culture and recommends media to listeners, Ruminate, a show about the weird web and unusual snacks, and NPC: Next Portable Console, a show about the games we take with us.

This Week's Sponsor:

Albums

Algorithm-Free Listening for Music Lovers


Wired Confirms Perplexity Is Bypassing Efforts by Websites to Block Its Web Crawler

Last week, Federico and I asked Robb Knight to do what he could to block web crawlers deployed by artificial intelligence companies from scraping MacStories. Robb had already updated his own site’s robots.txt file months ago, so that’s the first thing he did for MacStories.

However, robots.txt only works if a company’s web crawler is set up to respect the file. As I wrote earlier this week, a better solution is to block them on your server, which Robb did on his personal site and wrote about late last week. The setup sends a 403 error if one of the bots listed in his server code requests information from his site.

Spoiler: Robb hit the nail on the head the first time.

Spoiler: Robb hit the nail on the head the first time.

After reading Robb’s post, Federico and I asked him to do the same for MacStories, which he did last Saturday. Once it was set up, Federico began testing the setup. OpenAI returned an error as expected, but Perplexity’s bot was still able to reach MacStories, which shouldn’t have been the case.1

Yes, I took a screenshot of Perplexity's API documentation because I bet it changes based on what we discovered.

Yes, I took a screenshot of Perplexity’s API documentation because I bet it changes based on what we discovered.

That began a deep dive to try to figure out what was going on. Robb’s code checked out, blocking the user agent specified in Perplexity’s own API documentation. What we discovered after more testing was that Perplexity was hitting MacStories’ server without using the user agent it said it used, effectively doing an end run around Robb’s server code.

Robb wrote up his findings on his website, which promptly shot to the top slot on Hacker News and caught the eye of Dhruv Mehrotra and Tim Marchman of Wired, who were in the midst of investigating how Perplexity works. As Mehrotra and Marchman describe it:

A WIRED analysis and one carried out by developer Robb Knight suggest that Perplexity is able to achieve this partly through apparently ignoring a widely accepted web standard known as the Robots Exclusion Protocol to surreptitiously scrape areas of websites that operators do not want accessed by bots, despite claiming that it won’t. WIRED observed a machine tied to Perplexity—more specifically, one on an Amazon server and almost certainly operated by Perplexity—doing this on wired.com and across other Condé Nast publications.

Until earlier this week, Perplexity published in its documentation a link to a list of the IP addresses its crawlers use—an apparent effort to be transparent. However, in some cases, as both WIRED and Knight were able to demonstrate, it appears to be accessing and scraping websites from which coders have attempted to block its crawler, called Perplexity Bot, using at least one unpublicized IP address. The company has since removed references to its public IP pool from its documentation.

That secret IP address—44.221.181.252—has hit properties at Condé Nast, the media company that owns WIRED, at least 822 times in the last three months. One senior engineer at Condé Nast, who asked not to be named because he wants to “stay out of it,” calls this a “massive undercount” because the company only retains a fraction of its network logs.

WIRED verified that the IP address in question is almost certainly linked to Perplexity by creating a new website and monitoring its server logs. Immediately after a WIRED reporter prompted the Perplexity chatbot to summarize the website’s content, the server logged that the IP address visited the site. This same IP address was first observed by Knight during a similar test.

This sort of unethical behavior is why we took the steps we did to block the use of MacStories’ websites as training data for Perplexity and other companies.2 Incidents like this and the lack of transparency about how AI companies train their models have led to a lot of mistrust in the entire industry among creators who publish on the web. I’m glad we’ve been able to play a small part in revealing Perplexity’s egregious behavior, but more needs to be done to rein in this sort of behavior, including closer scrutiny by regulators around the world.

As a footnote to this, it’s worth noting that Wired also puts to rest the argument that websites should be okay with Perplexity’s behavior because they include citations in their plagiarism. According to Wired’s story:

WIRED’s own records show that Perplexity sent 1,265 referrals to wired.com in May, an insignificant amount in the context of the site’s overall traffic. The article to which the most traffic was referred got 17 views.

That’s next to nothing for a site with Wired’s traffic, which Similarweb and other sites peg at over 20 million page views that same month. That’s a mere 0.006% of Wired’s May traffic. Let that sink in, and then ask yourself whether it seems like a fair trade.


  1. Meanwhile, I was digging through bins of old videogames and hardware at a Retro Gaming Festival doing ‘research’ for NPC↩︎
  2. Mehrotra and Marchman correctly question whether Perplexity is even an AI company because they piggyback on other company’s LLMs and use them in conjunction with scraped web data to provide summaries that effectively replace the source’s content. However, that doesn’t change the fact that Perplexity is surreptitiously scraping sites while simultaneously professing to respect sites’ robot.txt file. That’s the unethical bit. ↩︎

The Latest from NPC: Next Portable Console and AppStories

Enjoy the latest episodes from MacStories’ family of podcasts:

This week, Federico takes us on his journey to build the world’s best eGPU, and we take a first look at the Anbernic RG35XXSP, a foldable retro handheld that two out of the three of us have received.


This week, Federico and John recap WWDC week with more on their early testing of the iOS and iPadOS 18 betas and an in-depth conversation about why they are disappointed with Apple’s decision to train its large language models on the Open Web.

Read more


Retro Videogame Streaming Service Antstream To Launch on the App Store Next Week

In the wake of the Digital Markets Act, Apple made a couple of worldwide changes to its App Review Guidelines, along with many EU-specific updates. One of the worldwide updates was to allow third-party game streaming services.

Today, Antstream became the first game streaming service to announce that it will launch an app on Apple’s App Store. Antstream is a retro game streaming service with a catalog of over 1,300 videogames. The service, which is available on multiple other platforms in the EU, US, and Brazil, will bring its licensed library of games to the iPhone and iPad next week on June 27th.

Antstream’s catalog covers a wide variety of retro systems, including the Atari 2600, Commodore 64, SNES, Megadrive, PlayStation One, and Arcade classics. Antstream Arcade normally costs $4.99 per month or $39.99 per year but will be available for $3.99 per month or $29.99 per year for a limited time when it launches on the App Store.

I haven’t used Antstream Arcade yet, but I’m looking forward to trying it to see what’s in the catalog and check out how it performs over Wi-Fi.


Apple Developer Academies in Six Countries to Add AI Courses This Fall

Today, Apple announced that this fall, the company will offer a new curriculum for its Developer Academy students focused on machine learning and artificial intelligence.

According to Apple:

Beginning this fall, every Apple Developer Academy student will benefit from custom-built curriculum that teaches them how to build, train, and deploy machine learning models across Apple devices. Courses will include the fundamentals of AI technologies and frameworks; Core ML and its ability to deliver fast performance on Apple devices; and guidance on how to build and train AI models from the ground up. Students will learn from guided curriculum and project-based assignments that include assistance from hundreds of mentors and more than 12,000 academy alumni worldwide.

The new curriculum will be offered at 18 academies in Brazil, Indonesia, Italy, Saudi Arabia, South Korea, and the United States. With the company’s emphasis on Apple Intelligence at WWDC, it’s not surprising that the skills needed to implement those new features are being added to its educational efforts.


How We’re Trying to Protect MacStories from AI Bots and Web Crawlers – And How You Can, Too

Over the past several days, we’ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We’ve learned a lot, so we thought we’d share what we’ve done in case anyone else would like to do something similar.

If you read MacStories regularly, or listen to our podcasts, you already know that Federico and I think that crawling the Open Web to train large language models is unethical. Industry-wide, AI companies have scraped the content of websites like ours, using it as the raw material for their chatbots and other commercial products without the consent or compensation of publishers and other creators.

Now that the horse is out of the barn, some of those companies are respecting publishers’ robots.txt files, while others seemingly aren’t. That doesn’t make up for the tens of thousands of articles and images that have already been scraped from MacStories. Nor is robots.txt a complete solution, so it’s just one of four approaches we’re taking to protect our work.

Read more


The Origin Story of Apple Podcasts’ Transcripts

Ari Saperstein, writing for The Guardian, interviewed Ben Cave, Apple’s global head of podcasts and Sarah Herrlinger, who manages accessibility policy for the company, about Apple Podcasts transcripts. The feature, which was introduced in March, automatically generates transcripts of podcast episodes in Apple’s catalog and has been a big accessibility win for podcast fans.

The origins of Apple’s transcription efforts began modestly:

Apple’s journey to podcast transcripts started with the expansion of a different feature: indexing. It’s a common origin story at a number of tech companies like Amazon and Yahoo – what begins as a search tool evolves into a full transcription initiative. Apple first deployed software that could identify specific words in a podcast back in 2018.

“What we did then is we offered a single line of the transcript to give users context on a result when they’re searching for something in particular,” Cave recalls. “There’s a few different things that we did in the intervening seven years, which all came together into this [transcript] feature.”

Drawing from technologies and designs used by Apple Music and Books, the feature has been lauded by the accessibility community:

“I was knocked out on how accurate it was,” says Larry Goldberg, a media and technology accessibility pioneer who created the first closed captioning system for movie theaters. The fidelity of auto-transcription is something that’s long been lacking, he adds. “It’s improved, it has gotten better … but there are times when it is so wrong.”

My experience with Podcasts’ transcripts tracks with the people interviewed for Saperstein’s story. Automatically generated transcription is hard. I’ve tried various services in the past, and I’ve never been happy enough with any of them to publish their output on MacStories. Apple’s solution isn’t perfect, but it’s easily the best I’ve seen, tipping into what I consider publishable territory. The feature makes it easy to search, select text, and generate time-stamped URLs for quoting snippets of an episode, which makes the app an excellent tool for researching and writing about podcasts, too.

Permalink

WWDC 2024: The AppStories Interviews with ADA and Swift Student Challenge Distinguished Winners

Devin Davies, the developer of Crouton.

Devin Davies, the developer of Crouton.

To wrap up our week of WWDC coverage, we just published a special episode of AppStories that was recorded in the Apple Podcasts Studio at Apple Park. Federico and I interviewed three of this year’s Apple Design Award winners:

Devin Davies.

Devin Davies.

  • Devin Davies, the creator of Crouton, which won an ADA in the Interaction category
Katarina Lotrič and Jasna Krmelj of Gentler Streak.

Katarina Lotrič and Jasna Krmelj of Gentler Streak.


- Katarina Lotrič, CEO and co-founder, and Jasna Krmelj, CTO and co-founder, of Gentler Streak, which won an ADA in the Social Impact category

James Cuda, CEO, and Michael Shaw, CTO, of Procreate.

James Cuda, CEO, and Michael Shaw, CTO, of Procreate.


- James Cuda, CEO, and Michael Shaw, CTO of Procreate, which won an ADA for (Procreate Dreams) in the Innovation category

We also interviewed two of the Swift Student Challenge Distinguished Winners:

  • Dezmond Blair, a student at the Apple Developer Academy in Detroit. His app marries his passion for biking and the outdoors with technology, which creates an immersive experience.
  • Adelaide Humez, a high school student from Lille, France. Her winning app, Egretta, allows users to create a journal of their dreams based on emotions.

In addition to being available as always in your favorite podcast app as an audio-only podcast, This special episode of AppStories is available on our new MacStories YouTube channel, which is also the home of Comfort Zone, one of the two podcasts we launched last week and other video projects.


We deliver AppStories+ to subscribers with bonus content, ad-free, and at a high bitrate early every week.

To learn more about the benefits included with an AppStories+ subscription, visit our Plans page or read the AppStories+ FAQ.

Permalink

The Latest from Magic Rays of Light, Comfort Zone, and MacStories Unwind

Enjoy the latest episodes from MacStories’ family of podcasts:

This week on Magic Rays of Light, Sigmund and Devon recap the Apple TV and entertainment announcements at WWDC – including tvOS 18, visionOS 2, Immersive Video updates, and more – and score their event predictions.


We’re back! After surviving our first challenge together, the gang is back for more with new goodies, an unexpectedly heavy topic, and a new mysterious challenge we didn’t see coming.


This week, John is joined by Jonathan Reed and Sigmund Judge for an explanation of how John missed his first episode of AppStories in seven years this week, an update from Sigmund on what’s coming to tvOS and Apple TV+, plus a bunch of picks from everyone.

Read more


Opting Out of AI Model Training

Dan Moren has an excellent guide on Six Colors that explains how to exclude your website from the web crawlers used by Apple, OpenAI, and others to train large language models for their AI products. For many sites, the process simply requires a few edits to the robots.txt file on your server:

If you’re not familiar with robots.txt, it’s a text file placed at the root of a web server that can give instructions about how automated web crawlers are allowed to interact with your site. This system enables publishers to not only entirely block their sites from crawlers, but also specify just parts of the sites to allow or disallow.

The process is a little more complicated with something like a WordPress, which MacStories uses, and Dan covers that too.

Unfortunately, as Dan explains, editing robots.txt isn’t a solution for companies that ignore the file. It’s simply a convention that doesn’t carry any legal or regulatory weight. Nor does it help with Google or Microsoft’s use of your website’s copyrighted content unless you’re also willing to remove your site from the biggest search engines.

Although I’m glad there is a way to block at least some AI web crawlers prospectively, it’s cold comfort. We and many sites have years of articles that have already been crawled to train these models, and you can’t unring that bell. That said, MacStories’ robot.txt file has been updated to ban Apple and OpenAI’s crawlers, and we’re investigating additional server-level protections.

If you listen to Ruminate or follow my writing on MacStories, you know that I think what these companies are doing is wrong both in the moral and legal sense of the word. However, nothing captures it quite as well as this Mastodon post by Federico today:

If you’ve ever read the principles that guide us at MacStories, I’m sure Federico’s post came as no surprise. We care deeply about the Open Web, but ‘open’ doesn’t give tech companies free rein to appropriate our work to build their products.

Yesterday, Federico linked to Apple’s Machine Learning Research website where it was disclosed that the company has indexed the web to train its model without the consent of publishers. I was as disappointed in Apple as Federico. I also immediately thought of this 2010 clip of Steve Jobs near the end of his life, reflecting on what ‘the intersection of Technology and the Liberal Arts’ meant to Apple:

I’ve always loved that clip. It speaks to me as someone who loves technology and creates things for the web. In hindsight, I also think that Jobs was explaining what he hoped his legacy would be. It’s ironic that he spoke about ‘technology married with Liberal Arts,’ which superficially sounds like what Apple and others have done to create their AI models but couldn’t be further from what he meant. It’s hard to watch that clip now and not wonder if Apple has lost sight of what guided it in 2010.


You can follow all of our WWDC coverage through our WWDC 2024 hub or subscribe to the dedicated WWDC 2024 RSS feed.

Permalink