Over the past several days, we’ve made some changes at MacStories to address the ingestion of our work by web crawlers operated by artificial intelligence companies. We’ve learned a lot, so we thought we’d share what we’ve done in case anyone else would like to do something similar.
If you read MacStories regularly, or listen to our podcasts, you already know that Federico and I think that crawling the Open Web to train large language models is unethical. Industry-wide, AI companies have scraped the content of websites like ours, using it as the raw material for their chatbots and other commercial products without the consent or compensation of publishers and other creators.
Now that the horse is out of the barn, some of those companies are respecting publishers’ robots.txt
files, while others seemingly aren’t. That doesn’t make up for the tens of thousands of articles and images that have already been scraped from MacStories. Nor is robots.txt
a complete solution, so it’s just one of four approaches we’re taking to protect our work.
Preventing AI Crawlers Using Robots.txt
The first step, and one of the easiest to implement, is to request that the web crawlers of AI companies not crawl your site using robots.txt.
The trouble with this approach is that it’s nothing more than the Internet equivalent of an “AI Bots Keep Out” sign hung on your website. It can be ignored and only works if crawlers identify themselves, which not all seem to do. That said, it’s a good first step and the first thing we did. I highly recommend Dan Moren’s article on Six Colors that I linked to last week for more information about robots.txt
and details on implementing it on your site.
Blocking AI Bots at Your Server
We don’t trust AI companies to respect our robots.txt
file. After all, they already took our content without our consent. So, we went a step further and blocked known AI crawlers at the server level with the help of Robb Knight. Doing so requires that you know your way around a web server, but it’s more effective than simply editing your robots.txt
file. If you want to learn more about configuring your site to block AI crawlers, Robb has written about the work he did for his personal site and MacStories here.
Update Your Terms of Service
I also recommend having a Terms of Service for your website. The New York Times, which is currently litigating OpenAI’s LLM training practices updated their terms of service late last summer, which we’ve used as a guide to carefully define how MacStories content, whether it’s an article, image, or podcast, can be used in our own Terms of Service.
Rest assured, you have a lot of latitude for personal use of MacStories content. Nor do we have an issue with commercial uses that use reasonable portions of our content as long as they are properly attributed in line with the content that is used. However, we do not consent to the use of our content for AI model training.
Support Legislation Regulating AI Training
None of the above are complete solutions, which is why we support legislation regulating how AI companies train their LLMs. Last summer, media organizations from around the world signed an open letter asking lawmakers to regulate LLM training, stating:
We, the undersigned organizations, support the responsible advancement and deployment of generative AI technology, while believing that a legal framework must be developed to protect the content that powers AI applications as well as maintain public trust in the media that promotes facts and fuels our democracies.
The letter goes to the heart of something we believe, too. We’re not against artificial intelligence as a technology. Many of the tools being built are promising. However, we don’t believe that it’s right for tech companies worth billions and even trillions of dollars to be given a pass for building those tools on the backs of others’ work, especially in an economic environment where so many online media companies are struggling to survive. It’s just not right.
The solutions above aren’t perfect or foolproof, and as a result, some people have told us that we shouldn’t bother; we should just give in. In a sign of just how strapped media companies are for cash, others have cut deals with AI companies figuring that getting something is better than nothing.
But, here’s the thing. The web is a special place. Every day, it brings people from around the world together to share their thoughts and express their creativity. That’s something nobody should take for granted, and it’s worth protecting. AI is cool and all, but it’s not worth destroying the web.