WebOptOut.org — check if websites forbid data-mining & AI training

Copyright exceptions for data-mining based on Opt-Out are law in 27 countries. Since the topic is increasingly discussed, and as Opt-Out is often mis-/abused, it's important to clarify the requirements and establish baseline facts.

Overview

Most of the web has opted-out already! We checked the Terms Of Service.

weboptout.org lets you know if a website requires an opt-out of Data Mining (for Copyright), in which case all that domain's content must not be used for/by Artificial Intelligence. You can click the link of any domain to get both machine-readable and human-readable specification for the rights that are reserved. Does a website ban robots or crawlers, forbid scraping and mining, or exclude other forms of training?

❝ 90% of the top websites opted-out of for-profit AI training. ❞

https://weboptout.org/d/<domain.name>/yaml

These pages are based on the legal text from the website itself, and provide the technical information you need (as common file formats such as JSON or YAML) in order to comply with Copyright regulations and IP laws.

Statistics

No Training

No Data Mining

Not-For-Profit

Unknown

Failure

DESCRIPTION: Opt-Out percentages for websites in multiple categories. Green is a complete opt-out of Data Mining. Blue horizontal stripes mean there's additional opt-out of AI training, and light green vertical stripes mean it's non-commercial use only. The numbers are collected from the Top ~100 websites, which covers over an estimated 90% of the traffic and the majority of dataset entries.

The Key Message!

If you want to help, there is one key point to reinforce in discussions, whether online or in print:

❝ The discussion about Opt-Out in AI is disconnected from reality. ❞

*MYTH* How to implement Opt-Out for AI training is an open question to debate for years. *MYTH*: FACT: This is a delaying tactic (promoted by Big Tech) in order to exploit the gap between technical progress and legal enforcement. The current law already suggests Terms Of Service as one of many solutions, and this was confirmed by the recent court verdict in the EU. It's simple to implement for any company acting in Good Faith, and with state-of-the-art technologies the mistakes drop to near zero.
*MYTH* Everyone can use existing web standards like `robots.txt` as-is without any changes. *MYTH*: FACT: The standard as it is commonly used today, is not sufficient to comply with Copyright and AI regulations. This fact is well known in technical, legal and policy circles. However, the `robots.txt` extension suggested below is not only syntactically backwards compatible, but semantically allows legal compliance too.
*MYTH* Copyright exceptions are a solution that can support the entire AI industry. *MYTH*: FACT: Most publishers, websites, creators want to opt-out and already have — see our statistics. There will never be any clarity for an industry built on "Fair Dealing" exceptions under international Copyright law. By definition, exceptions are litigated on a case-by-case basis, for which there's a growing list of conditions and requirements that need to be fulfilled. Only licensing provides the legal framework needed to build a solid business — let alone an entire industry worth trillions.

Usage

artstation.com — YAML, JSON.
behance.net — YAML, JSON.
deviantart.com — YAML, JSON.
etsy.com — YAML, JSON.
flickr.com — YAML, JSON.
pixiv.net — YAML, JSON.
newgrounds.com — YAML, JSON.
furaffinity.net — YAML, JSON.

weasyl.com — YAML, JSON.
ko-fi.com — YAML, JSON.
patreon.com — YAML, JSON.
gumroad.com — YAML, JSON.
itch.io — YAML, JSON.
redbubble.com — YAML, JSON.
society6.com — YAML, JSON.
displate.com — YAML, JSON.

dailymail.co.uk — YAML, JSON.
cnn.com — YAML, JSON.
theguardian.com — YAML, JSON.
foxnews.com — YAML, JSON.
economist.com — YAML, JSON.
yahoo.com — YAML, JSON.
nytimes.com — YAML, JSON.
usatoday.com — YAML, JSON.
espn.com — YAML, JSON.
si.com — YAML, JSON.

cbssports.com — YAML, JSON.
bbc.com — YAML, JSON.
washingtonpost.com — YAML, JSON.
reuters.com — YAML, JSON.
bloomberg.com — YAML, JSON.
nbcnews.com — YAML, JSON.
wsj.com — YAML, JSON.
apnews.com — YAML, JSON.
forbes.com — YAML, JSON.
cnbc.com — YAML, JSON.

amazon.com — YAML, JSON.
aliexpress.us — YAML, JSON.
shopify.com — YAML, JSON.
ebay.com — YAML, JSON.
walmart.com — YAML, JSON.
target.com — YAML, JSON.
bestbuy.com — YAML, JSON.
wayfair.com — YAML, JSON.
homedepot.com — YAML, JSON.
costco.com — YAML, JSON.

ikea.com — YAML, JSON.
lowes.com — YAML, JSON.
temu.com — YAML, JSON.
chewy.com — YAML, JSON.
macys.com — YAML, JSON.
kohls.com — YAML, JSON.
rakuten.com — YAML, JSON.
bigcommerce.com — YAML, JSON.
newegg.com — YAML, JSON.

photoshelter.com — YAML, JSON.
smugmug.com — YAML, JSON.
istockphoto.com — YAML, JSON.
gettyimages.com — YAML, JSON.
shutterstock.com — YAML, JSON.
alamy.com — YAML, JSON.
bigstockphoto.com — YAML, JSON.
adobestock.com — YAML, JSON.
dreamstime.com — YAML, JSON.
123rf.com — YAML, JSON.
depositphotos.com — YAML, JSON.
stocksy.com — YAML, JSON.
pond5.com — YAML, JSON.

audiojungle.net — YAML, JSON.
premiumbeat.com — YAML, JSON.
epidemicsound.com — YAML, JSON.
soundsnap.com — YAML, JSON.
storyblocks.com — YAML, JSON.
artlist.io — YAML, JSON.
turbosquid.com — YAML, JSON.
cgtrader.com — YAML, JSON.
sketchfab.com — YAML, JSON.
freesound.org — YAML, JSON.
zapsplat.com — YAML, JSON.
sounddogs.com — YAML, JSON.

Command-Line

YAML
JSON

$ curl -H "Accept: application/json" https://weboptout.org/d//json

Loading...

$ curl -H "Accept: application/yaml" https://weboptout.org/d//yaml

Loading...

Technology

`weboptout` is based on simple scraping technology from the 1990s and could be rebuilt by any competent programmer in a matter of hours. There are multiple steps involved:

Parse `robots.txt` — Check the content in a specific section called `weboptout-for-all`, as described below.
Lookup Domain — If this is a content-distribution network, look up the main site using one of multiple rules.
Find Terms Of Service — From the main domain, find links to the ToS based on one of multiple patterns.
Identify Patterns — Scan each paragraph in the ToS, then check to see if patterns are found in the legal text.
Confirm Reservations — When rights appear to be reserved based on patterns, confirm them with a language model.

Approximately 70% to 80% of the English-speaking web, based on traffic and dataset inclusion, is covered so far in our database. Based on manual review, when rights reservations are found, we estimate the false positives for the simple pattern matching to be less than 5%, and using a language model it's less than 1%. (False positives only cause minor inconvenience for scrapers.) False negatives are 0% by design in order to maximize legal compliance. (False negatives may result in a violation of the creator's rights.)

Legal Framework

Laws & Regulations

There are multiple regulations that provide the legal foundation for this project at the European level.

Directive (EU) 2019/790 on Copyright in the Digital Single Market (CDSM): Article 4.3 — The exception or limitation [for text and data mining] shall apply on condition that the use of works and other subject matter [...] has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.; Recital (18) — In the case of content that has been made publicly available online, it should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service.
Regulation (EU) 2024/1689 for Harmonised Rules on Artificial Intelligence (AI Act): Article 53. — […] (c) put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights

Case Precedents

Only one court case has taken place on the subject of Copyright and Opt-Out, reference 310_O_227/23 in Hamburg Regional Court. Our open-source project `weboptout` was explicitly brought up by plaintiffs, and the functionality of such an implementation was at the core of the discussion. The panel of three judges agreed:

Rights reservations via Terms Of Service are applicable.
Terms Of Service are considered machine readable.
Automated analysis of the Terms Of Service is expected.

The detailed text of the verdict confirms the common-sense interpretation of the laws listed in the previous section. The approach used by `weboptout` is a valid one.

Legal Reminders

Keep in mind that complying with Opt-Out directives alone isn't sufficient to succesfully claim a Copyright exception. Here are other conditions and requirements:

Copyright Exceptions apply to Reproductions only (Berne Convention 9.2).
Exceptions are limited by definition, i.e. ❝certain special cases❞.
The use of the work must not ❝conflict with a normal exploitation❞.
This use must not ❝unreasonably prejudice the legitimate interests❞ of the author.
Any such exceptions, by definition, shouldn't undermine Copyright itself.
The purpose of the activity must be defined as analysis/analytical (see C-DSM Article 2.2),
Claiming Copyright exceptions generally assumes operating in good faith,
The operator is expected to follow "appropriate" data governance practices (see AI Act 10.2),
Requires both lawful access of operator & lawfully accessibility by rightholders (C-DSM Article 4.1),
Must respect security measures established to ensure reservations are upheld (C-DSM Recital 18),
Put in place a policy to identify and comply with a reservation of rights expressed (AI Act 53(c)).

Failing any single one of these requirements fataly undermines a claim for the Copyright exception.

Open To Contributions

How To Contribute

To update the information about a domain you control, you can simply modify the `robots.txt` file as follows:

## robots.txt

# GLOBAL PERMISSIONS
User-agent: weboptout-for-all
Disallow: *

# COMMERCIAL + MODELS OPT-OUT
User-agent: weboptout-for-all
Disallow: training
Disallow: memorizing
Disallow: profiting

# DATASET EXCLUSION
User-agent: weboptout-for-all
Disallow: linking
Disallow: collecting
Disallow: archiving
Disallow: distributing

# AI ASSISTANT PERMISSIONS
User-agent: weboptout-for-all
Disallow: embedding
Disallow: generating
Disallow: referencing

EXAMPLE: You can see our very own robots.txt for weboptout.org, which breaks our opt-out down into four categories.

This new `robots.txt` section has a specific meaning in the context of opt-out, and provides fine-grained permissions over data-mining, machine learning, and artificial intelligence. By convention, the section called `weboptout-for-all` expressely applies to all robots that intend to claim Copyright exceptions and/or Fair Use. The wildcard * expressely matches all existing and future uses by those robots; if in doubt, just use Disallow: * below the user agent `weboptout-for-all`. Alternatively, you can list verbs of activities to prohibit, such as `training`, `caching` or `archiving` — use one per line as shown in the examples.

Open Source!

WebOptOut is an open source project available on GitHub. The codebase includes `weboptout` as both a command-line tool and a Python library that can be used programmatically. Contributions such as bug fixes are welcome, but most importantly multi-language support is the priority in order to parse the legal text of international websites too — not just English ones.

Frequent Questions

Q: What was the methology do determine that "90% of top websites opted-out"?

A: The study reported in the statistics section uses both the Top 100 domains by traffic and the Top 100 domains from popular datasets. There is obviously overlap between the two, but we kept only the English-language ones. Then running the code for each of the corresponding websites produces meta-data about opt-out for each of them. The most important category is opt-out from text- and data-mining (TDM) as per regulations, which alone accounts for 86% of the reservations. When including opt-out for commercial activities, this accounts for a total almost 91% of top websites opted-out of for-profit TDM activities.

Q: Are the results perfect? What are the false positives/negatives?

A: No, obviously! The results will never be perfect, as there are many layers of technology and people involved at every step: mistakes will continuously appear but also get resolved. As for accuraty... The false positives, where an opt-out is reported but doesn't exist, only causes minor inconvenience to corporations; these are currently below 4% of the cases. The false negatives, where an opt-out is requested but not reported by the tool, which directly harms creators and removes their rights, are ~0% by design to increase compliance.

Q: Are the machine-readable summaries found in Terms Of Service entirely automated?

A: Nothing is entirely automatic. A human programmer designed and wrote the program, deployed and tested it, then collected feedback and verified the result. There is always such high-level manual supervision+iteration in software, as such, the machine-readable summaries you find here could be considered hybrid-produced. There is no legal requirement for this work to be fully automated, and in fact, acting with due dilligence (given the importance of respecting Human Rights of creators) requires human oversight to check and improve the accuracy above all.

Q: Why suggest an extension to `robots.txt`? What about alternatives?

A: Much of the existing legislation and policy suggestions are focused on `robots.txt`, partly because it's the main example of an industry-wide opt-out system. So, it makes sense to run with that idea — especially as suggested alternatives are technically deficient! The design suggested above is backward compatible and requires (a) no extra code to handle, and (b) no additional files to be downloaded. Besides the convenience, the design is also better than more formal standards (e.g. TDM Reservation Protocol) because it provides an extensible way to reserve rights: you just list the actions to disallow (e.g. scraping, mining, training).

Q: Which expertise did you base the project on?

A: Hi, I'm Alex. You might remember me from... working in AI and creative industries professionally for almost 25 years — in roles spanning from technology to design, also managing director. On the legal side, I was recently a technical advisor and named witness on the only lawsuit worldwide that involved Copyright and Opt-Out (in Hamburg). I've done forensic studies of scraper activities and assisted the analyses of content in popular datasets. I don't know of any other expert more qualified in the real-world implementation of Opt-Out as a Copyright exception, and if there is one I'd love to talk to them!

The functionality of the early prototype of `weboptout`, built 15 months before the LAION verdict anticipating it was the core issue, indeed ended up being the main part of the judges' ruling on the topic of Opt-Out (German Act On Copyright, UrgH 44b). On the policy side, the insights from my this project (e.g. the web having mostly opted-out already) have been widely published and raised in policy discussions worldwide — including to a reluctant AI Office.

Q: Does implementing this form of Opt-Out provide legal certainty?

A: Not entirely, but it helps. Relying on Copyright "Fair Dealings" exceptions like Opt-Out is an inherently risky approach because it's evaluated on a case-by-case basis. If a rightholder finds evidence of infringement, then they have a valid claim and the burden of proof is on the defendant to justify in court exactly which Copyright exception they qualify for. Simply respecting Opt-Out is not sufficient to win that argument.

Q: Why should I respect Opt-Out if there's no legal benefit?

A: A Copyright exception is not a substitute for licensing and cannot be the entire basis of a commercial operation, as this would undermine Copyright itself. An exception may allow AI providers operating in Good Faith to make mistakes with a small percentage of the files used in training. Given the scale of datasets, a reasonable percent for such mistakes would be significantly less than 1% of inputs. `weboptout` by design strives to eliminate false negatives for this reason.

Q: What countries do Opt-Out Copyright exceptions apply?

A: Predominantly, opt-out regulations were borne in the European Union, but are now being more widely suggested by government in the United Kingdom and proposed to policymakers in the United States. Under international Copyright treaties (e.g. Berne Convention), the country that has jurisdiction is the place where infringement occurs. As such, regardless of where you are operating if you interact with a jurisdiction that mandates opt-out, you are obliged to respect it. (This includes scraping content from the EU, or releasing a model/product there.)

Q: Do you expect AI companies to adopt this?

A: Companies don't have to use this exact solution, but they have a legal burden to implement something better and explain it was better. Anything less would be operating in Bad Faith, and that weighs heavily against a favorable verdict in favor of any Copyright exception and/or Fair Use. Companies that are setting up fake "opt-out" systems intentionally flawed or below industry standard will struggle in court.

Q: How did the project initially get started?

A: The initial prototype was built in a few hours, trying to answer the hypothetical question: ❝ What if AI companies tried to follow the law on Copyright?❞ It turns out, it is very easy to check website Terms Of Service (as specified by Copyright regulations) to determine if scraping / mining is allowed, using simple technology from the 1990s and minor programming work. It became an open-source project almost immediately after that, as an example of what Good Faith compliance with regulations looks like.

Q: Why doesn't this website support Opt-In too?

A: In order to be legally sound, Opt-In must be done via licensing. However, there's no established basis for licensing done via `robots.txt` or even in Terms Of Service, and even if there was, a different infrastructure is needed to process these as contracts. (Anything less would have no legal weight, and only serve as a form of data laundering.) Instead, look for datasets of permissively licensed or public domain content instead, or have a lawyer check with the company you want to enter an licensing agreement with.

Q: What is the privacy policy for API access?

A: Only robots are allowed to access the API, and robots do not have rights to privacy. As such, the access logs for the API are retained permanently. In case of third-party lawsuits, we will provide information about access with a court order and for an operating fee. We believe in transparency, and thus will indirectly support parties that operate transparently — either defendants who intend to prove they acted in Good Faith or plaintiffs who would like to document the opposite.