Copyright exceptions for data-mining based on Opt-Out are law in 27 countries. Since the topic is increasingly discussed, and as Opt-Out is often mis-/abused, it's important to clarify the requirements and establish baseline facts.

Overview

Most of the web has opted-out already! We checked the Terms Of Service.

weboptout.org lets you know if a website requires an opt-out of Data Mining (for Copyright), in which case all that domain's content must not be used for/by Artificial Intelligence. You can click the link of any domain to get both machine-readable and human-readable specification for the rights that are reserved. Does a website ban robots or crawlers, forbid scraping and mining, or exclude other forms of training?

❝ 90% of the top websites opted-out of for-profit AI training. ❞
https://weboptout.org/d/<domain.name>/yaml

These pages are based on the legal text from the website itself, and provide the technical information you need (as common file formats such as JSON or YAML) in order to comply with Copyright regulations and IP laws.

Statistics

No Training
No Data Mining
Not-For-Profit
Unknown
Failure
DESCRIPTION: Opt-Out percentages for websites in multiple categories. Green is a complete opt-out of Data Mining. Blue horizontal stripes mean there's additional opt-out of AI training, and light green vertical stripes mean it's non-commercial use only. The numbers are collected from the Top ~100 websites, which covers over an estimated 90% of the traffic and the majority of dataset entries.

The Key Message!

If you want to help, there is one key point to reinforce in discussions, whether online or in print:

❝ The discussion about Opt-Out in AI is disconnected from reality. ❞


*MYTH* How to implement Opt-Out for AI training is an open question to debate for years. *MYTH*
FACT: This is a delaying tactic (promoted by Big Tech) in order to exploit the gap between technical progress and legal enforcement. The current law already suggests Terms Of Service as one of many solutions, and this was confirmed by the recent court verdict in the EU. It's simple to implement for any company acting in Good Faith, and with state-of-the-art technologies the mistakes drop to near zero.
*MYTH* Everyone can use existing web standards like `robots.txt` as-is without any changes. *MYTH*
FACT: The standard as it is commonly used today, is not sufficient to comply with Copyright and AI regulations. This fact is well known in technical, legal and policy circles. However, the `robots.txt` extension suggested below is not only syntactically backwards compatible, but semantically allows legal compliance too.
*MYTH* Copyright exceptions are a solution that can support the entire AI industry. *MYTH*
FACT: Most publishers, websites, creators want to opt-out and already have — see our statistics. There will never be any clarity for an industry built on "Fair Dealing" exceptions under international Copyright law. By definition, exceptions are litigated on a case-by-case basis, for which there's a growing list of conditions and requirements that need to be fulfilled. Only licensing provides the legal framework needed to build a solid business — let alone an entire industry worth trillions.

Usage

Examples


Command-Line

$ curl -H "Accept: application/json" https://weboptout.org/d//json Loading...
$ curl -H "Accept: application/yaml" https://weboptout.org/d//yaml Loading...

Technology

`weboptout` is based on simple scraping technology from the 1990s and could be rebuilt by any competent programmer in a matter of hours. There are multiple steps involved:

  1. Parse `robots.txt` — Check the content in a specific section called `weboptout-for-all`, as described below.
  2. Lookup Domain — If this is a content-distribution network, look up the main site using one of multiple rules.
  3. Find Terms Of Service — From the main domain, find links to the ToS based on one of multiple patterns.
  4. Identify Patterns — Scan each paragraph in the ToS, then check to see if patterns are found in the legal text.
  5. Confirm Reservations — When rights appear to be reserved based on patterns, confirm them with a language model.

Approximately 70% to 80% of the English-speaking web, based on traffic and dataset inclusion, is covered so far in our database. Based on manual review, when rights reservations are found, we estimate the false positives for the simple pattern matching to be less than 5%, and using a language model it's less than 1%. (False positives only cause minor inconvenience for scrapers.) False negatives are 0% by design in order to maximize legal compliance. (False negatives may result in a violation of the creator's rights.)

Open To Contributions

How To Contribute

To update the information about a domain you control, you can simply modify the `robots.txt` file as follows:

## robots.txt

# GLOBAL PERMISSIONS
User-agent: weboptout-for-all
Disallow: *
# COMMERCIAL + MODELS OPT-OUT
User-agent: weboptout-for-all
Disallow: training
Disallow: memorizing
Disallow: profiting
# DATASET EXCLUSION
User-agent: weboptout-for-all
Disallow: linking
Disallow: collecting
Disallow: archiving
Disallow: distributing
# AI ASSISTANT PERMISSIONS
User-agent: weboptout-for-all
Disallow: embedding
Disallow: generating
Disallow: referencing

EXAMPLE: You can see our very own robots.txt for weboptout.org, which breaks our opt-out down into four categories.

This new `robots.txt` section has a specific meaning in the context of opt-out, and provides fine-grained permissions over data-mining, machine learning, and artificial intelligence. By convention, the section called `weboptout-for-all` expressely applies to all robots that intend to claim Copyright exceptions and/or Fair Use. The wildcard * expressely matches all existing and future uses by those robots; if in doubt, just use Disallow: * below the user agent `weboptout-for-all`. Alternatively, you can list verbs of activities to prohibit, such as `training`, `caching` or `archiving` — use one per line as shown in the examples.

Open Source!

WebOptOut is an open source project available on GitHub. The codebase includes `weboptout` as both a command-line tool and a Python library that can be used programmatically. Contributions such as bug fixes are welcome, but most importantly multi-language support is the priority in order to parse the legal text of international websites too — not just English ones.

Frequent Questions

Q: What was the methology do determine that "90% of top websites opted-out"?

Q: Are the results perfect? What are the false positives/negatives?

Q: Are the machine-readable summaries found in Terms Of Service entirely automated?

Q: Why suggest an extension to `robots.txt`? What about alternatives?

Q: Which expertise did you base the project on?

Q: Does implementing this form of Opt-Out provide legal certainty?

Q: Why should I respect Opt-Out if there's no legal benefit?

Q: What countries do Opt-Out Copyright exceptions apply?

Q: Do you expect AI companies to adopt this?

Q: How did the project initially get started?

Q: Why doesn't this website support Opt-In too?

Q: What is the privacy policy for API access?