Frequent Questions
Q: What was the methology do determine that "90% of top websites opted-out"?
A:
The study reported in the statistics section uses both the Top 100 domains by traffic and the Top 100 domains from popular datasets. There is obviously overlap between the two, but we kept only the English-language ones. Then running the code for each of the corresponding websites produces meta-data about opt-out for each of them. The most important category is opt-out from text- and data-mining (TDM) as per regulations, which alone accounts for 86% of the reservations. When including opt-out for commercial activities, this accounts for a total almost 91% of top websites opted-out of for-profit TDM activities.
Q: Are the results perfect? What are the false positives/negatives?
A:
No, obviously! The results will never be perfect, as there are many layers of technology and people involved at every step: mistakes will continuously appear but also get resolved. As for accuraty... The false positives, where an opt-out is reported but doesn't exist, only causes minor inconvenience to corporations; these are currently below 4% of the cases. The false negatives, where an opt-out is requested but not reported by the tool, which directly harms creators and removes their rights, are ~0% by design to increase compliance.
Q: Are the machine-readable summaries found in Terms Of Service entirely automated?
A:
Nothing is entirely automatic. A human programmer designed and wrote the program, deployed and tested it, then collected feedback and verified the result. There is always such high-level manual supervision+iteration in software, as such, the machine-readable summaries you find here could be considered hybrid-produced. There is no legal requirement for this work to be fully automated, and in fact, acting with due dilligence (given the importance of respecting Human Rights of creators) requires human oversight to check and improve the accuracy above all.
Q: Why suggest an extension to `robots.txt`? What about alternatives?
A:
Much of the existing legislation and policy suggestions are focused on `robots.txt`, partly because it's the main example of an industry-wide opt-out system. So, it makes sense to run with that idea — especially as suggested alternatives are technically deficient! The design suggested above is backward compatible and requires (a) no extra code to handle, and (b) no additional files to be downloaded. Besides the convenience, the design is also better than more formal standards (e.g. TDM Reservation Protocol) because it provides an extensible way to reserve rights: you just list the actions to disallow (e.g. scraping, mining, training).
Q: Which expertise did you base the project on?
A:
Hi, I'm Alex. You might remember me from... working in AI and creative industries professionally for almost 25 years — in roles spanning from technology to design, also managing director. On the legal side, I was recently a technical advisor and named witness on the only lawsuit worldwide that involved Copyright and Opt-Out (in Hamburg). I've done forensic studies of scraper activities and assisted the analyses of content in popular datasets. I don't know of any other expert more qualified in the real-world implementation of Opt-Out as a Copyright exception, and if there is one I'd love to talk to them!
The functionality of the early prototype of `weboptout`, built 15 months before the LAION verdict anticipating it was the core issue, indeed ended up being the main part of the judges' ruling on the topic of Opt-Out (German Act On Copyright, UrgH 44b). On the policy side, the insights from my this project (e.g. the web having mostly opted-out already) have been widely published and raised in policy discussions worldwide — including to a reluctant AI Office.
Q: Does implementing this form of Opt-Out provide legal certainty?
A:
Not entirely, but it helps. Relying on Copyright "Fair Dealings" exceptions like Opt-Out is an inherently risky approach because it's evaluated on a case-by-case basis. If a rightholder finds evidence of infringement, then they have a valid claim and the burden of proof is on the defendant to justify in court exactly which Copyright exception they qualify for. Simply respecting Opt-Out is not sufficient to win that argument.
Q: Why should I respect Opt-Out if there's no legal benefit?
A:
A Copyright exception is not a substitute for licensing and cannot be the entire basis of a commercial operation, as this would undermine Copyright itself. An exception may allow AI providers operating in Good Faith to make mistakes with a small percentage of the files used in training. Given the scale of datasets, a reasonable percent for such mistakes would be significantly less than 1% of inputs. `weboptout` by design strives to eliminate false negatives for this reason.
Q: What countries do Opt-Out Copyright exceptions apply?
A:
Predominantly, opt-out regulations were borne in the European Union, but are now being more widely suggested by government in the United Kingdom and proposed to policymakers in the United States. Under international Copyright treaties (e.g. Berne Convention), the country that has jurisdiction is the place where infringement occurs. As such, regardless of where you are operating if you interact with a jurisdiction that mandates opt-out, you are obliged to respect it. (This includes scraping content from the EU, or releasing a model/product there.)
Q: Do you expect AI companies to adopt this?
A:
Companies don't have to use this exact solution, but they have a legal burden to implement something better and explain it was better. Anything less would be operating in Bad Faith, and that weighs heavily against a favorable verdict in favor of any Copyright exception and/or Fair Use. Companies that are setting up fake "opt-out" systems intentionally flawed or below industry standard will struggle in court.
Q: How did the project initially get started?
A:
The initial prototype was built in a few hours, trying to answer the hypothetical question: ❝ What if AI companies tried to follow the law on Copyright?❞ It turns out, it is very easy to check website Terms Of Service (as specified by Copyright regulations) to determine if scraping / mining is allowed, using simple technology from the 1990s and minor programming work. It became an open-source project almost immediately after that, as an example of what Good Faith compliance with regulations looks like.
Q: Why doesn't this website support Opt-In too?
A:
In order to be legally sound, Opt-In must be done via licensing. However, there's no established basis for licensing done via `robots.txt` or even in Terms Of Service, and even if there was, a different infrastructure is needed to process these as contracts. (Anything less would have no legal weight, and only serve as a form of data laundering.) Instead, look for datasets of permissively licensed or public domain content instead, or have a lawyer check with the company you want to enter an licensing agreement with.
Q: What is the privacy policy for API access?
A:
Only robots are allowed to access the API, and robots do not have rights to privacy. As such, the access logs for the API are retained permanently. In case of third-party lawsuits, we will provide information about access with a court order and for an operating fee. We believe in transparency, and thus will indirectly support parties that operate transparently — either defendants who intend to prove they acted in Good Faith or plaintiffs who would like to document the opposite.