Meta admits using pirated books to train AI, but won't pay for it

Lee Duna · 5 months ago

Meta admits using pirated books to train AI, but won't pay for it

TWeaK · 5 months ago

So where does that say I’m wrong?

I said fair use covers news, education, research, criticism, or comment.

for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research

Then I said the next thing considered is whether it is commercial.

In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include— (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

I didn’t cover everything in the law, I just covered the relevant points in a way that could be easily understood and related to the subject at hand.

My point is that the copying AI does isn’t really research, but even if it were considered research it is absolutely commercial and thus should not have a fair use exemption.

@[email protected] · edit-2 5 months ago

You need to read this carefully. It’s a statute. It means exactly what it says.

purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research

Such as means that these are examples. This is not a complete list.

the factors to be considered shall include

All of these factors must be considered. It does not mean that other factors cannot be considered. These are not categories.

A commercial purpose does not rule out a finding of fair use (and vice versa). It must be considered and that is all.

I don’t think that Meta’s use can be classed as commercial. Presumably, they do hope that the research budget will pay off eventually. But what must be considered is the particular copying in question. Llama 2’s license looks to me fairly non-commercial.

Eventually, fair use derives from the constitution. Copyright is a limitation on the freedom of the press (and of speech). But it cannot completely do away with these freedoms. The examples given in the statue here could not be banned completely even if they were not mentioned.

The US Constitution itself allows congress to create copyrights. Or more precisely, it empowers congress to promote the Progress of Science and useful Arts by creating copyrights. That’s another limitation.

I’ve seen a number of far-right commenters admit that this money grab would harm AI development (a “useful Art”). I think mostly these commenters hold some far-right ideology à la Ayn Rand that values property over society, but some may just be selfish and believe that they would personally benefit. Either way, it’s straight up anti-constitutional.

@[email protected] · 5 months ago

Here’s the summary for the wikipedia article you mentioned in your comment:

The Copyright Clause (also known as the Intellectual Property Clause, Copyright and Patent Clause, or the Progress Clause) describes an enumerated power listed in the United States Constitution (Article I, Section 8, Clause 8). The clause, which is the basis of copyright and patent laws in the United States, states that: [the United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

^to ^opt ^out^, ^pm ^me ^{‘optout’.} ^article ^| ^about

TWeaK · 5 months ago

Such as means that these are examples. This is not a complete list.

AI developers have explicitly envoked the research exemption. That is why I focused on that. I disagree that what they do is “research” for the reasons I gave previously. Bringing up the fact there are other exemptions is beside the point - they aren’t claiming any other exemption!

All of these factors must be considered. It does not mean that other factors cannot be considered. These are not categories.

Sure, but I never said that commerciality was the only thing that should be considered. My claim here is simply that it is so overwhelmingly commercial in nature that it overrides anything else and thus they should not be awarded the privilege of an exemption.

A commercial purpose does not rule out a finding of fair use (and vice versa).

A commercial purpose might not rule out of a finding of fair use. That does not mean it cannot rule out such a finding. All factors must be considered, but any one factor can outweigh the others.

I never said it was an exclusive category, I just brought it up as the most significant factor - one which is not reasonably overruled by any of the others in this circumstance. In fact, every one of those arguably fails. To give detail:

The copying is done in a commercial nature. They sell AI services. It’s offered very cheap right now - even for free for limited personal use - but eventually that will change as their demand for profit grows.
The nature of the copied work is varied and includes all kinds of work, commercial and non-commercial. The copying is pandemic.
The whole work has been copied into the training database. Significant portions of the work can and have been reproduced by the finished product, in spite of the finished product allegedly not containing the original work in its database. Furthermore, even if a human genuinely believes they aren’t copying something they read before, that does not mean they are innocent of copyright infringement - it is the similarity of the two works that make the determining factor.
AI work is already flooding the market and pushing out original creators. Childrens’ books is one area where this is happening extensively - not only does this make it harder for genuine authors to get a break in the market, but they’re effectively training children to think AI work is normal. It’s not hard to see us headed to a future where people think AI is “real” and original work is “fake”, simply by volume.

I will admit, not all of those arguments are very strong (particularly 4.). However 1. is the strongest and I think overrides any argument the other way for any other.

I don’t think that Meta’s use can be classed as commercial. Presumably, they do hope that the research budget will pay off eventually.

Those two statements contradict one another. Of course they want it to be commercial eventually - or, rather, they want to eventually turn a profit. Hell, AI is already being used in a commercial manner: if you want to make significant or non-personal use of AI systems currently on the market, you have to pay for it.

Eventually, fair use derives from the constitution.

Setting aside the fact that AI extends far beyond the borders of the US and its constitution, fair use and copyright are derived from copyright law, which is written by Congress. The Constitution grants Congress the right to write such laws, but no one is “invoking the Constitution” when they enforce copyright or claim fair use. The Constitution gives permission, but the law forms the definition.

AI is not simply a “useful Art”. It is a commercial venture that exploits original work without duly compensating the authors of said work. Congress has a greater duty to protect those original authors than it does a business that seeks to exploit their work. I say this as someone who has never really made much of anything original myself. I play a bit of music, but don’t compose and just do covers. I probably (lol limewire definitely) infringe on copyright - but I do so exclusively in a non-commercial manner.

Blurting out “far-right” is borderline a personal insult - one which is laughably far from the mark when addressed towards me - and points to you clutching at straws to cling to a frivilous argument.

I now feel the need to ask, why do you so passionately defend AI businesses here? Why do you support them?

Are you that infatuated with the novelty of their product that you have let go of objectivity?

I also have to emphasise again that I’m a little disgusted that you made this political. You’ve tried to build an argument that “it is a Constitutional right” to infringe copyright in order to have AI tools, and you’re implying that anyone who opposes that idea is some kind of far-right nutjob. I hadn’t even heard of Ayn Rand before you mentioned her, but have you actually read her work, or did you just watch the Atlas Shrugged movie and form your opinions from internet memes?

I’d actually probably agree with you about AI - if it was non-commercial in nature and truly for the benefit of the people. As it is, I think you are blinded by the sheen of a new toy, without realising it’s coated in lead paint.

TWeaK · 5 months ago

A commercial purpose might not rule out of a finding of fair use.

ARRRRG I spent so long reviewing this comment, over and over and over again, and still there were words wrong. I’m not editing it though, I want the comment to stay clean.

archomrade [he/him] · edit-2 5 months ago

Pretty sure @[email protected] is referring to this portion:

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole

(4) the effect of the use upon the potential market for or value of the copyrighted work.

The main argument for this being fair use is both that a single work of copyright bears little to no relationship to the end product, and that the model itself does not effect the market for - or value of - the copyrighted work (note: the market for additional works produced is not what is in question here, it is the market for the work that has been copied).

TWeaK · 5 months ago

Sorry for the double reply, but I did also expand further upon (3) and (4), and other aspects, in my latest reply to /u/[email protected] (link to your instance’s version): https://midwest.social/comment/6225045

TWeaK · 5 months ago

The main argument for this being fair use is both that a single work of copyright bears little to no relationship to the end product

It bears relationship to the end product when the end product reproduces the original work.

that the model itself does not effect the market for - or value of - the copyrighted work

Given that AI is poised to take over the position of original writers and flood the market with fake work, copying not only their words but their very style, I’d argue it does affect the value of existing work. With children’s books already being heavily written by AI, it seems quite likely that we will before too long get to the point where people expect things to be written by AI, thus devaluing true creative and original work.

archomrade [he/him] · 5 months ago

I appreciate your enthusiasm here, but the law (and precedent reading of the law) simply does not bear out a clear interpretation like you’re suggesting.

It bears relationship to the end product when the end product reproduces the original work.

This is not how copyright has been applied when speaking of other machine learning processes using logical regression that is considered fair use, as in Text and Data Mining classifications(TDM) (proposed class 7(a) and 7(b) (page 102) in Recommendation of the Register of Copyrights 2021). The model itself is simply a very large regression model, that has created metadata analysis from unstructured data sources. When determining weather an LLM fits into this fair use category, they will look at what the model is and how it is created, not to whether it can be prompted to recreate a similar work. To quote from Comments in Response of Notice of Inquiry on the matter:

Understanding the process of training foundation models is relevant to the generative AI systems’ fair use defenses because the scope of copyright protection does not extend to “statistical information” such as “word frequencies, syntactic patterns, and thematic markers.” Processing in-copyright works to extract “information about the original [work]” does not infringe because it does not “replicat[e] protected expression.

Granted, what is novel about this particular case (LLM’s generally) is their apparent ability to re-construct substantially similar works from the same overall process of TDM. Acknowledged, but to borrow again from the same comments as above:

Yet, in limited situations, Generative AI models do copy the training data.24 So unlike prior copy-reliant technologies that courts have held are fair use, it is impossible to say categorically that inputs and outputs of Generative AI will always be fair use. We note in addition that some have argued that the ability of Generative AI to produce artifacts that could pass for human expression and the potential scale of such production may have implications not seen in previous non-expressive use cases. The difficulty with such arguments is that the harm asserted does not flow from the communication of protected expression to any human audience.

Basically, they are asserting that applying copyright to this use that falls outside of its explicit scope would not prevent the same harm caused by that same technology created without the use of the copyrighted works. Any work sufficiently described in publicly available text data could be reconstructed by a sufficiently weighted regression model and the correct prompting. E.g. - if I described a desired output sufficiently enough in my input to the model, the output could be substantially similar to a protected work, regardless of its lack of representation in the training data.

I happen to agree that these AI models represent a threat to the work and livelihoods of real artists, and that the benefit as currently captured by billion-dollar companies is a substantial problem that must be addressed, but I simply do not think the application of copyright in this manner is appropriate (as it will prevent legitimate uses of the technology), nor do i think it is sufficiently preventative in future consolidation of wealth by the use of these models.

Nevermind my personal objections to copyright law on the basis of my worldview - I just don’t think copyright is the correct tool to use for the desired protection.

TWeaK · 5 months ago

This is not how copyright has been applied when speaking of other machine learning processes using logical regression that is considered fair use, as in Text and Data Mining classifications(TDM) (proposed class 7(a) and 7(b) (page 102) in Recommendation of the Register of Copyrights 2021).

Your link is merely proposed recommendations. That is not legislation nor case law. Also, the sections on TDM that you reference clearly state (my emphasis):

for the purpose of scholarly research and teaching.

I think this is even more abundantly clear that the research exemption does not apply. AI “research” is in no way “scholarly”, it is commercial product development and thus does not align with fair use copyright exemptions.

It’s also not talking about building AI, but circumventing DRM in order to preserve art. They’re saying that there should be an exemption to the illegal practice of circumventing DRM in certain, limited circumstances. However, they’re still only suggesting this! So not only does this not apply to your argument, it isn’t even actually in force.

To put your other link into context, this also is not law, but comments from legal professors.

Understanding the process of training foundation models is relevant to the generative AI systems’ fair use defenses because the scope of copyright protection does not extend to “statistical information” such as “word frequencies, syntactic patterns, and thematic markers.” Processing in-copyright works to extract “information about the original [work]” does not infringe because it does not “replicat[e] protected expression.

The flaw here is that the work isn’t processed in situ, it is copied into a training database, then processed. The processing may be fine, but the copying is illegal.

If they had a legitimate license to view the work for their purpose, and processed in situ, that might be different.

The difficulty with such arguments is that the harm asserted does not flow from the communication of protected expression to any human audience.

The argument here is that, while it sometimes infringes copyright, the harm it causes isn’t primarily from the infringing act. Not always, though that depends. If AI is used to pass off as someone else, then the AI manufacturer has built a tool that facilitates an illegal act, by copying the original work.

However, this, again, ignores the fact that the commercial enterprise has copied the data into their training database without duly compensating the rightsholder.

archomrade [he/him] · 5 months ago

Your link is merely proposed recommendations. That is not legislation nor case law.

It’s also not talking about building AI, but circumventing DRM in order to preserve art. They’re saying that there should be an exemption to the illegal practice of circumventing DRM in certain, limited circumstances. However, they’re still only suggesting this! So not only does this not apply to your argument, it isn’t even actually in force.

At the bottom of the document, the Library of Congress approves all recommendations and adopts them as legal defenses against copyright claims. This is established law, not merely recommendations. Please understand the legal processes we’re discussing here.

Regardless, I’m not arguing that this exemption class 7(a) and 7(b) actually apply to AI and LLM’s, only that they serve as precedent guidance on how they should be treated in any suit raised. Granted, OpenAI is not a research institution, so this classification would not apply on those grounds, but the way they treat the work being challenged is still relevant. LLM’s are transformative in nature. Their use and nature are distinctly similar to that of a searchable database described in Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google (the legal strength is even greater here, since LLM outputs are creative, and do not provide ‘copied’ expressions as a matter of course - fringe cases not withstanding), and as such we have no reason to expect they’d view it differently in the case of an LLM. Training data is a utilitarian precursor to an expressive tool, as repeatedly affirmed as fair use in existing precedent.

The flaw here is that the work isn’t processed in situ, it is copied into a training database, then processed. The processing may be fine, but the copying is illegal

Fair use describes exemptions to the illegality of unauthorized copies, it is explicitly asserting the copying as legal for a given use. See Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google for reference. Worthy to point out the distinction between a right to control unauthorized use and unauthorized access, and admittedly this would be the weakest point in Meta’s case. However, I share the paper author’s perspective on illicit sources:

On the other hand, as Michael Carroll argues, there are strong arguments to be made that copying from an infringing source may still be fair use. Carroll argues that ‘[t]reating an otherwise fair use as unfair because it was made from an infringing source would lead a court to deny the public access to the products of secondary uses that fair use is designed to encourage.’ He notes that significant doubt exists as to whether good faith is a consideration in fair use at all. Judge Pierre Leval has also persuasively argued that using a good faith inquiry in fair use analysis ‘produces anomalies that conflict with the goals of copyright and adds to the confusion surrounding the doctrine.’ Moreover, even if good faith is part of the broader fair use calculus, courts have found that knowing use of an infringing source is not bad faith when the user acts in the reasonable belief that their use is a fair use. There is no recognized ‘fruit of the poisonous tree’ doctrine in copyright law.

The argument being proposed in the paper (for once, you are correct that this is not established law) is that in other, different cases where TDM is used as a precursor to expressive use, the collection of data for that purpose has been found to be lawful (provided sufficient security is used to prevent infringing, non-exempt abuses). However, the issue we’re discussing is novel. The paper is proposing frameworks for how to apply existing precedent to the novel use-case being investigated. There is no case-law to refer to that addresses this specific situation. I can’t tell if you’re just trying to debate-bro me or actually discuss the merits of the case, but i’d just remind you that none of this is settled, nor am I suggesting it is. My perspective is that precedent supports training data for LLM’s as a fair use, and that strengthening copyright in the way proposed does not mitigate the harm being claimed by plaintiffs, and in fact increases harm to the greater public by gatekeeping access to automation tools and consolidating the benefits to already gigantic companies.

If AI is used to pass off as someone else, then the AI manufacturer has built a tool that facilitates an illegal act, by copying the original work.

That’s not an issue for copyright, but I agree it ought to be addressed. Once again, the harm doesn’t stem from the use of copyrighted material, it stems from the technology itself (the harm doesn’t change weather the material is authorized or not, nor does it change to whom harm is done). I really have to stress again that the issues and concerns being raised over AI cannot be sufficiently addressed through the use of copyright law.

TWeaK · 5 months ago

At the bottom of the document, the Library of Congress approves all recommendations and adopts them as legal defenses against copyright claims. This is established law, not merely recommendations.

Thank you for the clarification.

Their use and nature are distinctly similar to that of a searchable database described in Authors Guild, Inc. v. HathiTrust and Authors Guild v. Google (the legal strength is even greater here, since LLM outputs are creative, and do not provide ‘copied’ expressions as a matter of course - fringe cases not withstanding), and as such we have no reason to expect they’d view it differently in the case of an LLM. Training data is a utilitarian precursor to an expressive tool, as repeatedly affirmed as fair use in existing precedent.

This is indeed a complicated subject, and thank you again for your insight. These are very good example cases, because Google’s searchable book database is exactly the same as the training databases LLM’s use to develop their transform nodes.

The difference between the Authors Guild cases and this one, as I see it, is that Google and HathiTrust are acting to preserve information and art for future generations - there is an inherent benefit to society front and centre with their goals. With LLM’s, the goal is to develop a commercial product. Yes, people can use it for free (right now) but ultimately they expect to sell access and profit from it. Also, no one else gets access to their training database, it is kept as some sort of trade secret.

for once, you are correct that this is not established law

Yay!

My perspective is that precedent supports training data for LLM’s as a fair use, and that strengthening copyright in the way proposed does not mitigate the harm being claimed by plaintiffs, and in fact increases harm to the greater public by gatekeeping access to automation tools and consolidating the benefits to already gigantic companies.

I wouldn’t want to restrict or gatekeep access to art for genuine fair purpose uses. I agree with the Authors Guild rulings in those circumstances, I just disagree that LLM’s are a similar enough circumstance that LLM’s deserve the same exemption with how they’re developed.

I really have to stress again that the issues and concerns being raised over AI cannot be sufficiently addressed through the use of copyright law.

I agree. Certainly, not copyright law as it exists right now, and even then there are so many aspects of the use of AI that fall well oustide the scope of copyright law.

Ultimately, my gripe is that a commercial business has used copyrighted work to develop a product without paying the rightsholders. Their product is their own unique creation, but the copyrighted work their product learned from was not. The training database they’ve used is not “research” because it is not scholarly; even if it were research, it is highly commercial in nature and as such does not warrant a fair use exemption.