Meta admits using pirated books to train AI, but won't pay for it

Lee Duna · 5 months ago

Meta admits using pirated books to train AI, but won't pay for it

@[email protected] · 5 months ago

Critical to understanding whether this applies is to understand “use” in the first place. I would argue it’d even more important because it’s a threshold question in whether you even need to read 107.

17 U.S. Code § 106 - Exclusive rights in copyrighted works Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following: (1)to reproduce the copyrighted work in copies or phonorecords; (2)to prepare derivative works based upon the copyrighted work; (3)to distribute copies or phonorecords of the copyrighted work to the public by sale or other transfer of ownership, or by rental, lease, or lending; (4)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and motion pictures and other audiovisual works, to perform the copyrighted work publicly; (5)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and pictorial, graphic, or sculptural works, including the individual images of a motion picture or other audiovisual work, to display the copyrighted work publicly; and (6)in the case of sound recordings, to perform the copyrighted work publicly by means of a digital audio transmission.

Copyright protects just what it sounds like- the right to “copy” or reproduce a work along the examples given above. It is not clear that use in training AI falls into any of these categories. The question mainly relates to items 1 and 2.

If you read through the court filings against OpenAI and Stability AI, much of the argument is based around trying to make a claim under case 1. If you put a model into an output loop you can get it to reproduce small sections of training data that include passages from copyrighted works, although of course nowhere near the full corpus can be retrieved because the model doesn’t contain any thing close to a full data set - the models are much too small and that’s also not how transformers architecture works. But in some cases, models can preserve and output brief sections of text or distorted images that appear highly similar to at least portions of training data. Even so, it’s not clear that this is protected under copyright law because they are small snippets that are not substitutes for the original work, and don’t affect the market for it.

Case 2 would be relevant if an LLM were classified as a derivative work. But LLMs are also not derivative works in the conventional definition, which is things like translated or abridged versions, or different musical arrangements in the case of music.

For these reasons, it is extremely unclear whether copyright protections are even invoked, becuase the nature of the use in model training does not clearly fall under any of the enumerated rights. This is not the first time this has happened, either - the DMCA of 1998 amended the Copyright Act of 1976 to add cases relating to online music distribution as the previous copyright definitions did not clearly address online filesharing.

There are a lot of strong opinions about the ethics of training models and many people are firm believers that either it should or shouldn’t be allowed. But the legal question is much more hazy, because AI model training was not contemplated even in the DMCA. I’m watching these cases with interest because I don’t think the law is at all settled here. My personal view is that an act of congress would be necessary to establish whether use of copyrighted works in training data, even for purposes of developing a commercial product, should be one of the enumerated protections of copyright. Under current law, I’m not certain that it is.

archomrade [he/him] · 5 months ago

This is an extremely good write-up, thank you for this.

TWeaK · 5 months ago

(1)to reproduce the copyrighted work in copies or phonorecords

The works are copied in their entirey and reproduced in the training database. AI businesses do not deny this is copying, but instead claim it is research and thus has a fair use exemption.

I argue it is not research, but product development - and furthermore, unlike traditional R&D, it is not some prototype that is different and separate from the commercial product. The prototype is the commercial product.

(2)to prepare derivative works based upon the copyrighted work

AI can and has reproduced significant portions of copyrighted work, even in spite of the fact that the finished product allegedly does not include the work in its database (it just read the training database).

Furthermore, even if a human genuinely and honestly believes they’re writing something original, that does not matter when they reproduce work that they have read before. What determines copyright infringement is the similarity of the two works.

If you read through the court filings against OpenAI and Stability AI, much of the argument is based around trying to make a claim under case 1.

The position that I take is that the arguments made against OpenAI and Stability AI in court are not complete. They’re not quite good enough. However, that doesn’t mean there isn’t a valid argument that is good enough. I just hope we don’t get a ruling in favour of AI businesses simply because the people challenging them didn’t employ the right ammunition.

With regards to Case 2, I refer back to my comment about the similarity of the work. The argument isn’t that the LLM itself is an infringement of copyright, but that the LLM, as designed by the business, infringes copyright in the same way a human would.

I definitely agree it is all extremely unclear. However, I maintain that the textual definition of the law absolutely still encompasses the feeling that peoples’ work is being ripped off for a commercial venture. Because it is so commercial, original authors are being harmed as they will not see any benefit from the commercial profits.

I would also like to point you to my other comment, which I put a lot of time into and where I expanded on many other points (link to your instance’s version): https://lemmy.world/comment/6706240

archomrade [he/him] · 5 months ago

The works are copied in their entirey and reproduced in the training database. AI businesses do not deny this is copying, but instead claim it is research and thus has a fair use exemption.

The copying of the data is not, by itself, infringement. It depends on the use and purpose of the copied data, and the defense argues that training a model against the data is fair use under TDM use-cases.

AI can and has reproduced significant portions of copyrighted work, even in spite of the fact that the finished product allegedly does not include the work in its database (it just read the training database).

The model does not have a ‘database’, it is a series of transform nodes weighted against unstructured data. The transformation of the copyrighted works into a weighted regression model is what is being argued is fair use.

Furthermore, even if a human genuinely and honestly believes they’re writing something original, that does not matter when they reproduce work that they have read before.

yup, and it isn’t the act of that human reading a copyrighted work that is considered as infringement, it is the creation of the work that is substantially similar. In the same analogy, it wouldn’t be the creation of the AI model that is the infringement, but each act of creation thereafter that is substantially similar to a copyrighted work. But this comes with a bunch of other problems for the plaintiffs, and would be a losing case without merit.

The position that I take is that the arguments made against OpenAI and Stability AI in court are not complete

The argument isn’t that the LLM itself is an infringement of copyright, but that the LLM, as designed by the business, infringes copyright in the same way a human would.

Trying really hard not to come off as rude, but there’s a good reason why this isn’t the argument being put forward in the lawsuits. If this was their argument, the LLM could be considered a commissioned agent, placing the liability on the agent commissioning the work (e.g. the human prompting the work) - not OpenAI or Stability - in much the same way a company is held responsible for the work produced by an employee.

I really do understand the anger and frustration apparent in these comments, but I would really like to encourage you to learn a bit more about the basis for these cases before spending substantial effort writing long responses.

TWeaK · 5 months ago

The copying of the data is not, by itself, infringement.

Copyright is absolute. The rightsholder has complete and total right to dictate how it is copied. Thus, any unauthorised copying is copyright infringement. However, fair use gives exemption to certain types of copying. The copyright is still being infringed, because the rightsholder’s absolute rights are being circumvented, however the penalty is not awarded because of fair use.

This is all just pedantry, though, and has no practical significance. Saying “fair use means copyright has not been infringed” doesn’t change anything.

it is a series of transform nodes weighted against unstructured data.

That’s a database. Or perhaps rather some kind of 3D array - which could just be considered an advanced form of database. But yeah, you’re right here, you win this pedantry round lol. 1-1.

it wouldn’t be the creation of the AI model that is the infringement, but each act of creation thereafter that is substantially similar to a copyrighted work. But this comes with a bunch of other problems for the plaintiffs, and would be a losing case without merit.

Yeah I don’t want to go down the avenue of suing the AI itself for infringement. However…^[1]^[2]^[3]

Trying really hard not to come off as rude

You’re not coming off as rude at all with what you’ve said, in fact I welcome and appreciate your rebuttals.

I really do understand the anger and frustration apparent in these comments, but I would really like to encourage you to learn a bit more about the basis for these cases before spending substantial effort writing long responses.

You say that as if I haven’t enjoyed fleshing out the ideas and sharing them. By the way, right now I’m sharing with you lemmy’s hidden citation feature :o)

Although, I was much happier replying to you before I just saw the downvotes you’ve apparently given me across the board. That’s a bit poor behaviour on your part, you shouldn’t downvote just because you disagree - and you can’t even say that I’m wrong as a justification when the whole thing is being heavily debated and adjudicated over whether it is right or wrong.

I thought we were engaging in a positive manner, but apparently you’ve been spitting in my face.

but there’s a good reason why this isn’t the argument being put forward in the lawsuits.

↩︎
the LLM could be considered a commissioned agent

↩︎
The LLM absolutely could be considered an agent, but the way it acts is merely prompted by the user. The actual behaviour is dictated by the organisation that built it. In any case, this is only my backup argument if you even consider the initial copying to be research - which it isn’t. ↩︎

archomrade [he/him] · 5 months ago

Copyright is absolute. The rightsholder has complete and total right to dictate how it is copied.

Really and truly, this is not how this works. The exemptions granted by the office of the registrar are granting an exemption to copyright claims against fair uses. It isn’t talking about whether the claim can be awarded damages, it’s talking about the claim being exempt in entirety. You can think about copyright as an exemption to the first amendment right to free speech, and the exemption to copyright as describing where that ‘right’ does not apply. Copyright holders do not get to control the use of their work where fair use has been determined by the registrar, which is reconsidered every 3 years.

This is all just pedantry, though, and has no practical significance. Saying “fair use means copyright has not been infringed” doesn’t change anything.

True enough, but it seems like it’s important for your understanding in how copyright works.

Or perhaps rather some kind of 3D array - which could just be considered an advanced form of database. But yeah, you’re right here, you win this pedantry round lol. 1-1.

I wasn’t being pedantic, that distinction is important for how copyright is conceptualized. The AI model is the thing being considered for infringement, so it’s important to note that the works being claimed within it do not exist as such within the model. The ‘3-d array’ does not contain copyrighted works. You can think of it as a large metadata file, describing how to construct language as analyzed through the training data. The nature and purpose of the ‘work’ is night-and-day different from the works being claimed, and ‘database’ is a clear misrepresentation (possibly even intentionally so) of what it is.

Yeah I don’t want to go down the avenue of suing the AI itself for infringement.

That was exactly what you pivoted to in your comment here, i’m not sure why you’re now saying you don’t want to go down that avenue. I’m confused what you’re arguing at this point.

Although, I was much happier replying to you before I just saw the downvotes you’ve apparently given me across the board. That’s a bit poor behaviour on your part, you shouldn’t downvote just because you disagree - and you can’t even say that I’m wrong as a justification when the whole thing is being heavily debated and adjudicated over whether it is right or wrong.

I’ve down-voted your comments because they contain inaccuracies and could be misleading to others. You shouldn’t let my grading of your comments reflect my attitude towards you; i’m sure you’re a fine individual. Downvotes don’t mean anything on Lemmy anyway, i’m not sure ‘spitting in your face’ is a fair or accurate description, but I don’t want to invalidate your feelings, so I apologize for making you feel that way as that wasn’t my intent.

TWeaK · 5 months ago

I’ve down-voted your comments because they contain inaccuracies and could be misleading to others. You shouldn’t let my grading of your comments reflect my attitude towards you; i’m sure you’re a fine individual. Downvotes don’t mean anything on Lemmy anyway, i’m not sure ‘spitting in your face’ is a fair or accurate description, but I don’t want to invalidate your feelings, so I apologize for making you feel that way as that wasn’t my intent.

No worries, you’ve been very respectable. My feelings weren’t particularly hurt, I just felt the need to call it out.

Personally, I’m against downvoting things merely because they are wrong. If someone says something that’s wrong, it may well be a commonly held misconception, and downvoting it also demotes any correction that has been given, which means other people who hold the misconception are less likely to be corrected.

And that’s beside the fact that I don’t really think I’m completely wrong here :o)

That was exactly what you pivoted to in your comment here, i’m not sure why you’re now saying you don’t want to go down that avenue. I’m confused what you’re arguing at this point.

To be a little more specific, I don’t want to go down the route of blaming AI itself for copyright infringement. That is to say, whether or not AI is bound by laws the way that humans are. I think it is only worthwhile considering whether the AI developer and/or the users are infringing copyright through their creation or use of AI. In particular, I think the legal or philosophical question of whether AI is affected by laws in the same way humans are is pointless when we’re just talking about LLM’s and not a true Artificial Intelligence.

The ‘3-d array’ does not contain copyrighted works. You can think of it as a large metadata file, describing how to construct language as analyzed through the training data. The nature and purpose of the ‘work’ is night-and-day different from the works being claimed, and ‘database’ is a clear misrepresentation (possibly even intentionally so) of what it is.

Yes absolutely, the LLM itself does not include copyrighted works. That’s not what I’m arguing. The two issues I take are with the database of information the LLM is trained on. This database does contain copyrighted works, AI developers admit that it does, but they claim it is fair use research. I disagree with this claim, to use the terminology from one of your links their “research” is not “scholarly” - it is commercial product development.

The other issue is that the LLM can reproduce copyrighted work. While I agree with you in some sense that the user of the LLM is instructing it to infringe copyright, and thus the user is responsible, in another sense I think the developer is also responsible because they have given the tool the capability to do this. This is perhaps not a strong argument, particularly when the developers have made efforts to fix these “bugs” as they come to light.

However my most important point is that the developers have infringed copyright by building a training database full of copyrighted works, which the LLM was then trained on. The LLM itself isn’t copyright infringement, but they infringed copyright to develop it.