Skip to content

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

Technology
254 123 1.9k
  • Make an AI that is trained on the books.

    Tell it to tell you a story for one of the books.

    Read the story without paying for it.

    The law says this is ok now, right?

    The law says this is ok now, right?

    No.

    The judge accepted the fact that Anthropic prevents users from obtaining the underlying copyrighted text through interaction with its LLM, and that there are safeguards in the software that prevent a user from being able to get an entire copyrighted work out of that LLM. It discusses the Google Books arrangement, where the books are scanned in the entirety, but where a user searching in Google Books can't actually retrieve more than a few snippets from any given book.

    Anthropic get to keep the copy of the entire book. It doesn't get to transmit the contents of that book to someone else, even through the LLM service.

    The judge also explicitly stated that if the authors can put together evidence that it is possible for a user to retrieve their entire copyrighted work out of the LLM, they'd have a different case and could sue over it at that time.

  • But if one person buys a book, trains an "AI model" to recite it, then distributes that model we good?

    No. The court made its ruling with the explicit understanding that the software was configured not to recite more than a few snippets from any copyrighted work, and would never produce an entire copyrighted work (or even a significant portion of a copyrighted work) in its output.

    And the judge specifically reserved that question, saying if the authors could develop evidence that it was possible for a user to retrieve significant copyrighted material out of the LLM, they'd have a different case and would be able to sue under those facts.

  • You're poor? Fuck you you have to pay to breathe.

    Millionaire? Whatever you want daddy uwu

    That's kind of how I read it too.

    But as a side effect it means you're still allowed to photograph your own books at home as a private citizen if you own them.

    Prepare to never legally own another piece of media in your life. 😄

  • Yes, and that part of the case is going to trial. This was a preliminary judgment specifically about the training itself.

    specifically about the training itself.

    It's two issues being ruled on.

    Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.

    The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.

  • I am not a lawyer. I am talking about reality.

    What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?

    Who is stopping the individuals at the LLM company from learning or analysing a given book?

    From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.

    What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning?

    No, you're framing the issue incorrectly.

    The law concerns itself with copying. When humans learn, they inevitably copy things. They may memorize portions of copyrighted material, and then retrieve those memories in doing something new with them, or just by recreating it.

    If the argument is that the mere act of copying for training an LLM is illegal copying, then what would we say about the use of copyrighted text for teaching children? They will memorize portions of what they read. They will later write some of them down. And if there is a person who memorizes an entire poem (or song) and then writes it down for someone else, that's actually a copyright violation. But if they memorize that poem or song and reuse it in creating something new and different, but with links and connections to that previous copyrighted work, then that kind of copying and processing is generally allowed.

    The judge here is analyzing what exact types of copying are permitted under the law, and for that, the copyright holders' argument would sweep too broadly and prohibit all sorts of methods that humans use to learn.

  • FTA:

    Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.

    So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.

    The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.

  • thanks I hate it xD

  • The language model isn't teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source. It is not alive, it has no thoughts. It has no "its own words." (As seen by the judgement that its words cannot be copyrighted.) It only has other people's words. Every word it spits out by definition is plagiarism, whether the work was copyrighted before or not.

    People wonder why works, such as journalism are getting worse. Well how could they ever get better if anything a journalist writes can be absorbed in real time, reworded and regurgitated without paying any dos to the original source. One journalist article, displayed in 30 versions, dividing the original works worth up into 30 portions. The original work now being worth 1/30th its original value. Maybe one can argue it is twice as good, so 1/15th.

    Long term it means all original creations... Are devalued and therefore not nearly worth pursuing. So we will only get shittier and shittier information. Every research project... Physics, Chemistry, Psychology, all technological advancements, slowly degraded as language models get better, and original sources deminish returns.

    just spitting the information back out, without paying the copyright source

    The court made its ruling under the factual assumption that it isn't possible for a user to retrieve copyrighted text from that LLM, and explained that if a copyright holder does develop evidence that it is possible to get entire significant chunks of their copyrighted text out of that LLM, then they'd be able to sue then under those facts and that evidence.

    It relies heavily on the analogy to Google Books, which scans in entire copyrighted books to build the database, but where users of the service simply cannot retrieve more than a few snippets from any given book. That way, Google cannot be said to be redistributing entire books to its users without the publisher's permission.

  • You’re right, each of the 5 million books’ authors should agree to less payment for their work, to make the poor criminals feel better.

    If I steal $100 from a thousand people and spend it all on hookers and blow, do I get out of paying that back because I don’t have the funds? Should the victims agree to get $20 back instead because that’s more within my budget?

    You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?

  • I wonder if the archive.org cases had any bearing on the decision.

    Archive.org was distributing the books themselves to users. Anthropic argued (and the authors suing them weren't able to show otherwise) that their software prevents users from actually retrieving books out of the LLM, and that it only will produce snippets of text from copyrighted works. And producing snippets in the context of something else is fair use, like commentary or criticism.

  • It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.

    This does raise an interesting case where libraries could end up training and distributing public domain AI models.

    I would actually be okay with libraries having those AI services. Even if they were available only for a fee it would be absurdly low and still waived for people with low or no income.

  • You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?

    No I don’t, but we’re not talking about a single copy of one book, and it is grovellingly insidious to imply that we are.

    We are talking about a company taking the work of an author, of thousands of authors, and using it as the backbone of a machine that’s goal is to make those authors obsolete.

    When the people who own the slop-machine are making millions of dollars off the back of stolen works, they can very much afford to pay those authors. If you can’t afford to run your business without STEALING, then your business is a pile of flaming shit that deserves to fail.

  • None of the above. Every professional in the world, including me, owes our careers to looking at examples of other people's work and incorporating their work into our own work without paying a penny for it. Freely copying and imitating what we see around us has been a human norm for thousands of years - in a process known as "the spread of civilization". Relatively recently it was demonized - for purely business reasons, not moral ones - by people who got rich selling copies of other people's work and paying them a pittance known as a "royalty". That little piece of bait on the hook has convinced a lot of people to put a black hat on behavior that had been considered normal forever. If angry modern enlightened justice warriors want to treat a business concept like a moral principle and get all sweaty about it, that's fine with me, but I'm more of a traditionalist in that area.

    Nobody who is mad at this situation thinks that taking inspiration, riffing on, or referencing other people’s work is the problem when a human being does it. When a person writes, there is intention behind it.

    The issue is when a business, owned by those people you think ‘demonised’ inspiration, take the works of authors and mulch them into something they lovingly named “The Pile”, in order to create derivative slop off the backs of creatives.

    When you, as a “professional”, ask AI to write you a novel, who is being inspired? Who is making the connections between themes? Who is carefully crafting the text to pay loving reference to another authors work? Not you. Not the algorithm that is guessing what word to shit out next based on math.

    These businesses have tricked you into thinking that what they are doing is noble.

  • One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and "any combination of deterministic components will result in a deterministic system". Randomness has to be externally injected into e.g. current LLMs to produce 'non-deterministic' output.

    There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.

    You're correct that a collection of deterministic elements will produce a deterministic result.

    LLMs produce a probability distribution of next tokens and then randomly select one of them. That's where the non-determinism enters the system. Even if you set the temperature to 0 you're going to get some randomness. The GPU can round two different real numbers to the same floating point representation. When that happens, it's a hardware-level coin toss on which token gets selected.

    You can test this empirically. Set the temperature to 0 and ask it, "give me a random number". You'll rarely get the same number twice in a row, no matter how similar you try to make the starting conditions.

  • I've hand calculated forward propagation (neural networks). AI does not learn, its statically optimized. AI "learning" is curve fitting. Human learning requires understanding, which AI is not capable of.

    Human learning requires understanding, which AI is not capable of.

    How could anyone know this?

    Is there some test of understanding that humans can pass and AIs can't? And if there are humans who can't pass it, do we consider then unintelligent?

    We don't even need to set the bar that high. Is there some definition of "understanding" that humans meet and AIs don't?

  • If this is the ruling which causes you to lose trust that any legal system (not just the US') aligns with morality, then I have to question where you've been all this time.

    I could have been more clear, but it wasn't my intention to imply that this particular case is the turning point.

  • No I don’t, but we’re not talking about a single copy of one book, and it is grovellingly insidious to imply that we are.

    We are talking about a company taking the work of an author, of thousands of authors, and using it as the backbone of a machine that’s goal is to make those authors obsolete.

    When the people who own the slop-machine are making millions of dollars off the back of stolen works, they can very much afford to pay those authors. If you can’t afford to run your business without STEALING, then your business is a pile of flaming shit that deserves to fail.

    Except it isnt, because the judge dismissed that part of the suit, saying that people have complete right to digitise and train on works they have a legitimate copy of. So those damages are for making the unauthorised copy, per book.

    And it is not STEALING as you put it, it is making an unauthorised copy, no one loses anything from a copy being made, if I STEAL your phone you no longer have that phone. I do find it sad how many people have drunk the capitalist IP maximalist stance and have somehow convinced themselves that advocating for Disney and the publishing cartel being allowed to dictate how people use works they have is somehow sticking up for the little guy

  • Nobody who is mad at this situation thinks that taking inspiration, riffing on, or referencing other people’s work is the problem when a human being does it. When a person writes, there is intention behind it.

    The issue is when a business, owned by those people you think ‘demonised’ inspiration, take the works of authors and mulch them into something they lovingly named “The Pile”, in order to create derivative slop off the backs of creatives.

    When you, as a “professional”, ask AI to write you a novel, who is being inspired? Who is making the connections between themes? Who is carefully crafting the text to pay loving reference to another authors work? Not you. Not the algorithm that is guessing what word to shit out next based on math.

    These businesses have tricked you into thinking that what they are doing is noble.

    That's 100% rationalization. Machines have never done anything with "inspiration", and that's never been a problem until now. You probably don't insist that your food be hand-carried to you from a farm, or cooked over a fire you started by rubbing two sticks together. I think the mass reaction against AI is part of a larger pattern where people want to believe they're crusading against evil without putting out the kind of effort it takes to fight any of the genuine evils in the world.

  • Human learning requires understanding, which AI is not capable of.

    How could anyone know this?

    Is there some test of understanding that humans can pass and AIs can't? And if there are humans who can't pass it, do we consider then unintelligent?

    We don't even need to set the bar that high. Is there some definition of "understanding" that humans meet and AIs don't?

    It's literally in the phrase "statically optimized." This is like arguing for your preferred deity. It'll never be proven but we have evidence to make our own conclusions. As it is now, AI doesn't learn or understand the same way humans do.

  • It's literally in the phrase "statically optimized." This is like arguing for your preferred deity. It'll never be proven but we have evidence to make our own conclusions. As it is now, AI doesn't learn or understand the same way humans do.

    So you’re confident that human learning involves “understanding” which is distinct from “statistical optimization”. Is this something you feel in your soul or can you define the difference?