Skip to content

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

Technology
254 123 1.9k
  • Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.

    The court's ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.

    But the facts before the court were that Anthropic's LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.

  • Does buying the book give you license to digitise it?

    Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

    Definitions of "Ownership" can be very different.

    Does buying the book give you license to digitise it?

    Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

    Yes. That's what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you're allowed to keep that digital copy, analyze and index it and search it, in your personal library.

    Anthropic's practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn't distribute that library outside of its own company.

  • Make an AI that is trained on the books.

    Tell it to tell you a story for one of the books.

    Read the story without paying for it.

    The law says this is ok now, right?

    The law says this is ok now, right?

    No.

    The judge accepted the fact that Anthropic prevents users from obtaining the underlying copyrighted text through interaction with its LLM, and that there are safeguards in the software that prevent a user from being able to get an entire copyrighted work out of that LLM. It discusses the Google Books arrangement, where the books are scanned in the entirety, but where a user searching in Google Books can't actually retrieve more than a few snippets from any given book.

    Anthropic get to keep the copy of the entire book. It doesn't get to transmit the contents of that book to someone else, even through the LLM service.

    The judge also explicitly stated that if the authors can put together evidence that it is possible for a user to retrieve their entire copyrighted work out of the LLM, they'd have a different case and could sue over it at that time.

  • But if one person buys a book, trains an "AI model" to recite it, then distributes that model we good?

    No. The court made its ruling with the explicit understanding that the software was configured not to recite more than a few snippets from any copyrighted work, and would never produce an entire copyrighted work (or even a significant portion of a copyrighted work) in its output.

    And the judge specifically reserved that question, saying if the authors could develop evidence that it was possible for a user to retrieve significant copyrighted material out of the LLM, they'd have a different case and would be able to sue under those facts.

  • You're poor? Fuck you you have to pay to breathe.

    Millionaire? Whatever you want daddy uwu

    That's kind of how I read it too.

    But as a side effect it means you're still allowed to photograph your own books at home as a private citizen if you own them.

    Prepare to never legally own another piece of media in your life. 😄

  • Yes, and that part of the case is going to trial. This was a preliminary judgment specifically about the training itself.

    specifically about the training itself.

    It's two issues being ruled on.

    Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.

    The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.

  • I am not a lawyer. I am talking about reality.

    What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?

    Who is stopping the individuals at the LLM company from learning or analysing a given book?

    From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.

    What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning?

    No, you're framing the issue incorrectly.

    The law concerns itself with copying. When humans learn, they inevitably copy things. They may memorize portions of copyrighted material, and then retrieve those memories in doing something new with them, or just by recreating it.

    If the argument is that the mere act of copying for training an LLM is illegal copying, then what would we say about the use of copyrighted text for teaching children? They will memorize portions of what they read. They will later write some of them down. And if there is a person who memorizes an entire poem (or song) and then writes it down for someone else, that's actually a copyright violation. But if they memorize that poem or song and reuse it in creating something new and different, but with links and connections to that previous copyrighted work, then that kind of copying and processing is generally allowed.

    The judge here is analyzing what exact types of copying are permitted under the law, and for that, the copyright holders' argument would sweep too broadly and prohibit all sorts of methods that humans use to learn.

  • FTA:

    Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.

    So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.

    The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.

  • thanks I hate it xD

  • The language model isn't teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source. It is not alive, it has no thoughts. It has no "its own words." (As seen by the judgement that its words cannot be copyrighted.) It only has other people's words. Every word it spits out by definition is plagiarism, whether the work was copyrighted before or not.

    People wonder why works, such as journalism are getting worse. Well how could they ever get better if anything a journalist writes can be absorbed in real time, reworded and regurgitated without paying any dos to the original source. One journalist article, displayed in 30 versions, dividing the original works worth up into 30 portions. The original work now being worth 1/30th its original value. Maybe one can argue it is twice as good, so 1/15th.

    Long term it means all original creations... Are devalued and therefore not nearly worth pursuing. So we will only get shittier and shittier information. Every research project... Physics, Chemistry, Psychology, all technological advancements, slowly degraded as language models get better, and original sources deminish returns.

    just spitting the information back out, without paying the copyright source

    The court made its ruling under the factual assumption that it isn't possible for a user to retrieve copyrighted text from that LLM, and explained that if a copyright holder does develop evidence that it is possible to get entire significant chunks of their copyrighted text out of that LLM, then they'd be able to sue then under those facts and that evidence.

    It relies heavily on the analogy to Google Books, which scans in entire copyrighted books to build the database, but where users of the service simply cannot retrieve more than a few snippets from any given book. That way, Google cannot be said to be redistributing entire books to its users without the publisher's permission.

  • You’re right, each of the 5 million books’ authors should agree to less payment for their work, to make the poor criminals feel better.

    If I steal $100 from a thousand people and spend it all on hookers and blow, do I get out of paying that back because I don’t have the funds? Should the victims agree to get $20 back instead because that’s more within my budget?

    You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?

  • I wonder if the archive.org cases had any bearing on the decision.

    Archive.org was distributing the books themselves to users. Anthropic argued (and the authors suing them weren't able to show otherwise) that their software prevents users from actually retrieving books out of the LLM, and that it only will produce snippets of text from copyrighted works. And producing snippets in the context of something else is fair use, like commentary or criticism.

  • It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.

    This does raise an interesting case where libraries could end up training and distributing public domain AI models.

    I would actually be okay with libraries having those AI services. Even if they were available only for a fee it would be absurdly low and still waived for people with low or no income.

  • You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?

    No I don’t, but we’re not talking about a single copy of one book, and it is grovellingly insidious to imply that we are.

    We are talking about a company taking the work of an author, of thousands of authors, and using it as the backbone of a machine that’s goal is to make those authors obsolete.

    When the people who own the slop-machine are making millions of dollars off the back of stolen works, they can very much afford to pay those authors. If you can’t afford to run your business without STEALING, then your business is a pile of flaming shit that deserves to fail.

  • None of the above. Every professional in the world, including me, owes our careers to looking at examples of other people's work and incorporating their work into our own work without paying a penny for it. Freely copying and imitating what we see around us has been a human norm for thousands of years - in a process known as "the spread of civilization". Relatively recently it was demonized - for purely business reasons, not moral ones - by people who got rich selling copies of other people's work and paying them a pittance known as a "royalty". That little piece of bait on the hook has convinced a lot of people to put a black hat on behavior that had been considered normal forever. If angry modern enlightened justice warriors want to treat a business concept like a moral principle and get all sweaty about it, that's fine with me, but I'm more of a traditionalist in that area.

    Nobody who is mad at this situation thinks that taking inspiration, riffing on, or referencing other people’s work is the problem when a human being does it. When a person writes, there is intention behind it.

    The issue is when a business, owned by those people you think ‘demonised’ inspiration, take the works of authors and mulch them into something they lovingly named “The Pile”, in order to create derivative slop off the backs of creatives.

    When you, as a “professional”, ask AI to write you a novel, who is being inspired? Who is making the connections between themes? Who is carefully crafting the text to pay loving reference to another authors work? Not you. Not the algorithm that is guessing what word to shit out next based on math.

    These businesses have tricked you into thinking that what they are doing is noble.

  • One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and "any combination of deterministic components will result in a deterministic system". Randomness has to be externally injected into e.g. current LLMs to produce 'non-deterministic' output.

    There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.

    You're correct that a collection of deterministic elements will produce a deterministic result.

    LLMs produce a probability distribution of next tokens and then randomly select one of them. That's where the non-determinism enters the system. Even if you set the temperature to 0 you're going to get some randomness. The GPU can round two different real numbers to the same floating point representation. When that happens, it's a hardware-level coin toss on which token gets selected.

    You can test this empirically. Set the temperature to 0 and ask it, "give me a random number". You'll rarely get the same number twice in a row, no matter how similar you try to make the starting conditions.

  • I've hand calculated forward propagation (neural networks). AI does not learn, its statically optimized. AI "learning" is curve fitting. Human learning requires understanding, which AI is not capable of.

    Human learning requires understanding, which AI is not capable of.

    How could anyone know this?

    Is there some test of understanding that humans can pass and AIs can't? And if there are humans who can't pass it, do we consider then unintelligent?

    We don't even need to set the bar that high. Is there some definition of "understanding" that humans meet and AIs don't?

  • If this is the ruling which causes you to lose trust that any legal system (not just the US') aligns with morality, then I have to question where you've been all this time.

    I could have been more clear, but it wasn't my intention to imply that this particular case is the turning point.

  • No I don’t, but we’re not talking about a single copy of one book, and it is grovellingly insidious to imply that we are.

    We are talking about a company taking the work of an author, of thousands of authors, and using it as the backbone of a machine that’s goal is to make those authors obsolete.

    When the people who own the slop-machine are making millions of dollars off the back of stolen works, they can very much afford to pay those authors. If you can’t afford to run your business without STEALING, then your business is a pile of flaming shit that deserves to fail.

    Except it isnt, because the judge dismissed that part of the suit, saying that people have complete right to digitise and train on works they have a legitimate copy of. So those damages are for making the unauthorised copy, per book.

    And it is not STEALING as you put it, it is making an unauthorised copy, no one loses anything from a copy being made, if I STEAL your phone you no longer have that phone. I do find it sad how many people have drunk the capitalist IP maximalist stance and have somehow convinced themselves that advocating for Disney and the publishing cartel being allowed to dictate how people use works they have is somehow sticking up for the little guy

  • Nobody who is mad at this situation thinks that taking inspiration, riffing on, or referencing other people’s work is the problem when a human being does it. When a person writes, there is intention behind it.

    The issue is when a business, owned by those people you think ‘demonised’ inspiration, take the works of authors and mulch them into something they lovingly named “The Pile”, in order to create derivative slop off the backs of creatives.

    When you, as a “professional”, ask AI to write you a novel, who is being inspired? Who is making the connections between themes? Who is carefully crafting the text to pay loving reference to another authors work? Not you. Not the algorithm that is guessing what word to shit out next based on math.

    These businesses have tricked you into thinking that what they are doing is noble.

    That's 100% rationalization. Machines have never done anything with "inspiration", and that's never been a problem until now. You probably don't insist that your food be hand-carried to you from a farm, or cooked over a fire you started by rubbing two sticks together. I think the mass reaction against AI is part of a larger pattern where people want to believe they're crusading against evil without putting out the kind of effort it takes to fight any of the genuine evils in the world.

  • 440 Stimmen
    104 Beiträge
    665 Aufrufe
    P
    I'm pretty sure I disabled/removed it when I got this phone. I don't specifically remember doing it but when I get a new phone, I watch some YouTube videos on how to purge all the crap I don't want. I read an article that mentioned using command line stuff to eliminate it and it kind looked familiar. I think I did this. I really should write stuff down.
  • No Internet For 4 Hours And Now This

    Technology technology
    14
    6 Stimmen
    14 Beiträge
    76 Aufrufe
    nokturne213@sopuli.xyzN
    My first set I made myself. The "blackout" backing was white. The curtains themselves were blue with horses I think (I was like 8). I later used the backing with some Star Wars sheets to make new curtains.
  • 80 Stimmen
    31 Beiträge
    158 Aufrufe
    P
    That clarifies it, thanks
  • 92 Stimmen
    5 Beiträge
    37 Aufrufe
    H
    This is interesting to me as I like to say the llms are basically another abstraction of search. Initially it was links with no real weight that had to be gone through and then various algorithms weighted the return, then the results started giving a small blurb so one did not have to follow every link, and now your basically getting a report which should have references to the sources. I would like to see this looking at how folks engage with an llm. Basically my guess is if one treats the llm as a helper and collaborates to create the product that they will remember more than if they treat it as a servant and just instructs them to do it and takes the output as is.
  • The Arc Browser Is Dead

    Technology technology
    88
    240 Stimmen
    88 Beiträge
    450 Aufrufe
    P
    Haha, it's funny that you went that far. I think the reason why I notice it and you don't, is the 4k factor. My screen is 1920x1200 iirc.
  • 288 Stimmen
    46 Beiträge
    463 Aufrufe
    G
    Just for the record, even in Italy the winter tires are required for the season (but we can just have chains on board and we are good). Double checking and it doesn’t seem like it? Then again I don’t live in Italy. Here in Sweden you’ll face a fine of ~2000kr (roughly 200€) per tire on your vehicle that is out of spec. https://www.europe-consommateurs.eu/en/travelling-motor-vehicles/motor-vehicles/winter-tyres-in-europe.html Well, I live in Italy and they are required at least in all the northern regions and over a certain altitude in all the others from 15th November to 15th April. Then in some regions these limits are differents as you have seen. So we in Italy already have a law that consider a different situation for the same rule. Granted that you need to write a more complex law, but in the end it is nothing impossible. …and thus it is much simpler to handle these kinds of regulations at a lower level. No need for everyone everywhere to agree, people can have rules that work for them where they live, folks are happier and don’t have to struggle against a system run by bureaucrats so far away they have no idea what reality on the ground is (and they can’t, it’s impossible to account for every scenario centrally). Even on a municipal level certain regulations differ, and that’s completely ok! So it is not that difficult, just write a directive that say: "All the member states should make laws that require winter tires in every place it is deemed necessary". I don't really think that making EU more integrated is impossibile
  • 32 Stimmen
    8 Beiträge
    46 Aufrufe
    J
    Apparently, it was required to be allowed in that state: Reading a bit more, during the sentencing phase in that state people making victim impact statements can choose their format for expression, and it's entirely allowed to make statements about what other people would say. So the judge didn't actually have grounds to deny it. No jury during that phase, so it's just the judge listening to free form requests in both directions. It's gross, but the rules very much allow the sister to make a statement about what she believes her brother would have wanted to say, in whatever format she wanted. From: https://sh.itjust.works/comment/18471175 influence the sentence From what I've seen, to be fair, judges' decisions have varied wildly regardless, sadly, and sentences should be more standardized. I wonder what it would've been otherwise.
  • 0 Stimmen
    2 Beiträge
    23 Aufrufe
    andromxda@lemmy.dbzer0.comA
    The enshittification continues, but it doesn't affect me at all. Piracy is the way to go nowadays that all streaming services suck. !piracy@lemmy.dbzer0.com