Skip to content

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

Technology
254 123 1.9k
  • Google search results aren't deterministic but I wouldn't say it "learns" like a person. Algorithms with pattern detection isn't the same as human learning.

    You may be correct but we don't really know how humans learn.

    There's a ton of research on it and a lot of theories but no clear answers.
    There's general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
    The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.

    We modeled perceptrons after neurons and we've been working to make them more like neurons. They don't have any obvious capabilities that perceptrons don't have.

    That's the big problem with any claim that "AI doesn't do X like a person"; since we don't know how people do it we can neither verify nor refute that claim.

    There's more to AI than just being non-deterministic. Anything that's too deterministic definitely isn't an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.

  • why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre

    Hehe jailbreak an Android OS. You mean “rooting”.

  • This post did not contain any content.

    Judge,I'm pirating them to train ai not to consume for my own personal use.

  • Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.

    That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.

    The ruling explicitly says that scanning books and keeping/using those digital copies is legal.

    The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.

  • Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.

    That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.

    It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.

    This does raise an interesting case where libraries could end up training and distributing public domain AI models.

  • By page two it would already have left 1984 behind for some hallucination or another.

    Oh, so it would be the news?

  • The ruling explicitly says that scanning books and keeping/using those digital copies is legal.

    The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.

    I wonder if the archive.org cases had any bearing on the decision.

  • Does it "generate" a 1:1 copy?

    You can train an LLM to generate 1:1 copies

  • why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre

    when not in use i have it load images from my local webserver that are generated by some scripts and feature local news or the weather. kindle screensaver sucks.

  • You may be correct but we don't really know how humans learn.

    There's a ton of research on it and a lot of theories but no clear answers.
    There's general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
    The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.

    We modeled perceptrons after neurons and we've been working to make them more like neurons. They don't have any obvious capabilities that perceptrons don't have.

    That's the big problem with any claim that "AI doesn't do X like a person"; since we don't know how people do it we can neither verify nor refute that claim.

    There's more to AI than just being non-deterministic. Anything that's too deterministic definitely isn't an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.

    One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and "any combination of deterministic components will result in a deterministic system". Randomness has to be externally injected into e.g. current LLMs to produce 'non-deterministic' output.

    There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.

  • Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they've been sued. Maybe you're thinking of that case?

    Billions of dollars, and they can't afford to buy ebooks?

  • This post did not contain any content.

    It took me a few days to get the time to read the actual court ruling but here's the basics of what it ruled (and what it didn't rule on):

    • It's legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn't give permission. And even if you bought the books used, for very cheap, in bulk.
    • It's legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
    • It's legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
    • It's legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder's permission.
    • It's illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder's permission.

    Here's what it didn't rule on:

    • Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn't legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
    • Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
    • Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).

    So it's a pretty important ruling, in my opinion. It's a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder's permission, as long as you first own a legal copy in the first place. And it's a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.

  • Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.

    The court's ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.

    But the facts before the court were that Anthropic's LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.

  • Does buying the book give you license to digitise it?

    Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

    Definitions of "Ownership" can be very different.

    Does buying the book give you license to digitise it?

    Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

    Yes. That's what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you're allowed to keep that digital copy, analyze and index it and search it, in your personal library.

    Anthropic's practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn't distribute that library outside of its own company.

  • Make an AI that is trained on the books.

    Tell it to tell you a story for one of the books.

    Read the story without paying for it.

    The law says this is ok now, right?

    The law says this is ok now, right?

    No.

    The judge accepted the fact that Anthropic prevents users from obtaining the underlying copyrighted text through interaction with its LLM, and that there are safeguards in the software that prevent a user from being able to get an entire copyrighted work out of that LLM. It discusses the Google Books arrangement, where the books are scanned in the entirety, but where a user searching in Google Books can't actually retrieve more than a few snippets from any given book.

    Anthropic get to keep the copy of the entire book. It doesn't get to transmit the contents of that book to someone else, even through the LLM service.

    The judge also explicitly stated that if the authors can put together evidence that it is possible for a user to retrieve their entire copyrighted work out of the LLM, they'd have a different case and could sue over it at that time.

  • But if one person buys a book, trains an "AI model" to recite it, then distributes that model we good?

    No. The court made its ruling with the explicit understanding that the software was configured not to recite more than a few snippets from any copyrighted work, and would never produce an entire copyrighted work (or even a significant portion of a copyrighted work) in its output.

    And the judge specifically reserved that question, saying if the authors could develop evidence that it was possible for a user to retrieve significant copyrighted material out of the LLM, they'd have a different case and would be able to sue under those facts.

  • You're poor? Fuck you you have to pay to breathe.

    Millionaire? Whatever you want daddy uwu

    That's kind of how I read it too.

    But as a side effect it means you're still allowed to photograph your own books at home as a private citizen if you own them.

    Prepare to never legally own another piece of media in your life. 😄

  • Yes, and that part of the case is going to trial. This was a preliminary judgment specifically about the training itself.

    specifically about the training itself.

    It's two issues being ruled on.

    Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.

    The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.

  • I am not a lawyer. I am talking about reality.

    What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?

    Who is stopping the individuals at the LLM company from learning or analysing a given book?

    From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.

    What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning?

    No, you're framing the issue incorrectly.

    The law concerns itself with copying. When humans learn, they inevitably copy things. They may memorize portions of copyrighted material, and then retrieve those memories in doing something new with them, or just by recreating it.

    If the argument is that the mere act of copying for training an LLM is illegal copying, then what would we say about the use of copyrighted text for teaching children? They will memorize portions of what they read. They will later write some of them down. And if there is a person who memorizes an entire poem (or song) and then writes it down for someone else, that's actually a copyright violation. But if they memorize that poem or song and reuse it in creating something new and different, but with links and connections to that previous copyrighted work, then that kind of copying and processing is generally allowed.

    The judge here is analyzing what exact types of copying are permitted under the law, and for that, the copyright holders' argument would sweep too broadly and prohibit all sorts of methods that humans use to learn.

  • FTA:

    Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.

    So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.

    The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.

  • 149 Stimmen
    22 Beiträge
    13 Aufrufe
    T
    Of course they will try to get everything from your phone. They are neither better nor worse than their American counterparts. I would never take a personal PC or phone into either country, and whatever I'd bring back I would consider compromized.
  • 87 Stimmen
    5 Beiträge
    29 Aufrufe
    L
    Can somebody TLDR and determine if there's any useful information in this article. I refuse to read quanta magazine. Edit: link to paper: https://eprint.iacr.org/2025/118
  • 154 Stimmen
    28 Beiträge
    219 Aufrufe
    O
    That is still isolated because they do at least a million times more business
  • 334 Stimmen
    45 Beiträge
    272 Aufrufe
    A
    I am glad it hasn’t been hard for you. Pretty much everybody I know has moved to other states because of how bad the jobs are here. I would if I could afford it.
  • Sierpinski triangle programs by 5 AI models

    Technology technology
    7
    1
    15 Stimmen
    7 Beiträge
    42 Aufrufe
    M
    oh, wow! that's so cool!
  • 85 Stimmen
    12 Beiträge
    63 Aufrufe
    cupcakezealot@lemmy.blahaj.zoneC
    i like how ask photos is not just a dumb idea but it's also a dumb name
  • Catbox.moe got screwed 😿

    Technology technology
    40
    55 Stimmen
    40 Beiträge
    253 Aufrufe
    archrecord@lemm.eeA
    I'll gladly give you a reason. I'm actually happy to articulate my stance on this, considering how much I tend to care about digital rights. Services that host files should not be held responsible for what users upload, unless: The service explicitly caters to illegal content by definition or practice (i.e. the if the website is literally titled uploadyourcsamhere[.]com then it's safe to assume they deliberately want to host illegal content) The service has a very easy mechanism to remove illegal content, either when asked, or through simple monitoring systems, but chooses not to do so (catbox does this, and quite quickly too) Because holding services responsible creates a whole host of negative effects. Here's some examples: Someone starts a CDN and some users upload CSAM. The creator of the CDN goes to jail now. Nobody ever wants to create a CDN because of the legal risk, and thus the only providers of CDNs become shady, expensive, anonymously-run services with no compliance mechanisms. You run a site that hosts images, and someone decides they want to harm you. They upload CSAM, then report the site to law enforcement. You go to jail. Anybody in the future who wants to run an image sharing site must now self-censor to try and not upset any human being that could be willing to harm them via their site. A social media site is hosting the posts and content of users. In order to be compliant and not go to jail, they must engage in extremely strict filtering, otherwise even one mistake could land them in jail. All users of the site are prohibited from posting any NSFW or even suggestive content, (including newsworthy media, such as an image of bodies in a warzone) and any violation leads to an instant ban, because any of those things could lead to a chance of actually illegal content being attached. This isn't just my opinion either. Digital rights organizations such as the Electronic Frontier Foundation have talked at length about similar policies before. To quote them: "When social media platforms adopt heavy-handed moderation policies, the unintended consequences can be hard to predict. For example, Twitter’s policies on sexual material have resulted in posts on sexual health and condoms being taken down. YouTube’s bans on violent content have resulted in journalism on the Syrian war being pulled from the site. It can be tempting to attempt to “fix” certain attitudes and behaviors online by placing increased restrictions on users’ speech, but in practice, web platforms have had more success at silencing innocent people than at making online communities healthier." Now, to address the rest of your comment, since I don't just want to focus on the beginning: I think you have to actively moderate what is uploaded Catbox does, and as previously mentioned, often at a much higher rate than other services, and at a comparable rate to many services that have millions, if not billions of dollars in annual profits that could otherwise be spent on further moderation. there has to be swifter and stricter punishment for those that do upload things that are against TOS and/or illegal. The problem isn't necessarily the speed at which people can be reported and punished, but rather that the internet is fundamentally harder to track people on than real life. It's easy for cops to sit around at a spot they know someone will be physically distributing illegal content at in real life, but digitally, even if you can see the feed of all the information passing through the service, a VPN or Tor connection will anonymize your IP address in a manner that most police departments won't be able to track, and most three-letter agencies will simply have a relatively low success rate with. There's no good solution to this problem of identifying perpetrators, which is why platforms often focus on moderation over legal enforcement actions against users so frequently. It accomplishes the goal of preventing and removing the content without having to, for example, require every single user of the internet to scan an ID (and also magically prevent people from just stealing other people's access tokens and impersonating their ID) I do agree, however, that we should probably provide larger amounts of funding, training, and resources, to divisions who's sole goal is to go after online distribution of various illegal content, primarily that which harms children, because it's certainly still an issue of there being too many reports to go through, even if many of them will still lead to dead ends. I hope that explains why making file hosting services liable for user uploaded content probably isn't the best strategy. I hate to see people with good intentions support ideas that sound good in practice, but in the end just cause more untold harms, and I hope you can understand why I believe this to be the case.
  • Uploading The Human Mind Could Become a Reality, Expert Says

    Technology technology
    12
    1
    6 Stimmen
    12 Beiträge
    63 Aufrufe
    r3d4ct3d@midwest.socialR
    what mustard is best for the human body?