Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not
-
Google search results aren't deterministic but I wouldn't say it "learns" like a person. Algorithms with pattern detection isn't the same as human learning.
You may be correct but we don't really know how humans learn.
There's a ton of research on it and a lot of theories but no clear answers.
There's general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.We modeled perceptrons after neurons and we've been working to make them more like neurons. They don't have any obvious capabilities that perceptrons don't have.
That's the big problem with any claim that "AI doesn't do X like a person"; since we don't know how people do it we can neither verify nor refute that claim.
There's more to AI than just being non-deterministic. Anything that's too deterministic definitely isn't an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.
-
why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre
Hehe jailbreak an Android OS. You mean “rooting”.
-
This post did not contain any content.
Judge,I'm pirating them to train ai not to consume for my own personal use.
-
Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.
That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.
The ruling explicitly says that scanning books and keeping/using those digital copies is legal.
The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.
-
Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.
That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.
It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.
This does raise an interesting case where libraries could end up training and distributing public domain AI models.
-
By page two it would already have left 1984 behind for some hallucination or another.
Oh, so it would be the news?
-
The ruling explicitly says that scanning books and keeping/using those digital copies is legal.
The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.
I wonder if the archive.org cases had any bearing on the decision.
-
Does it "generate" a 1:1 copy?
You can train an LLM to generate 1:1 copies
-
why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre
when not in use i have it load images from my local webserver that are generated by some scripts and feature local news or the weather. kindle screensaver sucks.
-
You may be correct but we don't really know how humans learn.
There's a ton of research on it and a lot of theories but no clear answers.
There's general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.We modeled perceptrons after neurons and we've been working to make them more like neurons. They don't have any obvious capabilities that perceptrons don't have.
That's the big problem with any claim that "AI doesn't do X like a person"; since we don't know how people do it we can neither verify nor refute that claim.
There's more to AI than just being non-deterministic. Anything that's too deterministic definitely isn't an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.
One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and "any combination of deterministic components will result in a deterministic system". Randomness has to be externally injected into e.g. current LLMs to produce 'non-deterministic' output.
There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.
-
Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they've been sued. Maybe you're thinking of that case?
Billions of dollars, and they can't afford to buy ebooks?
-
This post did not contain any content.
It took me a few days to get the time to read the actual court ruling but here's the basics of what it ruled (and what it didn't rule on):
- It's legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn't give permission. And even if you bought the books used, for very cheap, in bulk.
- It's legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
- It's legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
- It's legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder's permission.
- It's illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder's permission.
Here's what it didn't rule on:
- Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn't legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
- Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
- Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).
So it's a pretty important ruling, in my opinion. It's a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder's permission, as long as you first own a legal copy in the first place. And it's a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.
-
Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
The court's ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.
But the facts before the court were that Anthropic's LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
-
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Definitions of "Ownership" can be very different.
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Yes. That's what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you're allowed to keep that digital copy, analyze and index it and search it, in your personal library.
Anthropic's practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn't distribute that library outside of its own company.
-
Make an AI that is trained on the books.
Tell it to tell you a story for one of the books.
Read the story without paying for it.
The law says this is ok now, right?
The law says this is ok now, right?
No.
The judge accepted the fact that Anthropic prevents users from obtaining the underlying copyrighted text through interaction with its LLM, and that there are safeguards in the software that prevent a user from being able to get an entire copyrighted work out of that LLM. It discusses the Google Books arrangement, where the books are scanned in the entirety, but where a user searching in Google Books can't actually retrieve more than a few snippets from any given book.
Anthropic get to keep the copy of the entire book. It doesn't get to transmit the contents of that book to someone else, even through the LLM service.
The judge also explicitly stated that if the authors can put together evidence that it is possible for a user to retrieve their entire copyrighted work out of the LLM, they'd have a different case and could sue over it at that time.
-
But if one person buys a book, trains an "AI model" to recite it, then distributes that model we good?
No. The court made its ruling with the explicit understanding that the software was configured not to recite more than a few snippets from any copyrighted work, and would never produce an entire copyrighted work (or even a significant portion of a copyrighted work) in its output.
And the judge specifically reserved that question, saying if the authors could develop evidence that it was possible for a user to retrieve significant copyrighted material out of the LLM, they'd have a different case and would be able to sue under those facts.
-
You're poor? Fuck you you have to pay to breathe.
Millionaire? Whatever you want daddy uwu
That's kind of how I read it too.
But as a side effect it means you're still allowed to photograph your own books at home as a private citizen if you own them.
Prepare to never legally own another piece of media in your life.
-
Yes, and that part of the case is going to trial. This was a preliminary judgment specifically about the training itself.
specifically about the training itself.
It's two issues being ruled on.
Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.
The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.
-
I am not a lawyer. I am talking about reality.
What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?
Who is stopping the individuals at the LLM company from learning or analysing a given book?
From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.
What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning?
No, you're framing the issue incorrectly.
The law concerns itself with copying. When humans learn, they inevitably copy things. They may memorize portions of copyrighted material, and then retrieve those memories in doing something new with them, or just by recreating it.
If the argument is that the mere act of copying for training an LLM is illegal copying, then what would we say about the use of copyrighted text for teaching children? They will memorize portions of what they read. They will later write some of them down. And if there is a person who memorizes an entire poem (or song) and then writes it down for someone else, that's actually a copyright violation. But if they memorize that poem or song and reuse it in creating something new and different, but with links and connections to that previous copyrighted work, then that kind of copying and processing is generally allowed.
The judge here is analyzing what exact types of copying are permitted under the law, and for that, the copyright holders' argument would sweep too broadly and prohibit all sorts of methods that humans use to learn.
-
FTA:
Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.
So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.
The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.
-
-
Dems Demand Answers from Palantir About Plans to Build IRS “Mega-Database” of American Citizens
Technology1
-
-
-
-
Time reporters were able to use Google's AI to make convincing videos of Muslims setting fire to a Hindu temple; Chinese researchers handling a bat in a wet lab; and election workers shredding ballots
Technology1
-
-