linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

Technology

254 Beiträge 123 Kommentatoren 1.9k Aufrufe

A axel7fb5@lemmy.cafe

why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre
Y This user is from outside of this forum
Y This user is from outside of this forum
yournamehere@lemm.ee

schrieb zuletzt editiert von

#225

when not in use i have it load images from my local webserver that are generated by some scripts and feature local news or the weather. kindle screensaver sucks.
1 Antwort Letzte Antwort

0
N nednobbins@lemmy.zip

You may be correct but we don't really know how humans learn.

There's a ton of research on it and a lot of theories but no clear answers.
There's general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.

We modeled perceptrons after neurons and we've been working to make them more like neurons. They don't have any obvious capabilities that perceptrons don't have.

That's the big problem with any claim that "AI doesn't do X like a person"; since we don't know how people do it we can neither verify nor refute that claim.

There's more to AI than just being non-deterministic. Anything that's too deterministic definitely isn't an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.
H This user is from outside of this forum
H This user is from outside of this forum
hoppolito@mander.xyz

schrieb zuletzt editiert von

#226

One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and "any combination of deterministic components will result in a deterministic system". Randomness has to be externally injected into e.g. current LLMs to produce 'non-deterministic' output.

There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.
N 1 Antwort Letzte Antwort

0
A antonim@lemmy.dbzer0.com

Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they've been sued. Maybe you're thinking of that case?
S This user is from outside of this forum
S This user is from outside of this forum
scoffinglizard@lemmy.dbzer0.com

schrieb zuletzt editiert von

#227

Billions of dollars, and they can't afford to buy ebooks?
1 Antwort Letzte Antwort

2
P pro@programming.dev

This post did not contain any content.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#228
It took me a few days to get the time to read the actual court ruling but here's the basics of what it ruled (and what it didn't rule on):
- It's legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn't give permission. And even if you bought the books used, for very cheap, in bulk.
- It's legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
- It's legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
- It's legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder's permission.
- It's illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder's permission.
Here's what it didn't rule on:
- Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn't legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
- Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
- Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).
So it's a pretty important ruling, in my opinion. It's a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder's permission, as long as you first own a legal copy in the first place. And it's a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.
1 Antwort Letzte Antwort

13
M mtk@lemmy.world

Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#229

The court's ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.

But the facts before the court were that Anthropic's LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
1 Antwort Letzte Antwort

3
D devils_advocate@sh.itjust.works

Does buying the book give you license to digitise it?

Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

Definitions of "Ownership" can be very different.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#230

Does buying the book give you license to digitise it?

Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

Yes. That's what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you're allowed to keep that digital copy, analyze and index it and search it, in your personal library.

Anthropic's practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn't distribute that library outside of its own company.
1 Antwort Letzte Antwort

1
F forkdestroyer@infosec.pub

Make an AI that is trained on the books.

Tell it to tell you a story for one of the books.

Read the story without paying for it.

The law says this is ok now, right?
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#231

The law says this is ok now, right?

No.

The judge accepted the fact that Anthropic prevents users from obtaining the underlying copyrighted text through interaction with its LLM, and that there are safeguards in the software that prevent a user from being able to get an entire copyrighted work out of that LLM. It discusses the Google Books arrangement, where the books are scanned in the entirety, but where a user searching in Google Books can't actually retrieve more than a few snippets from any given book.

Anthropic get to keep the copy of the entire book. It doesn't get to transmit the contents of that book to someone else, even through the LLM service.

The judge also explicitly stated that if the authors can put together evidence that it is possible for a user to retrieve their entire copyrighted work out of the LLM, they'd have a different case and could sue over it at that time.
1 Antwort Letzte Antwort

0
R rvtv95xbeo@sh.itjust.works

But if one person buys a book, trains an "AI model" to recite it, then distributes that model we good?
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#232

No. The court made its ruling with the explicit understanding that the software was configured not to recite more than a few snippets from any copyrighted work, and would never produce an entire copyrighted work (or even a significant portion of a copyrighted work) in its output.

And the judge specifically reserved that question, saying if the authors could develop evidence that it was possible for a user to retrieve significant copyrighted material out of the LLM, they'd have a different case and would be able to sue under those facts.
1 Antwort Letzte Antwort

0
R randomgal@lemmy.ca

You're poor? Fuck you you have to pay to breathe.

Millionaire? Whatever you want daddy uwu
E This user is from outside of this forum
E This user is from outside of this forum
eestileib@lemmy.blahaj.zone

schrieb zuletzt editiert von

#233

That's kind of how I read it too.

But as a side effect it means you're still allowed to photograph your own books at home as a private citizen if you own them.

Prepare to never legally own another piece of media in your life.
1 Antwort Letzte Antwort

2
F facedeer@fedia.io

Yes, and that part of the case is going to trial. This was a preliminary judgment specifically about the training itself.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#234

specifically about the training itself.

It's two issues being ruled on.

Yes, as you mention, the act of training an LLM was ruled to be fair use, assuming that the digital training data was legally obtained.

The other part of the ruling, which I think is really, really important for everyone, not just AI/LLM companies or developers, is that it is legal to buy printed books and digitize them into a central library with indexed metadata. Anthropic has to go to trial on the pirated books they just downloaded from the internet, but has fully won the portion of the case about the physical books they bought and digitized.
1 Antwort Letzte Antwort

0
A alphane_moon@lemmy.world

I am not a lawyer. I am talking about reality.

What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning? Where is the learning happening? Who is doing the learning?

Who is stopping the individuals at the LLM company from learning or analysing a given book?

From my experience living in the US, this is pretty standard American-style corruption. Lots of pomp and bombast and roleplay of sorts, but the outcome is no different from any other country that is in deep need of judicial and anti-corruotion reform.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#235

What does an LLM application (or training processes associated with an LLM application) have to do with the concept of learning?

No, you're framing the issue incorrectly.

The law concerns itself with copying. When humans learn, they inevitably copy things. They may memorize portions of copyrighted material, and then retrieve those memories in doing something new with them, or just by recreating it.

If the argument is that the mere act of copying for training an LLM is illegal copying, then what would we say about the use of copyrighted text for teaching children? They will memorize portions of what they read. They will later write some of them down. And if there is a person who memorizes an entire poem (or song) and then writes it down for someone else, that's actually a copyright violation. But if they memorize that poem or song and reuse it in creating something new and different, but with links and connections to that previous copyrighted work, then that kind of copying and processing is generally allowed.

The judge here is analyzing what exact types of copying are permitted under the law, and for that, the copyright holders' argument would sweep too broadly and prohibit all sorts of methods that humans use to learn.
1 Antwort Letzte Antwort

0
P prox@lemmy.world

FTA:

Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.

So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.
W This user is from outside of this forum
W This user is from outside of this forum
womble@lemmy.world

schrieb zuletzt editiert von

#236

The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.
1 Antwort Letzte Antwort

1
S s_h_k@lemmy.dbzer0.com

Gives you versions like this
K This user is from outside of this forum
K This user is from outside of this forum
kazerniel@lemmy.world

schrieb zuletzt editiert von

#237

thanks I hate it xD
1 Antwort Letzte Antwort

0
L lifeinmultiplechoice@lemmy.world

The language model isn't teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source. It is not alive, it has no thoughts. It has no "its own words." (As seen by the judgement that its words cannot be copyrighted.) It only has other people's words. Every word it spits out by definition is plagiarism, whether the work was copyrighted before or not.

People wonder why works, such as journalism are getting worse. Well how could they ever get better if anything a journalist writes can be absorbed in real time, reworded and regurgitated without paying any dos to the original source. One journalist article, displayed in 30 versions, dividing the original works worth up into 30 portions. The original work now being worth 1/30th its original value. Maybe one can argue it is twice as good, so 1/15th.

Long term it means all original creations... Are devalued and therefore not nearly worth pursuing. So we will only get shittier and shittier information. Every research project... Physics, Chemistry, Psychology, all technological advancements, slowly degraded as language models get better, and original sources deminish returns.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#238

just spitting the information back out, without paying the copyright source

The court made its ruling under the factual assumption that it isn't possible for a user to retrieve copyrighted text from that LLM, and explained that if a copyright holder does develop evidence that it is possible to get entire significant chunks of their copyrighted text out of that LLM, then they'd be able to sue then under those facts and that evidence.

It relies heavily on the analogy to Google Books, which scans in entire copyrighted books to build the database, but where users of the service simply cannot retrieve more than a few snippets from any given book. That way, Google cannot be said to be redistributing entire books to its users without the publisher's permission.
1 Antwort Letzte Antwort

0
T thistlewick@lemmynsfw.com

You’re right, each of the 5 million books’ authors should agree to less payment for their work, to make the poor criminals feel better.

If I steal $100 from a thousand people and spend it all on hookers and blow, do I get out of paying that back because I don’t have the funds? Should the victims agree to get $20 back instead because that’s more within my budget?
W This user is from outside of this forum
W This user is from outside of this forum
womble@lemmy.world

schrieb zuletzt editiert von

#239

You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?
T 1 Antwort Letzte Antwort

0
D deltapi@lemmy.world

I wonder if the archive.org cases had any bearing on the decision.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#240

Archive.org was distributing the books themselves to users. Anthropic argued (and the authors suing them weren't able to show otherwise) that their software prevents users from actually retrieving books out of the LLM, and that it only will produce snippets of text from copyrighted works. And producing snippets in the context of something else is fair use, like commentary or criticism.
1 Antwort Letzte Antwort

2
B bob_robertson_ix@discuss.tchncs.de

It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.

This does raise an interesting case where libraries could end up training and distributing public domain AI models.
R This user is from outside of this forum
R This user is from outside of this forum
restingboredface@sh.itjust.works

schrieb zuletzt editiert von

#241

I would actually be okay with libraries having those AI services. Even if they were available only for a fee it would be absurdly low and still waived for people with low or no income.
1 Antwort Letzte Antwort

1
W womble@lemmy.world

You think that 150,000 dollars, or roughly 180 weeks of full time pretax wages at 15$ an hour, is a reasonable fine for making a copy of one book which doe no material harm to the copyright holder?
T This user is from outside of this forum
T This user is from outside of this forum
thistlewick@lemmynsfw.com

schrieb zuletzt editiert von

#242

No I don’t, but we’re not talking about a single copy of one book, and it is grovellingly insidious to imply that we are.

We are talking about a company taking the work of an author, of thousands of authors, and using it as the backbone of a machine that’s goal is to make those authors obsolete.

When the people who own the slop-machine are making millions of dollars off the back of stolen works, they can very much afford to pay those authors. If you can’t afford to run your business without STEALING, then your business is a pile of flaming shit that deserves to fail.
W 1 Antwort Letzte Antwort

0
L lovablesidekick@lemmy.world

None of the above. Every professional in the world, including me, owes our careers to looking at examples of other people's work and incorporating their work into our own work without paying a penny for it. Freely copying and imitating what we see around us has been a human norm for thousands of years - in a process known as "the spread of civilization". Relatively recently it was demonized - for purely business reasons, not moral ones - by people who got rich selling copies of other people's work and paying them a pittance known as a "royalty". That little piece of bait on the hook has convinced a lot of people to put a black hat on behavior that had been considered normal forever. If angry modern enlightened justice warriors want to treat a business concept like a moral principle and get all sweaty about it, that's fine with me, but I'm more of a traditionalist in that area.
T This user is from outside of this forum
T This user is from outside of this forum
thistlewick@lemmynsfw.com

schrieb zuletzt editiert von

#243

Nobody who is mad at this situation thinks that taking inspiration, riffing on, or referencing other people’s work is the problem when a human being does it. When a person writes, there is intention behind it.

The issue is when a business, owned by those people you think ‘demonised’ inspiration, take the works of authors and mulch them into something they lovingly named “The Pile”, in order to create derivative slop off the backs of creatives.

When you, as a “professional”, ask AI to write you a novel, who is being inspired? Who is making the connections between themes? Who is carefully crafting the text to pay loving reference to another authors work? Not you. Not the algorithm that is guessing what word to shit out next based on math.

These businesses have tricked you into thinking that what they are doing is noble.
L 1 Antwort Letzte Antwort

0
H hoppolito@mander.xyz

One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and "any combination of deterministic components will result in a deterministic system". Randomness has to be externally injected into e.g. current LLMs to produce 'non-deterministic' output.

There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.
N This user is from outside of this forum
N This user is from outside of this forum
nednobbins@lemmy.zip

schrieb zuletzt editiert von

#244

You're correct that a collection of deterministic elements will produce a deterministic result.

LLMs produce a probability distribution of next tokens and then randomly select one of them. That's where the non-determinism enters the system. Even if you set the temperature to 0 you're going to get some randomness. The GPU can round two different real numbers to the same floating point representation. When that happens, it's a hardware-level coin toss on which token gets selected.

You can test this empirically. Set the temperature to 0 and ask it, "give me a random number". You'll rarely get the same number twice in a row, no matter how similar you try to make the starting conditions.
1 Antwort Letzte Antwort

0

Anmelden zum Antworten

R

Amazon Warns 220 Million Customers Of Prime Account Attacks
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

1

81 Stimmen

2 Beiträge

0 Aufrufe

X

The relevant bits: Pieter Arntz, a malware intelligence researcher at Malwarebytes, has issued a timely July 16 reminder that “scammers are impersonating Amazon in a Prime membership scam.” The cause of Arntz’s reminder, and the underlying Amazon warning to all 220 million Prime customers, however, was a spike in email attacks claiming that subscription rates are about to rise, along with a cancel subscription button that would lead to Prime account credential theft.
T

Software is evolving backwards
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
64

1

341 Stimmen

64 Beiträge

405 Aufrufe

M

Came here looking for this
P

The Department of Defense Efforts to Buy and Maintain IT Systems Are Billions Over Budget and Delayed
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
13

1

216 Stimmen

13 Beiträge

71 Aufrufe

J

It’s DEI’s fault!
P

WhatsApp is getting ads using personal data from Instagram and Facebook
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
2

1

51 Stimmen

2 Beiträge

22 Aufrufe

B

So glad I never got on WhatsApp
P

Bill Atkinson, visionary engineer behind the Apple Macintosh operating system, dies at 74
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

179 Stimmen

1 Beiträge

13 Aufrufe

Niemand hat geantwortet
P

X (formerly Twitter) has been experiencing international outages for a second time in a week.
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
3

2

8 Stimmen

3 Beiträge

28 Aufrufe

B

[image: 8978adf5-b473-470c-9f21-62a31e2fbc77.gif]
D

Chrome using Gemini Nano for ‘Enhanced Protection’ against scams
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
8

1

1 Stimmen

8 Beiträge

40 Aufrufe

L

I think the principle could be applied to scan outside of the machine. It is making requests to 127.0.0.1:{port} - effectively using your computer as a "server" in a sort of reverse-SSRF attack. There's no reason it can't make requests to 10.10.10.1:{port} as well. Of course you'd need to guess the netmask of the network address range first, but this isn't that hard. In fact, if you consider that at least as far as the desktop site goes, most people will be browsing the web behind a standard consumer router left on defaults where it will be the first device in the DHCP range (e.g. 192.168.0.1 or 10.10.10.1), which tends to have a web UI on the LAN interface (port 8080, 80 or 443), then you'd only realistically need to scan a few addresses to determine the network address range. If you want to keep noise even lower, using just 192.168.0.1:80 and 192.168.1.1:80 I'd wager would cover 99% of consumer routers. From there you could assume that it's a /24 netmask and scan IPs to your heart's content. You could do top 10 most common ports type scans and go in-depth on anything you get a result on. I haven't tested this, but I don't see why it wouldn't work, when I was testing 13ft.io - a self-hosted 12ft.io paywall remover, an SSRF flaw like this absolutely let you perform any network request to any LAN address in range.
S

FCC commissioner writes op-ed titled, “It’s time for Trump to DOGE the FCC“
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
43

1

342 Stimmen

43 Beiträge

215 Aufrufe

G

highly recommend using containerized torrents through a VPN. I have transmission and openvpn containers. when the network goes down transmission can't connect since it's networked through the ovpn container. once the vpn is restored, everything restarts and resumes where it left off. ever since I've had this setup running, I haven't had a nastygram sent to me.