linux-nerds.org

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

Technology

254 Beiträge 123 Kommentatoren 1.9k Aufrufe

L lifeinmultiplechoice@lemmy.world

The language model isn't teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source. It is not alive, it has no thoughts. It has no "its own words." (As seen by the judgement that its words cannot be copyrighted.) It only has other people's words. Every word it spits out by definition is plagiarism, whether the work was copyrighted before or not.

People wonder why works, such as journalism are getting worse. Well how could they ever get better if anything a journalist writes can be absorbed in real time, reworded and regurgitated without paying any dos to the original source. One journalist article, displayed in 30 versions, dividing the original works worth up into 30 portions. The original work now being worth 1/30th its original value. Maybe one can argue it is twice as good, so 1/15th.

Long term it means all original creations... Are devalued and therefore not nearly worth pursuing. So we will only get shittier and shittier information. Every research project... Physics, Chemistry, Psychology, all technological advancements, slowly degraded as language models get better, and original sources deminish returns.
V This user is from outside of this forum
V This user is from outside of this forum
voterfrog@lemmy.world

schrieb zuletzt editiert von

#210

The language model isn’t teaching anything it is changing the wording of something and spitting it back out. And in some cases, not changing the wording at all, just spitting the information back out, without paying the copyright source.

You could honestly say the same about most "teaching" that a student without a real comprehension of the subject does for another student. But ultimately, that's beside the point. Because changing the wording, structure, and presentation is all that is necessary to avoid copyright violation. You cannot copyright the information. Only a specific expression of it.

There's no special exception for AI here. That's how copyright works for you, me, the student, and the AI. And if you're hoping that copyright is going to save you from the outcomes you're worried about, it won't.
1 Antwort Letzte Antwort

0
P pro@programming.dev

This post did not contain any content.
D This user is from outside of this forum
D This user is from outside of this forum
dfx4509b_2@lemmy.org

schrieb zuletzt editiert von dfx4509b_2@lemmy.org

#211

Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.

That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.
B B 2 Antworten Letzte Antwort

11
K kux@lemm.ee

Make up a word that is not found anywhere on the internet

Returns word that is found on the internet as a brand of nose rings, as a youtube username, as an already made up word in fantasy fiction, and as a (ocr?) typo of urethra
N This user is from outside of this forum
N This user is from outside of this forum
nednobbins@lemmy.zip

schrieb zuletzt editiert von

#212

That's a reasonable critique.

The point is that it's trivial to come up with new words. Put that same prompt into a bunch of different LLMs and you'll get a bunch of different words. Some of them may exist somewhere that don't exist. There are simple rules for combining words that are so simple that children play them as games.

The LLM doesn't actually even recognize "words" it recognizes tokens which are typically parts of words. It usually avoids random combinations of those but you can easily get it to do so, if you want.
1 Antwort Letzte Antwort

0
R rvtv95xbeo@sh.itjust.works

"Recite the complete works of Shakespeare but replace every thirteenth thou with this"
C This user is from outside of this forum
C This user is from outside of this forum
clamdrinker@lemmy.world

schrieb zuletzt editiert von

#213

A court will decide such cases. Most AI models aren't trained for this purpose of whitewashing content even if some people would imply that's all they do, but if you decided to actually train a model for this explicit purpose you would most likely not get away with it if someone dragged you in front of a court for it.

It's a similar defense that some file hosting websites had against hosting and distributing copyrighted content (Eg. MEGA), but in such cases it was very clear to what their real goals were (especially in court), and at the same time it did not kill all file sharing websites, because not all of them were built with the intention to distribute illegal material with under the guise of legitimate operation.
1 Antwort Letzte Antwort

0
A antonim@lemmy.dbzer0.com

Large AI companies themselves want people to be ignorant of how AI works, though. They want uncritical acceptance of the tech as they force it everywhere, creating a radical counterreaction from people. The reaction might be uncritical too, I'd prefer to say it's merely unjustified in specific cases or overly emotional, but it doesn't come from nowhere or from sheer stupidity. We have been hearing about people treating their chatbots as sentient beings since like 2022 (remember that guy from Google?), bombarded with doomer (or, from AI companies' point of view, very desirable) projections about AI replacing most jobs and wreaking havoc on world economy - how are ordinary people supposed to remain calm and balanced when hearing such stuff all the time?
C This user is from outside of this forum
C This user is from outside of this forum
clamdrinker@lemmy.world

schrieb zuletzt editiert von clamdrinker@lemmy.world

#214

This so very much. I've been saying it since 2020. People who think the big corporations (even the ones that use AI), aren't playing both sides of this issue from the very beginning just aren't paying attention.

It's in their interest to have those positive to AI defend them by association by energizing those negative to AI to take on an "us vs them" mentality, and the other way around as well. It's the classic divide and conquer.

Because if people refuse to talk to each other about it in good faith, and refuse to treat each other with respect, learn where they're coming from or why they hold such opinions, you can keep them fighting amongst themselves, instead of banding together and demanding realistic, and fair policies in regards to AI. This is why bad faith arguments and positions must be shot down on both the side you agree with and the one you disagree with.
1 Antwort Letzte Antwort

0
E enkimaru@lemmy.world

You are obviously not educated on this.

It did not “learn” anymore than a downloaded video ran through a compression algorithm.
Just: LoLz.
G This user is from outside of this forum
G This user is from outside of this forum
gaja@lemm.ee

schrieb zuletzt editiert von

#215

I've hand calculated forward propagation (neural networks). AI does not learn, its statically optimized. AI "learning" is curve fitting. Human learning requires understanding, which AI is not capable of.
N 1 Antwort Letzte Antwort

1
N nednobbins@lemmy.zip

They seem pretty different to me.

Video compression developers go through a lot of effort to make them deterministic. We don't necessarily care that a particular video stream compresses to a particular bit sequence but we very much care that the resulting decompression gets you as close to the original as possible.

AIs will rarely produce exact replicas of anything. They synthesize outputs from heterogeneous training data. That sounds like learning to me.

The one area where there's some similarity is dimensionality reduction. Its technically a form of compression, since it makes your files smaller. It would also be an extremely expensive way to get extremely bad compression. It would take orders of magnitude more hardware resources and the images are likely to be unrecognizable.
G This user is from outside of this forum
G This user is from outside of this forum
gaja@lemm.ee

schrieb zuletzt editiert von

#216

Google search results aren't deterministic but I wouldn't say it "learns" like a person. Algorithms with pattern detection isn't the same as human learning.
N 1 Antwort Letzte Antwort

0
G gaja@lemm.ee

Google search results aren't deterministic but I wouldn't say it "learns" like a person. Algorithms with pattern detection isn't the same as human learning.
N This user is from outside of this forum
N This user is from outside of this forum
nednobbins@lemmy.zip

schrieb zuletzt editiert von

#217

You may be correct but we don't really know how humans learn.

There's a ton of research on it and a lot of theories but no clear answers.
There's general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.

We modeled perceptrons after neurons and we've been working to make them more like neurons. They don't have any obvious capabilities that perceptrons don't have.

That's the big problem with any claim that "AI doesn't do X like a person"; since we don't know how people do it we can neither verify nor refute that claim.

There's more to AI than just being non-deterministic. Anything that's too deterministic definitely isn't an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.
H 1 Antwort Letzte Antwort

0
A axel7fb5@lemmy.cafe

why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre
J This user is from outside of this forum
J This user is from outside of this forum
j0ester@lemmy.world

schrieb zuletzt editiert von j0ester@lemmy.world

#218

Hehe jailbreak an Android OS. You mean “rooting”.
1 Antwort Letzte Antwort

0
P pro@programming.dev

This post did not contain any content.
F This user is from outside of this forum
F This user is from outside of this forum
fizz@lemmy.nz

schrieb zuletzt editiert von

#219

Judge,I'm pirating them to train ai not to consume for my own personal use.
1 Antwort Letzte Antwort

21
D dfx4509b_2@lemmy.org

Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.

That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#220

The ruling explicitly says that scanning books and keeping/using those digital copies is legal.

The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.
D 1 Antwort Letzte Antwort

1
D dfx4509b_2@lemmy.org

Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.

That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.
B This user is from outside of this forum
B This user is from outside of this forum
bob_robertson_ix@discuss.tchncs.de

schrieb zuletzt editiert von

#221

It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.

This does raise an interesting case where libraries could end up training and distributing public domain AI models.
R 1 Antwort Letzte Antwort

4
J jcbazpx@lemmy.world

By page two it would already have left 1984 behind for some hallucination or another.
P This user is from outside of this forum
P This user is from outside of this forum
pattymcb@lemmy.world

schrieb zuletzt editiert von

#222

Oh, so it would be the news?
1 Antwort Letzte Antwort

0
B booly@sh.itjust.works

The ruling explicitly says that scanning books and keeping/using those digital copies is legal.

The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.
D This user is from outside of this forum
D This user is from outside of this forum
deltapi@lemmy.world

schrieb zuletzt editiert von

#223

I wonder if the archive.org cases had any bearing on the decision.
B 1 Antwort Letzte Antwort

0
N nodiratime@lemmy.world

Does it "generate" a 1:1 copy?
M This user is from outside of this forum
M This user is from outside of this forum
mtk@lemmy.world

schrieb zuletzt editiert von

#224

You can train an LLM to generate 1:1 copies
1 Antwort Letzte Antwort

1
A axel7fb5@lemmy.cafe

why do you even jailbreak your kindle? you can still read pirated books on them if you connect it to your pc using calibre
Y This user is from outside of this forum
Y This user is from outside of this forum
yournamehere@lemm.ee

schrieb zuletzt editiert von

#225

when not in use i have it load images from my local webserver that are generated by some scripts and feature local news or the weather. kindle screensaver sucks.
1 Antwort Letzte Antwort

0
N nednobbins@lemmy.zip

You may be correct but we don't really know how humans learn.

There's a ton of research on it and a lot of theories but no clear answers.
There's general agreement that the brain is a bunch of neurons; there are no convincing ideas on how consciousness arises from that mass of neurons.
The brain also has a bunch of chemicals that affect neural processing; there are no convincing ideas on how that gets you consciousness either.

We modeled perceptrons after neurons and we've been working to make them more like neurons. They don't have any obvious capabilities that perceptrons don't have.

That's the big problem with any claim that "AI doesn't do X like a person"; since we don't know how people do it we can neither verify nor refute that claim.

There's more to AI than just being non-deterministic. Anything that's too deterministic definitely isn't an intelligence though; natural or artificial. Video compression algorithms are definitely very far removed from AI.
H This user is from outside of this forum
H This user is from outside of this forum
hoppolito@mander.xyz

schrieb zuletzt editiert von

#226

One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and "any combination of deterministic components will result in a deterministic system". Randomness has to be externally injected into e.g. current LLMs to produce 'non-deterministic' output.

There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.
N 1 Antwort Letzte Antwort

0
A antonim@lemmy.dbzer0.com

Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they've been sued. Maybe you're thinking of that case?
S This user is from outside of this forum
S This user is from outside of this forum
scoffinglizard@lemmy.dbzer0.com

schrieb zuletzt editiert von

#227

Billions of dollars, and they can't afford to buy ebooks?
1 Antwort Letzte Antwort

2
P pro@programming.dev

This post did not contain any content.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#228
It took me a few days to get the time to read the actual court ruling but here's the basics of what it ruled (and what it didn't rule on):
- It's legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn't give permission. And even if you bought the books used, for very cheap, in bulk.
- It's legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
- It's legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
- It's legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder's permission.
- It's illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder's permission.
Here's what it didn't rule on:
- Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn't legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
- Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
- Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).
So it's a pretty important ruling, in my opinion. It's a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder's permission, as long as you first own a legal copy in the first place. And it's a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.
1 Antwort Letzte Antwort

13
M mtk@lemmy.world

Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
B This user is from outside of this forum
B This user is from outside of this forum
booly@sh.itjust.works

schrieb zuletzt editiert von

#229

The court's ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.

But the facts before the court were that Anthropic's LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
1 Antwort Letzte Antwort

3

Anmelden zum Antworten

I

US AI startups see funding surge while more VC funds struggle to raise, data shows
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
3

38 Stimmen

3 Beiträge

30 Aufrufe

T

Vibe investors
R

TikTok Is Reportedly Making a U.S. Version of the App
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
7

1

28 Stimmen

7 Beiträge

58 Aufrufe

A

So basically doing what Western Tech companies do in China? Just sounds like a way to isolate American users and control what they see and hear...a bit like what they do in China...huh.
P

Elon Musk wants to rewrite "the entire corpus of human knowledge" with Grok
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
199

2

816 Stimmen

199 Beiträge

989 Aufrufe

Z

It's clear you don't really understand the wider context and how historically hard these tasks have been. I've been doing this for a decade and the fact that these foundational models can be pretrained on unrelated things then jump that generalization gap so easily (within reason) is amazing. You just see the end result of corporate uses in the news, but this technology is used in every aspect of science and life in general (source: I do this for many important applications).
Z

The loopholes in US immigration law enforcement and the erosion of immigration rights
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

0 Stimmen

1 Beiträge

14 Aufrufe

Niemand hat geantwortet
D

A weaponized AI chatbot is flooding city councils with climate misinformation
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

1

4 Stimmen

1 Beiträge

11 Aufrufe

Niemand hat geantwortet
V

Best MS Office 365 Services in Saudi Arabia for Businesses
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
1

2

0 Stimmen

1 Beiträge

12 Aufrufe

Niemand hat geantwortet
P

Study: Remote working benefits fathers while childless men miss sense of community
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
170

1

643 Stimmen

170 Beiträge

794 Aufrufe

F

I actually wouldn't enjoy talking to most people at work, because that would involve going there instead of doing it from the computer where I already am
P

Microsoft is putting AI actions into the Windows File Explorer
Beobachtet Ignoriert Geplant Angeheftet Gesperrt Verschoben Technology technology
11

1

1 Stimmen

11 Beiträge

49 Aufrufe

I

Cool, so that's a specific problem with your needed use case. That's not what you said before.