Skip to content

Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not

Technology
222 117 1
  • Make an AI that is trained on the books.

    Tell it to tell you a story for one of the books.

    Read the story without paying for it.

    The law says this is ok now, right?

    As long as they don't use exactly the same words in the book, yeah, as I understand it.

  • Huh? Didn’t Meta not use any permission, and pirated a lot of books to train their model?

    True. And I will be happy if someone sue them and the judge say the same thing.

  • If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)

    They may be trying to put safeguards so it isn't directly happening, but here is an example that the text is there word for word:

    If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)

    Well, it would be interesting if this case would be used as precedence in a case invonving a single student that do the same thing. But you are right

  • So I can't use any of these works because it's plagiarism but AI can?

    You can “use” them to learn from, just like “AI” can.

    What exactly do you think AI does when it “learns” from a book, for example? Do you think it will just spit out the entire book if you ask it to?

  • AI can “learn” from and “read” a book in the same way a person can and does

    This statement is the basis for your argument and it is simply not correct.

    Training LLMs and similar AI models is much closer to a sophisticated lossy compression algorithm than it is to human learning. The processes are not at all similar given our current understanding of human learning.

    AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?

    The current Disney lawsuit against Midjourney is illustrative - literally, it includes numerous side-by-side comparisons - of how AI models are capable of recreating iconic copyrighted work that is indistinguishable from the original.

    If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.

    An AI doesn't create works on its own. A human instructs AI to do so. Attribution is also irrelevant. If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).

    Your very first statement calling my basis for my argument incorrect is incorrect lol.

    LLMs “learn” things from the content they consume. They don’t just take the content in wholesale and keep it there to regurgitate on command.

    On your last part, unless someone uses AI to recreate the tone etc of a best selling author *and then markets their book/writing as being from said best selling author, and doesn’t use trademarked characters etc, there’s no issue. You can’t copyright a style of writing.

  • This post did not contain any content.

    But I thought they admitted to torrenting terabytes of ebooks?

  • You can “use” them to learn from, just like “AI” can.

    What exactly do you think AI does when it “learns” from a book, for example? Do you think it will just spit out the entire book if you ask it to?

    It cant speak or use any words without it being someone elses words it learned from? Unless its giving sources everything is always from something it learned because it cannot speak or use words without that source in the first place?

  • If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)

    Well, it would be interesting if this case would be used as precedence in a case invonving a single student that do the same thing. But you are right

    This was my understanding also, and why I think the judge is bad at their job.

  • AI can “learn” from and “read” a book in the same way a person can and does

    This statement is the basis for your argument and it is simply not correct.

    Training LLMs and similar AI models is much closer to a sophisticated lossy compression algorithm than it is to human learning. The processes are not at all similar given our current understanding of human learning.

    AI doesn’t reproduce a work that it “learns” from, so why would it be illegal?

    The current Disney lawsuit against Midjourney is illustrative - literally, it includes numerous side-by-side comparisons - of how AI models are capable of recreating iconic copyrighted work that is indistinguishable from the original.

    If a machine can replicate your writing style because it could identify certain patterns, words, sentence structure, etc then as long as it’s not pretending to create things attributed to you, there’s no issue.

    An AI doesn't create works on its own. A human instructs AI to do so. Attribution is also irrelevant. If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).

    Even if we accept all your market liberal premise without question... in your own rhetorical framework the Disney lawsuit should be ruled against Disney.

    If a human uses AI to recreate the exact tone, structure and other nuances of say, some best selling author, they harm the marketability of the original works which fails fair use tests (at least in the US).

    Says who? In a free market why is the competition from similar products and brands such a threat as to be outlawed? Think reasonably about what you are advocating... you think authorship is so valuable or so special that one should be granted a legally enforceable monopoly at the loosest notions of authorship. This is the definition of a slippery-slope, and yet, it is the status quo of the society we live in.

    On it "harming marketability of the original works," frankly, that's a fiction and anyone advocating such ideas should just fucking weep about it instead of enforce overreaching laws on the rest of us. If you can't sell your art because a machine made "too good a copy" of your art, it wasn't good art in the first place and that is not the fault of the machine. Even big pharma doesn't get to outright ban generic medications (even tho they certainly tried)... it is patently fucking absurd to decry artist's lack of a state-enforced monopoly on their work. Why do you think we should extend such a radical policy towards... checks notes... tumblr artists and other commission based creators? It's not good when big companies do it for themselves through lobbying, it wouldn't be good to do it for "the little guy," either. The real artists working in industry don't want to change the law this way because they know it doesn't work in their favor. Disney's lawsuit is in the interest of Disney and big capital, not artists themselves, despite what these large conglomerates that trade in IPs and dreams might try to convince the art world writ large of.

  • If I understand correctly they are ruling you can by a book once, and redistribute the information to as many people you want without consequences. Aka 1 student should be able to buy a textbook and redistribute it to all other students for free. (Yet the rules only work for companies apparently, as the students would still be committing a crime)

    They may be trying to put safeguards so it isn't directly happening, but here is an example that the text is there word for word:

    Not at all true. AI doesn’t just reproduce content it was trained on on demand.

  • My interpretation was that AI companies can train on material they are licensed to use, but the courts have deemed that Anthropic pirated this material as they were not licensed to use it.

    In other words, if Anthropic bought the physical or digital books, it would be fine so long as their AI couldn't spit it out verbatim, but they didn't even do that, i.e. the AI crawler pirated the book.

    Does buying the book give you license to digitise it?

    Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?

    Definitions of "Ownership" can be very different.

  • This was my understanding also, and why I think the judge is bad at their job.

    I suppose someone could develop an LLM that digests textbooks, and rewords the text and spits it back out. Then distribute it for free page for page. You can't copy right the math problems I don't think.. so if the text wording is what gives it credence, that would have been changed.

  • I joined lemmy specifically to avoid this reddit mindset of jumping to conclusions after reading a headline

    Guess some things never change...

    Well to be honest lemmy is less prone to knee-jerk reactionary discussion but on a handful of topics it is virtually guaranteed to happen no matter what, even here. For example, this entire site, besides a handful of communities, is vigorously anti-AI; and in the words of u/jsomae@lemmy.ml elsewhere in this comment chain:

    "It seems the subject of AI causes lemmites to lose all their braincells."

    I think there is definitely an interesting take on the sociology of the digital age in here somewhere but it's too early in the morning to be tapping something like that out lol

  • You're getting douchevoted because on lemmy any AI-related comment that isn't negative enough about AI is the Devil's Work.

    Some communities on this site speak about machine learning exactly how I see grungy Europeans from pre-18th century manuscripts speaking about witches, Satan, and evil... as if it is some pervasive, black-magic miasma.

    As someone who is in the field of machine learning academically/professionally it's honestly kind of shocking and has largely informed my opinion of society at large as an adult. No one puts any effort into learning if they see the letters "A" and "I" in all caps, next to each other. Immediately turn their brain off and start regurgitating points and responding reflexively, on Lemmy or otherwise. People talk about it so confidently while being so frustratingly unaware of their own ignorance on the matter, which, for lack of a better comparison... reminds me a lot of how historically and in fiction human beings have treated literal magic.

    That's my main issue with the entire swath of "pro vs anti AI" discourse... all these people treating something that, to me, is simple & daily reality as something entirely different than my own personal notion of it.

  • You can “use” them to learn from, just like “AI” can.

    What exactly do you think AI does when it “learns” from a book, for example? Do you think it will just spit out the entire book if you ask it to?

    I am educated on this. When an ai learns, it takes an input through a series of functions and are joined at the output. The set of functions that produce the best output have their functions developed further. Individuals do not process information like that. With poor exploration and biasing, the output of an AI model could look identical to its input. It did not "learn" anymore than a downloaded video ran through a compression algorithm.

  • LLMs don’t learn, and they’re not people. Applying the same logic doesn’t make much sense.

    The judge isn't saying that they learn or that they're people. He's saying that training falls into the same legal classification as learning.

  • Your very first statement calling my basis for my argument incorrect is incorrect lol.

    LLMs “learn” things from the content they consume. They don’t just take the content in wholesale and keep it there to regurgitate on command.

    On your last part, unless someone uses AI to recreate the tone etc of a best selling author *and then markets their book/writing as being from said best selling author, and doesn’t use trademarked characters etc, there’s no issue. You can’t copyright a style of writing.

    If what you are saying is true, why were these ‘AI’s” incapable of rendering a full wine glass? It ‘knows’ the concept of a full glass of water, but because of humanities social pressures, a full wine glass being the epitome of gluttony, art work did not depict a full wine glass, no matter how ai prompters demanded, it was unable to link the concepts until it was literally created for it to regurgitate it out. It seems ‘AI’ doesn’t really learn, but regurgitates art out in collages of taken assets, smoothed over at the seams.

  • I suppose someone could develop an LLM that digests textbooks, and rewords the text and spits it back out. Then distribute it for free page for page. You can't copy right the math problems I don't think.. so if the text wording is what gives it credence, that would have been changed.

    If a human did that it’s still plagiarism.

  • What a bad judge.

    Why ? Basically he simply stated that you can use whatever material you want to train your model as long as you ask the permission to use it (and presumably pay for it) to the author (or copytight holder)

    "Fair use" is the exact opposite of what you're saying here. It says that you don't need to ask for any permission. The judge ruled that obtaining illegitimate copies was unlawful but use without the creators consent is perfectly fine.

  • Not at all true. AI doesn’t just reproduce content it was trained on on demand.

    It can, the only thing stopping it is if it is specifically told not to, and this consideration is successfully checked for. It is completely capable of plagiarizing otherwise.

  • Best MS Office 365 Services in Saudi Arabia for Businesses

    Technology technology
    1
    2
    0 Stimmen
    1 Beiträge
    3 Aufrufe
    Niemand hat geantwortet
  • The world could experience a year above 2°C of warming by 2029

    Technology technology
    17
    1
    201 Stimmen
    17 Beiträge
    17 Aufrufe
    sattarip@lemmy.blahaj.zoneS
    Thank you for the clarification.
  • Covert Web-to-App Tracking via Localhost on Android

    Technology technology
    2
    42 Stimmen
    2 Beiträge
    7 Aufrufe
    M
    Thanks for sharing this, it is an interesting read (though an additional comment about what this about would have been helpful). I want to say I am glad I do not use either of these services but Yandex implementation seems so bad that it does not matter, as any app could receive their data
  • 0 Stimmen
    1 Beiträge
    2 Aufrufe
    Niemand hat geantwortet
  • Catbox.moe got screwed 😿

    Technology technology
    40
    55 Stimmen
    40 Beiträge
    24 Aufrufe
    archrecord@lemm.eeA
    I'll gladly give you a reason. I'm actually happy to articulate my stance on this, considering how much I tend to care about digital rights. Services that host files should not be held responsible for what users upload, unless: The service explicitly caters to illegal content by definition or practice (i.e. the if the website is literally titled uploadyourcsamhere[.]com then it's safe to assume they deliberately want to host illegal content) The service has a very easy mechanism to remove illegal content, either when asked, or through simple monitoring systems, but chooses not to do so (catbox does this, and quite quickly too) Because holding services responsible creates a whole host of negative effects. Here's some examples: Someone starts a CDN and some users upload CSAM. The creator of the CDN goes to jail now. Nobody ever wants to create a CDN because of the legal risk, and thus the only providers of CDNs become shady, expensive, anonymously-run services with no compliance mechanisms. You run a site that hosts images, and someone decides they want to harm you. They upload CSAM, then report the site to law enforcement. You go to jail. Anybody in the future who wants to run an image sharing site must now self-censor to try and not upset any human being that could be willing to harm them via their site. A social media site is hosting the posts and content of users. In order to be compliant and not go to jail, they must engage in extremely strict filtering, otherwise even one mistake could land them in jail. All users of the site are prohibited from posting any NSFW or even suggestive content, (including newsworthy media, such as an image of bodies in a warzone) and any violation leads to an instant ban, because any of those things could lead to a chance of actually illegal content being attached. This isn't just my opinion either. Digital rights organizations such as the Electronic Frontier Foundation have talked at length about similar policies before. To quote them: "When social media platforms adopt heavy-handed moderation policies, the unintended consequences can be hard to predict. For example, Twitter’s policies on sexual material have resulted in posts on sexual health and condoms being taken down. YouTube’s bans on violent content have resulted in journalism on the Syrian war being pulled from the site. It can be tempting to attempt to “fix” certain attitudes and behaviors online by placing increased restrictions on users’ speech, but in practice, web platforms have had more success at silencing innocent people than at making online communities healthier." Now, to address the rest of your comment, since I don't just want to focus on the beginning: I think you have to actively moderate what is uploaded Catbox does, and as previously mentioned, often at a much higher rate than other services, and at a comparable rate to many services that have millions, if not billions of dollars in annual profits that could otherwise be spent on further moderation. there has to be swifter and stricter punishment for those that do upload things that are against TOS and/or illegal. The problem isn't necessarily the speed at which people can be reported and punished, but rather that the internet is fundamentally harder to track people on than real life. It's easy for cops to sit around at a spot they know someone will be physically distributing illegal content at in real life, but digitally, even if you can see the feed of all the information passing through the service, a VPN or Tor connection will anonymize your IP address in a manner that most police departments won't be able to track, and most three-letter agencies will simply have a relatively low success rate with. There's no good solution to this problem of identifying perpetrators, which is why platforms often focus on moderation over legal enforcement actions against users so frequently. It accomplishes the goal of preventing and removing the content without having to, for example, require every single user of the internet to scan an ID (and also magically prevent people from just stealing other people's access tokens and impersonating their ID) I do agree, however, that we should probably provide larger amounts of funding, training, and resources, to divisions who's sole goal is to go after online distribution of various illegal content, primarily that which harms children, because it's certainly still an issue of there being too many reports to go through, even if many of them will still lead to dead ends. I hope that explains why making file hosting services liable for user uploaded content probably isn't the best strategy. I hate to see people with good intentions support ideas that sound good in practice, but in the end just cause more untold harms, and I hope you can understand why I believe this to be the case.
  • 24 Stimmen
    14 Beiträge
    7 Aufrufe
    S
    I think you're missing some key points. Any file hosting service, no matter what, will have to deal with CSAM as long as people are able to upload to it. No matter what. This is an inescapable fact of hosting and the internet in general. Because CSAM is so ubiquitous and constant, one can only do so much to moderate any services, whether they're a large corporation are someone with a server in their closet. All of the larger platforms like 'meta', google, etc., mostly outsource that moderation to workers in developing countries so they don't have to also provide mental health counselling, but that's another story. The reason they own their own hardware is because the hosting services can and will disable your account and take down your servers if there's even a whiff of CSAM. Since it's a constant threat, it's better to own your own hardware and host everything from your closet so you don't have to eat the downtime and wait for some poor bastard in Nigeria to look through your logs and reinstate your account (not sure how that works exactly though).
  • 182 Stimmen
    39 Beiträge
    9 Aufrufe
    H
    https://archive.org/details/swgrap
  • 0 Stimmen
    1 Beiträge
    4 Aufrufe
    Niemand hat geantwortet