Skip to content

Judge backs AI firm over use of copyrighted books

Technology
59 34 324
  • That "freely" there really does a lot of hard work.

    It means what it means, "freely" pulls its own weight. I didn't say "readily" accessible. Torrents could be viewed as "readily" accessible but it couldn't be viewed as "freely" accessible because at the very least you bear the guilt of theft. Library books are "freely" accessible, and if somehow the training involved checking out books and returning them digitally, it should be fine. If it is free to read into neurons it is free to read into neural systems. If payment for reading is expected then it isn't free.

  • To anyone who is reading this comment without reading through the article. This ruling doesn't mean that it's okay to pirate for building a model. Anthropic will still need to go through trial for that:

    But he rejected Anthropic's request to dismiss the case, ruling the firm would have to stand trial over its use of pirated copies to build its library of material.

    I also read through the judgement, and I think it's better for anthropic than you describe. He distinguishes three issues:

    A) Use any written material they get their hands on to train the model (and the resulting model doesn't just reproduce the works).

    B) Buy a single copy of a print book, scan it, and retain the digital copy for a company library (for all sorts of future purposes).

    C) Pirate a book and retain that copy for a company library (for all sorts of future purposes).

    A and B were fair use by summary judgement. Meaning this judge thinks it's clear cut in anthropics favor. C will go to trial.

  • Page 6 the judge writes the LLM “memorized” the content and could “recite” it.

    Neither is true in training or use of LLMs

    Depends on the content and the method. There are tons of ways to encrypt data, and under relevant law they may still count as copies. There are certainly weaker NN models where we can extract a lot of the training data, even if it's not easy, from the model parameters (even if we can't find a prompt that gets the model to regurgitate).

  • IMO the focus should have always been on the potential for AI to produce copyright-violating output, not on the method of training.

    Plantifs made that argument and the judge shoots it down pretty hard. That competition isn't what copyright protects from. He makes an analogy with teachers teaching children to write fiction: they are using existing fantasy to create MANY more competitors on the fiction market. Could an author use copyright to challenge that use?

    Would love to hear your thoughts on the ruling itself (it's linked by reuters).

  • It means what it means, "freely" pulls its own weight. I didn't say "readily" accessible. Torrents could be viewed as "readily" accessible but it couldn't be viewed as "freely" accessible because at the very least you bear the guilt of theft. Library books are "freely" accessible, and if somehow the training involved checking out books and returning them digitally, it should be fine. If it is free to read into neurons it is free to read into neural systems. If payment for reading is expected then it isn't free.

    Civil cases of copyright infringment are not theft, no matter what the MPIA have trained you to believe.

  • I also read through the judgement, and I think it's better for anthropic than you describe. He distinguishes three issues:

    A) Use any written material they get their hands on to train the model (and the resulting model doesn't just reproduce the works).

    B) Buy a single copy of a print book, scan it, and retain the digital copy for a company library (for all sorts of future purposes).

    C) Pirate a book and retain that copy for a company library (for all sorts of future purposes).

    A and B were fair use by summary judgement. Meaning this judge thinks it's clear cut in anthropics favor. C will go to trial.

    C could still bankrupt the company depending on how trial goes. They pirated a lot of books.

  • Because books are used to train both commercial and open source language models?

    used to train both commercial

    commercial training is, in this case, stealing people's work for commercial gain

    and open source language models

    so, uh, let us train open-source models on open-source text. There's so much of it that there's no need to steal.

    ?

    I'm not sure why you added a question mark at the end of your statement.

  • Civil cases of copyright infringment are not theft, no matter what the MPIA have trained you to believe.

    But they are copyright infringement, which costs more than theft.

  • Because of the vast amount of data needed, there will be no competitive viable open source solution if half the data is kept in a walled garden.

    This is about open weights vs closed weights.

    They haven't dewalled the garden yet. The copyright infringement part of the case will continue.

  • What, how is this a win? Three authors lost a lawsuit to an AI firm using their works.

    It would harm the A.I. industry if Anthropic loses the next part of the trial on whether they pirated books — from what I’ve read, Anthropic and Meta are suspected of getting a lot off torrent sites and the like.

    It’s possible they all did some piracy in their mad dash to find training material but Amazon and Google have bookstores and Google even has a book text search engine, Google Scholar, and probably everything else already in its data centers. So, not sure why they’d have to resort to piracy.

  • C could still bankrupt the company depending on how trial goes. They pirated a lot of books.

    As a civil matter, the publishing houses are more likely to get the full money if anthropic stays in business (and does well). So it might be bad, but I'm really skeptical about bankruptcy (and I'm not hearing anyone seriously floating it?)

  • This post did not contain any content.

    Anakin: “Judge backs AI firm over use of copyrighted books”
    Padme: “But they’ll be held accountable when they reproduce parts of those works or compete with the work they were trained on, right?”
    Anakin: “…”
    Padme: “Right?”

  • Because of the vast amount of data needed, there will be no competitive viable open source solution if half the data is kept in a walled garden.

    This is about open weights vs closed weights.

    I agree that we need open-source and emancipate ourselves. The main issue I see is: The entire approach doesn't work. I'd like to give the internet as an example. It's meant to be very open, connect everyone and enable them to share information freely. It is set up to be a level playing field... Now look what that leads to. Trillion dollar mega-corporations, privacy issues everywhere and big data silos. That's what the approach promotes. I agree with the goal. But in my opinion the approach will turn out to lead to less open source and more control by rich companies. And that's not what we want.

    Plus nobody even opens the walled gardes. Last time I looked, Reddit wanted money for data. Other big platforms aren't open either. And there's kind of a small war going on with the scrapers and crawlers and anti-measures. So it's not as if it's open as of now.

  • This post did not contain any content.

    Pirate everything!

  • If you try to sell "the new adventures of Doctor Strange, Jonathan Strange and Magic Man." existing copyright laws are sufficient and will stop it. Really, training should be regulated by the same laws as reading. If they can get the material through legitimate means it should be fine, but pulling data that is not freely accessible should be theft, as it is already.

    I have a freely accessible document that I have a cc license for that states it is not to be used for commercial use. This is commercial use. Your policy would allow for that document to be used though since it is accessible. This kind of policy discourages me from easily sharing my works as others profit from my efforts and my works are more likely to be attributed to a corporate beast I want nothing to do with then to me.

    I'm all for copyright reform and simpler copyright law, but these companies need to be held to standard copyright rules and not just made up modifications.
    I'm convinced a perfectly decent LLM could be built without violating copyrights.

    I'd also be ok sharing works with a not for profit open source LLM and I think others might as well.

  • I agree that we need open-source and emancipate ourselves. The main issue I see is: The entire approach doesn't work. I'd like to give the internet as an example. It's meant to be very open, connect everyone and enable them to share information freely. It is set up to be a level playing field... Now look what that leads to. Trillion dollar mega-corporations, privacy issues everywhere and big data silos. That's what the approach promotes. I agree with the goal. But in my opinion the approach will turn out to lead to less open source and more control by rich companies. And that's not what we want.

    Plus nobody even opens the walled gardes. Last time I looked, Reddit wanted money for data. Other big platforms aren't open either. And there's kind of a small war going on with the scrapers and crawlers and anti-measures. So it's not as if it's open as of now.

    A lot of our laws are indeed obsolete. I think the best solution would be to force copy left licenses on anything using public created data.

    But I'll take the wild west we have now with no walls then any kind of copyright dystopia. Reddit did successfully sell it's data to Google for 60 million. Right now, you can legally scrape anything you want off reddit, it is an open garden in every sense of the word (even if they dont like it). It's a lot more legal then using pirated books, but Google still bet 60 million that copyright laws would swing broadly in their favor.

    I think it's very foolhardy to even hint at a pro copyright stance right now. There is a very real chance of AI getting monopolized and this is how they will do it.

  • A lot of our laws are indeed obsolete. I think the best solution would be to force copy left licenses on anything using public created data.

    But I'll take the wild west we have now with no walls then any kind of copyright dystopia. Reddit did successfully sell it's data to Google for 60 million. Right now, you can legally scrape anything you want off reddit, it is an open garden in every sense of the word (even if they dont like it). It's a lot more legal then using pirated books, but Google still bet 60 million that copyright laws would swing broadly in their favor.

    I think it's very foolhardy to even hint at a pro copyright stance right now. There is a very real chance of AI getting monopolized and this is how they will do it.

    I agree a copyright dystopia wouldn't be any good. Just mind that wild west or law of the jungle is the "right of the strongest". You're advantaging big companies and disadvantaging smaller players or people with ethics or who are more open/transparent.

    And I don't think legality with web scraping is the biggest issue. Sure I maybe could do it if it were possible. But I'm occasionally doing some weird stuff and most services have countermeasures in place. In reality I just can't scrape Reddit. Lot's of bots and crawlers just don't work any more. I'm getting rate limited left and right from all big platforms. Lots of things require an account these days, and services are quick banning me for "suspicious activity". It's barely possible to download Youtube videos these days. So, no. I can't. While Google can just pay for it and have the data.

    Also Reddit isn't really the benevolent underdog here. They're a big company as well. And they're not selling their data... They're selling their user's data. They're mainly monetizing other people's creations.

  • If you try to sell "the new adventures of Doctor Strange, Jonathan Strange and Magic Man." existing copyright laws are sufficient and will stop it. Really, training should be regulated by the same laws as reading. If they can get the material through legitimate means it should be fine, but pulling data that is not freely accessible should be theft, as it is already.

    as it is already

    Copies of copyrighted works cannot be regarded as "stolen property" for the purposes of a prosecution under the National Stolen Property Act of 1934.

    https://en.m.wikipedia.org/wiki/Dowling_v.United_States(1985)

  • used to train both commercial

    commercial training is, in this case, stealing people's work for commercial gain

    and open source language models

    so, uh, let us train open-source models on open-source text. There's so much of it that there's no need to steal.

    ?

    I'm not sure why you added a question mark at the end of your statement.

    I'm not sure why you added a question mark at the end of your statement.

    I was questioning whether or not you would see that as a benefit. Clearly you don't.

    Are you also against libraries letting people borrow books since those are also lost sales for the authors, or are you just a luddite?

  • I'm not sure why you added a question mark at the end of your statement.

    I was questioning whether or not you would see that as a benefit. Clearly you don't.

    Are you also against libraries letting people borrow books since those are also lost sales for the authors, or are you just a luddite?

    libraries letting people borrow books

    This is so far from analogous that it's almost a nonsequitur.

    are you just a luddite?

    No, and you don't even believe such nonsense. You're grasping, ineffectively.

  • 259 Stimmen
    60 Beiträge
    351 Aufrufe
    S
    Holy shit. That's scary.
  • the best platform where you can play Free games online

    Technology technology
    2
    0 Stimmen
    2 Beiträge
    19 Aufrufe
    P
    the best platform where you can play games for free https://playgamesonline.io/
  • 89 Stimmen
    15 Beiträge
    69 Aufrufe
    S
    I suspect people (not billionaires) are realising that they can get by with less. And that the planet needs that too. And that working 40+ hours a week isn’t giving people what they really want either. Tbh, I don't think that's the case. If you look at any of the relevant metrics (CO², energy consumption, plastic waste, ...) they only know one direction globally and that's up. I think the actual issues are Russian invasion of Ukraine and associated sanctions on one of the main energy providers of Europe Trump's "trade wars" which make global supply lines unreliable and costs incalculable (global supply chains love nothing more than uncertainty) Uncertainty in regards to China/Taiwan Boomers retiring in western countries, which for the first time since pretty much ever means that the work force is shrinking instead of growing. Economical growth was mostly driven by population growth for the last half century with per-capita productivity staying very close to inflation. Disrupting changes in key industries like cars and energy. The west has been sleeping on may of these developments (e.g. electric cars, batteries, solar) and now China is curbstomping the rest of the world in regards to market share. High key interest rates (which are applied to reduce high inflation due to some of the reason above) reduce demand on financial investments into companies. The low interest rates of the 2010s and also before lead to more investments into companies. With interest going back up, investments dry up. All these changes mean that companies, countries and people in the west have much less free cash available. There’s also the value of money has never been lower either. That's been the case since every. Inflation has always been a thing and with that the value of money is monotonically decreasing. But that doesn't really matter for the whole argument, since the absolute value of money doesn't matter, only the relative value. To put it differently: If you earn €100 and the thing you want to buy costs €10, that is equivalent to if you earn €1000 and the thing you want to buy costing €100. The value of money dropping is only relevant for savings, and if people are saving too much then the economy slows down and jobs are cut, thus some inflation is positive or even required. What is an actual issue is that wages are not increasing at the same rate as the cost of things, but that's not a "value of the money" issue.
  • Secure Your Gmail Now As Google Warns Of Password Attacks

    Technology technology
    9
    1
    53 Stimmen
    9 Beiträge
    56 Aufrufe
    J
    I tried to but they wanted to force me to give them my phone number. Fuck them, they don't need it.
  • An AI video ad is making a splash. Is it the future of advertising?

    Technology technology
    2
    10 Stimmen
    2 Beiträge
    24 Aufrufe
    apfelwoischoppen@lemmy.worldA
    Gobble that AI slop NPR. Reads like sponsored content.
  • 323 Stimmen
    137 Beiträge
    685 Aufrufe
    F
    I think it would be best if that's a user setting, like dark mode. It would obviously be a popular setting to adjust. If they don't do that, there will doubtless be grease monkey and other scripts to hide it.
  • 80 Stimmen
    27 Beiträge
    101 Aufrufe
    lanusensei87@lemmy.worldL
    Consider the possibility that you don't need to be doing anything wrong besides existing to be persecuted by a fascist regime.
  • Indian Government orders censoring of accounts on X

    Technology technology
    12
    149 Stimmen
    12 Beiträge
    61 Aufrufe
    M
    Why? Because you can’t sell them?