Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Core copyright violation moves ahead in The Intercept's lawsuit against OpenAI (niemanlab.org)

307 points by giuliomagnifico 435 days ago | 365 comments

tolmasky 434 days ago [-]

The logical endgame of all this isn’t “stopping LLMs,” it’s Disney happening to own a critical mass of IP to be able to more or less exclusively legally train and run LLMs that make movies, firing all their employees, and no smaller company ever having a chance in hell with competing with a literal century’s worth of IP powering a generative model. This turns the already egregiously generous backwards facing monopoly into a forward facing monopoly.

None of this was ever the point of copyright. The best part about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Disney is ironically the proof of how constructive a system that regularly turns works over to the public domain can be. But thanks to lobbying by Disney, now you’re never allowed to create a derivative work of the art in your life.

Copyright is only possible because we the public fund the infrastructure necessary to maintain it. “IP” isn’t self manifesting like physical items. Me having a cup necessarily means you don’t have it. That’s not how ideas and pictures work. You can infinitely perfectly duplicate them. Thus we set up laws and courts and police to create a complicated simulation of physical properties for IP. Your tax dollars pay for that. The original deal was that in exchange, those works would enter the public domain to give back to society. We’ve gotten so far from that that people now argue about OpenAI “stealing” from authors, when the authors most of the time don’t even own the works — their employers do! What a sad comedy where we’ve forgotten we have a stake in this too and instead argue over which corporation should “own” the exclusive ability to cheaply and blazingly fast create future works while everyone else has to do it the hard way.

notahacker 434 days ago [-]

If I thought that nobody had a chance in hell of competing with generative models compiled by Disney from its corpus of lighthearted family movies, I'd be even less keen to give unlimited power to create derivative works out of everything in history to the companies with the greatest amount of computing power, which in this case happens to be a subsidiary of Microsoft.

All property rights depends on public funding the infrastructure to enforce them. If I believed movies derived from applying generative AI techniques to other movies was the endgame of human creativity, I'd find your endgame of it being the fiefdom of corporations who sold enough Windows licenses to own billions of dollars worth of computer hardware even more dystopian than it being invested in the corporations who originally paid for the movies...

horsawlarway 434 days ago [-]

Two thoughts

1. You are assuming that "greatest computing power" is a requirement. I think we're actually seeing a trend in the opposite direction with recent generative art models: It turns out consumer grade hardware is "enough" in basically all cases, and renting the compute you might otherwise be missing is cheap. I don't buy this as the barrier.

2. Given #1, I think you are framing the conversation in a very duplicitive manner by pitching this as "either Microsoft or Disney - pick your oppressor". I'd suggest that breaking the current fuckery in copyright, and restoring something more sane (like the 7 + 7 year original timespans) would benefit individuals who want to make stories and art far more than it would benefit corporations. Disney is literaly THE reason for half of the current extensions in timespan. They don't want reduced copyright - they want to curtail expression in favor of profit. This case just happens to have a convienent opponent for public sentiment.

---

Further - "All property rights depends on public funding the infrastructure to enforce them" Is false. This is only the case for intellectual property rights, where nothing need be removed from one person for the other to be "in violation".

notahacker 434 days ago [-]

I'm assuming greater computing power is a requirement because creating generative feature length movies (which is a few orders of magnitude more complex than creating PNGs) is something only massive corporations can afford the computing power to do at the moment (and the implied bar for excellence something we haven't reached). Certainly computing power and dev resource are more of a bottleneck to creating successful AI movies than not having access to the Disney canon which was the argument the OP made for anything other than OpenAI having unlimited rights over everyones content leading inexorably to a Disney generative AI monopoly. (another weakness of that is I'm not sure the Disney canon is sufficient training data for Disney to replace their staff with generative movies, never mind necessary for anyone else to ever make a marketable quality movie again)

Given #1, I think the OP is framing the conversation in a far more duplicitous manner by assuming that in a lawsuit against AI which doesn't even involve Disney, the only beneficiary of OpenAI not winning will be Disney. Disney extending copyright laws in past decades has nothing to do with a 10 year old internet company objecting to Open AI stripping all the copyright information off its recent articles before feeding them into its generative model.

> Further - "All property rights depends on public funding the infrastructure to enforce them" Is false. This is only the case for intellectual property rights, where nothing need be removed from one person for the other to be "in violation".

People who don't respect physical property are just as capable of removing it as people who don't respect intellectual property are capable of copying it. In both cases the thing that prevents them doing so is a legal system and taxpayer funded enforcement against people that don't play by the rules.

catlifeonmars 434 days ago [-]

> All property rights depends on public funding the infrastructure to enforce them

Still true, because people generally depend on the legal system and police departments to enforce physical property rights (both are publicly funded entities).

JambalayaJimbo 433 days ago [-]

All property rights absolutely depend on public infrastructure. The only thing keeping your house in your name is the legal system enforcing your right to it.

marcosdumay 434 days ago [-]

Either copyrights exist, and people can't copy creative works "owned" by somebody else, or copyrights don't exist and people can copy those at will.

"Copyrights exist, and people can copy others works if they have enough computing power to multiplex it with other works and demultiplex to get it back" is not a reasonable position.

I'm all for limiting it to 15 or 20 years, and requiring registration. If you want to completely end them, I'd be ok with that too (but I think it's suboptimal). But "end them to rich people" isn't acceptable.

reedciccio 432 days ago [-]

> Either copyrights exist, and people can't copy creative works "owned" by somebody else, or copyrights don't exist and people can copy those at will.

that's not how copyright works, it's not a binary thing. Also, it's similar but not the same in every legislation. You can make partial copies, you can make full copies as personal backup, you can make copies to transform copyrighted material (like create art and parodies.)

These cases are going to decide whether Google Books was a fluke or indeed, there is a limit to the power of the big copyright holders (not the artists/creators: those keep on starving, except few lucky ones.)

dragonwriter 433 days ago [-]

> Either copyrights exist, and people can't copy creative works "owned" by somebody else, or copyrights don't exist and people can copy those at will.

Like most simple binaries, this is a false dichotomy, and not only do more options exist in possibility, but neither of those matches the overt state of the law (where copyrights exist, but so do a range of caveats and exceptions, so people can copy and otherwise make use of works by others without permission under certain circumstances, but not at will, satisfying neither of the two options you present as exhaustive of all possibilities.)

tzkaln 434 days ago [-]

ClosedAI etc. are certainly stealing from open source authors and web site creators, who do own the copyright.

That said, I agree with putting more emphasis on individual creators, even if they have sold the copyright to corporations. I was appalled by the Google settlement with the author's guild: Why does a guild decide who owns what and who gets compensations?

Both Disney and ClosedAI are in the wrong here. I'm the opposite of a Marxist, but Marx' analysis was frequently right. He used the term "alienation from one's work" in the context of factory workers. Now people are being alienated from their intellectual work, which is stolen, laundered and then sold back to them.

marcosdumay 434 days ago [-]

I don't think you need to be a Marxist to accept that his observation that people are being alienated from their work capacity is spot on.

The "Marxsist" name is either about believing on the parts that aren't true or about the political philosophy (that honestly, can't stand by its own without the wrong facts). The ones that fit reality only make one a "realist".

ToucanLoucan 434 days ago [-]

I mean, not to be that guy, but multiple Marxist and Marxist-adjacent people I know and am have been out here pointing out how this was exactly and always what was going to happen since the LLM hype cycle really kicked into high-gear in mid-2023. And I was told in no uncertain terms, many times, on here, about how I was being a doomer, a pessimist, a luddite, etc. etc. because I and many like me saw the writing on the wall, immediately, that while generative AI represented a neat thing for folks to play with, that it would, like every other emerging tech, quickly become the sole domain of the monied entities that already run the rest of our lives, and this would be bad for basically everyone long term.

And yeah it looks be shaping up as exactly that.

trinsic2 434 days ago [-]

Yep. And people support it like "No its not going to be like that this time" bullshit.

tzs 433 days ago [-]

> But thanks to lobbying by Disney, now you’re never allowed to create a derivative work of the art in your life

As far as I can tell the only copyright term extension that might have been influenced by Disney lobbying in the US is the Copyright Term Extension Act of 1998, which extended the term from life+50 to life+70 (or from 75 to 95 years for works of corporate authorship).

The switch from fixed terms to life plus+50 came with the Copyright Act of 1976 which had nothing to do with Disney. They were probably for it, but so was nearly everybody because it laid the groundwork for the US joining the Berne Convention and making its copyright system much more compatible with that of most other countries.

As far as copyright law outside the US goes, most countries were on life+50 or longer before Disney even existed.

cxr 434 days ago [-]

> Me having a cup necessarily means you don’t have it. That’s not how ideas and pictures work. You can infinitely perfectly duplicate them.

This is a stupid argument, no matter how often it comes up.

If I hire Alice to come to my sandwich shop and make sandwiches for customers all week and then on payday I say, "Welp, no need to pay you—the sandwiches are already made!" then Alice is definitely out something, and I am categorically a piece of shit for trotting out this line of reasoning to try to justify not paying her.

If I do the same thing except I commission Alice to do a drawing for a friend's birthday, then I am no less a piece of shit if I make my own copy once she's shown it to me and try to get out of paying since I'm not using "her" copy.

(Notice that in neither case was the thing produced ever something that Alice was going to have for herself—she was never going to take home 400 sandwiches, nor was she ever interested in a portrait of your friend and his pet rabbit.)

If Alice senses that I'd be interested in the drawing but might not be totally swayed until I see it for myself, so she proactively decides to make the drawing upfront before approaching me, then it doesn't fundamentally change the balance from the previous scenario—she's out no less in that case than if I approached her first and then refuse to pay after the fact. (If she was wrong and it turns it I didn't actually want it because she misjudged and will not be able to recoup her investment, fair. But that's not the same as if she didn't misjudge and I come to her with this bankrupt argument of, "You already made the drawing, and what's done is done, and since it's infinitely reproducible, why should I owe you anything?")

Copyright duration is too long. But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.

kweingar 434 days ago [-]

This isn't really an argument though. It's an assertion that not honoring a commission agreement (or an employment contract) is equivalent to not paying for a license to an existing work. I tend to disagree. I could be persuaded otherwise, but I'd need to hear an argument other than "clearly these are the same thing."

cxr 434 days ago [-]

> This isn't really an argument though. It's an assertion that not honoring a commission agreement

Wrong. It's that (not honoring an agreement negotiated beforehand) and an argument against treating past-action-thing as inherently zero-cost and/or zero-value; the fact that a prior agreement is an element in the offered scenarios doesn't negate or neutralize the rest of it (just like the fact that a sandwich shop is an element in one of the scenarios doesn't negate or neutralize the broader reality for non-sandwich-involving scenarios).

And that's before we mention: there _is_ such an prior agreement in the case of modern IP—you can't not contend with the fact that if Alice is operating in the United States which has existing legislation granting her a "temporary monopoly" on her creative output, and then she generates the output on the basis that she'll be protected by the law of the land, and then you decide that you just don't agree with the idea of IP, then Alice is getting screwed over by someone not holding up their end of the bargain.

ramblenode 433 days ago [-]

Agree with the sibling: committing fraud by intentionally not honoring a contract is not morally or logically the same as duplicating a piece of media under copyright. That is not to say that copyright violations are harmless (the scale and intent matter), but details can't be ignored.

A material difference between fraud and copyright violations as categories is the presence of lost profit. With fraud one has lost the time value of their work, but with media piracy there is some research (funded by the EU of all things) that it doesn't trade off with sales and may even help some sales.

432 days ago [-]

trinsic2 434 days ago [-]

I'm sorry, the two are not even remotely the same. Saying it over and over again doesn't make it so.

cxr 434 days ago [-]

You wanna, like, actually digest what I wrote there? The second comment here is so unlike the first that your "Saying it over and over again" remark can only lead to the conclusion that you either didn't read it or didn't grok it. They're two different comments about two different things.

> I'm sorry

Are you? I think you mixed up the words "insincere" and "sorry".

account42 432 days ago [-]

If you hire Alice that means there is a contract you both have agreed to and need to honor. If alice just shows up in your kitchen making burgers she doesn't get to tell you what to do with the burgers after you kick her out. With copyright there is no explicit contract you can choose to enter. Instead everyone is effectively forced into a contract with every creator. A contract that is unconciably biased to benefit the creator.

Do you think it would be reasonable for Mallory to sell burgers and then demand that if you share some of them with your friend you need to seek her permission? And of course since the burger becomes part of your body then perhaps Mallory should have a say in what you can do with that too and can extract some fee for you existing after eating her burgers. That's how copyright is usually (mis)used - to extract rent in perpetuity for work that was done long ago. This kind of business model just doesn't exist out of IP. It's entirely artificial.

> But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.

On the contrary, it is a very important point. We don't burgers just sitting around to feed everyone for their entire lives. We do have all kinds of art and entertainment as well as productivity tools that have essentially infinite free copies. We don't really NEED to artificially encourage more creation for a lot of these whereas if people stopped producing food everyone would be in big trouble.

tolmasky 434 days ago [-]

I'm answering in reverse order because I think a lot of this comment covers stuff that we don't really disagree with. Thus I will answer the conclusion and then I put my responses to everything else because I find them interesting, but not required for what I want to convey.

> Copyright duration is too long. But the fundamental difference between rivalrous possession of physical artifacts and infinitely reproducible ideas really needs to stay the hell out of these debates. It's a tired, empty talking point that doesn't actually address the substance of what IP laws are really about.

I would argue that it is perhaps the opposite of tired: ironically less relevant in the past and more relevant as technology advances and mere thought experiments become practical reality. I think many of these issues weren't dealt with in the past because these edge cases existed as mere hypotheticals. Kind of like a mathematician saying that copyright doesn't make any sense because he could write a program that iterates through all books. Lawyers just roll their eyes not because they have a counter-argument, but because they don't think that scenario exists as something they'd ever have to deal with. I think the idea of a computer that reads all the text in the world and learns from it is definitely tied to the questions of unresolved issues with the nature of data, but would have until very recently been considered an annoying hypothetical in a serious discussion about copyright, allowing us to actual dismiss it and continue not addressing it.

We all agree that copyright is too long. And I also think this would just become a non-issue if we had a reasonable duration for copyrights. Even if you philosophically disagreed with it, it wouldn't be worth arguing over vs. just waiting it out.

> This is a stupid argument, no matter how often it comes up.

I knew bringing this up would rekindle these arguments from 20 years ago, but it was necessary for a later point, so I was hoping I was making it value-neutral enough that it wouldn't trigger this, but I guess I was wrong.

To be clear, I am not making the same argument you have seen several times before. I am making a strictly weaker argument. The only goal of this distinction is to demonstrate that these properties are "different", and that the law aims to make "intellectual property" behave like physical property. Notice for example that I didn't then assert that IP thus doesn't exist. I didn't even argue whether this goal of matching the behavior was good or bad. I am simply stating that it doesn't by default behave the way we seem to want it to, and, people don't seem to intuitively ascribe the same morality to it either. My only intention is to make the point that this goal thus requires work, and (as I'll explain in more detail below), more work than in the physical case. So far I don't think there is anything necessarily unreasonable about this as a set of premise conditions for establishing the terms under which the public at large agrees to take on the costs of maintaining said system.

> A bunch of stuff about Alice making sandwiches and drawing pictures

Disclaimer: I don't think we're really in disagreement about the important points, and I don't think this section is relevant to the important points which I return to below, however I find it intellectually interesting to talk about, so I have a retort here, which I believe is just an unrelated digression

These analysis of Alice making sandwiches and drawings (IMO) misses the actual meaningful differences in these scenarios since it (IMO) focuses on the uncontroversial, but also irrelevant, breach-of-contract issues. In both these scenarios, the issue is not really the "property," it is the refusal to comply with a previously agreed arrangement. You can see this if we add a third scenario where I pay Alice to do jumping jacks for a week, she does them, and then I refuse to pay at the end of the week. No need to pay you, you already did the jumping jacks. No one "got" anything here, other than I guess "satisfaction" or "exercise". We can make the example even more abstract by having me pay Alice to do nothing all week, and she once again does a great job by sitting quietly in her room all week, and then I once again don't pay her. The sandwiches and drawings are just props in the original examples -- they're not actually necessary since this is a contract question, not a theft question.

The actual interesting aspects around the sandwiches and drawings are 1) what happens much after this transaction, and 2) what happens with third parties. With the sandwiches, "what happens after" is straight-forward. I either the sandwiches, resell them immediately, or they go bad. There's not much interesting there. No one needs to think hard about the "ramifications" of the sale of the sandwiches. Compare this to the drawing. What if after I have paid you, just like we agreed, I proceed to make my infinite copies. You might think that's not fair, you thought you'd have a repeat customer. I assumed I was free to do as I please with the drawing. In fact, ironically enough, in this instance if I treat the drawing like physical property, where the expectation is I can do as I please with it, it ironically creates this conundrum because "putting the paper in the photocopier" is in the set of "do as I please". But let's go one step further, what if I make all those copies and then sell them.

I'm sure you'll now respond that the royalties or usage rights were all implied in your original story. Great! But that's my point. Those were required. You needed a supremely complex web of laws and binding contracts (and litigation if they aren't followed) as a necessary component of that transaction due to the existence of degrees of freedom that simply don't exist for the sandwiches. You can write up a contract around the resale of a sandwich, but most sandwich shops don't because me eating sandwiches for the rest of my life by copying the original sandwich isn't a realistic scenario (so no need to price that into the original cost of the sandwich), and me out-sandwiching you by carbon-cloning the sandwich isn't feasible, and even if it was it would still have material ingredient costs that would bound its effect on my shop, and even "figuring out the recipe" isn't that much of a worry since you still need to like buy ingredients and make sandwiches as opposed to hitting paste over and over. These scenarios are dramatically different, and that's why sandwich shops usually don't employ lawyers but design shops do. And again, we didn't even go into third parties. What if someone manages to somehow make a copy of your image just as you're handing it to the client. Now both of you are in compliance with your deal, neither of you is angry at each other, but there's this weird situation where you were never expecting to get money from me, but I have a copy of the picture now, and it's really hard to reason about what that means in terms of "gain" and "loss" if I never do anything other than hang it up in my room. This is simply not possible with the sandwich, no one could quickly "copy the sandwich" in transit and potentially introduce an entirely new threat to your business.

Again, my only point here is that it seems very strange to insist that there physical property is identical to intellectual property, and that it isn't fairly complicated to make intellectual property approximate the relationships we have with physical property. And to be clear, nothing even derogatory has been said about this goal yet. You could take everything I've written in this comment so far, and use it as part of argument for copyright. However important is it precisely because of the explosion of complexity in possibilities that simply don't exist for the vast majority of physical items.

cxr 433 days ago [-]

You're comfortable acknowledging that the relevant principle isn't specific to sandwich shop workers or women named Alice and that what's at issue in the example provided is something more general than either of those two details. You're insisting, though, that it is as specific as breaking a prior agreement and nothing broader, even though that, too, was contrived just the same in order to flesh out the example with detail. That unwillingness is an error.

tolmasky 433 days ago [-]

I'm not merely comfortable acknowledging it, I specifically took the time to demonstrate how the outcome was property-type-independent by explicitly going over what happens when we remove the property from the example but leave everything else unchanged. If you contend that the breaking of the agreement is somehow equally superfluous, then I think it's on you to demonstrate that through a similar analysis. You seem pretty confident this is the case, but I am skeptical given that the story only really had two components: the property in question, and the agreement with regard to the property. They can't both be inconsequential details fleshing out an otherwise empty story, right?

But either way, I think the point that immediately follows that is even more important, right? The fact that the nature of the ownership, even in a "successful" transaction, is incredibly more complicated with the drawing. How the we don't even properly understand how much "ownership" you have of the drawing without a contract. How that transaction potentially puts you in direct competition with Alice in the future. Etc. etc. Again, the entirety of my position is the fairly narrow statements that: 1) intellectual property is fundamentally different from physical property, 2) you thus cannot simply model intellectual property transactions by merely pretending you're dealing with physical objects (since there's fundamentally more dimensionality and ambiguity without explicitly outlining and agreeing to way more terms and details), and 3) intellectual property thus naturally requires significant infrastructure in order to create an environment that gets anywhere close to simulating the same "physical-like" properties for intellectual property. I don't think that's controversial.

tlb 434 days ago [-]

I find "this wasn't the point of copyright", referring to the motivations of 18th century legislators, unpersuasive. They were making up rules that were good for the foreseeable future, but they didn't foresee everything and certainly not everyone being connected to a global data network.

Persuasive arguments should focus on what's good for the world today.

tolmasky 433 days ago [-]

I hate to break it to you, but we continued that pattern of making up rules. The main difference is that we let lobbyists play a bigger role over time. The updated rules (life of the author plus 70 years) was passed in 1976, so I don't think they had global data networks in mind either. But perhaps you believe neither side has presented a persuasive argument.

I will however say that I think my comment was not just an appeal to authority. Again, I think the fact that using public domain works was critical to Disney's early success is a fairly important data point, especially considering some of those works would not have allowed to be used with the current lifespans (e.g. Pinocchio's copyright would have lasted until 1960, 20 years after the film premiered).

But again, the most important thing I want taken away from this is that we the "consumers" of the content should not consider ourselves bystanders, but understand that we do have an active stake here as well. You're first sentence is perhaps more important than you realize, making up the rules wasn't something one-off incidental property of being first to the table, we could choose to make up the rules too, so we should act like it, as opposed to trying to "deduce" the ownership of a sentence. This is unique, we don't have that ability with physical property. We can't simply declare that everyone gets Ferrari tomorrow and then have them magically appear in everyone's garage. But we could declare that everyone can have the rights to Superman tomorrow, and they would "magically just have them".

There's no real baseline here. We should just weigh the pros and cons. The fashion industry operates more or less copyright-free. The infrastructure to enforce copyright has real costs. Not to mention there is all the collateral damage from the abuse of copyright takedowns this system brings along with it. And any sort of appeal to authorship is also highly suspicious given that authors rarely end up owning these rights. Every time one of these Marvel movies comes out there's a mini outcry when people see the guy whose comic the movie is based on is just some dude who gets nothing from the making of the movie. On the flip side we take for granted that every public domain character was of course at one point created. Robin Hood, Zorro, Dracula, Sherlock Holmes. Are we unhappy with the diversity of adaptations we've gotten from these? Would it be that Earth shattering if Harry Potter joined that list? As things stand right now no one on this website will likely ever get to legally publish "their take Harry Potter". The clock doesn't start ticking until after JK Rowling dies. It would have entered the public domain 2011 under the original rules. In case you're curious, her net worth in 2011 was $500M, if you want to factor that into whether you think that would have been "fair" (and its not like she stops making money at that point, its just other people start to be able to do stuff with the first book). I think it is worthwhile to imagine a different approach to this.

up2isomorphism 430 days ago [-]

There is no such a thing called “we” in terms of the financial consequences. If you own an IP is because you acquire it through a contractual piece, also an artist, if he works for a company he also receives the loyalty stated in the contract. You can argue the artists might not be fairly paid which I agree. But it is obviously worse to live in a world not only one CERTAINLY won’t get paid and the work can get plagiarized into something he/she does not even like. If that’s the case I think it is the end of innovation.

benreesman 434 days ago [-]

Sounds like it’s supposed to be hopeless to compete already: https://www.zenger.news/2023/06/10/sam-altman-says-its-hopel....

434 days ago [-]

adventured 434 days ago [-]

AI is absolutely a further wealth concentrator by its very nature. It will not liberate the bottom 3/4, it will not free up their time by allowing them to work a lot less (as so many incorrectly predict now). Eric Schmidt for example has some particularly incorrect claims out there right now about how AI will widely liberate people from having to work so many hours, it will prove laughable in hindsight. Those that wield high-end AI, and the extreme cost of operations that will go with it, will reap extraordinary wealth over the coming century. Elon Musk style wealth. Very few will have access to the resources necessary to operate the best AI (the cost will continue to climb over what companies like Microsoft, Google, Amazon, OpenAI, etc are already spending).

Sure, various AI assistants will make more aspects of your life automated. In that sense it'll buy people more time in their private lives. It won't get most people a meaningful increase in wealth, which is the ultimate liberator of time. That is, financial independence.

And you can already see the ratio of people that are highly engaged with utilizing the latest LLMs, paying for them, versus either rarely or never using them (either not caring/interested in utilizing, or not understanding how to do so effectively). It's heavily bifurcated between the elites and everybody else, just as most tech advances have been so far. A decade ago a typical lower / lower middle class person could have gone to the library and learned JavaScript and over the course of years could have dramatically increased their earning potential (a process that takes time to be clear); for the same reason that rarely happens by volition, they also will not utilize LLMs to advance their lives despite the wide availability of them. AI will end up doing trivial automation tasks for the bottom 50%. For the top ~1/4 it will produce enormous further wealth from equity holdings and business process productivity gains (boosting wealth from business ownership, which the bottom 50% lacks universally).

lupire 434 days ago [-]

Society-scale unemployment is certainly liberating people from their work, but also their food.

account42 432 days ago [-]

It really is a sad world where the prospect of there not being enough work to do for humans is a scary one.

theropost 434 days ago [-]

Copyright laws, in many ways, feel outdated and unnecessarily rigid. They often appear to disproportionately favor large corporations without providing equivalent value to society. For example, brands like Disney have leveraged long-running copyrights to generate billions, or even tens of billions, of dollars through enforcement over extended periods. This approach feels excessive and unsustainable.

The reliance on media saturation and marketing creates a perception that certain works are inherently more valuable than others, despite new creative works constantly being developed. While I agree that companies should have the right to profit from their investments, such as a $500 million movie, there should be reasonable limits. Once they recoup their costs, including a reasonable profit multiplier, the copyright could be considered fulfilled and should expire.

Holding onto copyrights indefinitely or for excessively long periods serves primarily to sustain a system that benefits lawyers and enforcement agencies, rather than providing meaningful value to society. For instance, enforcing a copyright from the 1940s for a multinational corporation that already generates billions makes little sense.

There should be a balanced framework. If I invest significant time and effort—say 100 hours—into creating a work, I should be entitled to earn a reasonable return, perhaps 10 times the effort I put in. However, after that point, the copyright should no longer apply. Current laws have spiraled out of control, failing to strike a balance between protecting creators and fostering innovation. Reform is long overdue.

pclmulqdq 434 days ago [-]

I am personally in favor of strong, short copyrights (and patents). 90+ year copyrights are just absurd. Most movies make almost all their money in the first 10 years anyway, and a strong 10- or 20-year copyright would keep the economics of movie and music production largely the same.

voxic11 431 days ago [-]

I think the biggest change would be that if characters and stories lose their copyright after 10 years then you would see even more sequels and remakes because anyone could make their own Star Wars sequel series for example.

3pt14159 435 days ago [-]

Is there a way to figure out if OpenAI ingested my blog? If the settlements are $2500 per article then I'll take a free used cars worth of payments if its available.

jazzyjackson 435 days ago [-]

I suppose the cost of legal representation would cancel it out. I can just imagine a class action where anyone who posted on blogger.com between 2002 and 2012 eventually gets a check for 28 dollars.

If I were more optimistic I could imagine a UBI funded by lawsuits against AGI, some combination of lost wages and intellectual property infringement. Can't figure out exactly how much more important an article on The Intercept had on shifting weights than your hacker news comments, might as well just pay everyone equally since we're all equally screwed

SahAssar 435 days ago [-]

If you posted on blogger.com (or any platform with enough money to hire lawyers) you probably gave them a license that is irrevocable, non-exclusive and able to be sublicensed.

There are reasons for that (they need a license to show it on the platform) but usually these agreements are overly broad because everyone except the user is covering their ass too much.

Those licenses will now be used to sell that content/data for purposes that nobody thought about when you started your account.

dwattttt 435 days ago [-]

Wouldn't the point of the class action to be to dilute the cost of representation? If the damages per article are high and there's plenty of class members, I imagine the limit would be how much OpenAI has to pay out.

Brajeshwar 434 days ago [-]

There was a Washington Post article that did something on this (but not for OpenAI). Check if your website is there at https://www.washingtonpost.com/technology/interactive/2023/a...

There should a was to check for OpenAI. But my guess is, if Google does it, OpenAI and others must be using the same/similar resource pool.

My website has some 56K Token and I have no clue what that was, but something is there https://www.dropbox.com/scl/fi/2tq4mg16jup2qyk3os6ox/brajesh...

quarterdime 435 days ago [-]

Interesting. Two key quotes:

> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.

> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.

Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?

Xelynega 435 days ago [-]

If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)

sandworm101 435 days ago [-]

But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.

The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.

Xelynega 435 days ago [-]

Yes, that's how every other industry that redistributes content works.

You have to license content you want to use, you cant just use it for free because it's on the internet.

Netflix doesn't just start hosting shows and hope they don't get a copyright suit...

YetAnotherNick 432 days ago [-]

In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.

[1]: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

account42 432 days ago [-]

Before gen AI, scraping mostly wasn't about copyrightable data but about finding facts. Scraping doesn't magically make copyright infringement legal.

noitpmeder 434 days ago [-]

It's insane to me that people don't agree that you need to require a license to train your proprietary for-profit model on someone else's work.

jsheard 435 days ago [-]

It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.

Xelynega 435 days ago [-]

I agree, just want to make sure "they can't stop doing illegal things or they wouldn't be a success" is said out loud instead of left to subtext.

IshKebab 434 days ago [-]

It's not definitely illegal yet.

yyuugg 434 days ago [-]

It's also definitely not not illegal either. Case law is very much tbd.

CuriouslyC 435 days ago [-]

[flagged]

Xelynega 435 days ago [-]

If we really want to be technical, in common law systems anything is legal as long as the highest court to challenge it decides it's legal.

I guess I should have used the phrase "common sense stealing in any other context" to be more precise?

krisoft 434 days ago [-]

> I guess I should have used the phrase "common sense stealing in any other context" to be more precise?

Clearly not common sense stealing. The Intercept was not deprived of their content. If OpenAI would have sneaked into their office and server farm and took all the hard drives and paper copies with the content that would be "common sense stealing".

TheOtherHobbes 434 days ago [-]

Very much common sense copyright violation though.

It's that simple. There is no "Yes but you still have your book" argument, because copyright is a claim on commercial value, not a claim on instantiation.

There's some minimal wiggle room for fair use, but clearly making an electronic copy and creating a condensed electronic version of the content - no matter how abstracted - and using it for profit is not fair use.

chii 434 days ago [-]

> Copyright means you're not allowed to copy something without permission.

but is training an AI copying? And if so, why isn't someone learning from said work not considered copying in their brain?

throw646577 434 days ago [-]

> but is training an AI copying?

If the AI produces chunks of training set nearly verbatim when prompted, it looks like copying.

> And if so, why isn't someone learning from said work not considered copying in their brain?

Well, their brain, while learning, is not someone's published work product, for one thing. This should be obvious.

But their brain can violate copyright by producing work as the output of that learning, and be guilty of plagiarism, etc. If I memorise a passage of your copyrighted book when I am a child, and then write it in my book when I am an adult, I've infringed.

The fact that most jurisdictions don't consider the work of an AI to be copyrightable does not mean it cannot ever be infringing.

CuriouslyC 434 days ago [-]

The output of a model can be copyright violation. In fact, even if the model was never trained on copyright content, if I provided copyright text then told the model to regurgitate it verbatim that would be a violation.

That does not make the model copyright violation itself.

throw646577 429 days ago [-]

This is is sort of like the argument against a blank tape levy or a tape copier tax, which is a reasonable argument in the context of the hardware.

But an LLM doesn't just enable direct duplication, it (well its model) contains it.

If software had a meaningful distribution cost or per-unit sale cost, a blank tape tax would be very appropriate for LLM sales.

But instead OpenAI is operating a for-pay duplication service where authors don't get a share of the proceeds -- it is doing the very thing that copyright laws were designed to dissuade by giving authors a time-limited right to control the profits from reproducing copies of their work.

trinsic2 433 days ago [-]

Yea good point. whats the difference between spidering content and training a model? Its almost like access pages of contact like a search engine.. If the information is publically available?

pera 434 days ago [-]

A product from a company is not a person. An LLM is not a brain.

If you transcode a CD to mp3 and build a business around selling these files without the author's permission you'd be in big legal problems.

Tech products that "accidentally" reproduce materials without the owners' permission (e.g. someone uploading La La Land into YouTube) have processes to remove them by simply filling a form. Can you do that with ChatGPT?

lelanthran 434 days ago [-]

Because the law considers scale.

It's legal for you to possess a single joint. It's not legal for you to possess a warehouse of 400 tons of weed.

The line between legal and not legal is sometimes based on scale; being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it.

krisoft 434 days ago [-]

Are you describing what the law is or what you feel the law should be? Because those things are not always the same.

lelanthran 433 days ago [-]

> Are you describing what the law is or what you feel the law should be?

I am stating what is, right now.

I thought the weed example made that clear.

Let me clarify: the state of things, as they stand, is that the entire justice system, legislation and courts included, takes scale into account when looking at the line dividing "legal" from "illegal".

There is literally no defense of "If it is legal at qty x1, it is legal at any qty".

krisoft 433 days ago [-]

> I am stating what is, right now.

Excelent. Then the next question is where (in which jurisdiction) are you describing the law? And what are your sources? Not about the weed, i don’t care about that. Particularly the “being able to ingest a single book and learn from it is not the same scale as ingesting the entire published works of mankind and learning from it”.

The reason why i’m asking is because you are drawing a paralel between criminal law and (i guess?) copyright infringement. The drug posession limits in many jurisdictions are explicitly written into the law. These are not some grand principle of laws but the result of explicit legislative intent. The people writing the law wanted to punish drug peddlers without punishing end users. (Or they wanted to punish them less severly or differently.) Are the copyright limits you are thinking about similarly written down? Do you have case references one can read?

lelanthran 433 days ago [-]

I made it clear in both my responses that scale matters, and that there is precedence in law, in almost all countries I can think off right now, for scale mattering.

I did not make the point that there is a written law specifically for copyright violations at scale (although many jurisdictions do have exemptions at small scale written into law).

I will try to clarify once again: there is no defence in law that because something is allowed at qty X1, it must be allowed at any qty.

This is the defence that was originally posted that I replied to, it is the one that is not valid because courts regularly consider the scale of an activity when determining the line between allowed and not allowed.

nkrisc 434 days ago [-]

Because AI isn’t a person.

hiatus 434 days ago [-]

Is training an AI the same as a person learning something? You haven't shown that to be the case.

chii 434 days ago [-]

no i havent, but judging by the name - machine learning - i think it is the case.

yyuugg 434 days ago [-]

Do you think starfish and jellyfish are fish? Judging by the name they are...

criddell 434 days ago [-]

That might be the point. If your business model is built on reselling something you’ve built on stuff you’ve taken without payment or permission, maybe the business isn’t viable.

asdff 435 days ago [-]

I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.

Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.

In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.

andyjohnson0 434 days ago [-]

> I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling”

I'm not a lawyer, but I know enough to be pretty confident that that wouldn't work. The law is about intent. Coming up with "one weird trick" to work-around a potential court ruling is unlikely to impress a judge.

trinsic2 433 days ago [-]

Im not quite familiar with the google book project, but isnt this similar? Im pretty sure google got away with scanning copyrighted books in 2015 [0]

[0]: https://www.reuters.com/article/technology/google-book-scann...

zozbot234 435 days ago [-]

They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.

CaptainFever 435 days ago [-]

The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)

But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)

But who knows. Maybe it can be done for more fact-like stuff.

noitpmeder 434 days ago [-]

Or this point, I'm sure there is more than enough publically and feely usable content to "learn how language works". There is no need to hoover up private or license-unclear content if that is your goal.

CaptainFever 434 days ago [-]

I would actually love it if that was true. It would reduce a lot of legal headaches for sure. But if that was true, why were previous GPT versions not as good at understanding language? I can only conclude that it's because that's not actually true. There's not enough digital public domain materials to train a LLM to understand language competently.

Perhaps old texts in physical form, then? It'll cost a lot to digitize that, wouldn't it? And it wouldn't really be accessible to AI hobbyists. Unless the digitization is publicly funded or something.

(A big part of this is also how insanely long copyright lasts (nearly a hundred years!) that keeps most of the Internet's material from being public domain in the first place, but I won't belabour that point here.)

Edit:

Fair enough, I can see your point. "Surely it is cheaper to digitize old texts or buy a license to Google Books than to potentially lose a court case? Either OpenAI really likes risking it to save a bit of money, or they really wanted facts not contained in old texts."

And yeah, I guess that's true. I could say "but facts aren't copyrightable" (which was supported by the judge's decision from the TFA), but then that's a different debate about whether or not people should be able to own facts. Which does have some inroads (e.g. a right against being summarized because it removes the reason to read original news articles).

TeMPOraL 435 days ago [-]

> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.

All of that and more, all at the same time.

Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.

RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.

CaptainFever 434 days ago [-]

That sounds about right. When I ask ChatGPT about "ought implies can" for example, it cites Kant.

TeMPOraL 435 days ago [-]

Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>

jprete 435 days ago [-]

Being part of a public court record doesn't seem like something that would invalidate copyright.

A4ET8a8uTh0 435 days ago [-]

Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.

ashoeafoot 435 days ago [-]

What about bombing? You could always smuggle dmca content in training sets hoping for a payout?

Xelynega 435 days ago [-]

The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal.

"well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...

logicchains 435 days ago [-]

Eventually we're going to have embodied models capable of live learning and it'll be extremely apparent how absurd the ideas of the copyright extremists are. Because in their world, it'd be illegal for an intelligent robot to watch TV, read a book or browse the internet like a human can, because it could remember what it saw and potentially regurgitate it in future.

CuriouslyC 435 days ago [-]

You have to understand, the media companies don't give a shit about the logic, in fact I'm sure a lot of the people pushing the litigation probably see the absurdity of it. This is a business turf war, the stated litigation is whatever excuse they can find to try and go on the offensive against someone they see as a potential threat. The pro copyright group (big media) sees the writing on the wall, that they're about to get dunked on by big tech, and they're thrashing and screaming because $$$.

Karliss 435 days ago [-]

If humanity ever gets to the point where intelligent robots are capable of watching TV like human can, having to adjust copyright laws seems like the least of problems. How about having to adjust almost every law related to basic "human" rights, ownership, being to establish a contract, being responsible for crimes and endless other things.

But for now your washing machine cannot own other things, and you owning a washing machine isn't considered slavery.

JoshTriplett 435 days ago [-]

> copyright extremists

It's not copyright "extremism" to expect a level playing field. As long as humans have to adhere to copyright, so should AI companies. If you want to abolish copyright, by all means do, but don't give AI a special exemption.

CuriouslyC 435 days ago [-]

It's actually the opposite of what you're saying. I can 100% legally do all the things that they're suing OpenAI for. Their whole argument is that the rules should be different when a machine does it than a human.

JoshTriplett 435 days ago [-]

Only because it would be unconscionable to apply copyright to actual human brains, so we don't. But, for instance, you absolutely can commit copyright violation by reading something and then writing something very similar, which is one reason why reverse engineering commonly uses clean-room techniques. AI training is in no way a clean room.

nhinck3 434 days ago [-]

You literally can't

p_l 434 days ago [-]

You literally can.

Your ability to regurgitate remembered article that is copyrighted does not make your brain a derivative work because removing that specific article from the training set is below noise floor of impact.

However reproducing the copyrighted material based on that is a violation because the created reproduction does critically depend on that copyrighted material.

(Gross simplification) Similar to how you can watch & read a lot of Star Wars and then even ape Ralph McQuarrie style in your own drawings but unless the result is unmistakenly related to Star Wars there's no copyright infringement - but there is if someone looks at the result and goes "that's Star Wars, isn't it?"

nhinck3 434 days ago [-]

Can you regurgitate billions of pieces of information to hundreds of thousands of other people in a way that competes with the source of that information?

CuriouslyC 434 days ago [-]

If there was only one source for a piece of news ever, you might be able to make that argument in good faith, but when there are 20 outlets with competing versions of the same story it doesn't hold.

YetAnotherNick 432 days ago [-]

It is called internet. It could regurgitate billions of pieces of information to billions of people every day.

IAmGraydon 435 days ago [-]

Except LLMs are in no way violating copyright in the true sense of the word. They aren’t spitting out a copy of what they ingested.

JoshTriplett 435 days ago [-]

Go make a movie using the same plot as a Disney movie, that doesn't copy any of the text or images of the original, and see how far "not spitting out a copy" gets you in court.

AI's approach to copyright is very much "rules for thee but not for me".

rcxdude 435 days ago [-]

That might get you pretty far in court, actually. You'd have to be pretty close in terms of the sequence of events, character names, etc. Especially considering how many Disney movies are based on pre-existing stories, if you were, to, say, make a movie featuring talking animals that more or less followed the plot of Hamlet, you would have a decent chance of prevailing in court, given the resources to fight their army of lawyers.

bdangubic 435 days ago [-]

100% agree. but now a million$ question - how would you deal with AI when it comes to copyright? what rules could we possibly put in place?

JoshTriplett 435 days ago [-]

The same rules we already have: follow the license of whatever you use. If something doesn't have a license, don't use it. And if someone says "but we can't build AI that way!", too bad, go fix it for everyone first.

slyall 435 days ago [-]

You have a lot of opinions on AI for somebody who has only read stuff in the public domain

noitpmeder 434 days ago [-]

Most Information about AI is in the public domain....?

slyall 434 days ago [-]

I mean "public domain" in the copyright context, not the "trade secret" context.

luqtas 435 days ago [-]

problem is when a human company profits over their scrape... this isn't a non-profit running out of volunteers & a total distant reality from autonomous robots learning it way by itself

we are discussing an emergent cause that has social & ecological consequences. servers are power hungry stuff that may or not run on a sustainable grid (that also has a bazinga of problems like leaking heavy chemicals on solar panels production, hydro-electric plants destroying their surroundings etc.) & the current state of producing hardware, be a sweatshop or conflict minerals. lets forget creators copyright violation that is written in the law code of almost every existing country and no artist is making billions out of the abuse of their creation right (often they are pretty chill on getting their stuff mentioned, remixed and whatever)

openrisk 435 days ago [-]

Leaving aside the hypothetical "live learning AGI" of the future (given that money is made or lost now), would a human regurgitating content that is not theirs - but presented as if it is - be acceptable to you?

CuriouslyC 435 days ago [-]

I don't know about you but my friends don't tell me that Joe Schmoe of Reuters published a report that said XYZ copyright XXXX. They say "XYZ happened."

openrisk 434 days ago [-]

In have a friend that recites all day amazingly long pieces of literature by heart. He says he just wrote them. He also produces a vast number of paintings in all styles, claiming he is a really talented painter.

noitpmeder 434 days ago [-]

So when everyone in the world starts going to your friend instead of paying Reuters, what happens then?

CuriouslyC 434 days ago [-]

Reuters finds a new business model? What did horse and buggy drivers do, pivot to romance themed city tours? I'm sure media companies will figure something out.

openrisk 434 days ago [-]

So who and why will produce the news for your friend to steal? The horse and buggy metaphor is getting tiresome when its used as some sort signalling of "progress oriented minds" and creative destruction enthusiasts versus the luddites.

CuriouslyC 433 days ago [-]

Someone who realizes that the raw information has no value in the age we're entering. Influencers have shown the way, brand and community engagement are the new differentiators.

openrisk 433 days ago [-]

This makes no sense. How can you source objective facts about the world by further devaluing and undermining those who's job it is to do it?

Influencers are parasites that have been made possible by broken, user-hostile platforms.

You are advocating for a deranged, dangerous world, where demagogues rule over large masses of idiots that can't tell the difference between AI junk and reality.

account42 432 days ago [-]

Facts are not copyrightable in the first place.

tokioyoyo 435 days ago [-]

The problem is, we can't come up with a solution where both parties are happy, because in the end, consumers choose one (getting information from news agencies) or the other (getting information from chatgpt). So, both are fighting for life.

IAmGraydon 435 days ago [-]

Exactly. Also core to the copyright extremists’ delusional train of thought is the fact that they don’t seem to understand (or admit) that ingesting, creating a model, and then outputting based on that model is exactly what people do when they observe others’ works and are inspired to create.

account42 432 days ago [-]

And if I rip a blu ray to my hard drive and then give the hard drive to my friend so he can output a movie is just the same as if I had told him my recollections of the movie from my brain. Both are claims you can make without anything to back them up.

jazzyjackson 435 days ago [-]

[flagged]

hydrolox 435 days ago [-]

I understand that regulations exist and how there can be copyright violations, but shouldn't we be concerned that other.. more lenient governments (mainly China) who are opposed to the US will use this to get ahead? If OpenAI is significantly set back.

fny 435 days ago [-]

No. OpenAI is suspected to be worth over $150B. They can absolutely afford to pay people for data.

Edit: People commenting need to understand that $150B is the discounted value of future revenues. So... yes they can pay out... yes they will be worth less... and yes that's fair to the people who created the information.

I can't believe there are so many apologists on HN for what amounts to vacuuming up peoples data for financial gain.

jsheard 435 days ago [-]

The OpenAI that is assumed to keep being able to harvest every form of IP without compensation is valued at $150B, an OpenAI that has to pay for data would be worth significantly less. They're currently not even expecting to turn a profit until 2029, and that's without paying for data.

https://finance.yahoo.com/news/report-reveals-openais-44-bil...

suby 435 days ago [-]

OpenAI is not profitable, and to achieve what they have achieved they had to scrape basically the entire internet. I don't have a hard time believing that OpenAI could not exist if they had to respect copyright.

https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...

noitpmeder 434 days ago [-]

That's a good thing! If a company cannot raise to fame unless they violate laws, it should not have been there.

There is plenty of public domain text that could have taught a LLM English.

suby 434 days ago [-]

I'm not convinced that the economic harm to content creators is greater than the productivity gains and accessibility of knowledge for users (relative to how competent it would be if trained just on public domain text). Personally, I derive immense value from ChatGPT / Claude. It's borderline life changing for me.

As time goes on, I imagine that it'll increasingly be the case that these LLM's will displace people out of their jobs / careers. I don't know whether the harm done will be greater than the benefit to society. I'm sure the answer will depend on who it is that you ask.

> That's a good thing! If a company cannot raise to fame unless they violate laws, it should not have been there.

Obviously given what I wrote above, I'd consider it a bad thing if LLM tech severely regressed due to copyright law. Laws are not inherently good or bad. I think you can make a good argument that this tech will be a net negative for society, but I don't think it's valid to do so just on the basis that it is breaking the law as it is today.

DrillShopper 434 days ago [-]

> I'm not convinced that the economic harm to content creators is greater than the productivity gains and accessibility of knowledge for users (relative to how competent it would be if trained just on public domain text).

Good thing whether or not something is a copyright violation doesn't depend on if you can make more money with someone else's work than they can.

suby 434 days ago [-]

I understand the anger about large tech companies using others work without compensation, especially when both they and their users benefit financially. But this goes beyond economcis. LLM tech could accelerate advances in medicine and technology. I strongly believe that we're going to see societal benefits in education, healthcare, especially mental health support thanks to this tech.

I also think that someone making money off LLM's is a separate question from whether or not the original creator has been harmed. I think many creators are going to benefit from better tools, and we'll likely see new forms of creation become viable.

We already recognize that certain uses of intellectual property should be permitted for societies benefit. We have fair use doctrine, patent compulsory licensing for public health, research exmpetions, and public libraries. Transformative use is also permitted, and LLMs are inherently transformative. Look at the volume of data that they ingest compared to the final size of a trained model, and how fundamentally different the output format is from the input data.

Human progress has always built upon existing knowledge. Consider how both Darwin and Wallace independently developed evolution theory at roughly the same time -- not from isolation, but from building on the intellectual foundation of their era. Everything in human culture builds on what came before.

That all being said, I'm also sure that this tech is going to negative impact people too. Like I said in the other reply, whether or not this tech is good or bad will depend on who you ask. I just think that we should weigh these costs against the potential benefits to society as a whole rather than simply preserving existing systems, or blindly following the law as if the law is inherently just or good. Copyright law was made before this tech was even imagined, and it seems fair to now evaluate whether the current copyright regime makes sense if it turns out that it'd keep us in some local maximum.

YetAnotherNick 432 days ago [-]

> unless they violate laws

*unless they violate country laws.

Which means openAI or its alternative could survive in China but not in US. The question is that if we are fine with it?

jpalawaga 435 days ago [-]

technically open ai has respected copyright, except in the (few) instances they produce non-fair-use amounts of copyrighted material.

dmca does not cover scraping.

mrweasel 435 days ago [-]

That's not real money tough. You need actual cash on hand to pay for stuff, OpenAI only have the money they've been given by investors. I suspect that many of the investors wouldn't have been so keen if they knew that OpenAI would need an additional couple of billions a year to pay for data.

__loam 434 days ago [-]

That's too bad that your business isn't viable without the largest single violation of copyright of all time.

nickpsecurity 435 days ago [-]

That doesn’t mean they have $150B to hand over. What you can cite is the $10 billion they got from Microsoft.

I’m sure they could use a chunk of that to buy competitive I.P. for both companies to use for training. They can also pay experts to create it. They could even sell that to others for use in smaller models to finance creating or buying even more I.P. for their models.

wvenable 434 days ago [-]

[flagged]

CJefferson 434 days ago [-]

We can, and do, choose to treat normal people different from billion dollar companies that are attempting to suck up all human output and turn it into their own personal profit.

If they were, say, a charity doing this for the good of mankind, I’d have more sympathy. Shame they never were.

tolmasky 434 days ago [-]

The way to treat them differently is not by making them share profits with another corporation. The logical endgame of all this isn’t “stopping LLMs,” it’s Disney happening to own a critical mass of IP to be able to legally train and run LLMs that make movies, firing all their employees, and no smaller company ever having a chance in hell with competing with a literal century’s worth of IP powering a generative model.

The best party about all this is that Disney initially took off by… making use of public domain works. Copyright used to last 14 years. You’d be able to create derivative works of most the art in your life at some point. Now you’re never allowed to. And more often than not, not to grant a monopoly to the “author”, but to the corporation that hired them. The correct analysis shouldn’t be OpenAI vs. Intercept or Disney of whomever. You’re just choosing kings at that point.

IsTom 434 days ago [-]

> produced "a unique" song?

People do get sued for making songs that are too similar to previously made songs. One defence available is that they've never heard it themselves before.

If you want to treat AI like humans then if AI output is similar enough to copyrighted material it should get sued. Then you try to prove that it didn't ingest the original version somehow.

noitpmeder 434 days ago [-]

The fact that these lawsuits aren't as simple as "is my copywrited work in your training set, yes or no" is boggling.

__loam 434 days ago [-]

I feel like at some point the people in favor of this are going to realize that whether the data was ingested into a training set is completely immaterial to the fact that these companies downloaded data they don't have a license to use to a company server somewhere with the intention to use it for commercial use.

GeoAtreides 434 days ago [-]

Ah yes, humans and LLMs are exactly the same, learning the same way, reasoning the same way, they're practically indistinguishable. So that's why it makes sense to equate humans reading books with computer programs ingesting and processing the equivalent of billions of books in literal days or months.

Timwi 434 days ago [-]

While I agree with your sentiment in general, this thread is about the legal situation and your argument is unfortunately not a legal one.

anileated 434 days ago [-]

“A person is fundamentally different from an LLM” does not need a legal argument and is implied by the fact that LLMs do not have human rights, or even anything comparable to animal rights.

A legal argument would be needed to argue the other way. This argument would imply granting LLMs some degree of human rights, which the very industry profiting from these copyright violations will never let happen for obvious reasons.

notahacker 434 days ago [-]

The other problem with the legal argument that it's "just like a person learning" is that corporations whose human employees have learned what copyrighted characters look like and then start incorporating them into their art are considered guilty of copyright violation, and don't get to deploy the "it's not an intentional copyright violation from someone who should have known better, it's just a tool outputting what the user requested" defence...

anileated 433 days ago [-]

Exactly.

Also, it is only a matter of time until one of those employees (thanks to free will and agency) will whistleblow, it doesn’t scale, etc.

Frankly, the fact that such a big segment of HN crowd unthinkingly buys big tech’s double standard (LLMs are human when copyright is concerned, but not human in every other sense) makes me ashamed of the industry.

mongol 434 days ago [-]

The process of reading it into their training data is a way of copying it. It exists somewhere and they need to copy it in order to ingest it.

wvenable 434 days ago [-]

By that logic you're violating copyright by using a web browser.

Suppafly 434 days ago [-]

>By that logic you're violating copyright by using a web browser.

You would be except for the fact that publishing stuff on the web gives people an implicit license to download it for the purposes of viewing it.

Timwi 434 days ago [-]

Not sure about US or other jurisdictions, but that's not how any of this works in Germany. In Germany downloading anything from anywhere (even a movie) is never illegal and does not require a license. What's illegal is publishing/disseminating copyrighted content without authorization. BitTorrenting a movie is illegal because you're distributing it to other torrenters. Streaming a movie on your website is illegal because it's public. You can be held liable for using a photo from the web to illustrate your eBay auction, not because you downloaded it but because you republished it.

OpenAI (and Google and everyone else) is creating a publicly-accessible system that produces output that could be derived from copyrighted material.

Suppafly 432 days ago [-]

I think it works like that in Canada and some other places too, because they pay an extra tax on storage media when they buy it, which essentially authorizes a license for any copyrighted material that might be stored on that media.

Tomte 434 days ago [-]

> In Germany […]

That‘s confidently and completely wrong.

wvenable 433 days ago [-]

I'm only allowed to view it? I can't download it, convert each word into a color, and create a weird piece of art work out of it? I think I can.

Suppafly 432 days ago [-]

>convert each word into a color, and create a weird piece of art work out of it? I think I can.

I agree, but the original author might get butthurt if you distribute it. Realistically copyright law in the US is a mess when it comes to weird pieces of art.

__loam 434 days ago [-]

The nature of the copy does actually matter.

DrillShopper 434 days ago [-]

> You read books and now you have a job? Pay up.

It is disingenuous to imply the scale of someone buying books and reading them (for which the publisher and author are compensated) or borrowing them from the library and reading them (again, for which the publisher and author are compensated) is the same as the wholesale copying without permission or payment of anything not behind a pay wall on the Internet.

434 days ago [-]

mu53 435 days ago [-]

Isn't it a greater risk that creators lose their income and nobody is creating the content anymore?

Take for instance what has happened with news because of the internet. Not exactly the same, but similar forces at work. It turned into a race to the bottom with everyone trying to generate content as cheaply as possible to get maximum engagement with tech companies siphoning revenue. Expensive, investigative pieces from educated journalists disappeared in favor of stuff that looks like spam. Pre-Internet news was higher quality

Imagine that same effect happening for all content? Art, writing, academic pieces. Its a real risk that openai has peaked in quality

CuriouslyC 435 days ago [-]

Lots of people create without getting paid to do it. A lot of music and art is unprofitable. In fact, you could argue that when the mainstream media companies got completely captured by suits with no interest in the things their companies invested in, that was when creativity died and we got consigned to genre-box superhero pop hell.

BeFlatXIII 434 days ago [-]

> Isn't it a greater risk that creators lose their income and nobody is creating the content anymore?

There are already multiple lifetimes of quality content out there. It's difficult to get worked up about the potential future losses.

eastbound 435 days ago [-]

I don’t know. When I look at news from before, there never was investigative journalism. It was all opinion swaying editos, until alternate voices voiced their counternarratives. It’s just not in newspapers because they are too politically biased to produce the two sides of stories that we’ve always asked them to do. It’s on other media.

But investigative journalism has not disappeared. If anything, it has grown.

mu53 434 days ago [-]

its changed. Investigative journalism is done by non-profits specializing in it, who have various financial motives.

The budgets at newspapers used to be much larger and fund more investigative journalism with a clearer motive.

bogwog 435 days ago [-]

This type of argument is ignorant, cowardly, shortsighted, and regressive. Both technology and society will progress when we find a formula that is sustainable and incentivizes everyone involved to maximize their contributions without it all blowing up in our faces someday. Copyright law is far from perfect, but it protects artists who want to try and make a living from their work, and it incentivizes creativity that places without such protections usually end up just imitating.

When we find that sustainable framework for AI, China or <insert-boogeyman-here> will just end up imitating it. Idk what harms you're imagining might come from that ("get ahead" is too vague to mean anything), but I just want to point out that that isn't how you become a leader in anything. Even worse, if they are the ones who find that formula first while we take shortcuts to "get ahead", then we will be the ones doing the imitation in the end.

gaganyaan 435 days ago [-]

Copyright is a dead man walking and that's a good thing. Let's applaud the end of a temporary unnatural state of affairs.

CJefferson 434 days ago [-]

If OpenAI wants copyright to be dead, then they could just give out all their models copyright free.

Andrex 434 days ago [-]

Care to make it interesting?

What do you consider "dead" and what do you consider a reasonable timeframe for this to occur?

I have 60 or so years and $50.

bdangubic 434 days ago [-]

I am in as well, I have 50 or so years and $60 (though would gladly put $600k on this… :) )

worble 435 days ago [-]

Should we also be concerned that other governments use slave labor (among other human rights violations) and will use that to get ahead?

logicchains 435 days ago [-]

It's hysterical to compare training an ML model with slave labour. It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.

Kbelicius 435 days ago [-]

> It's hysterical to compare training an ML model with slave labour.

Nobody did that.

> It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.

It makes sense. There is always scale to consider in these things.

totallykvothe 434 days ago [-]

worble literally did make that comparison. It is possible for comparisons to be made using other rhetorical devices than just saying "I am comparing a to b".

Terr_ 434 days ago [-]

> worble literally did make that comparison

No, their mention of "slave labor" is not a comparison to how LLMs work, nor an assertion of moral equivalence.

Instead it is just one example to demonstrate that chasing economic/geopolitical competitiveness is not a carte blanche to adopt practices that might be immoral or unjust.

434 days ago [-]

435 days ago [-]

immibis 435 days ago [-]

Absolutely: if copyright is slowing down innovation, we should abolish copyright.

Not just turn a blind eye when it's the right people doing it. They don't even have a legal exemption passed by Congress - they're just straight-up breaking the law and getting away with it. Which is how America works, I suppose.

JoshTriplett 435 days ago [-]

Exactly. They rushed to violate copyright on a massive scale quickly, and now are making the argument that it shouldn't apply to them and they couldn't possibly operate in compliance with it. As long as humans don't get to ignore copyright, AI shouldn't either.

Filligree 435 days ago [-]

Humans do get to ignore copyright, when they do the same thing OpenAI has been doing.

slyall 435 days ago [-]

Exactly.

Should I be paying a proportion of my salary to all the copyright holders of the books, song, TV shows and movies I consumed during my life?

If a Hollywood writer says she "learnt a lot about writing by watching the Simpsons" will Fox have an additional claim on her earnings?

dijksterhuis 434 days ago [-]

> Should I be paying a proportion of my salary to all the copyright holders of the books, song, TV shows and movies I consumed during my life?

you already are.

a proportion of what you pay for books, music, tv shows, movies goes to rights holders already.

any subscription to spotify/apple music/netflix/hbo; any book/LP/CD/DVD/VHS; any purchased digital download … a portion of that sales is paid back to rights holders.

so… i’m not entirely sure what your comment is trying to argue for.

are you arguing that you should get paid a rebate for your salary that’s already been spent on copyright payments to rights holders?

> If a Hollywood writer says she "learnt a lot about writing by watching the Simpsons" will Fox have an additional claim on her earnings?

no. that’s not how copyright functions.

the actual episodes of the simpsons are the copyrighted work.

broadcasting/allowing purchases of those episode incurs the copyright as it involves COPYING the material itself.

COPYright is about the rights of the rights holder when their work is COPIED, where a “work” is the material which the copyright applies to.

merely mentioning the existence of a tv show involves zero copying of a registered work.

being inspired by another TV show to go off and write your own tv show involves zero copying of the work.

a hollywood writer rebroadcasting a simpsons during a TV interview would be a different matter. same with the hollywood writer just taking scenes from a simpsons episode and putting it into their film. that’s COPYing the material.

—-

when it comes to open AI, obviously this is a legal gray area until courts start ruling.

but the accusations are that OpenAi COPIED the intercept’s works by downloading them.

openAi transferred the work to openAi servers. they made a copy. and now openAi are profiting from that copy of the work that they took, without any permission or remuneration for the rights holder of the copyrighted work.

essentially, openAI did what you’re claiming is the status quo for you… but it’s not the status quo for you.

so yeah, your comment confuses me. hopefully you’re being sarcastic and it’s just gone completely over my head.

slyall 434 days ago [-]

The problem is the anti-AI people who complain about AI are going for several steps in the chain (and often they are vague about which ones they are talking about at any point).

As well as the "copying" of content some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.

So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.

Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.

jashmatthews 434 days ago [-]

When humans learn and copy too closely we call that plagiarism. If an LLM does it how should we deal with that?

chii 434 days ago [-]

> If an LLM does it how should we deal with that?

why not deal with it the same way as humans have been dealt with in the past?

If you copied an art piece using photoshop, you would've violated copyright. Photoshop (and adobe) itself never committed copyright violations.

Somehow, if you swap photoshop with openAI and chatGPT, then people claim that the actual application itself is a copyright violation.

dijksterhuis 434 days ago [-]

this isn’t the same.

> If you copied an art piece using photoshop, you would've violated copyright. Photoshop (and adobe) itself never committed copyright violations.

the COPYing is happening on your local machine with non-cloud versions of Photoshop.

you are making a copy, using a tool, and then distributing that copy.

in music royalty terms, the making a copy is the Mechanical right, while distributing the copy is the Performing right.

and you are liable in this case.

> Somehow, if you swap photoshop with openAI and chatGPT, then people claim that the actual application itself is a copyright violation

OpenAI make a copy of the original works to create training data.

when the original works are reproduced verbatim (memorisation in LLMs is a thing), then that is the copyrighted work being distributed.

mechanical and performing rights, again.

but the twist is that ChatGPT does the copying on their servers and delivers it to your device.

they are creating a new copy and distributing that copy.

which makes them liable.

—

you are right that “ChatGPT” is just a tool.

however, the interesting legal grey area with this is — are ChatGPT model weights an encoded copy of the copyrighted works?

that’s where the conversation about the tool itself being a copyright violation comes in.

photoshop provides no mechanism to recite The Art Of War out of the box. an LLM could be trained to do so (like, it’s a hypothetical example but hopefully you get the point).

chii 434 days ago [-]

> OpenAI make a copy of the original works to create training data.

if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them? What openAI chooses to do with the viewed information is up to them - such as distilling summary statistics, or whatever.

> are ChatGPT model weights an encoded copy of the copyrighted works? that is indeed the most interesting legal gray area. I personally believe that it is not. The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.

It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!

dijksterhuis 434 days ago [-]

heads up: you may want to edit your second quote

—

> if a user is allowed to download said copy to view on their browser, why isn't that same right given to openAI to download a copy to view for them?

whether you can download a copy from your browser doesn’t matter. whether the work is registered as copyrighted does (and following on from that, who is distributing the work - aka allowing you to download the copy - and for what purposes).

from the article (on phone cba to grab a quote) it makes clear that the Intercept’s works were not registered as copyrighted works with whatever the name of the US copyright office was.

ergo, those works are not copyrighted and, yes, they essentially are public domain and no remuneration is required …

(they cannot remove DMCA attribution information when distributing copies of the works though, which is what the case is now about.)

but for all the other registered works that OpenAI has downloaded, creating their copy, used in training data, which the model then reproduces as a memorised copy — that is copyright infringement.

like, in case it’s not clear, i’ve been responding to what people are saying about copyright specifically. not this specific case.

> The information distilled from those works do not constitute any copyrightable information, as it is not literary, but informational.

that’s one argument.

my argument would be it is a form of compression/decompression when the model weights result in memorised (read: overfitted) training data being regurgitated verbatim.

put the specific prompt in, you get the decompressed copy out the other end.

it’s like a zip file you download with a new album of music. except, in this case, instead of double clicking on the file you have to type in a prompt to get the decompressed audio files (or text in LLM case)

> It's irrelevant that you could recover the original works from these weights - you could recover the same original works from the digits of pi!

actually, that’s the whole point of courts ruling on this.

the boundaries of what is considered reproduction is at question. it is up to the courts to decide on the red lines (probably blurry gray areas for a while).

if i specifically ask a model to reproduce an exact song… is that different to the model doing it accidentally?

i don’t think so. but a court might see it differently.

as someone who worked in music copyright, is a musician, sees the effects of people stealing musicians efforts all the time, i hope the little guys come out of this on top.

sadly, they usually don’t.

dijksterhuis 434 days ago [-]

i’ve been avoiding replying to your comment for a bit, and now i realised why.

edit: i am so sorry about the wall of text.

> some are also claiming that the output of a LLM should result in paying royalties back to the owning of the material used in training.

> So if an AI produces a sitcom script then the copyright holders of those tv shows it ingested should get paid royalties. In additional to the money paid to copy files around.

what you’re talking about here is the concept of “derivative works” made from other, source works.

this is subtly different to reproduction of a work.

see the last half of this comment for my thoughts on what the interesting thing courts need to work out regarding verbatim reproduction https://news.ycombinator.com/item?id=42282003

in the derivative works case, it’s slightly different.

sampling in music is the best example i’ve got for this.

if i take four popular songs, cut 10 seconds of each, and then join each of the bits together to create a new track — that is a new, derivative work.

but i have not sufficiently modified the source works. they are clearly recognisable. i am just using copyrighted material in a really obvious way. the core of my “new” work is actually just four reproductions of the work of other people.

in that case — that derivative work, under music copyright law, requires the original copyright rights holders to be paid for all usage and copying of their works.

basically, a royalty split gets agreed, or there’s a court case. and then there’s a royalty split anyway (probably some damages too).

in my case, when i make music with samples, i make sure i mangle and process those samples until the source work is no longer recognisable. i’ve legit made it part of my workflow.

it’s no longer the original copyrighted work. it’s something completely new and fully unrecognisable.

the issue with LLMs, not just ChatGpt, is that they will reproduce both verbatim and recognisably similar output to original source works.

the original source copyrighted work is clearly recognisable, even if not an exact verbatim copy.

and that’s what you’ve probably seen folks talking about, at least it sounds like it to me.

> Which leads to the precedent that if a writer creates a sitcom then the copyright holders of sitcoms she watched should get paid for "training" her.

robin thicke “blurred lines” —

* https://en.m.wikipedia.org/wiki/Pharrell_Williams_v._Bridgep...

* https://en.m.wikipedia.org/wiki/Blurred_Lines (scroll down)

yes, there is already some very limited precedent, at least for a narrow specific case involving sheet music in the US.

the TL;DR IANAL version of the question at hand in the case was “did the defendants write the song with the intention of replicating a hook from the plaintiff’s work”.

the jury decided, yes they did.

this is different to your example in that they specifically went out to replicate the that specific musical component of a song.

in your example, you’re talking about someone having “watched” a thing one time and then having to pay royalties to those people as a result.

that’s more akin to “being inspired” by, and is protected under US law i think IANAL. it came up in blurred lines, but, well, yeah. https://en.m.wikipedia.org/wiki/Idea%E2%80%93expression_dist...

again, the red line of infringement / not infringement is ultimately up to the courts to rule on.

—

anyway, this is very different to what openAi/chatGpt is doing.

openAi takes the works. chatgpt edits them according to user requests (feed forward through the model). then the output is distributed to the user. and that output could be considered to be a derivative work (see massive amount of text i wrote above, i’m sorry).

LLMs aren’t sitting there going “i feel like recreating a marvin gaye song”. it takes data, encodes/decodes it, then produces an output. it is a mechanical process, not a creative one. there’s no ideas here. no inspiration or expression.

an LLM is not a human being. it is a tool, which creates outputs that are often strikingly similar to source copyrighted works.

their users might be specifically asking to replicate songs though. in which case, openAi could be facilitating copyright infringement (wether through derivative works or not).

and that’s an interesting legal question by itself. are they facilitating the production of derivative works through the copying of copyrighted source works?

i would say they are. and, in some cases, the derivative works are obviously derived.

Suppafly 434 days ago [-]

>a proportion of what you pay for books, music, tv shows, movies goes to rights holders already.

When I borrow a book from a friend, how do the original authors get paid for that?

dijksterhuis 434 days ago [-]

they don’t.

borrowing a book is not creating a COPY of the book. you are not taking the pages, reproducing all of the text on those pages, and then giving that reproduction to your friend.

that is what a COPY is. borrowing the book is not a COPY. you’re just giving them the thing you already bought. it is a transfer of ownership, albeit temporarily, not a copy.

if you were copying the files from a digitally downloaded album of music and giving those new copies to your friend (music royalties were my specialty) then technically you would be in breach of copyright. you have copied the works.

but because it’s such a small scale (an individual with another individual) it’s not going to be financially worth it to take the case to court.

so copyright holders just cut their losses with one friend sharing it with another friend, and focus on other infringements instead.

which is where the whole torrenting thing comes in. if i can track 7000 people who have all downloaded the same torrented album, now i can just send a letter / court date to those 7000 people.

the costs of enforcement are reduced because of scale. 7000 people, all found the same thing, in a way that can be tracked.

and the ultimate, one person/company has download the works and making them available to others to download, without paying for the rights to make copies when distributing.

that’s the ultimate goldmine for copyright infringement lawsuits. and it sounds suspiciously like openAi’s business model.

Suppafly 432 days ago [-]

>borrowing a book is not creating a COPY of the book. you are not taking the pages, reproducing all of the text on those pages, and then giving that reproduction to your friend.

That's not what's happening with training AI models either though.

dijksterhuis 427 days ago [-]

check out my other comment in this thread about derivative works.

https://news.ycombinator.com/item?id=42282443

OpenAI are taking copies of people’s data. some of that is copyrighted data.

that’s copyright infringement.

an LLM is a tool to create derivative works from the data OpenAI has copied without permission (when considering only copyrighted works and nothing public domain).

derivative works can also be considered copyright infringement in some cases.

how the tool functions is irrelevant for the most part. how copy right infringement occurs doesn’t matter. only that it does.

immibis 434 days ago [-]

Copying copyrighted works?

chii 434 days ago [-]

learning, and extracting useful information from copyrighted works.

These extracted useful information cannot and should not be copyrightable.

keepingscore 434 days ago [-]

Learning from copy written work requires a license to access that work. You can extract information from the world's best books by purchasing those books. But no author is being compensated here they download books.torrent then uses that pirated material to then profit.

azemetre 434 days ago [-]

If you’re arguing that OpenAI should be compelled to make all their technology and models free then I think we all agree, but it sounds like you’re trying to weasel your way into letting a corpo get away with breaking the law while running away with billions.

chii 433 days ago [-]

> If you’re arguing that OpenAI should be compelled to make all their technology and models free

no i'm not - i'm arguing that it's weights are not copyrightable. It doesn't have to be free or not - that is a separate (and uninteresting) argument.

catlifeonmars 434 days ago [-]

That’s really expensive to do, so in practice only wealthy humans or corporations can do so. Still seems unfair.

434 days ago [-]

__loam 435 days ago [-]

Yeah it turns out humans have more rights than computer programs and tech startups.

triceratops 434 days ago [-]

So make OpenAI sleep 8 hours a day, pay income and payroll taxes with the same deductions as a natural human etc...

treyd 435 days ago [-]

ChatGPT doesn't violate copyright, it's a software application. "Open"AI does, it's a company run by humans (for now).

tpmoney 435 days ago [-]

> they're just straight-up breaking the law and getting away with it.

So far this has not been determined and there's plenty of reasonable arguments that they are not breaking copyright law.

blackqueeriroh 434 days ago [-]

> Absolutely: if copyright is slowing down innovation, we should abolish copyright.

Is this sarcasm?

immibis 434 days ago [-]

No. If something slows down innovation and suffocates the economy, why would you (an economically minded politician) keep it?

noitpmeder 434 days ago [-]

Because the world shouldn't be run primarily by economically minded politicians??

I'm sure China gets competitive advantages from their use of indentured and slave-like labor forces, and mass reeducation programs in camps. Should the US allow these things to happen? What about if a private business starts?

But remember, they're just trying to compete with China on a fair playing field, so everything is permitted right?

redwall_hp 434 days ago [-]

You might want to look at the constitutional amendment enshrining slave labor "as a punishment for a crime," and the world's largest prison population. Much of your food supply has links to prison labor.

https://apnews.com/article/prison-to-plate-inmate-labor-inve...

But don't worry, it's not considered "slave labor" because there's a nominal wage of a few pennies involved and it's not technically "forced." You just might be tortured with solitary confinement if you don't do it.

We need to point fewer fingers and clean up the problems here.

dmead 435 days ago [-]

I'm more concerned that someone people in the tech world are conflating Sam Altman's interest with the national interest.

astrange 435 days ago [-]

Easy to turn one into the other, just get someone to leak the model weights.

dmead 433 days ago [-]

That doesn't really do it right? The state would have to have its own training cluster.

jMyles 435 days ago [-]

Am I jazzed about Sam Altman making billions? No.

Am I even more concerned about the state having control over the future corpus of knowledge via this doomed-in-any-case vector of "intellectual property"? Yes.

I think it will be easier to overcome the influence of billionaires when we drop the pretext that the state is a more primal force than the internet.

dmead 435 days ago [-]

100% disagree. "It'll be fine bro" is not a substitute for having a vote over policy decisions made by the government. What you're talking about has a name. It starts with F and was very popular in Italy in the early to mid 20th century.

jMyles 435 days ago [-]

Rapidity of Godwin's law notwithstanding, I'm not disputing the importance of equity in decision-making. But this matter is more complex than that: it's obvious that the internet doesn't tolerate censorship even if it is dressed as intellectual property. I prefer an open and democratic internet to one policied by childish legacy states, the presence of which serves only (and only sometimes) to drive content into open secrecy.

It seems particularly unfair to equate any questioning of the wisdom of copyright laws (even when applied in situations where we might not care for the defendant, as with this case) with fascism.

dmead 435 days ago [-]

It's not Godwin's law when it's correct. Just because it's cool and on the Internet doesn't mean you get to throw out people's stake in how their lives are run.

jMyles 435 days ago [-]

> throw out people's stake in how their lives are run

FWIW, you're talking to a professional musician. Ostensibly, the IP complex is designed to protect me. I cannot fathom how you can regard it as the "people's stake in how their lives are run". Eliminating copyright will almost certainly give people more control over their digital lives, not less.

> It's not Godwin's law when it's correct.

Just to be clear, you are doubling down on the claim that sunsetting copyright laws is tantamount to nazism?

dmead 434 days ago [-]

Not at all. Go re read above.

devsda 435 days ago [-]

Get ahead in terms of what? Do you believe that the material in public domain or legally available content that doesn't violate copyrights is not enough to research AI/LLMs or is the concern about purely commercial interests?

China also supposedly has abusive labor practices. So, should other countries start relaxing their labor laws to avoid falling behind ?

FpUser 434 days ago [-]

Shall we install the emperor then?

435 days ago [-]

snickerbockers 435 days ago [-]

[flagged]

0xcde4c3db 435 days ago [-]

The claim that's being allowed to proceed is under 17 USC 1202, which is about stripping metadata like the title and author. Not exactly "core copyright violation". Am I missing something?

anamexis 435 days ago [-]

I read the headline as the copyright violation claim being core to the lawsuit.

H8crilA 435 days ago [-]

The plaintiffs focused on exactly this part - removal of metadata - probably because it's the most likely to hold in courts. One judge remarked on it pretty explicitly, saying that it's just a proxy topic for the real issue of the usage of copyrighted material in model training.

I.e., it's some legalese trick, but "everyone knows" what's really at stake.

0xcde4c3db 434 days ago [-]

Yeah; I think that's essentially where the disconnect is rooted for me. It seems to me (a non-lawyer, to be clear) that it's damn hard to make the case for model training necessarily being meat-and-potatoes "infringement" as things are defined in Title 17 Chapter 1. I see it as firmly in the grey area between "a mere change of physical medium or deterministic mathematical transformation clearly isn't a defense against infringement on its own" and "giant toke come on, man, Terry Brooks was obviously just ripping off Tolkien". There might be a tension between what constitutes "substantial similarity" through analog and digital lenses, especially as the question pertains to those who actually distribute weights.

kyledrake 434 days ago [-]

I think you're at the heart of it, and you've humorously framed the grey area here and it's very weird. Sans a ruling that, for example, computers are too deterministic to be creative, copyright laws really seem to imply that LLM training is legal. Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here? A ruling declaring this copyright infringement is likely going to have crazy ripple effects going way beyond LLMs, something a good judge is going to be very mindful of.

Ultimately, this is probably going to require congress to create new laws to codify this.

dragonwriter 433 days ago [-]

> Learning and then creating something new from what you learned isn't copyright infringement, so what's the legal argument here?

The legal argument is that copying or creating what would otherwise be derivative works solely within a human brain is exempt because the human brain is not a medium wherein a configuration of information constitutes either a copy or a new work until it is set in another medium or performed publicly, whereas the storage of an artificial computer is absolutely such a medium (both of which are well-established law), so that the “learning” metaphor is not legally valid even if it is arguably a decent metaphor for some other purpose, furthermore, learning and then creating something new is often illegal, if the “something new” has sufficient proximity to the source material (that's the prohibition on unlicensed derivative works), and GenAI systems often do that and are (so the argument goes) sufficiently frequently used, and known to the service and model providers to be used. Intentionally to do that that, even were the training itself not a violation, the standards for contributory infringement are met in the provision of the certain models and/or services.

mikae1 434 days ago [-]

According to us law, is the Internet Archive a library? I know they received a DMCA excemption.

If so, you could argue that your local library returns perfect copies of copyrighted works too. IMO it's somehow different from a business turning the results of their scraping into a profit machinery.

kyledrake 434 days ago [-]

My understanding is that there is no concept of a library license and that you just say you're a library and therefore become one, and whether your claim survives is more a product of social cultural acceptance than actual legal structures but someone is welcome to correct me.

The internet archive also scrapes the web for content, does not pay authors, the difference being that it spits out literal copies of the content it scraped, whereas an LLM fundamentally attempts to derive a new thing from the knowledge it obtains.

I just can't figure out how to plug this into copyright law. It feels like a new thing.

quectophoton 434 days ago [-]

Also, Google Translate, when used to translate web pages:

> does not pay authors

Check.

> it spits out literal copies of the content it scraped

Check.

> attempts to derive a new thing from the knowledge it obtains.

Check.

* Is interactive: Check.

* Can output text that sounds syntactically and grammatically correct, but a human can instantly say "that doesn't look right": Check.

* Changing one word in a sentence affects words in a completely different sentence, because that changed the context: Check.

mikae1 433 days ago [-]

Scraping with the intent to capitalize: no check.

quectophoton 428 days ago [-]

Even ignoring the fact that programmatic access to translation seems to require payment, or that its parent company is doing the scraping (similar to how one would use CommonCrawl instead of doing the scraping themselves), I am actually in favor of taking in to account the intent behind it.

"Give and take", "equal exchange", however people want to put it. I don't mind if someone uses publicly-accessible content and ignores its copyright to make another thing, as long as their result is publicly-accessible and they're prepared to have their copyright ignored in return. If you not only use the result of someone else, but also their process, then be prepared to have your process publicly-accessible too, with its copyright ignored. And so on.

That's why I don't mind "unofficial" translations or subtitles (both copyright violations as soon as they are distributed) appearing on multiple sites. That's why I respect open-source licenses of projects that respect them. That's why I pay for some open-source software even if I don't have to. That's why I give credit to artists even when I use an image that I didn't make myself as profile picture (either from the internet or because I paid for it).

That's also why I don't mind anyone ignoring my copyright as long as it's on "equal" terms ("if you vendor my code and pass it off as yours, that's tacit approval for someone else doing the same thing to you" kind of thing ("someone else" because, at least for code, it won't be me)).

I only gave very specific examples, but I hope I was able to explain what I mean.

The thing that I don't like, is the highly asymmetrical situation we're in with generative AI: because the result (the trained model) is not publicly accessible like a significant part of the content it was trained on; they only release a very limited interface to it.

kyledrake 418 days ago [-]

> Even ignoring the fact that programmatic access to translation seems to require payment, or that its parent company is doing the scraping (similar to how one would use CommonCrawl instead of doing the scraping themselves), I am actually in favor of taking in to account the intent behind it.

Does intent matter for the purposes of interpreting the laws here? I'm not criticizing your point, I'm genuinely curious if that matters (outside the context of fair use). I can certainly think of valid use cases that would not be considered fair use.

> The thing that I don't like, is the highly asymmetrical situation we're in with generative AI: because the result (the trained model) is not publicly accessible like a significant part of the content it was trained on; they only release a very limited interface to it.

I'm not sure that I agree with this one, given that most serious LLMs are free or very low cost to use, and in llama and phi-3's case pretty much just given away. Not a small gesture given the substantial expenses required to provide free access to some of these models.

dragonwriter 433 days ago [-]

”Core copyright violation”, here, I think is being used relative to the claims in the case.

Kon-Peki 435 days ago [-]

Violations of 17 USC 1202 can be punished pretty severely. It's not about just money, either.

If, during the trial, the judge thinks that OpenAI is going to be found to be in violation, he can order all of OpenAIs computer equipment be impounded. If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.

Whether you call that "core" or not, OpenAI cannot afford to lose these parts that are left of this lawsuit.

nickpsecurity 435 days ago [-]

“ If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.”

That is exactly why I suggested companies train some models on public domain and licensed data. That risk disappears or is very minimal. They could also be used for code and synthetic data generation without legal issues on the outputs.

jsheard 435 days ago [-]

That's what Adobe and Getty Images are doing with their image generation models, both are exclusively using their own licensed stock image libraries so they (and their users) are on pretty safe ground.

nickpsecurity 435 days ago [-]

That’s good. I hope more do. This list has those doing it under the Fairly Trained banner:

https://www.fairlytrained.org/certified-models

3pt14159 435 days ago [-]

The problem is that you don't get the same quality of data if you go about it that way. I love ChatGPT and I understand that we're figuring out this new media landscape but I really hope it doesn't turn out to neuter the models. The models are really well done.

nickpsecurity 435 days ago [-]

If I steal money, I can get way more done than I do now by earning it legally. Yet, you won’t see me regularly dismissing legitimate jobs by posting comparisons to what my numbers would look like if stealing I.P..

We must start with moral and legal behavior. Within that, we look at what opportunities we have. Then, we pick the best ones. Those we can’t have are a side effect of the tradeoffs we’ve made (or tolerated) in our system.

tremon 435 days ago [-]

That is OpenAI's problem, not their victims'.

zozbot234 435 days ago [-]

> he can order all of OpenAIs computer equipment be impounded.

Arrrrr matey, this is going to be fun.

Kon-Peki 435 days ago [-]

People have been complaining about the DMCA for 2+ decades now. I guess it's great if you are on the winning side. But boy does it suck to be on the losing side.

immibis 435 days ago [-]

And normal people can't get on the winning side. I'm trying to get Github to DMCA my own repositories, since it blocked my account and therefore I decided it no longer has the right to host them. Same with Stack Exchange.

GitHub's ignored me so far, and Stack Exchange explicitly said no (then I sent them an even broader legal request under GDPR)

ralph84 435 days ago [-]

When you uploaded your code to GitHub you granted them a license to host it. You can’t use DMCA against someone who’s operating within the parameters of the license you granted them.

tremon 435 days ago [-]

Their stance is that GitHub revoked that license by blocking their account.

Dylan16807 434 days ago [-]

Is it?

And what would connect those two things together?

immibis 434 days ago [-]

GitHub's terms of service specify the license is granted as necessary to provide the service. Since the service is not provided they don't have a license.

Dylan16807 434 days ago [-]

Hosting the code is providing the service, whether you have a working account or not.

Also was this code open source? Your stack exchange contributions were open source, so they don't need any ToS-based permission in the first place. They have access under CC BY-SA.

immibis 433 days ago [-]

Some, not all. GitHub is unlikely to continue hosting the code on the basis that it's open source. If they do, I'll send them a GDPR request to detach my name from it, including in source code comments and package names.

It's not always clear that Stack Exchange always followed the CC license, and if they violated it once, it was terminated. The checkbox you have to click now to access the data dumps might be a violation. The data dumps don't come with copies of the licenses, so that's a violation.

Dylan16807 433 days ago [-]

I don't think GDPR lets you retroactively redact code you released as open source.

immibis 432 days ago [-]

At the very least they have to stop associating it with my account. I told them they don't have to remove forks.

immibis 435 days ago [-]

It won't happen. Judges only order that punishment for the little guys.

pnut 434 days ago [-]

There would be a highly embarrassing walking back of such a ruling, when Sam Altman flexes his political network and effectively overrules it.

He spends his time amassing power and is well positioned to plow over a speed bump like that.

sieabahlpark 435 days ago [-]

[dead]

CaptainFever 435 days ago [-]

Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?

It seems to me that it shouldn't really affect model quality all that much, is it?

Also, in the amended complaint:

> not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights

Wasn't it already quite clear that as long as the articles weren't replicated, it wasn't protected? Or is that still being fought in this case?

In the decision:

> I agree with Defendants. Plai ntiffs allege that ChatGPT has been trained on "a scrape of most of the internet, " Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so . When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote. And while Plaintiffs provide third-party statistics indicating that an earlier version of ChatGPT generated responses containing signifi cant amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have not plausibly alleged that there is a " substantial risk" that the current version of ChatGPT will generate a response plagiarizing one of Plaintiffs' articles.

freejazz 435 days ago [-]

>Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?

Have you read 1202? It's all about hiding your infringement.

torginus 434 days ago [-]

How sturdy is this claim?

If we presume it's illegal to train on copyrighted works, but Wikipedia, a website summarizing the article is perfectly legal, then what would happen if we got LLM A to summarize the article and use that to train LLM B.

LLM A could be trained on public domain works.

miohtama 434 days ago [-]

If it is illegal to train on copyrighted work, it will also benefit actors that are free to ignore laws, like Chinese public private companies. It means Western companies will lose in the AI race.

tapoxi 434 days ago [-]

Then we don't respect their copyrights? Why is this some sort of unsolvable problem and the only solution is to allow mega corporations to sell us AI that is trained on the work of artists without their consent?

vorpalhex 434 days ago [-]

LLM B would be a very bad LLM with only limited vocabulary and turn of phrase, and would tend to have a single writing tone.

And no, having 5000 different summarizing LLMs doesn't help here.

It's sort of like taking a photograph of a photograph.

bastloing 435 days ago [-]

Isn't this the same thing Google has been doing for years with their search engine? Only difference is Google keeps the data internal, whereas openai spits it out to you. But it's still scraped and stored in both cases.

jazzyjackson 435 days ago [-]

A component of fair use is to what degree the derivative work displaces the original. Google's argument has always been that they direct traffic to the original, whereas AI summaries (which Google of course is just as guilty of as openai) completely obsoletes the original publication. The argument now is that the derivative work (LLM model) is transformative, ie, different enough that it doesn't economically compete with the original. I think it's a losing argument but we'll see what the courts arrive at.

CaptainFever 435 days ago [-]

Is this specific to AI or specific to summaries in general? Do summaries, like the ones found in Wikipedia or Cliffs Notes, not have the same effect of making it such that people no longer have to view the original work as much?

Note: do you mean the model is transformative, or the summaries are transformative? I think your comment holds up either way but I think it's better to be clear which one you mean.

LinuxBender 435 days ago [-]

In my opinion not a lawyer, Google at least references where they obtained the data and did not regurgitate it as if they were the creators that created something new. obfuscated plagiarism via LLM. Some claim derivative works but I have always seen that as quite a stretch. People here expect me to cite references yet LLM's somehow escape this level of scrutiny.

ysofunny 435 days ago [-]

the very idea of "this digital asset is exclusively mine" cannot die soon enough

let real physically tangible assets keep the exclusivity problem

let's not undo the advantages unlocked by the digital internet; let us prevent a few from locking down this grand boon of digital abundance such that the problem becomes saturation of data

let us say no to digital scarcity

CaptainFever 435 days ago [-]

This is, in fact, the core value of the hacker ethos. HackerNews.

> The belief that information-sharing is a powerful positive good, and that it is an ethical duty of hackers to share their expertise by writing open-source code and facilitating access to information and to computing resources wherever possible.

> Most hackers subscribe to the hacker ethic in sense 1, and many act on it by writing and giving away open-source software. A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.

http://www.catb.org/jargon/html/H/hacker-ethic.html

Perhaps if the Internet didn't kill copyright, AI will. (Hyperbole)

(Personally my belief is more nuanced than this; I'm fine with very limited copyright, but my belief is closer to yours than the current system we have.)

raincole 434 days ago [-]

Open AI scrapping copyrighted materials to make a proprietary model is the exact opposite of what GNU promotes.

CaptainFever 434 days ago [-]

As I mentioned in another comment:

"Scrapping" (scraping) copyrighted materials is not the wrong thing to do.

Making it proprietary is.

It is important to be clear about what is wrong, so you don't accidentally end up fighting for copyright expansion, or fighting against open models.

YetAnotherNick 432 days ago [-]

If you open the link it will be more clear to you that GNU wants to share information with so called bad hackers.

onetokeoverthe 435 days ago [-]

Creators freely sharing with attribution requested is different than creations being ruthlessly harvested and repurposed without permission.

https://creativecommons.org/share-your-work/

a57721 435 days ago [-]

> freely sharing with attribution requested

If I share my texts/sounds/images for free, harvesting and regurgitating them omits the requested attribution. Even the most permissive CC license (excluding CC0 public domain) still requires an attribution.

CaptainFever 435 days ago [-]

> A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.

In this view, the ideal world is one where copyright is abolished (but not moral rights). So piracy is good, and datasets are also good.

Asking creators to license their work freely is simply a compromise due to copyright unfortunately still existing. (Note that even if creators don't license their work freely, this view still permits you to pirate or mod it against their wishes.)

(My view is not this extreme, but my point is that this view was, and hopefully is, still common amongst hackers.)

I will ignore the moralizing words (eg "ruthless", "harvested" to mean "copied"). It's not productive to the conversation.

onetokeoverthe 435 days ago [-]

If not respected, some Creators will strike, lay flat, not post, go underground.

Ignoring moral rights of creators is the issue.

CaptainFever 435 days ago [-]

Moral rights involve the attribution of works where reasonable and practical. Clearly doing so during inference is not reasonable or practical (you'll have to attribute all of humanity!) but attributing individual sources is possible and is already being done in cases like ChatGPT Search.

So I don't think you actually mean moral rights, since it's not being ignored here.

But the first sentence of your comment still stands regardless of what you meant by moral rights. To that, well... we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here.

And yes, funding is a thing, which I agree needs copyright for the most part unfortunately. But does training AI on, for example, a book really reduce the need to buy the book, if it is not reproduced?

Remember, training is not just about facts, but about learning how humans talk, how languages work, how books work, etc. Learning that won't reduce the book's economical value.

And yes, summaries may reduce the value. But summaries already exist. Wikipedia, Cliff's Notes. I think the main defense is that you can't copyright facts.

onetokeoverthe 435 days ago [-]

we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here

?!?! Comparing and equating commenting to creative works. ?!?!

These comments are NOT equivalent to the 17 full time months it took me to write a nonfiction book.

Or an 8 year art project.

When I give away my work I decide to whom and how.

CaptainFever 434 days ago [-]

I have already covered these points in the latter paragraphs.

You might want to take a look at https://www.gnu.org/philosophy/shouldbefree.en.html

onetokeoverthe 434 days ago [-]

I'll decide the distribution of my work. Be it 100 million unique views or NOT at all.

CaptainFever 434 days ago [-]

If you don't have a proper argument, it's best not to distribute your comment at all.

onetokeoverthe 434 days ago [-]

If saying it's my work is not a "proper" argument, that says it all.

CaptainFever 434 days ago [-]

Indeed, owner.

Look, either actually read the link and refute the points within, or don't. But there's no use discussing anything if you're unwilling to even understand and seriously refute a single point being made here, other than repeating "mine, mine, mine".

onetokeoverthe 434 days ago [-]

Read it. Lots of nots, and no respect.

In the process, [OpenAI] trained ChatGPT not to acknowledge or respect copyright, not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights, and not to provide attribution when using the works of human journalists

CaptainFever 434 days ago [-]

No, wrong link.

https://news.ycombinator.com/item?id=42279218

AlienRobot 435 days ago [-]

I think an ethical hacker is someone who uses their expertise to help those without.

How could an ethical hacker side with OpenAI, when OpenAI is using its technological expertise to exploit creators without?

CaptainFever 435 days ago [-]

I won't necessarily argue against that moral view, but in this case it is two large corporations fighting. One has the power of tech, the other has the power of the state (copyright). So I don't think that applies in this case specifically.

Xelynega 435 days ago [-]

Aren't you ignoring that common law is built on precedent? If they win this case, that makes it a lot easier for people who's copyright is being infringed on an individual level to get justice.

CaptainFever 435 days ago [-]

You're correct, but I think many don't realize how many small model trainers and fine-tuners there are currently. For example, PonyXL, or the many models and fine-tunes on CivitAI made by hobbyists.

So basically the reasoning is this:

- NYT vs OpenAI, neither is disenfranchied - OpenAI vs individual creators, creators are disenfranchised - NYT vs individual model trainers, model trainers are disenfranchised - Individual model trainers vs individual creators, neither are disenfranchised

And if only one can win, and since the view is that information should be free, it biases the argument towards the model trainers.

AlienRobot 435 days ago [-]

What "information" are you talking about? It's a text and image generator.

Your argument is that it's okay to scrape content when you are an individual. It doesn't change the fact those individuals are people with technical expertise using it to exploit people without.

If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?

You need to draw the line somewhere.

CaptainFever 434 days ago [-]

Text and images are information, though.

> If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?

Kind of? It's not okay, but not because it is usage of information without consent (this is the "information should free" part), but because it is intentionally and unnecessarily annoying and angering people (this is the "don't use the information for evil" part which I think is your position).

"See? Similarly, even in your view, model trainers aren't bad because they're using data. They're bad in general because they're exploiting creatives."

But why is it exploitative?

"They're putting the creatives out of a job." But this applies to automation in general.

"They're putting creatives out of a job, using data they created." This is the strongest argument for me. It does intuitively feel exploitative. However, there are several issues:

1. Not all models or datasets do that. For instance, no one is visibly getting paid to write comments on HN, or to write fanfics on the non-commercial fanfic site AO3. Since the data creators are not doing it as a job in the first place, it does not make sense to talk about them losing their job because of the very same data.

2. Not all models or datasets do that. For example, spam filters, AI classifiers. All of this can be trained from the entire Internet and not be exploitative because there is no job replacement involved here.

3. Some models already do that, and are already well and morally accepted. For example, Google Translate.

4. This may be resolved by going the other way and making more models open source (or even leaks), so more creatives can use it freely, so they can make use of the productive power.

"Because they're using creatives' information without consent." But as mentioned, it's not about the information or consent. It's about what you do with the information.

Finally, because this is a legal case, it's also important to talk about the morality of using the state to restrict people from using information freely, even if their use of the information is morally wrong.

If you believe in free culture as in free speech, then it is wrong to restrict such a use using the law, even though we might agree it is morally wrong. But this really depends if you believe in free culture as in free speech in the first place, which is a debate much larger than this.

Xelynega 435 days ago [-]

I don't understand what the "hacker ethos" could have to do with defending openai's blatant stealing of people's content for their own profit.

Openai is not sharing their data(they're keeping it private to profit off of), so how could it be anywhere near the "hacker ethos" to believe that everyone else needs to hand over their data to openai for free?

CaptainFever 435 days ago [-]

Following the "GNU-flavour hacker ethos" as described, one concludes that it is right for OpenAI to copy data without restriction, it is wrong for NYT to restrict others from using their data, and it is also wrong for OpenAI to restrict the sharing of their model weights or outputs for training.

Luckily, most people seem to ignore OpenAI's hypocritical TOS against sharing their output weights for training. I would go one step further and say that they should share the weights completely, but I understand there's practical issues with that.

Luckily, we can kind of "exfiltrate" the weights by training on their output. Or wait for someone to leak it, like NovelAI did.

ysofunny 435 days ago [-]

oh please, then, riddle me why does my comment has -1 votes on "hacker" news

which has indeed turned into "i-am-rich-cuz-i-own-tech-stock"news

alwa 435 days ago [-]

I did not contribute a vote either way to your comment above, but I would point out that you get more of what you reward. Maybe the reward is monetary, like an author paid for spending their life writing books. Maybe it’s smaller, more reputational or social—like people who generate thoughtful commentary here, or Wikipedia’s editors, or hobbyists’ forums.

When you strip people’s names from their words, as the specific count here charges; and you strip out any reason or even way for people to reward good work when they appreciate it; and you put the disembodied words in the mouth of a monolithic, anthropomorphized statistical model tuned to mimic a conversation partner… what type of thought is it that becomes abundant in this world you propose, of “data abundance”?

In that world, the only people who still have incentive to create are the ones whose content has negative value, who make things people otherwise wouldn’t want to see: advertisers, spammers, propagandists, trolls… where’s the upside of a world saturated with that?

CaptainFever 435 days ago [-]

Yes, I have no idea either. I find it disappointing.

I think people simply like it when data is liberated from corporations, but hate it when data is liberated from them. (Though this case is a corporation too so idk. Maybe just "AI bad"?)

cess11 435 days ago [-]

I think you'll find that most people aren't comfortable with this in practice. They'd like e.g. the state to be able to keep secrets, such as personal information regarding citizens and the stuff foreign spies would like to copy.

jMyles 435 days ago [-]

Obviously we're all impacted in these perceptions by our bubbles, but it would surprise me at this particular moment in the history of US politics to find that most people favor the existence of the state at all, let alone its ability to keep secret personal information regarding citizens.

goatlover 435 days ago [-]

Most people aren't anarchists, and think the state is necessary for complex societies to function.

jMyles 435 days ago [-]

My sense is that the constituency of people who prefer deprecation of the US state is much larger than just anarchists.

warkdarrior 434 days ago [-]

Would this deprecation of the state include disbanding the police and the armed forces? I'm guessing the people who are for the deprecation of the state would answer quite differently if the question specified details of government functions.

jMyles 434 days ago [-]

...I mean, police are deeply unpopular in the American political consciousness, and have been since prior to their rebrand from "slave patrols" in the 19th century. Surely you recall that, only four years ago, millions of people took to the streets calling for a completion to the unfinished business of abolition?

Obviously the armed forces are much less despised than the police. But given that private gun ownership is at an all-time high (with woman and people of color - historically marginalized groups with regard to arms equality - making up the lion's share of the recent increase), I'm not sure that people are feeling particularly vulnerable to invasion either.

Is the state really that popular in your circle? How do people express their esteem? Am I just missing it?

noitpmeder 434 days ago [-]

Sound like you exist in some very insulated bubbles.

cess11 435 days ago [-]

Really? Are Food Not Bombs and the IWW that popular we're you live?

guerrilla 434 days ago [-]

Sure, as soon as people have an alternative way to survive.

ashoeafoot 435 days ago [-]

Will we see human washing, where Ai art or works get a "Made by man" final touch in some third world mechanical turk den? Would that add another financial detracting layer to the ai winter?

Retric 435 days ago [-]

The law generally takes a dim view of such attempts to get around things like that. AI biggest defense is claiming they are so beneficial to society that what they are doing is fine.

gmueckl 435 days ago [-]

That argument stands on the mother of all slippery slopes! Just find a way to make your product mpressive or ubiquitous and all of a sudden it doesn't matter how much you break the law along the way? That's so insane I don't even know where to start.

rcxdude 435 days ago [-]

Why not, considering copyright law specifically has fair use outlined for that kind of thing? It's not some overriding consequence of law, it's that copyright is a granting of a privilege to individuals and that that privilege is not absolute.

ashoeafoot 435 days ago [-]

Worked for purdue

Retric 435 days ago [-]

YouTube, AirBnB, Uber, and many many others have all done stuff that’s blatant against the law but gotten away with it due to utility.

gaganyaan 435 days ago [-]

That is not in any way the biggest defense

Retric 435 days ago [-]

It’s worked for many startups and court cases in the past. Copyright even has many explicit examples of the utility loophole look at say: https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....

CuriouslyC 435 days ago [-]

There's no point in having third world mechanical turk dens do finishing passes on AI output unless you're trying to make it worse.

Artists are already using AI to photobash images, and writers are using AI to outline and create rough drafts. The point of having a human in the loop is to tell the AI what is worth creating, then recognize where the AI output can be improved. If we have algorithms telling the AI what to make and content mill hacks smearing shit on the output to make it look more human, that would be the worst of both worlds.

TheDong 434 days ago [-]

I think the point of the comment isn't to have this finishing layer to make things "better", but to make things "legal".

Humans are allowed to synthesize a bunch of inputs together and produce a new novel copyrighted.

An algorithm, if it mixes a bunch of copyrighted things together by itself, plausibly is incapable of producing a novel copyright, and instead inherits the old copyright.

Just like Clean Room Design (https://en.wikipedia.org/wiki/Clean-room_design) can be used to re-create the same software free of the original copyright, I think the parent is arguing that a mechanical turk process could allow AI to produce the same output free of the original copyright.

righthand 435 days ago [-]

That will probably happen to some extent if not already. However I think people will just stop publishing online if malicious corps like OpenAI are just going to harvest works for their own gain. People publish for personal gain, not to enrich the public or enrich private entities.

Filligree 435 days ago [-]

However, I get my personal gain regardless of whether or not the text is also ingested into ChatGPT.

In fact, since I use ChatGPT a lot, I get more gain if it is.

righthand 434 days ago [-]

How much of your income have you spent on ChatGPT vs how much ChatGPT has increased your income?

Filligree 434 days ago [-]

ChatGPT doesn’t increase my income. It’s useful for my hobbies, and probably made those more expensive.

doctorpangloss 434 days ago [-]

Did you miss the twist at the end of the article?

> Andrew Deck is a generative AI staff writer at Nieman Lab...

ada1981 435 days ago [-]

I’m still of the opinion that we should be allowed to train on any data a human can read online.

smitelli 434 days ago [-]

…Limited to a human’s average rate of consumption, right?

warkdarrior 434 days ago [-]

Yes, just like my download speed is capped by the speed of me writing bytes on paper.

ada1981 434 days ago [-]

Is there any other information processing we limit to human speed?

_giorgio_ 434 days ago [-]

Just train the models in Japan.

*No copyright.*

https://insights.manageengine.com/artificial-intelligence/th...

https://news.ycombinator.com/item?id=38842788

bastloing 435 days ago [-]

Who would be forever grateful if openai removed all of The Intercept's content permanently and refused to crawl it in the future?

noitpmeder 434 days ago [-]

Sure, and then do that with every other piece of work theyre unfairly using

bastloing 434 days ago [-]

I'd actually leave it up to the owner. Some want their work removed, some don't care.

dr_dshiv 435 days ago [-]

Proposed: 10% tax as copyright settlement, half to pay for past creators and and half to pay current creative culture

dawnerd 434 days ago [-]

Problem with that is it’s too easy to effectively have 0 taxes.

dr_dshiv 433 days ago [-]

I don’t mean 10% of their taxes, I mean charge 10% of their revenue to settle cooyright

435 days ago [-]

cynicalsecurity 435 days ago [-]

Yeah, let's stop the progress because a few magazines no one cares about are unhappy.

a57721 435 days ago [-]

Maybe just don't use data from the unhappy magazines you don't care about in the first place?

efitz 435 days ago [-]

I would trust AI a lot more if it gave answers more like:

“Source A on date 1 said XYX”

“Source B …”

“Synthesizing these, it seems that the majority opinion is X but Y is also a commonly held opinion.”

Instead of what it does now, which is make extremely confident, unsourced statements.

It looks like the copyright lawsuits are rent-seeking as much as anything else; another reason I hate copyright in its current form.

akira2501 435 days ago [-]

> which is make extremely confident,

One of the results the LLM has available to itself is a confidence value. It should, at the very least, provide this along with it's answer. Perhaps if it did people would stop calling it 'AI'.'

pavon 435 days ago [-]

My understanding is that this confidence value is not a measure of how likely something is correct/true, but more along the lines of how likely that sentence would be. Including it could be more misleading than helpful, for example if it is repeating commonly misunderstood information.

ethernot 435 days ago [-]

I'm not sure that it's possible to produce anything reasonable in that space. It would need to know how far it is away from correct to provide a usable confidence value otherwise it'd just be hallucinating a number in the same way as the result.

An analogy. Take a former commuter friend of mine, Mr Skol (named after his favourite breakfast drink). Seen on a minibus I had to get to work years ago, we shared many interesting conversations. Now he was a confident expert on everything. If asked to rate his confidence in a subject it would be a good 95% at least. However he spoke absolute garbage because his brain was rotten away from drinking Skol for breakfast, and the odd crack chaser. I suspect his model was still better than GPT-4o. But an average person could determine the veracity of his arguments.

Thus confidence should be externally rated as an entity with knowledge cannot necessarily rate itself for it has bias. Which then brings in the question of how do you do that. Well you'd have to do the research you were going to do anyway and compare. So now you've used the AI and done the research which you would have had to do if the AI didn't exist. So the AI at this point becomes a cost over benefit if you need something with any level of confidence and accuracy.

Thus the value is zero unless you need crap information, which is at least here, never, unless I'm generating a picture of a goat driving a train or something. And I'm not sure that has any commercial value. But it's fun at least.

readyplayernull 435 days ago [-]

Do androids dream of Dunning-Kruger?

1vuio0pswjnm7 434 days ago [-]

"It looks like the copyright lawsuits are rent-seeking as much as anything else;"

If an entity charges fees for "AI", then is it "rent-seeking"

(Assume that the entity is not the author of the training data used)

Paul-E 434 days ago [-]

This is what a number of startups, such as Yurts.ai and Vannevar Labs, are racing to build for organizations. I wouldn't be surprised if, in 5-10 years, most large corps and government agencies had these sort of LLM/RAGs over their internal documents.

CaptainFever 435 days ago [-]

ChatGPT Search provides this, by the way, though it relies a lot on the quality of Bing search results. Consensus.app does this but for research papers, and has been very useful to me.

maronato 435 days ago [-]

More often than not in my experience, clicking these sources takes me to pages that either don’t exist, don’t have the information ChatGPT is quoting, or ChatGPT completely misinterpreted the content.

435 days ago [-]

434 days ago [-]

zb3 435 days ago [-]

Forecast: OpenAI and The Intercept will settle and OpenAI users will pay for it.

jsheard 435 days ago [-]

Yep, the game plan is to keep settling out of court so that (they hope) no legal precedent is set that would effectively make their entire business model illegal. That works until they run out of money I guess, but they probably can't keep it up forever.

echoangle 435 days ago [-]

Wouldn’t the better method to throw all your money at one suit you can make an example of and try to win that one? You can’t effectively settle every single suit if you have no realistic chance of winning, otherwise every single publisher on the internet will come and try to get their money.

gr3ml1n 435 days ago [-]

That's a good strategy, but you have to have the right case. One where OpenAI feels confident they can win and establish favorable precedent. If the facts of the case aren't advantageous, it's probably not worth the risk.

lokar 435 days ago [-]

Too high risk. Every year you can delay you keep lining your pockets.

tokioyoyo 435 days ago [-]

Side question, why doesn't other companies get the same attention? Anthropic, xAI and others have deep pockets, and scraped the same data, I'm assuming? It could be a gold mine for all these news agencies to keep settling out of court to make some buck.

Shalah 435 days ago [-]

[dead]

olddog2 435 days ago [-]

[flagged]

bogwog 435 days ago [-]

and I bet he's turning the frogs gay too!

luqtas 435 days ago [-]

[flagged]

mort96 435 days ago [-]

I mean it could also be that this is just a case of an OpenAI spokesperson repeating the OpenAI party line? OpenAI's very existence depends on training LLMs on copyrighted works being considered fair use, so I would be extremely surprised if any spokesperson ever hinted that it might not be fair use.

I can see where you're coming from with wanting the government to be more proactive in clamping down on illegal practices. But it's pretty standard, from what I understand, that violations civil law only has consequences if and when an aggrieved party goes to court.

luqtas 435 days ago [-]

it's a ~ 160 billion company that only reached this far because of copyright violation

Sora's presentation [0]: on multiple text input examples "... the video does not contain any text or additional objects."

are you gonna say it's a prompt for no text on signs or banners or is it a way to get rid of subtitles & watermark logos?

[0] https://hpcaitech.github.io/Open-Sora/

mort96 435 days ago [-]

Yeah, as I said, I see where you're coming from. I'm just saying that, as far as I understand, it's not unusual in any way. In fact, it almost seems to be a part of the Silicon Valley VC playbook to blatantly break the law and either grow so huge that the law has to change to accommodate, or cash out before the law breaking has consequences.

agilob 435 days ago [-]

[flagged]

mrweasel 435 days ago [-]

If OpenAI is evading paywalls, then the Aaron Swartz case should be considered precedent. The scale is just much much large and it's for financial gains, but not motivated by morals.

sottol 435 days ago [-]

> and it's for financial gains, but not motivated by morals.

Which is why nothing is going to happen, can't have people starting with the latter.

james_sulivan 435 days ago [-]

Meanwhile China is using everything available to train their AI models

paxys 434 days ago [-]

You think china is using uncensored news articles from western media to train its AI models?

dmead 434 days ago [-]

Yes. And they're being marked as bad during the alignment process.

warkdarrior 434 days ago [-]

For sure. The models are definitely safety tuned after pre-training

goatlover 435 days ago [-]

We don't want to be like China.

tokioyoyo 435 days ago [-]

Fair. But I made a comment somewhere else that, if their models become better than ours, they'll be incorporated into products. Then we're back to being depended on China for LLM model development as well, on top of manufacturing. Realistically that'll be banned because of National Security laws or something, but companies tend to choose the path of "best and cheapest" no matter what.

philipwhiuk 435 days ago [-]

It's extremely lousy that you have to pre-register copyright.

That would make the USCO a defacto clearinghouse for news.

throw646577 435 days ago [-]

You don't have to pre-register copyright in any Berne Convention countries. Your copyright exists from the moment you create something.

(ETA: This paragraph below is diametrically wrong. Sorry.)

AFAIK in the USA, registered copyright is necessary if you want to bring a lawsuit and get more than statutory damages, which are capped low enough that corporations do pre-register work.

Not the case in all Berne countries; you don't need this in the UK for example, but then the payouts are typically a lot lower in the UK. Statutory copyright payouts in the USA can be enough to make a difference to an individual author/artist.

As I understand it, OpenAI could still be on the hook for up to $150K per article if it can be demonstrated it is wilful copyright violation. It's hard to see how they can argue with a straight face that it is accidental. But then OpenAI is, like several other tech unicorns, a bad faith manufacturing device.

Loughla 435 days ago [-]

You seem to know more about this than me. I have a family member who "invented" some electronics things. He hasn't done anything with the inventions (I'm pretty sure they're quackery).

But to ensure his patent, he mailed himself a sealed copy of the plans. He claims the postage date stamp will hold up in court if he ever needs it.

Is that a thing? Or is it just more tinfoil business? It's hard to tell with him.

WillAdams 435 days ago [-]

It won't hold up in court, and given that the post-office will mail/deliver unsealed letters (which may then be sealed after the fact), will be viewed rather dimly.

Buy your family member a copy of:

https://www.goodreads.com/book/show/58734571-patent-it-yours...

Y_Y 435 days ago [-]

Surely the NSA will retain a copy which can be checked

Tuna-Fish 435 days ago [-]

Even if they did, it in fact cannot be checked. There is precedent that you cannot subpoena NSA for their intercepts, because exactly what has been intercepted and stored is privileged information.

hiatus 435 days ago [-]

> There is precedent that you cannot subpoena NSA for their intercepts

I know it's tangential to this thread but could you link to further reading?

ysofunny 435 days ago [-]

but only in a real democracy

Isamu 435 days ago [-]

Mailing yourself using registered mail is a very old tactic to establish a date for your documents using an official government entity, so this can be meaningful in court. However this may not provide the protection he needs. Copyright law differs from patent law and he should seek legal advice

cma 435 days ago [-]

The USmoved to first to file years ago. Whoever files first gets it, except if he publishes it publicly there is a 1-year inventor's grace period (it would not apply to a self mail or private mail to other people).

This is patent, not copyright.

dataflow 435 days ago [-]

Even if the date is verifiable, what would it even prove? If it's not public then I don't believe it can count as prior art to begin with.

435 days ago [-]

throw646577 435 days ago [-]

Honestly I don't know whether that actually is a meaningful thing to do anymore; it may be with patents.

It certainly used to be a legal device people used.

Essentially it is low-budget notarisation. If your family member believes they have something which is timely and valuable, it might be better to seek out proper legal notarisation, though -- you'd consult a Notary Public:

https://en.wikipedia.org/wiki/Notary_public

blibble 435 days ago [-]

presumably the intention is to prove the existence of the specific plans at a specific time?

I guess the modern version would be to sha256 the plans and shove it into a bitcoin transaction

good luck explaining that to a judge

Isamu 435 days ago [-]

Right, you can register before you bring a lawsuit. Pre-registration makes your claim stronger, as does notice of copyright.

dataflow 435 days ago [-]

That's what I thought too, but why does the article say:

> Infringement suits require that relevant works were first registered with the U.S. Copyright Office (USCO).

throw646577 435 days ago [-]

OK so it turns out I am wrong here! Cool.

I have it upside down/diametrically wrong, however you see fit. Right that structures exist, exactly wrong on how they apply.

It is registration that guarantees access to statutory damages:

https://www.justia.com/intellectual-property/copyright/infri...

Without registration you still have your natural copyright, but you would have to try to recover the profits made by the infringer.

Which does sound like more of an uphill struggle for The Intercept, because OpenAI could maybe just say "anything we earn from this is de minimis considering how much errr similar material is errrr in the training set"

Oh man it's going to take a long time for me to get my brain to accept this truth over what I'd always understood.

zerocrates 434 days ago [-]

You have to register to sue, but you have the copyright automatically at the moment the work is created.

You can go register after an infringement and still sue, but you then won't be able to get statutory damages or attorney's fees.

Statutory damages are a big deal in general but especially here where proving how much of OpenAI's revenue is due to your specific articles is probably impossible. Which is why they're suing under this DMCA provision: it's not an infringement suit so the registration requirement doesn't apply, and there's a separate statutory damages provision for it.

pera 435 days ago [-]

> It's hard to see how they can argue with a straight face that it is accidental

It's another instance of "move fast, break things" (i.e. "keep your eyes shut while breaking the law at scale")

renewiltord 435 days ago [-]

Yes, because all progress depends upon the unreasonable man.

whywhywhywhy 435 days ago [-]

It's so weird to me seeing journalists complaining about copyright and people taking something they did.

The whole of journalism is taking the acts of others and repeating them, why does a journalist claim they have the rights to someone else's actions when someone simply looks at something they did and repeat it.

If no one else ever did anything, the journalist would have nothing to report, it's inherently about replicating the work and acts of others.

echoangle 435 days ago [-]

That’s a pretty narrow view of journalism. If you look into newspapers, it’s not just a list of events but also opinion pieces, original research, reports etc. The main infringement isn’t with the basic reporting of facts but with the original part that’s done by the writer.

PittleyDunkin 435 days ago [-]

> The whole of journalism is taking the acts of others and repeating them

Hilarious (and depressing) that this is what people think journalists do.

SoftTalker 435 days ago [-]

What is a "journalist?" It sounds old-fashioned.

They are "content creators" now.

razakel 435 days ago [-]

Or you could just not do illegal and/or immoral things that are worthy of reporting.

barapa 435 days ago [-]

This is terribly unpersuasive

435 days ago [-]

Rendered at 13:00:23 GMT+0000 (Coordinated Universal Time) with Vercel.