Morality and legality aside, there's a substantive difference between use of con...

addicted · on Dec 16, 2023

“Content” requires as much, if not more, effort and expense than pretraining GPT-4.

All you’re doing is redefining content, ie thoughts, ideas, movies, videos, literature, sounds, writing, etc as “raw data”. But that isn’t raw data. There was a ton of effort that went into creating the “content”. For example, a single Wikipedia page may have many hundreds of people, some who have done years of college level studies and original research, to produce a few thousand words of content. Others have done research using primary sources. All of them have had to use effort and ingenuity to craft those into actual high quality statements, which in itself was only possible in many cases due to years of training and education. Finally, they had to setup a validation process to produce useful output from this collaborative process which included loads of arguments etc to generate what you are calling “raw data”.

I’m not sure what makes GPT’s output is any less raw than all the effort that went into producing a single Wikipedia page? Further, Wikipedia actually goes out of its way to cite its sources. GPT is designed to go out of its way to obscure its sources.

The only thing GPT does, IOW, that apparently makes the data it uses is not to cite its sources, something that would at the very least lead to professional disgrace for the people who created the “raw data” GPT uses without thought, and would even lead to lawsuits and prosecution in many cases.

So besides going out of its way to obscure the source of its data, what makes GPT’s output less raw than the output people have spent billions of man hours creating?

ta988 · on Dec 16, 2023

Except that the content already exists and there is no cost to maintain it.

If GPT incurred a non negligible cost on the content owners by accessing their resources it may have been different but that's not the case.

The only thing that content owners may be able to complain about is that potentially ChatGPT/DallE may reduce their potential income and this would have to be proven. I have not stopped buying books or art of any kind since I use ChatGPT/DallE. And low quality automated content producers existed before OpenAI and were already diluting the attention to more carefully produced content (as can be seen with videos on youtube).

xiphias2 · on Dec 16, 2023

It seems like you have no idea how much effort it takes to write a book.

Quite often it contains the experience of a life of a person condensed to a few hundred pages.

ChatGPT gives easier access to the knowledge contained in tens of thousands of these books. As for me I have been reading less and less books as more wisdom is accessable on the internet in better forms (now GPT).

I'm not against what OpenAI is doing as it moves humanity forward, but like you said I won't stop using ChatGPT just because ByteDance scrapes it.

dougb5 · on Dec 16, 2023

That's great to hear that there's no cost to maintaining content! I'll tell AWS they've been overcharging me :)

ta988 · on Dec 16, 2023

Not what I am saying. I am saying it is much much smaller than inference/model running cost. Easy exercise

How many books do you store in 1GB How much does it cost a year to store it and have OpenAI gather it once. How much does it cost to run a GPT4 level model that will output 1GB.

That's my point here that's all. It is a huge cost for OpenAI to run a system that produces dynamic content. And it is not comparable to the cost of storing static content.

I didn't talk about the cost of producing the original data.

And I do not talk about training costs.

dougb5 · on Dec 16, 2023

Sure, but your comment said "maintain", not "store". Even if storage were free, and even if you discount the value of the initial creation to zero, there are still nontrivial serving costs associated with many sites. What I share with people on the Web may look like a static byte sequence to the robots consuming it, but it takes a lot of work to compute those bytes (in the moment, I mean). Aggregated over the whole web, no, that is not smaller than OpenAI's expenditures.

leereeves · on Dec 16, 2023

If cost is your primary concern, shouldn't you support ByteDance's efforts to reduce inference costs by distilling the model?

(while at the same time reducing future costs for everyone by distributing the capability more widely to prevent monopolization)

ta988 · on Dec 16, 2023

At no point I said I did not support that.

x86x87 · on Dec 16, 2023

The effort and resources required to train from raw data are nothing compared to the effort and resources that went into producing the "training" input. How much dors it cost to produce all the things they scrapped from the internet? So morally they are in the wrong - I don't care if it's standard practice since "the beginning of the internet" or not.

addicted · on Dec 16, 2023

It’s also not standard practice since the beginning of the internet. Referencing original input through links is almost foundational to the internet (at least the original internet).

In fact, the power of linking to data sources is what Google is almost entirely built upon.

turquoisevar · on Dec 16, 2023

Others have already pointed out that you’re just shrugging off billions of hours and money that went into the content that is used to (pre-)train a model, so I’ll leave that for what it is.

I’m just curious how you start off with:

> Morality and legality aside

Only to then follow it up immediately with an argument for why one is more moral. Just because you didn’t end it with “and that’s why I think OpenAI is more moral,” doesn’t mean it’s not obvious and less of an irony.

blehn · on Dec 16, 2023

Morality and legality are the only relevant questions in the discussion. The two methods are virtually the same... in fact I'd argue that ByteDance's usage is more fair and moral. It really doesn't matter than it's cheaper and more efficient.

Palmik · on Dec 16, 2023

The cost of hiring humans to write the trillions of tokens they trained from scratch would surely be much larger than the training cost. Except they avoided that cost by using what's available on the Internet. [1]

Similarly, people are avoiding the cost of pre-training GPT-4 class model by scraping its output.

So I think it's fair to question the moral consistency of their ToS.

[1] Please note that I am not passing a judgement on this, just stating a fact in order to make an argument.

wgj · on Dec 16, 2023

> Pretraining a GPT 4-class model from raw data requires trillions of tokens and millions of dollars in compute,

And millions of documents authored by people that weren't compensated.

The difference is consolidating all of that value into a single company.

smegger001 · on Dec 16, 2023

And they largely put their work online for free where anyone can read it not expecteing any kind of direct compensation by the reader.

caconym_ · on Dec 16, 2023

> I would point out that this has been standard practice since the advent of the internet

Maybe it shouldn't have been? We've been frog-boiling toward this point for a long time, from a starting point that was generally good for content creators (your content is made more discoverable) to a point that is not so good for content creators (your content is scraped and digested, programmatically laundered and regurgitated on huge corporations' own platforms with token or no attribution, and no revenue shared).

In a parallel universe where search engines were explicitly opt-in from the beginning, I think these conversations would look very different today. What OpenAI and its peers have done would, I dare say, be uncontroversially (and correctly) regarded as theft. Just as I'm not allowed to distribute^[1] software incorporating somebody else's code in a way that violates the terms of its license (or lack thereof), I shouldn't be able to distribute software that incorporates any intellectual property that I don't have the rights to.

^[1] Broadly speaking.