Morality and legality aside, there's a substantive difference between use of content and use of a model. Pretraining a GPT 4-class model from raw data requires trillions of tokens and millions of dollars in compute, whereas distilling a model using GPT 4's output requires orders of magnitude less data. Add to that the fact that OpenAI is probably subsidizing compute at their current per-token cost, and it's clearly unsustainable.
The morality of training on internet-scale text data is another discussion, but I would point out that this has been standard practice since the advent of the internet, both for training smaller models and for fueling large tech companies such as Google. Broadly speaking, there is nothing wrong with mere consumption. What gets both morally and legally more complex is production - how much are you allowed to synthesize from the training data? And that is a fair question.
“Content” requires as much, if not more, effort and expense than pretraining GPT-4.
All you’re doing is redefining content, ie thoughts, ideas, movies, videos, literature, sounds, writing, etc as “raw data”. But that isn’t raw data. There was a ton of effort that went into creating the “content”. For example, a single Wikipedia page may have many hundreds of people, some who have done years of college level studies and original research, to produce a few thousand words of content. Others have done research using primary sources. All of them have had to use effort and ingenuity to craft those into actual high quality statements, which in itself was only possible in many cases due to years of training and education. Finally, they had to setup a validation process to produce useful output from this collaborative process which included loads of arguments etc to generate what you are calling “raw data”.
I’m not sure what makes GPT’s output is any less raw than all the effort that went into producing a single Wikipedia page? Further, Wikipedia actually goes out of its way to cite its sources. GPT is designed to go out of its way to obscure its sources.
The only thing GPT does, IOW, that apparently makes the data it uses is not to cite its sources, something that would at the very least lead to professional disgrace for the people who created the “raw data” GPT uses without thought, and would even lead to lawsuits and prosecution in many cases.
So besides going out of its way to obscure the source of its data, what makes GPT’s output less raw than the output people have spent billions of man hours creating?
Except that the content already exists and there is no cost to maintain it.
If GPT incurred a non negligible cost on the content owners by accessing their resources it may have been different but that's not the case.
The only thing that content owners may be able to complain about is that potentially ChatGPT/DallE may reduce their potential income and this would have to be proven. I have not stopped buying books or art of any kind since I use ChatGPT/DallE. And low quality automated content producers existed before OpenAI and were already diluting the attention to more carefully produced content (as can be seen with videos on youtube).
It seems like you have no idea how much effort it takes to write a book.
Quite often it contains the experience of a life of a person condensed to a few hundred pages.
ChatGPT gives easier access to the knowledge contained in tens of thousands of these books. As for me I have been reading less and less books as more wisdom is accessable on the internet in better forms (now GPT).
I'm not against what OpenAI is doing as it moves humanity forward, but like you said I won't stop using ChatGPT just because ByteDance scrapes it.
Not what I am saying. I am saying it is much much smaller than inference/model running cost.
Easy exercise
How many books do you store in 1GB
How much does it cost a year to store it and have OpenAI gather it once.
How much does it cost to run a GPT4 level model that will output 1GB.
That's my point here that's all. It is a huge cost for OpenAI to run a system that produces dynamic content. And it is not comparable to the cost of storing static content.
I didn't talk about the cost of producing the original data.
Sure, but your comment said "maintain", not "store". Even if storage were free, and even if you discount the value of the initial creation to zero, there are still nontrivial serving costs associated with many sites.
What I share with people on the Web may look like a static byte sequence to the robots consuming it, but it takes a lot of work to compute those bytes (in the moment, I mean). Aggregated over the whole web, no, that is not smaller than OpenAI's expenditures.
The effort and resources required to train from raw data are nothing compared to the effort and resources that went into producing the "training" input. How much dors it cost to produce all the things they scrapped from the internet? So morally they are in the wrong - I don't care if it's standard practice since "the beginning of the internet" or not.
It’s also not standard practice since the beginning of the internet. Referencing original input through links is almost foundational to the internet (at least the original internet).
In fact, the power of linking to data sources is what Google is almost entirely built upon.
Others have already pointed out that you’re just shrugging off billions of hours and money that went into the content that is used to (pre-)train a model, so I’ll leave that for what it is.
I’m just curious how you start off with:
> Morality and legality aside
Only to then follow it up immediately with an argument for why one is more moral.
Just because you didn’t end it with “and that’s why I think OpenAI is more moral,” doesn’t mean it’s not obvious and less of an irony.
Morality and legality are the only relevant questions in the discussion. The two methods are virtually the same... in fact I'd argue that ByteDance's usage is more fair and moral. It really doesn't matter than it's cheaper and more efficient.
The cost of hiring humans to write the trillions of tokens they trained from scratch would surely be much larger than the training cost. Except they avoided that cost by using what's available on the Internet. [1]
Similarly, people are avoiding the cost of pre-training GPT-4 class model by scraping its output.
So I think it's fair to question the moral consistency of their ToS.
[1] Please note that I am not passing a judgement on this, just stating a fact in order to make an argument.
> I would point out that this has been standard practice since the advent of the internet
Maybe it shouldn't have been? We've been frog-boiling toward this point for a long time, from a starting point that was generally good for content creators (your content is made more discoverable) to a point that is not so good for content creators (your content is scraped and digested, programmatically laundered and regurgitated on huge corporations' own platforms with token or no attribution, and no revenue shared).
In a parallel universe where search engines were explicitly opt-in from the beginning, I think these conversations would look very different today. What OpenAI and its peers have done would, I dare say, be uncontroversially (and correctly) regarded as theft. Just as I'm not allowed to distribute^[1] software incorporating somebody else's code in a way that violates the terms of its license (or lack thereof), I shouldn't be able to distribute software that incorporates any intellectual property that I don't have the rights to.
The morality of training on internet-scale text data is another discussion, but I would point out that this has been standard practice since the advent of the internet, both for training smaller models and for fueling large tech companies such as Google. Broadly speaking, there is nothing wrong with mere consumption. What gets both morally and legally more complex is production - how much are you allowed to synthesize from the training data? And that is a fair question.