Am not justifying what OpenAI did, but nobody is stopping ByteDance from doing what OpenAI did. They can also use the world’s information. Instead, since OpenAI has “cleaned” the data, they are trying to use OpenAI’s cleaned dataset. After OpenAI spending endless amounts of money on that, am not surprised they don’t want others to steal their “cleaned” dataset.
The massive illegal scraping of data on the internet is "only done once" type deal. After platforms have learned of the abuse OpenAI has engaged in, content platforms are now gated and under access controls. You can't access NSFW content on Reddit without logging in, for reference[1]. You could before OpenAI Buzz existed. The point of the illegal scraping is the first mover advantage. Subsequent scrapings will not be as easy. This is also the reason why we could send FBI agents to OpenAI to bust their servers and delete the training data. After wards, gathering this said data again would be much more harder, thus delaying any kind of LLM "progress" in future. For LLM skeptics, this is a dream. Jail the executives, send in feds to light the server rooms on fire.
[1] still works on old.reddit.com
Reddit gating NSFW content with login is pretty obviously a play to increase signups and therefore engagement. Making scraping less feasible might just be a bonus, but attributing the whole thing to that is a stretch.
There are stories all over the web of content houses locking down their stuff after they found out OAI was benefitting commercially from harvesting it. This hasn't been true for at least a year. See Reddit.
I think GP is pointing out that someone that spends years building a large online gallery of their artwork, only for it to be smushed into a pool of vector mush, has the same reasoning to prevent openAI from using their work as openAI does to prevent competitors from using their artisanaly laundered dataset.
Doesn’t matter how much endless amounts of money they spend, they’re going to have to contend with the fact that the value they ship is derived from other’s work. It’s just diluted to the point of it becoming “data” rather than “artworks”.
The actual content is the clean stuff. If you disagree then you accept OpenAI could just create all the content themselves instead of scraping, which is comparatively trivial.