Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Am not justifying what OpenAI did, but nobody is stopping ByteDance from doing what OpenAI did. They can also use the world’s information. Instead, since OpenAI has “cleaned” the data, they are trying to use OpenAI’s cleaned dataset. After OpenAI spending endless amounts of money on that, am not surprised they don’t want others to steal their “cleaned” dataset.


The massive illegal scraping of data on the internet is "only done once" type deal. After platforms have learned of the abuse OpenAI has engaged in, content platforms are now gated and under access controls. You can't access NSFW content on Reddit without logging in, for reference[1]. You could before OpenAI Buzz existed. The point of the illegal scraping is the first mover advantage. Subsequent scrapings will not be as easy. This is also the reason why we could send FBI agents to OpenAI to bust their servers and delete the training data. After wards, gathering this said data again would be much more harder, thus delaying any kind of LLM "progress" in future. For LLM skeptics, this is a dream. Jail the executives, send in feds to light the server rooms on fire. [1] still works on old.reddit.com


Reddit gating NSFW content with login is pretty obviously a play to increase signups and therefore engagement. Making scraping less feasible might just be a bonus, but attributing the whole thing to that is a stretch.


> You can't access NSFW content on Reddit without logging in

Sorry, what? You think reddit is trying to prevent openai from scraping the porn subreddits???


There is quite a bit of content that is not porn marked as nsfw on Reddit.


There are stories all over the web of content houses locking down their stuff after they found out OAI was benefitting commercially from harvesting it. This hasn't been true for at least a year. See Reddit.


I think GP is pointing out that someone that spends years building a large online gallery of their artwork, only for it to be smushed into a pool of vector mush, has the same reasoning to prevent openAI from using their work as openAI does to prevent competitors from using their artisanaly laundered dataset.

Doesn’t matter how much endless amounts of money they spend, they’re going to have to contend with the fact that the value they ship is derived from other’s work. It’s just diluted to the point of it becoming “data” rather than “artworks”.


>"After OpenAI spending endless amounts of money on that, am not surprised they don’t want others to steal their “cleaned” dataset."

And let's say I do not want them to clean up and then use my data for profit.


This makes no sense lol. The information openAI is using is cleaned to begin with


Raw text from a website including header text and footers and links and images etc is very dirty stuff.


The actual content is the clean stuff. If you disagree then you accept OpenAI could just create all the content themselves instead of scraping, which is comparatively trivial.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: