The OpenAI team is almost ready to release a web crawler that will eat the whole open web
If you don’t want your online material used to train AI, you’ll need to opt-out.
To increase the data available for training its next generation of AI systems, OpenAI has published a new web crawling bot called GPTBot. The corporation has trademarked the phrase “GPT-5,” suggesting an impending release while warning website owners how to avoid being indexed.
According to OpenAI, the web crawler will gather data from publicly accessible websites while avoiding stuff that is behind paywalls, sensitive, or otherwise off-limits. However, unlike Google, Bing, and Yandex, GPTBot is opt-out; it will automatically treat any publicly available data as a fair game. The owner of a website may prevent the OpenAI web crawler from accessing it by including a “disallow” rule in a generic web server file.
OpenAI further claims that GPTBot will proactively check scraped material for PII and policy-violating content and erase it.
However, the opt-out method still presents consent concerns, according to some experts in the field of technological ethics.
In defence of OpenAI’s action, several Hacker News commenters argued that it is necessary for the company to collect as much data as possible if the public is ever going to have access to a powerful generative AI tool. Another user warned that without more recent information, the GPT models will remain static until September 2021. One user who was more concerned about their privacy than usual pointed out that “OpenAI isn’t even citing in moderation.”
OpenAI has been under fire for improperly using scraped data to train LLMs like ChatGPT, thus the company has responded by releasing GPTBot. In April, the firm revised its privacy policy in response to similar criticisms.
Also Read: Rep. McHenry Proposes Stablecoin Legislation In Response To PayPal’s Action