With the rise of artificial intelligence, the debate about the origin of training data from large language models, such as GPT-4, has been put on the table—… as well as the right of the creators of these AIs to use such data. In some cases, it has even been put on the table of a court.

These models are trained using large volumes of data, including content extracted from various websites. This process, known as “web scraping,” is a common practice in research, journalism, and digital archiving. However, some website owners may have reservations about how their content is used in this particular context.

As a result, both OpenAI and Google have recently provided pointers to website owners who prefer to prevent their sites’ content from being incorporated into the huge training datasets of this class of AI models.



