AI Law - International Review of Artificial Intelligence Law
G. Giappichelli Editore

26/06/2024 - The Impending Exhaustion of Online Textual Data by AI Models (USA)

argument: Notizie/News - Digital Governance

According to an article from Live Science, AI systems could exhaust all available online textual data by 2026. AI models like GPT-4 rely on vast amounts of data from the internet to improve their capabilities. Researchers estimate that the quality data required for training these models will run out between 2026 and 2032. This shortage could lead tech companies to consider other sources, including synthetic data or private data stored in servers. The study, published on arXiv by researchers from Epoch AI, outlines potential challenges and solutions for future AI development.

AI advancements depend heavily on large datasets to identify complex patterns. For instance, ChatGPT was trained on approximately 570 GB of text data. The study used Google's web index and IP traffic analysis to estimate the current and future availability of text data online. High-quality data might be exhausted by 2032, with lower-quality data lasting until 2050. The neural scaling law indicates that AI improvements correlate with the volume of training data.

Companies may turn to private data or synthetic data to address this scarcity. However, the use of private data raises legal and ethical concerns, potentially leading to significant legal challenges. Energy consumption is another growing concern, as AI searches use significantly more electricity than traditional methods, prompting some tech companies to explore alternative energy sources like nuclear fusion.