A recent study suggests that artificial intelligence systems like ChatGPT may soon exhaust the vast amount of publicly available text data that has been used to train them, potentially leading to a bottleneck in AI development between 2026 and 2032. Companies like OpenAI and Google are already scrambling to secure high-quality data sources for their AI models, but there are concerns about the long-term sustainability of the current trajectory. The study predicts that within the next two to eight years, the AI field could face challenges in maintaining its rapid progress due to a lack of new blogs, news articles, and social media comments to train their models effectively.
While some argue that building larger models is not necessarily the only way to improve AI performance, concerns have been raised about the potential consequences of overtraining on existing data sources. Training on AI-generated data can lead to performance degradation and encode biases and errors present in the original information. As a result, there is a growing focus on how human-generated data, such as that found on websites like Reddit and Wikipedia, is used for AI training.
Companies like OpenAI are already exploring methods to generate synthetic data to train their models, but there are reservations about relying too heavily on this approach. Some argue that human-generated content should continue to be incentivized and protected, as it remains a valuable source of training data for AI systems. Overall, the study suggests that finding a balance between high-quality human data and synthetic data will be crucial for driving the next generation of AI models.
Article Source
https://fortune.com/2024/06/06/ai-training-bottleneck-google-meta-openai/