IBM recently announced the release of an open source language model, Your Granite 13B LLM, designed for enterprise applications. Armand Ruiz, IBM’s vice president of AI platform products, has now shared details of the extensive 6.48TB dataset used to train Granite 13B.
This dataset, which underwent thorough preprocessing, was ultimately reduced to 2.07TB, representing a notable 68% reduction. According to Ruiz, this reduction was crucial to ensure a high-quality, unbiased, ethical, and legally compliant dataset suited for business scenarios.
The dataset was carefully curated from various sources including arXiv, Common Crawl, DeepMind Mathematics, Free Law, Clean GitHub, Hacker News, OpenWeb Text, Project Gutenberg, PubMed Central, SEC Filings, Stack Exchange, USPTO, Web Pages, and Wikimedia.
The preprocessing involved various key steps such as text extraction, deduplication, language identification, sentence splitting, hate, abuse, and profanity annotation, document quality annotation, URL blocklist annotation, filtering, and tokenization. These steps, which included annotation and filtering based on defined criteria, ensured that the final dataset was of the highest quality for training models.
IBM has introduced four versions of the Granite code model, with parameters ranging from 3 to 34 billion. These models have undergone rigorous testing in various benchmarks and have shown superior performance compared to similar models such as Code Llama and Llama 3 across multiple tasks.
Overall, the release of Your Granite 13B LLM and the accompanying detailed dataset represent a significant advancement in the field of language models, especially in enterprise applications. Ruiz’s emphasis on the importance of high-quality, diverse, and ethical datasets underscores IBM’s commitment to developing cutting-edge AI technologies for real-world use cases.
Article Source
https://analyticsindiamag.com/ibm-reveals-its-entire-6-48-tb-llm-training-dataset/