Introduction of INDUS: A Collection of Domain-Specific Large Language Models (LLMs) by NASA and IBM Researchers for Enhanced Scientific Research

Spread the love



The NASA and IBM collaboration has introduced INDUS, a suite of Large Language Models specifically designed for various scientific fields like Earth sciences, biology, physics, and more. These models address the limitations of existing models like SCIBERT, BIOBERT, and SCHOLARBERT, covering a wide range of multidisciplinary domains.

The suite includes an encoder model for natural language understanding, a contrastive learning-based text embedding model for information retrieval tasks, and smaller versions for lower latency and computational resource needs. Additionally, three new scientific reference datasets have been developed for entity recognition related to climate change, NASA-specific question answering, and information retrieval tasks.

The team used the Byte Pair Encoding (BPE) technique to create a specialized tokenizer, INDUSBPE, capable of handling domain-specific language. Pre-training of encoder-only LLMs and fine-tuning with contrastive learning improved sentence embeddings. Smaller, efficient models were also trained using knowledge distillation techniques.

Experimental results showed that INDUS models outperformed domain-specific encoders like SCIBERT and general models like RoBERTa on benchmarking tasks in specific scientific fields. This advancement in Artificial Intelligence provides professionals and researchers with a powerful tool for enhancing their Natural Language Processing capabilities in scientific research domains.

The project acknowledges the researchers involved and invites readers to stay updated through their social media channels and newsletters. Tanya Malhotra, a student specializing in Artificial Intelligence and Machine Learning, contributed to the review of the research and shared insights on the advancements in the field.

Article Source
https://www.marktechpost.com/2024/07/04/nasa-and-ibm-researchers-introduce-indus-a-suite-of-domain-specific-large-language-models-llms-for-advanced-scientific-research/?amp