POT collaborates with IBM to create INDUS, a suite of large language models (LLMs) focused on scientific research across various fields. The partnership is formalized under Space Act Agreements and led by NASA and IBM’s Interagency Implementation and Advanced Concepts Team (IMPACT).
The INDUS suite includes specialized LLMs for Earth sciences, biological and physical sciences, heliophysics, planetary sciences, and astrophysics. These models are trained on specific scientific data to enhance their accuracy and relevance in their respective domains.
Comprising encoders and sentence transformers, INDUS is equipped with over 50,000 unique scientific terms to handle complex scientific concepts like biomarkers and phosphorylated molecules. The encoders convert natural language text into numerical formats for LLM processing, using domain-specific vocabulary.
NASA reports that the encoders were trained on a corpus of 60 billion tokens, while sentence transformers were fine-tuned on approximately 268 million pairs of text. This approach optimizes INDUS for tasks such as scientific question answering and entity recognition in Earth sciences, with smaller, faster models developed for latency-sensitive applications.
INDUS has been proven effective in retrieving scientific information from vast data repositories, supporting applications like the Open Science Data Repository (OSDR) API. It has also aided in categorizing publications referencing datasets from NASA’s Goddard Earth Science Data and Information Services Center (GES-DISC).
Dr. Sylvain Costes, NASA’s Physical and Biological Sciences Division, highlights INDUS’ impact on data curation efficiency and user experiences within scientific research platforms. The integration of INDUS into NASA’s Science Discovery Engine (SDE) has improved search accuracy and relevance across the agency’s open science data.
The collaboration between POT and IBM aims to advance AI support for scientific discovery, with INDUS models accessible on platforms like Hugging Face for the broader scientific community. Future publications will include reference datasets for climate change, Earth science quality assurance, and information retrieval to empower researchers in effectively navigating scientific knowledge.
NASA emphasizes the versatility of INDUS encoder models in scientific domains and the retriever models for information retrieval in real-time applications. The collaboration underlines the commitment to leveraging AI for advancing scientific research, benefiting researchers and the broader community.
Article Source
https://www.techtimes.com/amp/articles/306067/20240626/nasa-ibm-develop-indus-large-language-models-advanced-science-research.htm