PubMed Miner: AI-Powered Sequence Feature Extraction from Literature

Project Repository Edit

Project Themes:

  • Automated Knowledge Extraction and Curation

Team Name: PubMed Miners

Team Lead(s):

Suggested Team Members and Roles [4-6 members]

NameAffiliationRole / Expertise
Indresh SinghJ. Craig Venter Institute, BV-BRCSoftware engineering, AI/ML
Shubham DeshmukhJ. Craig Venter Institute, BV-BRCSoftware engineering, AI/ML

Project Summary

Sequence features—including codon usage, conserved motifs, structural domains, mutation patterns, and regulatory elements—are central to understanding viral biology and evolution. They define the functional potential of a virus and serve as key indicators for virulence, transmissibility, immune escape, and drug resistance. However, the extraction and integration of sequence features face significant challenges due to limited availability of the latest experimentally identified mutations in published literature. Public databases like the Immune Epitope Database and UniProt do not contain the latest epitopes or mutations that are critical during outbreaks and pandemics, and reviewing published papers is a slow, manual process. This project aims to use AI/LLMs to expedite the identification and validation of sequence features from PubMed literature, enabling rapid characterization of viral strains and supporting outbreak response.

Goals and Objectives

  • Goal 1: Apply AI/LLM to identify sequence features of interest in the latest publications based on virus family or species
  • Goal 2: Eliminate false positive mutations based on gene name, coordinates, or protein function
  • Goal 3 (optional): Explore the applicability of the strategy to other use cases presented during the Codeathon

Approach

The project creates a fully automated, end-to-end mining pipeline that queries PubMed to collect relevant review articles, retrieves open-access full texts from PMC, and analyzes them using free-tier large language models such as Gemini and LLaMA via their respective APIs (Google AI Studio and Groq). These models process cleaned and structured article text to extract meaningful biological information—including sequence features, residue positions, and mutation patterns—presented in organized, tabular formats.

The workflow integrates several intelligent components:

  • Automated PubMed querying and full-text retrieval
  • Text preprocessing and normalization
  • Structured LLM-based extraction
  • Post-processing validation and false positive filtering
  • Cross-checking against gene names, residue coordinates, and protein functions
  • Rule-based consistency verification

The system has been tested on review papers from Chikungunya, Dengue, and Influenza A viruses, achieving approximately 75–85% accuracy aligned with expert-curated annotations. A Streamlit-based interface makes the workflow interactive and user-friendly, enabling researchers to configure queries, manage API keys, and download outputs in standard formats.

Current working code is at https://github.com/Shubhs0411/Pubmed_Miner

Data and Resources Required

Resource TypeSource / LinkDescription / Purpose
DataPubMed corpusSource publications
LLMs / AI ModelsGoogle Gemini, GPT-4, Claude 3Feature extraction
Compute / StorageLocal (small dataset)Processing

Expected Outcomes / Deliverables

  • Enhanced working prototype with improved accuracy
  • Strategy to eliminate false positive results
  • Curated datasets of extracted sequence features
  • Integration concept for BRC resources

Potential Impact and Next Steps

Infectious disease research or surveillance: PubMed Miner will enable timely identification of new epitopes and mutations during outbreaks, supporting rapid calculation of sequence feature variant types and accelerating outbreak response.

AI/ML automation and interpretability: This approach can be applied to other similar use cases including disease/phenotype associations, pathway and molecular function extraction, genomic function annotation, and gene regulation patterns.

Public health preparedness: Rapid identification of data from publications is key to responding effectively to outbreaks or pandemics.

Possible follow-up activities:

  • Include sequence feature identification as part of quarterly or yearly processes based on pathogen priority
  • Identify new use cases and customize code and prompts to extract knowledge from literature

Technical Support Needed

  • Mentor support (LLM expert to review current approach and evaluate alternative models)

Additional Comments

We request mentorship from an LLM expert to review our current approach and evaluate alternative models.