PubMed Miner: AI-Powered Sequence Feature Extraction from Literature

Project Repository Video Presentation Edit

Project Themes:

Automated Knowledge Extraction and Curation

Team Name: PubMed Miners

Team Lead(s):

Name: Indresh Singh / Shubham Deshmukh
Affiliation: J. Craig Venter Institute, BV-BRC
Email: isingh@jcvi.org, sdeshmuk@jcvi.org

Suggested Team Members and Roles [4-6 members]

Name	Affiliation	Role / Expertise
Allan Dickerman	University of Virginia, BV-BRC
Gordon Pusch	Argonne National Laboratory, BV-BRC/ANL
Hasin Rehana	University of North Dakota (Non-BRC)
Indresh Singh	J. Craig Venter Institute, BV-BRC	Software engineering, AI/ML
Sayed Asaduzzaman	University of North Dakota (Non-BRC)
Shubham Deshmukh	Virginia Tech (Non-BRC)	Software engineering, AI/ML

Project Summary

Sequence features—including codon usage, conserved motifs, structural domains, mutation patterns, and regulatory elements—are central to understanding viral biology and evolution. They define the functional potential of a virus and serve as key indicators for virulence, transmissibility, immune escape, and drug resistance. However, the extraction and integration of sequence features face significant challenges due to limited availability of the latest experimentally identified mutations in published literature. Public databases like the Immune Epitope Database and UniProt do not contain the latest epitopes or mutations that are critical during outbreaks and pandemics, and reviewing published papers is a slow, manual process. This project aims to use AI/LLMs to expedite the identification and validation of sequence features from PubMed literature, enabling rapid characterization of viral strains and supporting outbreak response.

Goals and Objectives

Goal 1: Apply AI/LLM to identify sequence features of interest in the latest publications based on virus family or species
Goal 2: Eliminate false positive mutations based on gene name, coordinates, or protein function
Goal 3 (optional): Explore the applicability of the strategy to other use cases presented during the Codeathon

Approach

The project creates a fully automated, end-to-end mining pipeline that queries PubMed to collect relevant review articles, retrieves open-access full texts from PMC, and analyzes them using free-tier large language models such as Gemini and LLaMA via their respective APIs (Google AI Studio and Groq). These models process cleaned and structured article text to extract meaningful biological information—including sequence features, residue positions, and mutation patterns—presented in organized, tabular formats.

The workflow integrates several intelligent components:

Automated PubMed querying and full-text retrieval
Text preprocessing and normalization
Structured LLM-based extraction
Post-processing validation and false positive filtering
Cross-checking against gene names, residue coordinates, and protein functions
Rule-based consistency verification

The system has been tested on review papers from Chikungunya, Dengue, and Influenza A viruses, achieving approximately 75–85% accuracy aligned with expert-curated annotations. A Streamlit-based interface makes the workflow interactive and user-friendly, enabling researchers to configure queries, manage API keys, and download outputs in standard formats.

Current working code is at https://github.com/Shubhs0411/Pubmed_Miner

Data and Resources Required

Resource Type	Source / Link	Description / Purpose
Data	PubMed corpus	Source publications
LLMs / AI Models	Google Gemini, GPT-4, Claude 3	Feature extraction
Compute / Storage	Local (small dataset)	Processing

Expected Outcomes / Deliverables

Enhanced working prototype with improved accuracy
Strategy to eliminate false positive results
Curated datasets of extracted sequence features
Integration concept for BRC resources

Potential Impact and Next Steps

Infectious disease research or surveillance: PubMed Miner will enable timely identification of new epitopes and mutations during outbreaks, supporting rapid calculation of sequence feature variant types and accelerating outbreak response.

AI/ML automation and interpretability: This approach can be applied to other similar use cases including disease/phenotype associations, pathway and molecular function extraction, genomic function annotation, and gene regulation patterns.

Public health preparedness: Rapid identification of data from publications is key to responding effectively to outbreaks or pandemics.

Possible follow-up activities:

Include sequence feature identification as part of quarterly or yearly processes based on pathogen priority
Identify new use cases and customize code and prompts to extract knowledge from literature

Technical Support Needed

Mentor support (LLM expert to review current approach and evaluate alternative models)

Additional Comments

We request mentorship from an LLM expert to review our current approach and evaluate alternative models.