HiPerRAG for Literature-based Data Extraction on Priority Pathogens

Project Repository Video Presentation Edit

Project Themes:

Automated Knowledge Extraction and Curation

Team Lead(s):

Name: Ozan Gokdemir
Affiliation: Argonne National Laboratory, BV-BRC/ANL
Email: ogokdemir@anl.gov

Suggested Team Members and Roles [4-6 members]

Name	Affiliation	Role / Expertise
Blessy Antony	Virginia Tech (Non-BRC)	Data integration and validation
James McFeeters	CViSB (Non-BRC)	Scientific Advisor
Maliha Aziz	George Washington University (Non-BRC)	Data integration and validation
Ozan Gokdemir	Argonne National Laboratory, BV-BRC/ANL	AI/ML Engineer, Data Curation Lead
Yitian Chen	Scripps Research (Non-BRC)	Priority Pathogen Data Alignment

Project Summary

This project leverages HiPerRAG—a high-performance retrieval-augmented generation system optimized for large scientific corpora—to extract and curate structured data for priority pathogens. By targeting key relationship types such as protein–protein interactions (PPIs), host–pathogen interactions, and drug–protein binding data, the project aims to produce curated, machine-readable datasets for integration with BV-BRC knowledgebases.

Goals and Objectives

Goal 1: Define target data types relevant to BV-BRC (e.g., PPIs, drug-protein interactions)
Goal 2: Deploy HiPerRAG on relevant literature corpora to extract structured relationships
Goal 3: Generate curated datasets for 1–2 priority pathogens (e.g., Nipah, Lassa)

Approach

HiPerRAG will be configured to parse biomedical literature and extract relations using fine-tuned retrieval and extraction modules. The system’s hybrid pipeline combines dense retrieval, passage reranking, and LLM-based summarization to produce high-confidence knowledge graphs. The team will test both automated and human-in-the-loop curation pipelines.

Data and Resources Required

Resource Type	Source / Link	Description / Purpose
Data	PubMed, BV-BRC text corpora	Literature sources for entity/relation extraction
Tools / Services	HiPerRAG (ArXiv:2505.04846)	RAG-based extraction framework
LLMs / AI Models	Mistral Large, GPT-4 via Rhea	Entity normalization and summarization
Compute / Storage	Argonne HPC, BRC clusters	Parallel literature processing

Expected Outcomes / Deliverables

Curated datasets of structured biological relationships for priority pathogens, integrated into BV-BRC pipelines.

Potential Impact and Next Steps

This project demonstrates scalable AI-driven literature mining for infectious disease research. It will enable automated knowledge enrichment and accelerate understanding of pathogen biology, supporting BV-BRC’s informatics goals.

Technical Support Needed

Datasets preloaded
GPU / LLM access
Mentor support

Additional Comments