Automated Knowledge Extraction and Curation

Project Theme

Automated Knowledge Extraction and Curation

Project Summary

The scientific literature on infectious diseases grows exponentially, making manual curation increasingly challenging. This project leverages AI and natural language processing to automatically extract, structure, and curate biological knowledge from scientific literature, databases, and other sources. By automating knowledge extraction for gene functions, pathways, protein interactions, and phenotypes, we can enhance BRC data annotations, accelerate database updates, and enable novel knowledge discovery through integrated knowledge graphs.

Goals and Objectives

Extract biological entities and relationships from literature using NLP and named entity recognition
Build knowledge graphs connecting pathogens, genes, proteins, functions, and phenotypes
Validate extracted knowledge using confidence scoring and cross-referencing
Integrate with BRC curation pipelines to enhance database annotations

Approach

Methods and AI/ML Approaches:

Named Entity Recognition (NER) for biological entities (genes, proteins, pathogens, phenotypes)
Relation extraction using transformer models (BioBERT, SciBERT, PubMedBERT)
Knowledge graph construction and validation
LLMs for summarization and hypothesis generation
Active learning for curator feedback integration

Implementation Steps:

Collect literature corpus (PubMed, PMC) for priority pathogens
Fine-tune NER models on infectious disease texts
Extract entities and relationships
Build and visualize knowledge graphs
Implement confidence scoring and validation
Create curator interface for review and correction
Generate test cases for integration with BRC databases

Data and Resources Required

Resource Type	Source / Link	Description / Purpose
Data	PubMed, PubMed Central	Scientific literature corpus
Data	BV-BRC, UniProt, GO	Reference databases for validation
LLMs / AI Models	BioBERT, SciBERT, PubMedBERT	Domain-specific language models
Tools / Services	spaCy, Hugging Face Transformers	NLP frameworks
Tools / Services	Neo4j or NetworkX	Knowledge graph storage
Compute / Storage	GPU for model training	Fine-tuning language models

Expected Outcomes / Deliverables

Automated extraction pipeline from literature to knowledge graphs
Knowledge graphs for priority pathogens with entities and relationships
Evaluation metrics for extraction accuracy and coverage
Curator interface prototype for validation and correction
Integration plan for BRC database enhancement
Public repository with code, trained models, and examples
Case study demonstrating knowledge discovery

Potential Impact and Next Steps

Impact on:

Infectious disease research: Accelerates literature review and knowledge synthesis
AI/ML automation: Demonstrates practical biomedical NLP applications
Public health preparedness: Enables rapid knowledge compilation for emerging pathogens

Next Steps After Codeathon:

Expand to all BRC pathogens and related literature
Implement continuous monitoring of new publications
Add multi-language support for non-English literature
Create automated hypothesis generation from knowledge graphs
Deploy as a service for BRC users

Technical Support Needed

Access to PubMed API and bulk downloads
GPU compute for model training
BRC database schemas and curation workflows
Gold-standard annotated corpus for evaluation
Mentor support from biocuration experts

Team Information

Teams will be formed during the Codeathon. Ideal team composition:

NLP/ML Engineer: Model development and training
Biocurator: Domain expertise and validation
Knowledge Graph Engineer: Graph database and visualization
Backend Developer: Pipeline and API development
Bioinformatician: BRC integration and use case definition