Project Theme
Assign Functions to Uncharacterized Genes
Project Summary
A significant portion of genes in bacterial and viral pathogens remain uncharacterized or annotated only as “hypothetical proteins.” This project uses AI and machine learning to predict functions for these genes by integrating multiple data types including sequence, structure, genomic context, and expression patterns. By combining deep learning on protein sequences, AlphaFold structure predictions, and comparative genomics, we aim to provide functional annotations with confidence scores, helping researchers understand pathogen biology and identify novel therapeutic targets.
Goals and Objectives
- Develop multi-modal AI models for function prediction integrating sequence, structure, and context
- Generate functional predictions for hypothetical proteins in priority pathogens
- Provide confidence scores and supporting evidence for predictions
- Create visualization tools for exploring predictions and evidence
Approach
Methods and AI/ML Approaches:
- Deep learning on protein sequences (transformers, CNNs)
- Structure-based function prediction using AlphaFold models
- Genomic context analysis (operons, gene neighborhoods, synteny)
- Transfer learning from well-characterized model organisms
- Ensemble methods combining multiple prediction approaches
- Graph neural networks for protein-protein interaction prediction
Implementation Steps:
- Compile datasets of characterized proteins for training
- Extract features: sequences, structures, genomic context
- Train and validate multi-modal prediction models
- Generate predictions for hypothetical proteins in BRC
- Implement confidence scoring and evidence aggregation
- Build visualization interface for predictions
- Validate predictions with experimental data where available
Data and Resources Required
Resource Type | Source / Link | Description / Purpose |
---|---|---|
Data | BV-BRC genome annotations | Target genes for prediction |
Data | UniProt, InterPro, Pfam | Characterized protein functions |
Data | AlphaFold Database | Protein structure predictions |
Data | STRING, BioGRID | Protein interaction networks |
LLMs / AI Models | ESM-2, ProtTrans | Protein language models |
Tools / Services | PyTorch, TensorFlow | Deep learning frameworks |
Compute / Storage | GPU for model training | Large-scale predictions |
Expected Outcomes / Deliverables
- Functional predictions for hypothetical proteins in priority pathogens
- Confidence metrics and evidence summaries for each prediction
- Trained models for function prediction
- Web interface or API for batch predictions
- Comparison with existing tools (InterProScan, BLAST, HHpred)
- Public repository with code, models, and documentation
- Publication-ready validation results
Potential Impact and Next Steps
Impact on:
- Infectious disease research: Accelerates functional annotation of pathogen genomes
- AI/ML automation: Advances protein function prediction methods
- Public health preparedness: Enables rapid characterization of emerging pathogens
Next Steps After Codeathon:
- Apply to all BRC genomes genome-wide
- Integrate with BRC annotation pipelines
- Add experimental validation suggestions
- Develop active learning for continuous improvement
- Create educational materials on protein function prediction
Technical Support Needed
- BRC genome data access
- AlphaFold structure predictions
- GPU compute for training
- Curated protein function datasets
- Mentor support from structural biologists
Team Information
Teams will be formed during the Codeathon. Ideal team composition:
- ML/AI Researcher: Model development and training
- Structural Biologist: Structure-function relationships
- Comparative Genomicist: Genomic context analysis
- Bioinformatician: Data integration and validation
- Software Developer: Interface and deployment