Project Themes:
- Automated Workflow Generation and Execution
Team Name: Capybara
Team Lead(s):
- Name: Nicholas Chia
- Affiliation: ANL
- Email: chia@anl.gov
Suggested Team Members and Roles [4-6 members]
| Name | Affiliation | Role / Expertise |
|---|---|---|
| ??? | ||
Project Summary
The problem addressed by this project is the lack of structured, high-quality training data for teaching AI systems to automatically generate bioinformatics workflows from natural language descriptions. In infectious disease research, rapid and reproducible data analysis is essential for understanding pathogen genomics, host-pathogen interactions, and outbreak response—but most workflows remain locked in narrative text rather than machine-executable form. Our project aims to extract question-answer pairs from existing BRC resources, other pipeline documentation, and publications to train large language models on workflow synthesis and reasoning. The AI/ML innovation lies in developing methods to automatically identify, structure, and contextualize these Q&A pairs to enable models that can generate, explain, and adapt infectious disease research workflows.
Goals and Objectives
-
Goal 1: Define the set of target infectious disease workflows, i.e., what makes a workflow worth including into this training set?
-
Goal 2: Define structured input/outputs for training data
-
Goal 3: Build dataset of at least 100 examples during hackathon (with plans to grow after)
-
Goal 4 (post workshop): Share learning results and build a publication
Approach
- General Training Approach: Multi-turn preference optimization (RL) for learning complex workflows
- Data Generation Approaches:
-
Extraction into structured format from papers
-
Multi-hop question-answering to generate QA pairs
-
Agentic workflows (light) for generating QA pairs
-
Data and Resources Required
| Resource Type | Source / Link | Description / Purpose |
|---|---|---|
| Data | PubMed corpus | Extract workflows and questions |
| Tools / Services | BV-BRC analysis tools, APIs, BV-BRC CLI | Run workflows and obtain verified answers |
| LLMs / AI Models | GPT-4, Claude 3, Mistral Large via BV-BRC Copilot | Test difficulty of problems on existing LLMs (i.e., confirm need for special training) |
| Compute / Storage | BRC cluster nodes | Run workflows (and eventually for training) |
Expected Outcomes / Deliverables
- Workflow(s) for generating data
- Training dataset in huggingface acceptable format (not to be posted until after submission of paper)
Potential Impact and Next Steps
- Infectious disease research or surveillance: Trains automated workflow generation for infectious disease applications.
- AI/ML automation and interpretability: This project directly trains automation of natural language biology questions into computational biology workflows and answers.
- Public health preparedness or education: This project makes infectious disease analyses more accessible to public health officials by removing technical and interpretability barriers.
Technical Support Needed
- GPU / LLM access
- API keys
- Mentor support
Additional Comments