Extracting Training Data for Automated Workflow Generation

Project Repository Edit

Project Themes:

  • Automated Workflow Generation and Execution

Team Name: Capybara

Team Lead(s):

Suggested Team Members and Roles [4-6 members]

NameAffiliationRole / Expertise
???

Project Summary

The problem addressed by this project is the lack of structured, high-quality training data for teaching AI systems to automatically generate bioinformatics workflows from natural language descriptions. In infectious disease research, rapid and reproducible data analysis is essential for understanding pathogen genomics, host-pathogen interactions, and outbreak response—but most workflows remain locked in narrative text rather than machine-executable form. Our project aims to extract question-answer pairs from existing BRC resources, other pipeline documentation, and publications to train large language models on workflow synthesis and reasoning. The AI/ML innovation lies in developing methods to automatically identify, structure, and contextualize these Q&A pairs to enable models that can generate, explain, and adapt infectious disease research workflows.

Goals and Objectives

  • Goal 1: Define the set of target infectious disease workflows, i.e., what makes a workflow worth including into this training set?

  • Goal 2: Define structured input/outputs for training data

  • Goal 3: Build dataset of at least 100 examples during hackathon (with plans to grow after)

  • Goal 4 (post workshop): Share learning results and build a publication

Approach

  • General Training Approach: Multi-turn preference optimization (RL) for learning complex workflows
  • Data Generation Approaches:
    • Extraction into structured format from papers

    • Multi-hop question-answering to generate QA pairs

    • Agentic workflows (light) for generating QA pairs

Data and Resources Required

Resource TypeSource / LinkDescription / Purpose
DataPubMed corpusExtract workflows and questions
Tools / ServicesBV-BRC analysis tools, APIs, BV-BRC CLIRun workflows and obtain verified answers
LLMs / AI ModelsGPT-4, Claude 3, Mistral Large via BV-BRC CopilotTest difficulty of problems on existing LLMs (i.e., confirm need for special training)
Compute / StorageBRC cluster nodesRun workflows (and eventually for training)

Expected Outcomes / Deliverables

  • Workflow(s) for generating data
  • Training dataset in huggingface acceptable format (not to be posted until after submission of paper)

Potential Impact and Next Steps

  • Infectious disease research or surveillance: Trains automated workflow generation for infectious disease applications.
  • AI/ML automation and interpretability: This project directly trains automation of natural language biology questions into computational biology workflows and answers.
  • Public health preparedness or education: This project makes infectious disease analyses more accessible to public health officials by removing technical and interpretability barriers.

Technical Support Needed

  • GPU / LLM access
  • API keys
  • Mentor support

Additional Comments