Automated Workflow Generation and Execution

Edit

Project Theme

Automated Workflow Generation and Execution

Project Summary

Bioinformatics workflows often require expertise in multiple tools, file formats, and execution environments. This project aims to build an AI system that understands natural language descriptions of scientific analyses and automatically generates, validates, and executes bioinformatics workflows using BRC tools and external resources. By translating research questions directly into executable pipelines, we can dramatically reduce the time from hypothesis to results and make complex analyses accessible to researchers without deep computational expertise.

Goals and Objectives

  1. Natural language to workflow translation using LLMs to convert user descriptions into workflow specifications
  2. Automatic tool and parameter selection based on input data types and analysis goals
  3. Workflow validation and optimization to ensure correctness and efficiency
  4. Execute workflows on BRC infrastructure with proper data management and provenance tracking

Approach

Methods and AI/ML Approaches:

  • Large Language Models for parsing research questions and generating workflow logic
  • Workflow description languages (CWL, Nextflow, Snakemake, or Galaxy)
  • BRC API integration for tool execution
  • LLM-guided parameter optimization
  • Knowledge graphs of tool relationships and compatibility

Implementation Steps:

  1. Define common workflow patterns for infectious disease research
  2. Create prompts for LLM-based workflow generation
  3. Build a tool registry with BRC and external bioinformatics tools
  4. Implement workflow generation pipeline (LLM → workflow language)
  5. Add validation layer to check workflow correctness
  6. Execute generated workflows and capture results
  7. Test with real research questions

Data and Resources Required

Resource TypeSource / LinkDescription / Purpose
DataBV-BRC sample datasetsTesting workflow execution
Tools / ServicesBV-BRC analysis services APITool execution backend
Tools / ServicesGalaxy, Nextflow, or SnakemakeWorkflow engine
LLMs / AI ModelsGPT-4, Claude 3, or Code LlamaWorkflow generation
DataTool documentation and schemasFor tool selection and parameterization
Compute / StorageHPC access for workflow executionRun generated pipelines

Expected Outcomes / Deliverables

  • Prototype system for natural language workflow generation
  • Library of example workflows generated from research questions
  • Validation framework for workflow correctness
  • API documentation for BRC integration
  • Performance benchmarks comparing manual vs. automated workflows
  • Public GitHub repository with code, examples, and documentation
  • Demo video showing end-to-end workflow generation

Potential Impact and Next Steps

Impact on:

  • Infectious disease research: Accelerates time from question to analysis results
  • AI/ML automation: Advances AI-driven scientific workflow automation
  • Public health preparedness: Enables rapid deployment of analysis pipelines during outbreaks

Next Steps After Codeathon:

  • Expand tool registry to cover all BRC services
  • Add interactive workflow refinement (user feedback loop)
  • Implement workflow version control and sharing
  • Create educational materials for teaching computational biology
  • Integrate with BRC workspaces for seamless execution

Technical Support Needed

  • BRC API documentation and test credentials
  • Sample datasets with known analysis workflows
  • LLM API access
  • HPC or cloud compute allocation
  • Mentor support from workflow experts

Team Information

Teams will be formed during the Codeathon. Ideal team composition:

  • AI/ML Engineer: LLM integration and prompt engineering
  • Workflow Developer: Experience with CWL/Nextflow/Snakemake
  • Bioinformatician: Domain expertise in infectious disease analyses
  • Backend Developer: API integration and workflow execution
  • DevOps Engineer: Infrastructure and deployment