AI-Powered Analysis Pathfinder: From Research Questions to Data and Workflows

Project Repository Edit

Project Themes:

  • Automated Workflow Generation and Execution

Team Name: BRC Analysis Pathfinder

Team Lead(s):

  • Name: Dannon Baker
  • Affiliation: Johns Hopkins University, BRC Analytics
  • Email: dannon.baker@jhu.edu

Suggested Team Members and Roles [4-6 members]

NameAffiliationRole / Expertise
[Member 1][Institution]AI/ML Engineer - LLM Integration
[Member 2][Institution]Bioinformatician - Domain Knowledge
[Member 3][Institution]Backend Developer - API Integration
[Member 4][Institution]Frontend Developer - User Interface
[Member 5][Institution]Data Scientist - Query Optimization

Project Summary

Bench scientists know what analysis they want to perform but lack expertise to find appropriate datasets and workflows. This project creates an AI-powered system where researchers start with analysis goals (“identify AMR genes in K. pneumoniae”) and receive both relevant datasets and executable Galaxy workflows—a complete path from question to execution. The system uses large language models to extract data requirements and analysis intent from natural language queries, searches ENA for matching datasets, and recommends compatible IWC workflows with parameter guidance.

Goals and Objectives

  • Goal 1: Implement natural language query interpretation that extracts data requirements and analysis intent from research questions.
  • Goal 2: Build ENA integration that retrieves and ranks datasets based on extracted parameters.
  • Goal 3: Integrate workflow recommendation from IWC catalog with parameter guidance.
  • Goal 4 (Stretch): Enhance results with study descriptions and sample annotations for biological context.

Approach

  • Natural Language Query Processing: LLMs extract data requirements and analysis goals from queries, outputting structured parameters for ENA search and workflow matching.
  • ENA Integration: Translate parameters to ENA Portal API queries, retrieve dataset metadata, and rank results.
  • Workflow Recommendation: Match analysis goals with compatible IWC Galaxy workflows based on dataset characteristics, providing parameter guidance.

Milestones

  • Day 1: Backend implementation with pydantic-ai agents, query interpretation prompts, ENA API client, FastAPI endpoints, workflow catalog indexing.
  • Day 2: Frontend development, dataset display, workflow recommendation integration, initial testing.
  • Day 3: Optimization, end-to-end testing with domain experts, demonstration preparation.

Data and Resources Required

Resource TypeSource / LinkDescription / Purpose
Sequencing DataEuropean Nucleotide Archive (ENA) Portal APIPublic repository of sequencing experiments and metadata for infectious disease pathogens
Workflow CatalogIntergalactic Workflow Commission (IWC)Curated Galaxy workflows for genomic analysis (variant calling, RNA-seq, metagenomics)
LLMs / AI ModelsLLM API access (e.g., Claude Sonnet/Haiku, GPT-4)Query interpretation, workflow matching
Frameworkpydantic-aiType-safe LLM agent framework with structured outputs and provider abstraction
Compute / StorageBRC cluster nodes or cloud deploymentBackend FastAPI service, Next.js frontend

Expected Outcomes / Deliverables

  • Functional search API with RESTful endpoints and OpenAPI documentation
  • Web interface for describing analysis goals, viewing datasets, and browsing workflow recommendations
  • Query interpretation evaluation on 50+ test queries
  • End-to-end validation from analysis goal → dataset → workflow (target 80%+ completion)
  • Documentation: user guide, API docs, prompt engineering best practices

Potential Impact and Next Steps

This project addresses a critical barrier for bench scientists: the fragmented path from research question to analysis execution. Our system transforms this into a single conversation enabling researchers to move from question to analysis in minutes rather than days. For outbreak response, this enables rapid identification of comparable datasets and establishment of analysis workflows within seconds.

Follow-up activities: BV-BRC integration for federated search, multi-repository support (NCBI SRA, DDBJ), semantic enhancement with vector embeddings, analysis templates for common scenarios, Galaxy integration enhancements, and publication as open-source tool.

Technical Support Needed

  • API keys (LLM providers - e.g., Anthropic, OpenAI)

Additional Comments

The system uses pydantic-ai for structured outputs to prevent hallucination and ensure type safety. Key technical challenges include dual-intent query understanding, workflow-data compatibility verification, and prompt engineering for consistent interpretation. Success metrics: query-to-workflow response time <3 seconds, user satisfaction >4/5, and 80% successful completion rate.