AI-Powered Analysis Pathfinder: From Research Questions to Data and Workflows

Project Repository Video Presentation Edit

Project Themes:

Automated Workflow Generation and Execution

Team Name: BRC Analysis Pathfinder

Team Lead(s):

Name: Dannon Baker
Affiliation: Johns Hopkins University, BRC Analytics
Email: dannon@jhu.edu

Suggested Team Members and Roles [4-6 members]

Name	Affiliation	Role / Expertise
Anton Nekrutenko	Penn State, BRC Analytics	AI/ML Engineer - LLM Integration
Danielle Callan	Temple University, BRC Analytics	Bioinformatician - Domain Knowledge
Dannon Baker	Johns Hopkins, BRC Analytics	Backend Developer - API Integration
Dave Rogers	Penn State/Clever Canary, BRC Analytics	Frontend Developer - User Interface
Jonathan Livny	Genomic Center for Infectious Disease, NIAID (Non-BRC)	Data Scientist - Query Optimization
Steven Weaver	Temple University, BRC Analytics	Data Scientist - Query Optimization

Project Summary

Bench scientists know what analysis they want to perform but lack expertise to find appropriate datasets and workflows. This project creates an AI-powered system where researchers start with analysis goals (“identify AMR genes in K. pneumoniae”) and receive both relevant datasets and executable Galaxy workflows—a complete path from question to execution. The system uses large language models to extract data requirements and analysis intent from natural language queries, searches ENA for matching datasets, and recommends compatible IWC workflows with parameter guidance.

Goals and Objectives

Goal 1: Implement natural language query interpretation that extracts data requirements and analysis intent from research questions.
Goal 2: Build ENA integration that retrieves and ranks datasets based on extracted parameters.
Goal 3: Integrate workflow recommendation from IWC catalog with parameter guidance.
Goal 4 (Stretch): Enhance results with study descriptions and sample annotations for biological context.

Approach

Natural Language Query Processing: LLMs extract data requirements and analysis goals from queries, outputting structured parameters for ENA search and workflow matching.
ENA Integration: Translate parameters to ENA Portal API queries, retrieve dataset metadata, and rank results.
Workflow Recommendation: Match analysis goals with compatible IWC Galaxy workflows based on dataset characteristics, providing parameter guidance.

Milestones

Day 1: Backend implementation with pydantic-ai agents, query interpretation prompts, ENA API client, FastAPI endpoints, workflow catalog indexing.
Day 2: Frontend development, dataset display, workflow recommendation integration, initial testing.
Day 3: Optimization, end-to-end testing with domain experts, demonstration preparation.

Data and Resources Required

Resource Type	Source / Link	Description / Purpose
Sequencing Data	European Nucleotide Archive (ENA) Portal API	Public repository of sequencing experiments and metadata for infectious disease pathogens
Workflow Catalog	Intergalactic Workflow Commission (IWC)	Curated Galaxy workflows for genomic analysis (variant calling, RNA-seq, metagenomics)
LLMs / AI Models	LLM API access (e.g., Claude Sonnet/Haiku, GPT-4)	Query interpretation, workflow matching
Framework	pydantic-ai	Type-safe LLM agent framework with structured outputs and provider abstraction
Compute / Storage	BRC cluster nodes or cloud deployment	Backend FastAPI service, Next.js frontend

Expected Outcomes / Deliverables

Functional search API with RESTful endpoints and OpenAPI documentation
Web interface for describing analysis goals, viewing datasets, and browsing workflow recommendations
Query interpretation evaluation on 50+ test queries
End-to-end validation from analysis goal → dataset → workflow (target 80%+ completion)
Documentation: user guide, API docs, prompt engineering best practices

Potential Impact and Next Steps

This project addresses a critical barrier for bench scientists: the fragmented path from research question to analysis execution. Our system transforms this into a single conversation enabling researchers to move from question to analysis in minutes rather than days. For outbreak response, this enables rapid identification of comparable datasets and establishment of analysis workflows within seconds.

Follow-up activities: BV-BRC integration for federated search, multi-repository support (NCBI SRA, DDBJ), semantic enhancement with vector embeddings, analysis templates for common scenarios, Galaxy integration enhancements, and publication as open-source tool.

Technical Support Needed

API keys (LLM providers - e.g., Anthropic, OpenAI)

Additional Comments

The system uses pydantic-ai for structured outputs to prevent hallucination and ensure type safety. Key technical challenges include dual-intent query understanding, workflow-data compatibility verification, and prompt engineering for consistent interpretation. Success metrics: query-to-workflow response time <3 seconds, user satisfaction >4/5, and 80% successful completion rate.