The First NIAID BRC AI Codeathon 2025: Accelerating Infectious Disease Research Through Artificial Intelligence

This November, the NIAID Bioinformatics Resource Centers for Infectious Diseases (BRCs) hosted the first-ever NIAID-BRC AI Codeathon for Infectious Disease Research at Argonne National Laboratory. Over three days (November 12–14, 2025), more than 70 scientists, software engineers, data scientists, and students came together to explore how recent advances in artificial intelligence (AI) can transform the way we study pathogens, respond to outbreaks, and generate actionable biological insights.

The event brought together multidisciplinary participants from national labs, universities, and partner institutions with a shared goal: to explore how emerging computational methods—particularly those involving artificial intelligence —can strengthen the nation's capacity to study pathogens and respond to infectious-disease threats.

NIAID BRC AI Codeathon 2025 participants group photo at Argonne National Laboratory

Three Days of Prototyping and Innovation

The codeathon kicked off with a welcome session and a keynote address "AI Update: The Future of Scientific Discovery" by Rick Stevens (BV-BRC PI and Associate Laboratory Director for Computing, Environment and Life Sciences Division at Argonne National Laboratory), setting the stage for three days of creativity and rapid prototyping. Participants spent most of their time in focused breakout sessions, supported by mentors and technical experts.

Twelve High-Impact AI Projects

The heart of the codeathon was its 12 project teams, each tackling a major challenge in infectious disease research. These projects—selected through a competitive proposal process—ranged from AI-driven gene annotation to outbreak analytics and automated workflow generation. Brief summaries of each project are below:

AI Co-Scientist for Protein Function Prediction

This project extended the "Co-Scientist" framework to build reasoning agents focused on protein function prediction, using datasets from the BV‑BRC for priority pathogens. The goal was to create agents that can propose functional hypotheses for uncharacterized proteins—helping bridge the gap between raw sequence data and actionable biological insight.

AI-Powered Analysis Pathfinder: From Research Questions to Data and Workflows

This project builds an AI-powered system that takes researcher goals and returns both relevant datasets and executable Galaxy workflows—a full path from question to analysis. Large language models interpret natural language queries to determine needed data and analyses, search ENA for matching datasets, and recommend compatible IWC workflows with parameter guidance tailored to the discovered data.

Assigning Functions to Uncharacterized Genes

This project uses machine learning and large language models to predict functionality for uncharacterized genes and proteins based on patterns learned from known, well-characterized ones. It combines traditional ML on k-mer features with ensemble methods and LLM-based embeddings, using BV-BRC functional roles, PGFams, and PLFams as labels and sequence data as the feature space.

Expanding Rhea for Automated Workflow Generation

This project extends the Rhea MCP+RAG platform to integrate BV-BRC's pathogen data pipelines with Galaxy's execution engine and automatically generate end-to-end workflows for infectious disease analysis. The resulting system will support rapid hypothesis testing, reproducible analyses, and scalable execution of complex workflows across both ecosystems.

Extracting Training Data for Automated Workflow Generation

This project addresses the shortage of structured training data for teaching AI systems to generate infectious disease bioinformatics workflows from natural language. It builds and organizes question–answer pairs from BRC resources, pipeline documentation, and publications to train language models that can propose, explain, and adapt reproducible workflows.

Generative AI-Driven Workflow Design via MCP

This team applied generative AI, guided via the Model Context Protocol (MCP), to automate the design, refinement, and execution of bioinformatics workflows. By combining multi-agent frameworks with workflow platforms, they aimed to reduce the time and expertise required to move from a research question to an executable pipeline.

HiPerRAG for Pathogen Literature Mining

The HiPerRAG team developed a high-performance retrieval-augmented generation (RAG) pipeline for extracting and curating structured biological data from the scientific literature, specifically targeting priority pathogens. Their work sought to faster access, synthesise, and structure the scientific content needed for downstream computational and experimental tasks.

InterWeb Outbreak Surveillance Dashboard

In this project, participants built an AI-powered outbreak surveillance system that aggregates data from web sources—including social media, news, case-reports, and public-health databases. The aim was to provide near-real-time visibility into emerging infectious-disease events, supporting both research and preparedness efforts.

PubMed Miner: AI-Powered Sequence Feature Extraction from Literature

This project uses AI/LLMs to automatically mine the latest viral literature for high-value sequence features (e.g., epitopes, mutations, domains), filter out false positives using gene/coordinate/function context, and generate validated feature sets for downstream analyses like sequence feature variant typing.

RDF Knowledge Graph Construction

This project uses LLMs, guided by domain experts, to automatically generate RDF triples from PDN resources and improve their interoperability within the SIB's Linked Data ecosystem. It defines a curated schema/ontology, maps PDN resources to other SIB data sources, and develops exemplar research questions with SPARQL queries to streamline data harmonization, improve accessibility, and expand SIB's linked biodata.

StorySeq: Automated Sequence Narrative Generation

StorySeq focused on automating the generation of narrative summaries from sequence analyses. The pipeline uses BLAST searches, database queries, and LLM-based narrative synthesis to help researchers interpret genomic results—particularly in pathogen and antimicrobial-resistance gene discovery workflows—more quickly and clearly.

Viral Structural Phylogenetics

This project aims to develop a fast, accurate language model for viral structure prediction by fine-tuning ProstT5 with LoRA on recent viral protein structure datasets to better handle under-represented viral species. The model is validated by reconstructing viral taxonomy from structurally conserved core genes and integrating it into the Foldseek framework to enable large-scale structure prediction across massive viral sequence databases.

Each of these projects remains in active development. The Codeathon's purpose was not to produce polished, deployable tools, but to give teams the opportunity to evaluate feasibility, identify challenges, and establish technical foundations that can be advanced through follow-on work.

A full list of teams, members, and leads—totaling 71 participants—is available on the NIAID-BRC AI Codeathon website.

Watch the final project presentations on YouTube to see each team's progress and outcomes.

A Successful First Codeathon—And a Glimpse of the Future

The 2025 NIAID-BRC AI Codeathon demonstrated the immense potential of integrating AI with infectious disease informatics. Teams produced early prototypes, workflow designs, model concepts, and system architectures that will continue to evolve in the months ahead. Many projects plan to transition into fully supported NIAID-BRC development tracks, where they can directly benefit the infectious disease research community.

More importantly, the event fostered a vibrant, collaborative community—bridging AI expertise with biological insight and public health relevance. From ambitious protein annotation agents to real-time outbreak monitoring tools, the innovations born during these three days point toward a future where AI accelerates every stage of scientific discovery.

The success of this inaugural event sets the foundation for future NIAID-BRC codeathons and broader AI-enabled initiatives across the infectious disease ecosystem.

Why This Work Matters

In today's data-driven landscape, infectious diseases research is increasingly dependent on large-scale and complex data. The need for advanced analytical tools and techniques to analyze these datasets has become growing imperative to drive innovation and discovery in the field. Bioinformatics and AI can help researchers work through this information more efficiently, provided the methods are well designed and tested in realistic settings.

The Codeathon offered a focused environment for teams to identify concrete problems, try new approaches, and see where AI can make routine tasks faster or more reliable. While the projects are still evolving, they highlight practical ways to improve data access, workflow development, literature integration, and analysis support.

These initiatives are crucial in enhancing the computational foundation for infectious-disease research, showcasing the BRCs' commitment to supporting the research community with cutting-edge resources and tools.