Assigning Functions to Uncharacterized Genes

Project Repository Edit

Project Themes:

  • Assign Functions to Uncharacterized Genes

Team Name: Correct Horse Battery Staple

Team Lead(s):

Suggested Team Members and Roles [4-6 members]

NameAffiliationRole / Expertise
Marcus NguyenUniversity of ChicagoModel building and data cleaning
Justin PodowskiUniversity of ChicagoModel building and data cleaning
Alex Brace???University of ChicagoLLM modeling
James Davis???University of ChicagoModel building and data cleaning

Project Summary

In this project, we aim to leverage aspects of machine learning and large language models to predict functoriality from known, characterized genes and proteins. That model can then be used to predict functionality for uncharacterized genes and proteins. The project will utilize traditional ML using K-mers and ensemble methods like XGBoost and random forest while also incorporating aspects of LLMs like GenSLMs and ESM embeddings and, if time permits, fine tuning GenSLMs or ESM models for the task of function classifiction. The BV-BRC functional roles, PGFams, and PLFams will be leveraged as labels while the sequence data itself will be feature-ized to build models.

Goals and Objectives

  • Goal 1: Build ensemble models utilizing k-mers
  • Goal 2: Build ensemble models utilizing LLM embeddings
  • Goal 3 (time permitting): Fine tune LLM model for classification task

Approach

  • Featurizing sequences via K-mers and LLM embeddings for XGBoost or random forest
  • Fine tuning GenSLMs

Data and Resources Required

Resource TypeSource / LinkDescription / Purpose
DataBV-BRC genome feature sequences and metadata (functional role, PLFam, PGFam)Raw features and labels
Tools / ServicesBV-BRC FTP and/or CLIDownload sequence and metadata
LLMs / AI ModelsGenSLMs, ESMEmbedding sequences and fine tuning models
Compute / StorageArgonne HPC, BRC cluster nodesBRC cluster nodes to train ML models, HPC possibly needed to fine tune LLMs

Expected Outcomes / Deliverables

  • Models that predict protein function

Potential Impact and Next Steps

  • Infectious disease research or surveillance: characterizing unknown and known genes can be useful to know what genes are in the population
  • AI/ML automation and interpretability: the product is an ML/AI/LLM model to predict protein function
  • Public health preparedness or education: characterizing unknown and known genes can be useful to know what genes are in the population

Follow up after the codeathon would include tuning of ML models and more thorough statistical analysis of results. An expansion of the models’ scope may also be a possibility as well. Possible publication?

Technical Support Needed

  • GPU / LLM access

Additional Comments