Model
CodonFM predicts masked codons in mRNA sequences from codon-level context to enable variant effect interpretation and codon optimization
Use the NGC CLI to download:
Copied!
| Field | Response |
|---|---|
| Intended Task/Domain: | mRNA sequence prediction |
| Model Type: | Transformer |
| Intended Users: | Molecular and cellular biologists, Genomics researchers studying coding variants, Synthetic biology and mRNA therapeutics teams, Computational biologists working on codon optimization or protein expression. |
| Output: | Text |
| Describe how the model works: | The model takes biological coding sequences as input, converts them into embeddings, and processes them through multiple transformer encoder layers featuring multi-head attention to capture context and pattern dependencies. Rotary positional encoding and feed-forward networks enhance the model’s ability to interpret the order and relationships of codons across the sequence for masked language modeling tasks. The final layers predict missing or masked codons, enabling the model to infer and reconstruct biologically relevant information from partial sequence data. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
| Technical Limitations & Mitigation: | The models were pre-trained on mRNA sequences. The model may not perform well if there is a difference from coding sequence type inputs and the training data. Any input sequence beyond 2046 codons requires truncation/windowing. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Significance Testing (-log10(p)), Classification (AUROC), Regression (Spearman r, r²) |
| Potential Known Risks: | If the model does not work as intended, it may generate inaccurate predictions for biological sequences, leading to mischaracterization of codon relationships or masked inputs. Specifically, if the model inaccurately characterizes depth or other sequence properties, critical biological features may be distorted or missed, resulting in unreliable data interpretation or downstream analysis. Such failures could compromise clinical or research outcomes that depend on high-fidelity biological sequence understanding. |
| Licensing: | Use of this model is governed by the (NVIDIA Open Model License Agreement)[https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/]. |