CDS-BART

CDS-BART is designed as an easy-to-use tool, facilitating accessibility for researchers to leverage the development of mRNA vaccines and therapeutics. The model is based on BART and pre-trained with mRNA data contains nine taxonomic groups provided by the NCBI RefSeq database. It is a BART-based foundation model that can be fine-tuned for various mRNA downstream tasks such as mRFP expression, mRNA stability.

Model Description

  • Developed by: Jadamba Erkhembayar, Sangheon Lee, Hyunjin Shin, Hyekyoung Lee, Jinhee Hong
  • Funded by : Mogam institute for biomedical research
  • Model type: BART
  • Trained Database: NCBI RefSeq
  • License: MIT License

Load tokenizer and model

The example code for loading pre-trained denoising model and tokenzier. BartModel has pre-trained for denoising and sequence representation tasks.

from transformers import (
    BartTokenizerFast,
    BartModel,
)

# Load tokenizer
tokenizer = BartTokenizerFast.from_pretrained("mogam-ai/CDS-BART-denoising")
# Load pre-trained model
model = BartModel.from_pretrained("mogam-ai/CDS-BART-denoising")

Example code

example_sequences = [
  'ACGCGAGCGUCAUUUCGCGGGGCAUAUGUA'
  ]

encoded = tokenizer(
    example_sequences,
    max_length=850,
    truncation=True,
    padding="max_length",
    return_tensors="pt"
    )

output = model(
    input_ids = encoded['input_ids'],
    attention_mask = encoded['attention_mask']
)

hidden_states = output.last_hidden_state

Can add more here!

  • Maximum length of tokenizer is 850 and maximum mRNA sequence is around 4000nt.
  • It is available to extract sequence embeddings from the model.

Model Sources [optional]

Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support