Geneformer-V2-316M (TransformerEngine-Optimized) Overview

Description:

Geneformer is a foundational transformer model pretrained on a large-scale corpus of single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology.

This version of the Geneformer model is optimized with NVIDIA's TransformerEngine library. It is based on the Geneformer-V2-316M model, and (within numerical precision) has identical weights and outputs.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Geneformer Model Card.

License/Terms of Use:

Geneformer is licensed under the Apache 2.0 license.

Deployment Geography:

Global

Use Case:

Network biology and therapeutic discovery, particularly in data-limited settings such as rare diseases or diseases affecting hard-to-access tissues.

Release Date:

Hugging Face 12/19/2025 via https://huggingface.co/nvidia/geneformer_V2_316M

Reference(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: BERT

This model was developed based on: Geneformer
Number of model parameters: 3.16 x 10^8

Input:

Input Type: Number (Row represents cell, containing gene names and single cell expression counts)
Input Format: Array AnnData
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: This model supports a context length of 4096.

Output:

Output Type: Dense Embedding Predictions
Output Format: Vector
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Numeric floating point vector (fp16, bf16, or fp32); Geneformer-V2-316M outputs 512 dimensional embeddings.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • Transformer Engine
  • PyTorch

Supported Hardware Microarchitecture Compatibility:

  • A100
  • H100
  • H200
  • GB200

Preferred/Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

  • Geneformer-V1-10M
  • Geneformer-V2-104M
  • Geneformer-V2-316M
  • Geneformer-V2-104M_CLcancer

Training and Evaluation Datasets:

Training Datasets:

Link: Genecorpus-103M - the author intends to release the dataset upon the publication of their manuscript.

Data Modality:

  • Text (Human single-cell transcriptomes)

Text Training Data Size:

  • 1 Billion to 10 Trillion Tokens

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • N/A

Properties: The single-cell transcriptomes were assembled from a broad range of publicly available data sources. The researchers collected raw counts from sources like NCBI Gene Expression Omnibus (GEO), Human Cell Atlas, and Tumor Immune Single-cell Hub (TISCH), among others. They excluded cells with high mutational burdens, such as malignant cells and immortalized cell lines, and included only droplet-based sequencing platforms to ensure data comparability. The raw data was then converted into a uniform loom HDF5 file format.

Evaluation Datasets:

Link: CELLxGENE

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Hybrid: Automated, Human

Properties: Single-cell transcriptomes with consolidated and balanced annotations.

Link: Systematic Comparison of High-throughput Single-Cell and Single-Nucleus Transcriptomes during Cardiomyocyte Differentiation

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Human

Properties: The researchers used two different sequencing platforms to collect data from the same biological process: induced pluripotent stem cell (iPSC) differentiation into cardiomyocytes. The two platforms used were Drop-seq (single-cell) and DroNc-seq (single-nucleus). The study involved two iPSC lines and collected data over a 15-day time period.

Inference:

Acceleration Engine: Transformer Engine, PyTorch

Test Hardware:

  • A100
  • H100
  • H200
  • GB200

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and comply with applicable safety regulations and ethical standards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nvidia/geneformer_V2_316M

Collection including nvidia/geneformer_V2_316M