TAP-CT: 3D Task-Agnostic Pretraining of CT Foundation Models

TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.

This repository provides TAP-CT-B-3D, a Vision Transformer (ViT-Base) architecture pretrained on volumetric inputs with a spatial resolution of (12, 224, 224) and a patch size of (4, 8, 8). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan.

Preprocessing

Using dedicated image processor

Each TAP-CT model repository provides its own dedicated image processor and configuration file. To ensure proper preprocessing, it is recommended to instantiate the corresponding image processor using the AutoImageProcessor class from Hugging Face Transformers. This can be accomplished as follows:

from transformers import AutoImageProcessor

preprocessor = AutoImageProcessor.from_pretrained(
    'fomofo/tap-ct-b-3d',
    trust_remote_code=True
)

This approach automatically loads the appropriate processor and configuration for the selected TAP-CT model.

Preprocessing without pipeline

Orientation: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
Spatial Resizing: Resize the volume to a spatial resolution of (z, 224, 224) or (z, 512, 512), where (z) represents the number of slices along the axial dimension.
Axial Padding: Apply -1024 padding along the (z)-axis to ensure divisibility by 4, accommodating the model's patch size of (4, 8, 8).
Intensity Clipping: Clip voxel intensities to the range ([-1008, 822]) HU (Hounsfield Units).
Normalization: Apply z-score normalization using (mean = -86.8086) and (std = 322.6347).

Usage

Default Usage

import torch
from transformers import AutoModel

# Load the model
model = AutoModel.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)

# Prepare input (batch_size, channels, depth, height, width)
x = torch.randn((16, 1, 12, 224, 224))

# Forward pass
with torch.no_grad():
    output = model.forward(x)

Usage with Preprocessor, loading CT volumes & sliding window inference

Recommended environment:

Python >= 3.11
torch >= 2.8
numpy >= 2.35
SimpleITK >= 2.52
monai >= 1.4.0
xformers >= 0.0.32 (optional, recommended for CUDA)

import numpy as np
import SimpleITK as sitk
import torch
from transformers import AutoModel, AutoImageProcessor

# Load the model
model = AutoModel.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)
preprocessor = AutoImageProcessor.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)

# Load image & set orientation to LPS
volume = sitk.ReadImage('/path/to/ct-scan.nii.gz')
volume = sitk.DICOMOrient(volume, 'LPS')

# Get array, expand to (B, C, D, H, W) and preprocess
array = sitk.GetArrayFromImage(volume)
array = np.expand_dims(array, axis=(0, 1))
x = preprocessor(array)['pixel_values']

# Forward pass
with torch.no_grad():
    output = model.forward(x)

# OR

# Forward pass with sliding window
from monai.inferers import SlidingWindowInferer

def predictor_fn(x):
    # Reshape the patch tokens to resemble a 3D feature map
    out = model(x, reshape=True)
    return out.last_hidden_state

inferer = SlidingWindowInferer(
    roi_size=[12, 224, 224],
    sw_batch_size=1,
    overlap=0.75,
    mode='gaussian'
)

with torch.no_grad():
    output = inferer(x, predictor_fn)

The model returns a BaseModelOutputWithPooling object from the transformers library. The output.pooler_output contains the pooled [CLS] token representation, while output.last_hidden_state contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass output_hidden_states=True to the forward method.

Model Details

Model Type: 3D CT Vision Foundation Model
Input Shape: (batch_size, 1, depth, height, width)
Example Input: (16, 1, 12, 224, 224) - batch of 16 CT crops with 12 slices at 224×224 resolution
License: CC-BY-NC-4.0

Citation

If you find this work useful, please cite:

@article{veenboer2025tapct,
  title={TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models},
  author={Veenboer, Tim and Yiasemis, George and Marcus, Eric and Van Veldhuizen, Vivien and Snoek, Cees G. M. and Teuwen, Jonas and Groot Lipman, Kevin B. W.},
  journal={arXiv preprint arXiv:2512.00872},
  year={2025}
}

Downloads last month: 210

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for fomofo/tap-ct-b-3d

TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models

Paper • 2512.00872 • Published Nov 30, 2025