TAP-CT: 3D Task-Agnostic Pretraining of CT Foundation Models
TAP-CT is a suite of foundation models for computed tomography (CT) imaging, pretrained in a task-agnostic manner through an adaptation of DINOv2 for volumetric data. These models learn robust 3D representations from CT scans without requiring task-specific annotations.
This repository provides TAP-CT-B-3D, a Vision Transformer (ViT-Base) architecture pretrained on volumetric inputs with a spatial resolution of (12, 224, 224) and a patch size of (4, 8, 8). For inference on full-resolution CT volumes, a sliding window approach can be employed to extract features across the entire scan.
Preprocessing
Using dedicated image processor
Each TAP-CT model repository provides its own dedicated image processor and configuration file. To ensure proper preprocessing, it is recommended to instantiate the corresponding image processor using the AutoImageProcessor class from Hugging Face Transformers. This can be accomplished as follows:
from transformers import AutoImageProcessor
preprocessor = AutoImageProcessor.from_pretrained(
'fomofo/tap-ct-b-3d',
trust_remote_code=True
)
This approach automatically loads the appropriate processor and configuration for the selected TAP-CT model.
Preprocessing without pipeline
- Orientation: Convert the volume to LPS (Left-Posterior-Superior) orientation. While the model is likely orientation-invariant, all evaluations were conducted using LPS orientation.
- Spatial Resizing: Resize the volume to a spatial resolution of (z, 224, 224) or (z, 512, 512), where (z) represents the number of slices along the axial dimension.
- Axial Padding: Apply -1024 padding along the (z)-axis to ensure divisibility by 4, accommodating the model's patch size of (4, 8, 8).
- Intensity Clipping: Clip voxel intensities to the range ([-1008, 822]) HU (Hounsfield Units).
- Normalization: Apply z-score normalization using (mean = -86.8086) and (std = 322.6347).
Usage
Default Usage
import torch
from transformers import AutoModel
# Load the model
model = AutoModel.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)
# Prepare input (batch_size, channels, depth, height, width)
x = torch.randn((16, 1, 12, 224, 224))
# Forward pass
with torch.no_grad():
output = model.forward(x)
Usage with Preprocessor, loading CT volumes & sliding window inference
Recommended environment:
- Python >= 3.11
- torch >= 2.8
- numpy >= 2.35
- SimpleITK >= 2.52
- monai >= 1.4.0
- xformers >= 0.0.32 (optional, recommended for CUDA)
import numpy as np
import SimpleITK as sitk
import torch
from transformers import AutoModel, AutoImageProcessor
# Load the model
model = AutoModel.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)
preprocessor = AutoImageProcessor.from_pretrained('fomofo/tap-ct-b-3d', trust_remote_code=True)
# Load image & set orientation to LPS
volume = sitk.ReadImage('/path/to/ct-scan.nii.gz')
volume = sitk.DICOMOrient(volume, 'LPS')
# Get array, expand to (B, C, D, H, W) and preprocess
array = sitk.GetArrayFromImage(volume)
array = np.expand_dims(array, axis=(0, 1))
x = preprocessor(array)['pixel_values']
# Forward pass
with torch.no_grad():
output = model.forward(x)
# OR
# Forward pass with sliding window
from monai.inferers import SlidingWindowInferer
def predictor_fn(x):
# Reshape the patch tokens to resemble a 3D feature map
out = model(x, reshape=True)
return out.last_hidden_state
inferer = SlidingWindowInferer(
roi_size=[12, 224, 224],
sw_batch_size=1,
overlap=0.75,
mode='gaussian'
)
with torch.no_grad():
output = inferer(x, predictor_fn)
The model returns a BaseModelOutputWithPooling object from the transformers library. The output.pooler_output contains the pooled [CLS] token representation, while output.last_hidden_state contains the spatial patch token embeddings. To extract features from all intermediate transformer layers, pass output_hidden_states=True to the forward method.
Model Details
- Model Type: 3D CT Vision Foundation Model
- Input Shape:
(batch_size, 1, depth, height, width) - Example Input:
(16, 1, 12, 224, 224)- batch of 16 CT crops with 12 slices at 224×224 resolution - License: CC-BY-NC-4.0
Citation
If you find this work useful, please cite:
@article{veenboer2025tapct,
title={TAP-CT: 3D Task-Agnostic Pretraining of Computed Tomography Foundation Models},
author={Veenboer, Tim and Yiasemis, George and Marcus, Eric and Van Veldhuizen, Vivien and Snoek, Cees G. M. and Teuwen, Jonas and Groot Lipman, Kevin B. W.},
journal={arXiv preprint arXiv:2512.00872},
year={2025}
}
- Downloads last month
- 210