Swin Transformer V2: Scaling Up Capacity and Resolution
Paper
•
2111.09883
•
Published
A Swin Transformer v2 image classification model. The model follows a two-stage training process: first undergoing intermediate training on a large-scale dataset containing diverse bird species from around the world, then fine-tuned specifically on the arabian-peninsula dataset (all the relevant bird species found in the Arabian peninsula inc. rarities).
The species list is derived from data available at https://avibase.bsc-eoc.org/checklist.jsp?region=ARA.
Model Type: Image classification and detection backbone
Model Stats:
Dataset: arabian-peninsula (735 classes)
Papers:
import birder
from birder.inference.classification import infer_image
(net, model_info) = birder.load_pretrained_model("swin_transformer_v2_s_intermediate-arabian-peninsula", inference=True)
# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)
# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)
image = "path/to/image.jpeg" # or a PIL image, must be loaded in RGB format
(out, _) = infer_image(net, image, transform)
# out is a NumPy array with shape of (1, 735), representing class probabilities.
import birder
from birder.inference.classification import infer_image
(net, model_info) = birder.load_pretrained_model("swin_transformer_v2_s_intermediate-arabian-peninsula", inference=True)
# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)
# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)
image = "path/to/image.jpeg" # or a PIL image
(out, embedding) = infer_image(net, image, transform, return_embedding=True)
# embedding is a NumPy array with shape of (1, 768)
from PIL import Image
import birder
(net, model_info) = birder.load_pretrained_model("swin_transformer_v2_s_intermediate-arabian-peninsula", inference=True)
# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)
# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)
image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
# features is a dict (stage name -> torch.Tensor)
print([(k, v.size()) for k, v in features.items()])
# Output example:
# [('stage1', torch.Size([1, 96, 96, 96])),
# ('stage2', torch.Size([1, 192, 48, 48])),
# ('stage3', torch.Size([1, 384, 24, 24])),
# ('stage4', torch.Size([1, 768, 12, 12]))]
@misc{liu2022swintransformerv2scaling,
title={Swin Transformer V2: Scaling Up Capacity and Resolution},
author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
year={2022},
eprint={2111.09883},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2111.09883},
}