Vision Transformer (ViT) Fine-Tuned Model

This repository contains a fine-tuned version of google/vit-large-patch16-224, optimized for a custom image classification task.

📌 Model Overview

Base model: google/vit-large-patch16-224
Architecture: Vision Transformer (ViT)
Patch size: 16×16
Image resolution: 224×224
Frameworks: PyTorch, Hugging Face Transformers

📊 Performance

Metric	Value
Final Validation Loss	0.3268
Lowest Validation Loss	0.2548 (Epoch 18)

Training loss and validation loss trends indicate good convergence with slight overfitting after ~30 epochs.

🔧 Training Configuration

Hyperparameter	Value
Learning rate	`2e-5`
Train batch size	`20`
Eval batch size	`8`
Optimizer	AdamW (`betas=(0.9, 0.999)`, `eps=1e-8`)
LR scheduler	Linear
Epochs	`40`
Seed	`42`
Framework versions	Transformers 4.52.4, PyTorch 2.6.0+cu124, Datasets 3.6.0, Tokenizers 0.21.2

📂 Training Results

Epoch	Step	Validation Loss
1	24	0.5601
5	120	0.3421
10	240	0.2901
14	336	0.2737
18	432	0.2548
40	960	0.3268

🛠 Intended Uses

Image classification on datasets with characteristics similar to the training dataset.
Fine-tuning for domain-specific classification tasks.

⚠ Limitations

Trained on a custom dataset — may not generalize well to unrelated domains without additional fine-tuning.
No guarantees on fairness, bias, or ethical implications without dataset analysis.

🚀 How to Use

You can use this model in two main ways:

1️⃣ Using the High-Level `pipeline` API

from transformers import pipeline

pipe = pipeline("image-classification", model="rakib730/output-models")

# Classify an image from a URL
result = pipe("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png")
print(result)

2️⃣ Using the Processor and Model Directly**
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import requests
import torch

# Load processor and model
processor = AutoImageProcessor.from_pretrained("rakib730/output-models")
model = AutoModelForImageClassification.from_pretrained("rakib730/output-models")

# Load an image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Preprocess
inputs = processor(images=image, return_tensors="pt")

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class_id = logits.argmax(-1).item()

print("Predicted class:", model.config.id2label[predicted_class_id])

Downloads last month: 29

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

Validation Loss
self-reported

0.327

Vision Transformer (ViT) Fine-Tuned Model

Vision Transformer (ViT) Fine-Tuned Model

📌 Model Overview

📊 Performance

🔧 Training Configuration

📂 Training Results

🛠 Intended Uses

⚠ Limitations

🚀 How to Use

1️⃣ Using the High-Level pipeline API

Evaluation results

1️⃣ Using the High-Level `pipeline` API