In actual testing, compared to using fp16, it is only less than 10% faster
Load:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
model.to("cuda")
model.eval()
model.half()
pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to("cuda")
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
print(scores)
If use ONNX GPU Runtime with O4, it will fast than ctranslate2.
Bellow is a quick benchmark (on A10 GPU).
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
import time
import torch
device_mapping="cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("./onnxO4_bge_reranker_large")
model = ORTModelForSequenceClassification.from_pretrained("./onnxO4_bge_reranker_large").to(device_mapping)
pairs = [['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]*1024
t0 = time.time()
with torch.no_grad():
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to(device_mapping)
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
t1 = time.time()
print(f"Seconds: {t1-t0}")
# Seconds: 1.3976035118103027
I tried to convert the model weights using both O3 and O4 (--device cuda), I encountered some issues but anyway using both the average time for a batch of 1024 was 1.39 seconds VS 0.8 for ctranslate2 and 0.9 for fp16. It seems like fp16 is definitely a good competitor! Have you tried converting the weights to ONNX O4 and benchmark too?
I tried to convert the model weights using both O3 and O4 (
--device cuda), I encountered some issues but anyway using both the average time for a batch of1024was 1.39 seconds VS0.8forctranslate2and0.9forfp16. It seems likefp16is definitely a good competitor! Have you tried converting the weights to ONNX O4 and benchmark too?
https://colab.research.google.com/drive/1HP9GQKdzYa6H9SJnAZoxJWq920gxwd2k?usp=sharing
bge-rerank-base: ONNX O4 is 2x fast than fp16.
bge-rerank-large: same result
