Model split into two vision_model and text_decoder. Run the vision model once and capture the outputs encoder_hidden_states and encoder_attention_mask. Feed them as inputs to the text decoder and generate the image caption.
converted to onnx from source: https://huggingface.co/Salesforce/blip-image-captioning-base
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support