BGE Drops a Multimodal Embedding Model That Lets You Search Anything with Anything
If you've ever tried to build a visual search system that mixes text and images — like "find me a red shoe that looks like this photo" — you know the pain. You either hack together separate models for text and image embeddings and hope they land in the same vector space, or you overpay for a closed API.
BGE's latest multimodal embedding model just made that much simpler. It's designed to take any combination of text and image inputs and produce a unified embedding, ready for retrieval, clustering, or classification.
What It Does
This is a new model from the FlagEmbedding team (the same folks behind the popular BGE text embedding models). The big news: it now handles both text and images in a single embedding space.
You can feed in:
- Text only
- Image only
- Text + image together
And it will output a single vector that represents the combined meaning. That means you can search images with text, text with images, or even image-to-image using the same model. No separate pipelines, no alignment tricks.
The model is open source and available on GitHub under the FlagOpen/FlagEmbedding repository. It builds on their existing architecture, so if you've used BGE before, the API will feel familiar.
Why It's Cool
A few things stand out:
1. Any-to-any retrieval
Most multimodal models lock you into one query type (like text-to-image). This one lets you mix and match. You can do text-to-image, image-to-text, image-to-image, or text+image-to-text. That's rare and extremely useful for RAG pipelines where queries might come in different formats.
2. No separate encoders to align
You don't need to train or fine-tune a separate model to map text and image embeddings into the same space. It's baked into the model. That saves time and reduces complexity in production.
3. Lightweight and efficient
Compared to some of the huge vision-language models (like CLIP-based models that require multiple GPUs), this one is relatively compact. You can run it on a single GPU or even CPU for smaller workloads.
4. Works with existing BGE tools
If you already use BGE for text embedding, you can plug this in without rewriting your retrieval infrastructure. Same API, same vector database workflows.
How to Try It
The model is available on Hugging Face and installable via the FlagEmbedding library. Here's a quick way to get started:
pip install -U FlagEmbedding
Then in Python:
from FlagEmbedding import FlagMultimodalModel
model = FlagMultimodalModel(
model_name_or_path="BAAI/BGE-VL-Multimodal",
normalize_embeddings=True
)
# Text query
text_emb = model.encode_text("a red sports car")
# Image query
image_emb = model.encode_image("car.jpg")
# Combined query
combined_emb = model.encode(["a red sports car", "car.jpg"])
# Now use any of these in your vector search
The repo has more detailed examples for batch processing and integration with libraries like Milvus or FAISS.
Final Thoughts
This isn't a flashy "AI breakthrough" announcement. It's a practical tool that solves a real problem: how to search across modalities without building custom pipelines. For developers working on visual search, RAG with images, or any kind of multimodal recommendation system, this is worth a look.
The fact that it's open source and works with their existing ecosystem makes it easy to try out. I'd start with a simple image-to-text search and see how it feels. The alignment is surprisingly good for a model this size.
Check the repo for the full details: https://github.com/FlagOpen/FlagEmbedding
Follow us on X: @githubprojects
Repository: https://github.com/FlagOpen/FlagEmbedding