Elastic launches multimodal AI search models for text, image, video, audio
Elastic (NYSE: ESTC) announced the release of jina-embeddings-v5-omni, a family of multimodal embedding models that can process text, images, video, and audio content for search applications. The models are available in two sizes: small and nano.
The new models share the same text embedding space as the existing jina-embeddings-v5-text model, allowing users to integrate multimedia content into existing systems without rebuilding their index infrastructure. Users can perform search, classification, clustering, and deduplication across different media types.
"Our goal with v5-omni is simple: make multimodal search as easy and scalable as text search already is," said Ken Exner, chief product officer at Elastic. "By building on existing models and ensuring full compatibility, we're giving teams a practical way to expand into images, audio, and video, without starting from scratch."
The models feature a modular design that allows users to toggle text, image, and audio processing features as needed. They offer adjustable embedding sizes to balance accuracy, speed, and cost requirements.
According to Elastic's performance evaluations, the v5-omni models achieved top rankings in their size class on several benchmarks. On the Massive Audio Embedding Benchmark, the models outperformed larger systems. For image processing, the v5-omni-small model ranked as the top performer in the 1 billion parameter range on the Massive Image Embedding Benchmark and Visual Document Retrieval Benchmark.
Both jina-embeddings-v5-omni-small and jina-embeddings-v5-omni-nano models are available through Elastic Inference Service, the Jina API, and for local installation. Model weights are distributed under a non-commercial license, with commercial licensing available through Elastic sales.
