Multimodal retrieval for voice and video.
Text-in, text-out RAG has become a commodity. The next wave of products we are shipping retrieves across audio, video and image — and most of the standard RAG playbook has to be rewritten. Here is the shape of the new problem.
A client asks: "why does the user ever get the wrong clip?" — talking about a podcast intelligence tool we built. The answer is multimodal retrieval, and specifically the three mistakes that make it break in ways text-only RAG never does.
You cannot chunk time the way you chunk text
A transcript chunked by 500 tokens puts the answer to "what did the guest say about fundraising?" in the middle of one chunk and the follow-up question in another. Voice is structured by turn, by topic, by emotion — not by token count. Chunk by semantic boundary (topic-shift detection, speaker-change, silence) and you will retrieve the actual unit of meaning.
Embeddings need the audio, not just the text
Text embeddings of a transcript throw away the meaning that lives in prosody, tone, pace, emphasis. For some queries — "find the moment where the host got annoyed" — the text embedding cannot possibly answer. Use a multimodal embedding (CLAP for audio, CLIP for image, CLIP-video for video) or embed both modalities and retrieve against both, ranked together.
Temporal precision is the product
A user who searches for a moment wants the moment — timestamped, quotable, shareable. Retrieval that returns "the third chunk of episode 47" is useless. The output is a span: start timestamp, end timestamp, in the original media. Every component in the pipeline carries this span. Every answer is hyperlinkable to the exact second.
Grounding at answer time
A text answer can cite a paragraph. A multimodal answer has to cite a clip, a frame, a range. Your generator outputs structured references that your UI can render — "see 14:32–14:58 of episode 47" — not "the guest mentioned fundraising somewhere". Make the model's output structured, make your UI consume that structure, and the user gets the evidence along with the answer.
“If the answer does not link back to a span the user can replay, you have built a summariser — not a retrieval system.”
What we ran into last time
- Whisper gave us accurate transcripts but missed emotion. We added a prosody encoder and a mood classifier on top.
- Embedding long audio directly was too slow for interactive queries. We cached embeddings per turn at ingestion.
- Speaker diarisation errors cascaded — attributing the host's question to the guest's answer made retrieval useless. We now validate diarisation with a cheap secondary pass before indexing.
- Multimodal
- Retrieval
- Audio
- Video