Deploying VLMs Over Arbitrary Length Video, Live or Offline
February 2025
One of the key failure modes for VLMs are processing long videos that take up the model's entire context window. Some examples are:
- • Gemini 2.5: 45 mins [1]
- • OpenAI GPT-4V: ~30 mins (depending on frame sampling and compression) [2]
- • LLaVA-1.6 (Large Language and Vision Assistant): ~10–15 mins [3]
- • Flamingo (DeepMind): ~5 mins for dense frame sampling [4]
While these are approximations, the actual length of video that can be processed at once depends on the extraction rate, compression, and ingestion format.
Our solution is to selectively feed the VLM portions of the video that are relevant to the prompt, rather than the entire video at once. To do this, we search through the video for the segments of interest and inference those parts.
Implementation
The underlying search workflow is performed by semantically clipping the incoming video and embedding each clip, producing a 1024-dimensional vector per clip. At query time, we run the prompt through a small LLM that expands it into a more descriptive query and embed it in the same space, finding the clips most similar to it. The same embedding context search is also done on the transcript and OCR (text on screen).
With video search and a VLM, we can integrate them into a workflow we call inquire. Given a query, an agent breaks the prompt into discrete search queries and the context it finds is fed into the VLM to be inferenced on. Importantly, the model never sees the whole video — only the portions that are relevant to the prompt.
Because we retrieve the relevant scenes at query time, this system can also be run in real time. Instead of uploading a video, you can stream the video in and the same embedding process described above runs in real time, inserting into a live index. The inquire workflow runs as it did before, now on this live index.
Try it yourself
The demos below run against this pipeline live. The first two work over a fixed set of example videos; the third streams from your own webcam.
Search
Search the example videos for objects, actions, or scenes. Results are grouped by how they were found — visual embedding, transcript, or on-screen text — and clicking one jumps the video to that moment.
Search Through These Videos
Inquire
Ask a question across the same videos. The agent breaks your prompt into discrete searches, then feeds only the relevant scenes to the VLM, which answers with citations that link back into the footage.
Ask Questions About These Videos
Live webcam
This one streams your webcam straight into the pipeline. Start the stream, let it watch for a moment, then ask it what it sees. Frames are indexed on arrival, so questions are answered against the last few seconds of live video. Sessions are capped at one minute.
Ask Your Live Camera
In Practice
We have used the same streaming primitive across a range of cases: a support agent that watches a user's screen and walks them through a Figma file in real time, and a game driven by hand-gesture controls at sub-100 millisecond latency. Check out the demo below: