Deploying VLMs over Arbitrary Length Video
January 2025
What's up with VLMs Right Now
One of the key failure modes for VLMs are processing videos long enough that take up the context window of the model. Some examples include:
- • Gemini 2.5: 45 mins [1]
- • OpenAI GPT-4V: ~30 mins (depending on frame sampling and compression) [2]
- • LLaVA-1.6 (Large Language and Vision Assistant): ~10–15 mins [3]
- • Flamingo (DeepMind): ~5 mins for dense frame sampling [4]
The values above are approximations of video length and the actual values depend on a number of factors including: extraction rate, compression, and whether the model supports sliding window or chunked context ingestion.
Beyond the raw context limitations, tasks like action recognition, event detection, and narrative understanding become unreliable when videos exceed a manageable input size. Moreover, real-world applications, such as lectures, surveillance, sports broadcasting, or autonomous vehicle monitoring, all involve continuous streams of hours long footage.
Without specialized handling, VLMs cannot efficiently reason over extended visual inputs leaving gap between model performance and production needs.
Implementation
Our solution to this problem is an agent that selectively feeds the VLM only the portions of the video that are relevant to the prompt, rather than the entire video. To achieve this, we search through the video for segments of interest and run inference on just those parts.
Video Search
Video search is performed by embedding the video with an image encoder (we used a fork of CLIP). During search, we embed the prompt and identify the frames most similar to it. While this simple approach can generate results, it suffers from both efficiency and quality limitations.
To improve efficiency, instead of embedding every frame, we only embed frames that matter. We accomplish this by running the video through a scene detection pipeline and embedding entire scenes rather than individual frames.
To enhance search quality, we embed frame sets using more than just an image encoder and employ a classification model to detect specific actions that the embedding model alone may fail to capture. We also analyze the transcript of the video and any text on screen to improve quality and divide search results into three categories: Embeddings (visual moments and actions), Transcript, and OCR.
Search Through These Videos
Putting it Together
Now that we have video search and the VLM components, we can integrate them into a single workflow and process videos of any length.
Ask Questions About These Videos
This method is highly scalable and effective for processing multi-hour videos. It provides a much deeper level of analysis than transcript-based approaches. In lecture videos with abundant on-screen text, the content can still be accurately followed even without audio. For sports or other action-intensive videos, the system can track, detect, and count specific actions with high reliability. Furthermore, this approach supports flexible integration with downstream tasks such as event detection, highlight extraction, and long-context reasoning, making it suitable for real-world production environments where extended video analysis is required.