logo VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

1The Hong Kong Polytechnic University 2Show Lab, National University of Singapore
Equal Contribution Corresponding Authors

Abstract

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored.

In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility.

Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including grounded video question-answering (Grounded VideoQA), video temporal grounding (VTG), and general video question-answering (VideoQA), underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

What is VideoMind?

Teaser

Model Overview

Model Overview

Role-Specific Designs

Timestamp Decoder
Timestamp Decoder
Planner
Planner
Verifier
Verifier

Visualization

Visualization

Citation

Please kindly cite our paper if you find this project helpful.
@article{liu2025videomind,
title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.13444},
year={2025}
}