Logo MotionBench

Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Introduction

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability โ€” fine-grained motion comprehension โ€” remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension.

Leaderboard

Accuracy scores on MotionBench.

# Model Frames LLM
Params
Date Dev Avg (%) Test Avg (%) MR (%) LM (%) CM (%) MO (%) AO (%) RC (%)
TE Fusion

Zhipu AI & Tsinghua

16 9B 2024-11-25 58 58 64 59 51 69 41 39
Qwen2-VL-72B

Alibaba

1fps 72B 2024-11-25 57 58 58 61 63 72 47 31
InternVL2-40B

Shanghai AI Lab

16 34B 2024-11-25 55 54 54 58 49 76 41 30
GLM-4V-Plus

Zhipu AI

30 - 2024-11-25 54 55 57 57 54 69 40 37
MiniCPM-V2.6

Tsinghua

64 7B 2024-11-25 52 53 56 49 45 72 39 33
PLLaVA 34B

Bytedance & NTU

16 34B 2024-11-25 52 51 55 51 47 66 38 31
Gemini 1.5 Pro

Google

1fps - 2024-11-25 51 50 51 52 54 67 40 22
Oryx-34B

Tsinghua University & Tencent & NTU

64 34B 2024-11-25 49 49 48 52 44 65 42 32
LLaVA-NeXT-Video-DPO (34B)

Bytedance & NTU S-Lab

32 34B 2024-11-25 48 40 53 45 36 66 39 23
CogVLM2-Video

Zhipu AI

24 8B 2024-11-25 41 44 43 39 38 64 37 33

Green date indicates the newly added/updated models.

MotionBench

State-of-the-art video understanding models struggle with basic motion-level perception.

Compared to existing benchmarks, our proposed MotionBench focuses on assessing the modelโ€™s Motion level perception capability, which is critical in understanding videos with fast and instant interactions and motions.

Example

MotionBench, a collection of manually curated multi-choice queries with video clips featuring dynamic changes from various scenes such as daily life and medical instructions. We devise six primary tasks to evaluate the capability of motion-level perception. Unlike previous story-level and event-level benchmarks, MotionBench is characterized by a significantly higher annotation density, allowing for the assessment of fine-grained motions.

The comparison of existing video VLM benchmarks with MotionBench.

MotionBench collects various video sources including web videos and synthetic videos, and provides a new evaluation perspective in motion level perception.

Basic Statistics of MotionBench

data-composition

Through-Encoder Fusion

We propose Through-Encoder(TE) Fusion, a novel compression architecture to enhance motion-level understanding under constrained LLM context length. Experimental results demonstrate that TE Fusion achieves state-of-the-art results on MotionBench and outperforms other compression methods across MotionBench, MVBench, LVBench, and VideoMME in the ablation study, and shows a particular advantage in high compression ratio scenarios.

Model Architecture of Video Understanding Models

MotionBench collects various video sources including web videos and synthetic videos, and provides a new evaluation perspective in motion level perception.

Experimental Results

Evaluation results of the existing video VLMs.

Abbreviations: MR (Motion Recognition), LM (Location-related Motion), CM (Camera Motion), MO (Motion-related Objects), AO (Action Order), RC (Repetition Count). We randomly split MotionBench into โ€œdevโ€ and โ€œtestโ€. We will release the ground truth answers in the โ€œdevโ€ set and set up an online platform for results submission in the โ€œtestโ€ set.

Evaluation results on Different Benchmarks.

Model performance variation with respect to different compression ratios k = 2, 4, 8, 16, given a fixed VLM input frame count of N_{input} = 16. The pink dotted line represents the performance of the baseline model, which processes 16 frames without temporal compression. Note that each compression method is re-implemented on the GLM-4V-9B backbone to ensure a fair comparison.

Evaluation results on Different Benchmarks.

Benchmark results for different compression methods at various compression rates, all using the same sequence length in the VLM decoder.

Citation

@misc{hong2025motionbenchbenchmarkingimprovingfinegrained,
      title={MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models},
      author={Wenyi Hong and Yean Cheng and Zhuoyi Yang and Weihan Wang and Lefan Wang and Xiaotao Gu and Shiyu Huang and Yuxiao Dong and Jie Tang},
      year={2025},
      eprint={2501.02955},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.02955},
}