ExAct: A Video-Language Benchmark for Expert Action Analysis

NeurIPS 2025

UNC Chapel Hill


Abstract

We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3,521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing Gemini 2.5 Pro model achieves only 55.35% accuracy, well below the 82.02% attained by trained human experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains.

Comparison with Existing Datasets

data-composition

Our proposed ExAct benchmark uniquely combines expert-level, free-form language annotations and a multiple-choice question (MCQ) evaluation format, making it an excellent resource for evaluating modern video-language models at expert-level understanding of skilled human activities.




Leaderboard

EXACT evaluation results (QA accuracy) across six diverse physical domains: Sports (Basketball, Soccer, Bouldering), Bike Repair, Cooking, Health (COVID-19 safety, CPR), Music (Guitar, Piano, Violin), and Dance.

The leaderboard is sorted by overall accuracy by default. To view results sorted by other metrics, please click on the corresponding column header.

# Model Frames Overall (%) Results by Domain (%)
Sports Bike Repair Cooking Health Music Dance
Random Choice

- 20.00 20.00 20.00 20.00 20.00 20.00 20.00
Human Non-Expert

- 61.86 62.97 55.02 66.58 71.43 54.11 59.22
Human Expert

- 82.02 82.09 81.23 80.27 87.09 80.21 81.55
PerceptionLM-8B

32 24.65 24.22 28.16 25.75 22.53 22.95 26.42
VideoLLaMA3-7B

32 26.38 26.64 23.30 29.32 26.65 23.79 27.79
InternVL2.5-78B

32 33.48 31.93 36.57 33.70 37.91 32.00 34.62
LLaVA-OneVision-72B

32 35.44 33.65 43.04 33.42 35.44 30.53 43.51
Qwen2.5-VL-72B-Instruct

32 35.67 35.62 37.86 33.97 36.26 32.63 38.50
LLaVA-Video-72B

32 41.58 41.81 42.72 44.11 32.42 38.74 48.52
Gemini 1.5 Pro

32 43.91 42.83 52.10 51.78 41.21 41.89 39.86
GPT-4o

32 44.70 43.47 52.75 46.30 53.30 33.89 46.70
GPT-4.1

32 50.89 51.37 58.90 54.25 51.10 40.84 51.48
Gemini 2.5 Pro

32 55.35 52.58 65.05 58.36 60.71 53.05 53.98

Benchmark

Benchmark Statistics

benchmark-statistics

Left: Our proposed ExAct benchmark contains 11 skilled activity types spanning 6 broader physical domains: Sports, Music, Dance, Health, Cooking, and Bike Repair. Top Right: Distribution of video lengths across the dataset, showing that most clips fall within the 0-10 second range. Bottom Right: Sample distribution per activity, categorized by the expert feedback type: Good Execution (GE) and Tips for Improvement (TIPS).



Data Examples


Benchmark Construction

data-composition

Overview of our benchmark construction pipeline. In stage I, we pre-process raw expert commentaries using GPT-4o, correcting errors and segmenting them into concise, self-contained feedback commentaries. In Stage II, we construct multiple-choice QA pairs, each consisting of one correct expert commentary and four carefully generated distractors. The four red arrows indicate the LLM-generated distractors, while the green arrow represents the correct expert commentary. In Stage III, we filter out low-quality or biased samples using length-based heuristics and blind-LLMs. Finally, in Stage IV, domain experts review all QA pairs to ensure visual grounding and linguistic accuracy.




Conclusion

We introduce ExAct, a new video-language benchmark designed to evaluate expert-level understanding of skilled human activities across a diverse set of physical and procedural domains. Our new benchmark uses fine-grained, expert-level, language annotations and a multiple-choice evaluation format to enable a rigorous evaluation of expert-level understanding of physical human skills. Our experiments reveal a significant gap between state-of-the-art VLMs and human experts' performance, indicating a significant room for future improvement in video-language model design. We believe that ExAct will be pivotal in the development and evaluation of video language models capable of skilled human activity understanding.



Citation


      @article{yi2025exact,
        title={ExAct: A Video-Language Benchmark for Expert Action Analysis},
        author={Yi, Han and Pan, Yulu and He, Feihong and Liu, Xinyu and Zhang, Benjamin and Oguntola, Oluwatumininu and Bertasius, Gedas},
        journal={arXiv preprint arXiv:2506.06277},
        year={2025}
      }
      
-->