ExAct: A Video-Language Benchmark for Expert Action Analysis

UNC Chapel Hill


Abstract

We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains.

Comparison with Existing Datasets

data-composition

Our proposed ExAct benchmark uniquely combines expert-level, free-form language annotations and a multiple-choice question (MCQ) evaluation format, making it an excellent resource for evaluating modern video-language models at expert-level understanding of skilled human activities.




Leaderboard

EXACT evaluation results (QA accuracy) across six diverse physical domains: Sports (Basketball, Soccer, Bouldering), Bike Repair, Cooking, Health (COVID-19 safety, CPR), Music (Guitar, Piano, Violin), and Dance.

The leaderboard is sorted by overall accuracy by default. To view results sorted by other metrics, please click on the corresponding column header.

# Model Frames Overall (%) Results by Domain (%)
Sports Bike Repair Cooking Health Music Dance
Random Choice

- 20.00 20.00 20.00 20.00 20.00 20.00 20.00
Human Non-Expert

- 61.86 62.97 55.02 66.58 71.43 59.22 59.22
Human Expert

- 82.02 82.09 81.23 80.27 87.09 80.21 81.55
Gemini 1.5 Pro

32 43.91 42.83 52.10 51.78 41.21 41.89 39.86
GPT-4o

32 44.70 43.47 52.75 46.30 53.30 33.89 46.70
PerceptionLM-8B

32 24.65 24.22 28.16 25.75 22.53 22.95 26.42
VideoLLaMA3-7B

32 26.38 26.64 23.30 29.32 26.65 23.79 27.79
InternVL2.5-78B

32 33.48 31.93 36.57 33.70 37.91 32.00 34.62
LLaVA-OneVision-72B

32 35.44 33.65 43.04 33.42 35.44 30.53 43.51
Qwen2.5-VL-72B-Instruct

32 35.67 35.62 37.86 33.97 36.26 32.63 38.50
LLaVA-Video-72B

32 41.58 41.81 42.72 44.11 32.42 38.74 48.52

Benchmark

Benchmark Statistics

benchmark-statistics

Left: Our proposed ExAct benchmark contains 11 skilled activity types spanning 6 broader physical domains: Sports, Music, Dance, Health, Cooking, and Bike Repair. Top Right: Distribution of video lengths across the dataset, showing that most clips fall within the 0-10 second range. Bottom Right: Sample distribution per activity, categorized by the expert feedback type: Good Execution (GE) and Tips for Improvement (TIPS).



Data Examples


Benchmark Construction

data-composition

Overview of our benchmark construction pipeline. In stage I, we pre-process raw expert commentaries using GPT-4o, correcting errors and segmenting them into concise, self-contained feedback commentaries. In Stage II, we construct multiple-choice QA pairs, each consisting of one correct expert commentary and four carefully generated distractors. The four red arrows indicate the LLM-generated distractors, while the green arrow represents the correct expert commentary. In Stage III, we filter out low-quality or biased samples using length-based heuristics and blind-LLMs. Finally, in Stage IV, domain experts review all QA pairs to ensure visual grounding and linguistic accuracy.




Conclusion

We introduce ExAct, a new video-language benchmark designed to evaluate expert-level understanding of skilled human activities across a diverse set of physical and procedural domains. Our new benchmark uses fine-grained, expert-level, language annotations and a multiple-choice evaluation format to enable a rigorous evaluation of expert-level understanding of physical human skills. Our experiments reveal a significant gap between state-of-the-art VLMs and human experts' performance, indicating a significant room for future improvement in video-language model design. We believe that ExAct will be pivotal in the development and evaluation of video language models capable of skilled human activity understanding.



Citation


      
      
-->