ExAct

Abstract

We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in various physical and procedural domains.

Comparison with Existing Datasets

Our proposed ExAct benchmark uniquely combines expert-level, free-form language annotations and a multiple-choice question (MCQ) evaluation format, making it an excellent resource for evaluating modern video-language models at expert-level understanding of skilled human activities.

Leaderboard

EXACT evaluation results (QA accuracy) across six diverse physical domains: Sports (Basketball, Soccer, Bouldering), Bike Repair, Cooking, Health (COVID-19 safety, CPR), Music (Guitar, Piano, Violin), and Dance.

The leaderboard is sorted by overall accuracy by default. To view results sorted by other metrics, please click on the corresponding column header.

#	Model	Frames	Overall (%)	Results by Domain (%)
#	Model	Frames	Overall (%)	Sports	Bike Repair	Cooking	Health	Music	Dance
	Random Choice	-	20.00	20.00	20.00	20.00	20.00	20.00	20.00
	Human Non-Expert	-	61.86	62.97	55.02	66.58	71.43	59.22	59.22
	Human Expert	-	82.02	82.09	81.23	80.27	87.09	80.21	81.55
	Gemini 1.5 Pro	32	43.91	42.83	52.10	51.78	41.21	41.89	39.86
	GPT-4o	32	44.70	43.47	52.75	46.30	53.30	33.89	46.70
	PerceptionLM-8B	32	24.65	24.22	28.16	25.75	22.53	22.95	26.42
	VideoLLaMA3-7B	32	26.38	26.64	23.30	29.32	26.65	23.79	27.79
	InternVL2.5-78B	32	33.48	31.93	36.57	33.70	37.91	32.00	34.62
	LLaVA-OneVision-72B	32	35.44	33.65	43.04	33.42	35.44	30.53	43.51
	Qwen2.5-VL-72B-Instruct	32	35.67	35.62	37.86	33.97	36.26	32.63	38.50
	LLaVA-Video-72B	32	41.58	41.81	42.72	44.11	32.42	38.74	48.52

Benchmark Statistics

Left: Our proposed ExAct benchmark contains 11 skilled activity types spanning 6 broader physical domains: Sports, Music, Dance, Health, Cooking, and Bike Repair. Top Right: Distribution of video lengths across the dataset, showing that most clips fall within the 0-10 second range. Bottom Right: Sample distribution per activity, categorized by the expert feedback type: Good Execution (GE) and Tips for Improvement (TIPS).

Data Examples

Overview of our benchmark construction pipeline. In stage I, we pre-process raw expert commentaries using GPT-4o, correcting errors and segmenting them into concise, self-contained feedback commentaries. In Stage II, we construct multiple-choice QA pairs, each consisting of one correct expert commentary and four carefully generated distractors. The four red arrows indicate the LLM-generated distractors, while the green arrow represents the correct expert commentary. In Stage III, we filter out low-quality or biased samples using length-based heuristics and blind-LLMs. Finally, in Stage IV, domain experts review all QA pairs to ensure visual grounding and linguistic accuracy.

We introduce ExAct, a new video-language benchmark designed to evaluate expert-level understanding of skilled human activities across a diverse set of physical and procedural domains. Our new benchmark uses fine-grained, expert-level, language annotations and a multiple-choice evaluation format to enable a rigorous evaluation of expert-level understanding of physical human skills. Our experiments reveal a significant gap between state-of-the-art VLMs and human experts' performance, indicating a significant room for future improvement in video-language model design. We believe that ExAct will be pivotal in the development and evaluation of video language models capable of skilled human activity understanding.


      @article{yi2025exact,
        title={ExAct: A Video-Language Benchmark for Expert Action Analysis},
        author={Yi, Han and Pan, Yulu and He, Feihong and Liu, Xinyu and Zhang, Benjamin and Oguntola, Oluwatumininu and Bertasius, Gedas},
        journal={arXiv preprint arXiv:2506.06277},
        year={2025}
      }

ExAct: A Video-Language Benchmark for Expert Action Analysis

Abstract

Comparison with Existing Datasets

Leaderboard

Benchmark

Benchmark Statistics

Data Examples

Benchmark Construction

Conclusion

Citation