EXACT evaluation results (QA accuracy) across six diverse physical domains: Sports (Basketball, Soccer, Bouldering), Bike Repair, Cooking, Health (COVID-19 safety, CPR), Music (Guitar, Piano, Violin), and Dance.
The leaderboard is sorted by overall accuracy by default. To view results sorted by other metrics, please click on the corresponding column header.
# | Model | Frames | Overall (%) | Results by Domain (%) | |||||
---|---|---|---|---|---|---|---|---|---|
Sports | Bike Repair | Cooking | Health | Music | Dance | ||||
Random Choice | - | 20.00 | 20.00 | 20.00 | 20.00 | 20.00 | 20.00 | 20.00 | |
Human Non-Expert | - | 61.86 | 62.97 | 55.02 | 66.58 | 71.43 | 59.22 | 59.22 | |
Human Expert | - | 82.02 | 82.09 | 81.23 | 80.27 | 87.09 | 80.21 | 81.55 | |
Gemini 1.5 Pro | 32 | 43.91 | 42.83 | 52.10 | 51.78 | 41.21 | 41.89 | 39.86 | |
GPT-4o | 32 | 44.70 | 43.47 | 52.75 | 46.30 | 53.30 | 33.89 | 46.70 | |
PerceptionLM-8B | 32 | 24.65 | 24.22 | 28.16 | 25.75 | 22.53 | 22.95 | 26.42 | |
VideoLLaMA3-7B | 32 | 26.38 | 26.64 | 23.30 | 29.32 | 26.65 | 23.79 | 27.79 | |
InternVL2.5-78B | 32 | 33.48 | 31.93 | 36.57 | 33.70 | 37.91 | 32.00 | 34.62 | |
LLaVA-OneVision-72B | 32 | 35.44 | 33.65 | 43.04 | 33.42 | 35.44 | 30.53 | 43.51 | |
Qwen2.5-VL-72B-Instruct | 32 | 35.67 | 35.62 | 37.86 | 33.97 | 36.26 | 32.63 | 38.50 | |
LLaVA-Video-72B | 32 | 41.58 | 41.81 | 42.72 | 44.11 | 32.42 | 38.74 | 48.52 |