Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, Δknowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.
We introduce Video-MMMU, a massive, multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos through three main features:
1) Knowledge-intensive Video Collection:
Our dataset comprises 300 expert-level videos spanning 6 professional disciplines: Art, Business, Science, Medicine, Humanities, and Engineering, with 30 subjects distributed among them.
2) Knowledge Acquisition-based Question Design:
Each video includes three question-answer pairs aligned with the three knowledge acquisition stages: Perception (identifying key information related to the knowledge), Comprehension (understanding the underlying concepts), and Adaptation (applying knowledge to new scenarios).
3) Quantitative Knowledge Acquisition Assessment:
We propose a knowledge acquisition metric, denoted as Δknowledge, to measure performance gains on practice exam questions after learning from videos. This metric enables us to quantitatively evaluate how effectively large multimodal models (LMMs) can assimilate and utilize the information presented in the videos to solve real-world, novel problems.
Key Insights:
This limitation underscores a potential challenge in current LMMs. While humans naturally acquire knowledge through video-based learning, having developed this capability through classroom learning and educational experiences throughout life, LMMs struggle to effectively learn from videos. These findings emphasize the need for further research to enhance how LMMs acquire and utilize video-based information, bringing them closer to human-level learning processes.
We evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison.
Model | Overall | Perception | Comprehension | Adaptation |
---|---|---|---|---|
Human Expert | 74.44 | 84.33 | 78.67 | 60.33 |
Claude-3.5-Sonnet | 65.78 | 72.00 | 69.67 | 55.67 |
GPT-4o | 61.22 | 66.00 | 62.00 | 55.67 |
Gemini 1.5 Pro | 53.89 | 59.00 | 53.33 | 49.33 |
Aria | 50.78 | 65.67 | 46.67 | 40.00 |
Gemini 1.5 Flash | 49.78 | 57.33 | 49.00 | 43.00 |
LLaVA-Video-72B | 49.67 | 59.67 | 46.00 | 43.33 |
LLaVA-OneVision-72B | 48.33 | 59.67 | 42.33 | 43.00 |
MAmmoTH-VL-8B | 41.78 | 51.67 | 40.00 | 33.67 |
InternVL2-8B | 37.44 | 47.33 | 33.33 | 31.67 |
LLaVA-Video-7B | 36.11 | 41.67 | 33.33 | 33.33 |
VILA1.5-40B | 34.00 | 38.67 | 30.67 | 32.67 |
LLaVA-OneVision-7B | 33.89 | 40.00 | 31.00 | 30.67 |
Llama-3.2-11B | 30.00 | 35.67 | 32.33 | 22.00 |
LongVA-7B | 23.98 | 24.00 | 24.33 | 23.67 |
VILA1.5-8B | 20.89 | 20.33 | 17.33 | 25.00 |
Question: Which of the following does NOT appear in the video frame when the video introduces Painting?
Options:
Ground truth: D. Abstraction
Question: Based on your understanding of the video, which of the following statements about Baroque painting is/are correct?
Options:
Ground truth: I. Statement 2 is correct
Question: On the basis of style, this painting belongs to which of the following periods?
Options:
Ground truth: I. Baroque
Question: According to the video, a minimum price control for alcoholic drinks, the intention is to discourage consumption from Q1 to ____, due to the negative externalities, and the price is raised to ____ from the free market price of ____. Fill in the blanks based on the video content.
Options:
Ground truth: A. (Q*, Pmin, P1)
Question: Based on your understanding, what is the correct sequence of consequences when a minimum price is imposed on a good with negative externalities in consumption?
Options:
Ground truth: A. Free market equilibrium at P1 and Q1 β Minimum price imposed above P1 β Consumption contracts to Q* β Externality internalized.
Question: Suppose the government decided that, since gasoline is a necessity, its price should be legally capped at $1.30 per gallon. What do you anticipate would be the outcome in the gasoline market?
Options:
Ground truth: B. A price below that of $1.30 would cause a situation of excess demand and hence a shortage.
Question: Based on the video, fill in the blanks about the Functionalist view of societyβs culture:
"Functionalists take a ________ view on societyβs culture and suggest it ________."
Options:
Ground truth: A. Consensus, reflects the norms and values of the majority β a value consensus
Question: Based on your understanding, which of the following statements about different sociological theories on culture is/are false?
Statements:
Options:
Ground truth: G. Statements 1 and 4 are false
Question: Identify the theory that the following argument represents.
Options:
Ground truth: Functionalism
Question: What is the equation used when solving the first question in the video?
Options:
Ground truth: D. y = y0 + v0y t - 1/2 g t2
Question: If the angle theta is changed to 30 degrees, what is the result of the first question about the total time in the air?
Options:
Ground truth: E. 3.04 seconds
Question: A rocket is shot from the top of a tower at an angle of 45Β° above the horizontal (Fig. 19-1). It hits the ground in 5 seconds at a horizontal distance from the foot of the tower equal to three times the height of the tower. Find the height of the tower.
Options:
Ground truth: A. h = 100 ft
Question: At the beginning of the video, what are the muscles in the lower left corner, upper left corner, and lower right corner, respectively?
Options:
Ground truth: J. Cardiac muscle, Skeletal muscle, Smooth muscle
Question: Based on the video, how many of the following characteristics can be used to identify the different types of muscle tissue?
Characteristics:
Options:
Ground truth: H. 7
Question: What kind of tissue does this image depict?
Options:
Ground truth: A. Cardiac muscle
Question: Identify the correct sequence of steps to construct the minimum spanning tree using Kruskal's algorithm from the graph described in the video.
Options:
Ground truth: G. Add BE, add AC, add EF, add BC, add FG.
Question: Based on the graph example in the video, if you apply Kruskal's algorithm and the weight of the first few edges changes slightly, which would be the resulting edge sequence if the edge BE now has a weight of 1 and EF a weight of 0.5?
Options:
Ground truth: F. EF, BE, AC, BC, FG
Question: Consider the following graph: Which one of the following can be a valid sequence of edges added, in that order, to a minimum spanning tree using Kruskal's algorithm?
Options:
Ground truth: D. (d-f), (a-b), (b-f), (d-c), (d-e)
This website is adapted from Panda-70M, and all website content is licensed under Creative Commons Attribution-NonCommercial 4.0 International License.