Video-MMMU Icon Video-MMMU

Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos


Kairui Hu1, Penghao Wu1, Fanyi Pu1, Wang Xiao1, Yuanhan Zhang1,
Xiang Yue2, Bo Li1, Ziwei Liu1*

1S-Lab, Nanyang Technological University
2Carnegie Mellon University



Paper Dataset Code






Business
Economics

Business
Finance

Business
Manage

Business
Economics

Engineering
Architecture

Engineering
Energy

Engineering
Electronics

Engineering
Computer Science

Art
Art Theory

Art
Art Theory

Humanities
Psychology

Humanities
History

Science
Physics

Science
Chemistry

Medicine
Clinical Medicine

Medicine
Public Health


Abstract


Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, Δknowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.


Example PDF as Image

Overview


We introduce Video-MMMU, a massive, multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos through three main features:

1) Knowledge-intensive Video Collection:
Our dataset comprises 300 expert-level videos spanning 6 professional disciplines: Art, Business, Science, Medicine, Humanities, and Engineering, with 30 subjects distributed among them.

2) Knowledge Acquisition-based Question Design:
Each video includes three question-answer pairs aligned with the three knowledge acquisition stages: Perception (identifying key information related to the knowledge), Comprehension (understanding the underlying concepts), and Adaptation (applying knowledge to new scenarios).

3) Quantitative Knowledge Acquisition Assessment:
We propose a knowledge acquisition metric, denoted as Δknowledge, to measure performance gains on practice exam questions after learning from videos. This metric enables us to quantitatively evaluate how effectively large multimodal models (LMMs) can assimilate and utilize the information presented in the videos to solve real-world, novel problems.


Figure 2


Key Insights:

  • 1) Progressive Performance Decline:
    Model performance decreases as cognitive demands increase. While models perform relatively better on perception tasks, their accuracy drops notably on comprehension tasks and declines further on adaptation tasks.
  • 2) Knowledge Acquisition from Videos is Challenging:
    The knowledge acquisition metric Δknowledge reveals a significant gap between human and model performance. While humans achieve substantial improvement (Δknowledge = 33.1%) after watching the videos, even the top-performing models show smaller knowledge gains (GPT-4o: Δknowledge = 15.6%, Claude-3.5-Sonnet: Δknowledge = 11.4%).

This limitation underscores a potential challenge in current LMMs. While humans naturally acquire knowledge through video-based learning, having developed this capability through classroom learning and educational experiences throughout life, LMMs struggle to effectively learn from videos. These findings emphasize the need for further research to enhance how LMMs acquire and utilize video-based information, bringing them closer to human-level learning processes.


Video-MMMU Leaderboard

We evaluate various open-source and proprietary LMMs. The table below provides a detailed comparison.

Human Expert
Open-Source
Proprietary
Model Overall Perception Comprehension Adaptation
Human Expert 74.44 84.33 78.67 60.33
Claude-3.5-Sonnet 65.78 72.00 69.67 55.67
GPT-4o 61.22 66.00 62.00 55.67
Gemini 1.5 Pro 53.89 59.00 53.33 49.33
Aria 50.78 65.67 46.67 40.00
Gemini 1.5 Flash 49.78 57.33 49.00 43.00
LLaVA-Video-72B 49.67 59.67 46.00 43.33
LLaVA-OneVision-72B 48.33 59.67 42.33 43.00
MAmmoTH-VL-8B 41.78 51.67 40.00 33.67
InternVL2-8B 37.44 47.33 33.33 31.67
LLaVA-Video-7B 36.11 41.67 33.33 33.33
VILA1.5-40B 34.00 38.67 30.67 32.67
LLaVA-OneVision-7B 33.89 40.00 31.00 30.67
Llama-3.2-11B 30.00 35.67 32.33 22.00
LongVA-7B 23.98 24.00 24.33 23.67
VILA1.5-8B 20.89 20.33 17.33 25.00

Demo Video



Statistics


Figure 3

Annotation Pipeline



Error Analysis


Figure 3

Error Examples in Adaptation track


Figure 3

Wrong-to-Right Examples in Adaptation track


Figure 3

Video-MMMU Examples



Video-MMMU Examples - Art

Perception Track

Question: Which of the following does NOT appear in the video frame when the video introduces Painting?

Options:

  1. Intense, warm colors
  2. Strong contrasts of light and dark
  3. Focus on movement, drama, and emotion
  4. Abstraction
  5. Allegory
  6. Enhanced sense of movement
  7. Deliberately set apart from Renaissance and Mannerism
  8. Asymmetry
  9. Renaissance
  10. Mannerism

Ground truth: D. Abstraction

Comprehension Track

Question: Based on your understanding of the video, which of the following statements about Baroque painting is/are correct?

  • Statement 1. Baroque paintings focused on calm, balanced scenes with even lighting.
  • Statement 2. Baroque artists used strong contrasts of light and dark to highlight key figures.
  • Statement 3. Baroque painting was known for symmetrical compositions and a sense of stability.
  • Statement 4. Baroque artists avoided using allegory in their works.

Options:

  1. Statement 1 is correct
  2. Statement 1 and 2 are correct
  3. Statement 1, 2, and 3 are correct
  4. Statement 1, 3, and 4 are correct
  5. Statement 2 and 4 are correct
  6. Statement 3 is correct
  7. Statements 1 and 4 are correct
  8. Statements 1, 3, and 4 are correct
  9. Statement 2 is correct
  10. All are correct

Ground truth: I. Statement 2 is correct

Adaptation Track

Question: On the basis of style, this painting belongs to which of the following periods?

Painting for Adaptation Question

Options:

  1. Rococo
  2. Gothic
  3. Renaissance
  4. Baroque
  5. Neoclassicism
  6. Byzantine
  7. Mannerism
  8. Romanticism
  9. Art Nouveau
  10. Classical

Ground truth: I. Baroque


Video-MMMU Examples - Business

Perception Track

Question: According to the video, a minimum price control for alcoholic drinks, the intention is to discourage consumption from Q1 to ____, due to the negative externalities, and the price is raised to ____ from the free market price of ____. Fill in the blanks based on the video content.

Options:

  1. (Q*, Pmin, P1)
  2. (Q*, P1, Pmin)
  3. (Q1, Pmin, P2)
  4. (Q2, P1, Pmin)
  5. (Q*, P2, P1)
  6. (Q1, P2, Pmin)
  7. (Q2, Pmin, P1)
  8. (Q*, Pmin, P2)
  9. (Q2, P2, P1)
  10. (Q1, P1, Pmin)

Ground truth: A. (Q*, Pmin, P1)

Comprehension Track

Question: Based on your understanding, what is the correct sequence of consequences when a minimum price is imposed on a good with negative externalities in consumption?

Options:

  1. Free market equilibrium at P1 and Q1 β†’ Minimum price imposed above P1 β†’ Consumption contracts to Q* β†’ Externality internalized.
  2. Minimum price imposed above P1 β†’ Free market equilibrium at P1 and Q1 β†’ Consumption contracts to Q* β†’ Externality internalized.
  3. Free market equilibrium at P1 and Q1 β†’ Consumption contracts to Q* β†’ Minimum price imposed above P1 β†’ Externality internalized.
  4. Externality internalized β†’ Free market equilibrium at P1 and Q1 β†’ Minimum price imposed above P1 β†’ Consumption contracts to Q*.
  5. Minimum price imposed above P1 β†’ Consumption contracts to Q* β†’ Externality internalized β†’ Free market equilibrium at P1 and Q1.
  6. Consumption contracts to Q* β†’ Minimum price imposed above P1 β†’ Free market equilibrium at P1 and Q1 β†’ Externality internalized.
  7. Free market equilibrium at P1 and Q1 β†’ Minimum price imposed above P1 β†’ Externality internalized β†’ Consumption contracts to Q*.
  8. Minimum price imposed above P1 β†’ Free market equilibrium at P1 and Q1 β†’ Externality internalized β†’ Consumption contracts to Q*.
  9. Externality internalized β†’ Minimum price imposed above P1 β†’ Free market equilibrium at P1 and Q1 β†’ Consumption contracts to Q*.
  10. Consumption contracts to Q* β†’ Externality internalized β†’ Minimum price imposed above P1 β†’ Free market equilibrium at P1 and Q1.

Ground truth: A. Free market equilibrium at P1 and Q1 β†’ Minimum price imposed above P1 β†’ Consumption contracts to Q* β†’ Externality internalized.

Adaptation Track

Question: Suppose the government decided that, since gasoline is a necessity, its price should be legally capped at $1.30 per gallon. What do you anticipate would be the outcome in the gasoline market?

Gasoline Market Question

Options:

  1. A price below that of $1.30 would cause a situation of excess demand and hence a surplus.
  2. A price below that of $1.30 would cause a situation of excess demand and hence a shortage.
  3. Not certain.
  4. A price above that of $1.30 would cause a situation of excess demand and hence a shortage.
  5. A price above that of $1.30 would cause a situation of excess supply and hence a shortage.
  6. A price at $1.30 per gallon would result in an equilibrium where supply meets demand.
  7. A price below that of $1.30 would cause a situation of excess supply and hence a surplus.
  8. A price at $1.30 would eliminate any market shortages or surpluses.
  9. A price of $1.30 would result in both excess demand and excess supply, depending on consumer preferences.
  10. All situations are possible.

Ground truth: B. A price below that of $1.30 would cause a situation of excess demand and hence a shortage.


Video-MMMU Examples - Humanities

Perception Track

Question: Based on the video, fill in the blanks about the Functionalist view of society’s culture:

"Functionalists take a ________ view on society’s culture and suggest it ________."

Options:

  1. Consensus, reflects the norms and values of the majority β€” a value consensus
  2. Unified, represents shared practices and beliefs of a group
  3. Conflict, reveals power struggles and competing interests within society
  4. Critical, critiques the dominant traditions that maintain inequalities
  5. Conservative, emphasizes the preservation of societal expectations
  6. Agreement, highlights the shared opinions and moral agreements
  7. Progressive, encourages the development of new cultural ideologies
  8. Stratified, reflects the rules that structure society into classes
  9. Divisive, showcases the competing customs and traditions in society
  10. Analytical, breaks down societal structures into individual roles

Ground truth: A. Consensus, reflects the norms and values of the majority β€” a value consensus

Comprehension Track

Question: Based on your understanding, which of the following statements about different sociological theories on culture is/are false?

Statements:

  • Statement 1. Functionalists argue that society’s culture is fragmented, and individuals create their own norms based on personal preferences and life experiences.
  • Statement 2. Marxists suggest that culture is imposed by the ruling class (bourgeoisie) to maintain control over the working class and reinforce capitalist ideologies.
  • Statement 3. Feminists believe that culture is male-dominated and reflects patriarchal values, prioritizing men’s interests over women’s.
  • Statement 4. Postmodernists suggest that there is a single dominant culture, and individuals can choose their own cultural norms.

Options:

  1. Statement 1 is false
  2. Statement 2 is false
  3. Statement 3 is false
  4. Statement 4 is false
  5. Statements 1 and 2 are false
  6. Statements 3 and 4 are false
  7. Statements 1 and 4 are false
  8. Statements 1 and 3 are false
  9. Statements 2 and 4 are false
  10. None of the statements are false

Ground truth: G. Statements 1 and 4 are false

Adaptation Track

Question: Identify the theory that the following argument represents.

Gasoline Market Question

Options:

  1. Functionalism
  2. Marxism
  3. New Right
  4. Postmodernism
  5. Social action approaches
  6. Interactionism
  7. Feminism
  8. Conflict Theory
  9. Modernity
  10. Positivists

Ground truth: Functionalism


Video-MMMU Examples - Science

Perception Track

Question: What is the equation used when solving the first question in the video?

Options:

  1. y = v0y t + 1/2 g t2
  2. y = y0 + v0x t - 1/2 g t2
  3. y = v0x t + g t2
  4. y = y0 + v0y t - 1/2 g t2
  5. y = v0y t - g t
  6. y = y0 + v0x t + 1/2 g t2
  7. y = y0 + g t
  8. y = y0 + v0y t + g t2
  9. y = v0x t + 1/2 g t2
  10. y = y0 + 1/2 g t2

Ground truth: D. y = y0 + v0y t - 1/2 g t2

Comprehension Track

Question: If the angle theta is changed to 30 degrees, what is the result of the first question about the total time in the air?

Options:

  1. 4.00 seconds
  2. 2.82 seconds
  3. 3.50 seconds
  4. 2.50 seconds
  5. 3.04 seconds
  6. 2.00 seconds
  7. 3.15 seconds
  8. 1.85 seconds
  9. 2.25 seconds
  10. 3.85 seconds

Ground truth: E. 3.04 seconds

Adaptation Track

Question: A rocket is shot from the top of a tower at an angle of 45Β° above the horizontal (Fig. 19-1). It hits the ground in 5 seconds at a horizontal distance from the foot of the tower equal to three times the height of the tower. Find the height of the tower.

Rocket Adaptation Question

Options:

  1. h = 100 ft
  2. h = 80 ft
  3. h = 110 ft
  4. h = 85 ft
  5. h = 90 ft
  6. h = 95 ft
  7. h = 105 ft
  8. h = 120 ft
  9. h = 75 ft
  10. h = 115 ft

Ground truth: A. h = 100 ft


Video-MMMU Examples - Medicine

Perception Track

Question: At the beginning of the video, what are the muscles in the lower left corner, upper left corner, and lower right corner, respectively?

Options:

  1. Cardiac muscle, Smooth muscle, Skeletal muscle
  2. Skeletal muscle, Cardiac muscle, Smooth muscle
  3. Skeletal muscle, Smooth muscle, Cardiac muscle
  4. Smooth muscle, Cardiac muscle, Skeletal muscle
  5. Smooth muscle, Skeletal muscle, Cardiac muscle
  6. Smooth muscle, Cardiac muscle, Cardiac muscle
  7. Skeletal muscle, Skeletal muscle, Smooth muscle
  8. Cardiac muscle, Smooth muscle, Smooth muscle
  9. Skeletal muscle, Smooth muscle, Smooth muscle
  10. Cardiac muscle, Skeletal muscle, Smooth muscle

Ground truth: J. Cardiac muscle, Skeletal muscle, Smooth muscle

Comprehension Track

Question: Based on the video, how many of the following characteristics can be used to identify the different types of muscle tissue?

Characteristics:

  • (c1). Presence of striations
  • (c2). Presence of intercalated discs
  • (c3). Voluntary control
  • (c4). Involuntary control
  • (c5). Branching appearance
  • (c6). Smooth, spindle-shaped cells
  • (c7). Long, cylindrical fibers
  • (c8). Multinucleated cells
  • (c9). Location

Options:

  1. 0
  2. 1
  3. 2
  4. 3
  5. 4
  6. 5
  7. 6
  8. 7
  9. 8
  10. 9

Ground truth: H. 7

Adaptation Track

Question: What kind of tissue does this image depict?

Muscle Tissue Question

Options:

  1. Cardiac muscle
  2. Skeletal muscle
  3. Cartilage
  4. Smooth muscle
  5. Tendon
  6. Ligament
  7. Adipose tissue
  8. Epithelium
  9. Connective tissue
  10. Nerve tissue

Ground truth: A. Cardiac muscle


Video-MMMU Examples - Engineering

Perception Track

Question: Identify the correct sequence of steps to construct the minimum spanning tree using Kruskal's algorithm from the graph described in the video.

Options:

  1. Add CF, add AC, add EF, add BC, add FG.
  2. Add BE, add EF, add CF, add BC, add FG.
  3. Add BE, add EF, add FG, add BC, add CD.
  4. Add BE, add AC, add EF, add FC, add CD.
  5. Add BE, add BC, add AC, add EF, add FG.
  6. Add BE, add AC, add DF, add BC, add CD.
  7. Add BE, add AC, add EF, add BC, add FG.
  8. Add BE, add AC, add DF, add FG, add BC.
  9. Add BE, add EF, add BC, add FG, add CD.
  10. Add BE, add AC, add FG, add EF, add BC.

Ground truth: G. Add BE, add AC, add EF, add BC, add FG.

Comprehension Track

Question: Based on the graph example in the video, if you apply Kruskal's algorithm and the weight of the first few edges changes slightly, which would be the resulting edge sequence if the edge BE now has a weight of 1 and EF a weight of 0.5?

Options:

  1. EF, AC, BE, BC, FG
  2. BE, EF, AC, FG, BC
  3. BE, AC, BC, EF, FG
  4. AC, BE, EF, BC, FG
  5. EF, BE, FG, AC, BC
  6. EF, BE, AC, BC, FG
  7. BE, EF, FG, BC, AC
  8. EF, AC, BC, BE, FG
  9. BE, AC, FG, EF, BC
  10. AC, FG, BE, EF, BC

Ground truth: F. EF, BE, AC, BC, FG

Adaptation Track

Question: Consider the following graph: Which one of the following can be a valid sequence of edges added, in that order, to a minimum spanning tree using Kruskal's algorithm?

Kruskal's Algorithm Adaptation Question

Options:

  1. (a-b), (b-f), (d-f), (d-c), (d-e)
  2. (a-b), (d-c), (d-f), (b-f), (d-e)
  3. (d-f), (d-e), (a-b), (d-c), (b-f)
  4. (d-f), (a-b), (b-f), (d-c), (d-e)
  5. (a-c), (b-f), (d-f), (d-c), (d-e)
  6. (d-f), (d-c), (a-b), (b-f), (d-e)
  7. (a-c), (d-f), (b-f), (d-e), (d-c)
  8. (d-c), (d-f), (a-b), (b-f), (d-e)
  9. (d-f), (b-f), (a-b), (d-c), (d-e)
  10. (a-c), (d-c), (d-f), (b-f), (d-e)

Ground truth: D. (d-f), (a-b), (b-f), (d-c), (d-e)




This website is adapted from Panda-70M, and all website content is licensed under Creative Commons Attribution-NonCommercial 4.0 International License.