Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

Zhiyuan Li, Heng Wang, Dongnan Liu, Chaoyi Zhang, Ao Ma, Jieting Long, Weidong Cai,
School of Computer Science, The University of Sydney

MuCR Benchmark is to Challenge Vision Large Language Models' Causality Comprehension Ability.

A new multimodal causal reasoning benchmark

Abstract

Large Language Models (LLMs) have showcased exceptional ability in causal reasoning from textual information. However, will these causalities remain straightforward for Vision Large Language Models (VLLMs) when only visual hints are provided? Motivated by this, we propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge VLLMs to infer semantic cause-and-effect relationship when solely relying on visual cues such as action, appearance, clothing, and environment. Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues, which can effectively evaluate VLLMs' causal reasoning capabilities. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess VLLMs' comprehension abilities. Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped. Furthermore, we perform a comprehensive analysis to understand these models' shortcomings from different views and suggest directions for future research. We hope MuCR can serve as a valuable resource and foundational benchmark in multimodal causal reasoning research.

Previous Limitations

We identify three major drawbacks in previous benchmarks:

(1) Absence of visual modality: Linguistic causal reasoning benchmarks fail to assess visual comprehension ability.

(2) Lack of multi-image understanding: Current causal reasoning VQA tasks are inadequate in cross-image analysis.

(3) Absence of cause-and-effect question: Existing multi-image understanding benchmarks lack cause-and-effect questions, rendering them insufficient for an evaluation of VLLMs' causal reasoning capabilities.

A new multimodal causal reasoning benchmark

Cause-and-effect Images Synthesis

Our cause-and-effect image synthesis begins with generating core caption pairs, each consisting of one caption describing the cause and the other stating the effect. We then leverage the language capabilities of LLMs to entail these paired captions into contextually relevant descriptions, enhancing the consistency of sentences to facilitate the creation of cause-and-effect image pairs. Finally, we employ diffusion models to generate numerous siamese images based on these descriptions, annotating cue phrases and causality explanations for each pair.

A new multimodal causal reasoning benchmark

Dataset Distribution

Our MuCR benchmark consist 400 pairs of cause-and-effect images across various categories (humans, animals, plants, characters, and mixtures) and different styles (photograph and comic). The below table illustrates some examples featuring various categories and styles from our MuCR benchmark as well as the distribution overview of categories and styles.

A new multimodal causal reasoning benchmark

Benchmark Metric

Image-level Metric. The image-level score consists of two parts: cause-to-effect (C2E) score and effect-to-cause (E2C) score. This scoring is designed to assess whether the VLLMs can identify visual cues and semantic causality between images and make the correct choice from four potential images (see paper for more details).

Phrase-level Metric. The phrase-level metric is called Cue score, which tests VLLMs' capability to distinguish the correct cue from a list of fraudulent phrases according to the siamese images (see paper for more details).

Sentence-level Metric. Our final metric is designed to evaluate VLLMs' ability to explain causality. This sentence-level metric is called the explanation (Exp) score (see paper for more details).

Experimental Result

We evaluate several popular open-source models on our benchmark, including BLIP2, OpenFlamingo, InstructBLIP, MiniGPT4, and LLaVA. Additionally, we assess large-scale in-house models such as Claude, Gemini, and GPT-4. The models' performance is shown as below (see paper for more details).

A new multimodal causal reasoning benchmark

BibTeX

@article{li2024multimodal,
          title={Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images},
          author={Li, Zhiyuan and Wang, Heng and Liu, Dongnan and Zhang, Chaoyi and Ma, Ao and Long, Jieting and Cai, Weidong},
          journal={arXiv preprint arXiv:2408.08105},
          year={2024}
        }