Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

Source code for our paper : Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

Click the links below to view our paper and datasets.

If you find this work useful, please cite our paper and give us a shining star 🌟

@misc{liu2025benchmarkingretrievalaugmentedgenerationmultimodal,
      title={Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts}, 
      author={Zhenghao Liu and Xingsheng Zhu and Tianshuo Zhou and Xinyi Zhang and Xiaoyuan Yi and Yukun Yan and Yu Gu and Ge Yu and Maosong Sun},
      year={2025},
      eprint={2502.17297},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.17297}, 
}

🎃 Overview

The $M^2RAG$ benchmark evaluates Multi-modal Large Language Models (MLLMs) by using multi-modal retrieved documents to answer questions. It includes four tasks: image captioning, multi-modal QA, fact verification, and image reranking, assessing MLLMs’ ability to leverage knowledge from multi-modal contexts.

The Multi-Modal Retrieval Augmented Instruction Tuning (MM-RAIT) method further adapts MLLMs to multi-modal in-context learning, enhancing their effectiveness in utilizing knowledge from these retrieval documents.

🎃 Requirement

🌵 1. Install the following packages using Pip or Conda under this environment

Python==3.10
Pytorch
transformers==4.44.2 (4.46.1 for finetune qwen2-vl)
clip
faiss==1.9.0
tqdm
numpy
base64
diffusers
flash-attn
xformers
llamafactory
accelerate
nltk
rouge_score
sklearn

We provide the version file requirements.txt of all our used packages, if you have any problems configuring the environment, please refer to this document.

🌵 2. Prepare the pretrained models: MiniCPM-V 2.6, Qwen2-VL, and VISTA

MM-RAIT is built on MiniCPM-V 2.6 and Qwen2-VL model. VISTA is used for retrieval multi-modal documents.

🎃 Reproduce MM-RAIT

🌵 Download Code & Dataset

First, use git clone to download this project:

git clone https://github.com/NEUIR/M2RAG
cd M2RAG

Second, you can either directly download and use M2RAG, or follow the instructions in 'data/data_preprocess' to build it step by step. Please place the downloaded dataset in the data folder. (❗️Note: For the imgs.tsv, you need to download the data from this link and run 7z x imgs.7z.001).

data/
└──m2rag/
    ├──fact_verify/
    ├──image_cap/
    ├──image_rerank/
    ├──mmqa/
    ├──imgs.lineidx.new
    └──imgs.tsv

🌵 Inference for Zero-Shot setting

Once the dataset and vanilla models are ready, you can follow the instructions below to reproduce our zero-shot results.

Step 1: Encode the queries from the test set and the multi-modal corpus for each task.

cd script
bash get_embed_test.sh

Step 2: Retrieve the topN most relevant multi-modal documents for each query.

bash retrieval_test.sh

Step 3: Use the retrieved documents for vanilla RAG inference.

bash inference_cpmv.sh or bash inference_qwen.sh

For Image Reranking task, please use:

bash compute_ppl_minicpmv.sh or bash compute_ppl_qwen2vl.sh

🌵 Train MM-RAIT using MM-RAIT

Using the MiniCPM-V 2.6 models as an example, I will show you how to reproduce the results in this paper. The same is true for the Qwen2-VL. Also, we provide fine-tuned checkpoints. You can skip this step and proceed directly to inference.

First step: Prepare the training data.

bash get_embed_train.sh
bash retrieval_train.sh
cd ../data/
bash finetune/construct_finetune_data.sh

Second step: Fine-tune the MiniCPM-V model using LoRA.

cd ../script
bash finetune_cpmv.sh

Final step: Use the fine-tuned model for inference.

bash inference_cpmv.sh

For Image Reranking task, please use:

bash compute_ppl_minicpmv.sh

🎃 Evaluate Generation Effectiveness

Go to the src/evaluation folder and evaluate model performance as follow:

For Image Captioning and Muit-modal QA tasks, please use:

python generation.py --reference_file path_to_reference_data --candidate_file path_to_generation_data

For Multi-Modal Factify Verification task, please use:

python evaluation/classification.py --true_file path_to_reference_data --pred_file path_to_generation_data

For Image Reranking task, please use:

python -m pytorch_fid path/to/reference_images path/to/rerank_images

🎃 Contact

If you have questions, suggestions, and bug reports, please email:

[email protected]     [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
data		data
script		script
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

🎃 Overview

🎃 Requirement

🎃 Reproduce MM-RAIT

🌵 Download Code & Dataset

🌵 Inference for Zero-Shot setting

🌵 Train MM-RAIT using MM-RAIT

🎃 Evaluate Generation Effectiveness

🎃 Contact

About

Releases

Packages

Contributors 2

Languages

License

NEUIR/M2RAG

Folders and files

Latest commit

History

Repository files navigation

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

🎃 Overview

🎃 Requirement

🎃 Reproduce MM-RAIT

🌵 Download Code & Dataset

🌵 Inference for Zero-Shot setting

🌵 Train MM-RAIT using MM-RAIT

🎃 Evaluate Generation Effectiveness

🎃 Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages