Evaluating Object Hallucination in Large Vision-Language Models

https://arxiv.org/abs/2305.10355
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, Ji-Rong Wen

Used the CHAIR metric (Caption Hallucination Assessment with Image Relevance) from Rohrbach2018object.

Proposed Polling based Object Probing Evaluation (POPE) method. The idea is to probe LVLMs with yes-no questions about objects in the image, where nonexistent objects are sampled by three strategies: random, popular, and adversarial. Popular sampling samples from top-k in the data and adversarial sampling samples top-k from a list sorted by the co-occurrence with the ground-truth objects.

This is slightly different from examining the generated caption. There’s a subtle difference between seeing the image and hallucinate an object vs. when it starts to describe the image and then hallucinate based on the part of the already generated caption.