Evaluating Object Hallucination in Large Vision-Language Models

Hallucination, LVLM hallucination

Used the CHAIR metric (Caption Hallucination Assessment with Image Relevance) from Rohrbach2018object.

Proposed Polling based Object Probing Evaluation (POPE) method. The idea is to probe LVLMs with yes-no questions about objects in the image, where nonexistent objects are sampled by three strategies: random, popular, and adversarial. Popular sampling samples from top-k in the data and adversarial sampling samples top-k from a list sorted by the co-occurrence with the ground-truth objects.

This is slightly different from examining the generated caption. There’s a subtle difference between seeing the image and hallucinate an object vs. when it starts to describe the image and then hallucinate based on the part of the already generated caption.