Vision-language models (VLMs) are advanced AI models that combine the abilities of vision and language models to handle both images and text6. They are used for tasks such as chatting about images, image recognition via instructions, visual question answering, and providing visual assistance for visually impaired individuals2. VLMs have the potential to revolutionize various applications by simultaneously comprehending image and text data modalities.
Multi-object scenarios in Vision-Language Models (VLMs) often lead to object hallucination, where models describe objects not present in the image. This issue is particularly problematic when models must recognize multiple objects simultaneously. The Recognition-based Object Probing Evaluation (ROPE) protocol has been introduced to assess how models handle these scenarios, focusing on factors such as object class distribution and visual prompts. Large VLMs tend to hallucinate more frequently when focusing on multiple objects, emphasizing the need for balanced datasets and advanced training protocols to mitigate this issue.
The ROPE (Recognition-based Object Probing Evaluation) protocol is a comprehensive framework designed to assess how vision-language models handle scenarios involving multiple objects. It categorizes test scenarios into four subsets (In-the-Wild, Homogeneous, Heterogeneous, and Adversarial) and focuses on factors such as object class distributions and visual prompts. The protocol aims to evaluate and quantify multi-object hallucinations in large vision-language models, highlighting the need for more balanced datasets and advanced training protocols.