Factual image captioning using VLM and knowledge graphs
VLMs struggle to output factually correct image captions, they either generate partial facts or completely hallucinate. Increasing model size further worsens the ability. My experiments suggest that hallucination in VLMs is often linked to poor hierarchical knowledge because spatial understanding is a crucial factor in Visual understanding. I demonstrate how incorporating hierarchical and structured knowledge such as country → city → landmark help make vlm responses for factual image captioning a promising path to explore. I also systematically ablate different kinds of hierarchical knowledge augmentation and compare their results.