Removing Distributional Discrepancies in Captions Improves Image-Text Alignment


University of Wisconsin-Madison
Adobe Research

Abstract

In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. Our approach focuses on generating high-quality training datasets for the alignment task by producing mixed-type negative captions derived from positive ones. Critically, we address the distribution imbalance between positive and negative captions to ensure that the alignment model does not depend solely on textual information but also considers the associated images for predicting alignment accurately. By creating this enhanced training data, we fine-tune an existing leading visual-language model to boost its capability in understanding alignment. Our model significantly outperforms current top-performing methods across various datasets. We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.

Text Distributional Gap

The most common approach for training image-text alignment models involves creating hard-negative captions. However, we discovered that hard-negative captions generated by LLMs (e.g., GPT) can lead to distributional biases. For instance, when applied to COCO captions, GPT often replaces "airplane" with "boat" or "giraffe" with "elephant." This makes the text information alone distinguishable within the image-text pair dataset, potentially causing the alignment model to fail in learning true alignment information.

Approach

  • Use 'swap' and 'replace' strategies to create hard-negative captions.
  • Employ a text-only classifier to identify and remove easy positive and negative captions (those that can be distinguished solely by text information).
  • Fine-tune a vision-language model (e.g., LLaVA) to enhance image-text alignment capabilities.

BibTeX


@article{li2024llavascore,
  author      = {Yuheng Li and Haotian Liu and Mu Cai and Yijun Li and Eli Shechtman and Zhe Lin and Yong Jae Lee and Krishna Kumar Singh},
  title       = {Removing Distributional Discrepancies in Captions Improves Image-Text Alignment},
  publisher   = {arXiv:2410.00905},
  year        = {2024},
}

Acknowledgement

This website is adapted from GLIGEN , licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.