Image captioning is a task that generates natural language descriptions for images. To evaluate the quality of these captions, researchers need a reliable metric to measure their accuracy. However, existing metrics have limitations, as they are based on simple pixel-wise comparisons or use complex neural network architectures that can be difficult to interpret.
To address this issue, the authors propose CLIPSCORE, a new evaluation metric for image captioning. CLIPSCORE is designed to align with human perception and reflect the accuracy and realism of the generated captions. The authors conducted an extensive human evaluation study to assess the quality of SUPER-RES outputs and compared human preferences to those of multiple pixel-wise, perceptual, and model-based metrics. They found that model-based metrics easily outperform pixel-wise ones, and using modern and powerful visual encoders is most effective.
The authors propose CLIPSCORE based on their analysis and inspired by [20]. CLIPSCORE uses a combination of perceptual and structural similarity measures to evaluate image captions. Perceptual similarity measures the visual quality of the generated images, while structural similarity measures the accuracy of the captions in describing the content of the images. The authors show that CLIPSCORE outperforms other metrics in evaluating image captioning models.
In summary, CLIPSCORE is a reference-free evaluation metric for image captioning that aligns with human perception and reflects the accuracy and realism of the generated captions. By combining perceptual and structural similarity measures, CLIPSCORE provides a more comprehensive evaluation of image captioning models than existing metrics.
Computer Science, Computer Vision and Pattern Recognition