Our proposed method, multi-change captioning transformers, addresses this limitation by identifying multiple changes in an image pair and describing them with natural language words. This approach is based on a transformer architecture that incorporates a new attention mechanism to densely correlate different regions of the images and dynamically determine the related change regions.
Multi-Change Captioning Transformers: Our proposed method consists of two main components: 1) a dense correlation module, which correlates different regions of the images, and 2) a dynamic word insertion module, which determines the related change region with words in sentences. The dense correlation module is designed to capture spatial and temporal dependencies between different regions of the images. This is achieved by applying a convolutional neural network (CNN) to each image and then concatenating the feature maps to form a single feature vector for each image pair. The dynamic word insertion module uses a transformer architecture to generate words in sentences that describe the detected changes.
To improve the accuracy of change detection, we propose a new attention mechanism called multi-change attention. This mechanism allows the model to focus on different regions of the images based on their relevance to the current change region being processed. We also introduce an additional classification branch to predict the type of change occurred in each region, which improves the interpretability of the model.
Experiments: We evaluate our proposed method on several publicly available datasets and compare it with state-of-the-art methods. The results show that our method outperforms existing methods in terms of accuracy and efficiency. We also conduct a series of ablation studies to analyze the contribution of each component in our proposed method.
Conclusion: In this paper, we propose a novel approach to image captioning called multi-change captioning transformers, which can identify multiple changes in an image pair. Our method improves the accuracy and efficiency of change detection in remote sensing or street view scenes, where changes may occur in various forms and locations. By densely correlating different regions of the images and dynamically determining the related change regions with words in sentences, our approach provides a more comprehensive and accurate representation of changes in images.
Computer Science, Computer Vision and Pattern Recognition