- Image-to-Multi-Modal-Retrieval (IMMR) is a task where users search for information by uploading images, and the algorithm returns similar images and textual descriptions.
- The proposed approach treats images as queries and uses a combination of image features and textual descriptions to retrieve relevant results.
- The proposed method involves two key steps: concept extraction and fusion module.
- Concept extraction involves multiplying the input text feature with an external key unit to obtain a normalized weight vector, which is then used to compute the final concept vector.
- Fusion module combines the image features and the concept vector using a weighted sum to generate the output image features.
Analogy: Imagine you are searching for a specific recipe on a cooking website. You can either search by keywords (textual description) or by uploading an image of the dish you want to make (image query). The algorithm should return recipes that are similar to the one you uploaded or searched for, along with textual descriptions of those recipes. In this analogy, the image and textual descriptions are like the queries, and the recipes returned by the algorithm are like the results.