Deep Learning for Image-Based Multi-Modal Retrieval: A Survey

Image-to-Multi-Modal-Retrieval (IMMR) is a task where users search for information by uploading images, and the algorithm returns similar images and textual descriptions.
The proposed approach treats images as queries and uses a combination of image features and textual descriptions to retrieve relevant results.
The proposed method involves two key steps: concept extraction and fusion module.
Concept extraction involves multiplying the input text feature with an external key unit to obtain a normalized weight vector, which is then used to compute the final concept vector.
Fusion module combines the image features and the concept vector using a weighted sum to generate the output image features.

Analogy: Imagine you are searching for a specific recipe on a cooking website. You can either search by keywords (textual description) or by uploading an image of the dish you want to make (image query). The algorithm should return recipes that are similar to the one you uploaded or searched for, along with textual descriptions of those recipes. In this analogy, the image and textual descriptions are like the queries, and the recipes returned by the algorithm are like the results.

ARXIV/2305.03972 authored by Zida Cheng, Chen Ju, Xu Chen, Zhonghua Zhai, Shuai Xiao, Xiaoyi Zeng, Weilin Huang.

Categories

Tags

Archives

Deep Learning for Image-Based Multi-Modal Retrieval: A Survey

LLama 2 7B Chat

Accurate Analysis of Image Captions with CoT-Based Methods

Unsupervised Audio-Caption Alignment via Correspondence Learning

Efficient Method for ML Model Accuracy Improvement in Non-IID Data Settings

Categories

Tags

Archives