The article discusses the importance of creating a dataset for text-based person retrieval, which is a task that involves matching a person in an image with their corresponding textual description. The authors recognize that most existing datasets have limitations, such as insufficient training data and lack of multi-description images. To address these issues, they propose a new dataset called PRW-TPS-CN, which includes 11,776 scene images with 8,656 described images, 32,206 bounding boxes with 933 valid IDs. The dataset contains a total of 47,102 sentences with 2 million characters, including 3,432 Chinese phrases and 1,300 unique Chinese characters. The longest sentence in the dataset has 107 characters, and the average sentence length is 42.5. The authors also translate the sentences into English, resulting in an average length of 28.79 words. The PRW-TPS-CN dataset can evaluate the methods more comprehensively due to its informative nature and bilingual descriptions.
The article emphasizes that creating a robust network for text-based person retrieval requires a dataset with multiple descriptions per image, as this allows the model to learn more comprehensive representations of the images. The authors believe that their proposed dataset will facilitate the development of more accurate and robust methods for text-based person retrieval. They also note that their approach can be applied to other tasks that require cross-modal matching, such as image-to-image retrieval and cross-lingual language understanding.
Computer Science, Computer Vision and Pattern Recognition