PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
Loading...
Access rights
openAccess
URL
Journal Title
Journal ISSN
Volume Title
A4 Artikkeli konferenssijulkaisussa
This publication is imported from Aalto University research portal.
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
View publication in the Research portal (opens in new window)
View/Open full text file from the Research portal (opens in new window)
Other link related to publication (opens in new window)
Date
2023-07-19
Department
Major/Subject
Mcode
Degree programme
Language
en
Pages
5
2261-2265
2261-2265
Series
SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
Abstract
Vision-language (VL) Pre-training (VLP) has shown to well generalize VL models over a wide range of VL downstream tasks, especially for cross-modal retrieval. However, it hinges on a huge amount of image-text pairs, which requires tedious and costly curation. On the contrary, weakly-supervised VLP (W-VLP) [33] explores means with object tags generated by a pre-trained object detector (OD) from images. Yet, they still require paired information, i.e. images and object-level annotations, as supervision to train an OD. To further reduce the amount of supervision, we propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images. Concretely, given a category label of an image, e.g. refinery, the knowledge, e.g. a refinery could be seen with large storage tanks, pipework, and..., extracted by LLMs is used as the language counterpart. The knowledge supplements, e.g. the common relations among entities most likely appearing in a scene. We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K [8] with PiTL. Empirically, the VL models pre-trained with PiTL-generated pairs are strongly favored over other W-VLP works on image-to-text (I2T) and text-to-image (T2I) retrieval tasks, with less supervision. The results reveal the effectiveness of PiTL-generated pairs for VLP.Description
Funding Information: This work is supported by the Academy of Finland in project 345791. We acknowledge the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC and the LUMI consortium. Publisher Copyright: © 2023 Copyright held by the owner/author(s).
Keywords
Knowledge Prompting, Pre-training, Vision-language Retrieval
Other note
Citation
Guo, Z, Wang, T J J, Pehlivan, S, Radman, A & Laaksonen, J 2023, PiTL : Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting . in SIGIR 2023 - Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval . ACM, pp. 2261-2265, International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, Republic of China, 23/07/2023 . https://doi.org/10.1145/3539618.3592038