Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision

dc.contributorAalto Universityen
dc.contributor.authorWang, Tzu-Jui Juliusen_US
dc.contributor.authorLaaksonen, Jormaen_US
dc.contributor.authorLanger, Tomasen_US
dc.contributor.authorArponen, Heikkien_US
dc.contributor.authorBishop, Tomen_US
dc.contributor.departmentDepartment of Computer Scienceen_US
dc.contributor.departmentComputer Science Lecturersen_US
dc.contributor.departmentIntuition Machines Inc.en_US
dc.contributor.departmentSystematic Alphaen_US
dc.contributor.departmentGlass Imaging Inc.en_US
dc.description.abstractWeakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics.We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities.Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT (U-VB), over a variety of V-L tasks,i.e. XMR, Visual Question Answering, etc. Notably, benchmarked with recall@{1,5,10}, it consistently U-VB on image-to-text and improves text-to-image retrieval on two popular datasets Flickr30K and MSCOCO. Meanwhile, it gains by at least 14.5% in cross-dataset generalization tests on these XMR tasks. Moreover, in other V-L downstream tasks considered, our WFH models are on par with models trained with paired V-L data, revealing the utility of unpaired data. These results demonstrate greater generalization of the proposed W-VLP model with WFH.en
dc.description.versionPeer revieweden
dc.identifier.citationWang , T-J J , Laaksonen , J , Langer , T , Arponen , H & Bishop , T 2023 , Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision . in Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023 . IEEE Winter Conference on Applications of Computer Vision , IEEE , pp. 1073-1083 , IEEE Winter Conference on Applications of Computer Vision , Waikoloa , Hawaii , United States , 02/01/2023 . https://doi.org/10.1109/WACV56688.2023.00113en
dc.identifier.otherPURE UUID: 09aebd2e-0ab8-4d5e-ae5c-65eb272759c9en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/09aebd2e-0ab8-4d5e-ae5c-65eb272759c9en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85149034358&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/119390370/SCI_Wang_etal_WACV_2023.pdfen_US
dc.relation.ispartofIEEE Winter Conference on Applications of Computer Visionen
dc.relation.ispartofseriesProceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023en
dc.titleLearning by Hallucinating: Vision-Language Pre-training with Weak Supervisionen
dc.typeConference article in proceedingsfi