CLIP4IDC: CLIP for Image Difference Captioning

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorGuo, Zixinen_US
dc.contributor.authorWang, Tzu-Jui Juliusen_US
dc.contributor.authorLaaksonen, Jormaen_US
dc.contributor.departmentDepartment of Computer Scienceen_US
dc.contributor.departmentComputer Science Lecturersen_US
dc.date.accessioned2023-08-11T07:20:48Z
dc.date.available2023-08-11T07:20:48Z
dc.date.issued2022-11en_US
dc.description.abstractImage Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor. Accordingly, two major issues may arise: (1) a large domain gap usually exists between the pre-training datasets used for training such a visual encoder and that of the downstream IDC task, and (2) the visual feature extractor, when separately encoding two images, often does not effectively encode the visual changes between two images. Due to the excellent zero-shot performance of the recently proposed CLIP, we thus propose CLIP4IDC to transfer a CLIP model for the IDC task to address those issues. Different from directly fine-tuning CLIP to generate sentences, we introduce an adaptation training process to adapt CLIP’s visual encoder to capture and align differences in image pairs based on the textual descriptions. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC.en
dc.description.versionPeer revieweden
dc.format.extent33-42
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationGuo , Z , Wang , T-J J & Laaksonen , J 2022 , CLIP4IDC: CLIP for Image Difference Captioning . in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP) . vol. 2 , Association for Computational Linguistics , pp. 33-42 , 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing , Virtual, Online , 20/11/2022 . < https://aclanthology.org/2022.aacl-short.5 >en
dc.identifier.isbn978-1-955917-64-3
dc.identifier.otherPURE UUID: 1e52a418-aeb7-4966-b87a-fbcbd54659cben_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/1e52a418-aeb7-4966-b87a-fbcbd54659cben_US
dc.identifier.otherPURE LINK: https://aclanthology.org/2022.aacl-short.5en_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/118064229/SCI_Guo_etal_ACL_2022.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/122331
dc.identifier.urnURN:NBN:fi:aalto-202308114680
dc.language.isoenen
dc.relation.ispartof2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processingen
dc.relation.ispartofseriesProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP)en
dc.relation.ispartofseriesVolume 2en
dc.rightsopenAccessen
dc.titleCLIP4IDC: CLIP for Image Difference Captioningen
dc.typeConference article in proceedingsfi
dc.type.versionpublishedVersion
Files