Post-Attention Modulator for Dense Video Captioning

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorGuo, Zixinen_US
dc.contributor.authorWang, Tzu-Jui Juliusen_US
dc.contributor.authorLaaksonen, Jormaen_US
dc.contributor.departmentDepartment of Computer Scienceen_US
dc.contributor.departmentComputer Science Lecturersen_US
dc.date.accessioned2023-08-30T04:20:38Z
dc.date.available2023-08-30T04:20:38Z
dc.date.issued2022en_US
dc.description.abstractDense video captioning (VC) aims at generating a paragraph-long description for events in video segments. Borrowing from the success in language modeling, Transformer-based models for VC have been shown effective also in modeling cross-domain video-text representations with cross-attention (Xatt). Despite Xatt’s effectiveness, the queries and outputs of attention, which are from different domains, tend to be weakly related. In this paper, we argue that the weak relatedness, or domain discrepancy, could impede a model from learning meaningful cross-domain representations. Hence, we propose a simple yet effective Post-Attention Modulator (PAM) that post-processes Xatt’s outputs to narrow the discrepancy. Specifically, PAM modulates and enhances the average similarity over Xatt’s queries and outputs. The modulated similarities are then utilized as a weighting basis to interpolate PAM’s outputs. In our experiments, PAM was applied to two strong VC baselines, VTransformer and MART, with two different video features on the well-known VC benchmark datasets ActivityNet Captions and YouCookII. According to the results, the proposed PAM brings consistent improvements in, e.g., CIDEr-D at most to 14.5%, as well as other metrics, BLEU and METEOR, considered.en
dc.description.versionPeer revieweden
dc.format.extent1536-1542
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationGuo , Z , Wang , T-J J & Laaksonen , J 2022 , Post-Attention Modulator for Dense Video Captioning . in Proceedings of the 26th International Conference on Pattern Recognition (ICPR) . International Conference on Pattern Recognition , IEEE , pp. 1536-1542 , International Conference on Pattern Recognition , Montreal , Quebec , Canada , 21/08/2022 . https://doi.org/10.1109/ICPR56361.2022.9956260en
dc.identifier.doi10.1109/ICPR56361.2022.9956260en_US
dc.identifier.isbn978-1-6654-9062-7
dc.identifier.issn1051-4651
dc.identifier.otherPURE UUID: 5c6011ad-3660-49c7-a70e-e1e5219fe912en_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/5c6011ad-3660-49c7-a70e-e1e5219fe912en_US
dc.identifier.otherPURE LINK: http://www.scopus.com/inward/record.url?scp=85143637293&partnerID=8YFLogxKen_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/119390260/SCI_Guo_etal_ICPR_PAM_2022.pdfen_US
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/122975
dc.identifier.urnURN:NBN:fi:aalto-202308305315
dc.language.isoenen
dc.relation.ispartofInternational Conference on Pattern Recognitionen
dc.relation.ispartofseriesProceedings of the 26th International Conference on Pattern Recognition (ICPR)en
dc.relation.ispartofseriesInternational Conference on Pattern Recognitionen
dc.rightsopenAccessen
dc.titlePost-Attention Modulator for Dense Video Captioningen
dc.typeConference article in proceedingsfi
dc.type.versionacceptedVersion
Files