Deep Visual Understanding and Beyond - Saliency, Uncertainty, and Bridges to Natural Language

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorLaaksonen, Jorma, Senior University Lecturer, D.Sc. (Tech.), Aalto University, Department of Computer Science, Finland
dc.contributor.authorWang, Tzu-Jui Julius
dc.contributor.departmentTietotekniikan laitosfi
dc.contributor.departmentDepartment of Computer Scienceen
dc.contributor.labContent-Based Image and Information Retrieval Groupen
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.schoolSchool of Scienceen
dc.contributor.supervisorKaski, Samuel, Prof., Aalto University, Department of Computer Science, Finland
dc.date.accessioned2024-03-28T10:00:16Z
dc.date.available2024-03-28T10:00:16Z
dc.date.defence2024-04-12
dc.date.issued2024
dc.description.abstractVisual understanding concerns to what extent a cognitive system can reason about the visual surroundings before it reacts accordingly. While visual understanding is considered crucial, what go beyond are the capabilities of multi-modal reasoning which involve also other modalities. That is, a cognitive system is often faced with a daunting process – how to capitalize on the inputs, usually from one or more modalities – to adapt itself to the world of multiple modalities. More importantly, different machine learning paradigms may be exploited to learn both uni-modal and multi-modal reasoning tasks. This defines the main research question initiating the research endeavour presented in this thesis. In response to the dissertation's core research question, the work provides a number of methods empowered by different machine learning paradigms for both uni-modal and multi-modal contexts. More concretely, it is shown that one can estimate visual saliency, which is one of the most crucial fundamentals of visual understanding, with visual cues learned in an unsupervised fashion. Semi-supervised learning principle is found to be effective in combating class-imbalance issues in scene graph generation, which aims at discovering relationships among visual objects in an image. Moreover, to overcome the primary drawback in vision-language (VL) pre-training and other VL applications, which conventionally necessitate annotated image-text pairs, a novel weakly supervised approach is introduced. Besides, several enhancements have been made to supervised learning applications: Firstly, an improved dense image captioning model is proposed to better exploit different types of relationships between visual objects in an image. Secondly, an enhanced video captioning model is proposed to alleviate the impact brought by the modality gap, which can be commonly found in the widely adopted Transformer models. Lastly, an uncertainty-aware classification model is proposed to learn more robustly under noisy supervision when accounting for data and model uncertainties. These results suggest the usefulness and wide applicability of different learning paradigms. In terms of models' robustness, several breakthroughs have been made and elaborated for both uni-modal and multi-modal applications. The research outcomes encompass numerous findings related to computer vision techniques and their bridges to natural language. The thesis concludes with a discussion on the limitations of each published work and potential future endeavours in both uni-modal and multi-modal research.en
dc.format.extent117 + app. 83
dc.format.mimetypeapplication/pdfen
dc.identifier.isbn978-952-64-1742-4 (electronic)
dc.identifier.isbn978-952-64-1741-7 (printed)
dc.identifier.issn1799-4942 (electronic)
dc.identifier.issn1799-4934 (printed)
dc.identifier.issn1799-4934 (ISSN-L)
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/127310
dc.identifier.urnURN:ISBN:978-952-64-1742-4
dc.language.isoenen
dc.opnRahtu, Esa Mikael, Prof., Tampere University, Finland
dc.publisherAalto Universityen
dc.publisherAalto-yliopistofi
dc.relation.haspart[Publication 1]: Tzu-Jui Julius Wang, Hamed Rezazadegan Tavakoli, and Jorma Laaksonen. Fixation Prediction in Videos Using Unsupervised Hierarchical Features. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017. DOI: 10.1109/CVPRW.2017.276
dc.relation.haspart[Publication 2]: Tzu-Jui Julius Wang, Jorma Laaksonen, Yi-Ping Liao, Bo-Zong Wu, and Shih-Yun Shen. A Multi-Task Bayesian Deep Neural Net for Detecting Life-Threatening Infant Incidents From Head Images. 2019 IEEE International Conference on Image Processing (ICIP), Pages 3006-3010, 2019. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202001021386. DOI: 10.1109/ICIP.2019.8803332
dc.relation.haspart[Publication 3]: Tzu-Jui Julius Wang, Hamed Rezazadegan Tavakoli, Mats Sjoberg, and Jorma Laaksonen. Geometry-aware Relational Exemplar Attention for Dense Captioning. 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications (MULEA ’19), co-located with ACM Multimedia’19, Pages 3-11, October 2019. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202001021009. DOI: 10.1145/3347450.3357656
dc.relation.haspart[Publication 4]: Tzu-Jui Julius Wang, Selen Pehlivan, and Jorma Laaksonen. Tackling The Unannotated: Scene Graph Generation with Bias-reduced Models. 31st British Machine Vision Conference (BMVC), September 2020. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202306304355.
dc.relation.haspart[Publication 5]: Zixin Guo, Tzu-Jui Julius Wang, and Jorma Laaksonen. Post-Attention Modulator for Dense Video Captioning. 26th International Conference on Pattern Recognition (ICPR), August 2022. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202308305315. DOI: 10.1109/ICPR56361.2022.9956260
dc.relation.haspart[Publication 6]: Zixin Guo, Tzu-Jui Julius Wang, and Jorma Laaksonen. CLIP4IDC: CLIP for Image Difference Captioning. The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP), November 2022. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202308114680.
dc.relation.haspart[Publication 7]: Tzu-Jui Julius Wang, Tomas Langer, Jorma Laaksonen, Heikki Arponen, and Tom Bishop. Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision. Winter Conference on Applications of Computer Vision (WACV) 2023, Janurary 2023. Full text in Acris/Aaltodoc: https://urn.fi/URN:NBN:fi:aalto-202308305289. DOI: 10.1109/WACV56688.2023.00113
dc.relation.ispartofseriesAalto University publication series DOCTORAL THESESen
dc.relation.ispartofseries66/2024
dc.revGlowacka, Dorota, Assoc. Prof., University of Helsinki, Finland
dc.revKämäräinen, Joni-Kristian, Prof., Tampere University, Finland
dc.subject.keywordsaliency estimationen
dc.subject.keywordvisual captioningen
dc.subject.keywordscene graphen
dc.subject.keywordvision-language representation learningen
dc.subject.otherComputer scienceen
dc.titleDeep Visual Understanding and Beyond - Saliency, Uncertainty, and Bridges to Natural Languageen
dc.typeG5 Artikkeliväitöskirjafi
dc.type.dcmitypetexten
dc.type.ontasotDoctoral dissertation (article-based)en
dc.type.ontasotVäitöskirja (artikkeli)fi
local.aalto.acrisexportstatuschecked 2024-04-15_1351
local.aalto.archiveyes
local.aalto.formfolder2024_03_27_klo_14_02
local.aalto.infraScience-IT
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
isbn9789526417424.pdf
Size:
49.04 MB
Format:
Adobe Portable Document Format