Deep Visual Understanding and Beyond - Saliency, Uncertainty, and Bridges to Natural Language

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2024-04-12
Degree programme
117 + app. 83
Aalto University publication series DOCTORAL THESES, 66/2024
Visual understanding concerns to what extent a cognitive system can reason about the visual surroundings before it reacts accordingly. While visual understanding is considered crucial, what go beyond are the capabilities of multi-modal reasoning which involve also other modalities. That is, a cognitive system is often faced with a daunting process – how to capitalize on the inputs, usually from one or more modalities – to adapt itself to the world of multiple modalities. More importantly, different machine learning paradigms may be exploited to learn both uni-modal and multi-modal reasoning tasks. This defines the main research question initiating the research endeavour presented in this thesis. In response to the dissertation's core research question, the work provides a number of methods empowered by different machine learning paradigms for both uni-modal and multi-modal contexts. More concretely, it is shown that one can estimate visual saliency, which is one of the most crucial fundamentals of visual understanding, with visual cues learned in an unsupervised fashion. Semi-supervised learning principle is found to be effective in combating class-imbalance issues in scene graph generation, which aims at discovering relationships among visual objects in an image. Moreover, to overcome the primary drawback in vision-language (VL) pre-training and other VL applications, which conventionally necessitate annotated image-text pairs, a novel weakly supervised approach is introduced. Besides, several enhancements have been made to supervised learning applications: Firstly, an improved dense image captioning model is proposed to better exploit different types of relationships between visual objects in an image. Secondly, an enhanced video captioning model is proposed to alleviate the impact brought by the modality gap, which can be commonly found in the widely adopted Transformer models. Lastly, an uncertainty-aware classification model is proposed to learn more robustly under noisy supervision when accounting for data and model uncertainties. These results suggest the usefulness and wide applicability of different learning paradigms. In terms of models' robustness, several breakthroughs have been made and elaborated for both uni-modal and multi-modal applications. The research outcomes encompass numerous findings related to computer vision techniques and their bridges to natural language. The thesis concludes with a discussion on the limitations of each published work and potential future endeavours in both uni-modal and multi-modal research.
Supervising professor
Kaski, Samuel, Prof., Aalto University, Department of Computer Science, Finland
Thesis advisor
Laaksonen, Jorma, Senior University Lecturer, D.Sc. (Tech.), Aalto University, Department of Computer Science, Finland
saliency estimation, visual captioning, scene graph, vision-language representation learning
Other note
  • [Publication 1]: Tzu-Jui Julius Wang, Hamed Rezazadegan Tavakoli, and Jorma Laaksonen. Fixation Prediction in Videos Using Unsupervised Hierarchical Features. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017.
    DOI: 10.1109/CVPRW.2017.276 View at publisher
  • [Publication 2]: Tzu-Jui Julius Wang, Jorma Laaksonen, Yi-Ping Liao, Bo-Zong Wu, and Shih-Yun Shen. A Multi-Task Bayesian Deep Neural Net for Detecting Life-Threatening Infant Incidents From Head Images. 2019 IEEE International Conference on Image Processing (ICIP), Pages 3006-3010, 2019.
    DOI: 10.1109/ICIP.2019.8803332 View at publisher
  • [Publication 3]: Tzu-Jui Julius Wang, Hamed Rezazadegan Tavakoli, Mats Sjoberg, and Jorma Laaksonen. Geometry-aware Relational Exemplar Attention for Dense Captioning. 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications (MULEA ’19), co-located with ACM Multimedia’19, Pages 3-11, October 2019.
    DOI: 10.1145/3347450.3357656 View at publisher
  • [Publication 4]: Tzu-Jui Julius Wang, Selen Pehlivan, and Jorma Laaksonen. Tackling The Unannotated: Scene Graph Generation with Bias-reduced Models. 31st British Machine Vision Conference (BMVC), September 2020.
  • [Publication 5]: Zixin Guo, Tzu-Jui Julius Wang, and Jorma Laaksonen. Post-Attention Modulator for Dense Video Captioning. 26th International Conference on Pattern Recognition (ICPR), August 2022.
    DOI: 10.1109/ICPR56361.2022.9956260 View at publisher
  • [Publication 6]: Zixin Guo, Tzu-Jui Julius Wang, and Jorma Laaksonen. CLIP4IDC: CLIP for Image Difference Captioning. The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (AACL-IJCNLP), November 2022.
  • [Publication 7]: Tzu-Jui Julius Wang, Tomas Langer, Jorma Laaksonen, Heikki Arponen, and Tom Bishop. Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision. Winter Conference on Applications of Computer Vision (WACV) 2023, Janurary 2023.
    DOI: 10.1109/WACV56688.2023.00113 View at publisher