Browsing by Author "Laaksonen, Jorma, Senior University Lecturer, D.Sc. (Tech.), Aalto University, Department of Computer Science, Finland"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
Item Deep Visual Understanding and Beyond - Saliency, Uncertainty, and Bridges to Natural Language(Aalto University, 2024) Wang, Tzu-Jui Julius; Laaksonen, Jorma, Senior University Lecturer, D.Sc. (Tech.), Aalto University, Department of Computer Science, Finland; Tietotekniikan laitos; Department of Computer Science; Content-Based Image and Information Retrieval Group; Perustieteiden korkeakoulu; School of Science; Kaski, Samuel, Prof., Aalto University, Department of Computer Science, FinlandVisual understanding concerns to what extent a cognitive system can reason about the visual surroundings before it reacts accordingly. While visual understanding is considered crucial, what go beyond are the capabilities of multi-modal reasoning which involve also other modalities. That is, a cognitive system is often faced with a daunting process – how to capitalize on the inputs, usually from one or more modalities – to adapt itself to the world of multiple modalities. More importantly, different machine learning paradigms may be exploited to learn both uni-modal and multi-modal reasoning tasks. This defines the main research question initiating the research endeavour presented in this thesis. In response to the dissertation's core research question, the work provides a number of methods empowered by different machine learning paradigms for both uni-modal and multi-modal contexts. More concretely, it is shown that one can estimate visual saliency, which is one of the most crucial fundamentals of visual understanding, with visual cues learned in an unsupervised fashion. Semi-supervised learning principle is found to be effective in combating class-imbalance issues in scene graph generation, which aims at discovering relationships among visual objects in an image. Moreover, to overcome the primary drawback in vision-language (VL) pre-training and other VL applications, which conventionally necessitate annotated image-text pairs, a novel weakly supervised approach is introduced. Besides, several enhancements have been made to supervised learning applications: Firstly, an improved dense image captioning model is proposed to better exploit different types of relationships between visual objects in an image. Secondly, an enhanced video captioning model is proposed to alleviate the impact brought by the modality gap, which can be commonly found in the widely adopted Transformer models. Lastly, an uncertainty-aware classification model is proposed to learn more robustly under noisy supervision when accounting for data and model uncertainties. These results suggest the usefulness and wide applicability of different learning paradigms. In terms of models' robustness, several breakthroughs have been made and elaborated for both uni-modal and multi-modal applications. The research outcomes encompass numerous findings related to computer vision techniques and their bridges to natural language. The thesis concludes with a discussion on the limitations of each published work and potential future endeavours in both uni-modal and multi-modal research.