Browsing by Author "Kannala, Juho, Prof., Aalto University, Department of Computer Science, Finland"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
- Algorithms for Data-Efficient Training of Deep Neural Networks
School of Science | Doctoral dissertation (article-based)(2020) Verma, VikasDeep Neural Networks ("deep learning") have become a ubiquitous choice of algorithms for Machine Learning applications. These systems often achieve human-level or even super-human level performances across a variety of tasks such as computer vision, natural language processing, speech recognition, reinforcement learning, generative modeling and healthcare. This success can be attributed to their ability to learn complex representations directly from the raw input data, completely eliminating the hand-crafted feature extraction from the pipeline. However, there still exists a caveat: due to the extremely large number of trainable parameters in Deep Neural Networks, their generalization ability depends heavily on the availability of a large amount of labeled data. In many machine learning applications, gathering a large amount of labeled data is not feasible due to privacy, cost, time or expertise constraints. Examples of such applications are abundant in healthcare; for example, predicting the effect of a medicine on a new patient in the scenario where the medicine has been administered to only a few patients earlier. This thesis addresses the problem of improving the generalization ability of Deep Neural Networks using a limited amount of labeled data. More specifically, this thesis explores a class of methods that directly incorporates the inductive bias about how the Deep Neural Networks should "behave" in-between the training samples (both in the input space as well as the hidden space) into the learning algorithms. Throughout several publications included in this thesis, the author has demonstrated that such kinds of methods can outperform conventional baseline methods and achieve state-of-the-art performance across supervised, unsupervised, semi-supervised, adversarial training and graph-based learning settings. In addition to these algorithms, the author proposes a mutual information based method for learning the representations for the "graph-level" tasks in an unsupervised and semi-supervised manner. Finally, the author proposes a method to improve the generalization of ResNets based on the iterative inference view. - Deep Learning Methods for Image Matching and Camera Relocalization
School of Science | Doctoral dissertation (article-based)(2020) Melekhov, IaroslavDeep learning and convolutional neural networks have revolutionized computer vision and become a dominant tool in many applications, such as image classification, semantic segmentation, object recognition, and image retrieval. Their strength lies in the ability to learn an efficient representation of images that makes a subsequent learning task easier. This thesis presents deep learning approaches for a number of fundamental computer vision problems that are closely related to each other; image matching, image-based localization, ego-motion estimation, and scene understanding. In image matching, the thesis studies two methods utilizing a Siamese network architecture for learning both patch-level and image-level descriptors for measuring similarity using Euclidean distance. Next, it introduces a coarse-to-fine CNN-based approach for dense pixel correspondence estimation that can leverage the advantages of optical flow methods and extend them to the case of wide baseline between two images. The method demonstrates good generalization performance and it is applicable for image matching as well as for image alignment and relative camera pose estimation. One of the contributions of the thesis is a novel approach for recovering the absolute camera pose from ego-motion. In contrast to the existing CNN-based localization algorithms, the proposed method can be directly applied to scenes which are not available at training stage and it does not require scene-specific training of the network, thus, improving the scalability. The thesis also shows that Siamese architecture can be successfully utilized in the problem of relative camera pose estimation achieving better performance in challenging scenarios compared to traditional image descriptors. Lastly, the thesis demonstrates how the advances of visual geometry can help to efficiently learn depth, camera ego-motion, and optical flow for the task of scene understanding. More specifically, it introduces a method that can leverage temporally consistent geometric priors between frames of monocular video sequences and jointly estimate ego-motion and depth maps in a self-supervised manner. - Deep Learning Methods for Point Matching, Visual Localization and 3D Reconstruction
School of Science | Doctoral dissertation (article-based)(2024) Wang, ShuzheThis doctoral thesis explores advanced deep learning methods for three pivotal tasks in 3D computer vision: point matching, visual localization, and 3D reconstruction. These tasks are crucial for enabling machines to perceive and understand 3D environments, which is essential for applications ranging from virtual reality to robotics and autonomous driving. In point matching, this thesis investigates learning-based, visual descriptor-free matching pipelines that leverage geometric and color cues in conjunction with Graph Neural Networks. This approach significantly enhances the accuracy of 2D-3D keypoint matching while reducing memory usage and improving data privacy. For visual localization, the thesis introduces two hierarchical scene coordinate network architectures to establish dense 2D-3D matches for accurate 6-DoF camera pose estimation. These architectures incorporate conditioning mechanism and transformer to encode global context to local patches, overcoming scene ambiguities. Additionally, a novel few-shot learning setting is proposed, which reduces the training load for scene coordinate regression and improves training efficiency from days to minutes. A significant contribution of the thesis is the development of an end-to-end dense unconstrained 3D reconstruction pipeline based on Vision Transformers. This pipeline directly regresses point coordinates from image pairs without relying on camera parameters, simplifying traditional 3D reconstruction methods and introducing a unified framework for monocular and binocular reconstruction tasks. The thesis also explores methods to improve local feature matching by calculating the curvature of local 3D surface patches for detected points, enhancing matching accuracy with off-the-shelf learned matchers. Furthermore, it addresses the challenge of continual learning for visual localization by proposing an experience-replay-based baseline to prevent catastrophic forgetting and reduce computational and storage costs. Throughout the thesis, the importance of end-to-end learning is emphasized, where models are trained to directly produce desired outputs from raw input data. This paradigm shift can simplify the development pipelines and enhance the adaptability and scalability of 3D computer vision systems. By integrating deep learning with traditional geometric principles, this research provides a comprehensive framework for addressing key challenges in point matching, visual localization, and 3D reconstruction. - Deep Learning Methods for Semantic Matching, Image Retrieval and Camera Relocalization
School of Science | Doctoral dissertation (article-based)(2020) Laskar, ZakariaImage matching is a central component in many computer vision applications. The field has progressed significantly with the advancement of deep learning models such as convolutional neural networks. The thesis makes several contributions in advancing the performance of existing CNN based approaches in closely related problem areas of image matching, namely semantic matching, image retrieval and image based localization. In this thesis, the problem of data and ground-truth labelling efficiency for training CNN models is studied in the context of semantic matching. A weakly supervised method is presented to address the problem of learning using small training datasets. The method first generates additional training samples using existing data and proposes a novel loss function based on cyclic consistency to regularize the training process. Results show that the proposed method can learn from weakly labelled data without pixel level correspondence information. In the next part of the thesis, we study the application of both global and local image matching to the problem of image retrieval. In the problem of particular landmark retrieval the thesis studies the role of contextual information in global query image representation which is generally ignored by existing approaches to remove noisy background information. An attention model is proposed that uses bottom-up saliency to modulate contextual information in intermediary CNN representations in a top-down manner. On the other hand, to address the challenges due to local variations in city-scale retrieval, the thesis proposes a geometric verification method using CNN based image matching. In addition, it proposes method for improving the accuracy and efficiency of the image matching method. Lastly, the thesis demonstrates methods utilizing the key concepts from image matching and image retrieval to address problems in the field of image based localization. In contrast to existing approaches the proposed method can be applied to novel scenes not seen during training and scales favourably with the size of the environment. In addition, a challenging indoor localization dataset is made publicly available to address limitation of existing datasets. - Machine Learning Methods for Classification of Unstructured Data
School of Science | Doctoral dissertation (article-based)(2019) Sayfullina, LuizaNatural language processing is a field that studies automatic computational processing of human languages. Although natural language is symbolic and full of rules and ontologies, the state-of-the-art approaches are typically based on statistical machine learning. With the invention of word embeddings, researchers have managed to circumvent a problem of sparse feature space and to take into account word semantics learned from large corpora. When it comes to artificial strings, e.g. in source code, the usage of embeddings is restricted due to extremely large vocabulary. This dissertation covers two interesting applications using both embedding based and bag-of-words approaches: one related to industrial scale Android malware classification and another to extraction of soft skills and their impact on occupational gender segregation. Data coming from both applications is unstructured since Android applications consist of a set of files belonging to mainly unstructured data or semi-structured data, while job postings used for soft skill analysis represent free text where no clear structure is defined. The first part of the dissertation is dedicated to industrial scale Android malware classification covering a full pipeline from feature extraction to deployment. Various groups of features are extracted from Android installation package files, resulting in large high-dimensional sparse feature space. We investigated the ways to reduce feature space from millions to thousands of features efficiently and managed to improve the decision boundary. Finally, we addressed the problem of fair model assessment by separating training and test samples in time and evaluated proposed ensemble-based methods accordingly. The second part of the dissertation is dedicated to statistical and machine learning based soft skill analysis and their impact on occupational gender segregation. Soft skills are personality traits facilitating human interaction. Our work is pioneering with respect to large scale soft skill requirements analysis and their impact on salary. We show that not only soft skills are useful in predicting gender ratio estimate of the corresponding job category, but also most of them comply with gender stereotypes. Besides curating a soft skill list using job postings, we also propose various input representations to increase the precision of soft skill extraction using the context where soft skill occurs.