Learning Latent Image Representations with Prior Knowledge

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
School of Science | Doctoral thesis (article-based) | Defence date: 2022-12-16
Degree programme
62 + app. 78
Aalto University publication series DOCTORAL THESES, 194/2022
Deep learning has become a dominant tool in many computer vision applications due to the superior performance of extracting low-dimensional latent representations from images. However, though there is prior knowledge for many applications already, most existing methods learn image representations from large-scale training data in a black-box way, which is not good for interpretability and controllability. This thesis explores approaches that integrate different types of prior knowledge into deep neural networks. Instead of learning image representations from scratch, leveraging the prior knowledge in latent space can softly regularize the training and obtain more controllable representations.The models presented in the thesis mainly address three different problems: (i) How to encode epipolar geometry in deep learning architectures for multi-view stereo. The key of multi-view stereo is to find the matched correspondence across images. In this thesis, a learning-based method inspired by the classical plane sweep algorithm is studied. The method aims to improve the correspondence matching in two parts: obtaining better potential correspondence candidates with a novel plane sampling strategy and learning the multiplane representations instead of using hand-crafted cost metrics. (ii) How to capture the correlations of input data in the latent space. Multiple methods that introduce Gaussian process in the latent space to encode view priors are explored in the thesis. According to the availability of relative motion of frames, there is a hierarchy of three covariance functions which are presented as Gaussian process priors, and the correlated latent representations can be obtained via latent nonparametric fusion. Experimental results show that the correlated representations lead to more temporally consistent predictions for depth estimation, and they can also be applied to generative models to synthesize images in new views. (iii) How to use the known factors of variation to learn disentangled representations. Both equivariant representations and factorized representations are studied for novel view synthesis and interactive fashion retrieval respectively. In summary, this thesis presents three different types of solutions that use prior domain knowledge to learn more powerful image representations. For depth estimation, the presented methods integrate the multi-view geometry into the deep neural network. For image sequences, the correlated representations obtained from inter-frame reasoning make more consistent and stable predictions. The disentangled representations provide explicit flexible control over specific known factors of variation.
Supervising professor
Kannala, Juho, Prof., Aalto University, Department of Computer Science, Finland; Solin, Arno, Pof., Aalto University, Department of Computer Science, Finland
deep learning, machine learning, computer vision, multi view stereo, novel view synthesis, Gaussian processes
Other note
  • [Publication 1]: Yuxin Hou, Arno Solin and Juho Kannala. Unstructured Multi-view Depth Estimation Using Mask-Based Multiplane Representation. In Scandinavian Conference on Image Analysis (SCIA), Norrköping, Sweden, pp. 54-66, June 2019.
    DOI: 10.5281/zenodo.2628419 View at publisher
  • [Publication 2]: Yuxin Hou, Juho Kannala and Arno Solin. Multi-View Stereo by Temporal Nonparametric Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, pp. 2651-2660, October 2019.
    DOI: 10.1109/ICCV.2019.00274 View at publisher
  • [Publication 3]: Yuxin Hou, Muhammad Kamran Janjua, Juho Kannala and Arno Solin. Movement-induced Priors for Deep Stereo. In International Conference on Pattern Recognition (ICPR), Virtual, pp. 3628-3635, January 2021.
    DOI: 10.1109/ICPR48806.2021.9413074 View at publisher
  • [Publication 4]: Yuxin Hou, Ari Heljakka and Arno Solin. Gaussian Process Priors for View-Aware Inference. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual, pp. 7762-7770, May 2021
  • [Publication 5]: Yuxin Hou, Arno Solin and Juho Kannala. Novel View Synthesis via Depth-guided Skip Connections. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual, pp. 3119-3128, January 2021.
    DOI: 10.1109/WACV48630.2021.00316 View at publisher
  • [Publication 6]: Yuxin Hou, Eleonora Vig, Michael Donoser and Loris Bazzani. Learning Attribute-driven Disentangled Representations for Interactive Fashion Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, pp. 12147-12157, October 2021.
    DOI: 10.1109/ICCV48922.2021.01193 View at publisher