Improved deep depth estimation for environments with sparse visual cues

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorJoswig, Niclasen_US
dc.contributor.authorAutiosalo, Juusoen_US
dc.contributor.authorRuotsalainen, Lauraen_US
dc.contributor.departmentDepartment of Electronics and Nanoengineeringen
dc.contributor.departmentDepartment of Energy and Mechanical Engineeringen
dc.contributor.groupauthorMechatronicsen
dc.contributor.organizationUniversity of Helsinkien_US
dc.date.accessioned2023-01-18T09:22:49Z
dc.date.available2023-01-18T09:22:49Z
dc.date.issued2023-01en_US
dc.descriptionFunding Information: This work has been supported by a donation from Konecranes, Finnish Center for Artificial Intelligence (FCAI), the University of Helsinki and Aalto University. Publisher Copyright: © 2022, The Author(s).
dc.description.abstractMost deep learning-based depth estimation models that learn scene structure self-supervised from monocular video base their estimation on visual cues such as vanishing points. In the established depth estimation benchmarks depicting, for example, street navigation or indoor offices, these cues can be found consistently, which enables neural networks to predict depth maps from single images. In this work, we are addressing the challenge of depth estimation from a real-world bird’s-eye perspective in an industry environment which contains, conditioned by its special geometry, a minimal amount of visual cues and, hence, requires incorporation of the temporal domain for structure from motion estimation. To enable the system to incorporate structure from motion from pixel translation when facing context-sparse, i.e., visual cue sparse, scenery, we propose a novel architecture built upon the structure from motion learner, which uses temporal pairs of jointly unrotated and stacked images for depth prediction. In order to increase the overall performance and to avoid blurred depth edges that lie in between the edges of the two input images, we integrate a geometric consistency loss into our pipeline. We assess the model’s ability to learn structure from motion by introducing a novel industry dataset whose perspective, orthogonal to the floor, contains only minimal visual cues. Through the evaluation with ground truth depth, we show that our proposed method outperforms the state of the art in difficult context-sparse environments.en
dc.description.versionPeer revieweden
dc.format.extent12
dc.format.mimetypeapplication/pdfen_US
dc.identifier.citationJoswig, N, Autiosalo, J & Ruotsalainen, L 2023, 'Improved deep depth estimation for environments with sparse visual cues', Machine Vision and Applications, vol. 34, no. 1, 18. https://doi.org/10.1007/s00138-022-01364-0en
dc.identifier.doi10.1007/s00138-022-01364-0en_US
dc.identifier.issn0932-8092
dc.identifier.issn1432-1769
dc.identifier.otherPURE UUID: 5da91566-54f9-4949-868b-3c57c81dd26ben_US
dc.identifier.otherPURE ITEMURL: https://research.aalto.fi/en/publications/5da91566-54f9-4949-868b-3c57c81dd26ben_US
dc.identifier.otherPURE FILEURL: https://research.aalto.fi/files/98117320/s00138_022_01364_0.pdf
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/118860
dc.identifier.urnURN:NBN:fi:aalto-202301181216
dc.language.isoenen
dc.publisherSpringer
dc.relation.fundinginfoThis work has been supported by a donation from Konecranes, Finnish Center for Artificial Intelligence (FCAI), the University of Helsinki and Aalto University.
dc.relation.ispartofseriesMachine Vision and Applicationsen
dc.relation.ispartofseriesVolume 34, issue 1en
dc.rightsopenAccessen
dc.subject.keywordComputer visionen_US
dc.subject.keywordDeep learningen_US
dc.subject.keywordMonocular depthen_US
dc.subject.keywordVisual SLAMen_US
dc.titleImproved deep depth estimation for environments with sparse visual cuesen
dc.typeA1 Alkuperäisartikkeli tieteellisessä aikakauslehdessäfi
dc.type.versionpublishedVersion

Files