Data-Efficient Vision Transformer for modeling forest biomass and tree structures from hyperspectral imagery

No Thumbnail Available

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2021-10-18

Department

Major/Subject

Machine Learning, Data Science and Artificial Intelligence

Mcode

SCI3044

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

72

Series

Abstract

Hyperspectral and multispectral imaging technologies for remote sensing have been enjoying an enormous deal of fame in the modern technology era, owing to the advantage of holding rich geographical information, in comparison with RGB or greyscale imagining technologies. The remotely sensed data have been employed in various tasks, such as monitoring Earth’s surface, environmental risk analysis, and forest monitoring and modeling, etc. However, collecting, processing, and utilizing this data for predictive modeling remains an arduous task in modern-day machine learning due to the factors such as the complex and imbalanced nature of HSI, variable nature of spectral and spatial features, and abundant noise in spectral channels. In the AIROBEST project, a data processing scheme and a novel convolutional neural network have been proposed for predicting several forest variables. Despite the resultant fairly accurate results, there was a need felt to improve the classification results. This thesis addresses this challenge by accentuating the use of the Vision Transformer (ViT) for modeling forest biomass and tree structures. We have adopted the distillation training technique along with several data augmentation techniques to generalize the ViT well on unseen data. We have carried out several experiments to study the effect of various image and patch size combinations and augmentation techniques on the model’s performance. Furthermore, the resultant accuracies were recorded and compared with the benchmark results from AIROBEST. Altogether, the baseline Vision Transformer yields a good overall accuracy of 87.38% and a mean accuracy of 73.35%. We have achieved a higher overall accuracy by a margin of 3.9% and mean accuracy by 1.53% than benchmark results. Lastly, the thesis discusses the drawbacks of the work and provides suggestions to potentially overcome the shortcomings and improve the current results.

Description

Supervisor

Laaksonen, Jorma

Thesis advisor

Anwer, Rao Muhammad
Naseer, Muzammal

Keywords

Machine Learning, remote sensing, hyperspectral image, attention, vision transformer, neural network

Other note

Citation