Extracting semantic information from images is a challenge that is hard to achieve with the traditional software engineering approach. However, recent development in deep neural networks allows extracting semantic annotations affordably and efficiently. This work covers building a semantic segmentation model using a multi-task learning approach. The model is able to classify an image based on a scene and extract semantic information about objects present in an image, parts of those objects, as well as annotations about material and texture at a pixel level.
A heterogeneous dataset is used to successfully train the model, which unifies five different datasets. Unlike conventional training of segmentation models, this research implements a concept called super-convergence, in which training time is reduced by at least 50%, yet the performance is not compromised.
Many points discussed in this research paper are investigative. They aim to reach the main goal: building a segmentation model, efficiently training it, and producing a good semantic annotation of a given image. This work includes extensive information about the sequential progress to achieve this goal, starting with a literature review and ending with the model's implementation.