Natural Language Description of Images and Videos

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.advisor Laaksonen, Jorma
dc.contributor.author Shetty, Rakshith
dc.date.accessioned 2016-10-12T11:41:22Z
dc.date.available 2016-10-12T11:41:22Z
dc.date.issued 2016-09-26
dc.identifier.uri https://aaltodoc.aalto.fi/handle/123456789/22839
dc.description.abstract Understanding visual media, i.e. images and videos, has been a cornerstone topic in computer vision research for a long time. Recently, a new task within the purview of this research area, that of automatically captioning images and videos, has garnered wide-spread interest. The task involves generating a short natural language description of an image or a video. This thesis studies the automatic visual captioning problem in its entirety. A baseline visual captioning pipeline is examined, including its two constituent blocks, namely visual feature extraction and language modeling. We then discuss the challenges involved and the methods available to evaluate a visual captioning system. Building on this baseline model, several enhancements are proposed to improve the performance of both the visual feature extraction and the language modeling. Deep convolutional neural network based image features used in the baseline model are augmented with explicit object and scene detection features. In the case of videos, a combination of action recognition and static frame-level features are used. The long-short term memory network based language model used in the baseline is extended by introduction of an additional input channel and residual connections. Finally, an efficient ensembling technique based on a caption evaluator network is presented. Results from extensive experiments conducted to evaluate each of the above mentioned enhancements are reported. The image and video captioning architectures proposed in this thesis achieve state-of-the-art performance on the corresponding tasks. To support these claims, results from two video captioning challenges organized over the last year are reported, both of which were won by the models presented in the thesis. We also quantitatively analyze the automatic captions generated and identify several shortcomings of the current system. After having identified the deficiencies, we briefly look at a few interesting problems which could take the automatic visual captioning research forward. en
dc.format.extent 95
dc.format.mimetype application/pdf en
dc.language.iso en en
dc.title Natural Language Description of Images and Videos en
dc.type G2 Pro gradu, diplomityö fi
dc.contributor.school Perustieteiden korkeakoulu fi
dc.subject.keyword image captioning en
dc.subject.keyword video description en
dc.subject.keyword deep learning en
dc.subject.keyword long- short term memory en
dc.subject.keyword language modeling en
dc.identifier.urn URN:NBN:fi:aalto-201610124939
dc.programme.major Machine Learning and Data Mining fi
dc.programme.mcode SCI3015 fi
dc.type.ontasot Master's thesis en
dc.type.ontasot Diplomityö fi
dc.contributor.supervisor Karhunen, Juha
dc.programme Master’s Programme in Machine Learning and Data Mining (Macadamia) fi
dc.ethesisid Aalto 5679
dc.location P1


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search archive


Advanced Search

article-iconSubmit a publication

Browse

My Account