Visual Storytelling: Captioning of Image Sequences

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2019-12-16

Department

Major/Subject

Machine Learning, Data Science and Artificial Intelligence

Mcode

SCI3044

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

77 + 3

Series

Abstract

In the space of automated captioning, the task of visual storytelling is a dimension. Given sequences of images as inputs, visual storytelling (VIST) is about automatically generating textual narratives as outputs. Automatically producing stories for an order of pictures or video frames have several potential applications in diverse domains ranging from multimedia consumption to autonomous systems. The task has evolved over recent years and is moving into adolescence. The availability of a dedicated VIST dataset for the task has mainstreamed research for visual storytelling and related sub-tasks. This thesis work systematically reports the developments of standard captioning as a parent task with accompanying facets like dense captioning and gradually delves into the domain of visual storytelling. Existing models proposed for VIST are described by examining respective characteristics and scope. All the methods for VIST adapt from the typical encoder-decoder style design, owing to its success in addressing the standard image captioning task. Several subtle differences in the underlying intentions of these methods for approaching the VIST are subsequently summarized. Additionally, alternate perspectives around the existing approaches are explored by re-modeling and modifying their learning mechanisms. Experiments with different objective functions are reported with subjective comparisons and relevant results. Eventually, the sub-field of character relationships within storytelling is studied and a novel idea called character-centric storytelling is proposed to account for prospective characters in the extent of data modalities.

Description

Supervisor

Laaksonen, Jorma

Thesis advisor

Laaksonen, Jorma

Keywords

natural language processing, computer vision, deep learning, captioning, deep reinforcement learning, sequence modeling

Other note

Citation