Dialogue self-play and crowdsourcing to collect annotated data for a chatbot

No Thumbnail Available

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2020-08-18

Department

Major/Subject

Human-Computer Interaction and Design

Mcode

SCI3020

Degree programme

Master's Programme in ICT Innovation

Language

en

Pages

83 + 4

Series

Abstract

The impact of chatbot technologies on our economies, services, and society as a whole is becoming more evident year by year and it's deeply influencing the future prospects of human-computer interaction. Conversational agents are not a new thing. Nonetheless, this area of research and development is still fertile ground for new innovation, especially from the field of Artificial Intelligence. Historically, ad-hoc scripting languages were built to develop rule-based agents able to react to specific keywords or elements contained in the user input. Nowadays, deep learning based systems are becoming trending because they allow for more flexible and robust conversational models. One of the curses of recent-years Artificial Intelligence, especially in the sub-field of deep learning, is the constant need for large datasets used to train models and algorithms. Gathering and annotating dialogues is a time and resource consuming process; knowledge is also not always transferable, especially when it deals with task-specific data as in the case of goal-oriented chatbots. On the other hand, pattern-based chatbots for goal-oriented tasks are relatively easy to design and implement but fail in achieving the natural feel of interaction demanded by today's users. The goal of this work is to explore the applicability of Machines talking to Machines (M2M), a new framework to collect annotated datasets that can be used to train neural-based dialogue models. M2M makes use of a technique called dialogue self-play to build dialogue templates on top of computer-generated semantic annotations; in a successive phase, real users perform paraphrases of the templates to build up natural dialogues reflecting the underlying annotations. In this thesis, for the crowdsourcing phase, we tried to involve and compare the outcomes of classical workers from online platforms, and the real users of a chatbot called Siirtobot developed for the company 20Hexagons Oy. The results show that it is possible to integrate this data collection approach within a rule-based chatbot. The datasets collected from professional workers and actual users of a service show that the latter introduces a considerably higher dialogue diversity and linguistic richness, useful to train robust and flexible neural models. We also showcased how the collected data can be later used to bootstrap a neural agent by implementing and training a state-of-the-art module for Natural Language Generation.

Description

Supervisor

Kurimo, Mikko

Thesis advisor

Leinonen, Juho
Sandrini, Marco

Keywords

Chatbot, M2M, crowdsourcing, dialogue self-play

Other note

Citation