Dialogue self-play and crowdsourcing to collect annotated data for a chatbot
No Thumbnail Available
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Authors
Date
2020-08-18
Department
Major/Subject
Human-Computer Interaction and Design
Mcode
SCI3020
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
83 + 4
Series
Abstract
The impact of chatbot technologies on our economies, services, and society as a whole is becoming more evident year by year and it's deeply influencing the future prospects of human-computer interaction. Conversational agents are not a new thing. Nonetheless, this area of research and development is still fertile ground for new innovation, especially from the field of Artificial Intelligence. Historically, ad-hoc scripting languages were built to develop rule-based agents able to react to specific keywords or elements contained in the user input. Nowadays, deep learning based systems are becoming trending because they allow for more flexible and robust conversational models. One of the curses of recent-years Artificial Intelligence, especially in the sub-field of deep learning, is the constant need for large datasets used to train models and algorithms. Gathering and annotating dialogues is a time and resource consuming process; knowledge is also not always transferable, especially when it deals with task-specific data as in the case of goal-oriented chatbots. On the other hand, pattern-based chatbots for goal-oriented tasks are relatively easy to design and implement but fail in achieving the natural feel of interaction demanded by today's users. The goal of this work is to explore the applicability of Machines talking to Machines (M2M), a new framework to collect annotated datasets that can be used to train neural-based dialogue models. M2M makes use of a technique called dialogue self-play to build dialogue templates on top of computer-generated semantic annotations; in a successive phase, real users perform paraphrases of the templates to build up natural dialogues reflecting the underlying annotations. In this thesis, for the crowdsourcing phase, we tried to involve and compare the outcomes of classical workers from online platforms, and the real users of a chatbot called Siirtobot developed for the company 20Hexagons Oy. The results show that it is possible to integrate this data collection approach within a rule-based chatbot. The datasets collected from professional workers and actual users of a service show that the latter introduces a considerably higher dialogue diversity and linguistic richness, useful to train robust and flexible neural models. We also showcased how the collected data can be later used to bootstrap a neural agent by implementing and training a state-of-the-art module for Natural Language Generation.Description
Supervisor
Kurimo, MikkoThesis advisor
Leinonen, JuhoSandrini, Marco
Keywords
Chatbot, M2M, crowdsourcing, dialogue self-play