aalto1 untyped-item.component.html

Comparison of speech to text services

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis
Electronic archive copy is available via Aalto Thesis Database.

Department

Mcode

SCI3082

Language

en

Pages

4 + 60

Series

Abstract

The technology which enables the recognition and transcription of spoken language to text by computer is called speech-to-text (STT) or Automatic Speech Recognition (ASR). During my internship, I was working at FAKE Production Oy, a Finnish company that designs, develops and produces immersive Virtual Reality and Augmented Reality applications for communication, training, collaboration, and entertainment. The goal of this thesis is to find the best speech-to-text solutions for the following use case: Imagine a virtual meeting where participants are wearing Augmented and/or Virtual Reality headsets. Through these headsets, the participants can see other participants who joined the meeting remotely. As people are speaking virtual text bubbles are popping up above their heads, displaying in real-time what they are saying to help understanding. At the end of the meeting, the participants are automatically provided a transcript of what was said during the meeting. In my thesis, I compared four STT provider services (Google Cloud Speech-to-Text, IBM Watson Speech To Text, AWS Transcribe and SpeechMatics) together with Kaldi, which is an open source ASR toolkit. For the evaluation, I used an open source model of Kaldi which was trained on a telephone conversation database. During my comparison process, I only focused on English ASR and evaluated the services based on different aspects like performance data provided by the services, restrictions, and price. Moreover, I tested 11 hours of audio files and compared the results based on readability, accuracy, and speed. Out of the five candidates, SpeechMatics and Google Cloud Speech-to-Text performed the best. SpeechMatics was the fastest service and the second in the accuracy aspects, while Google Cloud Speech-to-Text performed the best in readability and accuracy viewpoint. Based on my results my final recommendation to FAKE Production Oy was Google Cloud Speech-to-Text.

Description

Supervisor

Kurimo, Mikko

Thesis advisor

Gröhn, Matti

Other note

Citation

Endorsement

Review

Supplemented By

Referenced By