Comparison of Speech To Text Services

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2019-08-19
Department
Major/Subject
Software and Service Architectures
Mcode
SCI3042
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
4 + 60
Series
Abstract
The technology which enables the recognition and transcription of spoken language to text by computer is called speech-to-text (STT) or Automatic Speech Recognition (ASR). During my internship, I was working at FAKE Production Oy, a Finnish company that designs, develops and produces immersive Virtual Reality and Augmented Reality applications for communication, training, collaboration, and entertainment. The goal of this thesis is to find the best speech-to-text solutions for the following use case: Imagine a virtual meeting where participants are wearing Augmented and/or Virtual Reality headsets. Through these headsets, the participants can see other participants who joined the meeting remotely. As people are speaking virtual text bubbles are popping up above their heads, displaying in real-time what they are saying to help understanding. At the end of the meeting, the participants are automatically provided a transcript of what was said during the meeting. In my thesis, I compared four STT provider services (Google Cloud Speech-to-Text, IBM Watson Speech To Text, AWS Transcribe and SpeechMatics) together with Kaldi, which is an open source ASR toolkit. For the evaluation, I used an open source model of Kaldi which was trained on a telephone conversation database. During my comparison process, I only focused on English ASR and evaluated the services based on different aspects like performance data provided by the services, restrictions, and price. Moreover, I tested 11 hours of audio files and compared the results based on readability, accuracy, and speed. Out of the five candidates, SpeechMatics and Google Cloud Speech-to-Text performed the best. SpeechMatics was the fastest service and the second in the accuracy aspects, while Google Cloud Speech-to-Text performed the best in readability and accuracy viewpoint. Based on my results my final recommendation to FAKE Production Oy was Google Cloud Speech-to-Text.
Description
Supervisor
Kurimo, Mikko
Thesis advisor
Gröhn, Matti
Keywords
speech to text, architecture, approaches, hidden Markov model, dynamic time warping, deep neural network
Other note
Citation