Comparison of Speech To Text Services

dc.contributorAalto Universityen
dc.contributor.advisorGröhn, Matti
dc.contributor.authorKlimko, Sara
dc.contributor.schoolPerustieteiden korkeakoulufi
dc.contributor.supervisorKurimo, Mikko
dc.description.abstractThe technology which enables the recognition and transcription of spoken language to text by computer is called speech-to-text (STT) or Automatic Speech Recognition (ASR). During my internship, I was working at FAKE Production Oy, a Finnish company that designs, develops and produces immersive Virtual Reality and Augmented Reality applications for communication, training, collaboration, and entertainment. The goal of this thesis is to find the best speech-to-text solutions for the following use case: Imagine a virtual meeting where participants are wearing Augmented and/or Virtual Reality headsets. Through these headsets, the participants can see other participants who joined the meeting remotely. As people are speaking virtual text bubbles are popping up above their heads, displaying in real-time what they are saying to help understanding. At the end of the meeting, the participants are automatically provided a transcript of what was said during the meeting. In my thesis, I compared four STT provider services (Google Cloud Speech-to-Text, IBM Watson Speech To Text, AWS Transcribe and SpeechMatics) together with Kaldi, which is an open source ASR toolkit. For the evaluation, I used an open source model of Kaldi which was trained on a telephone conversation database. During my comparison process, I only focused on English ASR and evaluated the services based on different aspects like performance data provided by the services, restrictions, and price. Moreover, I tested 11 hours of audio files and compared the results based on readability, accuracy, and speed. Out of the five candidates, SpeechMatics and Google Cloud Speech-to-Text performed the best. SpeechMatics was the fastest service and the second in the accuracy aspects, while Google Cloud Speech-to-Text performed the best in readability and accuracy viewpoint. Based on my results my final recommendation to FAKE Production Oy was Google Cloud Speech-to-Text.en
dc.format.extent4 + 60
dc.programmeMaster's Programme in ICT Innovationfi
dc.programme.majorSoftware and Service Architecturesfi
dc.subject.keywordspeech to texten
dc.subject.keywordhidden Markov modelen
dc.subject.keyworddynamic time warpingen
dc.subject.keyworddeep neural networken
dc.titleComparison of Speech To Text Servicesen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen