Large Scale Speech Recognition with Deep Learning
Loading...
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Authors
Date
2022-01-24
Department
Major/Subject
Data Science
Mcode
SCI3095
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
64+19
Series
Abstract
Automatic Speech Recognition (ASR) is the task of converting speech signal into text. To enable usage of large datasets to train the end to end speech recognition neural networks, we build a pipeline for increasing efficiency in data storing, data loading and training. We train an attention based sequence-to-sequence model and use word error rates for evaluating the experiments. The time to reach benchmark accuracy is another important metric used to compare the training efficiency of different systems. This work uses a dataset with around 26,000 hours that is new for speech to text experiments. The dataset consists of conference calls with a diverse set of speakers. Around half of the data has a presentation style audio, while the other half contains conversational language. First, the work focuses on the steps taken to make this dataset efficient for speech recognition. Then, two types of distributed training algorithms, synchronous and asynchronous training, are applied which enables the usage of multiple GPUs for stochastic gradient descent. The comparison of the different methods show that, for the experiment setup employed in this work, synchronous training provides the best word error rate of 10.87%. This run converged in 32 hours using 4 GPUs in parallel, which is a speed-up of 2x compared to the single GPU training job to reach the benchmark word error rate. The effective batch size plays an important role in these results. The experiment results also show that increasing the scale of the data, reduces the overall training time, and hence using larger datasets is beneficial even when obtaining quicker training results is an important criterion.Description
Supervisor
Kurimo, MikkoThesis advisor
Rouhe, AkuKeywords
speech recognition, deep lLearning, large scale, multi GPU, attention models, distributed training