Browsing by Author "Yegnanarayana, Bayya"
Now showing 1 - 6 of 6
- Results Per Page
- Sort Options
- Analysis and classification of phonation types in speech and singing voice
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2020-04) Kadiri, Sudarsana Reddy; Alku, Paavo; Yegnanarayana, BayyaBoth in speech and singing, humans are capable of generating sounds of different phonation types (e.g., breathy, modal and pressed). Previous studies in the analysis and classification of phonation types have mainly used voice source features derived using glottal inverse filtering (GIF). Even though glottal source features are useful in discriminating phonation types in speech, their performance deteriorates in singing voice due to the high fundamental frequency of these sounds that reduces the accuracy of source-filter separation in GIF. In the present study, features describing the glottal source were computed using three signal processing methods that do not compute source-filter separation. These three methods are zero frequency filtering (ZFF), zero time windowing (ZTW) and single frequency filtering (SFF). From each method, a group of scalar features were extracted. In addition, cepstral coefficients were derived from the spectra computed using ZTW and SFF. Experiments were conducted with the proposed features to analyse and classify phonation types using three phonation types (breathy, modal and pressed) for speech and singing voice. Statistical pair-wise comparisons between the phonation types showed that most of the features were capable of separating the phonation types significantly for speech and singing voices. Classification with support vector machine classifiers indicated that the proposed features and their combinations showed improved accuracy compared to usually employed glottal source features and mel-frequency cepstral coefficients (MFCCs). - Analysis of Aperiodicity in Artistic Noh Singing Voice using an Impulse Sequence Representation of Excitation Source
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2019-12-01) Kadiri, Sudarsana; Yegnanarayana, BayyaAperiodicity in the voice source is caused by changes in the vocal fold vibrations, other than the normal quasi-periodicity and the turbulence at the glottis. The aperiodicity appears to be one of the main properties that is responsible for conveying the emotion in artistic voices. In this paper, the feasibility of representing the excitation source characteristics in artistic (Noh) singing voice by an impulse-like sequence in the time domain is examined. The impulses at the glottal closure instants contribute to the major excitation of the vocal tract system. The sequence of such impulses produces harmonics of the fundamental frequency in the spectrum. The amplitude variation or amplitude modulation (AM) of these impulses in the sequence contributes to the aperiodicity in the excitation, and can result in appearance of subharmonics in the spectrum. The variation in the impulse intervals or frequency modulation (FM) can also contribute to the aperiodicity in the excitation. The aperiodic component of the excitation in the Noh voice is examined in the impulse-like sequence derived from the signal using the single frequency filtering (SFF) analysis. The effects of aperiodicity are explained for synthetic AM and FM sequence of impulses using spectrograms and saliency plots. - Analysis of Instantaneous Frequency Components of Speech Signals for Epoch Extraction
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2023-03) Kadiri, Sudarsana; Alku, Paavo; Yegnanarayana, BayyaThe major impulse-like excitation in the speech signal is due to abrupt closure of the vocal folds, which takes place at the glottal closure instant (GCI) or epoch in each cycle. GCIs are used in many areas of speech science and technology, such as in prosody modification, voice source analysis, formant extraction and speech synthesis. It is difficult to observe these discontinuities (corresponding to GCIs) in the speech signal because of the superimposed time-varying response of the vocal tract system. This paper examines the phase part of different frequency components of the speech signal to extract epochs. Three analysis methods to decompose the speech signal into different frequency components are considered. These methods are the short-time Fourier transform (STFT), narrow bandpass filtering (NBPF), and single frequency filtering (SFF). The locations of the discontinuities in the speech signal are obtained from the instantaneous frequency (IF) (i.e., the time derivative of the phase) of each of the frequency components. A method for automatic detection of epochs using the amplitude weighted IF is proposed. Performance of the proposed epoch detection method is compared with four state-of-the-art methods in clean and telephone quality speech. The performance of the proposed method is comparable with the performance of the existing epoch detection methods for clean speech but better for telephone quality speech. - Comparison of glottal closure instants detection algorithms for emotional speech
A4 Artikkeli konferenssijulkaisussa(2020-05) Kadiri, Sudarsana; Alku, Paavo; Yegnanarayana, BayyaIn production of voiced speech, epochs or glottal closure instants (GCIs) refer to the instants of significant excitation of the vocal tract. Extraction of GCIs is used as a pre-processing stage in many areas of speech technology, such as in prosody modification, speech synthesis and voice source analysis. In the past decades, several GCI detection algorithms have been developed and most of them provide excellent results for speech signals produced using modal (normal) type of phonation. There are, however, no studies comparing multiple state-of-the-art GCI detection methods in emotional speech. In this paper, we compare six GCI detection algorithms using emotional speech and known evaluation metrics. We use the Berlin EMO-DB acted emotional speech database which contains seven emotions and simultaneous electroglottography (EGG) recordings as ground truth. The results show that all six GCI detection algorithms give best performance in processing speech of neutral emotion and that the performance degrade particularly in emotions of high arousal (anger and joy). To improve the performance of GCI detection in emotional speech, the study underlines the importance of local average pitch period estimates. - Detection of glottal closure instant and glottal open region from speech signals using spectral flatness measure
A1 Alkuperäisartikkeli tieteellisessä aikakauslehdessä(2020-01) Kadiri, Sudarsana; Prasad, Ravi Shankar; Yegnanarayana, BayyaThis paper proposes an approach using spectral flatness measure to detect the glottal closure instant (GCI) and the glottal open region (GOR) within each glottal cycle in voiced speech. The spectral flatness measure is derived from the instantaneous spectra obtained in the analysis of speech using single frequency filtering (SFF) and zero time windowing (ZTW) methods. The Hilbert envelope of the numerator of group delay (HNGD) spectrum at each instant of time is obtained using the ZTW method. The HNGD spectrum highlights the important (like resonances) spectral characteristics of the vocal tract system at each instant of time. The dynamic characteristics of the vocal tract system can be tracked by the spectral flatness feature of the HNGD spectrum, thus bringing out the characteristics of the vocal tract system when the subglottal region is coupled with the supraglottal region during the open phase of the glottal cycle. The SFF spectra at each instant change significantly at the location of the GCI. The GCIs can be detected using the changes in the spectral flatness information derived from the SFF spectra. The proposed methods of detection of GCI and GOR is compared with several existing methods. - Extraction and Utilization of Excitation Information of Speech: A Review
A2 Katsausartikkeli tieteellisessä aikakauslehdessä(2021-12) Kadiri, Sudarsana; Alku, Paavo; Yegnanarayana, BayyaSpeech production can be regarded as a process where a time-varying vocal tract system (filter) is excited by a time-varying excitation. In addition to its linguistic message, the speech signal also carries information about, for example, the gender and age of the speaker. Moreover, the speech signal includes acoustical cues about several speaker traits, such as the emotional state and the state of health of the speaker. In order to understand the production of these acoustical cues by the human speech production mechanism and utilize this information in speech technology, it is necessary to extract features describing both the excitation and the filter of the human speech production mechanism. While the methods to estimate and parameterize the vocal tract system are well established, the excitation appears less studied. This article provides a review of signal processing approaches used for the extraction of excitation information from speech. This article highlights the importance of excitation information in the analysis and classification of phonation type and vocal emotions, in the analysis of nonverbal laughter sounds, and in studying pathological voices. Furthermore, recent developments of deep learning techniques in the context of extraction and utilization of the excitation information are discussed.