Browsing by Author "Kodali, Manila"
Now showing 1 - 11 of 11
Results Per Page
Sort Options
Item Automatic classification of the severity level of Parkinson’s disease: A comparison of speaking tasks, features, and classifiers(Academic Press, 2023-10) Kodali, Manila; Kadiri, Sudarsana; Alku, Paavo; Department of Information and Communications Engineering; Speech Communication TechnologyAutomatic speech-based severity level classification of Parkinson’s disease (PD) enables objective assessment and earlier diagnosis. While many studies have been conducted on the binary classification task to distinguish speakers in PD from healthy controls (HCs), clearly fewer studies have addressed multi-class PD severity level classification problems. Furthermore, in studying the three main issues of speech-based classification systems—speaking tasks, features, and classifiers—previous investigations on the severity level classification have yielded inconclusive results due to the use of only a few, and sometimes just one, type of speaking task, feature, or classifier in each study. Hence, a systematic comparison is conducted in this study between different speaking tasks, features, and classifiers. Five speaking tasks (vowel task, sentence task, diadochokinetic (DDK) task, read text task, and monologue task), four features (phonation, articulation, prosody, and their fusion), and four classifier architectures (support vector machine (SVM), random forest (RF), multilayer perceptron (MLP), and AdaBoost) were compared. The classification task studied was a 3-class problem to classify PD severity level as healthy vs. mild vs. severe. Two MDS-UPDRS scales (MDS-UPDRS-III and MDS-UPDRS-S) were used for the ground truth severity level labels. The results showed that the use of the monologue task and the articulation and fusion of features improved classification accuracy significantly compared to the use of the other speaking tasks and features. The best classification systems resulted in a rate of accuracy of 58% (using the monologue task with the articulation features) for the MDS-UPDR-III scale and 56% (using the monologue task with fusion of features) for the MDS-UPDRS-S scale.Item Automatic Classification of Vocal Intensity Category from Speech(2021-12-13) Kodali, Manila; Kadiri, Sudarsana; Sähkötekniikan korkeakoulu; Alku, PaavoVocal intensity regulation is a fundamental phenomenon in speech communication. In speech science, the term vocal intensity is referred to as the acoustic energy of speech, and it is quantified by sound pressure level (SPL). Unlike, for example, loudspeaker amplifies, which adjust the sound intensity by affecting only the gain, the regulation of intensity in speech is much more complex and challenging because it is based on the physiological speech production mechanism. The speech signal carries acoustical cues about the vocal intensity category/ SPL that the speaker used when the corresponding speech signal was produced. Due to the lack of proper calibration information in existing speech databases, it is not possible to estimate the true vocal intensity category/SPL used in recordings. In addition, there is only one previous study on the automatic classification of vocal intensity category. In this current study, a large speech database representing four vocal intensity categories (soft, normal, loud, and very loud) was recorded from 50 speakers by including calibration information. Two automatic machine learning-based classification systems were developed using Support Vector Machines (SVMs) and Convolutional Neural Networks (CNNs) and using Mel-Frequency Cepstral Coefficients (MFCCs) as features. The results show that the best classification accuracy (of about 65%) was obtained using the SVM classifier.Item Automatic classification of vocal intensity category from speech(2023) Kodali, Manila; Kadiri, Sudarsana; Laaksonen, Laura; Alku, Paavo; Department of Information and Communications Engineering; Speech Communication Technology; Department of Information and Communications EngineeringRegulation of vocal intensity is a fundamental phenomenon in speech communication. Vocal intensity can be quantified using sound pressure level (SPL), which can be measured easily by recording a standard calibration signal with speech and by comparing the energy of the recorded speech signal with that of the calibration tone. Unfortunately, speech recordings are mostly conducted without the SPL calibration signal, and speech signals are saved to databases using arbitrary amplitude scales. Therefore, neither the SPL nor the intensity category (e.g. soft or loud phonation) of a saved speech signal can be determined afterwards. Even though the original level information of speech is lost when the signal is presented on arbitrary amplitude scales, the speech signal contains other acoustic cues of vocal intensity. In the current study, we study machine learning and deep learning -based methods in automatic classification of vocal intensity category when the input speech is expressed using an arbitrary amplitude scale. A new gender-balanced database consisting of speech produced in four vocal intensity categories (soft, normal, loud, and very loud) was first recorded. Support vector machine and deep neural network (DNN) models were used to develop automatic classification systems using spectrograms, mel-spectrograms, and mel-frequency cepstral coefficients as features. The DNN classifier using the mel-spectrogram showed the best classification accuracy of about 90%. The database is made publicly available at https://bit.ly/3tLPGRxItem AVID: A speech database for machine learning studies on vocal intensity(Elsevier, 2024-02) Alku, Paavo; Kodali, Manila; Laaksonen, Laura; Kadiri, Sudarsana; Department of Information and Communications Engineering; Speech Communication Technology; Huawei TechnologiesVocal intensity, which is quantified typically with the sound pressure level (SPL), is a key feature of speech. To measure SPL from speech recordings, a standard calibration tone (with a reference SPL of 94 dB or 114 dB) needs to be recorded together with speech. However, most of the popular databases that are used in areas such as speech and speaker recognition have been recorded without calibration information by expressing speech on arbitrary amplitude scales. Therefore, information about vocal intensity of the recorded speech, including SPL, is lost. In the current study, we introduce a new open and calibrated speech/electroglottography (EGG) database named Aalto Vocal Intensity Database (AVID). AVID includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud and very loud). Recordings were conducted using a constant mouth-to-microphone distance and by recording a calibration tone. The speech data was labelled sentence-wise using a total of 19 labels that support the utilisation of the data in machine learing (ML) -based studies of vocal intensity based on supervised learning. In order to demonstrate how the AVID data can be used to study vocal intensity, we investigated one multi-class classification task (classification of speech into soft, normal, loud and very loud intensity classes) and one regression task (prediction of SPL of speech). In both tasks, we deliberately warped the level of the input speech by normalising the signal to have its maximum amplitude equal to 1.0, that is, we simulated a scenario that is prevalent in current speech databases. The results show that using the spectrogram feature with the support vector machine classifier gave an accuracy of 82% in the multi-class classification of the vocal intensity category. In the prediction of SPL, using the spectrogram feature with the support vector regressor gave an mean absolute error of about 2 dB and a coefficient of determination of 92%. We welcome researchers interested in classification and regression problems to utilise AVID in the study of vocal intensity, and we hope that the current results could serve as baselines for future ML studies on the topic.Item Classification of Vocal Intensity Category from Multi-sensor Recordings of Speech(2023-06-12) Ylä-Jääski, Juho; Kodali, Manila; Reddy Kadiri, Sudarsana; Perustieteiden korkeakoulu; Alku, PaavoVocal intensity is a crucial characteristic of speech. The intensity is regulated in the expression of emotions and with the purpose to propagate speech over longer distances. The regulation process is complex and affects many spectral and temporal speech characteristics. The inclusion of vocal intensity information enhances the performance of para-linguistic data sets; however, most speech data sets lack this information. In this study, a large speech data set of 50 speakers was created each speaker repeating 25 sentences with five intensity categories: whisper, soft, normal, loud, and very loud. The calibration information was included. The recordings involved seven sensors where both air-conducting (AC) and bone-conducting (BC) sensors were used. The calibration data enabled an accurate computation of sound pressure level (SPL). SPL value can be used to quantify the vocal intensity of the speaker. Two different labeling methods were developed. Subjective labels of the intensity categories follow the individual interpretation of the speaker while objective labels are based on the computed SPL values. The intensity information of the speech samples was deliberately removed before conducting the experiments. This was achieved by normalizing each spoken sentence sample, simulating a scenario where speech is represented in an arbitrary amplitude scale. This study explores four classifiers: 1D-CNN, SVM, MLP, and 2D-CNN. The SVM model appeared to outperform other models when utilizing subjective labels, whereas objective labels yielded better results with MLP and 2D-CNN models. The microphone positioned outside the headset (MC2) and the voice pickup sensor for bone conduction (VPU) produced the best results. Interestingly, the classifiers were able to predict intensity categories solely using BC speech data. Furthermore, the performance of the models was enhanced with the use of objective labels, as evidenced by an accuracy difference of 12% between the best models using subjective and objective labels. Moreover, multi-sensor models yielded better results than single-sensor models. The combination of MC2 and VPU sensors, along with the MLP model, yielded the best performance, achieving an accuracy of 80%.Item Classification of vocal intensity category from speech using the wav2vec2 and whisper embeddings(International Speech Communication Association, 2023) Kodali, Manila; Kadiri, Sudarsana; Alku, Paavo; Department of Communications and Networking; Department of Information and Communications Engineering; Speech Communication TechnologyIn speech communication, talkers regulate vocal intensity resulting in speech signals of different intensity categories (e.g., soft, loud). Intensity category carries important information about the speaker's health and emotions. However, many speech databases lack calibration information, and therefore sound pressure level cannot be measured from the recorded data. Machine learning, however, can be used in intensity category classification even though calibration information is not available. This study investigates pre-trained model embeddings (Wav2vec2 and Whisper) in classification of vocal intensity category (soft, normal, loud, and very loud) from speech signals expressed using arbitrary amplitude scales. We use a new database consisting of two speaking tasks (sentence and paragraph). Support vector machine is used as a classifier. Our results show that the pre-trained model embeddings outperformed three baseline features, providing improvements of up to 7%(absolute) in accuracy.Item Comparing 1-dimensional and 2-dimensional spectral feature representations in voice pathology detection using machine learning and deep learning classifiers(International Speech Communication Association, 2022-09) Javanmardi, Farhad; Kadiri, Sudarsana; Kodali, Manila; Alku, Paavo; Dept Signal Process and Acoust; Speech Communication TechnologyThe present study investigates the use of 1-dimensional (1-D) and 2-dimensional (2-D) spectral feature representations in voice pathology detection with several classical machine learning (ML) and recent deep learning (DL) classifiers. Four popularly used spectral feature representations (static mel-frequency cepstral coefficients (MFCCs), dynamic MFCCs, spectrogram and mel-spectrogram) are derived in both the 1-D and 2-D form from voice signals. Three widely used ML classifiers (support vector machine (SVM), random forest (RF) and Adaboost) and three DL classifiers (deep neural network (DNN), long short-term memory (LSTM) network, and convolutional neural network (CNN)) are used with the 1-D feature representations. In addition, CNN classifiers are built using the 2-D feature representations. The popularly used HUPA database is considered in the pathology detection experiments. Experimental results revealed that using the CNN classifier with the 2-D feature representations yielded better accuracy compared to using the ML and DL classifiers with the 1-D feature representations. The best performance was achieved using the 2-D CNN classifier based on dynamic MFCCs that showed a detection accuracy of 81%.Item Motion pattern recognition in 4D point clouds(2020-09) Salami, Dariush; Palipana, Sameera; Kodali, Manila; Sigg, Stephan; Ambient Intelligence; Department of Communications and NetworkingWe address an actively discussed problem in signal processing, recognizing patterns from spatial data in motion. In particular, we suggest a neural network architecture to recognize motion patterns from 4D point clouds. We demonstrate the feasibility of our approach with point cloud datasets of hand gestures. The architecture, PointGest, directly feeds on unprocessed timelines of point cloud data without any need for voxelization or projection. The model is resilient to noise in the input point cloud through abstraction to lower-density representations, especially for regions of high density. We evaluate the architecture on a benchmark dataset with ten gestures. PointGest achieves an accuracy of 98.8%, outperforming five state-of-the-art point cloud classification models.Item Severity classification of Parkinson's disease from speech using single frequency filtering-based features(International Speech Communication Association, 2023) Kadiri, Sudarsana; Kodali, Manila; Alku, Paavo; Department of Information and Communications Engineering; Speech Communication TechnologyDeveloping objective methods for assessing the severity of Parkinson's disease (PD) is crucial for improving the diagnosis and treatment. This study proposes two sets of novel features derived from the single frequency filtering (SFF) method: (1) SFF cepstral coefficients (SFFCC) and (2) MFCCs from the SFF (MFCC-SFF) for the severity classification of PD. Prior studies have demonstrated that SFF offers greater spectrotemporal resolution compared to the short-time Fourier transform. The study uses the PC-GITA database, which includes speech of PD patients and healthy controls produced in three speaking tasks (vowels, sentences, text reading). Experiments using the SVM classifier revealed that the proposed features outperformed the conventional MFCCs in all three speaking tasks. The proposed SFFCC and MFCC-SFF features gave a relative improvement of 5.8% & 2.3% for the vowel task, 7.0% & 1.8% for the sentence task, and 2.4% & 1.1% for the read text task, in comparison to MFCC features.Item Towards battery-less RF sensing(IEEE, 2021-05-25) Kodali, Manila; Nguyen, Le Ngu; Sigg, Stephan; Dept Signal Process and Acoust; Department of Communications and Networking; Ambient IntelligenceRecent work has demonstrated the use of the radio interface as a sensing modality for gestures, activities and situational perception. The field generally moves towards larger bandwidths, multiple antennas, and higher, mmWave frequency domains, which allow for the recognition of minute movements. We envision another set of applications for RF sensing: battery-less autonomous sensing devices. In this work, we investigate transceiver-less passive RF-sensors which are excited by the fluctuation of the received power over the wireless channel. In particular, we demonstrate the use of battery-less RF-sensing for applications of on-body gesture recognition integrated into smart garment, as well as the integration of such sensing capabilities into smart surfaces.Item Utilizing WAV2VEC in database-independent voice disorder detection(2023) Tirronen, Saska; Javanmardi, Farhad; Kodali, Manila; Kadiri, Sudarsana; Alku, Paavo; Speech Communication Technology; Department of Information and Communications EngineeringAutomatic detection of voice disorders from acoustic speech signals can help to improve reliability of medical diagnosis. However, the real-life environment in which speech signals are recorded for diagnosis can be different from the environment in which the detection system’s training data was originally collected. This mismatch between the recording conditions can decrease detection performance in practical scenarios. In this work, we propose to use a pre-trained wav2vec 2.0 model as a feature extractor to build automatic detection systems for voice disorders. The embeddings from the first layers of the context network contain information about phones, and these features are useful in voice disorder detection. We evaluate the performance of the wav2vec features in single-database and crossdatabase scenarios to study their generalizability to unseen speakers and recording conditions. The results indicate that the wav2vec features generalize better than popular spectral and cepstral baseline features.