|
Data Representation and Kernel-based Machine Learning Methods for Speech Emotion Recognition Presenter Miss Fengna Wang [Email] Abstract This dissertation aims at highlighting potential solutions, from both the model aspect and the feature aspect, for recognizing the latent emotions of humans from their speech signals. In this work, two sparse kernel machines, i.e., relevance vector machine (RVM) and relevance units machine (RUM), have been proposed as recognition models for speech emotion recognition. Moreover, sparse coding (SC) is employed for emotional feature representation. Support vector machine (SVM), as a popular machine learning approach, has been applied in several application domains, as well as in speech emotion recognition. Though SVM is theoretically sound, its model is based on kernel functions that satisfy the strong Mercers condition and SVM has the limitation that the required number of support vectors (SVs) typically grows linearly with the size of the training data. To alleviate this limitation, in this dissertation, the RVM and the RUM have been adopted as alternative kernel approaches. RVM, a Bayesian based kernel method, can achieve comparable and even better performance than SVM, while providing a much sparser model. RUM, actually a further extension of RVM under the Bayesian framework, releases the constraint that relevance units (RUs) have to be selected from the training samples. Moreover, RUM treats RUs and kernel parameters as part of the model parameters. Therefore, RUM maintains all advantages of RVM, offers superior sparsity, and has better generalization performance for unseen data. Finding an appropriate feature representation for audio data is central to speech emotion recognition. Most existing audio features rely on hand-crafted signal processing techniques. An alternative approach is to use features that are instead learned automatically. This has the advantage of generalizing well to new data, particularly if the features are learned in an unsupervised manner. We propose using sparse coding (SC), a popular representative within this class, as a mean to automatically learn features from audio data. Two SC based frameworks are proposed for speech emotion recognition, namely pooling shift-invariant sparse coding (PSISC), and hierarchical sparse coding (HSC). Overall, our experiments offer insights into what makes the proposed frameworks work well on speech emotion recognition benchmarks.
Short CV Master degree in Computer Application, Northwestern Polytechnical University, 2009
|