PhD Thesis: Disproving Visemes As The Basic Visual Unit Of Speech

Phonemes are the standard audio unit of speech. They are the smallest segment of sound which, if replaced with another, can change the meaning of the word. In visual speech recognition, visemes have commonly been used as the basic visual unit of speech. There is a many-to-one mapping from phonemes to visemes, with the phonemes contained within a viseme considered visually indistinguishable.

While visemes are widely used, there are a number of doubts that have not been examined regarding their suitability for use within visual speech recognition. As computing technology advances, the justification for creating viseme groupings diminishes. For visemes to be suitable for continued use, they must provide benefits when compared to the use of phonemes as the visual unit of speech. This thesis identifies the three required characteristics of a suitable viseme grouping.

In this thesis, a visual speech recogniser is constructed to test the validity of visemes. A novel energy method, known as "wrapping snakes", is developed to extract lip shapes from standard video datasets of people speaking. Taking this sequence of lip shapes as input, a Hidden Markov Model based recogniser is used to perform the speech recognition and output a phoneme transcript.

Examining the phoneme output of the recogniser shows that none of the three required characteristics are present in any existing viseme grouping, and further that it is not possible to construct a grouping that exhibits the required characteristics. This conclusively proves that it is phonemes, and not visemes, that should be used as the basic visual unit of speech.

Download the full thesis: Disproving Visemes As The Basic Visual Unit Of Speech

Appendices And Supplementary Material

Appendix A - Neural Network Performance For Various Network Configurations

The neural network (see Chapter 3) performance was evaluated for a number of network configurations. The number of neurons in the first layer was tested for the range of 6 to 20 neurons. The second layer was tested for the range of 6 to 15 neurons. The third layer was fixed with a single output neuron, as network only requires a single output.

The network performance was evaluated for all combinations of neuron numbers. This was done using one frame for each subject in the CUAVE dataset. The performance was measured by calculating the mean squared error (MSE) of lip and skin regions of the output only, with the manual labels used as the reference.

The results can be found here: Appendix A - neural network performance. The file lists the MSE for each of the 36 subjects, for all 150 network configurations.

This data is used in Chapter 3: "Lip Pixel Classification", to empirically determine the best network configuration.

Appendix B - Corrected Labels For The CUAVE Dataset

The full set of corrected labels for the CUAVE dataset can be found here: Appendix B - corrected CUAVE labels.

This data is used in Chapter 3: "Lip Pixel Classification".

Appendix C - Phoneme Trustworthiness

The full spreadsheet of phoneme trustworthiness can be found here: Appendix C - phoneme trustworthiness.

The summary (rearranged) worksheet lists the likelihood of each operation causing a phoneme to appear in the output. It also includes the cumulative likelihood for any of the preceding operations causing the particular phoneme to appear in the output of the recogniser.

For example, the /dh/ phoneme lists "(ins)" as 14.9%, and /dh/ as 11.4%. This means there is a 14.9% chance that an insertion caused /dh/ to appear, and a 11.4% chance of /dh/ in the input causing /dh/ to appear in the output. The cumulative column shows there is a 26.3% chance of either /dh/ or an insertion causing /dh/ to appear in the output.

This data is used in Section 7.1: "Phoneme Trustworthiness".