Nobody downloaded yet

Speech Production - Literature review Example

Add to wishlist

Summary

This paper 'Speech Production ' tells that There are two main classifications in speech sounds, including voiced and unvoiced sounds. Voiced sounds are distinguishable from unvoiced sounds since they are periodic and have a harmonic structure that does not characterize unvoiced sounds…

Download full paper File format: .doc, available for editing

GRAB THE BEST PAPER97.6% of users find it useful

Read Text

Subject: English
Type: Literature review
Level: Ph.D.
Pages: 59 (14750 words)
Downloads: 0

Extract of sample "Speech Production"

Download file to see previous pages

The sentence used in this example is, ‘she had your dark suit in great water waterfall year.” In Figure 1.0 both the voiced and unvoiced sections are highlighted in the word, she. One of the most important things to note is that the spectrum for /she/ is flattened out. Such flattening out is characteristic of noise. On the contrary, the spectrum for /ivy/ displays a harmonic structuthatich is characteristic of the alternating bands. Spectrogram and waveform of narrowband speech in a sentence “she had your dark suit in greasy wash wwater allyear” The vocal tract of a human being plays the most important part in the production of speech.

The process of producing speech starts when there is the low of athletic originates from the lungs and passes via vocal cords generating excitation (Campbell 65). The main determinant of the excitation type and whether a human being will produce a voiced or unvoiced speech is the opening of the glottis. Turbulences in the vocal tract are the main causes of unvoiced speech while quasi-periodic excitation cause voiced speech approximation the the vocal tract is achieved through a series of acoustic tubes that produce characteristic resonances commonly referred to as formant frequencies.

The shape and length of vocmakemakemakes speakers differ. Therefore, formant frequencies depend on speakers. The vocal tract length refers to the distance from the lips to the glottis. The source-filter model represents the model of speech production which is used frequently during speech processing (Kondoz 67; Neuburg and Bauer 90). As discJankovic Jancovic and Kokuer (195), this model is just a simple mathematicthatdel whithat commonly used in automatic speech recognition, coding, and synthesis.

In the block diagram of this model, the excitation-source signal is displayed as a periodic impulse train which has a pitch frequency that can be adjusted or white Gaussian noise. In speech processing, a model of speech production called the source-filter model is frequently employed (Neuburg and Bauer, 1986; Kondoz, 2004). This simple mathematical model finds use in automatic speech recognition, speech coding, and speech synthesis (e.g., Jancovic and Kokuer, 2009; Schroeder and Atal, 1985; Acero, 1998).

...Download file to see next pages Read More

Download file to see previous pages

A block diagram of the source-filter model of speech production is shown in Fig. 2.2. Here, the excitation-source signal is either a periodic impulse train, with adjustable pulse (pitch) frequency, or white Gaussian noise. The former is used for generation of voiced sounds and the latter for generation of unvoiced sounds. The speech signal is produced by passing the excitation-source signal through a linear filter which models vocal tract resonances (i.e., formants). The intensity of the produced speech sounds is controlled through a gain applied to the excitation-source signal.

Further discussion of the source-filter model, in the context of autoregressive modelling, is given in Section 2.4.3. 1.2 - Properties of Speech Signal 1.2.1 - Time-domain Characteristic The separation of voiced and unvoiced speech is achieved through digital signal analysis. Whereas voiced speech is made of a harmonic structure, unvoiced does not have a harmonic structure and resembles white noise. A series of glottal pulses brought about by the opening and closing of the glottis produces voiced speech.

However, each glottal opening and closing cycle is different depending on the shape and time-period. Consecutive glottal pulses lead to a quasi-periodic excitation and these pulses may also be referred to as pitch pulses. 1.2.2 - Frequency-domain Characteristics The main role of vocal tracts is to produce speech signals and these signals contain all-pole filter characteristics. During the speech process, the ear assumes the role of filter bank and a classifier of signals that are coming through it separating these signals into frequency components.

In frequency domain, an analysis of discrete speech signals is conducted through a process in which transformation of speech signals (arbitrary signals) into sinusoidal waves is achieved. Sinusoidal signals have a unique and distinct frequency. On its part, arbitrary signals do not have a unique frequency. This explains why these arbitrary signals have to be transformed into sinusoidal signals. Human speech spans a range between around 50 Hz to 6 kHz. A deeper voice has a lower frequency. The human ear has an ability to hear sounds ranging from 16 Hz to 18 kHz (Boyd 176).

Despite this capability, the human ear senses greater sensitivity at frequencies ranging from 1 kHz to 5 kHz. Therefore, distortion in high frequencies bandwidths cannot be highly noticed by the human ear relative to the distortion of similar amplitude in the low frequencies bandwidths. 1.2.3 - Estimating the bit-rate required for speech signal The vocal tract is characterised by a non-flat frequency response and this response acts as the origin for the correlation between different neighbouring samples of the speech signal as situation commonly referred to as short-term correlation.

At the time of voiced speech production, the periodic behaviour of the excitation leads to long-term correlation, that is, the correlation between the corresponding samples of neighbouring pitch pulses. To determine the frequency domain properties of a signal segment, it is necessary to use a short-time window of samples estimated as the range between 20ms and 30ms. Through the assumption that this segment is stationary, a computation of power spectrum is done to provide a representation of its short-time analysis.

The envelope of its power spectrum is provided by the short-term correlation and this occurs in the spectral domain. On the other hand, the fine structure of the spectrum is provided by the long-term correlation (Rabiner and Gold 56). In the power spectrum, the voiced speech is characterised by a harmonic structure. There are equal frequencies intervals between the sharp spectral peaks and these intervals are determined by the fundamental frequency. This offers the explanation for the periodic structure of the time domain representation of the voiced speech.

There are two concerns pertaining to speech manipulation, including the preservation and transmission of the speech content and convenient storage.

...Download file to see next pages Read More