Source Coding and Speech Processing

Source coding reduces redundancy in the speech signal and thus results in signal compression, which means that a significantly lower bit rate is achieved than needed by the original speech signal. The speech coder/decoder is the central part of the GSM speech processing function, both at the transmitter (Figure 6.2) as well as at the receiver (Figure 6.3). The functions of the GSM speech coder and decoder are usually combined in one building block called the codec (COder/DECoder).

Speech Processing Gsm
Figure 6.2: Schematic representation of speech functions at the transmitter

The analog speech signal at the transmitter is sampled at a rate of 8000 samples/s, and the samples are quantized with a resolution of 13 bits. This corresponds to a bit rate of 104 kbit/s for the speech signal. At the input to the speech codec, a speech frame containing 160 samples of 13 bits arrives every 20 ms. The speech codec compresses this speech signal into a source-coded speech signal of 260-bit blocks at a bit rate of 13 kbit/s. Thus the GSM speech coder achieves a compression ratio of 1 to 8. The source coding procedure is briefly explained in the following; detailed discussions of speech coding procedures are given in [54].

A further ingredient of speech processing at the transmitter is the recognition of speech pauses, called Voice Activity Detection (VAD). The voice activity detector decides, based on a set of parameters delivered by the speech coder, whether the current speech frame (20 ms) contains speech or a speech pause. This decision is used to turn off the transmitter amplifier during speech pauses, under control of the Discontinuous Transmission (DTX) block.

Figure 6.3: Schematic representation of speech functions at the receiver

The discontinuous transmission mode takes advantage of the fact, that during a normal telephone conversation, both parties rarely speak at the same time, and thus each directional transmission path has to transport speech data only half the time. In DTX mode, the transmitter is only activated when the current frame indeed carries speech information. This decision is based on the VAD signal of speech pause recognition. The DTX mode can reduce the power consumption and hence prolong the battery life. In addition, the reduction of transmitted energy also reduces the level of interference and thus improves the spectral efficiency of the GSM system. The missing speech frames are replaced at the receiver by a synthetic background noise signal called Comfort Noise (Figure 6.3). The parameters for the Comfort Noise Synthesizer are transmitted in a special Silence Descriptor (SID) frame.

This silence descriptor is generated at the transmitter from continuous measurements of the (acoustic) background noise level. It represents a speech frame which is transmitted at the end of a speech burst, i.e. at the beginning of a speech pause. In this way, the receiver recognizes the end of a speech burst and can activate the comfort noise synthesizer with the parameters received in the SID frame. The generation of this artificial background noise prevents that in DTX mode the audible background noise transmitted with normal speech bursts suddenly drops to a minimal level at a speech pause. This modulation of the background noise would have a very disturbing effect on the human listener and would significantly deteriorate the subjective speech quality. Insertion of comfort noise is a very effective countermeasure to compensate for this so-called noise-contrast effect.

Another loss of speech frames can occur, when bit errors caused by a noisy transmission channel cannot be corrected by the channel coding protection mechanism, and the block is received at the codec as a speech frame in error, which must be discarded. Such bad speech frames are flagged by the channel decoder with the Bad Frame Indication (BFI). In this case, the respective speech frame is discarded and the lost frame is replaced by a speech frame which is predictively calculated from the preceding frame. This technique is called Error Concealment. Simple insertion of comfort noise is not allowed. If 16 consecutive speech frames are lost, the receiver is muted to acoustically signal the temporary failure of the channel.

The speech compression takes place in the speech coder. The GSM speech coder uses a procedure known as Regular Pulse Excitation— Long-Term Prediction— Linear Predictive Coder (RPE-LTP). This procedure belongs to the family of hybrid speech coders. This hybrid procedure transmits part of the speech signal as the amplitude of a signal envelope, a pure wave form encoding, whereas the remaining part is encoded into a set of parameters. The receiver reconstructs these signal parts through speech synthesis (vocoder technique). Examples of envelope encoding are Pulse Code Modulation (PCM) or Adaptive Delta Pulse Code Modulation (ADPCM). A pure vocoder procedure is Linear Predictive Coding (LPC). The GSM procedure RPE-LTP as well as Code Excited Linear Predictive Coding (CELP) represent mixed (hybrid) approaches [15,46,54].

Figure 6.4: Simplified block diagram of the GSM speech coder

A simplified block diagram of the RPE-LTP coder is shown in Figure 6.4. Speech data generated with a sampling rate of 8000 samples/s and 13 bit resolution arrive in blocks of 160 samples at the input of the coder. The speech signal is then decomposed into three components: a set of parameters for the adjustment of the short-term analysis filter (LPC)

also called reflection coefficients', an excitation signal for the RPE part with irrelevant portions removed and highly compressed; and finally a set of parameters for the control of the LTP long-term analysis filter. The LPC and LTP analyses supply 36 filter parameters for each sample block, and the RPE coding compresses the sample block to 188 bits of RPE parameters. This results in the generation of a frame of 260 bits every 20 ms, equivalent to a 13 kbit/s GSM speech signal rate.

The speech data preprocessing of the coder (Figure 6.4) removes the DC portion of the signal if present and uses a preemphasis filter to emphasize the higher frequencies of the speech spectrum. The preprocessed speech data is run through a nonrecursive lattice filter (LPC filter, Figure 6.4) to reduce the dynamic range of the signal. Since this filter has a "memory" of about 1 ms, it is also called short-term prediction filter. The coefficients of this filter, called reflection coefficients, are calculated during LPC analysis and transmitted in a logarithmic representation as part of the speech frame, Log Area Ratios (LARs).

Further processing of the speech data is preceded by a recalculation of the coefficients of the long-term prediction filter (LTP analysis in Figure 6.4). The new prediction is based on the previous and current blocks of speech data. The resulting estimated block is finally subtracted from the block to be processed, and the resulting difference signal is passed on to the RPE coder.

Figure 6.5: Simplified block diagram of the GSM speech decoder

After LPC and LTP filtering, the speech signal has been redundancy reduced, i.e. it already needs a lower bit rate than the sampled signal; however, the original signal can still be reconstructed from the calculated parameters. The irrelevance contained in the speech signal is reduced by the RPE coder. This irrelevance represents speech information that is not needed for the understandability of the speech signal, since it is hardly noticeable to human hearing and thus can be removed without loss of quality. On one hand, this results in a significant compression (factor 160 X 13/188 ~ 11); on the other hand, it has the effect that the original signal cannot be reconstructed uniquely. Figure 6.5 summarizes the reconstruction of the speech signal from RPE data, as well as the long-term and short-term synthesis from LTP and LPC filter parameters. In principle, at the receiver site, the functions performed are the inverse of the functions of the encoding process.

The irrelevance reduction only minimally affects the subjectively perceived speech qual ity, since the main objective of the GSM codec is not just the highest possible compression but also good subjective speech quality. To measure the speech quality in an objective manner, a series of tests were performed on a large number of candidate systems and competing codecs.

The base for comparison used is the Mean Opinion Score (MOS), ranging from MOS = 1, meaning quality is very bad or unacceptable, to MOS = 5, quality very good, fully acceptable. A series of coding procedures were discussed for the GSM system; they were examined in extensive hearing tests for their respective subjective speech quality [46]. Table 6.1 gives an overview of these test results; it includes as reference also ADPCM and frequency-modulated analog transmission. The GSM codec with the RPE-LTP procedure generates a speech quality with an MOS value of about 4 for a wide range of different inputs.

Table 6.1: MOS results of codec hearing tests [46]

CODEC

Process

Bit rate (in kbit/s)

MOS

FM

Frequency Modulation

-

1.95

SBC-ADPCM

Subband-CODEC - Adaptive Delta-PCM

15

2.92

SBC-APCM

Subband-CODEC - Adaptive PCM

16

3.14

MPE-LTP

Multi-Pulse Excited LPC-CODEC - Long Term Prediction

16

3.27

RPE-LPC

Regular-Pulse Excited LPC-CODEC

13

3.54

RPE-LTP

Regular Pulse Excited LPC-CODEC - Long Term Prediction

13

ADPCM

Adaptive Delta Modulation

32

& 4

Was this article helpful?

0 0
DIY Battery Repair

DIY Battery Repair

You can now recondition your old batteries at home and bring them back to 100 percent of their working condition. This guide will enable you to revive All NiCd batteries regardless of brand and battery volt. It will give you the required information on how to re-energize and revive your NiCd batteries through the RVD process, charging method and charging guidelines.

Get My Free Ebook


Responses

  • Jan
    What is source coding with diagram?
    3 years ago
  • christian
    What is source cosing and speech processing?
    3 years ago

Post a comment