Analog Voice and Digital Data

Most scientists agree that what makes human beings radically different from all other mammals is that we talk with each other. There has not been enough time for evolution to make human brains that much different from apes, and DNA research has reinforced this belief. Human speech, however, learned rapidly after birth, appears to drive many of the differences in brain organization that allow people to carry on a conversation with their parents but not with their dogs. This came as a surprise to researchers, since conventional wisdom attributed speech to human brain organization and not the other way around. People got bigger and better brains because they needed to express bigger and better thoughts vocally. Between the ages of 5 and 11 years, the average person learns about two or three new words a day, because it is necessary to do so.

Be that as it may, only humans appear to use the voice as the primary means of deep social interaction. And as human society spread throughout the world, the need to keep in touch vocally led to the development of the worldwide telephone system and now drives the need to add voice to the global Internet.

Voice or human speech is generated by air from the lungs vibrating special vocal cords in the throat and also to a lesser extent the tongue and lips to form sounds, known technically as phonemes. The vibrations occur at a range of frequencies (number of vibrations per second) that do not vary much from culture to culture. However, the exact array of sounds made varies from culture to culture, and studies have shown that if a certain sound is not learned properly by the age of 7 years or so, the ability to generate that sound correctly is lost forever. This is one reason that the later a person learns a foreign language, the more likely he or she is to always speak with a pronounced accent. The sounds are arranged into voice units called words, words are organized into phrases, phrases make sentences, and so on.

The distance at which the human voice can be heard also varies. This range depends on weather factors, as well as the power of the sound. Some sounds are more powerful than others, a feature of voice that will be important later on in this chapter. Voice is a pressure wave generated in the air that travels from mouth to ear. Voice pressure waves are acoustical waves, as are the sounds made be musical instruments, animal growls, bird calls, and so forth. Pressure waves are fundamentally different than the type of electrical (technically, electromagnetic) waves that operate in power systems, wireless cellular telephone networks, TV systems, and data networks. Acoustical waves rely on air to carry them. The thinner the air, such as at higher altitudes, the less power is in the voice. Water, being very dense, carries sound waves so well that ears, normally capable of determining the direction of a sound source with ease, cannot discern where sounds come from; they seem to come from all directions at once.

Voice, carried on these acoustical waves, is a form of analog information. This means that the amplitude of the signal can vary from a maximum to a minimum and take on any value at all in between. Voice sounds can be described in one of several ways. There are frictives and labials, for instance. But when it comes to understanding voice digitization techniques, it is easiest to describe voice as a mixture of high-amplitude sounds called voiced sounds and low-amplitude sounds called unvoiced sounds. In the English language, voiced sounds include the vowel sounds such as a, e, i, o, and u. Voiced sounds are formed in the throat and are determined directly by vocal chord positioning. Unvoiced sounds are the consonants such as b, t, v, and so on. Unvoiced sounds are formed by the tongue and lips as air is passed directly through the vocal cords. Voiced sounds have about 20 times more amplitude than unvoiced sounds and carry most of the power of human speech. However, the unvoiced sounds are what ears rely on to distinguish between words having the same vowel sounds at their core, like the difference between belt, felt, and melt.

Figure 2-1 shows the relative amplitude of voiced and unvoiced sounds in an analog acoustical wave. The figure represents a pressure wave as a time waveform, a common enough practice. Other languages use other sounds, but all are formed with either the throat or mouth playing the major role. This limited range of possible sounds is an important feature of analog voice that comes into play in modern voice digitization techniques. However, the sound should never be confused with the letter or letters of the alphabet that represents it. The explosive k in the English word skate is quite different from the k sound in the word kite.

Figure 2-1 shows the relative amplitude of voiced and unvoiced sounds in an analog acoustical wave. The figure represents a pressure wave as a time waveform, a common enough practice. Other languages use other sounds, but all are formed with either the throat or mouth playing the major role. This limited range of possible sounds is an important feature of analog voice that comes into play in modern voice digitization techniques. However, the sound should never be confused with the letter or letters of the alphabet that represents it. The explosive k in the English word skate is quite different from the k sound in the word kite.

Analog Voice

Figure 2-1: The English word salt as an acoustical wave representing analog voice.

Voiced sounds, as illustrated by Figure 2-1, have fairly regular waveforms, since the vibrations in the throat are very stable. On the other hand, unvoiced sounds have a more random and unpredictable waveform pattern because the position of the mouth can vary during sound generation. These observations also will become important in modern voice digitization techniques.

Voiced waveforms tend to repeat in what is known as the pitch interval. The length of the pitch interval can vary, most notably between men and women (this is not a sexist comment, just a report on the findings of acoustical engineers). When men speak, the pitch interval lasts between 5 and 20 ms (thousandths of a second), and when women speak, the pitch interval typically lasts 2.5 to 10

ms. Voiced sounds last from 100 to 125 ms. Thus a single voiced sound can have anywhere from 5 to 50 pitch intervals that repeat during a single sound. Oddly, not all the pitch intervals are needed for understandable speech. People can understand speech faster than people can generate speech. Some television commercials, in a quest to jam as much talk into a 20-s spot as possible, routinely remove some of the repeated pitch intervals in voiced sounds to speed up the voice without distorting it. The result sounds somehow strange to viewers, but few can say exactly what the difference is.

One further aspect of human voice should be discussed before moving on to the invention of the telephone. This is the fact that silence plays a large role in human speech communication. The listener at the moment is almost totally silent during a conversation, save for short sounds generated to let the speaker know that he or she is still paying attention ("uh-huh") and understanding the speaker ("Go on"). These feedback sounds are most frequent when two people are conversing and almost totally absent when one person is addressing a group of listeners (thankfully).

Moreover, silence occurs in the speaker from the need to draw a breath at the end of a sentence or long phrase. Silences of shorter duration occur within phrases, between words, and even within words themselves, usually words consisting of multiple syllables (a syllable is a consonant and vowel combination treated as a unit of speech).

During a typical conversation between two people, active voice is generated by one person or the other only about 40 percent of the time. Fully half the time comes from one person listening, and the other 10 percent comes from pauses between sentences, words, and syllables. During a telephone call, the listener also pays attention to background noise, not only obvious noise sources such as TVs or radios but a persistent low-level hum. This persistent background noise is known as ambient sound. The presence of ambient sound even in the absence of speech allows the listener to realize that the line is still "live" and has not failed abruptly.

Thus human speech is characterized by three major features:

1.A mixture of high-amplitude voiced sounds and low-amplitude unvoiced sounds

2.A mixture of more regular and predictable voiced waveforms and more random unvoiced waveforms

3.Silence nearly 60 percent of the time during a two-way conversation

Any voice digitization technique should take all these voice features into account. And if one voice digitization technique acknowledges only one characteristic and another takes into account all three, then it can be said with confidence that the second voice digitization method is better than the first.

0 0

Post a comment