Sound Quality Measurement

How was it that in C.W. McCall’s ‘70s hit song Convoy, the Rubber Duck organized more than 85 trucks into a toll-bustin’ fleet despite roadblocks, a bear in the air, and the stench of hogs? The obvious answer is communications, in this case, via CB radio. The convoy only was possible because the truckers could understand each other above the racket of the engine noise and police sirens.

Because of our physiology and psychology, what we think we hear is a complex mix of the actual sounds presented to us and what we expect to hear. There may be a very large difference between what we perceive as the received auditory signal and what was sent. The field of study dealing with human sound perception is called psychoacoustics.

“The hearing system performs a spectrographic analysis of any auditory stimulus. The cochlea can be regarded as a bank of filters whose outputs are ordered tonotopically, so that a frequency-to-place transformation is effected. The filters closest to the cochlear base respond maximally to the highest frequencies and those closest to its apex respond maximally to the lowest.

“The hearing system also can be said to perform a temporal oscillographic analysis of the set of neural signals that originate in the cochlea in response to an auditory stimulus. This process is important for frequencies below 500 Hz, and it contributes to frequency resolution up to about 1.5 kHz.”¹

The anatomy of the cochlea supports a nearly linear frequency response from about 1 kHz to 8 kHz. However, outside this range, sensitivity falls off rapidly. In addition, human perception of complex sounds, mixed loud and soft sounds, and harmonic content is nonlinear.

Pitch, Loudness, and Selectivity

Pitch, the perception of frequency, and loudness are not independent variables. The Fletcher-Munson equal loudness curves shown in Figure 1 are plots of frequency vs. intensity having constant values of perceived loudness. They represent the variation in sensitivity of hearing presented in decibels relative to a sound pressure level (SPL) of 20 micropascals (mPa).

Each curve is labeled in phons, a measure of loudness that corresponds to the decibel value of SPL at 1 kHz. At frequencies other than 1 kHz, the curves reflect the ear’s general loss of sensitivity toward low and very high frequencies. For example, on the 40-phon curve, a 20-Hz tone must be 50 dB higher in amplitude than a 1-kHz tone to have the same perceived loudness.²

The flattening of the curves at high intensities means that the ear is relatively less sensitive to midrange frequencies and more sensitive to low frequencies. This is the reason that music may seem dull and uninteresting when played at a low level but much more lively at higher levels. Some music systems automatically boost the bass output at low volume to compensate for this effect.

A much weaker dependency relates pitch to loudness. That is, the perceived pitch of a tone depends only to a small degree on its loudness. Generally, pitch is directly proportional to the logarithm of frequency up to about 5 kHz. In addition, the sensation of pitch is affected by the complexity of the tone. Pure single-frequency notes are perceived differently than more complex spectra centered on the same frequency. Tones that are harmonically related present special challenges.

Humans anticipate the harmonic content of signals to the degree that we may find it difficult to discern missing or added harmonics. For example, the term masking describes our inability to hear a soft tone in the presence of a loud one. The effect heightens if the soft tone is a harmonic of the louder tone. The masking threshold is the level of the soft tone required for it to be heard above the louder one.

Similarly, selectivity is the ability to hear a single tone in the presence of noise. A tone is positioned at the center of a notch in the noise spectrum. For a given relationship between the tone and noise amplitudes, the notch widens until the tone can be heard distinctly.

“Auditory frequency selectivity can be described in terms of an equivalent rectangular bandwidth (ERB) as a function of center frequency. Both spectral and temporal analysis contribute to the detection of the tone….”³ The ERB is proportional to center frequencies, and below 500 Hz, resolution of the temporal fine structure of the signal contributes significantly to detection.

Voice-Quality Testing

The many subjective aspects of audio perception are further complicated by the distortions associated with signal encoding, transmission, decoding, and reproduction. Nevertheless, of all the metrics that could be applied, voice clarity and delay are considered the most important influences on a listener’s impression of voice quality.

The standard by which new technologies are judged is the voice quality provided by the public switched telephone network (PSTN). Although far from perfect, clarity is adequate, predictable, and reliable. System components that ensure good performance include the following:

Telephone handset filtering.
Digital sampling at 8 kS/s.
Digital nonuniform pulse code modulation (PCM) encoding (µ-law and A-law).
Guaranteed bandwidth at 56 or 64 kb/s.
Echo minimization.

Restricting the analog bandwidth to 4 kHz reduces noise and expense while still allowing easy speaker identification. Sampling at the Nyquist rate and providing sufficient channel bandwidth for the PCM coding preserve signal detail. Psychoacoustics comes into play in the nonuniform encoding.

Because masking prevents soft sounds from being heard in the presence of loud ones, the dynamic range of the components comprising a complex signal can be minimized. The µ-law codec compresses 13 bits of resolution into 7 bits by making use of this fact. The codec’s most significant 3 bits specify which of eight bands the digitized sample falls into. The remaining four least significant bits (LSBs) of the 7-bit code correspond to the 4 bits that followed the initial 1 in the original 13-bit data. The original LSB is never used. Table 1 shows the logarithmic nature of the resulting compression.⁴

Carrying telephone calls over IP networks (VoIP) may further degrade voice quality because of temporal signal loss and dropouts caused by packet or cell loss, delay variance (jitter), and low bit-rate encoding. To ensure that the level of clarity provided by PSTN is being achieved by traffic carried on data networks, satisfactory test methods are required.

Traditional metrics such as signal-to-noise ratio (SNR) and bit error rate tests (BERT) apply best to systems that reproduce the input waveform at the output. The data output from low bit rate codecs may be severely errored, but exact bit patterns are not necessarily required for good perceived quality sound. In systems using low bit rate codecs and compression, SNR and BERT do not correlate well to a listener’s subjective impression of clarity.

“The first significant technique used to measure speech clarity was to use large numbers of listeners to produce statistically valid subjective clarity scores. This technique is known as mean opinion scoring (MOS) where the mean value of large volumes of human-opinion scores is calculated…. Listening tests use standardized speech samples. Listeners hear the samples transmitted over a system or network and rate the overall quality of the sample based on opinion scales.”⁵

Two scales describe the MOS testing results. The first is a quality of speech scale that runs from a score of 5 for excellent to a score of 1 for bad. The other scale rates the difficulty a listener had in understanding the meaning of sentences. This scale runs from 5 for complete understanding to 1 for no meaning understood.

It’s not practical nor inexpensive to assemble large numbers of listeners every time changes are made to the design of a codec or a compression routine. For these reasons, although the MOS quality and difficulty scales actually address the issue of speech clarity directly, a more repeatable and objective test method was needed.

In the early 1990s, KPN Research in The Netherlands developed a technique called perceptual speech quality measurement (PSQM), adopted as ITU-T rec. P.861. It compares the original speech sample to the same information after it has been passed through a transmission process that typically includes encoding/decoding, data compression, and packetization. However, PSQM is not a complete tool in itself.

The output signal to be tested and the original input signal must be time-synchronized before the PSQM algorithm can be used. This is necessary because data is analyzed in 32-ms segments. Also, PSQM does not deal with channel degradations such as bit errors, packet loss, and time-clipping. If these impairments are present, the score returned generally will indicate higher voice quality than would subjective listening tests. To correct this shortcoming, the algorithm was upgraded to PSQM+ in 1997.

PSQM produces a mathematical representation of acoustic signals that takes into account human physiology and auditory sensitivities. Also in 1997, an alternative technique called measuring normalizing blocks (MNB) was proposed for measuring such perceptually transformed input and output signals. The differences between the signals are computed in both the time domain and the frequency domain and the results combined to correlate with MOS scores.

In 1998, in an independent effort, British Telecommunications developed the perceptual analysis measurement system (PAMS). In contrast to PSQM and MNB, PAMS includes all the synchronization and level normalization required (Figure 2). It, too, operates on data within a time frame and considers information to be in the time-frequency domain. PAMS produces a listening quality score and a listening effort score that correspond to the original MOS tests.

PAMS accurately predicts subjective speech-clarity test results when speech clarity is affected by:

Waveform codecs.
Nonwaveform vocoders.
Transcodings (the conversion from one digital format to another).
Speech-level input to codecs.
Talker dependencies; for example, languages, phrases.
Fast delay variances.
Time-clipping.
Level-clipping.
Added noise.

PAMS is not intended to deal with:

Delay.
Slow delay variances.
Overall system gain/attenuation.
Analog phone filtering.
Background noise present in input signal.
Music as input signal.

A further improvement to PSQM was made in 1999, resulting in a version called PSQM99. Most recently, PSQM99 and PAMS have been combined to produce perceptual evaluation of speech quality (PESQ), which is intended to retain the merits of both techniques and referred to as ITU-T rec. P.862.

“Like PSQM and PAMS before it, PESQ still is directed at narrowband telephone signals. It is applicable to systems with speech coding, including low bit-rate vocoders, variable delay, filtering, packet or cell loss, time-clipping, and channel errors. PESQ scores predict listening quality scores for absolute category rating (ACR) listening tests.”⁶

References

Auditory Scales of Frequency Representation, www.ling.su.se/staff/hartmut/bark.htm
Figure 1, Fletcher-Munson Curves, www.allchurchsound.com/index.html?ACS/edart/fmelc.html
Auditory Scales, op cit.
Brokish, C., “µ-Law Compression on the TMS320C54x,” Texas Instruments Application Brief SPRA267, 1996.
Anderson, J., “Methods for Measuring Perceptual Speech Quality,” Agilent Technologies White Paper, 2001, p. 9.
ibid., p. 25.

Return to EE Home Page

Published by EE-Evaluation Engineering
All contents © 2001 Nelson Publishing Inc.
No reprint, distribution, or reuse in any medium is permitted
without the express written consent of the publisher.

June 2002