50
ETSI EG 201 377-2 V1.1.1 (2004-04) ETSI Guide Speech Processing, Transmission and Quality Aspects (STQ); Specification and measurement of speech transmission quality; Part 2: Mouth-to-ear speech transmission quality including terminals

EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI Guide

Speech Processing, Transmission and Quality Aspects (STQ);Specification and measurement of

speech transmission quality;Part 2: Mouth-to-ear speech transmission

quality including terminals

Page 2: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 2

Reference DEG/STQ-00011

Keywords networks, QoS, quality, speech, terminal, testing,

transmission

ETSI

650 Route des Lucioles F-06921 Sophia Antipolis Cedex - FRANCE

Tel.: +33 4 92 94 42 00 Fax: +33 4 93 65 47 16

Siret N° 348 623 562 00017 - NAF 742 C

Association à but non lucratif enregistrée à la Sous-Préfecture de Grasse (06) N° 7803/88

Important notice

Individual copies of the present document can be downloaded from: http://www.etsi.org

The present document may be made available in more than one electronic version or in print. In any case of existing or perceived difference in contents between such versions, the reference version is the Portable Document Format (PDF).

In case of dispute, the reference shall be the printing on ETSI printers of the PDF version kept on a specific network drive within ETSI Secretariat.

Users of the present document should be aware that the document may be subject to revision or change of status. Information on the current status of this and other ETSI documents is available at

http://portal.etsi.org/tb/status/status.asp

If you find errors in the present document, send your comment to: [email protected]

Copyright Notification

No part may be reproduced except as authorized by written permission. The copyright and the foregoing restriction extend to reproduction in all media.

© European Telecommunications Standards Institute 2004.

All rights reserved.

DECTTM, PLUGTESTSTM and UMTSTM are Trade Marks of ETSI registered for the benefit of its Members. TIPHONTM and the TIPHON logo are Trade Marks currently being registered by ETSI for the benefit of its Members. 3GPPTM is a Trade Mark of ETSI registered for the benefit of its Members and of the 3GPP Organizational Partners.

Page 3: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 3

Contents

Intellectual Property Rights ................................................................................................................................5

Foreword.............................................................................................................................................................5

Introduction ........................................................................................................................................................5

1 Scope ........................................................................................................................................................6

2 References ................................................................................................................................................6

3 Definitions and abbreviations...................................................................................................................9 3.1 Definitions..........................................................................................................................................................9 3.2 Abbreviations ...................................................................................................................................................10

4 General considerations for end-to-end speech quality evaluations ........................................................11

5 Test configurations.................................................................................................................................15 5.1 Test setup for terminals ....................................................................................................................................16 5.1.1 Setup for handset terminals.........................................................................................................................17 5.1.2 Setup for headset terminals.........................................................................................................................17 5.1.3 Setup for hands-free type terminals and loudspeaking terminals................................................................17 5.1.4 Position and calibration of HATS...............................................................................................................17 5.2 Setup of the electrical interfaces.......................................................................................................................17 5.3 Test signals.......................................................................................................................................................18 5.4 Accuracy of test equipment ..............................................................................................................................18

6 Test conditions .......................................................................................................................................19 6.1 Acoustic environment.......................................................................................................................................19 6.2 Network conditions, general.............................................................................................................................19 6.2.1 Network conditions, PSTN.........................................................................................................................20 6.2.2 Network Conditions, packet based transmission ........................................................................................20 6.2.3 Network Conditions, GSM mobile .............................................................................................................21 6.2.3.1 Speech levels.........................................................................................................................................22 6.2.3.2 Echo control ..........................................................................................................................................22 6.2.3.3 Radio network and radio network features............................................................................................23

7 Measurement of "standard" parameters..................................................................................................25 7.1 Sending frequency response .............................................................................................................................26 7.2 Receiving frequency response ..........................................................................................................................26 7.3 Overall frequency response ..............................................................................................................................27 7.4 Sending (and connection) loudness rating........................................................................................................27 7.5 Receiving (and connection) loudness rating.....................................................................................................27 7.6 Overall loudness rating.....................................................................................................................................28 7.7 Sidetone masking rating ...................................................................................................................................28 7.8 Listener sidetone ..............................................................................................................................................29 7.9 Measurement and calculation of the value of the D-factor (DelSM)................................................................30 7.10 Delay ................................................................................................................................................................31 7.10.1 Delay in sending direction ..........................................................................................................................31 7.10.2 Delay in receiving direction........................................................................................................................31 7.10.3 Overall Delay..............................................................................................................................................32 7.11 Terminal coupling loss .....................................................................................................................................32 7.12 Talker echo loudness rating..............................................................................................................................32 7.13 Weighted echo path loss...................................................................................................................................34 7.14 Distortion..........................................................................................................................................................34 7.14.1 Distortion in sending...................................................................................................................................34 7.14.2 Distortion in receiving ................................................................................................................................35 7.14.3 Overall distortion ........................................................................................................................................36 7.15 Sensitivity against out-of-band signals in sending ...........................................................................................37 7.16 Spurious out-of-band signals in receiving ........................................................................................................37

8 Advanced measurement procedures, taking into account the conversational situation..........................38

Page 4: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 4

8.1 Measurement setup for objective tests..............................................................................................................40 8.2 Practical realization of test signals ...................................................................................................................40 8.3 Quality of background noise transmission .......................................................................................................40 8.3.1 Background noise transmission with far end speech ..................................................................................40 8.3.2 Background noise transmission with near end speech................................................................................41 8.4 Double talk performance ..................................................................................................................................42 8.5 Switching characteristics ..................................................................................................................................44 8.6 Level Adjustments by Companding or AGC ...................................................................................................47 8.7 Additional echo disturbances ...........................................................................................................................48 8.8 Speech Sound Quality ......................................................................................................................................48

Annex A (informative): Bibliography...................................................................................................49

History ..............................................................................................................................................................50

Page 5: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 5

Intellectual Property Rights IPRs essential or potentially essential to the present document may have been declared to ETSI. The information pertaining to these essential IPRs, if any, is publicly available for ETSI members and non-members, and can be found in ETSI SR 000 314: "Intellectual Property Rights (IPRs); Essential, or potentially Essential, IPRs notified to ETSI in respect of ETSI standards", which is available from the ETSI Secretariat. Latest updates are available on the ETSI Web server (http://webapp.etsi.org/IPR/home.asp).

Pursuant to the ETSI IPR Policy, no investigation, including IPR searches, has been carried out by ETSI. No guarantee can be given as to the existence of other IPRs not referenced in ETSI SR 000 314 (or the updates on the ETSI Web server) which are, or may be, or may become, essential to the present document.

Foreword This ETSI Guide (EG) has been produced by ETSI Technical Committee Speech Processing, Transmission and Quality Aspects (STQ).

The present document provides technical requirements for assessing the conversational speech quality performance parameters from mouth-to-ear independent of the technology used.

The present document is part 2 of a multi-part deliverable covering the specification and measurement of speech transmission quality, as identified below:

Part 1: "Introduction to objective comparison measurement methods for one-way speech quality across networks";

Part 2: "Mouth-to-ear speech transmission quality including terminals";

Part 3: "Non-intrusive objective measurement methods applicable to networks and links with classes of services".

Introduction Various standards within ETSI, ITU, TIA and other standardization organizations describe performance requirements for different types of terminals, networks and network components. In each standard emphasis is given typically only to a part of the overall connection. The speech quality perceived by the user however is influenced by any component in the overall connection. In modern complex network and en to end (mouth-to-ear) configurations there is no guarantee for a sufficient overall performance if only the individual components conform to their relevant standards. Furthermore many of the existing testing specifications still assume a linear and time invariant behaviour of the components which due to complex signal processing in most of the modern communication devices can no longer be expected. Only a few standards exist which describe test procedures and requirements for the interaction of different network components with the different types of terminals.

The present document addresses the mouth-to-ear speech quality taking into account all conversational aspects. An overview about different network/terminal configurations and their specific impact on speech quality is given. The present document describes testing procedures and setups for different configurations.

Page 6: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 6

1 Scope The present document addresses mouth-to-ear (i.e. end-to-end speech quality for 3,1 kHz telephony). It both:

a) summarizes and gives guidance about the main factors that affect speech quality in end-to-end scenarios; and

b) specifies test methods for end-to-end speech quality testing.

The test methods can be used both for the complete transmission from mouth-to-ear and also for testing individual sections of a connection.

The end-end (mouth-to-ear) test methods specified in the present document are independent of the technology used in the network and the terminals. However when practical considerations make it necessary to test at electrical interfaces within or between equipments the document explains how to handle the most common current technologies.

The present document is designed to be used by:

• terminal and terminal component (e.g. soundcard) developers who wish to evaluate the end-to-end performance of networks and their terminals (or components); or

• network designers who wish to evaluate the end-to-end performance of their networks with typical terminals.

And therefore it gives advice on how networks and representative terminals (respectively) can be selected or simulated for use in the end-to-end tests.

The test methods described allow the evaluation of all conversational situations such as single talk and double talk by means of objective procedures.

The present document takes account of:

a) all types of terminals, including handsets, headsets and dedicated hands-free arrangements such as are provided with some mobile terminals and PC based terminals;

b) both circuit switched and packet based networks, including IP and ATM.

The present document is not generally suitable for wideband telephony or other forms of wideband communication although the parametric approach and the measurement procedures for some of the parameters described in the document are applicable for wideband communication as well.

2 References The following documents contain provisions which, through reference in this text, constitute provisions of the present document.

• References are either specific (identified by date of publication and/or edition number or version number) or non-specific.

• For a specific reference, subsequent revisions do not apply.

• For a non-specific reference, the latest version applies.

Referenced documents which are not found to be publicly available in the expected location might be found at http://docbox.etsi.org/Reference.

[1] ETSI EG 201 377-1: "Speech Processing, Transmission and Quality Aspects (STQ); Specification and measurement of speech transmission quality; Part 1: Introduction to objective comparison measurement methods for one-way speech quality across networks".

[2] ETSI TR 101 110: "Digital cellular telecommunications system (Phase 2+) (GSM); Characterisation, test methods and quality assessment for handsfree Mobile Stations (MSs)".

Page 7: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 7

[3] ETSI EG 201 050: "Speech Processing, Transmission and Quality Aspects (STQ); Overall Transmission Plan Aspects for Telephony in a Private Network".

[4] ETSI TBR 008 (1998): "Integrated Services Digital Network (ISDN); Telephony 3,1 kHz teleservice; Attachment requirements for handset terminals".

[5] ETSI TR 102 251: "Speech Processing, Transmission and Quality Aspects (STQ); Anonymous Test Report from 2nd Speech Quality Test Event 2002".

[6] ITU-T Recommendation G.821: "Error performance of an international digital connection operating at a bit rate below the primary rate and forming part of an Integrated Services Digital Network".

[7] Gierlich, H.W.; Kettler, F., Diedrich, E.: "Speech Quality Evaluation of Hands-Free Telephones During Double Talk: New Evaluation Methodologies"; EUSIPCO 1998, Proceedings, Vol. II.

[8] Gierlich, H.W.: "The auditory perceived quality of hands-free telephones: Auditory judgements, instrumental measurements and their relationship", Speech Communication 20 (1996) 241-254, December 1996.

[9] IEC 61260: "Electroacoustics - Octave-band and fractional-octave-band filters".

[10] IEC 61672: "Electroacoustics - Sound level meters".

[11] ISO 3 (1973): "Preferred numbers - Series of preferred numbers".

[12] ITU-T Recommendation G.107: "The E-model, a computational model for use in transmission planning".

[13] ITU-T Recommendation G.111: "Loudness ratings (LRs) in an international connection".

[14] ITU-T Recommendation G.122: "Influence of national systems on stability and talker echo in international connections".

[15] ITU-T Recommendation G.131: "Talker echo and its control".

[16] ITU-T Recommendation G.168: "Digital network echo cancellers".

[17] ITU-T Recommendation G.712: "Transmission performance characteristics of pulse code modulation channels".

[18] ITU-T Recommendation O.131: "Quantizing distortion measuring equipment using a pseudo-random noise test signal".

[19] ITU-T Recommendation O.132: "Quantizing distortion measuring equipment using a sinusoidal test signal".

[20] ITU-T Recommendation P.340: "Transmission characteristics of hands-free telephones".

[21] ITU-T Recommendation P.380: "Electro-acoustic measurements on headsets".

[22] ITU-T Recommendation P.50: "Artificial voices".

[23] ITU-T Recommendation P.501: "Test signals for use in telephonometry".

[24] ITU-T Recommendation P.502: "Objective test methods for speech communication systems using complex test signals".

[25] ITU-T Recommendation P.51: "Artificial mouth".

[26] ITU-T Recommendation P.57: "Artificial ears".

[27] ITU-T Recommendation P.58: "Head and torso simulator for telephonometry".

[28] ITU-T Recommendation P.581: "Use of Head and Torso Simulator (HATS) for hands-free terminal testing".

Page 8: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 8

[29] ITU-T Recommendation P.64: "Determination of sensitivity/frequency characteristics of local telephone systems".

[30] ITU-T Recommendation P.79 and Corrigendum 2 (2001): "Calculation of loudness ratings for telephone sets".

[31] ITU-T Recommendation P.800: "Methods for subjective determination of transmission quality".

[32] ITU-T Recommendation P.810: "Modulated noise reference unit (MNRU)".

[33] ITU-T Recommendation P.830: "Subjective performance assessment of telephone-band and wideband digital codecs".

[34] ITU-T Recommendation P.831: "Subjective performance evaluation of network echo cancellers".

[35] ITU-T Recommendation P.832: "Subjective performance evaluation of hands-free terminals".

[36] ITU-T Recommendation P.862: "Perceptual Evaluation of Speech Quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs".

[37] ITU-T Recommendation Y.1541: "Network performance objectives for IP-based services".

[38] ITU-T COM12-42 (Federal Republic of Germany, January 1998): "Listening only test results for hands-free telephones and their dependence upon room surroundings".

[39] TIA/EIA 810-A: "Telecommunications - Telephone Terminal Equipment-Transmission Requirements for Narrowband".

[40] ITU-T Recommendation P.59: "Artificial conversational speech".

[41] ITU-T Recommendation G.711: "Pulse code modulation (PCM) of voice frequencies".

[42] ETSI TS 100 961: "Digital cellular telecommunications system (Phase 2+) (GSM); Full rate speech; Transcoding (GSM 06.10 version 7.1.0 Release 1998)".

[43] ETSI EN 300 969: "Digital cellular telecommunications system (Phase 2+) (GSM); Half rate speech; Half rate speech transcoding (GSM 06.20 version 8.0.1 Release 1999)".

[44] ETSI EN 300 726: "Digital cellular telecommunications system (Phase 2+) (GSM); Enhanced Full Rate (EFR) speech transcoding (GSM 06.60 version 8.0.1 Release 1999)".

[45] ETSI EN 300 903: "Digital cellular telecommunications system (Phase 2+) (GSM); Transmission planning aspects of the speech service in the GSM Public Land Mobile Network (PLMN) system (GSM 03.50 version 8.1.1 Release 1999)".

[46] ISO 9614 (all parts): "Acoustics - Determination of sound power levels of noise sources using sound intensity".

[47] K. Genuit: "Evaluation of Acoustic-Quality Based on a Relative Approach", Inter-Noise"96, 25th Anniversary Congress Liverpool, 30.07-02.08.1996, Conference Proceedings (Book 6 / ISBN: 1 873082 90 8), pp. 3233-3238, Liverpool, England.

Page 9: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 9

3 Definitions and abbreviations

3.1 Definitions For the purposes of the present document, the following terms and definitions apply:

Acoustic Reference Level (ARL): acoustic level at MRP which results in a -10 dBm0 output at the digital interface

artificial ear: device for the calibration of earphones incorporating an acoustic coupler and a calibrated microphone for the measurement of the sound pressure and having an overall acoustic impedance similar to that of the median adult human ear over a given frequency band

codec: combination of an analogue-to-digital encoder and a digital-to-analogue decoder operating in opposite directions of transmission in the same equipment

ear-Drum Reference Point (DRP): point located at the end of the ear canal, corresponding to the ear-drum position

Ear Reference Point (ERP): virtual point for geometric reference located at the entrance to the listener's ear, traditionally used for calculating telephonometric loudness ratings

end-to-end: endpoints of a (telephone) connection between two subscribers, either between the NTPs (e.g. for bearer services), or for speech communication between mouth and ear

Head And Torso Simulator (HATS) for telephonometry: manikin extending downward from the top of the head to the waist, designed to simulate the sound pick-up characteristics and the acoustic diffraction produced by a median human adult and to reproduce the acoustic field generated by the human mouth.

NOTE: HATS shall conform to ITU-T Recommendation P.58 [27].

HATS position: correct handset position for measuring sensitivity and frequency response characteristics

NOTE: The HATS position has been shown to be essentially identical to the LRGP (loudness rating guard-ring position) position, except for the mouth simulator direction, which has been corrected with a 19 degrees downwards rotation to more closely match real talkers. For handsets with omnidirectional microphones, measurements on the two heads may differ slightly, typically less than 1 dB. For handsets with directional or noise-cancelling microphones, the differences will be larger, and the HATS position will give the more realistic results. See ITU-T Recommendation P.64 [29] annexes D and E and [7].

Hands-Free Reference Point (HFRP): point located on the axis of the artificial mouth, at 50 cm from the outer plane of the lip ring, where the level calibration is made under free-field conditions

HATS Hands-Free Reference Point (HATS-HFRP): corresponds to a reference point "n" from ITU-T Recommendation P.58 [27]: "n" shall be one of the points numbered from 11 to 17 and defined in table 6a/P.58 (coordinates of far field front point). The HATS HFRP depends on the location(s) of the microphones of the terminal under test: the appropriate axis lip-ring/HATS HFRP shall be as close as possible to the axis lip-ring/HFT microphone under test. (see ITU-T Recommendation P.581 [28])

mouth-to-ear: endpoints of a telephone connection between two subscribers between mouth and ear

Mouth Reference Point (MRP): is located on axis and 25 mm in front of the lip plane of a mouth simulator

pinna simulator: device which has the approximate shape and dimensions of a median adult human pinna

reference volume control setting: the receive volume control setting which results in the Receive Loudness Rating (RLR) closest to the target value (centre of the RLR tolerance range)

NOTE: There may be separate settings for handset, headset and hands-free modes.

Page 10: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 10

sound pressure levels: value expressed as a ratio of the pressure of a sound to a reference pressure. The following sound level units are used in EG 201 377-2:

dBPa: the sound pressure level, in decibels, of a sound is 20 times the logarithm to the base 10 of the ratio of the pressure of this sound to the reference pressure of 1 Pascal (Pa).

NOTE: 1 Pa = 1 N/m2.

dBSPL: the sound pressure level, in decibels, of a sound is 20 times the logarithm to the base 10 of the ratio of the pressure of this sound to the reference pressure of 2 X 10-5 N/m2 (0 dBPa corresponds to 94 dBSPL).

dB(A): the A-weighted sound level is the sound pressure level e.g. in dBSPL, weighted by use of metering characteristics and A-weighting specified in IEC 61672 [10].

electric power and noise levels: the following electric power and noise level units are used in EG 201 377-2:

dBm0: the absolute power level at a digital reference point of the same signal that would be measured as the absolute power level, in dBm, if the reference point was analogue. The absolute power in dBm is defined as 10 log (power in mW / 1 mW). When the impedance is 600 ohm resistive, dBm can be referred to a voltage of 0,775 volts, which results in a reference active power of 1 mW. Note that 0 dBm0 is not the maximum digital code. For the L16-256 wideband codec adopted by TIA TR-41, 0 dBm0 is 3,14 dB below digital full scale.

3.2 Abbreviations For the purposes of the present document, the following abbreviations apply:

ADCPM Adaptive Differential Pulse Coded Modulation BER Bit Error Rate BSC Base Station Controller BTS Base Transceiver Station C/I Carrier to Interference ratio D D-Value of Terminal DCME Digital Circuit Multiplication Equipment DRP ear Drum Reference Point DTX Discontinuous Transmission EL Echo Loss ERL Echo Return Loss FER Frame Erasure Rate HATS Head And Torso Simulator HATS HFRP HATS Hands-Free Reference Point HFT Hands-Free Terminal HFRP Hands Free Reference Point LE Earphone Coupling Loss

LSTR Listener Sidetone Rating LRGP Loudness Rating Guard-ring Position MRP Mouth Reference Point MSC Mobile service Switching Centre Nc Circuit Noise referred to the 0 dBr-point NTP Network Termination Point OLR Overall Loudness Rating PCM Pulse Code Modulation PESQ Perceptional Evaluation of Speech Quality PLC Packet Loss Concealment PLMN Public Land Mobile Network PSTN Public Switched Telephone Network SLR Sending Loudness Rating SND Signal + Noise + Distortion STMR Sidetone Masking Rating RLR Receiving Loudness Rating TCLw Terminal Coupling Loss (weighted)

Page 11: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 11

TELR Talker Echo Loudness Rating TOSQA Telecommunication Objective Speech Quality Assessment TRC Transcoder Controller qdu number of quantizing distortion units WEPL Weighted Echo Path Loss

4 General considerations for end-to-end speech quality evaluations

When evaluation the overall speech transmission quality, networks and terminals may influence quite significantly the speech quality of a connection: Coding, delay and processing techniques like speech echo cancellers packetizing or DCME are mainly introduced by the network(s) but similar signal processing can be found in terminals as well. The transfer functions and loudness ratings of a connection are mainly determined by the terminals, the background noise and the background noise transmission are highly influenced by the terminal and the acoustical environment the terminal is exposed to. The conversational properties which are the most important ones in a conversation are determined by the terminal in combination with the network: double talk capability, switching characteristics, echo and delay are dominant impairments often introduced.

In order to find the determining factors a set of subjective test procedures have been developed allowing to extract the dominant quality aspects: Conversational test, talking and listening tests, double talk tests and listening only tests as described in [8], [31], [32], [33], [34] and [35] are the basis of the parameter extraction procedure.

An overview of the methodologies is given in figure 1.

Page 12: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 12

T1212320-00

Sound quality

Testenvironment

Auditorytests

Performanceparameters

Realistic Conditions"Human Factor"

Conversational tests2 subjects involved

(one at the near end ofthe telephone connection,the other at the far end)

or

Incr

easi

ng r

eali

ty o

f th

e te

stin

g

Incr

easi

ng c

ompa

rabi

lity

, mor

e an

alyt

ic

End-to-end speechtransmission quality

Difficulty in communicating

Double talktests

2 subjects involved(one at the near end of

the telephone connection,the other at the far end)

Talking and listeningtests

1 artificial head used(at the near end of thetelephone connection),

1 subject involved(at the far end), acting as

listener and talker

Annoyance caused byechoes and switching

Double talk performance

Method of backgroundnoise transmission

Speech level variationsvs. time

Comparison of individual parameters under defined

conditions

Classification of disturbances

Database for further tests

Listening-onlytests

2 artificial heads used(one at the near end of

the telephone connection,the other at the far end)1 subject as "observer"

Measurement conditions(exactly defined and

identical for each tests and reproducible)

NOTE: The assignment of "near end" and "far end" is chosen according to the E-model (ITU-T Recommendation G.107 [12]).

Figure 1a: Overview of test methods used for subjective evaluation - direct parameter access

Page 13: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 13

T1212330-00

Sound quality

Testenvironment

Auditorytests

Performanceparameters

Realistic Conditions"Human Factor"

Conversational tests2 subjects involved

(one at the near end ofthe telephone connection,the other at the far end)

or

Incr

easi

ng r

eali

ty o

f th

e te

stin

g

Incr

easi

ng c

ompa

rabi

lity

, mor

e an

alyt

ic

End-to-end speechtransmission quality

Difficulty in communicating

Double talktests

2 subjects involved(one at the near end of

the telephone connection,the other at the far end)

Talking and listeningtests

1 artificial head used(at the near end of thetelephone connection),

1 subject involved(at the far end), acting as

listener and talker

Annoyance caused byechoes and switching

Double talk performance

Method of backgroundnoise transmission

Speech level variationsvs. time

Comparison of individual parameters under defined

conditions

Classification of disturbances

Database for further tests

Listening-onlytests

2 artificial heads used(one at the near end of

the telephone connection,the other at the far end)1 subject as "observer"

Measurement conditions(exactly defined and

identical for each tests and reproducible)

NOTE: The assignment of "near end" and "far end" is chosen according to the E-model (ITU-T Recommendation G.107 [12]).

Figure 1b: Overview about test methods used for subjective evaluation-

parameter access via interviews

Page 14: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 14

T1212340-00

Sound quality

Testenvironment

Auditorytests

Performanceparameters

Realistic Conditions"Human Factor"

Conversational tests2 subjects involved

(one at the near end ofthe telephone connection,the other at the far end)

or

Incr

easi

ng r

eali

ty o

f th

e te

stin

g

Incr

easi

ng c

ompa

rabi

lity

, mor

e an

alyt

ic

End-to-end speechtransmission quality

Difficulty in communicating

Double talktests

2 subjects involved(one at the near end of

the telephone connection,the other at the far end)

Talking and listeningtests

1 artificial head used(at the near end of thetelephone connection),

1 subject involved(at the far end), acting as

listener and talker

Annoyance caused byechoes and switching

Double talk performance

Method of backgroundnoise transmission

Speech level variationsvs. time

Comparison of individual parameters under defined

conditions

Classification of disturbances

Database for further tests

Listening-onlytests

2 artificial heads used(one at the near end of

the telephone connection,the other at the far end)1 subject as "observer"

Measurement conditions(exactly defined and

identical for each tests and reproducible)

NOTE: The assignment of "near end" and "far end" is chosen according to the E-model (ITU-T Recommendation G.107 [12]).

Figure 1c: Overview about test methods used for subjective evaluation -

parameter access by including reference conditions

Page 15: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 15

The subjectively relevant parameters determining the "speech transmission quality" are as follows.

The overall quality is determined by:

• Delay and echo.

• Sound quality.

• Quality of background noise transmission at idle, in single talk and double talk conditions.

• Speech level variations during single talk and double talk.

• Disturbances caused by switching during single talk and double talk (completeness of speech transmission).

• Disturbances caused by echoes during single talk and double talk.

Consequently the evaluation methods need to be divided into single talk measurements and double talk evaluations. In addition evaluations are required during periods of silence where only background noise is present.

Since the typical test setup should include all components involved in the mouth-to-ear transmission a test arrangement should include the terminals "attached" to a realistic substitution of a user and his typical environment. Figure 2 illustrates how a test setup from end-to-end may look like typically.

Figure 2: Typical test setup for determining the speech transmission quality from end-to-end (mouth-to-ear) by subjective evaluation of the speech quality relevant parameters (example for

handset/hands-free communication)

Test setups as shown in figure 2 are used in auditory (subjective) tests to determine the quality aspects subjectively (see [31], [32], [33], [34] and [35]). From these evaluations procedures have been derived which allow the objective testing of the relevant parameters of terminals (or even end-to-end scenarios).

5 Test configurations This clause describes the test setups for terminals, networks and their various combinations. Since the document describes the general aspects of end-to-end speech quality testing, specific test setups and configuration description are made only in general. In case any specific description of terminal or network setups is needed (e.g. buffer sizes, type of codecs, packet loss simulations) these descriptions need to be found in the relevant standards of such transmission systems.

Page 16: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 16

5.1 Test setup for terminals The general access to terminals is described in figure 3. The traditional way to test handset-terminals is the LRGP-position using type 1 artificial ear and the artificial mouth according to ITU-T Recommendation P.51 [25]. positioned in LRGP (loudness rating guard ring position , ITU-T Recommendation P.64 [29]).The preferred acoustical access to terminal is the most realistic simulation of the "average" subscriber. This can be made by using HATS (Head And Torso Simulator) with appropriate ear simulation and appropriate means to fix handset, headset or hands-free terminals in a realistic by reproducible way to the HATS. HATS is described in ITU-T Recommendation P.58 [27], appropriate ears are described in ITU-T Recommendation P.57 [26] (type 3.3 and type 3.4 ear), a proper positioning of handsets in realistic conditions is found in ITU-T Recommendation P.64 [29], the test setups for various types of hands-free terminals can be found in ITU-T Recommendation P.581 [28].

The preferred way of testing a terminal is either to connect it to a network simulator with exact defined settings and access points or, in case of end-to-end scenarios, to connect the terminal to the "typical" network it is used in. The test sequences are fed in either electrically, using a reference codec or using the direct signal processing approach or acoustically using ITU-T specified devices.

ITU-Ttest sequences

( P.501, P.50, P.59)

ITU-Ttest sequences

( P.501, P.50, P.59)

Network Simulator

or actual networkElectrical

PartAcoustic

Part

TerminalNTP

ElectricalPart

AcousticPart

NTPTerminal

Figure 3: Test setup for terminals, acoustical access in end-to-end scenarios including a network or using a network simulator

Reference Codec e.g. G.711, G.726,

GSM...

Network Simulator

or actual network Electrical

Part

ITU-T test sequences

( P.501, P.50, P.59)

Electrical Part

Acoustic Part

Terminal

0 dBr Reference point

NTP

Reference access

e.g. DAI

Figure 4: Test setup for terminals, electrical access using a "reference" access or a network simulator

NOTE: Instead of using HATS as shown in figures 3 and 4 the test may be conducted using the LRGP positioning for handsets. However it should be noted that this will lead to more simplified acoustical conditions and such may reduce the validity of the measurement, especially in noisy environments and for terminals with unknown acoustical properties.

Page 17: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 17

5.1.1 Setup for handset terminals

When using a handset telephone the handset is placed in the HATS position as described in ITU-T Recommendation P.64 [29]. The artificial mouth shall conform with ITU-T Recommendation P.58 [27] when HATS is used. The artificial ear shall conform with ITU-T Recommendation P.57 [26], type 3.3 or type 3.4 ears shall be used. For (traditional) standard type handset terminals (the earpeace of which can be naturally sealed to the circular rim of a type 1 or type 3.2 coupler) LRGP positioning according to ITU-T Recommendation P.64 [29] may be used as well. The artificial mouth shall conform with ITU-T Recommendation P.51 [25]. The artificial ear shall conform with ITU-T Recommendation P.57 [26], type 1 or type 3.2 ears shall be used.

5.1.2 Setup for headset terminals

When using a headset the headset is placed on a HATS conforming to ITU-T Recommendation P.58 [27]. The artificial mouth shall conform with ITU-T Recommendation P.58 [27] when HATS is used. The artificial ear shall conform with ITU-T Recommendation P.57 [26], type 3.3 or type 3.4 ears shall be used.

The headset shall be placed in its recommended wearing position. Further information about setup and the use of HATS can be found in ITU-T Recommendation P.380 [21].

5.1.3 Setup for hands-free type terminals and loudspeaking terminals

General definition of hands-free terminals from ITU-T Recommendation P.340 [20] All types of terminals, which cannot be fit to the LRGP-position or the HATS-position -except headsets- need to be considered as hands-free type terminals. ITU-T Recommendation P.581 [28] describes the setup for hands-free terminals which are not covered by the setup description in ITU-T Recommendation P.340 [20]

5.1.4 Position and calibration of HATS

All the sending and receiving characteristics shall be tested with the HATS and it shall be indicated in the test report that HATS is used, it shall be indicated what type of ear was used at what pressure force. In case of hands-free measurements the HATSHFRP(s) shall be used for the calibration(s), the reference point chosen for the HATSHFRP shall be indicated.

The horizontal positioning of the HATS reference plane shall be guaranteed within ±2°.

The HATS shall be equipped with two Type 3.3 or 3.4 artificial ears. For hands-free measurements the HATS shall always be equipped with two artificial pinnas. The pinnas are specified in Recommendation P.57 [26] for Types 3.3 and 3.4 artificial ears. The pinna shall be positioned on HATS according to ITU-T Recommendation P.58 [27].

The exact calibration and equalization procedures as well as the combination of the two ear signals for the purpose of measurements can be found in ITU-T Recommendation P.581 [28]. This Recommendation also describes the positioning for the various types of hands-free terminals.

5.2 Setup of the electrical interfaces If electrical interfaces to terminals shall be simulated proper interface simulation is needed. The appropriate information typically can be found in the relevant terminal standards for all standardized interfaces.

For all interfaces the following points should be taken into account:

• General points:

- Any nonlinearity or time variance measured should derive from the EUT and not from the test equipment or the test interface used.

- The dynamic range of the test equipment has to exceed the dynamic range of the EUT by at least 10 dB, i.e. the noise floor of the test equipment has to be at least 10 dB below the noise floor of the EUT or at least 10 dB below the specified requirements.

- All nonlinear distortions introduced by the test equipment must not influence any measurement result (e.g. distortion measurements).

Page 18: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 18

- Any delay introduced by the test equipment has to be taken into account for delay and echo measurements.

• Analogue interfaces:

- Typically the impedance of the measuring device has to match the impedance of the interface of the EUT. Alternatively low impedance outputs are terminated by high impedance inputs.

- The dynamic range of the EUT has to be matched by the test equipment, i.e. the reference levels for the electric or the acoustical interface have to be defined and matched.

• Digital interfaces:

- The electrical specifications of the interface of the EUT (clock, frame, data, level, impedance,…) have to be matched by the test equipment.

- Care must be take to avoid jitter problems e.g. by improper configuration of master/slave.

- It is advisable to use a common system clock for the measurement of complex configurations.

- Any speech codec used must fully meet the codec specification of the speech codec used in the EUT.

5.3 Test signals Due to the extensive coding of the speech signals, standard test signals are not applicable for the tests, appropriate test signals (general description) are defined in ITU-T Recommendations P.50 [22] and P.501 [23]. More information can be found in the test procedures described below. Care should be taken, that the test signal used offers sufficient wideband signal energy in case a wideband system is tested.

For narrow band terminals the test signal used shall be bandfiltered between 200 Hz and 4 kHz with a bandpass filter providing a minimum of 24 dB/octet filter steepness, when feeding into the receiving direction.

The test signal levels are referred to the average level of the (band filtered in receiving direction) test signal, averaged over a period of 10 s. Unless specified otherwise, the averaging time for all measurements is 10 s.

5.4 Accuracy of test equipment Unless specified otherwise, the accuracy of measurements made by test equipment shall be better than:

Item Accuracy Electrical Signal Power Electrical Signal Power Sound pressure Time Frequency

±0,2 dB for levels ≥ -50 dBm ±0,4 dB for levels < -50 dBm ±0,7 dB ±5 % ±0,2 %

Unless specified otherwise, the accuracy of the signals generated by the test equipment shall be better than:

Quantity Accuracy Sound pressure level at MRP Electrical excitation levels Frequency generation

±1 dB for 200 Hz to 4 kHz ±3 dB for 100 Hz to 200 Hz and 4 kHz to 8 kHz ±0,4 dB (see note 1) ±2 % (see note 2)

NOTE 1: Across the whole frequency range. NOTE 2: When measuring sampled systems, it is advisable to avoid measuring at

sub-multiples of the sampling frequency. There is a tolerance of ±2 % on the generated frequencies, which may be used to avoid this problem, except for 4 kHz where only the -2 % tolerance may be used.

The measurements results shall be corrected for the measured deviations from the nominal level.

Page 19: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 19

The sound level measurement equipment shall conform to IEC 61672 [10] Type 1.

6 Test conditions

6.1 Acoustic environment In general two approaches need to be taken into consideration: Either the room and the background noise in the room contributes to the test parameter measured (e.g. performance measurements of acoustic echo cancellers) or the influence of the room and the background noise should be eliminated as much as possible (e.g. if loudness ratings of handsets are measured).

All measurements where no background noise or background noise simulation is needed should be conducted in quiet and "anechoic conditions". Depending on the distance of the transducers to mouth and ear a quiet office room may be sufficient e.g. for handsets where artificial mouth and artificial ear are located close to the acoustical transducers. For hands-free terminals as well as for some headsets or small handset terminals an anechoic room is needed for these measurements. The potential error introduced by the environmental conditions should be less than 10 % of desired measurement accuracy.

In all conditions where the performance of acoustic echo cancellers is tested a realistic room representing the typical use condition for the terminal needs to be used.

In all conditions where the terminal performance in the presence of background noise is evaluated a representative background noise or different background noises are needed. Proper background noise simulation also may serve for this purpose. Care needs to be taken to realistically reproduce the background noise in level, frequency, temporal and spatial distribution (see TR 101 110 [2]).

6.2 Network conditions, general The speech quality is dependent on a good transmission network. High Bit Error Rate (BER) packet loss without proper concealment and jitter contribute to degraded speech.

The speech levels are important for the user perception of the speech but also for the handling of the speech within the complete transmission system.

Low speech levels results in a low acoustical output volume. In analogue systems low speech levels may lead to insufficient signal to noise ratio, too high speech levels may lead to distorted and clipped speech signals. In digital systems speech coders and speech transcoders can not utilize the full dynamic range. Too high speech levels result in distorted speech. The level of distortion is dependent on the top and average speech level of the individual speaker, clipping can occur. Too low speech levels may lead to insufficient signal to noise ratio but to distorted speech signals as well due to the insufficient use of the dynamic range provided by the speech coders.

Unbalanced levels increase the risk for echo problems. The performance of the echo canceller is sensitive to a too large unbalance between sending and receiving.

The intention shall be to have the same line levels for the entire telephone network. Normal line levels have an average of -22 dBm0 to -16 dBm0 with a crest factor of about 16 dB. The individual differences can be large. For the average speaker this results in a peak level of -6 to 0 dBm0. The clipping level is 3,14 dBm0 for PCM A-law coded speech and 3,17 dBm0 for PCM µ-law coded speech (ITU-T Recommendation G.711 [41]) and the clipping margin is then 3 dB, for the worst "normal" case.

Echo effects are present in all types of networks. The echo is not normally noticed in the traditional PSTN because of the short round trip delay. As soon as the one way transmission time exceeds 15 ms however echo might be noticeable. Generally there are two sources of echoes: one is the hybrid echo the other is acoustic echo. The hybrid echoes are created in the PSTN network, by the hybrid used when changing from four to two wire circuits. The hybrid normally provides an Echo Return Loss (ERL) of > 15 dB, that means the echo is 15 dB lower than the unreflected signal. The acoustic echoes are created by acoustic coupling between speaker and microphone. The level of the acoustic echo is dependent on the TCL (terminal coupling loss) of the terminal.

If a satellite connection is used as part of the connection, the additional delay causes degradation in the conversational speech quality.

Page 20: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 20

The test setup chosen should be representative for the type of network to be evaluated. Either a network simulation

should be used or a real network (under controlled conditions) should be used. The network condition should be reproducible. If this can not be guaranteed a statistical approach should be chosen measuring the relevant parameters at various times in order to find average, maximum and minimum quality provided by the configuration.

6.2.1 Network conditions, PSTN

In general all parameters determining the speech quality in circuit switched networks are well controlled. PSTN type networks work isochronously. Typically loss of data is fairly unlikely. The delays in a connection are constant and within the control of the network operators. The network interfaces in such networks are well defined. In analogue networks the line characteristics (length and type of lines) has to be taken into account since it may have impact on loudness ratings, frequency responses, sidetone performance and trans-hybrid losses.

• Echo cancellation.

As soon as a delay higher then 25 ms (mouth-to-ear) is introduced in a connection echo cancellation has to be provided. In PSTN networks this is typically concentrated at international exchanges since within national or continental connections the network delay is typically below 25 ms. The echo cancellation is intended to cancel echoes produced by the hybrid of analogue terminals. The maximum echo loss to be provided is generally determined by the maximum delay which may be inserted in a connection. However most of modern network echo cancellation should conform to ITU-T Recommendation G.168 [16] which assumes 250 ms one way transmission time and provides adequate performance limits.

• Speech levels.

Different speech levels e.g. due to different national level settings, different attenuation in analogue systems can be expected and may degrade the performance of echocancellers and other signal processing the network and the terminals.

• Speech coding.

Low bitrate speech coding is typically not found in PSTN networks although ADPCM type coding is used quite frequently used in cordless terminals.

When DCME are inserted in a PSTN network all considerations found in clause 6.2.2 apply.

6.2.2 Network Conditions, packet based transmission

Packet based networks differ considerably from circuit switched networks. The input signal (speech signal) is segmented into packets typically of fixed length. Typical packet lengths used are in the range of 5 ms to 30 ms. By the packetization itself a minimum delay namely the packet length is introduced into the transmission.

• Delay, packet loss and jitter.

Furthermore the packets of different sources (e.g. terminals or subscribers) have to be sequenced for transmission. This process may add additional delay - as well as any buffering or routing in a connection. While in isochronous networks the delay of a connection is constant and predictable, the delay in IP-networks may vary between different connections depending on network load, routing, bandwidth, switching and others. In addition jitter may be introduced by the IP connection. Jitter occurs when the delay of each individual packet varies. Again, jitter is depending on network load, transmission bandwidth, switching and others. The delay and jitter is not predictable in IP-networks and therefore it is described by statistical parameters such as delay (average and distribution) and packet loss distribution.

In IP-networks the receiver has to take care of collection of the packets and their correct ordering. Jitter buffers are used in order to collect and sequence the packets received in such a way that a mostly error free transmission is achieved.

While in isochronous TDM networks the packet loss is negligible and almost no jitter occurs, packet loss and jitter are important factors in IP-connection and may highly influence the speech quality. When assessing the speech quality of IP networks these parameters (packet loss and jitter) have to be taken into account and speech quality measurements have to be conducted for the various network conditions to be expected. It is advisable to make appropriate simulations in a lab-type environment in order to get an estimate of the speech quality range to be expected under real use conditions (see ITU-T Recommendation Y.1541 [37]).

Page 21: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 21

• Echo cancellation.

As soon as a delay higher then 15 ms is introduced in a connection echo cancellation has to be provided, either in the terminal or in the network. In IP connections gateways have to be equipped with echo cancellers providing sufficient echo loss for all expected delays in a connection. The maximum echo loss to be provided is determined by the maximum delay which may be inserted in a connection. Since under worst case conditions the maximum one way transmission delay may be higher than 300 ms the minimum echo loss which has to be provided is 55 dB (see ITU-T Recommendation G.131 [15]). Consequently any IP-terminal connected to an IP network has to provide the same amount of echo loss (see TIA/EIA 810-A [39]).

• Speech levels.

Different speech levels e.g. due to computer terminals with insufficiently controlled LRs or equipment with codec which are not properly implemented may degrade the performance of echocancellers and other signal processing in an IP-network.

• Speech coding.

Speech coding especially in combination with packet loss and jitter may add further impairments to the transmitted speech signal. Speech codecs with integrated Packet Loss Concealment algorithms (PLC) add less impairments in presence of packet loss than speech codecs without PLC.

• Silence suppression.

Frequently silence suppression is used, voice activity is detected by voice activity detection, silent periods (periods where the signal level is below a defined threshold) are suppressed and not transmitted in order to save bandwidth. At the far end side silent periods typically are replaced by a low level noise signal (comfort noise). Especially in background noise situations this procedure may lead to background noise modulations in combination with comfort noise which may be quite annoying for the user. Furthermore voice activity detection may lead to clipping of the actual speech signal.

In IP-networks the classical differentiation between terminal and network no longer can be made, all sorts of impairments mentioned above may be introduced by IP-terminals as well.

6.2.3 Network Conditions, GSM mobile

The speech quality is dependent on a good transmission network. High Bit Error Rate (BER) on the A-bis and A-ter interface contributes to degraded speech. The PCM coded speech used between TRC and MSC is less sensitive to the effects of high BERs than the GSM coded speech transmitted between the BTS and the TRC.

The audible effects on speech quality due to increasing or decreasing BER, on the A-bis and A-ter interface, are unnoticeable providing that the transmission fulfils the performance objectives stated in ITU-T Recommendation G.821 [6].

Figure 5: Network configuration

Page 22: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 22

The upper limit of the speech quality is set by the codecs (e.g. TS 100 961 [42], Full rate speech transcoding, EN 300 969 [43], Half rate speech transcoding and EN 300 726 [44], Enhanced full rate speech transcoding). The speech codec quality is fixed and the focus of this section is thus on the speech degradation factors. The following areas have been identified:

• Speech Levels:

The speech levels affect speaker level, distortion and echo canceller performance.

• Echo control:

Poor network echo cancelling results in disturbing echo for calls to the PSTN and PLMN with full duplex hands-free mobiles. Echo effect can also be caused by acoustic coupling in mobiles.

• Radio Network and Radio Network features:

High bit error rate as a result of a low Carrier/Noise (C/N) and/or Carrier/Interference (C/I) ratio causes interruptions or degradation of the speech. The speech degradation because of radio environment factors can be minimized by using the radio network features Discontinuous Transmission (DTX), Power control and Frequency hopping.

The use of the radio network feature DTX affect the speech quality by removing the true background. The handover procedure creates a short speech interruption.

• Transmission Network:

High Bit Error Rate (BER) causes degradation of the speech. The speech quality is sensitive to BER on the A-bis and A-ter interface.

• Mobile stations:

The implementation of audio functions in the mobile stations affects the speech quality, i.e. speaker, microphone and speech processing. The mobile station implementation is not further discussed in this clause.

6.2.3.1 Speech levels

The speech levels are important for the user perception of the speech but also for the handling of the speech within the system.

In general the same considerations apply as for networks in general. Due to the specific size and shape of mobile terminals however specific signal processing especially echo cancellation is found in the terminals. Care should be taken to avoid unbalanced speech levels. The performance of the echo canceller is sensitive to a too large unbalance between the up- and downlink. The worst situation is when the uplink level is higher than the downlink level. High speech levels in the down link increase the risk for acoustic echo in the mobile.

The intention shall be to have the same line levels for the entire telephone network. This is dependent on the transmission, and adjustments of the line level might be used.

6.2.3.2 Echo control

The mean one-way delay of a GSM System is about 90 ms. This means that the echo returns to the mobile with a 180 ms delay (if satellite transmission are used the delay will be even longer). The echoes can be of two different types, hybrid and acoustic, see figure 6.

The echoes generated outside the own PLMN are handled by the echo cancellers in the GSM system. This is because the echo generated outside the own PLMN, can be heard in the GSM mobile.

Page 23: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 23

Figure 6: The echoes generated in PSTN and PLMN networks

Hybrid echo

The hybrid echoes are created in the PSTN network, by the hybrid used when changing from four to two wire circuits. The hybrid normally provides an Echo Return Loss (ERL) of > 15 dB, that means the echo is 15 dB lower than the unreflected signal. Thus echo cancellation needs to be provided by the PLMN in order to prevent echo signals being heard by the mobile user.

Acoustic echo

In the PLMN network the mobiles should handle the echo control. The mobiles should provide an EL of 46 dB (see EN 300 903 [45], Transmission Planning Aspects of the Speech Service in the GSM Public Land Mobile Network (PLMN) System) to eliminate the need for internal PLMN echo cancellers. This is hard to achieve with mobiles. Voice switching, echo cancellation or echo suppressors are commonly used to achieve the 46 dB EL requirement.

The characteristics of the echo originated in the MS are very different from those of the PSTN echo. The acoustics echo from the MS is non-linear (due to speech coding and decoding in the echo path) and the end path delay is very long (of the order of 200 ms). These are the reasons why one cannot remove the MS echo with the same type of echo canceller that is developed for the PSTN echo, the echo generated by the mobile needs to be cancelled by the mobile itself before the speech codec is involved.

6.2.3.3 Radio network and radio network features

Provided that the rest of the system(s) is functioning correctly, i.e. do not contribute to degradation of speech quality, poor performance of the radio network (air interface) is the major contributor to degraded speech quality of a call. The radio network environment is affected by the factors:

• Co-channel Interference (C/I);

• Noise Limitations (C/N);

• mobile speed (fading frequency);

• time dispersion;

• adjacent channel interference (C/A).

During a call these factors together contribute to:

• Bit Error Rate (BER): the average amount of bit errors in a speech frame;

• Frame Erasure Rate (FER): the percentage of erased frames.

Page 24: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 24

When the BER and the FER increases the speech decoder will get less and less information about the coded speech and thus the speech quality will degrade, see figure 7. This is why a good cellplanning is needed to avoid these kinds of interferences as much as possible. The quality of the radio environment can further be improved by using the radio network features DTX, Power control and Frequency hopping. Another degrading factor are the handovers. While moving from cell to cell they will also introduce a speech quality disturbance.

Figure 7: Speech quality vs C/I

Discontinuous Transmission (DTX)

The radio network feature Discontinuous Transmission reduces the total interference in the network, it only transmits when speech is detected. In normal conversation this will lead to a decreased transmitting time of about 50 %. The general speech quality for the network can therefore be increased. The use of DTX affects the speech quality, both positively and negatively. Positively by reducing the total interference level in the network. Negatively by introducing speech clipping and by injecting comfort noise instead of the true background. This could be disturbing for connections with high background noise levels where the background is very different from that of the comfort noise generated. It should be noted that the DTX and comfort noise generation is part of the speech codec and can therefore differ from one codec to another.

Frequency hopping

Frequency hopping can reduce the effect on the speech quality caused by multipath fading and interference. The signal strength variations, or the interference, are broken up into pieces of duration short enough for the interleaving and speech coding process to correct the errors. The average speech quality is thus increased compared to a non frequency hopping network.

Power control

Dynamic power control used together with frequency hopping and DTX results in a further increase in the protection against fading dips and interference. This is achieved by the fact that the output power is controlled with respect to received signal strength as well as with respect to speech quality (BER). The improvement in interference protection can be utilized directly as an improvement in general speech quality. It can also be utilized in replanning the radio network for high capacity, while leaving the general speech quality constant.

Quality based power control contributes to better speech quality by:

• enhanced signal strength based part that gives increased C/I gains in the radio network compared with the existing power control;

• a quality based part that raises the power in the presence of interference, which gives generally better speech quality.

Page 25: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 25

Handover

At handover there will be a short speech interruption in both downlink and uplink. There will also be signalling going on in conjunction with the handover. This signalling will steal speech frames and can by that introduce some degradation. One message (the Physical Information message) will be repetitively sent in the new cell until the mobile and the system has established a connection in this cell. Since there is a delay in the communication between the system and the mobile, the system should not repeat the message to fast so as to avoid unnecessary repetitive signalling.

7 Measurement of "standard" parameters The standard parameter and the according measurements to be included here are the following:

• Frequency Response in Sending and Receiving Direction;

• Overall Frequency Responses;

• SLR Sending Loudness Rating;

• RLR Receiving Loudness Rating;

• OLR Overall Loudness Rating;

• STMR Sidetone Masking Rating;

• LSTR Listener Sidetone Rating;

• D D-Value of Terminal;

• TCLw Terminal Coupling Loss (weighted);

• WEPL Weighted Echo Path Loss;

• TELR Talker Echo Loudness Rating;

• qdu Number of Quantizing Distortion Units;

• Nc Circuit Noise referred to the 0 dBr-point;

• Distortion in Sending and Receiving Direction;

• Out-of-Band Signals in Sending and Receiving Direction.

In general the measurement principles are the same as used in other standards e.g. TBR 008 [4] but special consideration needs to be given to the following points:

• The appropriate measurement signal (especially with respect to the codec) needs to be chosen.

It is recommended to use a speech-like test signal. Speech like test stimuli can be found in ITU-T Recommendation P.50 [22] and P.501 [23].

• The averaging times used to determine the transfer characteristics need to be adapted to the measurement signal chosen.

• Instead of level measurements using sine wave signal excitation, typically Fourier transformation is used for calculation /estimation of the output spectra.

• The measured output spectrum is always referred to the signal spectrum in order to determine transfer functions, loudness ratings etc.

• New procedures need to be established in order to determine distortion, especially to determine the parameter "speech sound quality" in combination with the acoustical interface. The basis for such procedures may be found in EG 201 377-1 [1].

Page 26: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 26

NOTE: When using the E-model [12] for predicting the speech quality in terms of R-values the following, additional parameter need to be determined:

• Ie Equipment Impairment Factor (low bit-rate Codecs);

• Nfor Noise Floor at the Receive-side;

• Ps Room Noise at the Send-side;

• Pr Room Noise at the Receive-side.

For the evaluation of Ie values for low bit-rate codecs, some objective measurement methods have been developed or are under development (see ITU-T Recommendation P.862 [36] and EG 201 377-1 [1]).

7.1 Sending frequency response a) The test signal to be used for the measurements shall be the artificial voice according to ITU-T

Recommendation P.50 [22] or a speech like test signal as described in ITU-T Recommendation P.501 [23]. The type of test signal used shall be stated in the test report. The spectrum of acoustic signal produced by the artificial mouth is calibrated under free field conditions at the MRP. The test signal level shall be -4,7 dBPa, measured at the MRP. The test signal level is averaged over the complete test signal sequence.

b) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal setup is described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered.

Measurements shall be made at one twelfth-octave intervals as given by the R.40 series of preferred numbers in ISO 3 [11] for frequencies from 100 Hz to 4 kHz inclusive. For the calculation the averaged measured level at the electrical reference point for each frequency band is referred to the averaged test signal level measured in each frequency band at the MRP.

c) The sensitivity is expressed in terms of dBV/Pa.

7.2 Receiving frequency response a) The test signal to be used for the measurements shall be the artificial voice according to ITU-T

Recommendation P.50 [22] or a speech like test signal as described in ITU-T Recommendation P.501 [23]. The type of test signal used shall be stated in the test report. The test signal level shall be -16 dBm0, measured at the digital reference point or the equivalent analogue point. The test signal level is averaged over the complete test signal sequence.

b) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP position or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal is setup as described in clause 5. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28]. The equalized output signal of each artificial ear is power-averaged on the total time of analysis; the "right" and "left" signals are voltage-summed for each 1/3 octave band frequency band; these 1/3 octave band data are considered as the input signal to be used for calculations or measurements.

Measurements shall be made at one twelfth-octave intervals as given by the R.40 series of preferred numbers in ISO 3 [11] for frequencies from 100 Hz to 4 kHz inclusive. For the calculation the averaged measured level at each frequency band is referred to the averaged test signal level measured in each frequency band.

c) The sensitivity is expressed in terms of dBPa/V.

Page 27: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 27

7.3 Overall frequency response a) The test signal to be used for the measurements shall be the artificial voice according to ITU-T

Recommendation P.50 [22] or a speech like test signal as described in ITU-T Recommendation P.501 [23]. The type of test signal used shall be stated in the test report. The spectrum of the acoustic signal produced by the artificial mouth is calibrated under free field conditions at the MRP. The test signal level shall be -4,7 dBPa, measured at the MRP. The test signal level is averaged over the complete test signal sequence.

b) Handset terminals are setup as described in clause 5. The handsets are mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

Hands-free terminals are setup as described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28]. The equalized output signal of each artificial ear is power-averaged for the total time of analysis; the "right" and "left" signals are voltage-summed for each 1/3 octave band frequency band; these 1/3 octave band data are considered as the input signal to be used for calculations or measurements.

Measurements shall be made at one twelfth-octave intervals as given by the R.40 series of preferred numbers in ISO 3 [11] for frequencies from 100 Hz to 4 kHz inclusive. For the calculation the averaged measured level at each frequency band is referred to the averaged test signal level measured in each frequency band.

c) The sensitivity is expressed in terms of dBPa/Pa.

7.4 Sending (and connection) loudness rating a) The test signal to be used for the measurements shall be the artificial voice according to ITU-T

Recommendation P.50 [22] or a speech like test signal as described in ITU-T Recommendation P.501 [23]. The type of test signal used shall be stated in the test report. The spectrum of acoustic signal produced by the artificial mouth is calibrated under free field conditions at the MRP. The test signal level shall be -4,7 dBPa, measured at the MRP. The test signal level is averaged over the complete test signal sequence.

b) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal setup is described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered.

The sending sensitivity shall be calculated from each band of the 14 frequencies given in table 1 of ITU-T Recommendation P.79 [30], bands 4 to 17. For the calculation the averaged measured level at the electrical reference point for each frequency band is referred to the averaged test signal level measured in each frequency band at the MRP.

c) The sensitivity is expressed in terms of dBV/Pa and the SLR shall be calculated according to ITU-T Recommendation P.79 [30], formula 5-1, over bands 4 to 17, using m = 0,175 and the sending weighting factors from ITU-T Recommendation P.79 [30], table 1.

7.5 Receiving (and connection) loudness rating a) The test signal to be used for the measurements shall be the artificial voice according to ITU-T

Recommendation P.50 [22] or a speech like test signal as described in ITU-T Recommendation P.501 [23]. The type of test signal used shall be stated in the test report. The test signal level shall be -16 dBm0, measured at the digital reference point or the equivalent analogue point. The test signal level is averaged over the complete test signal sequence.

Page 28: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 28

b) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or in the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal is setup as described in clause 5. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28]. The equalized output signal of each artificial ear is power-averaged on the total time of analysis; the "right" and "left" signals are voltage-summed for each 1/3 octave band frequency band; these 1/3 octave band data are considered as the input signal to be used for calculations or measurements.

The receiving sensitivity shall be calculated from each band of the 14 frequencies given in table 1 of ITU-T Recommendation P.79 [30], bands 4 to 17. For the calculation the averaged measured level at each frequency band is referred to the averaged test signal level measured in each frequency band.

c) The sensitivity is expressed in terms of dBPa/V and the RLR shall be calculated according to ITU-T Recommendation P.79 [30], formula 5-1, over bands 4 to 17, using m = 0,175 and the receiving weighting factors from table 1 of ITU-T Recommendation P.79 [30].

d) No leakage correction shall be applied when using HATS or type 3.2 artificial ear for the measurement.

7.6 Overall loudness rating a) The test signal to be used for the measurements shall be the artificial voice according to ITU-T

Recommendation P.50 [22] or a speech like test signal as described in ITU-T Recommendation P.501 [23]. The type of test signal used shall be stated in the test report. The spectrum of the acoustic signal produced by the artificial mouth is calibrated under free field conditions at the MRP. The test signal level shall be -4,7 dBPa, measured at the MRP. The test signal level is averaged over the complete test signal sequence.

b) Handset terminals are setup as described in clause 5. The handsets are mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

Hands-free terminals are setup as described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28]. The equalized output signal of each artificial ear is power-averaged for the total time of analysis; the "right" and "left" signals are voltage-summed for each 1/3 octave band frequency band; these 1/3 octave band data are considered as the input signal to be used for calculations or measurements.

Measurements shall be made at one twelfth-octave intervals as given by the R.40 series of preferred numbers in ISO 3 [11] for frequencies from 100 Hz to 4 kHz inclusive. For the calculation the averaged measured level at each frequency band is referred to the averaged test signal level measured in each frequency band.

c) The sensitivity is expressed in terms of dBPa/Pa and the OLR shall be calculated according to ITU-T Recommendation P.79 [30], formula 5-1 over bands 4 to 17, using m = 0,175 and the overall weighting factors from ITU-T Recommendation P.79 [30], table D.1.

7.7 Sidetone masking rating The measurement of STMR is applicable under the following conditions:

• The terminal only respectively connected through a typical line is measured.

• The type 1 artificial ear is used.

• The measurement is only applicable to handset terminals which can be sealed to the type 1 artificial ear.

The measurement is not applicable for end-to-end measurements including terminals.

Page 29: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 29

Test procedure:

a) The test signal to be used for the measurements shall be the artificial voice according to ITU-T Recommendation P.50 [22] or a speech like test signal as described in ITU-T Recommendation P.501 [23]. The type of test signal used shall be stated in the test report. The spectrum of the acoustic signal produced by the artificial mouth is calibrated under free field conditions at the MRP. The test signal level shall be -4,7 dBPa, measured at the MRP. The test signal level is averaged over the complete test signal sequence.

b) Handset terminals are setup as described in clause 5. The handset is mounted in the LRGP position (see ITU-T Recommendation P.64 [29]) and the earpiece is sealed to the knife-edge of the artificial ear.

c) Where a user controlled volume control is provided, the measurements shall be carried out at a setting which is as close as possible to the nominal value of the RLR (RLR = 3 dB).

Measurements shall be made at one twelfth-octave intervals as given by the R.40 series of preferred numbers in ISO 3 [11] for frequencies from 100 Hz to 8 kHz inclusive. For the calculation the averaged measured level at each frequency band (ITU-T Recommendation P.79 [30], table 4, bands 1 to 20) is referred to the averaged test signal level measured in each frequency band.

d) The Sidetone path loss (LmeST), as expressed in dB, and the SideTone Masking Rate (STMR) (in dB) shall be calculated from the formula 5-1 of ITU-T Recommendation P.79 [30], using m = 0,225 and the weighting factors of in table 3 of ITU-T Recommendation P.79 [30].

NOTE 1: STMR is needed in the E-Model [12].

NOTE 2: Further investigations are required when applying type 3.x artificial ears for the measurement. For the time being values measured using type 3.x artificial ears should not be used directly in the E-model since values required here depend on the electrical sidetone measured. Measurements with 3.x artificial ears always include the acoustical sidetone which is included due to the acoustical leakage present in those artificial ears.

7.8 Listener sidetone The measurement of LSTR is applicable under the following conditions:

• The terminal only respectively connected through a typical line is measured.

• The type 1 artificial ear is used.

• The measurement is only applicable to handset terminals which can be sealed to the type 1 artificial ear.

The measurement is not applicable for end-to-end measurements including terminals.

Test procedure:

a) Sound field calibration: The diffuse sound field is calibrated in the absence of any local obstacles. The averaged field shall be uniform to within ±3 dB within a radius of 0,15 m of the MRP, when measured in one-third octave bands according to IEC 61260 [9] from 100 Hz to 8 kHz (bands 1 to 20).

NOTE 1: The pressure intensity index, as defined in ISO 9614 [46], may prove to be a suitable method for assessing the diffuse field.

NOTE 2: Where more than one loudspeaker is used to produce the desired sound field, the loudspeakers may require to be fed with non-coherent electrical signals to eliminate standing waves and other interference effects.

b) Where adaptive techniques or voice switching circuits are not used (need to be declared by the supplier) the spectrum shall be band limited (50 Hz to 10 kHz) "pink noise" (see ITU-T Recommendation P.64 [29], annex B) to within ±3 dB and the level shall be adjusted to 70 dB(A) (-24 dBPa(A)). The tolerance for this level is ±1 dB.

In other cases the level shall be adjusted to 50 dB(A) (-44 dBPa(A)). The tolerance for this level is ±1 dB.

Page 30: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 30

c) Handset terminals are setup as described in clause 5. The handset is mounted in the LRGP position (see ITU-T Recommendation P.64 [29]) and the earpiece is sealed to the knife-edge of the artificial ear.

d) Measurements are made on one-third octave bands according to IEC 61260 [9] for the 20 bands centred at 100 Hz to 8 kHz (bands 1 to 20). For each band the sound pressure in the artificial ear shall be measured by connecting a suitable measuring set to the artificial ear.

NOTE 3: There may be problems with the signal to noise ratio. If it is less than 10 dB in any band, the microphone noise level and the noise level of any out-of-band signals need to be subtracted from the measured sidetone level (power subtraction).

e) The listener sidetone path loss is expressed in dB and the LSTR shall be calculated from ITU-T Recommendation P.79 [30], formula 5-1, using m = 0,225 and the weighting factors in table 3 of ITU-T Recommendation P.79 [30].

NOTE 4: LSTR is needed for the E-Model [12].

NOTE 5: Further investigations are required when applying type 3.x artificial ears for the measurement. For the time being values measured using type 3.x artificial ears should not be used directly in the E-model since values required here depend on the electrical sidetone measured. Measurements with 3.x artificial ears always include the acoustical sidetone which is included due to the acoustical leakage present in those artificial ears.

7.9 Measurement and calculation of the value of the D-factor (DelSM)

In general the D-factor can be measured for terminals in combination with a suitable network simulation but as well from end-to-end using the far end terminal for assessing the signal transmitted in sending.

a) Sound field calibration: The diffuse sound field is calibrated in the absence of any local obstacles. The averaged field shall be uniform to within ±3 dB within a radius of 0,15 m of the MRP, when measured in one-third octave bands according to IEC 61260 [9] from 100 Hz to 8 kHz (bands 1 to 20).

NOTE 1: The pressure intensity index, as defined in ISO 9614 [46], may prove to be a suitable method for assessing the diffuse field.

NOTE 2: Where more than one loudspeaker is used to produce the desired sound field, the loudspeakers may require to be fed with non-coherent electrical signals to eliminate standing waves and other interference effects.

b) Where adaptive techniques or voice switching circuits are not used (need to be declared by the supplier) the spectrum shall be band limited (50 Hz to 10 kHz) "pink noise" (see ITU-T Recommendation P.64 [29], annex B) to within ±3 dB and the level shall be adjusted to 70 dB(A) (-24 dBPa(A)). The tolerance for this level is ±1 dB.

In other cases the level shall be adjusted to 50 dB(A) (-44 dBPa(A)). The tolerance for this level is ±1 dB.

c) Handset or headset terminals are mounted as described in clause 5. Measurements are made on one-third octave bands according to IEC 61260 [9] for the 14 bands centred at 200 Hz to 4 kHz (bands 4 to 17). For each band the diffuse sound sensitivity Ssi(diff) is measured. The sensitivity shall be expressed in terms of dBV/Pa.

The D-factor may be used for hands-free terminals as well. The requirements however have to be adapted. Hands-free terminals are setup as described in clause 5.

NOTE 3: Additional types of background noise should be used to conduct tests with background noise simulating the real use conditions of the terminal as close as possible. Therefore level and spectrum as well as distribution of sound sources may be different as compared to the requirements when simulating a diffuse sound field with pink noise.

Page 31: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 31

d) The direct sound sensitivity shall be measured using the test set-up specified in clause 5.1 and a speech like test signal as defined in ITU-T Recommendation P.50 [22] or P.501 [23]. The type of test signal used shall be stated in the test report. The direct sound sensitivity is measured in one-third octave bands according to IEC 61260 [9] for the 14 bands centred at 200 Hz to 4 kHz (bands 4 to 17). For each band the direct sound sensitivity Ssi(direct) is measured. The sensitivity shall be expressed in terms of dBV/Pa.

e) The value of the D-factor shall be calculated according to ITU-T Recommendation P.79 [30], annex E, formulas E2 and E3, over the bands from 4 to 17, using the coefficients Ki from table E1 of ITU-T

Recommendation P.79 [30].

7.10 Delay

7.10.1 Delay in sending direction

a) The test signal to be used for the measurements shall be a Composite Source Signal (CSS) as described in ITU-T Recommendation P.501 [23]. The spectrum of acoustic signal produced by the artificial mouth is calibrated under free field conditions at the MRP. The test signal level shall be -4,7 dBPa, measured at the MRP. The test signal level is averaged over the complete test signal sequence.

b) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or in the HATS position (see ITU-T Recommendation P.64 [29]). The application force used to apply the handset against the artificial ear shall be stated in the test report.

The hands-free terminal setup is described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered.

The delay is calculated using the cross correlation function between the signal at the electrical test point and the signal at the MRP. The measurement is corrected by the delay introduced by the test equipment.

c) The delay is expressed in ms, determined from the maximum of the cross correlation function.

NOTE: Delay may be time variant. Therefore constant monitoring of the actual delay may be required when evaluating the range of delay which can be observed in a given connection. The test setup must take into account either real network conditions or the tools needed to simulate typical causes for time variant delay (e.g. packet loss) during the measurement period. Other methods like running cross correlation or delay estimation procedures e.g. used in PESQ (ITU-T Recommendation P.862 [36]) may be used.

7.10.2 Delay in receiving direction

a) The test signal to be used for the measurements shall be a Composite Source Signal (CSS) as described in ITU-T Recommendation P.501 [23]. The test signal level shall be -16 dBm0, measured at the electrical test point. The test signal level is averaged over the complete test signal sequence.

b) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The application force used to apply the handset against the artificial ear shall be stated in the test report..

The hands-free terminal is setup as described in clause 5. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28]. The equalized output signal of one artificial ear is used for the delay calculation.

The delay is calculated using the cross correlation function between the signal at the electrical test point and the signal at the DRP. The measurement is corrected by the delay introduced by the test equipment.

c) The delay is expressed in ms, determined from the maximum of the cross correlation function.

NOTE: Delay may be time variant. Therefore constant monitoring of the actual delay may be required when evaluating the range of delay which can be observed in a given connection. The test setup must take into account either real network conditions or the tools needed to simulate typical causes for time variant delay (e.g. packet loss) during the measurement period. Other methods like running cross correlation or delay estimation procedures e.g. used in PESQ (ITU-T Recommendation P.862 [36]) may be used.

Page 32: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 32

7.10.3 Overall Delay

a) The test signal to be used for the measurements shall be a Composite Source Signal (CSS) as described in ITU-T Recommendation P.501 [23]. The test signal level shall be -16 dBm0, measured at the electrical test point. The test signal level is averaged over the complete test signal sequence.

b) The handset terminals are setup as described in clause 5. The handsets are mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The application force used to apply the handset against the artificial ear shall be stated in the test report.

Hands-free terminals are setup as described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28]. The equalized output signal of one artificial ear is used for the delay calculation.

The delay is calculated using the cross correlation function between the signal at the MRP on the one side of the connection and the signal at the DRP at the other side of the connection. The measurement is corrected by the delay introduced by the test equipment.

c) The delay is expressed in ms, determined from the maximum of the cross correlation function.

NOTE: Delay may be time variant. Therefore constant monitoring of the actual delay may be required when evaluating the range of delay which can be observed in a given connection. The test setup must take into account either real network conditions or the tools needed to simulate typical causes for time variant delay (e.g. packet loss) during the measurement period. Other methods like running cross correlation or delay estimation procedures e.g. used in PESQ (ITU-T Recommendation P.862 [36]) may be used.

7.11 Terminal coupling loss The measurement is only applicable for terminals respectively terminals connected through a typical line.

The measurement is not applicable for end-to-end measurements including terminals at both ends.

a) For conducting the tests the typical environments where the terminal is typically used shall be used. E.g. office rooms for all office types of telephones, a car cabin for hands-free telephones in vehicles ... The terminal is setup as described in clause 5. The ambient noise level shall be less than -64 dBPa(A) for handset and headset terminals, less than -70 dBPa(A) for hands-free type terminals. The attenuation from electrical reference point input to electrical reference point output shall be measured using a speech like test signal.

b) Before the actual test a training sequence consisting of 10 s artificial voice male and 10 s artificial voice female according to ITU-T Recommendation P.50 [22] is altered. The training sequence level shall be -16 dBm0 in order not to overload the codec.

c) The test signal is a PN-sequence complying with ITU-T Recommendation P.501 [23] with a length of 4 096 points (for the 48 kHz sampling rate) and a crest factor of 6 dB. The duration of the test signal is 250 ms. The test signal level is -3 dBm0. The low-crest factor is achieved by random-alternation of the phase between -180° and 180°.

d) The TCLw is calculated according to ITU-T Recommendation G.122 [14], clause B.4 (trapezoidal rule). For the calculation the averaged measured echo level at each frequency band is referred to the averaged test signal level measured in each frequency band. The length of the test signal shall be at least one second (1,0 s). For the measurement a time window has to be applied adapted to the duration of the actual test signal (200 ms).

7.12 Talker echo loudness rating In general the talker echo loudness rating can be expressed by:

TELR = SLR + RLR + Le

where Le is the echo loss provided by the far end. Le includes the complete echo loss provided by the far end terminal and all echo cancellers active in the connection. The measurement of SLR and RLR is described in clauses 7.4 and 7.5.

Page 33: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 33

In scenarios where the near end terminal is not connected, the measurement of Le can be made by assessing the connection a the 0 dBr point and calculating Le according to ITU-T Recommendation G.111 [13]. If no echo cancellation is involved in the network and no loss is inserted, the Le value is identical to TCL W. In these scenarios the test setup is as follows:

a) For conducting the tests the typical environments where the far end terminal is typically used shall be used. E.g. office rooms for all office types of telephones, a car cabin for hands-free telephones in vehicles ... The terminal is setup as described in clause 5. The ambient noise level shall be less than -64 dBPa(A) for handset and headset terminals, less than -70 dBPa(A) for hands-free type terminals. The attenuation from electrical reference point input to electrical reference point output shall be measured using a speech like test signal.

b) Before the actual test a training sequence consisting of 10 s artificial voice male and 10 s artificial voice female according to ITU-T Recommendation P.50 [22] is altered. The training sequence level shall be -16 dBm0 in order not to overload the codec.

c) The test signal is a PN-sequence complying with ITU-T Recommendation P.501 [23] with a length of 4 096 points b (for the 48 kHz sampling rate) and a crest factor of 6 dB. The duration of the test signal is 250ms. The test signal level is -3 dBm0. The low-crest factor is achieved by random-alternation of the phase between -180° and 180°.

d) Le is calculated according to ITU-T Recommendation G.111 [13], annex A, formula A 4-7 (m = 1). For the calculation the averaged measured echo level at each frequency band is referred to the averaged test signal level measured in each frequency band. The length of the test signal shall be at least one second (1,0 s). For the measurement a time window has to be applied adapted to the duration of the actual test signal (200 ms).

NOTE 1: It should be noted that Le may be highly time variant depending on the type of terminal and the performance of the echo cancellers involved in the connections. Therefore in ITU-T Recommendation G.168 [16] various tests are described which allow the testing of the echo canceller performance parameters in various conditions. Instead of the echo path simulations described in ITU-T Recommendation G.168 [16] the real echo path (terminal) used in the connection should be tested.

NOTE 2: Time varying echo loss as well as echo loss variations e.g. in double talk conditions may be measured but are not taken into account in the E-model yet.

In end-to-end scenarios the TELR typically can not be assessed directly. If however the sidetone-path of the near end terminal and the acoustical coupling of the test setup can be separated from the echo signal the measurement is possible. In this case it is required that the delay between acoustical coupling and the echo signal is long enough to apply a time window to the echo signal. In addition the frequency responses and SLR and RLR of the near end terminal must be known. The procedure now is as follows:

a) For conducting the tests the typical environments where the far end terminal is typically used shall be used. E.g. office rooms for all office types of telephones, a car cabin for hands-free telephones in vehicles. The room where the near end terminal is setup should be ideally anechoic. The terminal is setup as described in clause 5. The ambient noise level shall be less than -64 dBPa(A) for handset and headset terminals, less than -70 dBPa(A) for hands-free type terminals in both rooms. The attenuation from mouth-to-ear shall be measured using a speech like test signal .

b) Before the actual test a training sequence consisting of 10 s artificial voice male and 10 s artificial voice female according to ITU-T Recommendation P.50 [22] is altered. The training sequence level shall be -4,7 dBPa.

c) The test signal is a PN-sequence complying with ITU-T Recommendation P.501 [23] with a repetition period > than the expected delay of the echo signal. and a crest factor of 6 dB. The duration of the test signal is adapted to the expected echo signal. The test signal level is +5 dBPa. The low-crest factor is achieved by random-alternation of the phase between -180° and 180°.

d) The impulse response of the measured signal is calculated by inverse Fourier transformation. The echo impulse response is cut off in the time domain and used for the further calculations.

e) From the impulse response the echo spectrum is generated by Fourier transformation. This spectrum is corrected by the sending and receiving frequency responses of the near end terminal. The corrected spectrum is used for Le calculation.

Page 34: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 34

f) Le is calculated according to ITU-T Recommendation G.111 [13], annex A, formula A 4-7 (m = 1). For the calculation the averaged measured echo level at each frequency band is referred to the averaged test signal level measured in each frequency band. The length of the test signal shall be at least one second (1,0 s). For the measurement a time window has to be applied adapted to the duration of the actual test signal

NOTE 3: It should be noted that for low level echo signals the method is not applicable due to insufficient signal to noise ratio.

7.13 Weighted echo path loss WEPL is the frequency weighted sum of all losses and gains in inserted in a network. The weighting is defined in ITU-T Recommendation G.122 [14]. It includes TCLW and any transhybrid loss and such cannot be measured directly. Further information can be found in EG 201 050 [3].

NOTE 1: It should be noted that WEPL may be highly time variant depending on the type of terminal and the performance of the echo cancellers involved in the connections.

NOTE 2: Time varying echo loss as well as echo loss variations e.g. in double talk conditions may be measured but are not taken into account in the E-model yet.

7.14 Distortion

7.14.1 Distortion in sending

a) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal setup is described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered.

b) The type of test and test signals depend on the type of signal processing used and introduced in the system to be tested. For systems introducing highly time variant and non linear signal processing (e.g. all types of low bit rate codecs based on CELP type codecs) the tests described below are not applicable or should be applied with some caution.

- When testing non linear time invariant transmission systems a pure tone measurement can be used. An instrument capable of measuring the harmonic distortion of signals with fundamental frequencies in the range of 315 Hz to 1 kHz shall be connected to the artificial ear. A pure-tone signal in the range of -45 dBPa to -4,7 dBPa shall be applied at the MRP at frequencies of 315 Hz, 502 Hz and 1 004 kHz. Alternatively the level is adjusted until the output of the terminal is -10 dBm0. The level of the signal at the MRP is then the ARL. The level of this signal is in the range of -35 dB to +10 dB rel. to the ARL. The ratio of the signal to total distortion power at the electrical interface shall be measured with the psophometric noise weighting (see ITU-T Recommendations G.712 [17] and O.132 [19]).

- Alternatively a band-limited noise signal corresponding to ITU-T Recommendation O.131 [18] can be used. The test signal shall be applied at the MRP. The level of this signal is in the range of -45 dBPa to -4,7 dBPa. Alternatively the level is adjusted until the output of the terminal is -10 dBm0. The level of the signal at the MRP is then the ARL. The level of this signal is in the range of -45 dB to +7 dB rel. to the ARL. The ratio of signal to total distortion power at the electrical interface shall be measured (see ITU-T Recommendations G.712 [17], annex A and O.131 [18]).

- For systems requiring speech like test stimuli a test-signal which is more "speech-like" e.g. an AM-FM modulated sinewave composed signal having a fundamental frequency similar to speech and the typical speech like harmonics can be used. Limit the spectrum to 1 kHz and measure the total distortion in the frequency range from 1 kHz to 3,4 kHz. The test signal is defined as:

( )[ ] ( )[ ]∑ ××+×××+=i

FMFMiAMAM ftftftnAts )2sin(2cos2cos)( 0 πµππµ

Page 35: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 35

with A = 0,5

fAM = 3 Hz, µAM = 0,5

fFM = 5 Hz, µFM =1

F0 i= i × 240 Hz ;i=1..4

The spectrum should be shaped using a shaping filter as described in Table3/P.501 [23] which provides a slope of 5 dB/octet

The test signal level is adjusted to -4,7 dBPa at the MRP.

The total energy of the distortion components is measured in the frequency range from 1 kHz to 4 kHz at the electrical interface.

NOTE: In order to ensure a reliable activation of the configuration measured, an activation signal may be generated before the actual measurement starts. The activation signal may consist of a sequence of 4 Composite Source Signals (CSS) according to ITU-T Recommendation P.501 [23]. Alternatively artificial voice according to ITU-T Recommendation P.50 [22] can be used. The level of the activation signal shall be -4,7 dBPa, measured at the MRP. The level of the activation signal is averaged over the complete activation sequence signal.

7.14.2 Distortion in receiving

a) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or in the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal is setup as described in clause 5. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28].

b) The type of test and test signals depend on the type of signal processing used and introduced in the system to be tested. For systems introducing highly time variant and non linear signal processing (e.g. all types of low bit rate codecs based on CELP type codecs) the tests described below are not applicable or should be applied with some caution.

- When testing non linear time invariant transmission systems a pure tone measurement can be used. An instrument capable of measuring the harmonic distortion of signals with fundamental frequencies in the range of 315 Hz to 1 kHz shall be connected to the electrical interface. A pure-tone signal in the range of -45 dBm0 to 0 dBm0 shall be applied at the electrical interface at frequencies of 315 Hz, 502 Hz and 1 004 kHz. The ratio of the signal to total distortion power at the artificial ear shall be measured with the psophometric noise weighting (see ITU-T Recommendations G.712 [17] and O.132 [19]).

- Alternatively a band-limited noise signal corresponding to ITU-T Recommendation O.131 [18] can be used. The test signal shall be applied at the MRP. The level of this signal is in the range of -45 dBPa to -4,7 dBPa. The ratio of the signal-to-total distortion power shall be measured with the psophometric noise weighting in the artificial ear (see ITU-T Recommendations G.712 [17] and O.132 [19]).

- For systems requiring speech like test stimuli a test-signal which is more "speech-like" e.g. an AM-FM modulated sinewave composed signal having a fundamental frequency similar to speech and the typical speech like harmonics, limit the spectrum to 1 kHz and measure the total distortion in the frequency range from 1 kHz to 3,4 kHz. The test signal is defined as:

( )[ ] ( )[ ]∑ ××+×××+=i

FMFMiAMAM ftftftnAts )2sin(2cos2cos)( 0 πµππµ

with A = 0,5

fAM = 3 Hz, µAM = 0,5

fFM = 5 Hz, µFM =1

Page 36: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 36

F0 i= i × 240 Hz ;i=1..4

The spectrum should be shaped using a shaping filter as described in Table3/P.501 [23] which provides a slope of 5 dB/octet.

The test signal level is adjusted to -16 dBm0 at the electrical interface.

The total energy of the distortion components is measured in the frequency range from 1 kHz to 4 kHz in the artificial ear.

NOTE: In order to ensure a reliable activation of the configuration measured, an activation signal may be generated before the actual measurement starts. The activation signal consists of a sequence of 4 Composite Source Signals (CSS) according to ITU-T Recommendation P.501 [23]. Alternatively artificial voice according to ITU-T Recommendation P.50 [22] can be used. The level of the activation level shall be -16 dBm0, measured at the electrical interface. The level of the activation signal is averaged over the complete activation sequence signal.

7.14.3 Overall distortion

a) Handset terminals are setup as described in clause 5. The handsets are mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

Hands-free terminals are setup as described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28]. The equalized output signal of each artificial ear is power-averaged for the total time of analysis; either the "right" or the "left" signals are used measurements and calculations.

b) The type of test and test signals depend on the type of signal processing used and introduced in the system to be tested. For systems introducing highly time variant and non linear signal processing (e.g. all types of low bit rate codecs based on CELP type codecs) the tests described below are not applicable or should be applied with some caution.

- When testing non linear time invariant transmission systems a pure tone measurement can be used. A pure-tone signal in the range of -45 dBPa to -4,7 dBPa shall be applied at the MRP at frequencies of 315 Hz, 502 Hz and 1 004 kHz. The ratio of the signal to total distortion power at the artificial ear shall be measured with the psophometric noise weighting (see ITU-T Recommendations G.712 [17] and O.132 [19]).

- Alternatively a band-limited noise signal corresponding to ITU-T Recommendation O.131 [18] can be used. The test signal shall be applied at the MRP. The level of this signal is in the range of -55 dBm0 to -3 dBm0. The ratio of the signal-to-total distortion power shall be measured with the psophometric noise weighting in the artificial ear (see ITU-T Recommendations G.712 [17] and O.132 [19]).

- For systems requiring speech like test stimuli a test-signal which is more "speech-like" e.g. an AM-FM modulated sinewave composed signal having a fundamental frequency similar to speech and the typical speech like harmonics, limit the spectrum to 1 kHz and measure the total distortion in the frequency range from 1 kHz to 3,4 kHz. The test signal is defined as:

( )[ ] ( )[ ]∑ ××+×××+=i

FMFMiAMAM ftftftnAts )2sin(2cos2cos)( 0 πµππµ

with A = 0,5

fAM = 3 Hz, µAM = 0,5

fFM = 5 Hz, µFM =1

F0 i= i × 240 Hz ;i=1..4

The spectrum should be shaped using a shaping filter as described in Table3/P.501 [23] which provides a slope of 5 dB/octet.

Page 37: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 37

The test signal level is adjusted to -4,7 dBPa at the mouth reference point.

The total energy of the distortion components is measured in the frequency range from 1 kHz to 4 kHz in the artificial ear.

NOTE: In order to ensure a reliable activation of the configuration measured, an activation signal may be generated before the actual measurement starts. The activation signal consists of a sequence of 4 Composite Source Signals (CSS) according to ITU-T Recommendation P.501 [23]. Alternatively artificial voice according to ITU-T Recommendation P.50 [22] can be used. The level of the activation level shall be -4,7 dBPa, measured at the MRP. The level of the activation signal is averaged over the complete activation sequence signal.

7.15 Sensitivity against out-of-band signals in sending a) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or the

HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal setup is described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered.

b) The type of test and test signals depend on the type of signal processing used and introduced in the system to be tested. For systems introducing highly time variant and non linear signal processing (e.g. all types of low bit rate codecs based on CELP type codecs) the tests described below are not applicable or should be applied with some caution.

- When testing non linear time invariant transmission systems a pure tone measurement can be used. For input signals at frequencies of 4,65 kHz, 5 kHz, 6 kHz, 6,5 kHz, 7 kHz and 7,5 kHz at the level specified of -4,7 dBPa, the level of any image frequencies at the electrical interface shall be measured.

- Alternatively a white Gaussian noise band-limited to the frequency range between 4,6 kHz and 8 kHz with a level of -4,7 dBPa at the MRP can be used. The total level - measured in a frequency range from 300 Hz to 3,4 kHz is measured at the electrical reference point and shall be less than 40 dB referred to the reference level. The reference level is determined using artificial voice according to ITU-T Recommendation P.50 [22], band-limited to the frequency range between 300 Hz and 3,4 kHz with a level of -4,7 dBPa at the MRP. For this signal the in-band level averaged over the complete reference signal length is determined at the electrical reference point.

NOTE 1: In order to ensure a reliable activation of the terminal an activation signal may be generated before the actual measurement starts. The activation signal consists of a sequence of 4 Composite Source Signals (CSS) according to ITU-T Recommendation P.501 [23]. The level of the activation level shall be -4,7 dBPa, measured at the MRP. The level of the activation signal is averaged over the complete activation sequence signal.

NOTE 2: Appropriate limits can be found e.g. in TBR 008 [4].

7.16 Spurious out-of-band signals in receiving a) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or in the

HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal is setup as described in clause 5. The HATS is freefield equalized as described in ITU-T Recommendation P.581 [28].

Page 38: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 38

b) The type of test and test signals depend on the type of signal processing used and introduced in the system to be tested. For systems introducing highly time variant and non linear signal processing (e.g. all types of low bit rate codecs based on CELP type codecs) the tests described below are not applicable or should be applied with some caution.

- For input signals at the frequencies 500 Hz, 1 000 Hz, 2 000 Hz and 3 150 Hz applied at the level of -10 dBm0, the level of spurious out-of-band image signals at frequencies of up to 8 kHz shall be measured selectively in the artificial ear.

- Alternatively the test signal used is artificial voice according to ITU-T Recommendation P.50 [22], band-limited in a frequency range between 300 Hz and 3,4 kHz with a level of -12 dBm0 in receiving direction. The level of the out-of-band signal is measured in a frequency range between 4,6 kHz and 8 kHz at the DRP and shall be at least 45 dB below the level of the reference signal. The level of the reference signal is determined by measuring the acoustical level of the in-band signal at the artificial ear. The in-band signal is the artificial voice signal band-limited between 300 Hz and 3,4 kHz applied with a level of -12 dBm0.

NOTE 1: In order to ensure a reliable activation of the terminal an activation signal may be generated before the actual measurement starts. The activation signal consists of a sequence of 4 Composite Source Signals (CSS) according to ITU-T Recommendation P.501 [23]. The level of the activation level shall be -16 dBm0, measured at the electrical interface. The level of the activation signal is averaged over the complete activation sequence signal.

NOTE 2: Appropriate limits can be found e.g. in TBR 008 [4].

8 Advanced measurement procedures, taking into account the conversational situation

All measurements and parameters listed in clause seven assume that the systems under test are broadly linear and time invariant (LTI-systems). For LTI-systems, the single figure measures of clause seven can be used effectively to define speech quality performance. In cases where the signal processing in the components cannot be assumed to be linear and time invariant (except the codec), signal processing may influence the speech transmission quite substantially, especially in the conversational situation. The signal processing procedures to be expected are voice activated switching and amplification, echo cancellation (acoustic and electric), noise reduction, etc. The importance of double talk performance and background noise transmission was derived by conversational tests and investigated more in detail by using specific double talk tests and listening only tests.

The signal processing components expected in non-LTI (non-linear time variant) systems are found e.g. in small (mobile) terminals, hands-free terminals, echocancellers, packetizing equipment.

Table 1 gives an overview of the relevant subjective and objective parameters, which in addition to the parameters defined in clause seven, influence the speech transmission quality.

Page 39: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 39

Table 1: Subjectively relevant parameters and their correlating objective parameters

Subjectively relevant parameter Description Correlating objective parameter

Delay and echo loss One way transmission time caused by signal processing, packetizing and transmission

- delay - (time variant) echo loss - TCL - Switching characteristics

Quality of background noise

transmission

Transmission effect experienced in the send direction during - idle mode - only far end speech active - only near end speech active

- attenuation range - attenuation in send direction - switching characteristics - minimum activation level in send direction - frequency response - EC design of NLP or centre clippers - design of noise reduction systems - sensitivity of background noise detection

(activation level, absolute level, level fluctuations)

Double talk performance

typically in send and receive direction: - a loudness variation between

single and double talk periods - loudness variation during

double talk - echo disturbances - occurrence of speech gaps

- attenuation range - attenuation in send and receive direction

during double talk - switching characteristics - minimum activation level to switch over from

receive to send direction and from send to receive direction

- echo attenuation - spectral and time dependent echo

characteristics - design of NLP or centre clippers in

conjunction with ECs Echo disturbances under single talk

conditions

measured between receive and send direction

- echo level - echo level fluctuation vs. time - spectral echo attenuation

Speech sound quality in send and receive direction - frequency responses - distortions

Loudness in send and receive direction - loudness ratings in send and receive

Noise in send and receive direction - noise level - level fluctuations - spectral characteristics

NOTE 1: The behaviour of users during subjective tests clearly demonstrates that the individual speech levels on both sides of the connection highly influence the transmission performance of a system under test. Consequently the measurement levels must be adapted in an appropriate way to represent the possible level variations at the "receive" and "send" inputs.

NOTE 2: Room characteristics highly influence the perceived transmission quality [38]. This demands that terminals be tested in an appropriate environment.

NOTE 3: Recent subjective testing has found that echo levels during double talk may influence the speech quality more than previously expected. Echo loss requirements during double talk can be found in the appendix to ITU-T Recommendation G.131 [15] and in ITU-T Recommendation P.340 [20].

Page 40: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 40

8.1 Measurement setup for objective tests The measurement setup should be chosen as described in clause 5. In addition some changes are suggested in order to improve the accuracy of the measurements:

• A possible delay in sending or receiving of terminals and associated network components and the acoustical propagation delay between artificial mouth and the HFT microphone must be taken into account. This is especially important for analyses, where the measured signals are referred to the original test signals. These analyses require an exact synchronization of both signals. Therefore the delay has to be compensated.

• When analyzing hands-free type telephones, in addition a microphone should be positioned very close to the HFT loudspeaker in order to record the RCV signal. This recorded signal can easier be distinguished from the signal introduced by the artificial mouth under double talk measurement conditions.

8.2 Practical realization of test signals The signals to be used for the advanced measurements are specifically adapted to the measurement and can be found in ITU-T Recommendations P.501 [23], P.50 [22] and P.59 [40]. The tests described in the following are suggested to determine additional parameters as given in table 1 and are described in ITU-T Recommendation P.502 [24]. In cases where background noise simulation is required, care should be taken when choosing the type of background noise. Time and frequency characteristics should be chosen to simulate the typical environment the system is supposed to operate, e.g. car type noise for HFTs in car type environments. In case of specific acoustic preprocessing provided by the terminal, the spatial characteristics of the background noise may be important. In order to simplify the description, long term power density spectrum and long term level density during the measurement should be noted.

8.3 Quality of background noise transmission In general the simulation of background noise can be a continuous noise signal (with shaped spectrum), however a more sophisticated signal is recommended to represent realistic conditions (e.g. office voice babble). In such cases the background noise should be characterized by its long term power density spectrum and its average level applied during the measurement.

For the following tests the background noise signal is regarded as the measurement signal and not as a disturbing component. Consequently analyses should be applied to the noise signal. The transmission quality of background noise (from the near end in SND direction) can be evaluated:

• at idle mode;

• with far end speech;

• with near end speech.

In all these cases important parameters are:

• the sensitivity of background noise detection in terms of activation level;

• the absolute level of the transmitted signal;

• level fluctuations.

8.3.1 Background noise transmission with far end speech

The following signal structure can be used to evaluate the quality of background noise transmission in SND direction coincident with far end speech. Exemplary the following figure 8 represents a signal with a continuous noise signal applied at the near end (SND direction, grey colour) and a simulation of far end speech in RCV direction (white colour, bursts of CSS can be used). The measurement is carried out in SND direction. In figure 8 the level of the CSS bursts vary and the simulation of background noise is applied with a constant level.

Page 41: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 41

t

s (t)

NOTE: The dotted line indicates the repetition or elongation of the test signal to achieve the suitable length for the measurement.

Figure 8: Example of a test signal structure to evaluate the quality of background

noise transmission in SND direction (with far end speech simulation)

The test is carried out applying the Composite Source Signal in receiving direction. After the end of the last Composite Source Signal burst (representing the end of far end speech simulation) the signal level in sending direction is measured (and typically should not vary by more than 10 dB (during transition to transmission of background noise without far end speech)).

a) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal setup is described in clause 5.

b) According to the specification of the manufacturer/test lab the background noise is played back. The test should be carried out using a mostly constant background noise (e.g. constant driving condition in a car). The background noise should be applied for at least 5 seconds in order to adapt noise reduction algorithms.

c) A Composite Source Signal as shown in figure 8 is applied as test signal in receiving direction with a duration of ≥ 2 CSS periods. The test signal level is -16 dBm0 at the electrical reference point. If the dynamic behaviour should be investigated a signal with rising level as indicated in figure 8 can be used.

d) The sending signal is recorded at the electrical reference point. The test signal level versus time is calculated using a time constant of 35 ms.

e) The signal level variation in sending direction is determined immediately after the end of the last CSS burst in receiving direction until the transmitted background noise signal in sending direction reached the maximum level. The resulting level difference corresponds to the signal level variation in sending direction.

In addition to the level analysis as described above also a spectral analysis or a hearing model based signal analysis can be used . One possibility is the so called "Relative Approach" [47]. The "Relative Approach" is a single ended method which does not need a reference signal and is purely based on the analysis of the transmitted background noise signal.. The algorithm takes into account the sensitivity of the human ear on signal fluctuation in the time domain as well as on significant spectral structures. It is recognized that slow variations of a signal in time and/or frequency are typically not disturbing. The algorithm is based on foreword estimation using the signal history. The new signal values are predicted and compared to the actual acquired signal. The basis for the new procedure is the hearing adequate spectral representation of the time and frequency domain. Therefore a time frequency analysis is required. Like in ordinary hearing models the nonlinear relationship between sound pressure level and loudness perceived subjectively is taken into account by time-frequency warping using a Bark filter bank and proper integration of the individual outputs. The filter bank is realized in the time domain. The output signals of the filter bank are rectified and integrated, thus the envelope is generated. This three-dimensional output of the hearing model is the basis of the "Relative Approach".

After spectral analysis and nonlinear transformation the foreword estimation of new signal value is made and compared to the actual signal. The difference can be colour coded and represented as new type of spectrography: bright colours represent big differences between estimation and actual signal, dark colours indicate low differences.

8.3.2 Background noise transmission with near end speech

A similar signal structure can be used to determine the quality of background noise transmission coincident with near end speech. In this case the background noise and the speech signal simulation (again CSS can be used) are applied and measured on the same direction (in opposite to the upper figure 8), e.g. the SND direction.

Page 42: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 42

The test is carried out applying a Composite Source Signal in sending direction. After the end of the last Composite Source Signal burst (representing the end of near end speech simulation) the signal level in sending direction is measured (and typically should not vary more than 10 dB (during transition to transmission of background noise without near end speech)).

a) The handset terminal is setup as described in clause 5. The handset is mounted either in the LRGP or the HATS position (see ITU-T Recommendation P.64 [29]). The pressure force used to apply the handset against the artificial ear is noted in the test report.

The hands-free terminal setup is described in clause 5. In addition to the MRP calibration, the broadband signal level is adjusted to -28,7 dBPa at the HFRP and the spectrum is not altered.

b) According to the specification of the manufacturer/test lab the background noise is played back. The test should be carried out using a mostly constant background noise (e.g. constant driving condition in a car). The background noise should be applied for at least 5 s in order to adapt noise reduction algorithms.

c) The near end speech is simulated using the Composite Source Signal according to ITU-T Recommendation P.501 [23] with a duration of ≥ 2 CSS periods (see figure 8). The test signal level is -4,7 dBPa at the MRP.

d) The sending signal is recorded at the electrical reference point. The test signal level versus time is calculated using a time constant of 35 ms.

e) The signal level variation in sending direction is determined immediately after the end of the last CSS burst in receiving direction until the transmitted background noise signal in sending direction reached the maximum level. The resulting level difference corresponds to the signal level variation in sending direction.

In addition to the level analysis as described above also a spectral analysis or a hearing model based signal analysis (e.g. "Relative Approach") can be used.

8.4 Double talk performance Figure 9 demonstrates the structure of a test signal, based on periodically repeated Composite Source Signals in SND and RCV direction, which can be used to determine important parameters during periods of double talk. The signal and examples for applications are described in [24].

s (t)

t1 t2 t3 t4

NOTE: The dotted line indicates the repetition or elongation of the test signal to achieve the suitable length for the measurement.

Figure 9: Structure of test signal for double talk evaluation

Page 43: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 43

The signal represents an ongoing double talk period. Periods of the CSS are fed to the RCV input port (grey colour) and via the artificial mouth to the HFT microphone (white colour). The measurement level of the RCV and SND signals varies for 0,5 dB between two CSS periods. In addition the level distribution between the SND and RCV direction spreads over a wide range. Typical signal parameters can be chosen as follows:

active duration / pause duration

highest signal level (active part)

lowest signal level (active part)

level of the first signal period

CSS in SND direction

248,62 ms / 151,38 ms

-3 dBPa -23 dBPa -3 dBPa

CSS in RCV direction

248,62 ms / 151,38 ms

-10 dBm0 -29,5 dBm0 -29,5 dBm0

NOTE: All levels measured in dBV are referred to the 0 dBm0 reference point.

The SND and RCV signal periods overlap for 48,62 ms, which corresponds to the duration of the voiced sound of the CSS. The signals for SND and RCV direction should be uncorrelated. The complete duration amounts to 32 s.

The following parameters can be determined during double talk:

• attenuation range;

• attenuation in SND/RCV direction;

• switching characteristics, e.g. switching times;

• minimum activation level to switch over from RCV to SND direction and from SND to RCV direction;

• frequency responses in SND and RCV direction;

• loudness ratings in SND and RCV direction;

• echo attenuation;

• design of NLP or centre clippers in conjunction with EC's.

The evaluation of echo during double talk (including speech activity in SND direction from the near end speaker) requires the separation of echo signal and near end signal. A possible construction of a test signal is shown in figure 10.

Shapingfilter 1

Shapingfilter 2

∼ x

x CH1

CH2

sFM 1

sFM 2

sAM 1

sAM 2

s 2,1FM (t) = )2(cos 2,012,1 FntA FM ×∑ × π ; n = 1,2,....

s 2,1AM (t) = 2,1AMA ;)2(cos 2,1AMFtπ×

Figure 10: Two channel test signal generation for double talk evaluations based on AM-FM signals

Page 44: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 44

Typical settings are given in the following table:

fAM fFM F0 shaping filter

Channel 1 (CH 1) fAM1 = 3 Hz fFM1 = 5 Hz F01 = 270 Hz LP, 5 dB/octet

Channel 2 (CH 2) fAM2 = 3 Hz fFM2 = 5 Hz F02 = 290 Hz LP, 5 dB/octet

Average measurement levels can be chosen for the signals, i.e. -4,7 dBPa (SND) and -20 dBV (RCV). The test signal

may be embedded in speech or speech like sequences. The measured signal in SND direction is filtered in order to eliminate the near end signal component.

The parameter, which can be determined, is:

• the echo level during double talk

8.5 Switching characteristics For these subjective and the correlating objective parameters measurement signals according to ITU-T Recommendation P.501 [23] can be used.

NOTE: The switching characteristics influence the transmission quality in various ways as given by table 1 in detail (see column: correlating objective parameters).

One of the most important parameters, especially for implementations with level switching devices is the attenuation range. This parameter can be determined using a test signal structure shown in figure 11.

s (t)

t1

t

NOTE: The dotted line indicates the repetition or elongation of the test signal to achieve the suitable length for the measurement.

Figure 11: Structure of test signal for attenuation range measurement.

A periodical repetition of CSS bursts as a simulation of speech is used to activate one transmission path (grey colour). At the end of one burst, indicated by t1 on the time scale, the measurement signal is applied in the opposite path (white

colour). This signal part consists of a periodical repetition of a voiced sound.

Typical settings are as follows:

measurement signal measurement signal level

activation signal (in opposite direction)

level of the activation signal (in opposite direction)

Switching RCV -> SND

voiced sound in SND direction,

period. repetition

-3 dBPa

CSS in RCV

-21,7 dBm0

(including pauses) Switching SND -> RCV

voiced sound in RCV direction,

period repetition

-20 dBm0

CSS in SND

-4,7 dBPa

(including pauses)

Page 45: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 45

The following parameters can be measured:

• attenuation range;

• switching characteristics (for speech like signals), e.g. switching times.

The signal structure shown in figure 12 represents signal parts with increasing levels. The minimum activation level to switch on the RCV or SND direction from idle mode can be determined using these sequences. Periods of the CSS (as a simulation of speech) with increasing levels are suited for this signal.

s (t)

t

t1 t2 tN

NOTE: The dotted line indicates the repetition or elongation of the test signal to achieve the suitable length for the measurement.

Figure 12: Structure of test signal to determine the minimum activation level

Typical settings can be chosen as follows:

active duration / pause duration

level of the first period

level difference between two periods

CSS for switching in SND direction

248,62 ms / 451,38 ms

-23 dBPa (see note)

1 dB

CSS for switching in RCV direction

248,62 ms / 451,38 ms

-40 dBm0 (see note)

1 dB

NOTE: These levels should be sufficiently low, to ensure that a wide level range is measured.

If the transmitted signals are measured and referred to the original measurement signal, the minimum activation level can be determined. The activation can be analysed at the beginning of each signal burst (t1, t2, ..., tN). The parameters

which can be determined using this signal are

• the minimum activation level (for speech like signals);

• the switching times (for speech like signals, level dependent).

If the minimum activation levels to switch over from RCV to SND direction (or vice versa, i.e. from SND to RCV direction) shall be measured, the given test signals can be used with slight modifications. As shown in figure 13 an additional signal is needed in the opposite transmission direction (grey colour). The level of the measurement signal (white colour) increases again periodically. Periods of the CSS are suited for both signals in figure 13, if the switching characteristics shall be determined applying speech like signals. Again the signals should be chosen to be uncorrelated.

Page 46: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 46

t1 t3t2 tN

s (t)

t

NOTE: The dotted line indicates the repetition or elongation of the test signal to achieve the suitable length for the measurement

Figure 13: Structure of test signal to determine the minimum activation level to switch over

Suitable setting are given below:

active duration / pause duration

level of the first period

level difference between two

periods

level (active part) in opposite transmission direction

CSS to switch over to SND direction

248,62 ms / 451,38 ms

-13 dBPa

1 dB

-20 dBm0 (RCV)

CSS in to switch over to RCV direction

248,62 ms / 451,38 ms

-30 dBm0

1 dB

-3 dBPa (SND)

Again the activation can be analyzed at the beginning of the signal bursts (t1, t2, ..., tN).

In addition the same tests can be performed with a simulation of background noise applied at the opposite transmission path. Assessable parameters are

• the minimum activation level (for speech like signals) to switch over

• the switching times (for speech like signals)

The signal structure given in figure 14 can be used to determine the switching characteristics in the presence of background noise. In this case a speech like signal (CSS) and a background noise simulation are applied simultaneously on the same channel. The parameters for the CSS can be taken from the tables above.

t

s (t)

NOTE: The dotted line indicates the repetition or elongation of the test signal to achieve the suitable length for the measurement.

Figure 14: Structure of test signal to determine the minimum activation level in

the presence of background noise

The parameters to be determined are:

• the minimum activation level in the presence of background noise;

• the switching characteristics in the presence of background noise.

Page 47: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 47

8.6 Level Adjustments by Companding or AGC The following figures 15 and 16 represent test signals, generated by a periodical repetition of voiced sounds. These signals can be used to measure level adjustments for HFTs, which react comparable on the periodical repetition of a voiced sound and real speech. Additionally artificial voice can be used.

The signal given through figure 15 represents an input signal with a constantly changing level, whereas the signal levels are adjusted in certain steps for the signal in figure 16.

t

s (t)

Figure 15: Structure of test signal to determine level adjustments (constantly changing input level)

Suggested parameters for the signal in figure 15 are as follows:

signal generation

highest level lowest level level variation

SND direction voiced sound, period repeated

-3,0 dBPa infinite linear

RCV direction voiced sound, period repeated

-20 dBm0 infinite linear

The complete signal duration can be chosen to 10 s. The signal is suited to determine the range of level adjustments as a function of the input signal level.

s (t)

t

t1 t3 t4 t5 t6t2t0

Figure 16: Structure of test signal to determine level adjustments

Suggested parameters for the signal in figure 16 are as follows:

signal generation

signal level during (t1 - t0)

signal level during (t2 - t1)

signal level during (t4 - t3)

signal level during (t6 - t5)

SND direction voiced sound, periodically repeated

-3,0 dBPa

-8,0 dBPa

-13,0 dBPa

-18,0 dBPa

RCV direction voiced sound, periodically repeated

-20 dBm0

-25 dBm0

-30 dBm0

-35 dBm0

The signal duration of the single periods can be chosen to 2,5 s each. The signal is suited to determine especially the time duration for level adjustments.

Page 48: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 48

8.7 Additional echo disturbances Echo impairments may be generated at various places:

• acoustical echo due to insufficient decoupling between loudspeaker and microphone of a terminal;

• electrical echo due to echoes generated in the network typically hybrid echoes.

The annoyance caused by echoes is independent of the source of echo and therefore echo loss requirements and the measurement setups are typically independent of the type of echo.

The general measurements for TCL and TELR are described in clauses 7.11 and 7.12. Echo measurements in double talk condition are described in clause 8.4.

In addition to these "static" echo measurements the following parameters should be measured in systems with non linear and/or time variant signal processing elements like echocancellers, speech activated/speech controlled attenuation/switching, comfort noise injection:

• Parameters described in ITU-T Recommendation P.340 [20], chapter 10.3 in case acoustic echo cancellation is involved.

• Parameters described in ITU-T Recommendation G.168 [16], chapter 3.4 in case only network echo cancellers are involved.

8.8 Speech Sound Quality Speech sound quality is influenced by many parameters as they have been described before:

Frequency response, Loudness Rating, distortion, switching and others.

Since speech sound quality is mostly important during one way transmission, the one way transmission is discussed in this clause.

In the presence of non linear and time variant systems, e.g. when speech coding is used, packet loss occurs and switching is introduced, the "traditional" parameters as mentioned above may no longer give sufficient results. When evaluating only between electrical access points PESQ as described in ITU-T Recommendation P.862 [36] or the methods described in EG 201 377-1 [1] may give sufficient information about the impact of the various impairments on speech transmission quality.

When evaluating the complete connection including terminals, P.862 [36] in its present form cannot be applied. TOSQA 2001 (see EG 201 377-1 [1]) as used in the first and second ETSI VoIP speech quality test event [5] is one method, which may be used in end-to-end scenarios. This method may be replaced by a new ITU-T approved method as soon as the standardization process is finalized.

Page 49: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 49

Annex A (informative): Bibliography Diedrich and H. W. Gierlich: "Mouth-to-ear Speech Quality: Some General Considerations", ETSI EP TIPHON 10 WG5, TD 78. http://docbox.etsi.org/zArchive/TIPHON/TIPHON/ARCHIVES/1998/05-9810-Tel_Aviv/

Page 50: EG 201 377-2 - V1.1.1 - Speech Processing, Transmission ......2001/01/01  · Part 2: Mouth-to-ear speech transmission quality including terminals ETSI 2 ETSI EG 201 377-2 V1.1.1 (2004-04)

ETSI

ETSI EG 201 377-2 V1.1.1 (2004-04) 50

History

Document history

V1.1.1 January 2004 Membership Approval Procedure MV 20040326: 2004-01-27 to 2004-03-26

V1.1.1 April 2004 Publication