Network and Multimedia QoE Management

Network and Multimedia QoE Management

陳昇瑋

中央研究院資訊科學研究所

Quality of Experience Management / Sheng‐Wei Chen 2

What is QoE?

Quality of Experience =

User Satisfaction in Using Computer/Communication

Systems


What is QoE Management?

Measurement

Provisioning

Measure user satisfaction

Improve system design to provide more satisfactory user experience


Goal of QoE Management

To Provide

Satisfactory User Experience

in Computer/Communication

Systems


Motivating Example

Network/computation resource is not infinite conflicting goals everywhere

Internet

voice data

path avail bandwidth loss rate delay

10 Kbps 2%

1%

3%

100 ms

20 Kbps 300 ms

30 Kbps 500 ms

Which path is “the best”?

Audio quality vs. video quality

Audio/video quality vs. real‐timeliness

Conflicting Goals in Video Conferencing

(time‐lagged) (low resolution)


Challenges

Hard to measure and quantify users’ perceptionnot directly observable, massively multidimensional

Hard to reduce the system’s parameter spaceNetwork factors (delay, loss, jitter, …)

Transmission factors (redundancy, compression, …)

Codec factors (lots of codec‐depending parameters)

Hard to measure and quantify the environment that may affect users’ experience

ambient noise

quality of headset

distance from viewer to display


Our Research Focus

Video Conferencing

Online EntertainmentVoIP


Our Work

Selected contributionsThe first QoE measurement methodology based on large‐scale user behavior observation

OneClick: A simple yet efficient framework for QoEmeasurement experiments

The first crowdsourcable QoE evaluation methodology

None of them are incremental work

Our Contribution #1 The first QoE measurement methodology based on large‐scale user behavior observation

Rationale (VoIP as an example)

The QoE perceived by users is more or less related to their call duration

jitter

delaynetwork quality

service levelsource rate

call duration

Correlated

TCP / UDP?relayed?

(QoE)

(QoS factors)


Skype Call Duration vs. Network Quality

Call du

ratio

n (m

in)

There are short calls with good network quality

The average shows negative correlation between the 2 variables

Jitter (Kbps)

95% confidence band of the average

average

worse quality

Our Contribution #1 (cont)

Proportional‐hazards modelingSkype’s QoE prediction

FeaturesNo user studies required (more scalable)Can be used to adjust system parameters in run timeApplies to all real‐time interactive applications

• Chen et al, "Quantifying Skype User Satisfaction," ACM SIGCOMM 2006 (cited by 63 papers since Sep 2006).• Chen et al, "On the Sensitivity of Online Game Playing Time to Network QoS," IEEE INFOCOM'06.• Chen et al, "How Sensitive are Online Gamers to Network Quality?," Comm. of ACM, 2006.• Chen et al, "Effect of Network Quality on Player Departure Behavior in Online Games," IEEE TPDS’08.


Our Contribution #2

OneClick: A simple yet efficient framework for QoE measurement experiments

Knocking at someone’s doorKnock on the door

You wait, and you knock on the door again

You wait, and you knock on the door again and again, and …



Simple instruction to users:Click when you feel dissatisfied

Click multiple times when you feel even less satisfied

Estimating QoE from application quality and users’ click event process

User Satisfaction

ClickClickUser Feedback

Application Quality

ClickClickClickClick ClickClick ClickClickClickClick ClickClickClickClickClickClickClickClickClickClickClickClickClickClickClickClick

Time



NaturalWe are already doing it to show lost of patience all the time

Bad‐memory proofReal‐time decisions

No need to “remember” past experience

Time‐awareCapture users’ responses at the time of the problems

Useful to study recency and habituation effect

Chen et al, "OneClick: A Framework for Measuring Network Quality of Experience,” IEEE INFOCOM 2009.

Our Contribution #3The first crowdsourcable QoE evaluation framework

Users’ inputs can be verifiedthe transitivity property: A > B and B > C A > C

detect inconsistent judgements from problematic users

Experiments can thus be outsourced to Internet crowdlower monetary cost

wider participant diversity

maintaining the evaluation results’ quality

Chen et al, "A Crowdsourceable QoE Evaluation Framework for Multimedia Content,” to appear in ACM Multimedia 2009 (full paper).

QuantifyingQuantifying

User SatisfactionUser Satisfaction

Collaborators: Chun‐Ying HuangPolly Huang

Chin‐Laung Lei(National Taiwan University)

Sheng‐Wei (Kuan‐Ta) Chen

Institute of Information Science, Academia Sinica

Appeared on ACM SIGCOMM 2006

2Kuan‐Ta Chen / Quantifying Skype User Satisfaction

MotivationMotivation

Are users satisfied with our system?

User survey

Market response

User satisfaction metric

To make a system self‐adaptable in real time for better user

experience

User satisfaction metric

Need of a Quality‐of‐Experience (QoE) metric!


QoEQoE metricsmetrics

FTP applications: data throughput rate

Web applications: response time and page load time

VoIP applications: voice quality (fidelity, loudness, noise),

conversational delay, echo

Online games: interactivity, responsiveness, consistency,

fairness

QoE is multi‐dimensional esp. for real‐time interactive applications!


What path should Skype choose?What path should Skype choose?

path avail bandwidth loss rate delay

10 Kbps 2%

1%

3%

100 ms

20 Kbps 300 ms

30 Kbps 500 ms

Internet

Which path is “the best”?


QoSQoS and and QoEQoE

QoS (Quality of service)The quality level of “native” performance metric

Communication networks: delay, loss rate

Voice/audio codec: fidelity

DBMS: query completion time

QoE (Quality of experience)How users “feel” about a service

Usually multi‐dimensional, and tradeoffs exist between different dimensions (download time vs. video quality, responsivess vs. smoothness)

However, a unified (scalar) index is normally desired!


A typical relationship between A typical relationship between QoSQoS and and QoEQoE

QoS, e.g., network bandwidth

QoE

Hard to tell “very bad”from “extremely bad”

Marginal benefit is small


MappingMapping between between QoSQoS and and QoEQoE

Which QoS metric is most influential on users’ perceptions

(QoE)?

Source rate?

Loss?

Delay?

Jitter?

Combination of the above?


How to measure How to measure QoEQoE: A quick review: A quick review

Subjective evaluation procedures

Human studies, not scalable

Costly!

Objective evaluation procedures

Statistical models based on subjective evaluation results

Pros: Computation without human involvement

Cons: (Over‐)simplifications of model parameters

E.g., use a single “loss rate” to capture the packet loss process

E.g., assume every voice/video packet is equally important

Not consider external effects such as loudness and quality of handsets


Subjective Evaluation ProceduresSubjective Evaluation Procedures

Single Stimulus Method (SSM)

Single Stimulus Continuous

Quality Evaluation (SSCQE)

Double Stimulus Continuous

Quality Scale (DSCQS)

Double Stimulus Impairment

Scale (DSIS)

Objective Evaluation MethodsObjective Evaluation Methods

Refereneced models

speech‐layer model: PESQ (ITU‐T P.862)

Compare original and degraded signals

Unreferenced models (no original signals required)

speech‐layer model: P.VTQ (ITU‐T P.563)

Detect unnatural voices, noise, mute/interruptions in degraded signals

network‐layer model: E‐model (ITU‐T G.107)

Regression model based on delay, loss rate, and 20+ variables

Equations are over‐complex for physical interpretation, e.g.

Is = 20

∙{1 + (Xolr

8)8} 18 − Xolr

8

¸Xolr = OLR + 0.2(64 + No − RLR)


Our goalsOur goals

An objective QoE assessment framework

passive measurement (thus scalable)

easy to construct models (for your own application)

easy to access input parameters

easy to compute in real time


Our contributionsOur contributions

An index for Skype user satisfaction

derived from real‐life Skype call sessions

verified by users’ speech interactivities in calls

accessible and computable in real time

bit rate: data rate of voice packets

jitter: receiving rate jitter (level of network congestion)

RTT: round‐trip times between two parties

USI = 2.15× log(bit rate) − 1.55 × log(jitter)− 0.36× RTT


Talk outlineTalk outline

The Question

Measurement

Modeling

Validation

Significance


Setting things upSetting things up

L3 switch

Port Mirroring

Dedicated Skype nodeTraffic Monitor

Uplink

Relayed Traffic


Capturing Skype trafficCapturing Skype traffic

1. Identify Skype hosts and ports

Track hosts sending http to “ui.skype.com”

Track their ports sending UDP within 10 seconds

(host, port)

Other parties which communicate with discovered host‐

port pairs

2. Record packets

Whose source or destination ∈ these (host, port)

Reduce the # of traced packets to 1‐2%


Extracting Skype callsExtracting Skype calls

1. Take these sessions

Average packet rate within (10, 100) pkt/sec

Average packet size within (30, 300) bytes

For longer than 10 seconds

2. Merge two sessions into one relay session

If the two sessions share a common relay node

Their start and finish time are close to each other with 30

seconds

And their packet rate series are correlated


Probing Probing RTTsRTTs

As we take traces

Send ICMP ping, application‐level ping & traceroute

Exponential intervals


Trace SummaryTrace Summary

CampusNetwork

L3 switch

Port Mirroring

Dedicated Skype nodeTraffic Monitor

Uplink

Relayed Traffic

Category Calls Hosts Avg. Time

240 29 min

18 min

24 min

Relayed 209 369

Total 462 570

253Direct

Internet

Direct sessions

Relayed sessions



The Question

Measurement

Modeling

Validation

Significance


The intuition The intuition behidebehide our analysisour analysis

The conversation quality (i.e., QoE) perceived by call

parties is more or less related to the call duration

The network conditions of a VoIP call are independent of

importance of talk content

call parties’ schedule

call parties’ talkativeness

other incentives to talk (e.g., free of charge)


First, getting a better senseFirst, getting a better sense

jitter

RTTnetwork quality


call duration

correlated?

TCP / UDP?relayed?

(QoE)

(QoS factors)


Is call duration related to each factor?Is call duration related to each factor?

For each factor

Scatter plot of the factor to the call duration

See whether they are positively, negatively, or not correlated

Hypothesis tests

Confirm whether they are indeed positively, negatively, or not

correlated


Call duration vs. jitterCall duration vs. jitter

avgcall du

ratio

n (m

in)

There are short calls with low jitters

The average shows a negative correlation between the 2 variables

jitter (Kbps)

95% confidence band of the average

average

(std dev of received bytes/sec)


Effect of Jitter Effect of Jitter –– Hypothesis TestingHypothesis Testing

The probability distribution of hanging up a call

Null HypothesisAll the survival curves are equivalent

Log‐rank test: P < 1e‐20

We have > 99.999% confidence claiming jitters are correlated with call duration


Effect of Source RateEffect of Source Rate(the bandwidth Skype intended to use)

Average session

time (m

in)


The better senseThe better sense

jitter

RTTnetwork quality


call duration

correlated?

TCP / UDP?relayed?

positivenegative

negative

negative

none

(non‐significant)


Linear regression?Linear regression?

No!

Reasons

Assumptions no longer holderrors are not independent and not normally distributed

variance of errors are not constant

CensorshipThere are calls that have been going on for a while

There are calls that have not yet finished by the time we terminate tracing

We can’t simply discard these calls

Otherwise we end up with a biased set of calls with limited call duration


Cox regression modelingCox regression modeling

The Cox regression model provides a good fit

the effect of treatment on patients’ survival time

log‐hazard function is proportional to the weighted sum of factors

Hazard function (conditional failure rate)

The instantaneous rate at which failures occur for observations that have survived at time t

: factors (bit rate=x, jitter=y, RTT=z, …)

: weights of factorsβ

Z

h(t) = lim∆t→0

Pr[t ≤ T < t+∆t|T ≥ t]∆t

log h(t|Z) ∝ βtZ


Functional Form ChecksFunctional Form Checks

Assumption must be conformed

Explore “true” functional forms of factors by

generalized additive models

Bit rate and jitter log scale

h(t|Z) ∝ exp(βtZ)

Human beings are known sensitive to the scale of physical quantity rather than the magnitude of the quantity

• Scale of sound (decibels vs. intensity)

• Musical staff for notes (distance vs. frequency)

• Star magnitudes (magnitude vs. brightness)


The Logarithm Fits Better (Bit rate)The Logarithm Fits Better (Bit rate)

After taking logarithm …


The Logarithm Fits Better (Jitter)The Logarithm Fits Better (Jitter)

After taking logarithm …


Final model & interpretationFinal model & interpretation

variable coef std. err. signif.

‐2.15 0.13

0.09

0.18

< 1e‐20

log(jitter) 1.55 < 1e‐20

RTT 0.36 4.3e‐02

log(bit rate)

Interpretation

A: bit rate = 20 Kbps

B: bit rate = 15 Kbps, other factors same as A

The hazard ratio between A and B can be computed by

exp((log(15) – log(20)) × ‐2.15) ≈ 1.86

The probability B will hang up is 1.86 times the probability A will do

so at any instant.


HangHang‐‐up rate and USIup rate and USI

Hang-up rate =

User satisfaction index (USI) =

−Hang-Up Rate

2.15× log(bit rate)− 1.55× log(jitter)− 0.36× RTT


Actual and Predicted Time vs. USIActual and Predicted Time vs. USIAverage session

time (m

in)


The multiThe multi‐‐path scenariopath scenario

path avail bandwidth jitter RTT USI

2 Kbps 3.84

6.33

5.43

1 Kbps

10 Kbps

3 Kbps

100 ms

20 Kbps 300 ms

30 Kbps 500 ms

Internet

is call hang‐up rate a good indication of user satisfaction?

BUT,



The Question

Measurement

Modeling

Validation

Significance


User satisfaction: ValidationUser satisfaction: Validation

Call duration

intuition: call duration <‐> satisfactionnot confirmed yet


User satisfaction: One step furtherUser satisfaction: One step further

Speech interactivity

Call duration

?

now we’re going to check!

intuition: interactive and tight speech activities in a cheerful conversation


Identifying talk burstsIdentifying talk bursts

The problem

Every voice packet is encrypted with 256‐bit AES

(Advanced Encryption Standard)

Possible solutions

packet rate: no silence suppression in Skype

packet size: our choice


What we need to achieveWhat we need to achieve

Input: a time series of packet sizes

Output: estimated ON/OFF periods (ON = talk / OFF=

silence)

Time


Speech activity detectionSpeech activity detection

1. Wavelet de‐noising

Removing high‐frequency fluctuations

2. Detect peaks and dips

3. Dynamic thresholding

Deciding the beginning/end of a talk burst


Speech detection algorithm: ValidationSpeech detection algorithm: Validation

The speech detection algorithm is validated with:

synthesized sin waves (500 Hz – 2000 Hz)

real speech recordings

relay node (chosen by Skype)average RTT: 350 msjitter: 5.1 Kbps

Force packet size processes contaminated by serious network

impairment (delay and loss)

play sound >>

capture packet size processes


Validation with synthesized sin wavesValidation with synthesized sin waves

3 times for each of 10 test cases

correctness (ratio of matched 0.1‐second periods): 0.73 – 0.92

true ON periods

estimated ON periods


Validation with speech recordingsValidation with speech recordings

3 times for each of 3 test cases

correctness (ratio of matched 0.1‐second periods): 0.71 – 0.85

true ON periods

estimated ON periods


Speech interactivity analysisSpeech interactivity analysis

Responsiveness:Avg. Response Delay:Avg. Burst Length:

Responsiveness:

Response delay:

Burst length:

whether the other party responds

how long before the other party responds

how long does a speech burst last


USI vs. Speech interactivityUSI vs. Speech interactivity

All are statistically significant (at 0.01 significance level)

Speech interactivity in conversation supports the proposed

USI

higher USI higher responsiveness

higher USI shorter response delay

higher USI shorter burst length



The Question

Measurement

Modeling

Validation

Significance


ImplicationsImplications

should put more attention to delay jitters (rather then

focus on network delay only)

and the encoding bit rate!


SignificanceSignificance

QoE‐aware systems that can optimize user experience in run time

Is it worth to sacrifice 20 ms latency for reducing 10 ms jitters (say, with a de‐jitter buffer)?

Pick the most appropriate parameters in run timeplayout scheduling (buffer time)

coding scheme (& rate)

source rate

data path (overlay routing)

transmission scheme (redundacy, erasure coding, …)


Future work (1)Future work (1)

Measurement

larger data sets (p2p traffic is hard to collect)

diverse locations

Validation

user studies

comparison with existing models (PESQ, etc)


Future work (2)Future work (2)

Beyond “call duration”

Who hangs up a call?

Call disconnect‐n‐connect behavior

More sophisticated modeling

Voice codec

Pricing effect

Time‐of‐day effect

Time‐dependent impact

Call Behavior

?


How Gamers are Aware of Service Quality?How Gamers are Aware of Service Quality?

Real‐time interactive online games are generally considered QoS‐

sensitive

Gamers are always complaining about high

“ping‐times” or network lags

Online gaming is increasingly popular despite the best‐effort

Internet

Q1: Are game players really sensitive to network quality as they claim?

Q2: If so, how do they react to poor network quality?

Appeared on IEEE INFOCOM 2006


Case Study: Case Study: ShenZhouShenZhou OnlineOnline


Traffic Trace CollectionTraffic Trace Collection

Gig

abit

Eth

erne

t

trace conn. # packets (in/out/both) bytes (in/out/both)

342M / 353M / 695M 4.7TB / 27.3TB / 32.0TB

N2 54,424 325M / 336M / 661M 4.7TB / 21.7TB / 26.5TB

57,945N1


Delay Jitter vs. Session TimeDelay Jitter vs. Session Time(std. dev. of the round-trip times)


Hypothesis Testing Hypothesis Testing ‐‐‐‐ Effect of Loss RateEffect of Loss Rate

Null Hypothesis: All the survival curves are equivalent

Log‐rank test: P < 1e‐20

We have > 99.999% confidence claiming loss rates are correlated with game playing times

high loss

low loss med loss

The CCDF of game session times


Regression ModelingRegression Modeling

Linear regression is not adequate

Violating the assumptions (normal errors, equal variance, …)

The Cox regression model provides a good fit

Log‐hazard function is proportional to the weighted sum of factors

Hazard function (conditional failure rate)

The instantaneous rate of quitting a game for a player (session)

h(t) = lim∆ t→ 0

Pr[t ≤ T < t + ∆t|T ≥ t]∆t

(our aim is to compute )β

where each session has factors Z (RTT=x, jitter=y, …)

log h(t|Z) ∝ βtZ


Final Model & InterpretationFinal Model & Interpretation

Interpretation

A: RTT = 200 msB: RTT = 100 ms, other factors same as A

Hazard ratio between A and B: exp((log(0.2) – log(0.1)) × 1.27) ≈ 2.4

A will more likely leave a game (2.4 times probability) than B at any moment

Variable Coef Std. Err. Signif.

log(RTT) 1.27 0.04 < 1e‐20

0.03

0.01

0.01

< 1e‐20

log(closs) 0.12 < 1e‐20

log(sloss) 0.09 7e‐13

0.68log(jitter)


How good does the model fit?How good does the model fit?


Relative Influence of Relative Influence of QoSQoS FactorsFactors

Latency = 20% Client packet loss = 20%

Delay jitter = 45% Server pakce loss = 15%


An Index for An Index for ShenZhouShenZhou OnlineOnline

Features

derived from real‐life game sessions

accessible and computable in real time

implications: delay jitter is more intolerable than delay

RTT:jitter:closs:sloss:

round‐trip timeslevel of network congestionloss rate of client packetsloss rate of server packets

log(departure rate) ∝ 1.27× log(RTT) + 0.68× log(jitter) +0.12× log(closs) + 0.09× log(sloss)


App #1: Evaluation of Alternative DesignsApp #1: Evaluation of Alternative Designs

Suppose now we have two designs (e.g., protocols)

One leads to lower delay but high jitter:

100 ms, 120 ms, 100 ms, 120 ms, 100 ms, 120 ms, 100 ms, 120 ms, …

One leads to higher delay but lower jitter:

150 ms, 150 ms, 150 ms, 150 ms, 150 ms, 150 ms, 150 ms, 150 ms, …

Which one design shall we choose?

time

network latency

150 ms


App #2: Overlay Path SelectionApp #2: Overlay Path Selection

Internet

path delay jitter loss rate score

50 ms (P) 3.84

6.33

5.43

20 ms (G)

100 ms (G)

30 ms (A)

5% (P)

150 ms (A) 1% (A)

200 ms (P) 1% (A)


Other ApplicationsOther Applications

Deciding smoothing bufferIs it worth to sacrifice 20ms latency for reducing 10ms jitters?

Maintaining fairnessAllocate more resources to players experience poor QoS


Player Departure Behavior AnalysisPlayer Departure Behavior Analysis

Player departure rate is decreasing by time

Golden time is the first 10 minutes: the longer gamers play, the

more external factors would affect their decisions to stay or leave

allocating more resources to players just entered

0.00

00.

001

0.00

20.

003

0.00

40.

005

0.00

6

Game session time (hour)

Est

imat

ed h

azar

d fu

nctio

n

30 60 120 180 240 300 360 420 480

weekdayweekend

10 20 30 40 50 60

0.15

0.20

0.25

0.30

Observation time (min)

Nag

elke

rke

R2

5%exit

10%exit

20%exit

Mostpredictable

To be continued To be continued ……

Sheng‐Wei (Kuan‐Ta) Chen

http://www.iis.sinica.edu.tw/~swc

IEEE INFOCOM 2009

OneClick ‐‐A Framework for Measuring

Network Quality of Experience

Kuan‐Ta Chen, Cheng‐Chu Tu, Wei‐Cheng Xiao

Institute of Information Science, Academia Sinica

Appeared on IEEE INFOCOM 2009

IEEE INFOCOM 2009

QoS and QoE

QoS (Quality of service)The quality level of system performance metric

Communication networks: delay, loss rate

DBMS: query completion time

QoE (Quality of experience)The quality of how users “feel” about a service

Subjective: Mean Opinion Score (MOS)

Objective: PSNR (picture), PESQ (voice), VQM (video)

IEEE INFOCOM 2009

Relationship between QoS and QoE

QoS, e.g., network bandwidth

QoEToo bad to perceive

Marginal benefit is small

Comfort range

OneClick: A Framework for Measuring Network Quality of Experience 4

Knowing the Relationship is Important!

So we know

How to adapt voice/video/game data rate (QoS) for user

satisfaction (QoE)

So we really know

How to send multimedia data over the Internet


Measuring QoS and QoE

QoS (A great body of work)

Measure network loss, delay, available bandwidth

Inference topology

Estimate network capacity

etc

QoE (Some work)

Objective: PSNR (picture), PESQ (voice), VQM (video)

Subjective: MOS (general)

Still not quite the human experiencewhich is multi-dimensional

What’s left!

IEEE INFOCOM 2009

MOS (Mean Opinion Score)

1. Slow in scoring (think/interpretation time)2. People are limited by finite memory3. Cannot capture users’ perceptions over time4. MOS is coarse in scale granularity5. Dissimilar interpretations of the scale among users

Problems

IEEE INFOCOM 2009

Our Ambition

Identify a simple and yet efficient way

to measure users’ satisfaction


The Idea: Click, Click, Click

Web surfing

Click on a link

You wait, and you refresh the link

You wait, and you refresh the link again, and again, and …

Knocking at someone’s door

Knock on the door

You wait, and you knock on the door again

You wait, and you knock on the door again and again, and …


Introducing OneClick

Simple instruction to users:

Click when you feel dissatisfied

Click multiple times when you feel even less satisfied

Clicking rate as the QoE

User Satisfaction


Application Quality


Time

Nice Things about OneClick

NaturalWe are already doing it to show lost of patience all the time

Bad‐memory proofReal‐time decisionsNo need to “remember” past experience

Time‐awareCapture users’ responses at the time of the problemsUseful to study recency, memory access, and habituation effect


Easy to Implement

As a plug‐in to your network applicationsFlash version done!

Co‐measurement of QoS and QoE

User Satisfaction


Application Quality


Time

Application Quality


Talk Progress

Overview

Methodology

Pilot Study

Validation

Case Studies

Conclusion

IEEE INFOCOM 2009

Human as a QoE Rating System

User

Application QoS

Application QoE

Network Setting

Click Events

affect

reflect

vary this:

observe this:

IEEE INFOCOM 2009

QoE QoS Modeling

Click events as a counting process

Poisson regression

C(t): QoEClicking rate at time t

N1(t), N2(t), … : QoSNetwork conditions at time t

αi : Regression coefficientsDerived from the maximum likelihood method

IEEE INFOCOM 2009

Wait a Minute…

Response delays?Users may not be able to click immediately after they are aware of the degraded quality

Clicking rate of a user consistent?Does a subject give similar ratings in repeated experiments?

Clicking rate consistent across users?Different subjects may have different preference on click decisions.

IEEE INFOCOM 2009

Pilot Study

An 5‐minute English song

Audio quality of AIM Messenger with various network settings

IEEE INFOCOM 2009

Test Material Compilation

For each network settingPlay the song

Record the song

K settings K recordings

A

Random

test material =

non‐overlapping segments from K different recordings

IEEE INFOCOM 2009

Response Delays

Try Poisson regression on C(t+x) to N1(t), N2(t), …

Varying x

Show the goodness of fit per x

IEEE INFOCOM 2009

1‐2 Seconds Delay

Response delay calibration needed!


Our Solution

Shift the click event process by time d

d is decided by model fitting Let d be the x such that the goodness of fit is the best

Let d be the x such that the residual deviance is the min

IEEE INFOCOM 2009

Consistency of C(t+d) from Same User

Consistency of C(t+d) from Different Users

Cross-user normalization needed!

Calibration and Normalization Added

Response DelayCalibration

Regression ModelingWith Normalization

User #1

User #2

OneClickMeasurement


Talk Progress

Overview

Methodology

Pilot Study

Validation

Case Studies

Conclusion


Exact problem we are trying to solve

Rationale

Direct: get people to do OneClick and MOS

Click Rate MOS

PESQ/VQM

Indirect: get people to do OneClick and PESQ/VQM

IEEE INFOCOM 2009

PESQ‐based Validation

PESQ: Perceptual Evaluation of Speech Quality

OneClick vs. PESQ to evaluate the audio quality of three VoIP applications

AIM

MSN Messenger

Skype

Network factorsLoss rates (0% – 30%)

Bandwidth (10 Kbps – 100 Kbps)

[Validation]

Qualitative Comparison[Validation]

Network Loss Rate

Bandwidth

IEEE INFOCOM 2009

VQM‐based Validation

VQM: Video Quality Measurement

OneClick vs. VQM to evaluate video quality of two video codecs

H.264

WMV9 (Windows Media Video)

FactorsCompression bit rate (200 Kbps – 1000 Kbps)

[Validation]

IEEE INFOCOM 2009

Qualitative Comparison[Validation]


Talk Progress

Overview

Methodology

Pilot Study

Validation

Case Studies

Conclusion

IEEE INFOCOM 2009

Case Studies

Evaluation of applications’ QoEVoIP applications

AIM

MSN Messenger

Skype

First‐person shooter gamesHalo

Unreal Tournament

IEEE INFOCOM 2009

Varying Bandwidth

MSN Messenger is generally the worst

Skype is the best if bw < 80 Kbps, otherwise AIM is the best

[Case Study]

Contour Lines of Click Rates

Slope of contour lineApplication’s sensitivity to loss vs. bandwidth shortage

AIM is relatively more sensitive to network losses

[Case Study]

Comfort Region

Comfort Region: a set of network configurations that leads to satisfactory QoESkype is the best in bw‐restricted scenarios (< 60 Kbps) when loss rate is < 10%

[Case Study]


Talk Progress

Overview

Methodology

Pilot Study

Validation

Case Studies

Conclusion

IEEE INFOCOM 2009

Nice about OneClick

Natural & fastWe are already doing it to show lost of patience all the time

Bad‐memory proofNo need to “remember” past experience

Time‐awareCapture users’ responses at the time of the problems

Fine‐grainThe score can be 0.2, 3.5, or even 12.345

Normalized user interpretation Different interpretations are normalized

Easy to implementhttp://mmnet.iis.sinica.edu.tw/proj/oneclick/

IEEE INFOCOM 2009

OneClick Online


On‐Going Work

Large‐scale experiments (by crowdsourcing)

http://mmnet.iis.sinica.edu.tw/proj/oneclick/

Click rate vs. MOS

QoE‐centric multimedia networking

As an example, Tuning the Redundancy Control Algorithm of

Skype for User Satisfaction, IEEE INFOCOM 2009.

IEEE INFOCOM 2009

To be continued …

Kuan‐Ta Chen


ACM Multimedia 2009

A CrowdsourceableQoE Evaluation Framework for

Multimedia Content

Kuan‐Ta Chen Academia SinicaChen‐Chi Wu National Taiwan UniversityYu‐Chun Chang National Taiwan UniversityChin‐Laung Lei National Taiwan University

Appeared on ACM Multimedia 2009

A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 2

What is QoE?

Quality of Experience =

Users’ satisfaction about a service

(e.g., multimedia content)


Quality of Experience

Poor(underexposed)

Good(exposure OK)


Challenges

How to quantify the QoE of multimedia content efficiently and reliably?

Q=?

Q=?

Q=?

Q=?


Mean Opinion Score (MOS)Idea: Single Stimulus Method (SSM) + Absolute Categorial Rating (ACR)

Excellent?Good?Fair?Poor?Bad?

vote

Fair

MOS Quality Impairment

5 Excellent Imperceptible

4 Good Perceptible but not annoying

3 Fair Slightly annoying

2 Poor Annoying

1 Bad Very annoying


ACR‐basedConcepts of the scales cannot be concretely defined

Dissimilar interpretations of the scale among users

Only an ordinal scale, not an interval scale

Difficult to verify users’ scores

Subjective experiments in laboratoryMonetary cost (reward, transportation)

Labor cost (supervision)

Physical space/time/hardware constraints

Drawbacks of MOS‐based Evaluations

Solve all these drawbacks
















Drawbacks of MOS‐based Evaluations

Crowdsourcing

Paired Comparison


Contribution


Talk Progress

Overview

MethodologyPaired Comparison

Crowdsourcing Support

Experiment Design

Case Study & EvaluationAcoustic QoE

Optical QoE

Conclusion


Current Approach: MOS Rating

Excellent?Good?Fair?Poor?Bad?

Vote

?


Our Proposal: Paired Comparison

Which one is better?

B

Vote

A

B


Properties of Paired Comparison

Generalizable across different content types and applications

Simple comparative judgmentdichotomous decision easier than 5‐category rating

Interval‐scale QoE scores can be inferred

The users’ inputs can be verified

Choice Frequency Matrix

0 9 10 9

1 0 7 8

0 3 0 6

1 2 4 0

10 experiments, each containing C(4,2)=6 paired comparisons

A

B

C

D

A B C D


Inference of QoE Scores

Bradley‐Terry‐Luce (BTL) modelinput: choice frequency matrix

output: an interval‐scale score for each content (based on maximum likelihood estimation)

)()(

)()(

1)()()(

ji

ji

TuTu

TuTu

ji

iij e

eTT

TP −

−

+=

+=

πππ

n content: T1,…, Tn

Pij : the probability of choosing Ti over Tj

u(Ti) is the estimated QoE score of the quality level Ti

Basic IdeaP12 = P23 u(T1) - u(T2) = u(T2) - u(T3)


Inferred QoE Scores

0 0.63 0.91 1


Talk Progress

Overview



Experiment Design


Optical QoE

Conclusion


Crowdsourcing

= Crowd + Outsourcing

“soliciting solutions via open calls to large‐scale communities”


Image Understanding

Reward: 0.04 USD

main theme?key objects?

unique attributes?


Linguistic Annotations

Word similarity (Snow et al. 2008)

USD 0.2 for labeling 30 word pairs


More Examples

Document relevance evaluationAlonso et al. (2008)

Document rating collectionKittur et al. (2008)

Noun compound paraphrasingNakov (2008)

Person name resoluationSu et al. (2007)


The Risk

Users may give erroneous feedback perfunctorily, carelessly, or dishonestly

Dishonest users have more incentives to perform tasks

Not every Internet user is trustworthy!

Need to have an ONLINE algorithm to detect problematic inputs!


Verification of Users’ Inputs (1)

Transitivity propertyIf A > B and B > C A should be > C

Transitivity Satisfaction Rate (TSR)

apply tomay ruleity transitiv the triplesof #ruleity transitivesatisfy th triplesof #


Verification of Users’ Inputs (2)

Detect inconsistent judgments from problematic users

TSR = 1 perfect consistency

TSR >= 0.8 generally consistent

TSR < 0.8 judgments are inconsistent

TSR‐based reward / punishment(e.g., only pay a reward if TSR > 0.8)


Experiment Design

For n algorithms (e.g., speech encoding)1. a source content as the evaluation target

2. apply the n algorithms to generate n content w/ different Q

3. ask a user to perform paired comparisons

4. compute TSR after an experiment⎟⎟⎠

⎞⎜⎜⎝

⎛2n

reward a user ONLY if his inputs are self‐consistent(i.e., TSR is higher than a certain threshold)


Concept Flow in Each Round


Audio QoE Evaluation


(SPACE key released) (SPACE key pressed)


Video QoE evaluation


(SPACE key released) (SPACE key pressed)


Talk Progress

Overview



Experiment Design


Optical QoE

Conclusion


Audio QoE Evaluation

MP3 compression levelSource clips: one fast‐paced and one slow‐paced song

MP3 CBR format with 6 bit rate levels: 32, 48, 64, 80, 96, and 128 Kbps

127 participants and 3,660 paired comparisons

Effect of packet loss rate on VoIPTwo speech codecs: G722.1 and G728

Packet loss rate: 0%, 4%, and 8%



Inferred QoE Scores

MP3 Compression Level VoIP Packet Loss Rate


Video QoE Evaluation

Video codecSource clips: one fast‐paced and one slow‐paced video clip

Three codecs: H.264, WMV3, and XVID

Two bit rates: 400 and 800 Kbps


Loss concealment schemeSource clips: one fast‐paced and one slow‐paced video clip

Two schemes: Frame copy (FC) and FC with frame skip (FCFS)

Packet loss rate: 1%, 5%, and 8%



Inferred QoE Scores

Video Codec Concealment Scheme


Participant Source

LaboratoryRecruit part‐time workers at an hourly rate of 8 USD

MTurkPost experiments on the Mechanical Turk web site

Pay the participant 0.15 USD for each qualified experiment

CommunitySeek participants on the website of an Internet community with 1.5 million members

Pay the participant an amount of virtual currency that was equivalent to one US cent for each qualified experiment

Participant Source EvaluationWith crowdsourcing…

lower monetary cost


maintaining the evaluation results’ quality

Crowdsourcing seems a good strategy for multimedia QoE assessment!

http://mmnet.iis.sinica.edu.tw/link/qoe

Conclusion

Crowdsourcing is not without limitationsphysical contact

environment control

media

With paired comparison and user input verification, less monetary cost


shorter experiment cycle

evaluation quality maintained

ACM Multimedia 2009

Kuan‐Ta ChenAcademia Sinica

May the crowd Force with you!

Thank You!


Future Plan

QoE measurementPsychophysical approachExploit social gaming to provide the incentives for large‐scale studiesGoals: cross‐application, cross‐modal, content‐dependent, context‐dependent

QoE provisioningQoE‐aware communication systemsParameter auto configured in run time

playout scheduling, coding scheme (& rate), overlay path routing, transmission (redundancy, coding, protocol), etc


Acknowledgements

Chin‐Laung Lei Chun‐Ying Huang

Polly Huang

Yu‐Chun Chang Te‐Yuan Huang

William Tu

Hung‐Hsuan Chen

Chen‐Chi Wu

Wei‐Cheng Xiao

http://images.google.com.tw/imgres?imgurl=http://cc.ee.ntu.edu.tw/%7Ephuang/pic/polly-spring04-take4.jpg&imgrefurl=http://cc.ee.ntu.edu.tw/%7Ephuang/&usg=__tpSo56rWfDrhjYQaiPzSidV8D6w=&h=1200&w=1600&sz=183&hl=zh-TW&start=1&sig2=uTS3EX0kDNuztvla7DDc2A&um=1&tbnid=1nuvm4bIk9dt_M:&tbnh=113&tbnw=150&prev=/images%3Fq%3Dpolly%2Bhuang%26hl%3Dzh-TW%26rlz%3D1B3GGGL_zh-TWTW352TW352%26sa%3DG%26um%3D1&ei=sy8VS-HwOM-HkAXRxsXlBQ

謝謝聆聽，請多指教。


陳昇瑋

中央研究院資訊科學研究所

Documents

Network and Multimedia QoE Management