Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Quality of Experience Management / Sheng‐Wei Chen 2
What is QoE?
Quality of Experience =
User Satisfaction in Using Computer/Communication
Systems
Quality of Experience Management / Sheng‐Wei Chen 3
What is QoE Management?
Measurement
Provisioning
Measure user satisfaction
Improve system design to provide more satisfactory user experience
Quality of Experience Management / Sheng‐Wei Chen 4
Goal of QoE Management
To Provide
Satisfactory User Experience
in Computer/Communication
Systems
Quality of Experience Management / Sheng‐Wei Chen 5
Motivating Example
Network/computation resource is not infinite conflicting goals everywhere
Internet
voice data
path avail bandwidth loss rate delay
10 Kbps 2%
1%
3%
100 ms
20 Kbps 300 ms
30 Kbps 500 ms
Which path is “the best”?
Audio quality vs. video quality
Audio/video quality vs. real‐timeliness
Conflicting Goals in Video Conferencing
(time‐lagged) (low resolution)
Quality of Experience Management / Sheng‐Wei Chen 7
Challenges
Hard to measure and quantify users’ perceptionnot directly observable, massively multidimensional
Hard to reduce the system’s parameter spaceNetwork factors (delay, loss, jitter, …)
Transmission factors (redundancy, compression, …)
Codec factors (lots of codec‐depending parameters)
Hard to measure and quantify the environment that may affect users’ experience
ambient noise
quality of headset
distance from viewer to display
Quality of Experience Management / Sheng‐Wei Chen 8
Our Research Focus
Video Conferencing
Online EntertainmentVoIP
Quality of Experience Management / Sheng‐Wei Chen 9
Our Work
Selected contributionsThe first QoE measurement methodology based on large‐scale user behavior observation
OneClick: A simple yet efficient framework for QoEmeasurement experiments
The first crowdsourcable QoE evaluation methodology
None of them are incremental work
Our Contribution #1 The first QoE measurement methodology based on large‐scale user behavior observation
Rationale (VoIP as an example)
The QoE perceived by users is more or less related to their call duration
jitter
delaynetwork quality
service levelsource rate
call duration
Correlated
TCP / UDP?relayed?
(QoE)
(QoS factors)
Quality of Experience Management / Sheng‐Wei Chen 11
Skype Call Duration vs. Network Quality
Call du
ratio
n (m
in)
There are short calls with good network quality
The average shows negative correlation between the 2 variables
Jitter (Kbps)
95% confidence band of the average
average
worse quality
Our Contribution #1 (cont)
Proportional‐hazards modelingSkype’s QoE prediction
FeaturesNo user studies required (more scalable)Can be used to adjust system parameters in run timeApplies to all real‐time interactive applications
• Chen et al, "Quantifying Skype User Satisfaction," ACM SIGCOMM 2006 (cited by 63 papers since Sep 2006).• Chen et al, "On the Sensitivity of Online Game Playing Time to Network QoS," IEEE INFOCOM'06.• Chen et al, "How Sensitive are Online Gamers to Network Quality?," Comm. of ACM, 2006.• Chen et al, "Effect of Network Quality on Player Departure Behavior in Online Games," IEEE TPDS’08.
Quality of Experience Management / Sheng‐Wei Chen 13
Our Contribution #2
OneClick: A simple yet efficient framework for QoE measurement experiments
Knocking at someone’s doorKnock on the door
You wait, and you knock on the door again
You wait, and you knock on the door again and again, and …
Quality of Experience Management / Sheng‐Wei Chen 14
Our Contribution #2 (cont)
Simple instruction to users:Click when you feel dissatisfied
Click multiple times when you feel even less satisfied
Estimating QoE from application quality and users’ click event process
User Satisfaction
ClickClickUser Feedback
Application Quality
ClickClickClickClick ClickClick ClickClickClickClick ClickClickClickClickClickClickClickClickClickClickClickClickClickClickClickClick
Time
Quality of Experience Management / Sheng‐Wei Chen 15
Our Contribution #2 (cont)
NaturalWe are already doing it to show lost of patience all the time
Bad‐memory proofReal‐time decisions
No need to “remember” past experience
Time‐awareCapture users’ responses at the time of the problems
Useful to study recency and habituation effect
Chen et al, "OneClick: A Framework for Measuring Network Quality of Experience,” IEEE INFOCOM 2009.
Our Contribution #3The first crowdsourcable QoE evaluation framework
Users’ inputs can be verifiedthe transitivity property: A > B and B > C A > C
detect inconsistent judgements from problematic users
Experiments can thus be outsourced to Internet crowdlower monetary cost
wider participant diversity
maintaining the evaluation results’ quality
Chen et al, "A Crowdsourceable QoE Evaluation Framework for Multimedia Content,” to appear in ACM Multimedia 2009 (full paper).
QuantifyingQuantifying
User SatisfactionUser Satisfaction
Collaborators: Chun‐Ying HuangPolly Huang
Chin‐Laung Lei(National Taiwan University)
Sheng‐Wei (Kuan‐Ta) Chen
Institute of Information Science, Academia Sinica
Appeared on ACM SIGCOMM 2006
2Kuan‐Ta Chen / Quantifying Skype User Satisfaction
MotivationMotivation
Are users satisfied with our system?
User survey
Market response
User satisfaction metric
To make a system self‐adaptable in real time for better user
experience
User satisfaction metric
Need of a Quality‐of‐Experience (QoE) metric!
3Kuan‐Ta Chen / Quantifying Skype User Satisfaction
QoEQoE metricsmetrics
FTP applications: data throughput rate
Web applications: response time and page load time
VoIP applications: voice quality (fidelity, loudness, noise),
conversational delay, echo
Online games: interactivity, responsiveness, consistency,
fairness
QoE is multi‐dimensional esp. for real‐time interactive applications!
4Kuan‐Ta Chen / Quantifying Skype User Satisfaction
What path should Skype choose?What path should Skype choose?
path avail bandwidth loss rate delay
10 Kbps 2%
1%
3%
100 ms
20 Kbps 300 ms
30 Kbps 500 ms
Internet
Which path is “the best”?
5Kuan‐Ta Chen / Quantifying Skype User Satisfaction
QoSQoS and and QoEQoE
QoS (Quality of service)The quality level of “native” performance metric
Communication networks: delay, loss rate
Voice/audio codec: fidelity
DBMS: query completion time
QoE (Quality of experience)How users “feel” about a service
Usually multi‐dimensional, and tradeoffs exist between different dimensions (download time vs. video quality, responsivess vs. smoothness)
However, a unified (scalar) index is normally desired!
6Kuan‐Ta Chen / Quantifying Skype User Satisfaction
A typical relationship between A typical relationship between QoSQoS and and QoEQoE
QoS, e.g., network bandwidth
QoE
Hard to tell “very bad”from “extremely bad”
Marginal benefit is small
7Kuan‐Ta Chen / Quantifying Skype User Satisfaction
MappingMapping between between QoSQoS and and QoEQoE
Which QoS metric is most influential on users’ perceptions
(QoE)?
Source rate?
Loss?
Delay?
Jitter?
Combination of the above?
8Kuan‐Ta Chen / Quantifying Skype User Satisfaction
How to measure How to measure QoEQoE: A quick review: A quick review
Subjective evaluation procedures
Human studies, not scalable
Costly!
Objective evaluation procedures
Statistical models based on subjective evaluation results
Pros: Computation without human involvement
Cons: (Over‐)simplifications of model parameters
E.g., use a single “loss rate” to capture the packet loss process
E.g., assume every voice/video packet is equally important
Not consider external effects such as loudness and quality of handsets
9Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Subjective Evaluation ProceduresSubjective Evaluation Procedures
Single Stimulus Method (SSM)
Single Stimulus Continuous
Quality Evaluation (SSCQE)
Double Stimulus Continuous
Quality Scale (DSCQS)
Double Stimulus Impairment
Scale (DSIS)
Objective Evaluation MethodsObjective Evaluation Methods
Refereneced models
speech‐layer model: PESQ (ITU‐T P.862)
Compare original and degraded signals
Unreferenced models (no original signals required)
speech‐layer model: P.VTQ (ITU‐T P.563)
Detect unnatural voices, noise, mute/interruptions in degraded signals
network‐layer model: E‐model (ITU‐T G.107)
Regression model based on delay, loss rate, and 20+ variables
Equations are over‐complex for physical interpretation, e.g.
Is = 20
∙{1 + (Xolr
8)8} 18 − Xolr
8
¸Xolr = OLR + 0.2(64 + No − RLR)
11Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Our goalsOur goals
An objective QoE assessment framework
passive measurement (thus scalable)
easy to construct models (for your own application)
easy to access input parameters
easy to compute in real time
12Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Our contributionsOur contributions
An index for Skype user satisfaction
derived from real‐life Skype call sessions
verified by users’ speech interactivities in calls
accessible and computable in real time
bit rate: data rate of voice packets
jitter: receiving rate jitter (level of network congestion)
RTT: round‐trip times between two parties
USI = 2.15× log(bit rate) − 1.55 × log(jitter)− 0.36× RTT
13Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Talk outlineTalk outline
The Question
Measurement
Modeling
Validation
Significance
14Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Setting things upSetting things up
L3 switch
Port Mirroring
Dedicated Skype nodeTraffic Monitor
Uplink
Relayed Traffic
15Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Capturing Skype trafficCapturing Skype traffic
1. Identify Skype hosts and ports
Track hosts sending http to “ui.skype.com”
Track their ports sending UDP within 10 seconds
(host, port)
Other parties which communicate with discovered host‐
port pairs
2. Record packets
Whose source or destination ∈ these (host, port)
Reduce the # of traced packets to 1‐2%
16Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Extracting Skype callsExtracting Skype calls
1. Take these sessions
Average packet rate within (10, 100) pkt/sec
Average packet size within (30, 300) bytes
For longer than 10 seconds
2. Merge two sessions into one relay session
If the two sessions share a common relay node
Their start and finish time are close to each other with 30
seconds
And their packet rate series are correlated
17Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Probing Probing RTTsRTTs
As we take traces
Send ICMP ping, application‐level ping & traceroute
Exponential intervals
18Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Trace SummaryTrace Summary
CampusNetwork
L3 switch
Port Mirroring
Dedicated Skype nodeTraffic Monitor
Uplink
Relayed Traffic
Category Calls Hosts Avg. Time
240 29 min
18 min
24 min
Relayed 209 369
Total 462 570
253Direct
Internet
Direct sessions
Relayed sessions
19Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Talk outlineTalk outline
The Question
Measurement
Modeling
Validation
Significance
20Kuan‐Ta Chen / Quantifying Skype User Satisfaction
The intuition The intuition behidebehide our analysisour analysis
The conversation quality (i.e., QoE) perceived by call
parties is more or less related to the call duration
The network conditions of a VoIP call are independent of
importance of talk content
call parties’ schedule
call parties’ talkativeness
other incentives to talk (e.g., free of charge)
21Kuan‐Ta Chen / Quantifying Skype User Satisfaction
First, getting a better senseFirst, getting a better sense
jitter
RTTnetwork quality
service levelsource rate
call duration
correlated?
TCP / UDP?relayed?
(QoE)
(QoS factors)
22Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Is call duration related to each factor?Is call duration related to each factor?
For each factor
Scatter plot of the factor to the call duration
See whether they are positively, negatively, or not correlated
Hypothesis tests
Confirm whether they are indeed positively, negatively, or not
correlated
23Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Call duration vs. jitterCall duration vs. jitter
avgcall du
ratio
n (m
in)
There are short calls with low jitters
The average shows a negative correlation between the 2 variables
jitter (Kbps)
95% confidence band of the average
average
(std dev of received bytes/sec)
24Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Effect of Jitter Effect of Jitter –– Hypothesis TestingHypothesis Testing
The probability distribution of hanging up a call
Null HypothesisAll the survival curves are equivalent
Log‐rank test: P < 1e‐20
We have > 99.999% confidence claiming jitters are correlated with call duration
25Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Effect of Source RateEffect of Source Rate(the bandwidth Skype intended to use)
Average session
time (m
in)
26Kuan‐Ta Chen / Quantifying Skype User Satisfaction
The better senseThe better sense
jitter
RTTnetwork quality
service levelsource rate
call duration
correlated?
TCP / UDP?relayed?
positivenegative
negative
negative
none
(non‐significant)
27Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Linear regression?Linear regression?
No!
Reasons
Assumptions no longer holderrors are not independent and not normally distributed
variance of errors are not constant
CensorshipThere are calls that have been going on for a while
There are calls that have not yet finished by the time we terminate tracing
We can’t simply discard these calls
Otherwise we end up with a biased set of calls with limited call duration
28Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Cox regression modelingCox regression modeling
The Cox regression model provides a good fit
the effect of treatment on patients’ survival time
log‐hazard function is proportional to the weighted sum of factors
Hazard function (conditional failure rate)
The instantaneous rate at which failures occur for observations that have survived at time t
: factors (bit rate=x, jitter=y, RTT=z, …)
: weights of factorsβ
Z
h(t) = lim∆t→0
Pr[t ≤ T < t+∆t|T ≥ t]∆t
log h(t|Z) ∝ βtZ
29Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Functional Form ChecksFunctional Form Checks
Assumption must be conformed
Explore “true” functional forms of factors by
generalized additive models
Bit rate and jitter log scale
h(t|Z) ∝ exp(βtZ)
Human beings are known sensitive to the scale of physical quantity rather than the magnitude of the quantity
• Scale of sound (decibels vs. intensity)
• Musical staff for notes (distance vs. frequency)
• Star magnitudes (magnitude vs. brightness)
30Kuan‐Ta Chen / Quantifying Skype User Satisfaction
The Logarithm Fits Better (Bit rate)The Logarithm Fits Better (Bit rate)
After taking logarithm …
31Kuan‐Ta Chen / Quantifying Skype User Satisfaction
The Logarithm Fits Better (Jitter)The Logarithm Fits Better (Jitter)
After taking logarithm …
32Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Final model & interpretationFinal model & interpretation
variable coef std. err. signif.
‐2.15 0.13
0.09
0.18
< 1e‐20
log(jitter) 1.55 < 1e‐20
RTT 0.36 4.3e‐02
log(bit rate)
Interpretation
A: bit rate = 20 Kbps
B: bit rate = 15 Kbps, other factors same as A
The hazard ratio between A and B can be computed by
exp((log(15) – log(20)) × ‐2.15) ≈ 1.86
The probability B will hang up is 1.86 times the probability A will do
so at any instant.
33Kuan‐Ta Chen / Quantifying Skype User Satisfaction
HangHang‐‐up rate and USIup rate and USI
Hang-up rate =
User satisfaction index (USI) =
−Hang-Up Rate
2.15× log(bit rate)− 1.55× log(jitter)− 0.36× RTT
34Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Actual and Predicted Time vs. USIActual and Predicted Time vs. USIAverage session
time (m
in)
35Kuan‐Ta Chen / Quantifying Skype User Satisfaction
The multiThe multi‐‐path scenariopath scenario
path avail bandwidth jitter RTT USI
2 Kbps 3.84
6.33
5.43
1 Kbps
10 Kbps
3 Kbps
100 ms
20 Kbps 300 ms
30 Kbps 500 ms
Internet
is call hang‐up rate a good indication of user satisfaction?
BUT,
36Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Talk outlineTalk outline
The Question
Measurement
Modeling
Validation
Significance
37Kuan‐Ta Chen / Quantifying Skype User Satisfaction
User satisfaction: ValidationUser satisfaction: Validation
Call duration
intuition: call duration <‐> satisfactionnot confirmed yet
38Kuan‐Ta Chen / Quantifying Skype User Satisfaction
User satisfaction: One step furtherUser satisfaction: One step further
Speech interactivity
Call duration
?
now we’re going to check!
intuition: interactive and tight speech activities in a cheerful conversation
39Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Identifying talk burstsIdentifying talk bursts
The problem
Every voice packet is encrypted with 256‐bit AES
(Advanced Encryption Standard)
Possible solutions
packet rate: no silence suppression in Skype
packet size: our choice
40Kuan‐Ta Chen / Quantifying Skype User Satisfaction
What we need to achieveWhat we need to achieve
Input: a time series of packet sizes
Output: estimated ON/OFF periods (ON = talk / OFF=
silence)
Time
41Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Speech activity detectionSpeech activity detection
1. Wavelet de‐noising
Removing high‐frequency fluctuations
2. Detect peaks and dips
3. Dynamic thresholding
Deciding the beginning/end of a talk burst
42Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Speech detection algorithm: ValidationSpeech detection algorithm: Validation
The speech detection algorithm is validated with:
synthesized sin waves (500 Hz – 2000 Hz)
real speech recordings
relay node (chosen by Skype)average RTT: 350 msjitter: 5.1 Kbps
Force packet size processes contaminated by serious network
impairment (delay and loss)
play sound >>
capture packet size processes
43Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Validation with synthesized sin wavesValidation with synthesized sin waves
3 times for each of 10 test cases
correctness (ratio of matched 0.1‐second periods): 0.73 – 0.92
true ON periods
estimated ON periods
44Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Validation with speech recordingsValidation with speech recordings
3 times for each of 3 test cases
correctness (ratio of matched 0.1‐second periods): 0.71 – 0.85
true ON periods
estimated ON periods
45Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Speech interactivity analysisSpeech interactivity analysis
Responsiveness:Avg. Response Delay:Avg. Burst Length:
Responsiveness:
Response delay:
Burst length:
whether the other party responds
how long before the other party responds
how long does a speech burst last
46Kuan‐Ta Chen / Quantifying Skype User Satisfaction
USI vs. Speech interactivityUSI vs. Speech interactivity
All are statistically significant (at 0.01 significance level)
Speech interactivity in conversation supports the proposed
USI
higher USI higher responsiveness
higher USI shorter response delay
higher USI shorter burst length
47Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Talk outlineTalk outline
The Question
Measurement
Modeling
Validation
Significance
48Kuan‐Ta Chen / Quantifying Skype User Satisfaction
ImplicationsImplications
should put more attention to delay jitters (rather then
focus on network delay only)
and the encoding bit rate!
49Kuan‐Ta Chen / Quantifying Skype User Satisfaction
SignificanceSignificance
QoE‐aware systems that can optimize user experience in run time
Is it worth to sacrifice 20 ms latency for reducing 10 ms jitters (say, with a de‐jitter buffer)?
Pick the most appropriate parameters in run timeplayout scheduling (buffer time)
coding scheme (& rate)
source rate
data path (overlay routing)
transmission scheme (redundacy, erasure coding, …)
50Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Future work (1)Future work (1)
Measurement
larger data sets (p2p traffic is hard to collect)
diverse locations
Validation
user studies
comparison with existing models (PESQ, etc)
51Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Future work (2)Future work (2)
Beyond “call duration”
Who hangs up a call?
Call disconnect‐n‐connect behavior
More sophisticated modeling
Voice codec
Pricing effect
Time‐of‐day effect
Time‐dependent impact
Call Behavior
?
52Kuan‐Ta Chen / Quantifying Skype User Satisfaction
How Gamers are Aware of Service Quality?How Gamers are Aware of Service Quality?
Real‐time interactive online games are generally considered QoS‐
sensitive
Gamers are always complaining about high
“ping‐times” or network lags
Online gaming is increasingly popular despite the best‐effort
Internet
Q1: Are game players really sensitive to network quality as they claim?
Q2: If so, how do they react to poor network quality?
Appeared on IEEE INFOCOM 2006
53Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Case Study: Case Study: ShenZhouShenZhou OnlineOnline
54Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Traffic Trace CollectionTraffic Trace Collection
Gig
abit
Eth
erne
t
trace conn. # packets (in/out/both) bytes (in/out/both)
342M / 353M / 695M 4.7TB / 27.3TB / 32.0TB
N2 54,424 325M / 336M / 661M 4.7TB / 21.7TB / 26.5TB
57,945N1
55Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Delay Jitter vs. Session TimeDelay Jitter vs. Session Time(std. dev. of the round-trip times)
56Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Hypothesis Testing Hypothesis Testing ‐‐‐‐ Effect of Loss RateEffect of Loss Rate
Null Hypothesis: All the survival curves are equivalent
Log‐rank test: P < 1e‐20
We have > 99.999% confidence claiming loss rates are correlated with game playing times
high loss
low loss med loss
The CCDF of game session times
57Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Regression ModelingRegression Modeling
Linear regression is not adequate
Violating the assumptions (normal errors, equal variance, …)
The Cox regression model provides a good fit
Log‐hazard function is proportional to the weighted sum of factors
Hazard function (conditional failure rate)
The instantaneous rate of quitting a game for a player (session)
h(t) = lim∆ t→ 0
Pr[t ≤ T < t + ∆t|T ≥ t]∆t
(our aim is to compute )β
where each session has factors Z (RTT=x, jitter=y, …)
log h(t|Z) ∝ βtZ
58Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Final Model & InterpretationFinal Model & Interpretation
Interpretation
A: RTT = 200 msB: RTT = 100 ms, other factors same as A
Hazard ratio between A and B: exp((log(0.2) – log(0.1)) × 1.27) ≈ 2.4
A will more likely leave a game (2.4 times probability) than B at any moment
Variable Coef Std. Err. Signif.
log(RTT) 1.27 0.04 < 1e‐20
0.03
0.01
0.01
< 1e‐20
log(closs) 0.12 < 1e‐20
log(sloss) 0.09 7e‐13
0.68log(jitter)
59Kuan‐Ta Chen / Quantifying Skype User Satisfaction
How good does the model fit?How good does the model fit?
60Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Relative Influence of Relative Influence of QoSQoS FactorsFactors
Latency = 20% Client packet loss = 20%
Delay jitter = 45% Server pakce loss = 15%
61Kuan‐Ta Chen / Quantifying Skype User Satisfaction
An Index for An Index for ShenZhouShenZhou OnlineOnline
Features
derived from real‐life game sessions
accessible and computable in real time
implications: delay jitter is more intolerable than delay
RTT:jitter:closs:sloss:
round‐trip timeslevel of network congestionloss rate of client packetsloss rate of server packets
log(departure rate) ∝ 1.27× log(RTT) + 0.68× log(jitter) +0.12× log(closs) + 0.09× log(sloss)
62Kuan‐Ta Chen / Quantifying Skype User Satisfaction
App #1: Evaluation of Alternative DesignsApp #1: Evaluation of Alternative Designs
Suppose now we have two designs (e.g., protocols)
One leads to lower delay but high jitter:
100 ms, 120 ms, 100 ms, 120 ms, 100 ms, 120 ms, 100 ms, 120 ms, …
One leads to higher delay but lower jitter:
150 ms, 150 ms, 150 ms, 150 ms, 150 ms, 150 ms, 150 ms, 150 ms, …
Which one design shall we choose?
time
network latency
150 ms
63Kuan‐Ta Chen / Quantifying Skype User Satisfaction
App #2: Overlay Path SelectionApp #2: Overlay Path Selection
Internet
path delay jitter loss rate score
50 ms (P) 3.84
6.33
5.43
20 ms (G)
100 ms (G)
30 ms (A)
5% (P)
150 ms (A) 1% (A)
200 ms (P) 1% (A)
64Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Other ApplicationsOther Applications
Deciding smoothing bufferIs it worth to sacrifice 20ms latency for reducing 10ms jitters?
Maintaining fairnessAllocate more resources to players experience poor QoS
65Kuan‐Ta Chen / Quantifying Skype User Satisfaction
Player Departure Behavior AnalysisPlayer Departure Behavior Analysis
Player departure rate is decreasing by time
Golden time is the first 10 minutes: the longer gamers play, the
more external factors would affect their decisions to stay or leave
allocating more resources to players just entered
0.00
00.
001
0.00
20.
003
0.00
40.
005
0.00
6
Game session time (hour)
Est
imat
ed h
azar
d fu
nctio
n
30 60 120 180 240 300 360 420 480
weekdayweekend
10 20 30 40 50 60
0.15
0.20
0.25
0.30
Observation time (min)
Nag
elke
rke
R2
5%exit
10%exit
20%exit
Mostpredictable
IEEE INFOCOM 2009
OneClick ‐‐A Framework for Measuring
Network Quality of Experience
Kuan‐Ta Chen, Cheng‐Chu Tu, Wei‐Cheng Xiao
Institute of Information Science, Academia Sinica
Appeared on IEEE INFOCOM 2009
IEEE INFOCOM 2009
QoS and QoE
QoS (Quality of service)The quality level of system performance metric
Communication networks: delay, loss rate
DBMS: query completion time
QoE (Quality of experience)The quality of how users “feel” about a service
Subjective: Mean Opinion Score (MOS)
Objective: PSNR (picture), PESQ (voice), VQM (video)
IEEE INFOCOM 2009
Relationship between QoS and QoE
QoS, e.g., network bandwidth
QoEToo bad to perceive
Marginal benefit is small
Comfort range
OneClick: A Framework for Measuring Network Quality of Experience 4
Knowing the Relationship is Important!
So we know
How to adapt voice/video/game data rate (QoS) for user
satisfaction (QoE)
So we really know
How to send multimedia data over the Internet
OneClick: A Framework for Measuring Network Quality of Experience 5
Measuring QoS and QoE
QoS (A great body of work)
Measure network loss, delay, available bandwidth
Inference topology
Estimate network capacity
etc
QoE (Some work)
Objective: PSNR (picture), PESQ (voice), VQM (video)
Subjective: MOS (general)
Still not quite the human experiencewhich is multi-dimensional
What’s left!
IEEE INFOCOM 2009
MOS (Mean Opinion Score)
1. Slow in scoring (think/interpretation time)2. People are limited by finite memory3. Cannot capture users’ perceptions over time4. MOS is coarse in scale granularity5. Dissimilar interpretations of the scale among users
Problems
IEEE INFOCOM 2009
Our Ambition
Identify a simple and yet efficient way
to measure users’ satisfaction
OneClick: A Framework for Measuring Network Quality of Experience 8
The Idea: Click, Click, Click
Web surfing
Click on a link
You wait, and you refresh the link
You wait, and you refresh the link again, and again, and …
Knocking at someone’s door
Knock on the door
You wait, and you knock on the door again
You wait, and you knock on the door again and again, and …
OneClick: A Framework for Measuring Network Quality of Experience 9
Introducing OneClick
Simple instruction to users:
Click when you feel dissatisfied
Click multiple times when you feel even less satisfied
Clicking rate as the QoE
User Satisfaction
ClickClickUser Feedback
Application Quality
ClickClickClickClick ClickClick ClickClickClickClick ClickClickClickClickClickClickClickClickClickClickClickClickClickClickClickClick
Time
Nice Things about OneClick
NaturalWe are already doing it to show lost of patience all the time
Bad‐memory proofReal‐time decisionsNo need to “remember” past experience
Time‐awareCapture users’ responses at the time of the problemsUseful to study recency, memory access, and habituation effect
OneClick: A Framework for Measuring Network Quality of Experience 11
Easy to Implement
As a plug‐in to your network applicationsFlash version done!
Co‐measurement of QoS and QoE
User Satisfaction
ClickClickUser Feedback
Application Quality
ClickClickClickClick ClickClick ClickClickClickClick ClickClickClickClickClickClickClickClickClickClickClickClickClickClickClickClick
Time
Application Quality
OneClick: A Framework for Measuring Network Quality of Experience 12
Talk Progress
Overview
Methodology
Pilot Study
Validation
Case Studies
Conclusion
IEEE INFOCOM 2009
Human as a QoE Rating System
User
Application QoS
Application QoE
Network Setting
Click Events
affect
reflect
vary this:
observe this:
IEEE INFOCOM 2009
QoE QoS Modeling
Click events as a counting process
Poisson regression
C(t): QoEClicking rate at time t
N1(t), N2(t), … : QoSNetwork conditions at time t
αi : Regression coefficientsDerived from the maximum likelihood method
IEEE INFOCOM 2009
Wait a Minute…
Response delays?Users may not be able to click immediately after they are aware of the degraded quality
Clicking rate of a user consistent?Does a subject give similar ratings in repeated experiments?
Clicking rate consistent across users?Different subjects may have different preference on click decisions.
IEEE INFOCOM 2009
Pilot Study
An 5‐minute English song
Audio quality of AIM Messenger with various network settings
IEEE INFOCOM 2009
Test Material Compilation
For each network settingPlay the song
Record the song
K settings K recordings
A
Random
test material =
non‐overlapping segments from K different recordings
IEEE INFOCOM 2009
Response Delays
Try Poisson regression on C(t+x) to N1(t), N2(t), …
Varying x
Show the goodness of fit per x
OneClick: A Framework for Measuring Network Quality of Experience 20
Our Solution
Shift the click event process by time d
d is decided by model fitting Let d be the x such that the goodness of fit is the best
Let d be the x such that the residual deviance is the min
Calibration and Normalization Added
Response DelayCalibration
Regression ModelingWith Normalization
User #1
User #2
OneClickMeasurement
OneClick: A Framework for Measuring Network Quality of Experience 24
Talk Progress
Overview
Methodology
Pilot Study
Validation
Case Studies
Conclusion
OneClick: A Framework for Measuring Network Quality of Experience 25
Exact problem we are trying to solve
Rationale
Direct: get people to do OneClick and MOS
Click Rate MOS
PESQ/VQM
Indirect: get people to do OneClick and PESQ/VQM
IEEE INFOCOM 2009
PESQ‐based Validation
PESQ: Perceptual Evaluation of Speech Quality
OneClick vs. PESQ to evaluate the audio quality of three VoIP applications
AIM
MSN Messenger
Skype
Network factorsLoss rates (0% – 30%)
Bandwidth (10 Kbps – 100 Kbps)
[Validation]
IEEE INFOCOM 2009
VQM‐based Validation
VQM: Video Quality Measurement
OneClick vs. VQM to evaluate video quality of two video codecs
H.264
WMV9 (Windows Media Video)
FactorsCompression bit rate (200 Kbps – 1000 Kbps)
[Validation]
OneClick: A Framework for Measuring Network Quality of Experience 30
Talk Progress
Overview
Methodology
Pilot Study
Validation
Case Studies
Conclusion
IEEE INFOCOM 2009
Case Studies
Evaluation of applications’ QoEVoIP applications
AIM
MSN Messenger
Skype
First‐person shooter gamesHalo
Unreal Tournament
IEEE INFOCOM 2009
Varying Bandwidth
MSN Messenger is generally the worst
Skype is the best if bw < 80 Kbps, otherwise AIM is the best
[Case Study]
Contour Lines of Click Rates
Slope of contour lineApplication’s sensitivity to loss vs. bandwidth shortage
AIM is relatively more sensitive to network losses
[Case Study]
Comfort Region
Comfort Region: a set of network configurations that leads to satisfactory QoESkype is the best in bw‐restricted scenarios (< 60 Kbps) when loss rate is < 10%
[Case Study]
OneClick: A Framework for Measuring Network Quality of Experience 35
Talk Progress
Overview
Methodology
Pilot Study
Validation
Case Studies
Conclusion
IEEE INFOCOM 2009
Nice about OneClick
Natural & fastWe are already doing it to show lost of patience all the time
Bad‐memory proofNo need to “remember” past experience
Time‐awareCapture users’ responses at the time of the problems
Fine‐grainThe score can be 0.2, 3.5, or even 12.345
Normalized user interpretation Different interpretations are normalized
Easy to implementhttp://mmnet.iis.sinica.edu.tw/proj/oneclick/
OneClick: A Framework for Measuring Network Quality of Experience 38
On‐Going Work
Large‐scale experiments (by crowdsourcing)
http://mmnet.iis.sinica.edu.tw/proj/oneclick/
Click rate vs. MOS
QoE‐centric multimedia networking
As an example, Tuning the Redundancy Control Algorithm of
Skype for User Satisfaction, IEEE INFOCOM 2009.
ACM Multimedia 2009
A CrowdsourceableQoE Evaluation Framework for
Multimedia Content
Kuan‐Ta Chen Academia SinicaChen‐Chi Wu National Taiwan UniversityYu‐Chun Chang National Taiwan UniversityChin‐Laung Lei National Taiwan University
Appeared on ACM Multimedia 2009
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 2
What is QoE?
Quality of Experience =
Users’ satisfaction about a service
(e.g., multimedia content)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 3
Quality of Experience
Poor(underexposed)
Good(exposure OK)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 4
Challenges
How to quantify the QoE of multimedia content efficiently and reliably?
Q=?
Q=?
Q=?
Q=?
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 5
Mean Opinion Score (MOS)Idea: Single Stimulus Method (SSM) + Absolute Categorial Rating (ACR)
Excellent?Good?Fair?Poor?Bad?
vote
Fair
MOS Quality Impairment
5 Excellent Imperceptible
4 Good Perceptible but not annoying
3 Fair Slightly annoying
2 Poor Annoying
1 Bad Very annoying
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 6
ACR‐basedConcepts of the scales cannot be concretely defined
Dissimilar interpretations of the scale among users
Only an ordinal scale, not an interval scale
Difficult to verify users’ scores
Subjective experiments in laboratoryMonetary cost (reward, transportation)
Labor cost (supervision)
Physical space/time/hardware constraints
Drawbacks of MOS‐based Evaluations
Solve all these drawbacks
ACR‐basedConcepts of the scales cannot be concretely defined
Dissimilar interpretations of the scale among users
Only an ordinal scale, not an interval scale
Difficult to verify users’ scores
Subjective experiments in laboratoryMonetary cost (reward, transportation)
Labor cost (supervision)
Physical space/time/hardware constraints
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 7
ACR‐basedConcepts of the scales cannot be concretely defined
Dissimilar interpretations of the scale among users
Only an ordinal scale, not an interval scale
Difficult to verify users’ scores
Subjective experiments in laboratoryMonetary cost (reward, transportation)
Labor cost (supervision)
Physical space/time/hardware constraints
Drawbacks of MOS‐based Evaluations
Crowdsourcing
Paired Comparison
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 9
Talk Progress
Overview
MethodologyPaired Comparison
Crowdsourcing Support
Experiment Design
Case Study & EvaluationAcoustic QoE
Optical QoE
Conclusion
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 10
Current Approach: MOS Rating
Excellent?Good?Fair?Poor?Bad?
Vote
?
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 11
Our Proposal: Paired Comparison
Which one is better?
B
Vote
A
B
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 12
Properties of Paired Comparison
Generalizable across different content types and applications
Simple comparative judgmentdichotomous decision easier than 5‐category rating
Interval‐scale QoE scores can be inferred
The users’ inputs can be verified
Choice Frequency Matrix
0 9 10 9
1 0 7 8
0 3 0 6
1 2 4 0
10 experiments, each containing C(4,2)=6 paired comparisons
A
B
C
D
A B C D
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 14
Inference of QoE Scores
Bradley‐Terry‐Luce (BTL) modelinput: choice frequency matrix
output: an interval‐scale score for each content (based on maximum likelihood estimation)
)()(
)()(
1)()()(
ji
ji
TuTu
TuTu
ji
iij e
eTT
TP −
−
+=
+=
πππ
n content: T1,…, Tn
Pij : the probability of choosing Ti over Tj
u(Ti) is the estimated QoE score of the quality level Ti
Basic IdeaP12 = P23 u(T1) - u(T2) = u(T2) - u(T3)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 15
Inferred QoE Scores
0 0.63 0.91 1
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 16
Talk Progress
Overview
MethodologyPaired Comparison
Crowdsourcing Support
Experiment Design
Case Study & EvaluationAcoustic QoE
Optical QoE
Conclusion
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 17
Crowdsourcing
= Crowd + Outsourcing
“soliciting solutions via open calls to large‐scale communities”
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 18
Image Understanding
Reward: 0.04 USD
main theme?key objects?
unique attributes?
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 19
Linguistic Annotations
Word similarity (Snow et al. 2008)
USD 0.2 for labeling 30 word pairs
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 20
More Examples
Document relevance evaluationAlonso et al. (2008)
Document rating collectionKittur et al. (2008)
Noun compound paraphrasingNakov (2008)
Person name resoluationSu et al. (2007)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 21
The Risk
Users may give erroneous feedback perfunctorily, carelessly, or dishonestly
Dishonest users have more incentives to perform tasks
Not every Internet user is trustworthy!
Need to have an ONLINE algorithm to detect problematic inputs!
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 22
Verification of Users’ Inputs (1)
Transitivity propertyIf A > B and B > C A should be > C
Transitivity Satisfaction Rate (TSR)
apply tomay ruleity transitiv the triplesof #ruleity transitivesatisfy th triplesof #
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 23
Verification of Users’ Inputs (2)
Detect inconsistent judgments from problematic users
TSR = 1 perfect consistency
TSR >= 0.8 generally consistent
TSR < 0.8 judgments are inconsistent
TSR‐based reward / punishment(e.g., only pay a reward if TSR > 0.8)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 24
Experiment Design
For n algorithms (e.g., speech encoding)1. a source content as the evaluation target
2. apply the n algorithms to generate n content w/ different Q
3. ask a user to perform paired comparisons
4. compute TSR after an experiment⎟⎟⎠
⎞⎜⎜⎝
⎛2n
reward a user ONLY if his inputs are self‐consistent(i.e., TSR is higher than a certain threshold)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 25
Concept Flow in Each Round
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 26
Audio QoE Evaluation
Which one is better?
(SPACE key released) (SPACE key pressed)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 27
Video QoE evaluation
Which one is better?
(SPACE key released) (SPACE key pressed)
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 28
Talk Progress
Overview
MethodologyPaired Comparison
Crowdsourcing Support
Experiment Design
Case Study & EvaluationAcoustic QoE
Optical QoE
Conclusion
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 29
Audio QoE Evaluation
MP3 compression levelSource clips: one fast‐paced and one slow‐paced song
MP3 CBR format with 6 bit rate levels: 32, 48, 64, 80, 96, and 128 Kbps
127 participants and 3,660 paired comparisons
Effect of packet loss rate on VoIPTwo speech codecs: G722.1 and G728
Packet loss rate: 0%, 4%, and 8%
62 participants and 1,545 paired comparisons
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 30
Inferred QoE Scores
MP3 Compression Level VoIP Packet Loss Rate
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 31
Video QoE Evaluation
Video codecSource clips: one fast‐paced and one slow‐paced video clip
Three codecs: H.264, WMV3, and XVID
Two bit rates: 400 and 800 Kbps
121 participants and 3,345 paired comparisons
Loss concealment schemeSource clips: one fast‐paced and one slow‐paced video clip
Two schemes: Frame copy (FC) and FC with frame skip (FCFS)
Packet loss rate: 1%, 5%, and 8%
91 participants and 2,745 paired comparisons
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 32
Inferred QoE Scores
Video Codec Concealment Scheme
A Crowdsourceable QoE Evaluation Framework for Multimedia Content / Kuan‐Ta Chen 33
Participant Source
LaboratoryRecruit part‐time workers at an hourly rate of 8 USD
MTurkPost experiments on the Mechanical Turk web site
Pay the participant 0.15 USD for each qualified experiment
CommunitySeek participants on the website of an Internet community with 1.5 million members
Pay the participant an amount of virtual currency that was equivalent to one US cent for each qualified experiment
Participant Source EvaluationWith crowdsourcing…
lower monetary cost
wider participant diversity
maintaining the evaluation results’ quality
Crowdsourcing seems a good strategy for multimedia QoE assessment!
Conclusion
Crowdsourcing is not without limitationsphysical contact
environment control
media
With paired comparison and user input verification, less monetary cost
wider participant diversity
shorter experiment cycle
evaluation quality maintained
Quality of Experience Management / Sheng‐Wei Chen 17
Future Plan
QoE measurementPsychophysical approachExploit social gaming to provide the incentives for large‐scale studiesGoals: cross‐application, cross‐modal, content‐dependent, context‐dependent
QoE provisioningQoE‐aware communication systemsParameter auto configured in run time
playout scheduling, coding scheme (& rate), overlay path routing, transmission (redundancy, coding, protocol), etc
Quality of Experience Management / Sheng‐Wei Chen 18
Acknowledgements
Chin‐Laung Lei Chun‐Ying Huang
Polly Huang
Yu‐Chun Chang Te‐Yuan Huang
William Tu
Hung‐Hsuan Chen
Chen‐Chi Wu
Wei‐Cheng Xiao