121
User Experiments in Recommender Systems

Tutorial on Conducting User Experiments in Recommender Systems

Embed Size (px)

DESCRIPTION

The slide handouts of my Recsys2012 tutorial

Citation preview

Page 1: Tutorial on Conducting User Experiments in Recommender Systems

User Experimentsin Recommender Systems

Page 2: Tutorial on Conducting User Experiments in Recommender Systems
Page 3: Tutorial on Conducting User Experiments in Recommender Systems

IntroductionWelcome everyone!

Page 4: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

IntroductionBart Knijnenburg- Current: UC Irvine

- Informatics - PhD candidate

- TU Eindhoven- Human Technology Interaction - Researcher & Teacher- Master Student

- Carnegie Mellon University- Human-Computer Interaction- Master Student

Page 5: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

IntroductionBart Knijnenburg- First user-centric

evaluation framework- UMUAI, 2012

- Founder of UCERSTI- workshop on user-centric

evaluation of recommender systems and their interfaces

- Statistics expert- Conference + journal reviewer- Research methods teacher- SEM advisor

Page 6: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Introduction

Page 7: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Introduction

“What is a user experiment?”

“A user experiment is a scientific method to investigate how and why system aspects

influence the users’ experience and behavior.”

Page 8: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Introduction

My goal:Teach how to scientifically evaluate recommender systems using a user-centric approachHow? User experiments!

My approach:- I will provide a broad theoretical framework

- I will cover every step in conducting a user experiment

- I will teach the “statistics of the 21st century”

Page 9: Tutorial on Conducting User Experiments in Recommender Systems

IntroductionWelcome everyone!

Evaluation frameworkA theoretical foundation for user-centric evaluation

HypothesesWhat do I want to find out?

ParticipantsPopulation and sampling

Testing A vs. BExperimental manipulations

MeasurementMeasuring subjective valuations

AnalysisStatistical evaluation of the results

Page 10: Tutorial on Conducting User Experiments in Recommender Systems
Page 11: Tutorial on Conducting User Experiments in Recommender Systems
Page 12: Tutorial on Conducting User Experiments in Recommender Systems

Evaluation frameworkA theoretical foundation for user-centric evaluation

Page 13: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

O!ine evaluations may not give the same outcome as online evaluations

Cosley et al., 2002; McNee et al., 2002

Solution: Test with real users

Page 14: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

Systemalgorithm

Interactionrating

Page 15: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

Higher accuracy does not always mean higher satisfaction

McNee et al., 2006

Solution: Consider other behaviors

Page 16: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

Systemalgorithm

Interaction

rating

consumption

retention

Page 17: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

The algorithm counts for only 5% of the relevance of a recommender system

Francisco Martin - RecSys 2009 keynote

Solution: test those other aspects

Page 18: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

System

algorithm

interaction

presentation

Interaction

rating

consumption

retention

Page 19: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

“Testing a recommender against a random videoclip system, the number of clicked clips

and total viewing time went down!”

Page 20: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

perceived recommendation quality

SSA

perceived system effectiveness

EXP

personalized

recommendationsOSA

number of clips watched from beginning

to end totalviewing time

number of clips clicked+

++

+

− −

choicesatisfaction

EXP

Framework

Knijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010

Page 21: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

Behavior is hard to interpretRelationship between behavior and satisfaction is not always trivial

User experience is a better predictor of long-term retentionWith behavior only, you will need to run for a long time

Questionnaire data is more robust Fewer participants needed

Page 22: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

Measure subjective valuations with questionnairesPerception and experience

Triangulate these data with behaviorGround subjective valuations in observable actionsExplain observable actions with subjective valuations

Measure every step in your theoryCreate a chain of mediating variables

Page 23: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Page 24: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

Personal and situational characteristics may have an important impact

Adomavicius et al., 2005; Knijnenburg et al., 2012

Solution: measure those as well

Page 25: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Page 26: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Framework

Objective System Aspects (OSA)

These are manipulations- visual / interaction design- recommender algorithm- presentation of recommendations- additional features

Page 27: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

FrameworkUser Experience (EXP)

Di"erent aspects may influence di"erent things- interface -> system

evaluations- preference elicitation

method -> choice process - algorithm -> quality of the

final choice

Page 28: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Framework

Subjective System Aspects (SSA)

Link OSA to EXP (mediation)Increase the robustness of the e"ects of OSAs on EXPHow and why OSAs a"ect EXP

Page 29: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Personal and Situational characteristics (PC and SC)E"ect of specific user and taskBeyond the influence of the systemHere used for evaluation, not for augmenting algorithms

Framework

Page 30: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Framework

Interaction (INT)

Observable behavior- browsing, viewing, log-ins

Final step of evaluation Grounds EXP in “real” data

Page 31: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

System

algorithm

interaction

presentation

Perception

usability

quality

appeal

Experience

system

process

outcome

Interaction

rating

consumption

retention

Personal CharacteristicsPersonal CharacteristicsPersonal Characteristicsgender privacy expertise

Situational CharacteristicsSituational CharacteristicsSituational Characteristicsroutine system trust choice goal

Page 32: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

FrameworkHas suggestions for measurement scales

e.g. How can we measure something like “satisfaction”?

Provides a good starting point for causal relationse.g. How and why do certain system aspects influence the user experience?

Useful for integrating existing worke.g. How do recommendation list length, diversity and presentation each have an influence on user experience?

Page 33: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Framework

“Statisticians, like artists, have the bad habit of falling in love with their models.”

George Box

Page 34: Tutorial on Conducting User Experiments in Recommender Systems

Evaluation frameworkHelp in measurement and causal relations

test your recommender system with real users

use our framework as a starting point for user-centric research

measure behaviors other than ratings

test aspects other than the algorithm

take situational and personal characteristics

into account

measure subjective valuations with questionnaires

Page 35: Tutorial on Conducting User Experiments in Recommender Systems
Page 36: Tutorial on Conducting User Experiments in Recommender Systems
Page 37: Tutorial on Conducting User Experiments in Recommender Systems

HypothesesWhat do I want to find out?

Page 38: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

“Can you test if my recommender system is good?”

Page 39: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

What does good mean?- Recommendation accuracy?

- Recommendation quality?

- System usability?

- System satisfaction?

We need to define measures

Page 40: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

“Can you test if the user interface of my recommender system scores high on this

usability scale?”

Page 41: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

What does high mean?Is 3.6 out of 5 on a 5-point scale “high”?What are 1 and 5?What is the di!erence between 3.6 and 3.7?

We need to compare the UI against something

Page 42: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

“Can you test if the UI of my recommender system scores high on this usability scale

compared to this other system?”

Page 43: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

My new travel recommender Travelocity

Page 44: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

HypothesesSay we find that it scores higher on usability... why does it?- di!erent date-picker method

- di!erent layout

- di!erent number of options available

Apply the concept of ceteris paribus to get rid of confounding variables

Keep everything the same, except for the thing you want to test (the manipulation)Any di!erence can be attributed to the manipulation

Page 45: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

My new travel recommender Previous version (too many options)

Page 46: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

HypothesesTo learn something from the study, we need a theory behind the e"ect

For industry, this may suggest further improvements to the systemFor research, this makes the work generalizable

Measure mediating variablesMeasure understandability (and a number of other concepts) as wellFind out how they mediate the e!ect on usability

Page 47: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

An example:

We compared three recommender systems

Three di!erent algorithmsCeteris paribus!

��

����

����

����

����

����

���� ����� �����

���������� ���������

Page 48: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

�������

�������������������

���� ����� �����

����������� ���������������

����������������������������

���� ����� �����

����������������������� ����

��

����

����

����

����

����

���� ����� �����

���������� ���������

Hypotheses

The mediating variables show the entire story

Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012

Page 49: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

++

+ +

perceived system effectiveness

EXP

Matrix Factorization recommender with explicit feedback (MF-E)(versus generally most popular; GMP)

OSA

Matrix Factorization recommender with implicit feedback (MF-I)

(versus most popular; GMP)OSA

perceived recommendation variety

SSA

perceived recommendation quality

SSA

Hypotheses

Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012

Page 50: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Hypotheses

“An approximate answer to the right problem is worth a good deal more than an exact

answer to an approximate problem.”

John Tukey

Page 51: Tutorial on Conducting User Experiments in Recommender Systems

HypothesesWhat do I want to find out?

define measures

compare system aspects against each

other

get rid of confounding

variables

look for a theory behind the found e"ects

apply the concept of ceteris paribus

measure mediating variables to explain the e"ects

Page 52: Tutorial on Conducting User Experiments in Recommender Systems
Page 53: Tutorial on Conducting User Experiments in Recommender Systems
Page 54: Tutorial on Conducting User Experiments in Recommender Systems

ParticipantsPopulation and sampling

Page 55: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Participants

“We are testing our recommender system on our colleagues/students.”

-or-

“We posted the study link on Facebook/Twitter.”

Page 56: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Participants

Are your connections, colleagues, or students typical users of your system?- They may have more knowledge of the field of study

- They may feel more excited about the system

- They may know what the experiment is about

- They probably want to please you

You should sample from your target populationAn unbiased sample of users of your system

Page 57: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Participants

“We only use data from frequent users.”

Page 58: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Participants

What are the consequences of limiting your scope?You run the risk of catering to that subset of users onlyYou cannot make generalizable claims about users

For scientific experiments, the target population may be unrestricted

Especially when your study is more about human nature than about a specific system

Page 59: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Participants

“We tested our system with 10 users.”

Page 60: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

ParticipantsIs this a decent sample size?

Can you attain statistically significant results?Does it provide a wide enough inductive base?

Make sure your sample is large enough

40 is typically the bare minimum

Anticipated e!ect size

Needed sample size

small 385

medium 54

large 25

Page 61: Tutorial on Conducting User Experiments in Recommender Systems

ParticipantsPopulation and sampling

sample from your target population

make sure your sample is large enough

the target population may be unrestricted

Page 62: Tutorial on Conducting User Experiments in Recommender Systems
Page 63: Tutorial on Conducting User Experiments in Recommender Systems
Page 64: Tutorial on Conducting User Experiments in Recommender Systems

Testing A vs. BExperimental manipulations

Page 65: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

“Are our users more satisfied if our news recommender shows only recent items?”

Page 66: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

ManipulationsProposed system or treatment:

Filter out any items > 1 month old

What should be my baseline?- Filter out items < 1 month old?

- Unfiltered recommendations?

- Filter out items > 3 months old?

You should test against a reasonable alternative“Absence of evidence is not evidence of absence”

Page 67: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

“The first 40 participants will get the baseline, the next 40 will get the treatment.”

Page 68: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

These two groups cannot be expected to be similar!Some news item may a!ect one group but not the other

Randomize the assignment of conditions to participantsRandomization neutralizes (but doesn’t eliminate) participant variation

Page 69: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

Between-subjects design:

Randomly assign half the participants to A, half to B

Realistic interactionManipulation hidden from userMany participants needed

100 participants

50 50

Page 70: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

Within-subjects design:

Give participants A first, then B- Remove subject variability

- Participant may see the manipulation

- Spill-over e!ect

50 participants

Page 71: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

Within-subjects design:

Show participants A and B simultaneously- Remove subject variability

- Participants can compare conditions

- Not a realistic interaction

50 participants

Page 72: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

Should I do within-subjects or between-subjects?

Use between-subjects designs for user experienceCloser to a real-world usage situationNo unwanted spill-over e!ects

Use within-subjects designs for psychological researchE!ects are typically smallerNice to control between-subjects variability

Page 73: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

You can test multiple manipulations in a factorial design

The more conditions, the more participants you will need!

Low diversity

High diversity

5 items 5+low 5+high

10 items

10+low 10+high

20 items

20+low 20+high

Page 74: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

Let’s test an algorithm against random recommendationsWhat should we tell the participant?

Beware of the Placebo e"ect!Remember: ceteris paribus!Other option: manipulate the message (factorial design)

Page 75: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Manipulations

“We were demonstrating our new recommender to a client. They were amazed by how well it predicted their preferences!”

“Later we found out that we forgot to activate the algorithm: the system was giving

completely random recommendations.”

(anonymized)

Page 76: Tutorial on Conducting User Experiments in Recommender Systems

Testing A vs. BExperimental manipulations

test against a reasonable alternative

randomize assignment of conditions

use between-subjects for user experience

you can test more than two conditions

use within-subjects for psychological research

you can test multiple manipulations in a factorial design

Page 77: Tutorial on Conducting User Experiments in Recommender Systems
Page 78: Tutorial on Conducting User Experiments in Recommender Systems
Page 79: Tutorial on Conducting User Experiments in Recommender Systems

MeasurementMeasuring subjective valuations

Page 80: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Measurement

“To measure satisfaction, we asked users whether they liked the system

(on a 5-point rating scale).”

Page 81: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Measurement

Does the question mean the same to everyone?- John likes the system because it is convenient

- Mary likes the system because it is easy to use

- Dave likes it because the recommendations are good

A single question is not enough to establish content validityWe need a multi-item measurement scale

Page 82: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Measurement

Perceived system e"ectiveness:- Using the system is annoying

- The system is useful

- Using the system makes me happy

- Overall, I am satisfied with the system

- I would recommend the system to others

- I would quickly abandon using this system

Page 83: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Measurement

Use both positively and negatively phrased items- They make the questionnaire less “leading”

- They help filtering out bad participants

- They explore the “flip-side” of the scale

The word “not” is easily overlookedBad: “The recommendations were not very novel”Good: “The recommendations felt outdated”

Page 84: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Measurement

Choose simple over specialized wordsParticipants may have no idea they are using a “recommender system”

Avoid double-barreled questionsBad: “The recommendations were relevant and fun”

Page 85: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Measurement

“We asked users ten 5-point scale questions and summed the answers.”

Page 86: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

MeasurementIs the scale really measuring a single thing?- 5 items measure satisfaction, the other 5 convenience

- The items are not related enough to make a reliable scale

Are two scales really measuring di!erent things?- They are so closely related that they actually measure the

same thing

We need to establish convergent and discriminant validityThis makes sure the scales are unidimensional

Page 87: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

MeasurementSolution: factor analysis- Define latent factors, specify how items “load” on them

- Factor analysis will determine how well the items “fit”

- It will give you suggestions for improvement

Benefits of factor analysis:- Establishes convergent and discriminant validity

- Outcome is a normally distributed measurement scale

- The scale captures the “shared essence” of the items

Page 88: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Items

Factorsmovie

expertise

perceived recommendation variety

perceived recommendation quality

choicesatisfaction

choicedifficulty

var1 var2 var3 var4 var5 var6

sat1 sat2 sat3 sat4 sat5 sat6 sat7

qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5

exp1 exp2 exp3

1

11

1 1

Measurement

Page 89: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

movieexpertise

perceived recommendation variety

perceived recommendation quality

choicesatisfaction

choicedifficulty

var1 var2 var3 var4 var5 var6

sat1 sat2 sat3 sat4 sat5 sat6 sat7

qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5

exp1 exp2 exp3

1

11

1 1

Measurementlow communality,

high residual with qual2

low communality

low communality

loads on quality, variety, and satisfaction

high residual with var1

Page 90: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

movieexpertise

perceived recommendation variety

perceived recommendation quality

choicesatisfaction

choicedifficulty

var1 var2 var3 var4 var6

sat1 sat3 sat4 sat5 sat6

qual1 qual2 qual3 qual4 diff1 diff2 diff4

exp1 exp2 exp3

1

11

1 1

MeasurementAVE: 0.622

sqrt(AVE) = 0.789largest corr.: 0.491

AVE: 0.756sqrt(AVE) = 0.870largest corr.: 0.709

AVE: 0.435 (!)sqrt(AVE) = 0.659

largest corr.: -0.438

AVE: 0.793sqrt(AVE) = 0.891

highest corr.: 0.234

AVE: 0.655sqrt(AVE) = 0.809

highest corr.: 0.709

Page 91: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Measurement

“Great! Can I learn how to do this myself?”

Check the video tutorials at www.statmodel.com

Page 92: Tutorial on Conducting User Experiments in Recommender Systems

MeasurementMeasuring subjective valuations

establish content validity with multi-item scales

follow the general principles for good

questionnaire items

establish convergent and discriminant

validity

use factor analysis

Page 93: Tutorial on Conducting User Experiments in Recommender Systems
Page 94: Tutorial on Conducting User Experiments in Recommender Systems
Page 95: Tutorial on Conducting User Experiments in Recommender Systems

AnalysisStatistical evaluation of the results

Page 96: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

Manipulation -> perception: Do these two algorithms lead to a di!erent level of perceived quality?

T-test -1

-0.5

0

0.5

1

Perceived quality

A B

Page 97: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

Perception -> experience:Does perceived quality influence system e!ectiveness?

Linear regression -2

-1

0

1

2

-3 0 3

System effectiveness

Recommendation quality

Page 98: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

Two manipulations -> perception:

What is the combined e!ect of list diversity and list length on perceived recommendation quality?

Factorial ANOVA 0

0.1

0.2

0.3

0.4

0.5

0.6

5 items 10 items 20 items

Perceived quality

low diversificationhigh diversification

Willemsen et al.: “Not just more of the same”, submitted to TiiS

Page 99: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

movieexpertise

perceived recommendation variety

perceived recommendation quality

choicesatisfaction

choicedifficulty

var1 var2 var3 var4 var5 var6

sat1 sat2 sat3 sat4 sat5 sat6 sat7

qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5

exp1 exp2 exp3

1

11

1 1

Analysis

Only one method: structural equation modeling

The statistical method of the 21st century

Combines factor analysis and path models- Turn items into factors

- Test causal relationsmovie

expertise

++

+

+

−+

+

+ − +perceived

recommendation variety

perceived recommendation quality

Top-20vs Top-5 recommendations

choicesatisfaction

choicedifficulty

Lin-20vs Top-5 recommendations

+

.455 (.211)p < .05

.181 (.075)p < .05

.503 (.090)p < .001

1.151 (.161)p < .001

.336 (.089)p < .001

-.417 (.125)p < .005.205 (.083)

p < .05

.879 (.265)p < .001

.612 (.220)p < .01 -.804 (.230)

p < .001

.894 (.287)p < .005

Page 100: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

movieexpertise

perceived recommendation variety

perceived recommendation quality

choicesatisfaction

choicedifficulty

var1 var2 var3 var4 var5 var6

sat1 sat2 sat3 sat4 sat5 sat6 sat7

qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5

exp1 exp2 exp3

1

11

1 1

Analysis

Very simple reporting- Report overall fit + e!ect

of each causal relation

- A path that explains the e!ects

movieexpertise

++

+

+

−+

+

+ − +perceived

recommendation variety

perceived recommendation quality

Top-20vs Top-5 recommendations

choicesatisfaction

choicedifficulty

Lin-20vs Top-5 recommendations

+

.455 (.211)p < .05

.181 (.075)p < .05

.503 (.090)p < .001

1.151 (.161)p < .001

.336 (.089)p < .001

-.417 (.125)p < .005.205 (.083)

p < .05

.879 (.265)p < .001

.612 (.220)p < .01 -.804 (.230)

p < .001

.894 (.287)p < .005

Page 101: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

AnalysisExample from Bollen et al.: “Choice Overload”

What is the e!ect of the number of recommendations?What about the composition of the recommendation list?

Tested with 3 conditions:- Top 5:

- recs: 1 2 3 4 5

- Top 20: - recs: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

- Lin 20: - recs: 1 2 3 4 5 99 199 299 399 499 599 699 799 899 999 1099 1199 1299 1399 1499

Page 102: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

�������

�������������������

�� ��� �� ���� ������

����������������

Analysis

Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010

Page 103: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

movieexpertise

+ +

+

−+

choicesatisfaction

++

choicedifficulty

+

perceived recommendation quality

+ +perceived

recommendation variety

Top-20vs Top-5 recommendations

Lin-20vs Top-5 recommendations

Analysis

Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010

Simple regression

“Full mediation”

“Inconsistent mediation”

Trade-o!

Additional e!ect

Measured by var1-var6(not shown here)

Page 104: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

movieexpertise

++

+

+

−+

+

+ − +perceived

recommendation variety

perceived recommendation quality

Top-20vs Top-5 recommendations

choicesatisfaction

choicedifficulty

Lin-20vs Top-5 recommendations

+

.455 (.211)p < .05

.181 (.075)p < .05

.503 (.090)p < .001

1.151 (.161)p < .001

.336 (.089)p < .001

-.417 (.125)p < .005.205 (.083)

p < .05

.879 (.265)p < .001

.612 (.220)p < .01 -.804 (.230)

p < .001

.894 (.287)p < .005

Analysis

Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010

Page 105: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

“Great! Can I learn how to do this myself?”

Check the video tutorials at www.statmodel.com

Page 106: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

Homoscedasticity is a prerequisite for linear stats

Not true for count data, time, etc.

Outcomes need to be unbounded and continuous

Not true for binary answers, counts, etc.

0

5

10

15

20

0 5 10 15

Inte

ract

ion

time

(min

)

Level of commitment

Page 107: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

AnalysisUse the correct methods for non-normal data- Binary data: logit/probit regression

- 5- or 7-point scales: ordered logit/probit regression

- Count data: Poisson regression

Don’t use “distribution-free” statsThey were invented when calculations were done by hand

Do use structural equation modelingMPlus can do all these things “automatically”

Page 108: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

AnalysisStandard regression requires uncorrelated errors

Not true for repeated measures, e.g. “rate these 5 items”There will be a user-bias (and maybe an item-bias)

Golden rule: data-points should be independent

0

1

2

3

4

5

0 5 10 15Num

ber

of fo

llow

ed re

com

men

datio

ns

Level of commitment

Page 109: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

Two ways to account for repeated measures:- Define a random

intercept for each user

- Impose an error covariance structure

Again, in MPlus this is easy 0

1.25

2.5

3.75

5

0 5 10 15Num

ber

of fo

llow

ed re

com

men

datio

ns

Level of commitment

Page 110: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

A manipulation only causes things

For all other variables:- Common sense

- Psych literature

- Evaluation frameworks

Example: privacy study

Satisfaction with the system

(R2 = .674)

Perceived privacy threat

(R2 = .565)

Trust in the company(R2 = .662)

Disclosure help

(R2 = .302)

++

+

− −

Knijnenburg & Kobsa.: “Making Decisions about Privacy”

Page 111: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

Analysis

“All models are wrong, but some are useful.”

George Box

Page 112: Tutorial on Conducting User Experiments in Recommender Systems

AnalysisStatistical evaluation of the results

use correct methods for

repeated measures

use manipulations and theory to make inferences about causality

use structural equation models

use correct methods for

non-normal data

Page 113: Tutorial on Conducting User Experiments in Recommender Systems
Page 114: Tutorial on Conducting User Experiments in Recommender Systems
Page 115: Tutorial on Conducting User Experiments in Recommender Systems

IntroductionUser experiments: user-centric evaluation of recommender systems

Evaluation frameworkUse our framework as a starting point for user experiments

HypothesesConstruct a measurable theory behind the expected e!ect

ParticipantsSelect a large enough sample from your target population

Testing A vs. BAssign users to di!erent versions of a system aspect, ceteris paribus

MeasurementUse factor analysis and follow the principles for good questionnaires

AnalysisUse structural equation models to test causal models

Page 116: Tutorial on Conducting User Experiments in Recommender Systems
Page 117: Tutorial on Conducting User Experiments in Recommender Systems

“It is the mark of a truly intelligent person to be moved by statistics.”

George Bernard Shaw

Page 118: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

ResourcesUser-centric evaluation

Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the User Experience of Recommender Systems. UMUAI 2012.Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.

Choice overloadWillemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the same: Preventing Choice Overload in Recommender Systems by O!ering Small Diversified Sets. Submitted to TiiS.Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.: Understanding Choice Overload in Recommender Systems. RecSys 2010.

Page 119: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

ResourcesPreference elicitation methods

Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How Di!erent Users Call for Di!erent Interaction Methods in Recommender Systems. RecSys 2011.Knijnenburg, B.P., Willemsen, M.C.: The E!ect of Preference Elicitation Methods on the User Experience of a Recommender System. CHI 2010.Knijnenburg, B.P., Willemsen, M.C.: Understanding the e!ect of adaptive preference elicitation methods on user satisfaction of a recommender system. RecSys 2009.

Social recommendersKnijnenburg, B.P., Bostandjiev, S., O'Donovan, J., Kobsa, A.: Inspectability and Control in Social Recommender Systems. RecSys 2012.

Page 120: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

ResourcesUser feedback and privacy

Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information Disclosure in Context-Aware Recommender Systems. Submitted to TiiS. Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and Providing Feedback: The User-Experience of a Recommender System. EC-Web 2010.

Statistics booksAgresti, A.: An Introduction to Categorical Data Analysis. 2nd ed. 2007.Fitzmaurice, G.M., Laird, N.M., Ware, J.H.: Applied Longitudinal Analysis. 2004.Kline, R.B.: Principles and Practice of Structural Equation Modeling. 3rd ed. 2010.Online tutorial videos at www.statmodel.com

Page 121: Tutorial on Conducting User Experiments in Recommender Systems

INFORMATION AND COMPUTER SCIENCES

ResourcesQuestions? Suggestions? Collaboration proposals?

Contact me!

Contact infoE: [email protected] W: www.usabart.nl T: @usabart