Understanding and Predicting User Satisfaction with Intelligent Assistants

Understanding and Predicting User Satisfaction

with Intelligent AssistantsJulia Kiseleva, Kyle Williams, Jiepu Jiang,

Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, Tasos Anastasakos

Eindhoven University of Technology Pennsylvania State University

University of Massachusetts Amherst Microsoft

Why do we care?

2011

-09

2011

-11

2012

-01

2012

-03

2012

-05

2012

-07

2012

-09

2012

-11

2013

-01

2013

-03

2013

-05

2013

-07

2013

-09

2013

-11

2014

-01

2014

-03

2014

-05

2014

-07

2014

-09

2014

-11

2015

-01

2015

-03

2015

-05

2015

-07

2015

-090

20

40

60

80

100Desktop Mobile

Timeline

Perc

enta

ge o

f Tr

affic

http://gs.statcounter.com

Desktop Mobile

Desktop Mobile

Understanding User Satisfaction with Intelligent Assistants

Q1: how is the weather in ChicagoQ2: how is it this weekendQ3: find me hotelsQ4: which one of these is the cheapestQ5: which one of these has at least 4 starsQ6: find me directions from the Chicago airport to number one

User’s dialogue with

Cortana:Task is

“Finding a hotel in

Chicago”

Q1: find me a pharmacy nearbyQ2: which of these is highly ratedQ3: show more information about number 2Q4: how long will it take me to get thereQ5: Thanks

User’s dialogue with

Cortana:Task is

“Finding a pharmacy”

Research Questions• RQ1: What are characteristic types of scenarios of

use?

Controlling Device• Call a person

• Send a text message

• Check on-device calendar

• Open an application

• Turn on/off wi-fi

• Play music

Knowledge Pane

Image Answer

Knowledge Pane

Image Answer Image Answer

Organic Results

Knowledge Pane

Image Answer Image Answer

Location Answer

Organic Results

User:“Do I

need to have a jacket

tomorrow?”

Search Dialogue

User:“Do I

need to have a jacket

tomorrow?”

Cortana: “You could

probably go without one. The forecast

shows …”

Search Dialogue

Cortana: “Here are

ten restaurants near you”

User:“show

restaurants near

me”

Search Dialogue



Cortana:“Here are ten restaurants

near you that have good reviews”

User:“show

restaurants near

me”

User:“show

the best restaurants near

me ”

Search Dialogue





Cortana:“Getting you direction to the Mayuri

Indian Cuisine”

User:“show

restaurants near

me”

User:“show


me ”

User:“show

directions to the second one”

Search Dialogue

Research Questions• RQ1: What are characteristic types of scenarios of use?

• RQ2: How can we measure different aspects of user satisfaction?

• RQ3: What are key factors determining user satisfaction for the different scenarios?

• RQ4: How to characterize abandonment in the web search scenario?

• RQ5: How does query-level satisfaction relate to overall user satisfaction for the search dialogue scenario?

Research Questions• RQ1: What are characteristic types of scenarios of use?



• RQ4: How to characterize abandonment in the web search scenario?


USE

R

STU

DY

User Study Participants

55%45%

LANGUAGEEnglish Other

• 60 Participants• 25.53 +/- 5.42 years


75%

25%

GENDER

Male Female

55%45%




75%

25%

GENDER

Male Female

55%45%


82%

8%2% 8%

Education

Computer ScienceElectrical EngineeringMathematicsOther


User Study Design• Video Instructions (same for all participants)

• Tasks are realistic – mined from Cortana logs:

o Control type of taskso Queries where users don’t clicko Search dialogue tasks – mostly localization type of

queries

Find out what is the hair color of

your favorite celebrity

You are planning a vacation. Pick a

place. Check if the weather is good enough for the period you are planning the

vacation. Find a hotel that suits you.

Find the driving directions to this

place.

You are planning a vacation. Pick a

place. Check if the weather is good enough for the period you are planning the

vacation. Find a hotel that suits you.

Find the driving directions to this

place.

Questionnaire: Controlling Device

• Were you able to complete the task?o Yes/No

• How satisfied are you with your experience in this task?o 5-point Likert scale

• How well did Cortana recognize what you said?o 5-point Likert scale

• Did you put in a lot of effort to complete the task?o 5-point Likert scale

Questionnaire: Controlling Device





5 Tasks20 Minutes

Questionnaire: Good Abandonment


• Where did you find the answer?o Answer Box, Image, SERP, Visited Website

• Which query led you to finding the answer?o First, Second, Third, >= Fourth



Questionnaire: Good Abandonment


• Where did you find the answer?o Answer Box, Image, SERP, Visited Website

• Which query led you to finding the answer?o First, Second, Third, >= Fourth



5 Tasks20 Minutes

Questionnaire: Search Dialogue


• How satisfied are you with your experience in this task?o If the task has sub-tasks participants indicate their graded

satisfaction e.g. o a. How satisfied are you with your experience in finding a hotel? o b. How satisfied are you with your experience in finding directions?



Questionnaire: Search Dialogue


• How satisfied are you with your experience in this task?o If the task has sub-tasks participants indicate their graded

satisfaction e.g. o a. How satisfied are you with your experience in finding a hotel? o b. How satisfied are you with your experience in finding directions?



8 Tasks: 1 simple, 4 with 2 subtasks, 3 with 3 subtasks

30 Minutes

Search Dialog Dataset• 540 tasks that incorporated

• 2, 040 queries, of which 1, 969 were unique

• the average query-length is 7.07

• The simple task generated 130 queries in total

• Tasks with 2 context switches generated 685 queries

• Tasks with 3 context switches generated 1, 355 queries

Factors Determining Satisfaction

RQ3: What are key factors determining user satisfaction for the different scenarios?

Across Scenar-

ious

Device Control

Web Search

Structured Dialog

50

1

2

3

4

5

6

Across Scenar-

ious

Device Control

Web Search

Structured Dialog

50

1

2

3

4

5

6

Satis

fact

ion

Leve

l

Effor

ts

Results Over ScenariosMean of Satisfaction

Results `Good Abandonment’

RQ4: How to characterize abandonment in the web search scenario?

First Query

Second Query

Third Query

>= Fourth Quey

0

1

2

3

4

5

6

Answer Box

Image SERP Visited WebSite

50

1

2

3

4

5

6

Satis

fact

ion

Leve

l

Results `Good Abandonment’

Mean of Satisfaction

Search Dialogue Satisfaction

RQ5: How does query-level satisfaction relate to overall user satisfaction for the structured search dialogue scenario?





Cortana:“Getting you direction to the Mayuri

Indian Cuisine”

User:“show

restaurants near

me”

User:“show


me ”

User:“show

directions to the second one”

SAT?

SAT?

SAT?

SAT?

SAT?

SAT?

Overall

SAT??

Search Dialogue Satisfaction

RQ5: How does query-level satisfaction relate to overall user satisfaction for the structured search dialogue scenario?

Satisfaction Over Different Tasks

Satisfaction Level

Weather Task

Num

ber

of

Ans

wer

s

1 2 3 4 5


Satisfaction Level

Weather Task Mission Task (2 sub-tasks)

Num

ber

of

Ans

wer

s

1 2 3 4 5


Satisfaction Level

Weather Task Mission Task (2 sub-tasks)

Mission Task (3 sub-tasks)

Num

ber

of

Ans

wer

s

1 2 3 4 5

Q1: what do you have medicine for the stomach acheQ2: stomach ache medicine over the counter

Q3: show me the nearest pharmacyQ4: more information on the second one

Q5: do they have a stool softenerQ6: does Fred Meyer have stool softeners

General Search

Search Dialog

Combination of scenarios

User’s dialogue with Cortana related to the ‘stomach ache’ problem

Conclusions (1)• RQ1: What are characteristic types of scenarios of use?• We proposed three main types of scenarios


• We designed a series of user studies tailored to the three scenarios


• Effort is a key component of user satisfaction across the different intelligent assistants scenarios

Conclusions (2)• RQ4: How to characterize abandonment in the web

search scenario?• We concluded that to measure good abandonment we

need to investigate the other forms of interaction signals that are not based on clicks or reformulation


• We looked at user satisfaction as ‘a user journey towards an information goal where each step is important,’ and showed the importance of session context

Predicting User Satisfaction with Intelligent Assistants(Good Abandonment Case)

Evaluating User Satisfaction

• We need metrics to evaluate user satisfaction

• Good abandonment [Human et. al, 2009]: Mobile: 36% of abandoned queries in were likely good Desktop: 14.3%

• Traditional methods use implicit signals: clicks and dwell time

Evaluating User Satisfaction

• We need metrics to evaluate user satisfaction

• Good abandonment [Human et. al, 2009]: Mobile: 36% of abandoned queries in were likely good Desktop: 14.3%

• Traditional methods use implicit signals: clicks and dwell time

Don’t work

Our Main Research Problem

In the absence of clicks, what is the relationship between a user's gestures and satisfaction and can we use gestures to detect satisfaction and good abandonment?

Research Questions• RQ1: What SERP elements are the sources of good

abandonment in mobile search?

• RQ2: Do a user's gestures provide signals that can be used to detect satisfaction and good abandonment in mobile search?

• RQ3: Which user gestures provide the strongest signals for satisfaction and good abandonment?





USE

R

STU

DY





USE

R

STU

DY

CR

OW

DSO

UR

CIN

G

Crowdsourcing ProcedureRandom sample of abandoned queries from the search logs of a personal digital assistant during one week in June 2015 (no query suggestion)

Crowdsourcing ProcedureQuery: Peniston

Previous Query: third eroics

Crowdsourcing Data• Total amount of queries – 3,895

• Judgments agreement (3 per one query) – 73%

• After filtering: SAT – 1,565 and DSAT – 1,924

RQ1: Reasons of Good Abandonment

RQ1: Reasons of Good Abandonment

Mean of Satisfaction

Query and Session Features• Session duration• Number of queries in session

Session Features

Query and Session Features• Session duration• Number of queries in session • Index of query within session• Time to next query • Query length (number of words)• Is this query a reformulation• Was this query reformulated

Session Features

Query Features

Query and Session Features• Session duration• Number of queries in session • Index of query within session• Time to next query • Query length (number of words)• Is this query a reformulation• Was this query reformulated• Click count • Number of SAT clicks (> 30 sec) • Number of back-click clicks (< 30 sec)

Session Features

Query Features

Click Features

Baseline 1:Click & Dwell• Session duration• Number of queries in session • Index of query within session• Time to next query • Query length (number of words)• Is this query a reformulation• Was this query reformulated• Click count • Number of SAT clicks (> 30 sec) • Number of back-click clicks (< 30 sec)

Session Features

Query Features

Click Features

Click > 30 sec

No Refomulation

B1: Click, Dwell with no Reform

ulation

Baseline 2: Optimistic • Session duration• Number of queries in session • Index of query within session• Time to next query • Query length (number of words)• Is this query a reformulation• Was this query reformulated• Click count • Number of SAT clicks (> 30 sec) • Number of back-click clicks (< 30 sec)

Session Features

Query Features

Click Features

NOClick

NO Refomulation

B2: Optimistic

Baseline 3: Query-Session Model• Session duration• Number of queries in session • Index of query within session• Time to next query • Query length (number of words)• Is this query a reformulation• Was this query reformulated• Click count • Number of SAT clicks (> 30 sec) • Number of back-click clicks (< 30 sec)

Session Features

Query Features

Click Features

B3: Query-Session Model:

Training Random Forest

Gesture Features (1)• Viewport features swipes-related:

o up swipes and down swipeso changes in swipe direction o swiped distance in pixels and average swiped distanceo swipe distance divided by time spent on the SERP

Gesture Features (1)• Viewport features swipes-related:

o up swipes and down swipeso changes in swipe direction o swiped distance in pixels and average swiped distanceo swipe distance divided by time spent on the SERP

• Time To Focuso Time to focus on Answero Time to Focus on Organic Search Results

3 seconds

6 seconds33% of

ViewPort 66% of

ViewPort

View

Port

H

eigh

t

2 seconds20% of ViewPo

rt

1s 4s 0.4s 5.4s+ + =

GF(2): Attributed Reading Time

400 pixels

300 pixels

AttributedReading Time: 5.4s

Pixel Area: (400 pix x 300

pix)

0.045 ms/pix2=

GF (3): Attributed Reading Time Per Pixel

Models: Detecting Good Abandonment

M1: Gesture Model:Training Random Forest based on gesture features

M2: Gesture Model + Query and Session Features:Training Random Forest based on gesture, query and session features

RQ2: Are gestures useful? (1)

On only abandoned user study data: 148 SAT queries and 313 DSAT queries


On crowdsourced data: 1565 SAT queries and 1924 DSAT queries


On all user study data: 179 SAT queries and 384 DSAT queries

Gestures Features are useful to detect user satisfaction in general!

Conclusions• RQ1: What SERP elements are the sources of good

abandonment in mobile search?Answer, Images and Snippet


Yes

• RQ3: Which user gestures provide the strongest signals for satisfaction and good abandonment

Time spent interacting with Answers is positively correlated. Swipe actions and time spent with SERP is negatively correlated

• Answer, Images and Snippet are potentially source of the good abandonment

• User gestures provide useful signals to detect good abandonment

• Time spent interacting with Answers is positively correlated. Swipe actions and time spent with SERP is negatively correlated

Questions?

Internet

Understanding and Predicting User Satisfaction with Intelligent Assistants