23
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 28-Dez-14 Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab Eval_Rec_Algo_Crowdsourcing__ICALT_2014_MA.pptx Evaluating Recommender Algorithms for Learning using Crowdsourcing Mojisola Erdt Christoph Rensing ICALT 2014, Athen Source: http://www.digitalvisitor.com/cultural-differences-in-online-behaviour-and-customer-reviews/

Eval rec algo_crowdsourcing__icalt_2014_ma

Embed Size (px)

Citation preview

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 28-Dez-14

Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab

Eval_Rec_Algo_Crowdsourcing__ICALT_2014_MA.pptx

Evaluating Recommender Algorithms for Learning using Crowdsourcing

Mojisola Erdt Christoph Rensing

ICALT 2014, Athen

Source: http://www.digitalvisitor.com/cultural-differences-in-online-behaviour-and-customer-reviews/

KOM – Multimedia Communications Lab 2

Motivation

Learning on-the-job § To solve a particular problem § To learn about a new topic § Mostly web resources

Social Tagging Applications § Help to manage resources § Offer recommendations

TEL Recommender Systems § Recommend relevant, novel and

diverse resources to a specific learning goal or activity

KOM – Multimedia Communications Lab 3

Evaluation Approach Advantages Disadvantages

Offline Experiments (historical or synthetic datasets)

§  Fast §  Less effort §  Repeatable

§  New, unknown resources cannot be evaluated

§  Dependent on dataset User Experiments §  User’s perspective §  A lot of effort and time

§  Few users (ca. 40)

Real-life testing §  Real-life setting §  Needs a substantial amount of users

Crowdsourcing

§  Fast §  Less effort §  Repeatable §  User’s perspective §  Sufficient users

§  Unknown users §  “Artificial task” §  Spamming

Evaluation Methods for TEL Recommender Systems

KOM – Multimedia Communications Lab 4

microworkers §  500,000 crowdworkers worldwide §  Flexible forwarding to other

hosting platforms §  Since 2009

CrowdFlower §  5 million crowdworkers in 208

countries § Gives access to other

crowdsourcing platforms e.g. Amazon MTurk

§  Since 2007

https://microworkers.com, http://www.crowdflower.com

Crowdsourcing Platforms

KOM – Multimedia Communications Lab 5

§ Motivation § Crowdsourcing Evaluation Concept §  Preparation Step §  Execution Step

§ Crowdsourcing Evaluation Results § Conclusion & Future Work

Overview

KOM – Multimedia Communications Lab 6

Crowdsourcing Evaluation Concept Preparation Step

Create Questionnaire

Set Goal

Formulate Hypotheses

Create Questions

Add Control

Questions

Select Topic

Create Activity

Hierarchy

Create Seed

Dataset

Prepare Algorithms

Generate Recommendations

Filter Duplicates

DeLFI 2013. M. Migenda, M. Erdt, M. Gutjahr, and C. Rensing

KOM – Multimedia Communications Lab 7

Preparation Step Set Goal

AScore is based on Activity Hierarchies § Extends FolkRank by considering activities, activity hierarchies and the current

activity of the learner

ECTEL 2012. Anjorin et al

Understanding the Carbon Footprint

Calculating the

Carbon Footprint

Investigate the impact of

Climate Change

Analyze potential

Catastrophes due to Climate Change

Investigate causes of

Climate Change

Give an overview on the history of Global Warming

Determine

future prognoses on Climate Change

Understanding

Climate Change

KOM – Multimedia Communications Lab 8

Set Evaluation Goals: § Investigate if AScore

recommends more relevant, novel and diverse learning resources to a specified topic than FolkRank.

§ Investigate if AScore recommends more relevant, novel and diverse learning resources to sub-activities (A Sub) than to activities higher up in the hierarchy (A Super).

Formulate Hypotheses: 1.  Hypothesis: Relevance §  Ascore vs. FolkRank §  A_Sub vs. A_Super

2.  Hypothesis: Novelty §  Ascore vs. FolkRank §  A_Sub vs. A_Super

3.  Hypothesis: Diversity §  Ascore vs. FolkRank §  A_Sub vs. A_Super

Preparation Step Set Goal and Formulate Hypotheses

KOM – Multimedia Communications Lab 9

Generate a basis graph structure for recommendations § 5 experts research on the topic of climate change for one hour § Using CROKODIL to create an extended folksonomy (users, tags, resources,

activities) § Ca. 70 resources were tagged and attached to 8 activities

Preparation Step Select Topic and Generate Recommendations

Understanding the Carbon Footprint

Calculating the

Carbon Footprint

Investigate the impact of

Climate Change

Analyze potential

Catastrophes due to Climate Change

Investigate causes of

Climate Change

Give an overview on the history of Global Warming

Determine

future prognoses on Climate Change

Understanding

Climate Change

Experiment Spring

Experiment Autumn

KOM – Multimedia Communications Lab 10

Conduct personal research on the topic § Level of knowledge on this topic § Request to find 5 online resources relevant to this topic

10 Questions per Recommendation § 3 questions to each hypothesis (relevance, novelty, diversity) § 1 control question to detect spammers §  E.g. Give 4 keywords to summarize the recommended resource

General Questions § Age, gender, level of education and nationality

Preparation Step Create Questionnaire

Experiment Spring

Sub-activity

Super-activity

AScore A_Sub A_Super FolkRank F_Sub F_Super

Experiment Autumn

Sub-activity

Super-activity

AScore A_Sub A_Super FolkRank F_Sub F_Super

KOM – Multimedia Communications Lab 11

https://www.soscisurvey.de

Crowdsourcing Evaluation Concept Execution Step

Release next iteration burst

Crowdsourcing Platform

Results

Filter Spammers

Make Payments

Questionnaire

KOM – Multimedia Communications Lab 12

Execution Step Participants and Treatment Conditions

Experiment Spring

Sub-activity

Super-activity

AScore A_Sub: 45

A_Super:39

FolkRank F_Sub: 39

F_Super: 36

Experiment Autumn

Sub-activity

Super-activity

AScore A_Sub: 80

A_Super: 73

FolkRank F_Sub: 76

F_Super: 85

CrowdFlower (32)

Microworker (35) Volunteers (92)

Spammers (243)

Crowdworkers (314)

Spammers (549)

KOM – Multimedia Communications Lab 13

§ Motivation § Crowdsourcing Evaluation Concept § Crowdsourcing Evaluation Results §  AScore and FolkRank §  Experiment Spring §  Experiment Autumn

§  A_Sub and A_Super §  Experiment Spring §  Experiment Autumn

§ Conclusion & Future Work

Overview

KOM – Multimedia Communications Lab 14

Crowdsourcing Evaluation Results Experiment Spring

Significance Tests

Hypothesis 1: Relevance 2: Novelty 3: Diversity

p-value 0.000003578 < 0.05 0.000001531 < 0.05 0.0001618 < 0.05

KOM – Multimedia Communications Lab 15

Crowdsourcing Evaluation Results Experiment Autumn

Significance Tests

Hypothesis 1: Relevance 2: Novelty 3: Diversity

p-value 0.000001362 < 0.05 0.0000007654 < 0.05 0.00000000015 < 0.05

KOM – Multimedia Communications Lab 16

Evaluation Goals: § Investigate if AScore

recommends more relevant, novel and diverse learning resources to a specified topic than FolkRank.

§ Investigate if AScore recommends more relevant, novel and diverse learning resources to sub-activities (A Sub) than to activities higher up in the hierarchy (A Super).

Formulate Hypotheses: 1.  Hypothesis: Relevance §  Ascore vs. FolkRank §  A_Sub vs. A_Super

2.  Hypothesis: Novelty §  Ascore vs. FolkRank §  A_Sub vs. A_Super

3.  Hypothesis: Diversity §  Ascore vs. FolkRank §  A_Sub vs. A_Super

Execution Step Evaluation Results

✔ ✔

KOM – Multimedia Communications Lab 17

Crowdsourcing Evaluation Results Experiment Spring

Significance Tests

Hypothesis 1: Relevance 2: Novelty 3: Diversity

p-value 0.0005654 < 0.05 0.01666 < 0.05 0.02176 < 0.05

KOM – Multimedia Communications Lab 18

Crowdsourcing Evaluation Results Experiment Autumn

Significance Tests

Hypothesis 1: Relevance 2: Novelty 3: Diversity

p-value 0.0005306 < 0.05 0.000001531 < 0.05 0.0000001608 < 0.05

KOM – Multimedia Communications Lab 19

Hypothesis 1 Hypothesis 2 Hypothesis 3

Aggregated Mean Values for Hypotheses 1, 2 and 3

Mea

n

01

23

45

67

F_SubF_Super

3.95 4.05 3.97 3.91 3.96 3.83

Crowdsourcing Evaluation Results Experiment Spring

Significance Tests

Hypothesis 1: Relevance 2: Novelty 3: Diversity

p-value 0.3023 > 0.05 0.5216 > 0.05 0.2031 > 0.05

KOM – Multimedia Communications Lab 20

Hypothesis 1 Hypothesis 2 Hypothesis 3

Aggregated Mean Values for Hypotheses 1, 2 and 3

Mea

n

01

23

45

67 F_Sub

F_Super

4.04 3.9 4.11 4.09 4.07 4.01

Crowdsourcing Evaluation Results Experiment Autumn

Significance Tests

Hypothesis 1: Relevance 2: Novelty 3: Diversity

p-value 0.01481 < 0.05 0.7064 > 0.05 0.2881 > 0.05

KOM – Multimedia Communications Lab 21

Evaluation Goals: § Investigate if AScore

recommends more relevant, novel and diverse learning resources to a specified topic than FolkRank.

§ Investigate if AScore recommends more relevant, novel and diverse learning resources to sub-activities (A Sub) than to activities higher up in the hierarchy (A Super).

Formulate Hypotheses: 1.  Hypothesis: Relevance §  Ascore vs. FolkRank §  A_Sub vs. A_Super

2.  Hypothesis: Novelty §  Ascore vs. FolkRank §  A_Sub vs. A_Super

3.  Hypothesis: Diversity §  Ascore vs. FolkRank §  A_Sub vs. A_Super

Execution Step Evaluation Results

KOM – Multimedia Communications Lab 22

Crowdsourcing can be successfully applied to evaluate TEL recommender algorithms §  Integrate more user-centric evaluations already during the design and

development of TEL recommender algorithms § Select the best fitting evaluation approach

Future Work § Can crowdsourcing be used to evaluate other aspects of a recommender

system? E.g. explanations, presentation… Can more complex TEL evaluation tasks be evaluated with crowdsourcing?

Conclusion and Future Work

KOM – Multimedia Communications Lab 23

Questions & Contact