Mining and Understanding Activities and Resources on the Web

Mining and Understanding (Learning)

Activities and Resources on the Web

Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju

L3S Research Center, Hannover, Germany

14/07/16 1Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju

Research areas

Web science, Information Retrieval, Semantic Web, Social Web Analytics, Knowledge Discovery, Human Computation

Interdisciplinary application areas: digital humanities, TEL/education, Web archiving, mobility

Some projects

L3S Research Center

14/07/16 2

See also: http://www.l3s.de


http://www.l3s.de/

“Intelligent Access to Information” / L3S

14/07/16 3Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju

Team & current projects

LA4S LearnWeb

14/07/16 4

GlycoRec

Ran Yu

Ujwal Gadiraju

Besnik Fetahu

Stefan Dietze


14/07/16 5

AFEL – Analytics for Everyday (Online) Learning

Figure courtesy of Mathieu d‘Aquin


14/07/16 6

AFEL – Analytics for Everyday Learning

Apply and Evaluate

- WP1 -Data

Capture

- WP3 -Visual

Analytics

- WP5 -Use Cases and

Evaluation

Collect & Enrich Data

Detect and Model User &

Learning Activities

Analyse Learning Behaviour

- WP2 -Data

Enrichment

- WP4 -Cognitive Modelling


Figure courtesy of Mathieu d‘Aquin

14/07/16 7


Entities/notions, e.g.:

• Learning

• ... Resource

• ... Activity

• ... Performance

• Knowledge

• Competence

• ....



Learning Activities


- WP2 -Data

Enrichment



14/07/16 8


Entities/notions, e.g.:

• Learning

• ... Resource

• ... Activity

• ... Performance

• Knowledge

• Competence

• ....



Learning Activities


- WP2 -Data

Enrichment


Understanding informal/micro learning on the Web (e.g. Social Web) – Challenges:

Absence of competence indcators/assessments etc ?

Measuring/detecting progress/competence etc, i.e. distinguish good/bad performance ?

Understanding learning activities => understanding of learning resources and involved entities

Heterogeneity and scale of data/activities/documents to consider (i.e. the Web)

...


14/07/16 9

Overview

Mining & understanding (learning) resources on the Web:

“Extracting entity-centric knowledge/learning resources from Web Documents“ (Stefan)

“Automated Wikipedia Entity Enrichment with News Sources” (Besnik)

Mining & understanding (learning) activities on the Web

Predicting/measuring „competence“: “Behavioral Methods for Improving the Effectiveness of Microtask Crowdsourcing" (Ujwal)



Learning Activities


- WP2 -Data

Enrichment



14/07/16 10

Understanding knowledge resources on the Web

Apple

Digital Revolution

Steve Jobs

IT Company

Bank

Jobs Biopic/Movie

Person

Detecting (salient) entities in Web

resources/documents

NLP-based named entity

recognition and disambiguation

(Babelfy, DBpedia Spotlight etc)

Usually uses background

knowledge graphs

(eg DBpedia/Wikipedia, Linked

Data)

Band

?


Web documents vs structured entity-centric knowledge graphs

14/07/16 11

Unstructured Web documents

Linked Data & Knowledge Graphs

The Web: approx. 46.000.000.000.000 (46 trillion)

Web pages indexed by Google

vs

Linked Data & Knowledge Graphs: structured

entity-centric data, approx. 1000 datasets & 100

billion statements (DBpedia, etc)

Linking entities (NED/NER) from documents:

Computational complex

Error-prone

Issues with less popular entities

(example: regional news sites)

Knowledge graphs less dynamic than Web

documents


Markup: entity-centric data embedded in the Web (30% of all Web documents in 2015)

Using W3C standards (RDFa, Microdata, Microformats)

Schema.org: inititative from Google, Yahoo, Bing, Yandex to push common vocabulary

Same order of magnitude as Web itself with respect to scale and dynamics(as opposed to knowledge graphs, DBpedia et al)

Rich source of knowledge and data going beyond existing knowledge bases (eg Wikipedia)

Entity-centric data on the Web: Web markup (schema.org)

14/07/16 12

Entity

node2 publisher Pearson Education

node2 publisher Elsevier

node2 published 03-01-2014

Unstructured Web documents

Linked Data & Knowledge Graphs

Embedded Markup (schema.org)

Entity

node1 name French Grammar advanced

node1 publisher The Open University

node1 publisher Nature

node1 datePublished 1956

node1 datePublished 1953


1

10

100

1000

10000

100000

1000000

10000000

1 51 101 151 201

cou

nt

(lo

g)

PLD (ranked)

# entities # statements

Example: entity markup of learning resources on the Web

“Learning Resources Metadata Intiative (LRMI)”: schema.org vocabulary for annotation of learning resources (informal, formal, etc)

Approx. 5000 PLDs in “Common Crawl”

LRMI-Adaptation on the Web (WDC) [LILE16]:

2014: 30.599.024 quads, 4.182.541 resources

2013: 10.636873 quads, 1.461.093 resources

14/07/16 13

Power law distribution across providers

4805 Provider / PLDs

Taibi, D., Dietze, S., Towards embedded markup of learning resourceson the Web: a quantitative Analysis of LRMI Terms Usage, inCompanion Publication of the IW3C2 WWW 2016 Conference, IW3C22016, Montreal, Canada, April 11, 2016


Entity-centric markup on the Web: challenges

14/07/16 14

Characteristics Example

Coreferences18.000 results for <„Iphone 6“, type, s:Product>(8,6 quads on average) in CommonCrawl

Redundancy <s, schema:name, „Iphone 6“> occurring 1000 times in CC

Lack of links Largely unlinked entity descriptions

Errors(typos & schema violations, see Meuselet al [ESWC2015])

Wrong namespaces, such as http://schma.org

Undefined types & predicates: 9,7 %, less common than in LOD

Confusion of datatype and object properties:<s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD

Data property range violations: e.g. literals vs numbers (12,6% vs 4,6 in LOD)

Why not using markup as knowledge graph of entities involved in (learning) resources (similar to DBpedia/Wikipedia)?


Improving understanding of resources: consolidating entity-centric Web data for a given document/resource/entity?

Markup as distributed knowledge graph/base, e.g. to augment existing knowledge bases (eg DBpedia/Wikipedia) ?

Data fusion for consolidating entity centric Web markup

14/07/16 15

Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Entitysummarisation on structured web markup. In TheSemantic Web: ESWC 2016 Satellite Events. Springer,2016.

Yu, R., Gadiraju, U., Zhu, X., Fetahu, B., S. Dietze, Fact Selection for data fusion on structured web markup. ICDE2017, IEEE International Conference on Data Engineering, in progress.

Query

iPhone 6, type:(Product)

Entity Description

brand Apple Inc.

weight 129

date 30.09.2015

manufacturer Foxconn

Storage 16 GB

<e1, s:name, „Iphone 6“>

<e2, s:brand, „Apple Inc.“>

<e3, s:brand, „Apple“> <e4, s:weight, 127>

<e5, s:releaseDate, „1.12.1972“>Web (crawl)

(i.e. billions of entites/facts)


A supervised ML approach to select entity facts from the Web

14/07/16 17

Fact/entity retrieval: BM25 entity retrieval model on markup index (Common Crawl)

Fact selection: supervised ML classifier (SVM), using 3 feature categories (relevance, authority, clustering)

Experiments on Common Crawl: products, movies, books (approx. 3 billion facts)

1. Retrieval

2. Fact selection

New Queries

Foxconn, type:(Organization)

Cupertino, type:(City)

Apple Inc., type:(Organization)

(trained SVM classifier)

Entity Description

brand Apple Inc.

weight 129

date 30.09.2015

manufacturer Foxconn

Storage 16 GB

Query

iPhone 6, type:(Product)Candidate Facts

node1 brand _node-x

node1 brand Apple Inc.

node1 weight 129

node2 weight 172

node2 manufacturer Foxconn

node3 releasedate 01.12.1972

node3 manufacturer Foxconn

Web page

markupWeb (crawl)

approx. 125.000 facts for „iPhone6“


14/07/16 19

Evaluation & results


Performance

Outperforms baselines (BM25F, CBFS)

Strong variance across types/queries

Average precision from 75% – 98 %

14/07/16 20

Evaluation & results: markup vs DBpedia/Wikipedia

Can markup augment existing Knowledge Graphs?

Comparison of obtained facts with existing knowledge bases (DBpedia/Wikipedia)

„new“: fact not existing in DBpedia(eg a book‘s releaseDate in Wiki/DBpedia)

„new-p“: property not existing in DBpedia(eg a book‘s release countries)

„existing“: fact already in DBpedia

On average approx. 60% new facts

Performance

Outperforms baselines (BM25F, CBFS)

Strong variance across types/queries

Average precision from 75% – 98 %


14/07/16 21

Conclusions

Data fusion on markup as means to extract rich descriptions of entities in Web documents

Understanding semantics of activities and resources (particularly learning resources)

Markup: rich source of entity centric data (30% of the Web, i.e. 16 trillion Web pages)

Potential training data for NED/NER approaches

Potential for augmenting existing knowledge graphs/bases (DBpedia/Wikipedia et al)



Learning Activities



14/07/16 22

Next





Predicting/measuring „competence“: “Behavioral Methods for Improving the Effectiveness of MicrotaskCrowdsourcing" (Ujwal)




Learning Activities


Outline

Wikipedia Entity

Enrichment

Besnik Fetahu, Katja Markert, Avishek Anand: Automated News Suggestions for Populating Wikipedia Entity Pages. CIKM 2015: 323-332

14/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju

http://dblp.uni-trier.de/pers/hd/m/Markert:Katja

http://dblp.uni-trier.de/pers/hd/a/Anand:Avishek

http://dblp.uni-trier.de/db/conf/cikm/cikm2015.html#FetahuMA15

Introduction

• Human fatalities: 10k vs 1.8k losses

• Estimated damages: $4.5 vs. $108 billions

• ‘Odisha cyclone’ has no coverage in the

entity location ‘Odisha’

• ‘Hurricane Katrina’ finds broad coverage in

entity location `New Orleans’

New Orleans

Odisha

Hurricane Katrina

Odisha Cyclone


Introduction

• Entities comprise of facts and statements supported by external

references!

• News as authoritative sources with emerging facts and events.

• Delay between the reporting of an event in news and its

inclusion in entity pages1

• Incomplete section structure for long—tail entities

• Several implications on real-world applications that make use of

Wikipedia, e.g. KB maintenance, entity disambiguation etc.

Besnik Fetahu, Abhijat Anand, Avishek Anand: How much is Wikipedia lagging behind news?. WebSci 2015



Motivation: News Density in Wikipedia

• Citation templates (‘news’,

‘books’, ‘web’, ‘journal’ etc.)

• ~60% vs. 20% ‘web’ and

‘news’ citations

• On average there are ~6.5

news citations per entity

• On average a news article is

assigned to ~1.3 entities

• The most cited news article

is cited by 81 entities

Besnik Fetahu, Abhijat Anand, Avishek Anand: How much is Wikipedia lagging behind news?. WebSci 2015



Problem Definition

news

Pub.date: tk

entity pages

Rev.date: tk-1

news article

• news title

• headline

• paragraphs

• named entities

entity page

• section template

• categories

• entities (anchors)

• …..

suggest news n to entity e ?

specify the section in e for n

suggest news n to entity e ?


Automated news suggestion to entity pages

feature extraction

Some half a million people were evacuated

from the southeastern Indian coast as

Cyclone Phailin, a tropical storm from the

Bay of Bengal, bore down on India. The

states of Orissa and Andhra Pradesh, both

of which have large coastal populations, were

on high alert ahead of the storm’s expected

arrival.

entities

news article

sections

wikipedia entity page

article entity placement

Odisha

Bay of Bengal Phailin

Task#1

one classifier perentity type

article section placement

[state]:geography

[city]:climate…

Task#2


Article—Entity Placement

Task#1

News Suggestion Attributes: Task#1Entity Salience

Nikola Tesla

Elon Musk

Larry Page

John B. Kennedy

Entity Salience: Relative Entity Frequency

• reward entity appearing throughout the text

• reward entity appearing in the top paragraphs

• weigh an entity w.r.t its co-occurring entities

Tesla is a central

concept in the given

news article


News Suggestion Attributes: Task#1Relative Entity Authority

Elias TabanHillary Clinton

Relative Entity Authority

• entities with `low authority’ have lower

entry barrier for a news article

• a news article in which an entity co-

occurs with `high authority’ entities

conveys news the importance

• entity authority as an a priori probability

or any centrality based measure


News Suggestion Attributes: Task#1Novelty & Redundancy

previously added news articles

• novelty is measured w.r.t previously added news articles

in an entity page

• major events have wide coverage in news media

• place the news article into the correct section

Novelty and Redundancy Measure


Article—Section PlacementTask#2

Task#2: Section—template Generation

Germanwings Adria Lufthansa

• Section templates per entity type

• Pre-determined number of main

sections

• Canonicalize sections

• Generate `complete’ section

templates based on similar entities

• Cluster based on the X—means[3]

algorithm

[3] D. Pelleg, A. W. Moore, et al. X-means: Extending k-means

with efficient estimation of the number of clusters. In ICML,

pages 727–734, 2000.


Task#2: Overall news—section fit

• What is the best section to append a given news article?• measure overall similarity between n and the pre-computed sections in

the section templates

• Similarity aspects between news articles and sections

• Topic similarity (LDA models over the sections and news documents)

• Syntactic similarity

• Lexical similarity

• Entity—based similarity (overlap of named entities)

• Frequency


Evaluation Strategy

What comprises of the ground-truth for such a task?

Challenges

• `Invasive’: add news articles and wait for a time period until it is either accepted or

deleted by the Wikipedia editors

• Long tail vs. trunk entities: long tail entities might not be of particular interest to

editors, hence, many `false positives’ will go unnoticed.

• Crowdsourcing: Challenging to find knowledgable workers for long-tail entities

Approach

•Use already referenced news articles from entity pages

•Avoid the uncertainty of judgements and expertise of crowd workers

•Non-invasive approach for entity pages

•Reusable test bed for similar approaches


Experimental Setup

Distribution of news articles, entities,

and sections across the years

Datasets Evaluation Plan

• train at years [to, ti], test at (ti, tk]

• P/R/F1 metrics

Baselines

Task#1: AEP

• B1: AEP based on Dunietz and Gillick

• B2: AEP if entity appears in the news title

Task#2: ASP

• S1: AES based on max similarity to one of the sections

• S2: AES to the most frequent section


Task#1: Article—Entity Placement

Performance

Robustness

Feature Analysis

Number Instances


Task#2: Article—Section Placement


• Two—stage news suggestion approach for Wikipedia entity pages

• Model and define what makes a good news suggestion

• Model functions for salience, relative authority, novelty and section placement defined as attributes

for a ‘good news suggestion’

• Entity profile expansion

• Extensive evaluation over 350k news articles, 73k entity pages and for the different Wikipedia

states between 2009 and 2014.

• A publicly available and reusable test bed for similar tasks

Conclusions


Next








Learning Activities



42

Crowdsourcing - A Brief Introduction

* 42

Portmanteau of "crowd " and "outsourcing,"

first coined by Jeff Howe in a June 2006

Wired magazine article.

Accumulating small

contributions from

each crowd worker to

solve a bigger

problem.


43

Crowdsourcing - The Means to Many Ends

* 4314/07/16Stefan Dietze, Besnik Fetahu, Ujwal Gadiraju

44

The Paid Crowdsourcing Paradigm

❏ Small monetary rewards in exchange for completing short tasks online

❏ Entertainment-driven workers primarily seek diversion by taking up

interesting, possibly challenging tasks

❏ Money-driven workers mainly attracted by monetary incentives

❏ A crowdsourcing platform acts as a marketplace for such tasks

❏ About five million tasks are completed per year at 1-5 cents each

❏ Some jobs can contain more than 300K tasks


45

Microtask Crowdsourcing Platforms as Online Social

Environments

Crowd worker as a learner in an atypical learning environment :

❏ No information regarding the background, knowledge, or skills

of a worker.

❏ Short nature of crowdsourced microtasks, workers face an

‘on-the-fly’ learning situation.

❏ Comparable to experiential learning and microlearning.

❏ In many cases, workers have no time to apply their gained

experience.

❏ Often for single use, high % of new requesters.Training Workers for Improving Performance in

Crowdsourcing Microtasks. Ujwal Gadiraju, Besnik

Fetahu, Ricardo Kawase. ECTEL 2015; Toledo, Spain.

Crowd Workers as Learners


46

Challenges

○ Diverse pool of workers

○ Wide range of behavior

○ Various motivations

Ross, J., Irani, L., Silberman, M., Zaldivar, A. and Tomlinson, B. Who are the crowdworkers?: shifting demographics in mechanical turk. In CHI'10 Extended Abstracts on Human factors in computing systems. ACM.

Kazai, Gabriella, Jaap Kamps, and Natasa Milic-Frayling. The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy. Proceedings of CIKM’12. ACM.

Quality Control in Crowdsourcing


47

➢ Typically adopted solution to

prevent/flag malicious activity

:

Gold-Standard Questions

➢ Flourishing crowdsourcing

markets, advances in

malicious activity

“workers with ulterior motives, who either simply sabotage

a task, or provide poor responses in an attempt to quickly

attain task completion for monetary gains”

Need to understand workers

behavior and types of malicious

activity.

Malicious Workers


48

Malicious Workers - Behavioral Patterns in a Survey

Ineligible

Workers (IW)

Fast Deceivers

(FD)

Rule Breakers

(RB)

Smart Deceivers

(SD)

Gold Standard

Preys (GSP)

Instruction: Please attempt this microtask ONLY IF you have

successfully completed 5 microtasks previously.

Response: ‘this is my first task’

eg: Copy-pasting same text in response to multiple questions, entering

gibberish, etc.

Response: ‘What’s your task?’ , ‘adasd’, ‘fgfgf gsd ljlkj’

Instruction: Identify 5 keywords that represent this task

(separated by commas).Response: ‘survey, tasks, history’ , ‘previous task yellow’

Instruction: Identify 5 keywords that represent this task

(separated by commas).

Response: ‘one, two, three, four, five’

These workers abide by the instructions and provide valid

responses, but stumble at the gold-standard questions!

Understanding Malicious Behavior in Crowdsourcing

Platforms: The Case of Online Surveys. Ujwal Gadiraju,

Ricardo Kawase, Stefan DIetze, Gianluca Demartini. CHI

2015; Seoul, Korea.


49

Workers Behavioral Patterns - Experimental Results


50

Automatic Classification of Worker Type

Image Transcription & Information Findings Tasks


51

Low-level features through

keystroke & mouse-tracking

❏ timeBeforeInput

❏ timeBeforeClick

❏ tabSwitchFreq

❏ windowToggleFreq

❏ openNewTabFreq

❏ totalMouseMovements

❏ scrollUpFreq

❏ scrollDownFreq

❏ . . .

Competent Worker

Fast Deceiver

Crowd Anatomy: Behavioral Traces for Crowd Worker

Modeling and Pre-selection. Ujwal Gadiraju, Gianluca

Demartini, Ricardo Kawase, and Stefan Dietze. (Under

Review at AAAI HCOMP 2016. Austin, Texas, USA.

Capturing Behavioral Traces ⇒ Behavioral Patterns


52

Worker Behavioral Patterns

❏ Multitaskers

❏ Divers & Feelers

❏ Wanderers

❏ Copy-Pasters & Typers

❏ . . .

Worker Types

❏ Competent Workers

❏ Diligent Workers

❏ Ineligible Workers

❏ Fast Deceivers

❏ Smart Deceivers

❏ Rule Breakers

❏ Incompetent Workers

❏ Sloppy Workers

Automatic Worker Type

Classification

Behavioral Traces for

Crowd Worker Modeling

and Pre-selection

Capturing Behavioral Traces ⇒ Behavioral Patterns


53

Evaluation of Automatic Worker Type Classification

Supervised Machine Learning

Model

❏ Automatic classification at scale

❏ Random forest classifier

❏ Classifiers evaluated using 10-fold

cross validation

❏ Information Finding & Content

Creation Tasks

Evaluation for Information Finding Tasks


54

Benefit of Automatic Worker Type Classification

Information Finding

Tasks (finding

middle names)

Content Creation

Tasks

(image transcription)


PRE-SELECTION

OF DESIRED

WORKER TYPES

55

Task Turnover Time

“the amount of time required to acquire the full set of

judgments from crowd workers, thereby completing and

finalizing a task considering pre-defined criteria (such as

qualification tests or pre-selection)”


56

Task Turnover Time

Information Finding

Tasks (finding

middle names)

Content Creation

Tasks

(image transcription)


57

Cognitive Theories & Entailing Data

Paradox of Choice in the Crowd

❏ Many available platforms and tasks

❏ Overload of choices for workers

❏ Detrimental effects on decision

making (psychology & social theory

works)

❏ Workers settle for less suitable tasks

❏ More capable workers are deprived

of an opportunity to work on suitable

tasks

❏ Overall effectiveness of the

crowdsourcing paradigm decreases

Typically Adopted Solution:

Crowd Worker Pre-selection


58

The Dunning-Kruger Effect

❏ Cognitive bias: Incompetent

individuals depict inflated self-

assessments and illusory superiority.

❏ Incompetence in a particular domain

reduces the metacognitive ability of

individuals to realize it.

❏ Incompetent individuals cognitively

miscalibrate by erroneously assessing

oneselves, while competent

individuals miscalibrate by

erroneously assessing others.




60

Self-Assessments for Pre-selection of Crowd Workers

❏ Crowd workers often lack awareness about their true level of

competence

❏ Novel worker pre-selection method based on self-assessments

& performance

Evaluation in

a Sentiment

Analysis Task

Worker

Performance Data


Using Worker Self-Assessments for Competence-based

Pre-Selection. Ujwal Gadiraju, Besnik Fetahu, Ricardo

Kawase, Patrick Siehndel and Stefan Dietze. (Under

Review at ACM CSCW 2017. Portland, Oregon, USA.


14/07/16 61

Summary









Learning Activities


14/07/16 62

Thank you!


• http://www.l3s.de

• http://stefandietze.net

• http://l3s.de/~fetahu

• http://www.l3s.de/~gadiraju/

Science

Mining and Understanding Activities and Resources on the Web