87
Myths and Challenges in Knowledge Extraction and Analysis from Human-generated Content Marco Brambilla [email protected] @marcobrambi

Myths and challenges in knowledge extraction and analysis from human-generated content

Embed Size (px)

Citation preview

Page 1: Myths and challenges in knowledge extraction and analysis from human-generated content

Myths and Challenges in Knowledge Extraction and Analysis

from Human-generated Content

Marco Brambilla

[email protected]@marcobrambi

Page 2: Myths and challenges in knowledge extraction and analysis from human-generated content

Knowledge, Behaviour and Feature Extraction with Big Data Science

Page 3: Myths and challenges in knowledge extraction and analysis from human-generated content

Your Data, My Problem

Page 4: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 1.The Complexity of Knowledge

Page 5: Myths and challenges in knowledge extraction and analysis from human-generated content

There are more things In heaven and earth, Horatio, Than are dreamt of in your philosophy.

Shakespeare (Hamlet Act 1, scene 5)

Page 6: Myths and challenges in knowledge extraction and analysis from human-generated content

The Answer to the Great Question... Of Life, the Universe and Everything

Data

Information

Knowledge

WisdomContextindependence

Understanding

Understanding relations

Understanding patterns

Understanding principles

Page 7: Myths and challenges in knowledge extraction and analysis from human-generated content

Formalizing evolving knowledge is hard

Only high frequency emerges

The long tail challenge

Page 8: Myths and challenges in knowledge extraction and analysis from human-generated content

The Evolving Knowledge

knownsocial

factoid

a

c¬c

bpotentially emerging potentially

decaying

actual and solid

d

Page 9: Myths and challenges in knowledge extraction and analysis from human-generated content

Information and Knowledge Extraction

Page 10: Myths and challenges in knowledge extraction and analysis from human-generated content

Heaven and Earth

Are they so different?

Page 11: Myths and challenges in knowledge extraction and analysis from human-generated content

The Digital “Heaven”

Vs.

The “physical” Earth

Page 12: Myths and challenges in knowledge extraction and analysis from human-generated content

Heaven and Earth

How to peer into the world through an effective window?

INGREDIENTSSocial media, IoT, … – the dataDomain experts – the context

Page 13: Myths and challenges in knowledge extraction and analysis from human-generated content

13[photo: http://hoglundassociates.com/Images/Cloud_Gate.jpg]

The digital reflection of our life is sharpening

Page 14: Myths and challenges in knowledge extraction and analysis from human-generated content

14

Datasource Aroundfor Frequency DelayCensus data 100syear years monthsNewspaper 100syear days 1dayWeather sensors 10syear hours/minutes hours/minutesTV news 10syears hours minutesTrafficsensors years 15minutes minutesCallDataRecords years 15minutes hoursSocialmedia years seconds secondsIoT recently milliseconds milliseconds

Sour

ce: E

man

uele

Della

Vall

e

The data evolution

Page 15: Myths and challenges in knowledge extraction and analysis from human-generated content

Data piles up without easing decision making

I have to decide:A or B?

Why not C?What if D?

Sour

ce: E

man

uele

Della

Vall

e

Page 16: Myths and challenges in knowledge extraction and analysis from human-generated content

But, we would like to …

fusing all those data sources

making sense of the fused information

Definitely E!

Sour

ce: E

man

uele

Della

Vall

e

Page 17: Myths and challenges in knowledge extraction and analysis from human-generated content

The MacroScope

Joël deRosnay,TheMacroscope,1979

Page 18: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 2.Cognitive Bias

(of the observer)

Page 19: Myths and challenges in knowledge extraction and analysis from human-generated content

the streetlamp effect

The bias of the observer

Page 20: Myths and challenges in knowledge extraction and analysis from human-generated content

Strategy and Inaccuracy

Page 21: Myths and challenges in knowledge extraction and analysis from human-generated content

Use Case: City

Page 22: Myths and challenges in knowledge extraction and analysis from human-generated content

Model of social media and reality sensing

Page 23: Myths and challenges in knowledge extraction and analysis from human-generated content

Model of social media and reality sensing

Page 24: Myths and challenges in knowledge extraction and analysis from human-generated content

Model of social media and reality sensing

Page 25: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 3.Data Quality

Page 26: Myths and challenges in knowledge extraction and analysis from human-generated content

Data Quality Issue

Gartner Report In 2017, 33% of the largest global companies will experience an information crisis due to their inability to adequately value, govern and trust their enterprise information.

If you torture the data long enough,it will confess to anything

– Darrell Huff

Page 27: Myths and challenges in knowledge extraction and analysis from human-generated content

The Vicious Cycle of Bad Data

BadData

IncorrectAnalysis

InvalidInsights

WrongDecisions

PoorOutcome

Page 28: Myths and challenges in knowledge extraction and analysis from human-generated content

Conventional Definition of Data Quality

• Accuracy• The data was recorded correctly.

• Completeness• All relevant data was recorded.

• Uniqueness• Entities are recorded once.

• Timeliness• The data is kept up to date (and time consistency is granted).

• Consistency• The data agrees with itself.

Page 29: Myths and challenges in knowledge extraction and analysis from human-generated content

Why is Data “Dirty” ?

• Dummy Values,• Absence of Data, • Multipurpose Fields,• Cryptic Data,• Contradicting Data,• Shared Field Usage,

• Inappropriate Use of Fields,• Violation of Business Rules,• Reused Primary Keys,• Non-Unique Identifiers,• Data Integration Problems

Page 30: Myths and challenges in knowledge extraction and analysis from human-generated content

Data Wrangling a.k.a.• Data Preprocessing• Data Preparation• Data Cleansing• Data Scrubbing• Data Munging• Data Transformation• Data Fold, Spindle, Mutilate…• (good old) ETL

Page 31: Myths and challenges in knowledge extraction and analysis from human-generated content

Foursquare• Check-ins explicitly performed in venues all around the world• Data set: Geo-localized Foursquare venues, collected through a query every 50m with radius >50m over:

• Milan area: 20km x 17,5km• Some numbers• Total n° of venues: 90K (dirty)• Total n° of valid venues: 43K

Page 32: Myths and challenges in knowledge extraction and analysis from human-generated content

Isn’t data science sexy?

Page 33: Myths and challenges in knowledge extraction and analysis from human-generated content

College & University

0

200

400

600

800

1000

1200

1400

weekend

weekend

weekend

weekend

weekend

Noaccess

Noaccess

Noaccess

Page 34: Myths and challenges in knowledge extraction and analysis from human-generated content

Event

0

10

20

30

40

50

60

70

weekend

weekend

weekend

weekend

weekend

events

Events

Page 35: Myths and challenges in knowledge extraction and analysis from human-generated content

The skeptic approach

Page 36: Myths and challenges in knowledge extraction and analysis from human-generated content

The Pragmatic Approach

Page 37: Myths and challenges in knowledge extraction and analysis from human-generated content

The (pseudo) Practitioner Approach

Page 38: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 4.Content Bias

(of the source)

Page 39: Myths and challenges in knowledge extraction and analysis from human-generated content

Data vs. Question

• Are they aligned? • The usual problem of representativeness of the sample…

• At a different scale • With much less control

• Example: the different pictures of the city

Page 40: Myths and challenges in knowledge extraction and analysis from human-generated content

Foursquare

Checkins

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

Page 41: Myths and challenges in knowledge extraction and analysis from human-generated content

Flickr

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

Page 42: Myths and challenges in knowledge extraction and analysis from human-generated content

Instagram

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

Page 43: Myths and challenges in knowledge extraction and analysis from human-generated content

Instagram

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

Page 44: Myths and challenges in knowledge extraction and analysis from human-generated content

44

Cities into cities, by languagehttp://urbanscope.polimi.it

Page 45: Myths and challenges in knowledge extraction and analysis from human-generated content

Bias of the Source• Technology• Audience / Users / Adopters• Behaviour

Page 46: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 5.Granularity

(time, space, …)

Page 47: Myths and challenges in knowledge extraction and analysis from human-generated content

Example. Space Granularity: the Grid• Regular squared grid• Irregular grid with official business-driven meaning• Irregular grid with data-driven definition

12/4

Page 48: Myths and challenges in knowledge extraction and analysis from human-generated content

Cities into citieshttp://urbanscope.polimi.it

Page 49: Myths and challenges in knowledge extraction and analysis from human-generated content

But other dimensions matter too• Time• Categories • Economical value• …

Page 50: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 6.Availability

& Access

Page 51: Myths and challenges in knowledge extraction and analysis from human-generated content

Google Places

Only intheUI(scraping)

ViaAPI

Page 52: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 7.Consistency

Page 53: Myths and challenges in knowledge extraction and analysis from human-generated content

Bringing Things TogetherSpace-text similarity btw. Google - Foursquare

Page 54: Myths and challenges in knowledge extraction and analysis from human-generated content

Problem 8.Size

Page 55: Myths and challenges in knowledge extraction and analysis from human-generated content

Data is big!

1 GigaByte of Data

(109) or, strictly,

230 bytes

Page 56: Myths and challenges in knowledge extraction and analysis from human-generated content

1 ZettaByte of Data

one sextillion (1021) or, strictly, 270 bytes

Page 57: Myths and challenges in knowledge extraction and analysis from human-generated content

The Fashion Week in Milano #MFW

Page 58: Myths and challenges in knowledge extraction and analysis from human-generated content

• Mobile Phone Calls & Msgs: 5 to 10 MLN per day in a city like Milan• Trackable user events (incl. data traffic): 1,000 per user per day

Mobile Phone Data

Page 59: Myths and challenges in knowledge extraction and analysis from human-generated content

IoT Sensors• People counters: 1 event per second (or less) • 86K+ events per day per sensor

• Industrial machine sensors: 100 measurements per second

Page 60: Myths and challenges in knowledge extraction and analysis from human-generated content

Human computation and crowdsourcing��������������������

��� �������

� ������

������������������

��������

���������������

Page 61: Myths and challenges in knowledge extraction and analysis from human-generated content

… and now …Examples and Cases

Page 62: Myths and challenges in knowledge extraction and analysis from human-generated content

Use Case #1: Fashion

The Milano Fashion Week

Page 63: Myths and challenges in knowledge extraction and analysis from human-generated content

Response of Social Media #MFW• MILANO FASHION WEEK #MFW• We have 2 signals:

• The first coming from the social media (in this case we will talk about only Instagram)

• The second derived from the official calendar events

Page 64: Myths and challenges in knowledge extraction and analysis from human-generated content

Research Questions

“Are live events still relevant?

Can online visibility be described simply by how famous is the brand? Do space and time still matter?Can we predict how people behave in time/space within events?

Page 65: Myths and challenges in knowledge extraction and analysis from human-generated content

Discover more about the #MFW case• https://marco-brambilla.com/2017/04/04/social-media-

behaviour-during-live-events-the-milano-fashion-week-mfw-case-www2017/(INCLUDING SLIDES)

Page 66: Myths and challenges in knowledge extraction and analysis from human-generated content

Use Case #2: Design

The Milano Design Week& FuoriSalone

Page 67: Myths and challenges in knowledge extraction and analysis from human-generated content

•Fuorisalone Official database • events/locations/itineraries

• Fuorisalone Official App• GPS positions1 of the App users• Events inserted in the agenda on the App• Private social post (Facebook) of App users2

• SocialMedia Listener• Keyword-based public social post (Twitter/Instagram)• Semantic analysis

•1 when the App was running

• 2 to use some App features the users had to perform a social login

Data sources of the analysis

Page 68: Myths and challenges in knowledge extraction and analysis from human-generated content

• Data elements are georeferenced and aggregate by citypixel (100 x 100 mt squares)

• Merging multiple data sources makes it possible to infer information:

• Which events attract more visitors?• Which areas have the larger presence of visitors?• Do people talk on the social networks about the events they are

interested in? • Do people use social networks while visiting the events?• ...

Fusing the data

Page 69: Myths and challenges in knowledge extraction and analysis from human-generated content

Use case #3:Como smartcity

Page 70: Myths and challenges in knowledge extraction and analysis from human-generated content

Approach

City-scale:mobiletelephoneand(gross-graingeo-located)socialmediadata

Street/square:peoplecounting&profilingIoTsensors

PointofInterest:peoplecountingsensor,WiFi loganalysis,beaconsand(finegraingeo-located)socialmedia

Descriptive,predictive,privacy-preservingand,whenneeded,real-timeanalysisofavarietyof(fused)datasources

Page 71: Myths and challenges in knowledge extraction and analysis from human-generated content

IntegrationPersonalizedinformation/offers,cityloyaltycards,digitalcoupons,andpolling

ProximitydetectionviaNFCorBLE/Beacons

Page 72: Myths and challenges in knowledge extraction and analysis from human-generated content

Measuring

PeoplecountingandprofilingviaMobileData

24.512Peoplepresent

41%71% 63%59%

tourists

citizens

29%

female

male

37%

private

business

10203040506070

age

Morepeoplethanusual

Page 73: Myths and challenges in knowledge extraction and analysis from human-generated content

MeasuringPeoplecountingvia3Dcamera

Page 74: Myths and challenges in knowledge extraction and analysis from human-generated content

DashboardsWhypeopleisthere

CrowdInsights

Page 75: Myths and challenges in knowledge extraction and analysis from human-generated content

DashboardsWhypeopleisthere

CrowdInsights

Page 76: Myths and challenges in knowledge extraction and analysis from human-generated content

7

1

6

2

3

4

5

7Areas

1. Cittàmurata2. LagospondaVialeGeno3. Lago4. LagospondadiVillaOlmo5. Zonaindustriale6. Brunate7. Businesseuniversità

Phone data

Page 77: Myths and challenges in knowledge extraction and analysis from human-generated content

Social

Page 78: Myths and challenges in knowledge extraction and analysis from human-generated content
Page 79: Myths and challenges in knowledge extraction and analysis from human-generated content

http://www.socialometers.com/balocchi/

Page 80: Myths and challenges in knowledge extraction and analysis from human-generated content

Use Case #4:

Knowledge Updater

Page 81: Myths and challenges in knowledge extraction and analysis from human-generated content

Overview

Page 82: Myths and challenges in knowledge extraction and analysis from human-generated content

Famous Emerging

Page 83: Myths and challenges in knowledge extraction and analysis from human-generated content

Knowledge Enrichment Setting

HF Entity1 HF Entity5

HF Entity2 HF Entity4

HF Entity3

LF Entity1 ??

LF Entity2 LF Entity4

LF Entity3

??

High FrequencyEntities

Low FrequencyEntities

??

?? ????

??

Type1

Type11

Type2Type111

Instances Types

<<instanceof>>

<<instanceof>>

<<ins

tance

of>>

<<instanceof>>

<<instanceof>>

<<instanceof>>

??

??

??

??

??

Seed Entity

Seed Type Type of interest

Legend

Expert inputs

Enrichment problems

Property2

Relations HF - LF entities

Relations LF - LF entities

Typing of LF entities

Extraction of new LF entities

Property1

?? ?? ??Finding attribute values

Page 84: Myths and challenges in knowledge extraction and analysis from human-generated content

Emerging Knowledge Harvesting

Page 85: Myths and challenges in knowledge extraction and analysis from human-generated content

Discover morehttps://marco-brambilla.com/2017/04/06/extracting-emerging-knowledge-from-social-media-www2017/

(SLIDES INCLUDED)

Page 86: Myths and challenges in knowledge extraction and analysis from human-generated content

Concluding.. Plenty of issues

And also plenty of application scenarios where to benchmark ideas!

Page 87: Myths and challenges in knowledge extraction and analysis from human-generated content

THANKS! QUESTIONS?

Myths and Challenges in Extraction of Emerging Knowledge

from Human-generated Content

Marco Brambilla @marcobrambi [email protected]://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi