Myths and challenges in knowledge extraction and analysis from human-generated content

Preview:

Citation preview

Myths and Challenges in Knowledge Extraction and Analysis

from Human-generated Content

Marco Brambilla

marco.brambilla@polimi.it@marcobrambi

Knowledge, Behaviour and Feature Extraction with Big Data Science

Your Data, My Problem

Problem 1.The Complexity of Knowledge

There are more things In heaven and earth, Horatio, Than are dreamt of in your philosophy.

Shakespeare (Hamlet Act 1, scene 5)

The Answer to the Great Question... Of Life, the Universe and Everything

Data

Information

Knowledge

WisdomContextindependence

Understanding

Understanding relations

Understanding patterns

Understanding principles

Formalizing evolving knowledge is hard

Only high frequency emerges

The long tail challenge

The Evolving Knowledge

knownsocial

factoid

a

c¬c

bpotentially emerging potentially

decaying

actual and solid

d

Information and Knowledge Extraction

Heaven and Earth

Are they so different?

The Digital “Heaven”

Vs.

The “physical” Earth

Heaven and Earth

How to peer into the world through an effective window?

INGREDIENTSSocial media, IoT, … – the dataDomain experts – the context

13[photo: http://hoglundassociates.com/Images/Cloud_Gate.jpg]

The digital reflection of our life is sharpening

14

Datasource Aroundfor Frequency DelayCensus data 100syear years monthsNewspaper 100syear days 1dayWeather sensors 10syear hours/minutes hours/minutesTV news 10syears hours minutesTrafficsensors years 15minutes minutesCallDataRecords years 15minutes hoursSocialmedia years seconds secondsIoT recently milliseconds milliseconds

Sour

ce: E

man

uele

Della

Vall

e

The data evolution

Data piles up without easing decision making

I have to decide:A or B?

Why not C?What if D?

Sour

ce: E

man

uele

Della

Vall

e

But, we would like to …

fusing all those data sources

making sense of the fused information

Definitely E!

Sour

ce: E

man

uele

Della

Vall

e

The MacroScope

Joël deRosnay,TheMacroscope,1979

Problem 2.Cognitive Bias

(of the observer)

the streetlamp effect

The bias of the observer

Strategy and Inaccuracy

Use Case: City

Model of social media and reality sensing

Model of social media and reality sensing

Model of social media and reality sensing

Problem 3.Data Quality

Data Quality Issue

Gartner Report In 2017, 33% of the largest global companies will experience an information crisis due to their inability to adequately value, govern and trust their enterprise information.

If you torture the data long enough,it will confess to anything

– Darrell Huff

The Vicious Cycle of Bad Data

BadData

IncorrectAnalysis

InvalidInsights

WrongDecisions

PoorOutcome

Conventional Definition of Data Quality

• Accuracy• The data was recorded correctly.

• Completeness• All relevant data was recorded.

• Uniqueness• Entities are recorded once.

• Timeliness• The data is kept up to date (and time consistency is granted).

• Consistency• The data agrees with itself.

Why is Data “Dirty” ?

• Dummy Values,• Absence of Data, • Multipurpose Fields,• Cryptic Data,• Contradicting Data,• Shared Field Usage,

• Inappropriate Use of Fields,• Violation of Business Rules,• Reused Primary Keys,• Non-Unique Identifiers,• Data Integration Problems

Data Wrangling a.k.a.• Data Preprocessing• Data Preparation• Data Cleansing• Data Scrubbing• Data Munging• Data Transformation• Data Fold, Spindle, Mutilate…• (good old) ETL

Foursquare• Check-ins explicitly performed in venues all around the world• Data set: Geo-localized Foursquare venues, collected through a query every 50m with radius >50m over:

• Milan area: 20km x 17,5km• Some numbers• Total n° of venues: 90K (dirty)• Total n° of valid venues: 43K

Isn’t data science sexy?

College & University

0

200

400

600

800

1000

1200

1400

weekend

weekend

weekend

weekend

weekend

Noaccess

Noaccess

Noaccess

Event

0

10

20

30

40

50

60

70

weekend

weekend

weekend

weekend

weekend

events

Events

The skeptic approach

The Pragmatic Approach

The (pseudo) Practitioner Approach

Problem 4.Content Bias

(of the source)

Data vs. Question

• Are they aligned? • The usual problem of representativeness of the sample…

• At a different scale • With much less control

• Example: the different pictures of the city

Foursquare

Checkins

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

Flickr

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

Instagram

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

Instagram

Copyrig

ht©

Milano

-Hub

project@

PolitecnicodiM

ilano

44

Cities into cities, by languagehttp://urbanscope.polimi.it

Bias of the Source• Technology• Audience / Users / Adopters• Behaviour

Problem 5.Granularity

(time, space, …)

Example. Space Granularity: the Grid• Regular squared grid• Irregular grid with official business-driven meaning• Irregular grid with data-driven definition

12/4

Cities into citieshttp://urbanscope.polimi.it

But other dimensions matter too• Time• Categories • Economical value• …

Problem 6.Availability

& Access

Google Places

Only intheUI(scraping)

ViaAPI

Problem 7.Consistency

Bringing Things TogetherSpace-text similarity btw. Google - Foursquare

Problem 8.Size

Data is big!

1 GigaByte of Data

(109) or, strictly,

230 bytes

1 ZettaByte of Data

one sextillion (1021) or, strictly, 270 bytes

The Fashion Week in Milano #MFW

• Mobile Phone Calls & Msgs: 5 to 10 MLN per day in a city like Milan• Trackable user events (incl. data traffic): 1,000 per user per day

Mobile Phone Data

IoT Sensors• People counters: 1 event per second (or less) • 86K+ events per day per sensor

• Industrial machine sensors: 100 measurements per second

Human computation and crowdsourcing��������������������

��� �������

� ������

������������������

��������

���������������

… and now …Examples and Cases

Use Case #1: Fashion

The Milano Fashion Week

Response of Social Media #MFW• MILANO FASHION WEEK #MFW• We have 2 signals:

• The first coming from the social media (in this case we will talk about only Instagram)

• The second derived from the official calendar events

Research Questions

“Are live events still relevant?

Can online visibility be described simply by how famous is the brand? Do space and time still matter?Can we predict how people behave in time/space within events?

Discover more about the #MFW case• https://marco-brambilla.com/2017/04/04/social-media-

behaviour-during-live-events-the-milano-fashion-week-mfw-case-www2017/(INCLUDING SLIDES)

Use Case #2: Design

The Milano Design Week& FuoriSalone

•Fuorisalone Official database • events/locations/itineraries

• Fuorisalone Official App• GPS positions1 of the App users• Events inserted in the agenda on the App• Private social post (Facebook) of App users2

• SocialMedia Listener• Keyword-based public social post (Twitter/Instagram)• Semantic analysis

•1 when the App was running

• 2 to use some App features the users had to perform a social login

Data sources of the analysis

• Data elements are georeferenced and aggregate by citypixel (100 x 100 mt squares)

• Merging multiple data sources makes it possible to infer information:

• Which events attract more visitors?• Which areas have the larger presence of visitors?• Do people talk on the social networks about the events they are

interested in? • Do people use social networks while visiting the events?• ...

Fusing the data

Use case #3:Como smartcity

Approach

City-scale:mobiletelephoneand(gross-graingeo-located)socialmediadata

Street/square:peoplecounting&profilingIoTsensors

PointofInterest:peoplecountingsensor,WiFi loganalysis,beaconsand(finegraingeo-located)socialmedia

Descriptive,predictive,privacy-preservingand,whenneeded,real-timeanalysisofavarietyof(fused)datasources

IntegrationPersonalizedinformation/offers,cityloyaltycards,digitalcoupons,andpolling

ProximitydetectionviaNFCorBLE/Beacons

Measuring

PeoplecountingandprofilingviaMobileData

24.512Peoplepresent

41%71% 63%59%

tourists

citizens

29%

female

male

37%

private

business

10203040506070

age

Morepeoplethanusual

MeasuringPeoplecountingvia3Dcamera

DashboardsWhypeopleisthere

CrowdInsights

DashboardsWhypeopleisthere

CrowdInsights

7

1

6

2

3

4

5

7Areas

1. Cittàmurata2. LagospondaVialeGeno3. Lago4. LagospondadiVillaOlmo5. Zonaindustriale6. Brunate7. Businesseuniversità

Phone data

Social

http://www.socialometers.com/balocchi/

Use Case #4:

Knowledge Updater

Overview

Famous Emerging

Knowledge Enrichment Setting

HF Entity1 HF Entity5

HF Entity2 HF Entity4

HF Entity3

LF Entity1 ??

LF Entity2 LF Entity4

LF Entity3

??

High FrequencyEntities

Low FrequencyEntities

??

?? ????

??

Type1

Type11

Type2Type111

Instances Types

<<instanceof>>

<<instanceof>>

<<ins

tance

of>>

<<instanceof>>

<<instanceof>>

<<instanceof>>

??

??

??

??

??

Seed Entity

Seed Type Type of interest

Legend

Expert inputs

Enrichment problems

Property2

Relations HF - LF entities

Relations LF - LF entities

Typing of LF entities

Extraction of new LF entities

Property1

?? ?? ??Finding attribute values

Emerging Knowledge Harvesting

Discover morehttps://marco-brambilla.com/2017/04/06/extracting-emerging-knowledge-from-social-media-www2017/

(SLIDES INCLUDED)

Concluding.. Plenty of issues

And also plenty of application scenarios where to benchmark ideas!

THANKS! QUESTIONS?

Myths and Challenges in Extraction of Emerging Knowledge

from Human-generated Content

Marco Brambilla @marcobrambi marco.brambilla@polimi.ithttp://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi