Upload
marco-brambilla
View
339
Download
0
Embed Size (px)
Citation preview
Myths and Challenges in Knowledge Extraction and Analysis
from Human-generated Content
Marco Brambilla
[email protected]@marcobrambi
Knowledge, Behaviour and Feature Extraction with Big Data Science
Your Data, My Problem
Problem 1.The Complexity of Knowledge
There are more things In heaven and earth, Horatio, Than are dreamt of in your philosophy.
Shakespeare (Hamlet Act 1, scene 5)
The Answer to the Great Question... Of Life, the Universe and Everything
Data
Information
Knowledge
WisdomContextindependence
Understanding
Understanding relations
Understanding patterns
Understanding principles
Formalizing evolving knowledge is hard
Only high frequency emerges
The long tail challenge
The Evolving Knowledge
knownsocial
factoid
a
c¬c
bpotentially emerging potentially
decaying
actual and solid
d
Information and Knowledge Extraction
Heaven and Earth
Are they so different?
The Digital “Heaven”
Vs.
The “physical” Earth
Heaven and Earth
How to peer into the world through an effective window?
INGREDIENTSSocial media, IoT, … – the dataDomain experts – the context
13[photo: http://hoglundassociates.com/Images/Cloud_Gate.jpg]
The digital reflection of our life is sharpening
14
Datasource Aroundfor Frequency DelayCensus data 100syear years monthsNewspaper 100syear days 1dayWeather sensors 10syear hours/minutes hours/minutesTV news 10syears hours minutesTrafficsensors years 15minutes minutesCallDataRecords years 15minutes hoursSocialmedia years seconds secondsIoT recently milliseconds milliseconds
Sour
ce: E
man
uele
Della
Vall
e
The data evolution
Data piles up without easing decision making
I have to decide:A or B?
Why not C?What if D?
Sour
ce: E
man
uele
Della
Vall
e
But, we would like to …
fusing all those data sources
making sense of the fused information
Definitely E!
Sour
ce: E
man
uele
Della
Vall
e
The MacroScope
Joël deRosnay,TheMacroscope,1979
Problem 2.Cognitive Bias
(of the observer)
the streetlamp effect
The bias of the observer
Strategy and Inaccuracy
Use Case: City
Model of social media and reality sensing
Model of social media and reality sensing
Model of social media and reality sensing
Problem 3.Data Quality
Data Quality Issue
Gartner Report In 2017, 33% of the largest global companies will experience an information crisis due to their inability to adequately value, govern and trust their enterprise information.
If you torture the data long enough,it will confess to anything
– Darrell Huff
The Vicious Cycle of Bad Data
BadData
IncorrectAnalysis
InvalidInsights
WrongDecisions
PoorOutcome
Conventional Definition of Data Quality
• Accuracy• The data was recorded correctly.
• Completeness• All relevant data was recorded.
• Uniqueness• Entities are recorded once.
• Timeliness• The data is kept up to date (and time consistency is granted).
• Consistency• The data agrees with itself.
Why is Data “Dirty” ?
• Dummy Values,• Absence of Data, • Multipurpose Fields,• Cryptic Data,• Contradicting Data,• Shared Field Usage,
• Inappropriate Use of Fields,• Violation of Business Rules,• Reused Primary Keys,• Non-Unique Identifiers,• Data Integration Problems
Data Wrangling a.k.a.• Data Preprocessing• Data Preparation• Data Cleansing• Data Scrubbing• Data Munging• Data Transformation• Data Fold, Spindle, Mutilate…• (good old) ETL
Foursquare• Check-ins explicitly performed in venues all around the world• Data set: Geo-localized Foursquare venues, collected through a query every 50m with radius >50m over:
• Milan area: 20km x 17,5km• Some numbers• Total n° of venues: 90K (dirty)• Total n° of valid venues: 43K
Isn’t data science sexy?
College & University
0
200
400
600
800
1000
1200
1400
weekend
weekend
weekend
weekend
weekend
Noaccess
Noaccess
Noaccess
Event
0
10
20
30
40
50
60
70
weekend
weekend
weekend
weekend
weekend
events
Events
The skeptic approach
The Pragmatic Approach
The (pseudo) Practitioner Approach
Problem 4.Content Bias
(of the source)
Data vs. Question
• Are they aligned? • The usual problem of representativeness of the sample…
• At a different scale • With much less control
• Example: the different pictures of the city
Foursquare
Checkins
Copyrig
ht©
Milano
-Hub
project@
PolitecnicodiM
ilano
Flickr
Copyrig
ht©
Milano
-Hub
project@
PolitecnicodiM
ilano
Copyrig
ht©
Milano
-Hub
project@
PolitecnicodiM
ilano
Copyrig
ht©
Milano
-Hub
project@
PolitecnicodiM
ilano
44
Cities into cities, by languagehttp://urbanscope.polimi.it
Bias of the Source• Technology• Audience / Users / Adopters• Behaviour
Problem 5.Granularity
(time, space, …)
Example. Space Granularity: the Grid• Regular squared grid• Irregular grid with official business-driven meaning• Irregular grid with data-driven definition
12/4
Cities into citieshttp://urbanscope.polimi.it
But other dimensions matter too• Time• Categories • Economical value• …
Problem 6.Availability
& Access
Google Places
Only intheUI(scraping)
ViaAPI
Problem 7.Consistency
Bringing Things TogetherSpace-text similarity btw. Google - Foursquare
Problem 8.Size
Data is big!
1 GigaByte of Data
(109) or, strictly,
230 bytes
1 ZettaByte of Data
one sextillion (1021) or, strictly, 270 bytes
The Fashion Week in Milano #MFW
• Mobile Phone Calls & Msgs: 5 to 10 MLN per day in a city like Milan• Trackable user events (incl. data traffic): 1,000 per user per day
Mobile Phone Data
IoT Sensors• People counters: 1 event per second (or less) • 86K+ events per day per sensor
• Industrial machine sensors: 100 measurements per second
Human computation and crowdsourcing��������������������
��� �������
� ������
������������������
��������
���������������
… and now …Examples and Cases
Use Case #1: Fashion
The Milano Fashion Week
Response of Social Media #MFW• MILANO FASHION WEEK #MFW• We have 2 signals:
• The first coming from the social media (in this case we will talk about only Instagram)
• The second derived from the official calendar events
Research Questions
“Are live events still relevant?
Can online visibility be described simply by how famous is the brand? Do space and time still matter?Can we predict how people behave in time/space within events?
Discover more about the #MFW case• https://marco-brambilla.com/2017/04/04/social-media-
behaviour-during-live-events-the-milano-fashion-week-mfw-case-www2017/(INCLUDING SLIDES)
Use Case #2: Design
The Milano Design Week& FuoriSalone
•Fuorisalone Official database • events/locations/itineraries
• Fuorisalone Official App• GPS positions1 of the App users• Events inserted in the agenda on the App• Private social post (Facebook) of App users2
• SocialMedia Listener• Keyword-based public social post (Twitter/Instagram)• Semantic analysis
•1 when the App was running
• 2 to use some App features the users had to perform a social login
Data sources of the analysis
• Data elements are georeferenced and aggregate by citypixel (100 x 100 mt squares)
• Merging multiple data sources makes it possible to infer information:
• Which events attract more visitors?• Which areas have the larger presence of visitors?• Do people talk on the social networks about the events they are
interested in? • Do people use social networks while visiting the events?• ...
Fusing the data
Use case #3:Como smartcity
Approach
City-scale:mobiletelephoneand(gross-graingeo-located)socialmediadata
Street/square:peoplecounting&profilingIoTsensors
PointofInterest:peoplecountingsensor,WiFi loganalysis,beaconsand(finegraingeo-located)socialmedia
Descriptive,predictive,privacy-preservingand,whenneeded,real-timeanalysisofavarietyof(fused)datasources
IntegrationPersonalizedinformation/offers,cityloyaltycards,digitalcoupons,andpolling
ProximitydetectionviaNFCorBLE/Beacons
Measuring
PeoplecountingandprofilingviaMobileData
24.512Peoplepresent
41%71% 63%59%
tourists
citizens
29%
female
male
37%
private
business
10203040506070
age
Morepeoplethanusual
MeasuringPeoplecountingvia3Dcamera
DashboardsWhypeopleisthere
CrowdInsights
DashboardsWhypeopleisthere
CrowdInsights
7
1
6
2
3
4
5
7Areas
1. Cittàmurata2. LagospondaVialeGeno3. Lago4. LagospondadiVillaOlmo5. Zonaindustriale6. Brunate7. Businesseuniversità
Phone data
Social
http://www.socialometers.com/balocchi/
Use Case #4:
Knowledge Updater
Overview
Famous Emerging
…
Knowledge Enrichment Setting
HF Entity1 HF Entity5
HF Entity2 HF Entity4
HF Entity3
LF Entity1 ??
LF Entity2 LF Entity4
LF Entity3
??
High FrequencyEntities
Low FrequencyEntities
??
?? ????
??
Type1
Type11
Type2Type111
Instances Types
<<instanceof>>
<<instanceof>>
<<ins
tance
of>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
??
??
??
??
??
Seed Entity
Seed Type Type of interest
Legend
Expert inputs
Enrichment problems
Property2
Relations HF - LF entities
Relations LF - LF entities
Typing of LF entities
Extraction of new LF entities
Property1
?? ?? ??Finding attribute values
Emerging Knowledge Harvesting
Discover morehttps://marco-brambilla.com/2017/04/06/extracting-emerging-knowledge-from-social-media-www2017/
(SLIDES INCLUDED)
Concluding.. Plenty of issues
And also plenty of application scenarios where to benchmark ideas!
THANKS! QUESTIONS?
Myths and Challenges in Extraction of Emerging Knowledge
from Human-generated Content
Marco Brambilla @marcobrambi [email protected]://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi