Upload
pavan-kapanipathi
View
193
Download
1
Tags:
Embed Size (px)
Citation preview
Knowledge-base Enabled Information Filtering on Social Web
Pavan Kapanipathi
Kno.e.sis Center, Wright State University
Advisor: Amit Sheth
1
Kno.e.sis
2
Social Web in 60 secs
3
Social Web in 60 secs
500M users generate 500M tweets per day
4
Disaster Management Organizations utilize Social Web
35% of 20M tweets during hurricane sandy shared information
and news about the disaster 5
Healthcare Issues
6
Healthcare Issues
7
Personalized Filtering on Social Web
Following Dynamically Evolving Topics as interests
8
Personalization on Social Web
• Following Dynamically Evolving Topics • Indian Elections • US Elections • Heathcare Debate
9
Personalization on Social Web
• Following Dynamically Evolving Topics • Indian Elections • US Elections • Heathcare Debate
10
Dynamic Topics
11
Dynamic Topics
Continuously Evolving on Twitter
Entity – Event relevance changes
Many entities are involved
12
Dynamic Topics
Manually crawl using keywords
“indianelection” “jan25” “sandy”
“swineflu” “ebola”
13
Dynamic Topics
Manually updating keywords to get topic relevant tweets is not
feasible
“indianelection” “modi” “bjp”
“congress”
“jan25” “egypt” “tunisia”
“arabspring”
“sandy” “newyork” “redcross” “fema”
“swineflu” “ebola”
14
Problem
How can we automatically update the filters to track a dynamically
evolving topic on Twitter
15
Hashtags as Filters
• Identify a topic on Twitter • Tweets with hashtags are
more informative • Users have a lot of freedom
to create them • Some get popular, most die
16
Exploring Hashtags as Evolving Filters for Dynamic Topics
Colorado Shooting
17
Exploring Hashtags as Evolving Filters for Dynamic Topics
Colorado Shooting
Occupy Wall Street
18
Exploring Hashtags as Evolving Filters for Dynamic Topics
Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Tweets: 6,077,378
Tags: 192,512 Distinct: 12,350 100% Retrieval: 7,763
Tags: 15,963,209 Distinct: 191,602 100% Retrieval: 21,314
19
Exploring Hashtags as Evolving Filters for Dynamic Topics
Colorado Shooting
Occupy Wall Street
CS OWS
Tweets: 122,062 Tweets: 6,077,378
Tags: 192,512 Distinct: 12,350 100% Retrieval: 7,763
Tags: 15,963,209 Distinct: 191,602 100% Retrieval: 21,314
HASHTAG FILTERS 20
Colorado Shooting Occupy Wall Street
Hashtag Filters Co-occurrence Graph
21
Colorado Shooting Occupy Wall Street
Event Related Hashtags co-occur
with each other
Hashtag Filters Co-occurrence Graph
22
Summarizing Hashtag Analysis
Starting with one of the event relevant hashtags, by co-
occurrence we can reach other relevant hashtags
23
Determining Relevancy of Co-occurring Hashtags
#indianelection2015
#modikisarkar
Too many co-occurring hashtags
24
Hashtag Filters distributions
25
Not surprising It’s a Powerlaw
distribution
Hashtag distributions
26
Top 1% retrieves around 85% of the
tweets
Hashtag distributions
27
Clustering Co-efficient of Hashtag Co-occurrence network (1%)
Clustering co-efficient
The top ones co-occur with each other the best
28
Determining Relevancy of Co-occurring Hashtags
#indianelection2015
#modikisarkar
Co-occurring: Threshold δ
Preferably a prominent hashtag
29
Hashtag Co-occurrence works?
o No. Just co-occurrence does not work o Many noisy or unrelated hashtags co-occurs
o Determine the “dynamic” relevance of the top co-occurring hashtag with the dynamic topic
30
Determining Relevancy of Co-occurring Hashtags
#indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2
Entity Extraction and Scoring
δ
Normalized Frequency Scoring
31
(Vector Space Model)
Determining Relevancy of Co-occurring Hashtags (Vector
Space Model) #indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Dynamically Updated Background Knowledge
δ
32
Event Relevant Background Knowledge
o Wikipedia Event Pages
33
o Wikipedia Event Pages
Event Relevant Background Knowledge
34
o Entities mentioned on the Event page of Wikipedia are relevant to the Event
Event Relevant Background Knowledge
35
o Wikipedia’s Hyperlink structure is very rich o Page-Page (Wikipedia) links
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India) UPA (India)
BJP
Indian National Congress
Event Relevant Background Knowledge – Graph Structure
36
Determining Relevancy of Co-occurring Hashtags (Vector
Space Model) #indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, Periodically Update Hyperlink structure
One hop from Event Page
δ
37
o Hyperlink structure is dynamically updated
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India) UPA (India)
BJP
Indian National Congress
10 May 2010
Event Relevant Background Knowledge
38
o Hyperlink structure is dynamically updated
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India) UPA (India)
BJP
Indian National Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
Event Relevant Background Knowledge
39
o Hyperlink structure is dynamically updated
Indian General Election, 2014
Narendra Modi
Rahul Gandhi
NDA (India) UPA (India)
BJP
Indian National Congress
10 May 2010
29 March 2013
29 March 2013 29 March 2013
29 March 2013
20 May 2013
20 May 2013
Event Relevant Background Knowledge
40
Determining Relevancy of Co-occurring Hashtags (Vector
Space Model) #indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, Periodically Update Hyperlink structure
Entity scoring based on relevance to the Event
One hop from Event Page
δ
41
o Edge Based Measure
o Link Overlap Measure: Jaccard similarity
o Out(c) are the links in Wikipedia page “c”
o Final Score: r(c,E) = ed(c,E) + oco(c,E)
Hyperlink Entity Scoring
India General Election, 2014
Narendra Modi
India General Election, 2014
India General Election, 2009
1
Mutually Important
ed (c,E) = 1
ed (c,E) = 2
42
Determining Relevancy of Co-occurring Hashtags (Vector
Space Model) #indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, Periodically Update Hyperlink structure
Entity scoring based on relevance to the Event
One hop from Event Page
Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3
δ
43
Determining Relevancy of Co-occurring Hashtags (Vector
Space Model) #indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, Periodically Update Hyperlink structure
Entity scoring based on relevance to the Event
One hop from Event Page
Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3
Similarity Check
Relevance Score: 0.6
δ
44
o Set Based o Jaccard Similarity
o Considers the entities without the scores
o Vector Based o Symmetric
o Cosine Similarity
o Asymmetric o Subsumption Similarity
Similarity Check
45
India General Election 2014
Narendra
Modi
Intuition behind Asymmetric
India General Election 2014
Narendra
Modi
Penalized
Ignored
Similarity
Symmetric
Asymmetric
46
Determining Relevancy of Co-occurring Hashtags (Vector
Space Model) #indianelection2015
#modikisarkar
Co-occurring: Threshold
Latest K (200,500)
Narendra Modi: 0.9 BJP: 0.7 NDA: 0.6 India: 0.4 Elections: 0.2 Rahul Gandhi: 0.2 Congress: 0.2
Entity Extraction and Scoring
Indian General Election,_2014
Extract, Periodically Update Hyperlink structure
Entity scoring based on relevance to the Event
One hop from Event Page
Indian General Elec: 1.0 India: 0.9 Elections: 0.7 UPA: 0.6 BJP: 0.3 NDA: 0.3 Narendra Modi: 0.3
Similarity Check
Relevance Score: 0.6
δ
47
o 2 events o US Presidential Elections (#election2012)
o Hurricane Sandy (#sandy)
o Top 25 co-occurring hashtags
Evaluation – Dataset
48
o Ranking Problem o Rank the Top 25 hashtags based on the relevancy of tweets to the event
o Experiment with all the similarity metrics o Manually annotated the tweets of these hashtags as relevant/irrelevant (Gold Standard)
o Ranking Evaluation Metrics o Mean Average Precision o NDCG
Evaluation – Strategy
49
Evaluation
50
Evaluation
Evaluated tweets comprising of top-relevant hashtags detected for
dynamic topics • NDCG - 92% at top-5 Mean Average
Precision
51
A little pause for Questions?
52
Personalized Filtering
53
User Interest Identification/User
Modeling
Filtering Module
Twitter Streaming API
Tweets
Network
Filtered Tweets
Personalized Filtering
54
User Interest Identification/User
Modeling
Filtering Module
Twitter Streaming API
Tweets
Network
Filtered Tweets
Dynamic Topics as Interests
Interest: Indian Elections
Personalized Filtering
55
User Interest Identification/User
Modeling
Filtering Module
Twitter Streaming API
Tweets
Network
Filtered Tweets
A Significant Module
o User Interest Identification on Twitter o Content-based (Only Tweets)
o Term-based (semantic, web, #semanticweb)
o Entity-based (sematic web <same as> #semanticweb)
o Interest Graphs derived from knowledge-base (Hierarchical Interest Graphs)
o Collaborative (Users’ Friends)
o Hybrid
User Modeling
56
A simple solution to most problems I am trying to solve
Hierarchical Interest Graphs
58
What is in your mind? (Next concept/term)
59
What is in your mind? (Next concept/term)
Fruit
60
What is in your mind? (Next concept/term)
Fruit
Other Fruit Names
61
Cognitive Science
o Human memory has been argued to be structured as a hierarchy of concepts (Semantic Network)
o Spreading activation theory has been
utilized to simulate search on semantic network
o This theory has not been well explored for user interest modeling
62
Hierarchical Interest Graphs
o Extending user profiles from Twitter to comprise a hierarchy of concepts
o Hierarchy of concepts are derived from Wikipedia Category Structure
o Each concept in the hierarchy is scored based on the users extent of interest
63
64
Semantic Search
Linked Data Metadata
0.8 0.2 0.6 Scores for Interests
65
User Interests
Internet
Semantic Search
Linked Data Metadata
Technology
World Wide Web
Semantic Web
Structured Information
0.8 0.2 0.6 Scores for Interests
66
User Interests
Internet
Semantic Search
Linked Data Metadata
Technology
World Wide Web
Semantic Web
Structured Information
0.8 0.2 0.6 Scores for Interests
67
User Interests
0.7
0.5
0.4
0.3
68
Tweets
Approach
69
Tweets
Approach
70
Wikipedia Category Graph
Contains Cycles
More abstract: World Wide Web or
Semantic Web?
71
Wikipedia Hierarchy
Hierarchical Levels
No Cycles
1
2
3
4
5
6
72
Tweets
Approach
73
http://en.wikipedia.org/wiki/Semantic_search
http://en.wikipedia.org/wiki/Ontology
o Extracting Wikipedia entities
o Interest Scoring o Frequency based
User Profile Generation
Internet
Semantic Search
Linked Data Metadata
Technology
World Wide Web
Semantic Web
User Interests
Structured Information
0.8 0.2 0.6 Scores for Interests
74
75
Tweets
Approach
76
Cricket
M S Dhoni Virat Kohli Sachin
Tendulkar
Sports
Indian Cricket
Indian Cricketers
0.8 0.2 0.6
0.5
0.4
0.25
0.1
Activation Function Determines the extent of spreading
Example
o Simple Activation Function
𝐴𝑗 = 𝐴𝑖 ×𝑊𝑖𝑗 × 𝐷𝑛𝑖=0
𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑐ℎ𝑖𝑙𝑑 𝑜𝑟 𝑠𝑢𝑏𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑜𝑓 𝑗 𝐴𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑 .
𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑡𝑜 𝑏𝑒 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑.
𝑊𝑖𝑗 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑑𝑔𝑒 𝑤𝑒𝑖𝑔ℎ𝑡 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑗 𝑎𝑛𝑑 𝑖.
𝐷 𝑖𝑠 𝑡ℎ𝑒 𝑑𝑒𝑐𝑎𝑦 𝑓𝑎𝑐𝑡𝑜𝑟.
77
Activation Function
o Uneven distribution of nodes in the hierarchy
oMany-many for category-subcategory relationships
78 78
Challenges – Wikipedia Category Graph
o Uneven distribution of nodes in the hierarchy
oMany-many for category-subcategory relationships
79 79
Challenges – Wikipedia Category Graph
o Uneven distribution of nodes in the hierarchy
oMany-many for category-subcategory relationships
80 80
Challenges – Wikipedia Category Graph
81
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
50000
100000
150000
200000
250000
300000
Nu
mb
er
of N
ode
s
Hierarchical Level
81
Addressing Uneven Node Distribution
o Uneven distribution of nodes in the hierarchy
oMany-many for category-subcategory relationships
82 82
Challenges – Wikipedia Category Graph
83 83
Preferential Path Constraint – Many to Many Links
84 84
Preferential Path Constraint – Many to Many Links
85
1 2 3 4
85
Preferential Path Constraint – Many to Many Links
Boosting Common Ancestors
o Nodes that intersect domains/subcategories activated by diverse entities
86 86
Boosting Common Ancestors
87
Cricket
M S Dhoni Virat Kohli Sachin
Tendulkar
Sports
Indian Cricket
Indian Cricketers 3
3
5
5
Michael Clarke
Shane Watson
Australian Cricket
Australian Cricketers
2
2
87
88 88
Boosting Common Ancestors
o Bell
𝐴𝑗 = 𝐴𝑖 × 𝐹𝑗
𝑛
𝑖=0
o Bell Log
𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗
𝑛
𝑖=0
o Priority Intersect
𝐴𝑗 = 𝐴𝑖 × 𝐹𝐿𝑗 × 𝑃𝑗𝑖 × 𝐵𝑗
𝑛
𝑖=0
89
Activation Functions
Evaluation
User Study • 37 Users
• 30K Tweets
Evaluated the top-10 categories of interests derived from the hierarchy
• 76% Mean Average Precision • 98% Mean Reciprocal Recall
• 70% are not mentioned in tweets
90
oWorking on a Tweet recommendation system that utilizes Hierarchical Interest Graph
o Preliminary results are “interesting”
91
Tweet Recommendation using Hierarchical Interest Graph
Conclusion
o Focus on “Information” overload instead of “Data” overload. o Personalized Information Filtering
o Knowledge-base enabled solutions for
challenges in Tweets filtering o Wikipedia hyperlink structure and category
graph leveraged for Twitter data filtering
o More Research on User Specific Attribute Extraction (Personalization) from Twitter Data o Activity Estimation o Location Prediction
93
More at Kno.e.sis
kHealth Knowledge-enabled Healthcare
Applied to ADHF, Asthma, GI, and Dementia
94
Through physical monitoring and analysis, our cellphones could act as an early warning system to detect serious health conditions, and provide actionable information
canary in a coal mine
Empowering Individuals (who are not Larry Smarr!) for their own health
kHealth: knowledge-enabled healthcare
95
Social Health Signals
96
Motivational Scenario
Manually going through news articles, diabetes forums, blogs, etc.
- Time consuming
- Relevant? Interesting? Informative? Useful?
97
How about all the relevant and important health
information aggregated at one platform?
A diabetic patient is interested in keeping himself up to date with
new information about diabetes
98
Search and Explore
X Controls
Cancer
X = diet, treatment, exercise
(Pattern-based Approach
leveraging domain
semantics)
Top Health News Informative news about selected
disease
Faceted search (by health topics)
Learn about disease
Source: Wikipedia
Search &
Explore Top Health
News
Tweet
Traffic Learn about
Disease Home
Thanks
Contact: [email protected]
Twitter:@pavankaps Webpage:
http://knoesis.org/researchers/pavan
99