Upload
katerina-iliakopoulou
View
89
Download
2
Embed Size (px)
Citation preview
Lecture @ International Hellenic University
Thessaloniki, 8 May 2014
Social Media Crawling and MiningOverview of Hands-on Workshop
Symeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou,
Yiannis KompatsiarisInformation Technologies Institute (ITI)Centre for Research & Technologies Hellas (CERTH)
IHU SocialSensor Seminar – May 2014 CERTH-ITI#2
Stream Manager
Supports search by: 1. Keywords
2. Users
3. Locations
Supports storage to: 1. MongoDB
2. Solr
Supports retrieval 1. Twitter
from: 2. Facebook
3. Google+, etc.
input.conf.xmlinput.conf.xml
streams.conf.xmlstreams.conf.xml
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Streams Manager
#3
How to run :
java –jar StreamsManager.jar stream.conf.xml input.conf.xml
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Items, MediaItems and StreamUsers
#4
Item class
Basic fields: String idString title String[] tagslong publicationTimeString uidString referenceString referenceUserIdString[] mentions
MediaItem class
Basic fields: String idString title String[] tagslong publicationTimeString uidString reference
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Items, MediaItems and StreamUsers
#5
StreamUser class
Basic fields: String idString usernameString urlint itemslong followerslong friends
Getters / Setters for each field
IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Import Data
#6
mongoimport –h localhost –d Snow14 –c Items –file ../../Items
mongoimport –h localhost –d Snow14 –c MediaItems –file ../../MediaItems
IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Direct Queries
#7
1. Find an Item by its id
db.Items.find({“id” : “Twitter#438612090748416”})
2. Find all Items posted before a certain date
db.Items.find({“publicationTime” : {$lt:1393408367000}})
3. Find a Media Item by its reference
db.MediaItems.find({“reference” : “Twitter#438612090748416”})
4. Find all Users with at least 1000 followers
db.StreamsUsers.find({“followers” : {$gt:1000}})
IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Query using DAO classes
#8
1. Create instance of ItemDAO to retrieve item
ItemDAO itemDAO = new ItemDAOImpl(“localhost”, “Snow14”, “Items”)
2. Create instance of MediaItemDAO to retrieve mediaItems
MediaItemDAO mediaItemDAO = new MediaItemDAOImpl(“localhost”, “Snow14”, “MediaItems”)
3. Create instance of StreamUserDAO to retrieve users
StreamUserDAO userDAO = new StreamUserDAOImpl(“localhost”, “Snow14”, “StreamUsers”)
IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Query using DAO classes
#9
1. Find an Item by its id
ItemDAO.getItem(“Twitter#438612090748416”)
2. Find a Media Item by its reference
List<String> items = new ArrayList<String>;items.add(“Twitter#438612090748416”);MediaItemDAO.getMediaItemsForItems(items,image,20);
3. Find 1000 latest ItemsItemDAO.getLatestItems(1000);
IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Generic queries & Iteration
#10
Use BasicDBObject class to represent JSON objects
e.g {“id” : “Twitter#1234567”} ->
BasicDBObject query = new BasicDBObject(“id” : “Twitter#1234567”)
List<Item> items = itemDAO.getItems(query);
To iterate:
ItemIterator it = itemDAO.getIterator(query);
Use methods hasNext() and next() to iterate over
the collection of Items.
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Solr – Query using SocialSensor wrappers
#11
1. Create instance of SolrItemHandler to index and retrieve items
SolrItemHandler itemHandler = SolrItemHandler.getInstance(
“http://localhost:8080/solr/Items”)
2. Create instance of SolrMediaItemHandler to index and retrieve mediaItems
SolrMediaItemHandler itemHandler = SolrMediaItemHandler.getInstance(
“http://localhost:8080/solr/MediaItems”)
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Solr – Use of UI and SocialSensor wrappers
#12
Assignment #1
Index all the items from MongoDB to Solr
Fill the method eu.socialsensor.ihu_workshop.indexItems
Assignment #2
Run the following queries to get relevant Items
Q1 : terror attack Q2 : Crimea Q3 : Bitcoin
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Basic Social Media Analytics
#13
Assignment #1
1. Find the N most frequent hashtag in a list of Items1. Process one by one all items in the list2. Create a map of all detected hashtags and their number of
occurrences.3. Select the hashtag with the highest value.
2.Find the N most frequent terms in a list of Items using tokenization
3.Find the N most re-tweeted tweets in the dataset 1. Process one by one all items in collection2. Create a map of the item (item id) and its retweets3. Select the item with the highest value
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Basic Social Media Analytics
#14
Assignment #1
4. Find N top users based on: a) Number of posted itemsb) Aggregated number of retweets
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Basic Social Media Analytics
#15
Assignment #1
5. Create an activity timeline for the tweets in the dataset and for the set of original tweets
6. Create the timeline of the tweets that contain a hashtag (or keyword) of your choice
7. Try to visualize the timelines you have created in the previous steps.
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#16
What is a trending topic?
Keywords, N-grams, Named Entities, Phrases, which are shared a lot in social media for a certain period of time.
Keywords, N-grams, Named Entities, Phrases, which are shared a lot in social media for a certain period of time.
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#17
Assignment #2Feature pivot topic detection by using hashtag
1.Baseline method: Split the data into timeslots of the same length. Calculate the most frequent hashtags of each timeslot
2.Calculate the most trending hashtags by comparing the current frequency of a hashtag with the values of the previous timeslots.
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#18
Assignment #2Document pivot event detection by clustering tweets
Cluster “similar” tweets to create groups of tweets that represent candidate events.
The similarity between two tweets could be a combination of similarity measures across different dimensions, e.g textual similarity, time and space proximity, etc.
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#19
Assignment #2 Frequency pivot event detection by clustering tweets
1.Run document-pivot clustering provided by SocialSensor to create a set of candidate events.
2.For each produced topic find a list of representative hashtags.
3.Try to calculate a measure of “trendiness” of each event.