81
Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar SOCIAL MEDIA MINING AND MULTIMEDIA ANALYSIS RESEARCH AND APPLICATIONS Yiannis Kompatsiaris Informatics and Telematics Institute Centre for Research and Technology - Hellas h"p://mklab.i-.gr

Social media mining and multimedia analysis research and applications

Embed Size (px)

DESCRIPTION

In this talk, research and applications in social media mining and multimedia analysis are going to be presented. Social media sharing websites host billions of images and videos, which have been annotated and shared among friends, or published in groups that cover a specific topic of interest. The fact that users annotate and comment on the content in the form of tags, ratings, preferences and so on, and that these activities are performed on a daily basis, gives such social media data source an extremely dynamic nature that reflects topics of interests, events and the evolution of community opinion and focus. The talk will present research challenges and activities and will focus on multi-modal graph-based community detection methods for social media mining, concept and event detection. Clusttour, a mobile and web application integrating research results with appropriate interface design will be demonstrated as a relevant use case. The talk will also include approaches for object/region classifiers learned using the self-training paradigm with loosely annotated training samples automatically selected from social media.

Citation preview

Page 1: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

SOCIAL MEDIA MINING AND MULTIMEDIA ANALYSIS

RESEARCH AND APPLICATIONS

Yiannis Kompatsiaris Informatics and Telematics Institute

Centre for Research and Technology - Hellas

h"p://mklab.i-.gr    

Page 2: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Contents

•  Introduction • Emergent Semantics from Social Media

•  Opportunities and Challenges •  Applications

• Research Approaches •  Community detection in Social Media •  Social media “teacher” of the machine •  Concept detection

• SocialSensor Applications • Conclusions - Issues

Page 3: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

•  Users upload, tag, share, connect and search

•  Over 800 million unique users visit YouTube each month

•  Over 3 billion hours of video are watched each month on YouTube

•  72 hours of video are uploaded to YouTube every minute

•  Emphasis is on uploading, visualization of results and interfaces

•  User engagement •  Single media item analysis •  Usage of the Collective

nature of Social Networks

Social networks and media

Page 4: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Web 2.0 Content •  Multi-modality: e.g. image + tags, image + video

•  Rich (Social) Context: spatio-temporal, social connections, relations and social graph

•  Huge volume: Massively produced and shared

•  Dynamic: Fast updates, real-time, streaming feeds

•  Multi-source: may be generated by different applications, user communities, e.g. delicious, StumbleUpon and reddit are all social bookmarking sites

•  Also connected to other sources (e.g. LOD, web)

•  Inconsistent quality: noise, spam, ambiguity

Page 5: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Caption

Time

User Profile

Favs

Comms

Tags

Page 6: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Social Web as a graph

Page 7: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

10#

social#web#as#a#graph#

nodes&=&twi+er&users&edges&=&retweets&on&#jan25&hashtag&

announcement&of&Mubarak’s&resigna<on&

h1p://gephi.org/2011/the7egyp9an7revolu9on7on7twi1er/#

Page 8: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

9"

blogosphere"as"a"graph"

nodes&=&blogs&edges&=&hyperlinks&

technical&4&gadgets&

society&4&poli5cs&

h-p://datamining.typepad.com/gallery/blog8map8gallery.html"

Page 9: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Two main directions

•  1. Improve access to social media §  Tag refinement, suggestion, propagation, concept

detection §  Result apply to single media items

•  2. Extract implicit information, capture emergent semantics §  Exploit explicit and implicit relations

§  Not explicitly identifiable by users §  Data mining, Collective Intelligence

Scalable approaches taking into account the content and social context of social networks

Page 10: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Tags everywhere Sharing, describe content and search

Page 11: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Very low precision

Page 12: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Very low recall

Page 13: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Can we improve things?

By combining information from many photos - tags, it seems that we can

Stable patterns in tagging systems over time

Page 14: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

“Single” media item analysis •  Use features of large number of similar content

§  E.g. visual and textual features and similarity §  Tag refinement, suggestion, propagation, concept

detection

Page 15: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Social Networks and Collective Intelligence

•  Social Networks is a data source with an extremely dynamic nature that reflects events and the evolution of community focus (user’s interests)

•  Web 2.0 data consists of individually rare but collectively frequent events and topics

•  Potential for much more if we mine the data and their relations and exploit them in the right context

•  Search and Discovery of meaningful topics, entities, points of interest, social connections and events

•  Rather than search for isolated or directly connected social media items

Page 16: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Social Networks and Collective Intelligence

•  “If a group has a means of aggregating different opinions, the group collective solution may well be smarter than even the smartest person’s solution”

•  Conditions •  Diversity (large-scale) •  Independence •  Aggregation •  Motivation for best guess

•  Gamification

Page 17: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Social Networks and Collective Intelligence •  “Social networks have emergent

properties. Emergent properties are new attributes of a whole that arise from the interaction and interconnection of the parts”

•  Emotions, Health, Sexual relationships do not depend just on our connections (e.g. number of them) but on our position - structure in the social graph

•  Central – Hub

•  Outlier

•  Transitivity (connections between friends)

Page 18: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Extraction of implicit information

trace Flickr users from a chronologically ordered set of geographically referenced photos

Who are the Italians and who are the Americans? MIT SENSEABLE CITY LAB, “The World's eyes”

Page 19: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

What else we can do?

Tags that are “representative” for a geographical area

•  1. Clustering of photos

§  K-means, based on their location [Kennedy07]

•  2. Rank each cluster’s tags •  3. Get tags above a certain

threshold Representative tags for San Francisco [Kennedy07]

Contribute to our understanding of

the world

Page 20: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Sensors and automatically user generated content

Uses the GPS in cellular phones to gather traffic information, process it, and distribute it back to the phones in real time

•  online, real-time data processing

•  privacy-preservation •  data efficiency, i.e. not

requiring excessive cellular network Mobile Century Project: http://

traffic.berkeley.edu/mobilecentury.html

Page 21: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Applications Xin Jin, Andrew Gallagher, Liangliang Cao, Jiebo Luo, and Jiawei Han. The wisdom of social multimedia: using flickr for prediction and forecast, International conference on Multimedia (MM '10). ACM.

Federal Emergency Management Agency plans to engage the public more in disaster response by sharing data and leveraging reports from mobile phones and social media

Gogobot: Travel Discovery Goes Social And Visual ”The service raised $4 million in funding (Google CEO Eric Schmidt is one of the investors)…This is a $100 billion a year industry in the U.S. It’s something like $350 billion worldwide.”

21

Page 22: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Social Media as real-time Sensors

“…if you're more than 100 km away from the epicenter [of an earthquake] you can read about the quake on twitter before it hits you…”

Page 23: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Applications •  Science

•  Sociology, machine learning, computer vision (annotation) •  Tourism – Leisure – Culture

•  Off-the-beaten path POI extraction •  Marketing

•  Brand monitoring, personalised ads •  E-Gov and e-participation

•  Direct citizens feedback (fixmystreet app) •  News

•  Topics, trends event detection •  Others

•  Environment, emergency response, energy saving, etc

Page 24: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Research Fields and Issues •  Statistical analysis, machine learning, data mining,

pattern recognition, social network analysis •  Clustering

•  Image, text, video feature extraction and analysis •  Representation, modeling, data reduction

•  Graph theory •  Fusion techniques •  Stream processing and real-time architectures •  Performance, scalability •  Multi-disciplinarity (sociologists) •  Security, privacy

Page 25: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Social Media Community Detection

Page 26: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar 26

Examples of Social Media networks Folksonomy (Delicious) MetaGraph (Digg)

Mika, P. (2005) Ontologies Are Us: A Unified Model of Social Networks and Semantics. Proceedings of the 4th International Semantic Web Conference (ISWC 2005), Springer Berlin / Heidelberg, pp. 522-536

Lin, Y., Sun, J., Castro, P., Konuru, R., Sundaram, H., and Kelliher, A. (2009) MetaFac: community discovery via relational hypergraph factorization. Proceedings of KDD '09, ACM, pp. 527-536

Page 27: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar 27

What is a community in a network? Group of vertices that are more densely connected to each

other than to the rest of the network. Multiple definitions to quantify

communities: Fortunato S. (2010) Community detection in graphs. Physics Reports486:

75-174 S. Papadopoulos, Y. Kompatsiaris, A. Vakali, P. Spyridonos. “Community Detection in Social Media”. In Data Mining and Knowledge Discovery, DOI: 10.1007/s10618-011-0224-z

inter-community edge

intra-community edge

Page 28: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Subgraphs

•  k"clique)

•  N"clique)

•  k"core)

31)

k=3)(triangle)) k=4) k=5)

N=2)(star))

0"core)

1"core)

2"core)

4"core)

3"core)all vertices have degree at least k

Each node is connected to all k-1 nodes

N is the length of the path allowed to all other members

Page 29: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Approach illustration (1/2)

• 1st step: (µ, ε) – core detection

•  2nd step: Local expansion

Two-step process:

•  3rd step: Characterization of remaining vertices as hubs or outliers

Page 30: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

•  Structural  similarity  +  Local  expansion    (highly  efficient  and  scalable  approach)  

•  Not  necessary  to  know  the  number    of  clusters  

•  Noise  resilient  (not  all  nodes  need  to  be  part  of  a  

community)  

•  Generic  approach  adaptable  to      many  applica-ons  (depending  on  node  –  edge  

representa-on)  

+

S.  Papadopoulos,  Y.  Kompatsiaris,  A.  Vakali.  “A  Graph-­‐based  Clustering  Scheme  for  Iden-fying  Related  Tags  in  Folksonomies”.  In  Proceedings  of  DaWaK'10,  Springer-­‐Verlag,  65-­‐76    

Approach illustration (2/2)

Page 31: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

LYCOS iQ Tag Network

Computers: A densely interconnected community

History: A star-shaped community

Page 32: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar 32

Hybrid photo Clustering

Page 33: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar event  

landmark  

Page 34: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar 34

Photo clustering results

Most clusters correspond to landmarks or events

baptism

conference

castels

LANDMARKS

EVENTS

Page 35: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar 35

Sample results: [Visual] vs. [Tag] vs. [Visual + Tag]

VISUAL

TAG

HYBRID

Page 36: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar 36

Numerical results: Geospatial cluster coherence

more precise than the ones produced by

k-means clustering. In terms of the GCC mea-

sure, the SCAN-produced clusters are clearly su-

perior to the k-means ones, which indicates

better geographical focus and thus better corre-

spondence to landmarks and events (which are

usually highly localized). The difference in GCC

is especially pronounced for visual clusters. The

actual GCC performance of k-means clustering

is worse than presented in Table 1 because in

the computation of !d we also included one-

member clusters (that obviously have !d ! 0),

which constitute a large portion of the k-

means results and thus reduce the average !d.In terms of SCQ (subjective evaluation), once

more the SCAN clusters appear to be of higher

precision than the competing k-means ones.

Clusters produced by k-means present consider-

ably higher recall. However, the low "-statistic

values that characterize k-means clusters indi-

cate that users don’t agree on whether the

images included in the clusters are related to

each other. In contrast, there is a high inter-

annotator agreement with respect to SCAN clus-

ters. That means that these clusters are easy to

interpret by users. Another interesting observa-

tion pertains to the cluster quality derived by

each variant of tag-based graph similarity,

namely plain co-occurrence and LSI. There is a

clear cluster quality advantage in favor of LSI,

which comes at a significant computational

cost during the graph construction step.We also conducted a similar study where we

compared the quality of clusters derived by

clustering the VIS, TAG-C, and HYB image

similarity graphs. We found that the best

information-retrieval performance is achieved

by use of the hybrid similarity graph. More spe-

cifically, the F-measure of the HYB image clus-

ters was 28.5 percent higher than the one of

VIS clusters and 19.8 percent higher than the

one of TAG-C clusters. The interannotator

agreement for these results was substantial, be-

cause in all cases the obtained "-statistic values

were above 0.6. These results are consistent

with our previous study.9

We also inspected the distribution of the dif-

ferent clusters (visual, tag, and hybrid) in the

feature space of Quack, Leibe, and Van Gool.

Figure 2 (next page) presents a comparison

among the three clusterings, revealing conspic-

uous differences. One could postulate that vi-

sual clusters largely correspond to landmarks,

while tag clusters mainly correspond to events.

It’s the hybrid clusters that span the whole fea-

ture space and thus correspond to both land-

marks and events. Thus, we use hybrid

clusters for the rest of the evaluation study.To further quantify the purity of the result-

ing clusters in terms of correspondence to land-

marks and events, we asked two users to look at

the images of 60 hybrid clusters and provide a

characterization of landmark or event at

the image level—that is, to decide whether

each image (seen in the context of the rest of

the images belonging to the same cluster)

depicted a landmark or an event. The annota-

tion set of 60 clusters consisted of 1,127 images.

For each cluster, we computed the percentage of

images that were annotated as landmarks and

[3B2-9] mmu2011010052.3d 18/2/011 16:43 Page 57

Table 1. Cluster quality comparison between SCAN and k-means approaches. The performance is

evaluated separately on visual and tag-based features and for multiple values of k. We could not include

k-means with K ! 3M in the tag cluster comparison because the large number of K led to an estimated

execution time of over a week.

Clustering method

Geospatial cluster

coherence

(m stands for meters) Subjective cluster quality

Cluster type (number of clusters) md (m) sd (m) P R F "

Visual SCANVIS (560) 357.1 1185.7 1.000 0.110 0.199 1.000

KMVIS,1M (560) 2470.0 1734.4 0.806 0.324 0.462 0.226

KMVIS,2M (1,120) 2249.7 1893.7 0.899 0.294 0.443 0.544

KMVIS,3M (1,680) 2183.1 2027.4 0.929 0.271 0.420 0.719

Tag SCANTAG-C (1,774) 767.4 1712.0 0.898 0.253 0.394 0.642

SCANTAG-LSI (4,027) 456.3 1151.1 0.950 0.182 0.306 0.820

KMTAG,1M (4,027) 766.8 1762.7 0.848 0.307 0.451 0.564

KMTAG,2M (8,054) 563.2 1528.7 0.903 0.258 0.401 0.707

Jan

uary!

March

2011

57

For 29 landmark clusters, the automatically generated cluster center fell on average within 80 meters of the actual landmark position

S. Papadopoulos, C. Zigkolis, Y. Kompatsiaris, A. Vakali. “Cluster-based Landmark and Event Detection on Tagged Photo Collections”. In IEEE Multimedia Magazine 18(1), pp. 52-63, 2011

Page 37: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

clus&our.gr  applica/on  

tags:  sagrada  familia,  cathedral,  barcelona  

taken:  12  May  2009  lat:  41.4036,  lon:  2.1743  

PHOTOS  &  METADATA  SPATIAL  CLUSTERING  +  TEMPORAL  ANALYSIS  

COMMUNITY  DETECTION  

CLASSIFICATION  TO  LANDMARKS/EVENTS  

VISUAL  

TAG  HYBRID  

[2  years,  50  users  /  1

20  photos]  

#users  /  #photos  

dura-on  [1  day,  2  users  /  10  ph

otos]  

S.   Papadopoulos,   C.   Zigkolis,   Y.   Kompatsiaris,   A.   Vakali.   “Cluster-­‐based   Landmark   and   Event   Detec-on   on   Tagged   Photo  Collec-ons”.  In  IEEE  Mul-media  Magazine  18(1),  pp.  52-­‐63,  2011  

Page 38: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

TIME  SLICES  

PHOTO  CLUSTER  SUMMARY  

AREA  TAGS  PHOTO  CLUSTERS  RANKED  BY  POPULARITY  

DIVERSE  SET  OF  AREA  PHOTOS    

ORIGINAL  PHOTO  METADATA  

Page 39: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Page 40: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Page 41: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Page 42: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Page 43: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Page 44: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Available  on  AppStore  http://clusttour.gr/itunes  

Page 45: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Social Media “teacher” of the machine

Page 46: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Manually labelled data

Train classifier

Apply on unlabelled data

Enhance training set with unlabelled data

based on the classifier’s decision

High Quality Annotations Expensive to generate

Crowdsourcing – social media

Visual + Textual

Self training + Social Media

Page 47: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Challenges Region instead of image annotation

E.g. tags are global annotations, while local ones are needed

Imperfect segmentation •  Adaptive size region selection

•  Visual and textual similarity ambiguity •  Fusion of scores

Page 48: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Adapted self training for region selection − Initial models are trained using labelled regions − The models are applied on regions extracted by

loosely tagged images (obtained at almost no cost) ü Dismiss regions that are relatively too small to be useful

− Tags add an extra layer of confidence in the selection process ü Semantic relatedness between concepts and tags is

calculated using either WordNet or a modified version of Google Similarity Distance

− Select regions based on visual and textual information and use them to enhance the positive training set

Proposed approach

Page 49: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Use the selected

samples to enhance

the positive

training set

Combine Visual and

Textual information

Dismiss non-informative regions

Page 50: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Experimental Setup

•  SAIAPR TC-12 dataset (imageCLEF) 20k manually labelled images split into 3 subsets

•  train 14k images (used for testing the proposed approach directly) – 70%

•  validation 2k images (used as the initial training set) – 10%

•  test 4k images (used for evaluation) – 20%

•  MIRFlickr-1m

•  1 million loosely tagged images (used for selecting regions to enhance the initial classifiers)

E. Chatzilari, S. Nikolopoulos, Y. Kompatsiaris, J. Kittler. "Multi-Modal Region Selection Approach for Training Object Detectors", ICMR 2012, Hong Kong - China, June 2012

Page 51: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Validation Visual Visual*Textual

Without ARD 5.7

4.9 6

With ARD 5.1 7

The proposed approach for adaptive region dismissal greatly increases the performance of the resulting classifiers.

The configuration incorporating both visual and textual information exhibits the highest performance in 44 out of the 63 examined concepts, compared to 4 for the typical self training configuration and 15 for the configuration based on the initial classifiers.

Performance Comparison of Retrained models

Page 52: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Current work

•  Application to global (whole image) annotation

•  Introduction of visual ambiguity for improved selection of training samples

•  Learning of concepts which are visually similar and co-occur in images

•  E.g. “sea” – “sky”

•  Do not select such training samples

Page 53: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Semi-supervised machine learning for concept detection

Page 54: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Concept  Detec/on  

•  Use  of  similarity  graph  structure  for  machine  learning  

•  Exploit  mul--­‐modal  informa-on  through  different  fusion  techniques  

 

#54  

Page 55: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Spectral  Graph  Clustering  

#55  

Example:  Values  of  second  eigenvector  of  normalized  Laplacian  matrix  

Page 56: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Fusion  (1)  

#56  

Page 57: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Fusion  (2)  

#57  

Page 58: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

MIR-­‐Flickr  Experimental  Results  

25000  images  +  labels,  38  concepts  

#58  

Page 59: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Proposed  Approach  Vs.  Hare  &  Lewis,  2010  

#59  

Page 60: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Proposed  Approach  Vs.  Guillaumin  et  al.,  2010  

#60  

Page 61: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

#61  

Other  relevant  approaches  •  S.  Nikolopoulos,  E.  Giannakidou,  I.  Kompatsiaris,  I.  Patras,  and  A.  

Vakali,  “Combining  mul/-­‐modal  features  for  social  media  analysis'',  in  book  Social  Media  Modeling  and  Compu-ng,  Springer  2011  •  pLSA-­‐based  aspect  models  to  define  a  latent  seman-c  space  where  

heterogeneous  types  of  informa-on  can  be  effec-vely  combined  •  Georgios  Petkos,  Symeon  Papadopoulos,  Yiannis  Kompatsiaris,  

“Social  Event  Detec/on  using  Mul/modal  Clustering  and  Integra/ng  Supervisory  Signals”,  ICMR  2012.  

•   E.  Spyromitros-­‐Xioufis,  S.  Papadopoulos,  I.  Kompatsiaris,  G.  Tsoumakas,  I.  Vlahavas.  "An  Empirical  Study  on  the  Combina/on  of  SURF  Features  with  VLAD  Vectors  for  Image  Search”  WIAMIS  2012,  Dublin,  Ireland,  May  2012  

Page 62: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

#62

VLAD+SIFT  vs.  VLAD+SURF        Accuracy  vs.  dimensionality  VLAD+SURF  improves  VLAD+SIFT  and  FV+SIFT  across  all  dimensions  in  both  

Holidays  and  Oxford  datasets  

Results  in  rows  star-ng  with  *  are  taken  from  Jégou  et  al.,  2011,    hence  the  missing  values  for  some  entries.  SIFT  corresponds    to  PCA  reduced  SIFT  which  yielded  be"er  results  than  standard  SIFT  in  Jegou  et  al.,  2011  

Page 63: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

SocialSensor Applications and Use Cases

h"p://www.socialsensor.eu    

Page 64: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

#64

   

“It has changed the way we do news”(MSN)

“Social media is the key place for emerging stories – internationally, nationally, locally” (BBC)

“Social media is transforming the way we do journalism” (New York Times)

Source: picture alliance / dpa

Page 65: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

#65

Page 66: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

#66

                   

                                             Source:  Ge"y  Images  

“It’s really hard to find the nuggets of useful stuff in an ocean of content” (BBC)

“Things that aren’t relevant crowd out the content you are looking for” (MSN)

“The filters aren’t configurable enough” (CNN)

Page 67: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Verifica/on  was  simpler  in  the  past...  

Source: Frank Grätz

#67

Page 68: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

#68

Infotainment  Events  with  large  numbers  

of  visitors  Thessaloniki  Interna-onal  

Film  Fes-val    80,000  viewers  /  100,000  visitors  in  10  days  

150  films,  350  screenings  

Discovery  and  presenta-on  of  relevant  aggregated  social  media  (e.g.  film  ra-ngs  from  tweets)  

Page 69: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Conclusions and Issues •  Social media data mining provides interesting

results in many applications •  Not all data always available (e.g. User queries, fb)

•  Infrastructure •  Policy issues

•  Scalability and Real-time approaches •  Fusion of various modalities

•  Content, social, temporal, location •  Linking other sources (web, Linked Open Data) •  Applications and commercialization

•  Proven functionality for the organization •  User engagement

Page 70: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Colleagues •  Dr. Symeon Papadopoulos

•  Community detection •  Graph-based concept detection •  Visual Features

•  Dr. Georgios Petkos •  Multimodal event detection

•  Dr. Spiros Nikolopoulos •  pLSA fusion

•  Elisavet Chatzilari (PhD Student) •  Social media for learning

•  Lefteris Spyromitros (PhD Student) •  Visual Features

•  Juxhin Bakalli and Manos Schinas •  Applications development (Clusttour and ThessFest)

•  Prof. Athina Vakali (Informatics Dept, AUTh) •  Collaboration in Community Detection / Clusttour

Page 71: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Thank  you!  h"p://mklab.i-.gr  

 

Page 72: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Scalability Challenges

•  Network, crawling, data collection •  Streaming data

•  Users •  High numbers of users

•  Processing (e.g. NLP, clustering, etc) •  Links

•  Web sites •  Retweets, mentions, etc •  Multimedia content (e.g. images, YouTube videos)

Page 73: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Some Statistics

Page 74: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Datasift Architecture

Page 75: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Datasift processing •  Process the whole firehose: +250 MTweets/day •  40+ services run in the system •  handling the firehose •  low latency natural language processing and entity extraction

on tweets •  low latency in-line augmentation of tweets •  low latency handling very large individual filters •  keeping a history of the firehose by persisting the 1TB of

data it sends each day •  allowing analytics to be run on the history of the firehose •  real-time billing •  streaming filter results to 1000s of clients •  http://highscalability.com/blog/2011/11/29/datasift-

architecture-realtime-datamining-at-120000-tweets-p.html

Page 76: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Datasift statistics •  Current Peak Delivery of 120,000 Tweets Per Second

(260Mbit bandwidth) •  Performs 250+ million sentiment analysis with sub 100ms

latency •  1TB of augmented (includes gender, sentiment, etc) data

transits the platform daily •  Data Filtering Nodes Can process up to 10,000 unique

streams (with peaks of 8000+ tweets running through them per second)

•  Can do data-lookup's on 10,000,000+ username lists in real-time

•  Links Augmentation Performs 27 million link resolves + lookups plus 15+ million full web page aggregations per day.

•  http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html

Page 77: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Frameworks •  MapReduce (Hadoop)

•  Computation distribution •  Batch processing of huge datasets •  Parallel processing on large clusters of compute nodes

•  Cassandra, Tokyo Cabinet •  Key value stores •  Horizontal scaling for many users •  Huge Data indexing •  Fault tolerance •  Not sophisticated query possibilities

•  MongoDB •  JSON native support •  Large-Scale data storage

•  Memcached •  Efficient caching •  Clustering

Page 78: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Scalability Processing Approaches

•  Sampling •  Dimensionality reduction

•  E.g. VLAD

•  Local computations •  Iterative scanning/processing

•  stream based

•  Multi-level – Hierarchical •  Distributed

Page 79: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Image Representation Approaches

Bag-Of-Words (BOW) The most widely used Memory usage and search time are usually prohibitive for > 10M images

Vector of Locally Aggregated Descriptors VLAD More accurate than BOW for the same representation size Cheaper to compute Dimensionality can be further reduced with PCA without noticeable impact in accuracy.

Page 80: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Experimental Results (holidays dataset)

1/8/12 Eleftherios Spyromitros-Xioufis

method k descriptor dimension MAP

BOW 1K SIFT-pca 1K 40.1 BOW 20K SIFT-pca 20K 43.7 BOW 200K SIFT-pca 200K 54.0 VLAD 64 SIFT-pca 4096 55.6 VLAD 64 SIFT 8192 55.2 VLAD 128 SIFT 16384 56.7 VLAD 64 SURF 4096 63.2

VLAD 128 SURF 8192 65.6

Fisher 64 SIFT-pca 4096 59.7 Fisher 256 SIFT-pca 16384 62.5

Page 81: Social media mining and multimedia analysis research and applications

Guildford, 31 July, 2012 University of Surrey, CVSSP Seminar

Experimental Results (holidays dataset)

1/8/12 Eleftherios Spyromitros-Xioufis

method k descriptor D D’ (pca) MAP

BOW 20K SIFT-pca 20K 512 44.9 128 45.2 64 44.4

VLAD 64 SIFT-pca 4096 512 59.8 128 55.7 64 55.3

VLAD 64 SURF 4096 512 63.4

128 58.6

64 55.6

Fisher 64 SIFT-pca 4096 512 61.0 128 56.5 64 52.0