View
213
Download
1
Category
Preview:
Citation preview
1
Bettina Berendt
Humboldt University Berlin, Germany – www.berendt.de
Social-media blog tagging:
Metadata or “just more content” ?
OR:How do normal people describe what they‘ve written – and how do readers understand that?
2
Acknowledgements
This presentation is based on the paper
Tags are not metadata, but “just more content” – to some people
at the International Conference on Weblogs and Social Media,
Boulder, CO, USA, March 2007
http://www.icwsm.org/papers/paper12.html
I thank my co-author
Christoph Hanser
(now at Resco, Hamburg, Germany)
3
Agenda
Blogs and other social media
What blog tags (should) say: “executive summary“ of this talk
Tag functions
Empirical study
1. Confirmatory part: Quality of automated content classification
2. Exploratory part: Complementarity and individual differences between classification methods
Some remarks on method
4
Agenda
Blogs and other social media
What blog tags (should) say: “executive summary“ of this talk
Tag functions
Empirical study
1. Confirmatory part: Quality of automated content classification
2. Exploratory part: Complementarity and individual differences between classification methods
Some remarks on method
5
Blogs
A weblog (blog) is
a website
containing journal-style entries
presented in reverse chronological order,
often written by a single user
6
Blogs and other social media („Web 2.0“)
Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links
Social network sites(e.g., MySpace)
Instant messageexchanges(Twitter)
Wikis(e.g., Wikipedia)
“Annotation platforms“(e.g., del.icio.us)
7
Blogs and other social media,and their activity focus
Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links
Publication; Expression
Social network sites(e.g., MySpace)
(Self-)profiling,Meeting people
Communication
Instant messageexchanges(Twitter)
Wikis(e.g., Wikipedia)
Creating content
“Annotation platforms“(e.g., del.icio.us)
Organizing content
8
Blogs and other social media,and some of their origins in older media
Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links
Diaries (Often political) journalism PR; press releases
Social network sites(e.g., MySpace)
Dating sites
Chatrooms
Instant messageexchanges(Twitter)
Wikis(e.g., Wikipedia)
Computer-supportedcooperative work
“Annotation platforms“(e.g., del.icio.us)
Bookmarks www.dmoz.org
Usenet
9
Blogs: Publication bordering on communication
Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links
Publication; Expression
Social network sites(e.g., MySpace)
(Self-)profiling,Meeting people
Communication
Instant messageexchanges(Twitter)
Wikis(e.g., Wikipedia)
Creating content
“Annotation platforms“(e.g., del.icio.us)
Organizing content
10
Blogs and other social media: Where tagging (= adding keywords) is most prominent
Blogs(e.g., Livejournal; Huffington Post)Sharing / linking by:Hyperlinks, comments, blogroll, trackback links
Diaries (Often political) journalism PR; press releases
“Annotation platforms“(e.g., del.icio.us)
Bookmarks
Reader tags
Author tags
Usenet
11
Agenda
Blogs and other social media
What blog tags (should) say: “executive summary“ of this talk
Tag functions
Empirical study
1. Confirmatory part: Quality of automated content classification
2. Exploratory part: Complementarity and individual differences between classification methods
Some remarks on method
12Tags in blogs (I):What does a tag tell us in terms of what a blog is about?
On March 20, 2003, Baghdad got its first taste of "Shock & Awe". A fiery mix of cruise missiles and high-IQ bombs lit up the night sky and unleashed the most lethal air bombing campaign since Vietnam. [...] Four years on, Shock & Awe is the daily reality of a war that has killed 3,200 US soldiers and well over 60,000 Iraqi civilians. [...]
Tags: Iraq, war
We expect tags to mirror content.
13
Tags in blogs (II):However ...
Tags: GPS systems
... tags often complement or add to content.
... this is perceived differently by different readers.
Reader 1: geography & computersReader 2: computers & politics
Level of access, allowing a precision of position determination of less than 20 meter selective availability would be turned off, and so now all users enjoy nearly the same however, on may 1 , 2000 , then us president bill clinton announced that this The system also provides . for the user to add intelligence, as perceived by the blind user, to the central server hosting the spatial database The codes are well suited to decoding a message embedded in noise signals which may be orders of magnitude larger than the signal itself […]
14
Agenda
Blogs and other social media
What blog tags (should) say: “executive summary“ of this talk
Tag functions
Empirical study
1. Confirmatory part: Quality of automated content classification
2. Exploratory part: Complementarity and individual differences between classification methods
Some remarks on method
15
1 million new tags every month
“… bloggers are not settling on common, decentralized meanings for tags; rather, they are often independently choosing distinct tags to refer to the same concepts”
(Brooks & Montanez, 2006)
16
Tag functions (Golder & Huberman, 2006)in del.icio.us
(also) typical author tags
typical reader tags
Identifying what (or who) the tagged item is about .
Refining categories
Identifying qualities or characteristics
Identifying what the tagged item is
Identifying who owns the tagged content / item .
Self reference
Task organizing
17
Tag functionsmirror standard metadata elements (here: Dublin Core)
(also) typical author tags
typical reader tags
dc:subject (“the topic of the resource“) – Identifying what (or who) the tagged item is about
dc:description – Refining categories
dc:description – Identifying qualities or characteristics
dc:type, dc:format – Identifying what the tagged item is
dc:creator (dc:rights ?) – Identifying who owns the tagged content / item
dc:contributor – Self reference
(dc:relation ?) – Task organizing
18
Do we expect tags / metadata to add to content?
(also) typical author tags
typical reader
tags
YES
NO dc:subject (“the topic of the resource“) – Identifying
what (or who) the tagged item is about
dc:description – Refining categories
dc:description – Identifying qualities or characteristics
dc:type, dc:format – Identifying what …
dc:creator (dc:rights ?) – Identifying who … .
dc:contributor – Self reference
(dc:relation ?) – Task organizing
[...] Eleven years ago, as a condition for ending the Persian Gulf War, the Iraqi regime was required to destroy its weapons of mass destruction, to cease all development of such weapons, […] The Iraqi regime has violated all of those obligations. It possesses and produces chemical and biological weapons. [...]
Identifying who owns the tagged item:http://www.whitehouse.gov/news/releases/2002/...
19
How do author tags relate to content?
20
Agenda
Blogs and other social media
What blog tags (should) say: “executive summary“ of this talk
Tag functions
Empirical study
1. Confirmatory part: Quality of automated content classification
2. Exploratory part: Complementarity and individual differences between classification methods
Some remarks on method
21
Empirical study – Part I: Overview
Blog corpus
Content classificationwith several text mining methods
“Gold standard“derived fromcontent classificationby human annotators
Controlled vocabulary / taxonomy
22
Data
Taken from the Weblogging Ecosystems 2006 corpus
random sample of 100 blog posts
written on 4th July 2006 (the first day of the large corpus)
written in English
tagged by their authors
length 50-500 words
no blogs with (only) meaningless tags
Tags were nearly all “topic tags”, broad range of topics
23
Classification taxonomy: WordNet Domains (Magnini & Cavaglia, 2000) – excerpt
FACTOTUM DOCTRINES FREE_TIME APPLIED_SCIENCE PURE_SCIENCE SOCIAL_SCIENCE / number / archaeology / play / agriculture / astronomy / administration / color / astrology / / betting / alimentation / / topography / anthropology / time_period / history / / card / gastronomy / biology / ethnology / person / / heraldry / / chess / architecture / / biochemistry / / folklore / quality / linguistics / sport / / town_planning / / ecology / artisanship / metrology / / grammar / / badminton / / building_industry / / plants / body_care
/ literature / / baseball / / furniture / / zoology / commerce
/ / philology / / basketball / computer_science / / / entomology / economy
/ philosophy / / cricket / engineering / / anatomy / / banking
/ psychology / / football / / mechanics / / physiology / / book_keeping
/ / psychoanalysis / / golf / / astronautics / / genetics / / enterprise
/ art / / rugby / / electrotechnics / chemistry / / exchange
/ / dance / / soccer / / hydraulics / earth / / insurance
/ / drawing / / table_tennis / medicine / / geology / / money
/ / / painting / / tennis / / dentistry / / meteorology / / tax
/ / / philately / / volleyball / / pharmacy / / oceanography / / finance
/ / music / / cycling / / psychiatry / / paleontology / fashion
24
Human annotations (“reader tags“)
5 graduate students
received corpus + WND hierarchy
Labelled each post with ≥ 0 domains (recommended: 0-3)
assigned 340/500 * 1 domain; 160/500 * 2 domains, 23/500 * 0 domains
Aggregation: consensus = domains with at least 2 votes
IAA = average pairwise Jaccard similarity
Jaccard similarity (A,B) = |A B| / |A B| = 0.5 * (1+ |A B| / |A B|) * F1
IAA = 0.39
Similarity to consensus {0.31, 0.47}
Pairwise similarity {0.26, 0.4}
The corresponding F1 values can be found at the paper‘s Web site:http://www.wiwi.hu-berlin.de/~berendt/Papers/ICWSM07/
25
Automated classification
Cleaning, POS tagging, lemmatizing
All 10 combinations of
feature sets for bag-of-words author tag(s) (tag)
nouns from the title (titleN)
nouns from the blog post body (bodyN)
the top 5 TF.IDF keyphrases (noun n-grams) from the body (TF.IDF)
the top 5 IDF keyphrases (noun n-grams) from the body (IDF)
word sense disambiguation strategies the top sense of the word (T)
all senses of the word (A)
From term to sense to domain
26
WordNet:words senses
27
WordNet Domains as the classific-ation hierarchy: senses domains
28
Study I: Results
29
Agenda
Blogs and other social media
What blog tags (should) say: “executive summary“ of this talk
Tag functions
Empirical study
1. Confirmatory part: Quality of automated content classification
2. Exploratory part: Complementarity and individual differences between classification methods
Some remarks on method
30Part II: Do automated methods err in the same way? What about automated-method – annotator suitability?
Blog corpus
Content classificationwith several text mining methods
“Gold standard“derived fromcontent classificationby human annotators
Controlled vocabulary / taxonomy
31
Similarity between methods
Automated methods
Human annotators
32
Combining automated methods
Tags complement content.
Definition: In a corpus of posts consisting of body elements (text, title, ...) and author tags, the tags are not metadata but content if (1) the tags have a low similarity with the body (such that body features cannot be used to predict the tags, or vice versa), and (2) the combination of body and tags predicts the human consensus classification of content better than either body or tags alone.
But: HOW are they more content?
33
Tags provideadditional information
Yes! Ive been looking for a way to easily transfer songs on my iPOD to my computer. I want to back all of them up to DVD, but Apple makes it very difficult to pull them off. Thats for copyright purposes, Im sure, but there are legitimate uses for it as well. iPOD Agent, a $15 shareware program, allows you to do that and much more. Synchs contacts/notes/etc with Outlook, gets horoscopes/movie times/weather/RSS feeds. Good stuff. iPOD Soft Go get it! Adam
Tags: General, Music Tag-based methods: Music
Body-based methods:Computer ScienceLiterature
Human annotators:3* (Music, Computer Science)PlayFree time
34
A closer look at agreement
35
Tags don‘t give all the information – go on reading!
Todays New York Times includes this report about Sleeper Cell, a 10-part Showtime series about a faithful Muslim named Darwyn (yes, we get it) who infiltrates a terrorist group. […] discuss the shows idealism: "You learn there are peace-loving souls in every religion," said Mr. Fehr, who once served in the Israeli military. "We have to respect and strengthen the peace-believers, and hopefully find a way turn the terrorists." In that sense, the production, for all its violence […] is perhaps most ambitious for the idealism that courses through it. […]
Tags: Radio & TV, IslamTag-based methods: TV, Religion
Body-based methods:Politics (and others)
A1: Religion, TV
A3: Politics
36Interpretation requires knowledgeof blogspace authoring conventions
I first wrote about Ludwika Ogorzelec s Space Crystallization Cycle after seeing her show here in NYC last February. Her prolific installation of site specific cellophane lattice has graced a broad range of settings since the series began a couple years ago. The latest... farmland. Farming With Mary is a Queensland Australia project that brought environmental artists from all over the globe to the farming community. Ludwika installed three pieces, each comprised of about 5km of cellophane, on a farm in Tuchikoi in the Mary Valley Region. She also installed one piece in Noosa Woods. Pictures after the jump.
Tags: Art Tag-based methods: Art
Body-based methods:(other domains)
I first wrote about Ludwika Ogorzelec s Space Crystallization Cycle after seeing her show here in NYC last February. Her prolific installation of site specific cellophane lattice has graced a broad range of settings since the series began a couple years ago. The latest... farmland. Farming With Mary is a Queensland Australia project that brought environmental artists from all over the globe to the farming community. Ludwika installed three pieces, each comprised of about 5km of cellophane, on a farm in Tuchikoi in the Mary Valley Region. She also installed one piece in Noosa Woods. Pictures after the jump.
A1: Art
A3: Photography
37
Agenda
Blogs and other social media
What blog tags (should) say: “executive summary“ of this talk
Tag functions
Empirical study
1. Confirmatory part: Quality of automated content classification
2. Exploratory part: Complementarity and individual differences between classification methods
Some remarks on method
38Searching for the basic level: Coarsening to hierarchy level 2 (which contains most human-assigned tags: average no. of occurrences of the four levels = 0.83, 8.0, 1.96, and 0.14)
A typical uncertainty of lay annotators:Religion (level 2) OR Theology (level 3) ?
39
News are easier for content classification:Comparison classification blog corpus – Reuters RCV1
settings: all nouns, all synsets, TF.IDF weighting
40
Summary
Blog post body and tag information are complementary.
For some readers, tags or body are the best / only indicators of content.
Gold-standard may be a meaningless concept!
Respect in the design of search engines and labelling recommenders!
Q: What do users (readers and …) do with tags?
41
Outlook
Improve on the mapping to the WordNet domain hierarchy
e.g., relatedness of domains Berendt & Navigli, 2006
Use different classification system (but: which one??)
Investigate differences that arise with named entities / proper nouns
Combine methods: extraction from body, tags, lexical resources, similarity mappings / collaborative filtering, machine learning, interactive (tagyu.com, AutoTag, TagAssist, …)
Larger samples, (quasi-)experimental designs, psycholinguistic methods
Investigate impact of language and other factors
Investigate impact of tagging-system features ( Marlow et al., 2006) …
Study developments / convergence of tags
Hayes et al., 2007: “A-tags / A-blogs”
Tags: Thank you for your attention!
42
Sources
See reference list of the paper, except these two papers from ICWSM 2007:
TagAssist
Sanjay Sood, Sara Owsley, Kristian Hammond and Larry Birnbaum
TagAssist: Automatic Tag Suggestion for Blog Posts
http://www.icwsm.org/papers/paper10.html
Hayes
Conor Hayes and Paolo Avesani
Using Tags and Clustering to Identify Topic-Relevant Blogs
http://www.icwsm.org/papers/paper23.html
Recommended