60
1 The evolution of a story in a network – a Web mining perspective Bettina Berendt www.cs.kuleuven.be/~berendt

1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

Embed Size (px)

Citation preview

Page 1: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

1

The evolution of a story in a network – a Web mining perspective

Bettina Berendtwww.cs.kuleuven.be/~berendt

Page 2: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

2

About me: My public (and mine-able) profile

: Information Systems: Computer Science / Cognitive Science: Artificial Intelligence: Business Science: Economics

: Computer Science

Page 3: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

3

Story evolution: texts change (T)

Story evolution: authors change (U)

Web mining:Text, Link structure, Usage

Agenda

Story evolution: communities of authors change (L)

Story evolution: reading behaviour changes (U)

Page 4: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

4

Web mining

Page 5: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

5

Information retrieval and data mining

What‘s in this list?

How is it ordered?

Page 6: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

6

Information retrieval and data mining

Page 7: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

7

Data mining & Web mining

Knowledge discovery (aka Data mining):

“the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1

Web mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas:

Web content mining

Web structure mining

Web usage mining

1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press

Navigation, queries, content access & creation

Simple, bipartite, tripartite, ... graphs

Texts, pictures, sounds, ...

Page 8: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

8

Story evolution: texts change

– joint work with Ilija Subašić, 2008 –

* All references are given on slide no. 47

Page 9: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

9

Dynamic Web content

Page 10: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

10

A story begins

http://www.telegraph.co.uk/news/main.jhtml?xml=/news/2007/05/22/nmaddy122.xml

Page 11: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

11

The story unfolds

Page 12: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

12

The story unfolds– new actors enter the stage (and old ones change their roles)

Page 13: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

13

Basic idea: A story is about relational statements story stages expressed by co-occurrences

Robert Murat – suspect

Kate MccCann (the mother) – suspect

Gabriel Ruget‘s talk

Page 14: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

14

Data collection and preprocessing

Articles from Google News 05/2007 – 11/2007 for search term “madeleine mccann“

(there was a Google problem in the December archive)

Only English-language articles

For each month, the first 100 hits

Of these, all that were freely available 477 documents

Preprocessing: HTML cleaning

tokenization

stopword removal

Page 15: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

15

Story elements

content-bearing words

the 150 top-TF words without stopwords

Page 16: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

16

Story stages:co-occurrence in a window

“mother“ and “suspect“ co-occur• in a window of size ≥ 6 (all words)• in a window of size ≥ 2 (non-stopwords only)

Page 17: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

17

Salient story elements

1. Split whole corpus T by week (17 = 30 Apr + until 44 = 12 Nov +)

2. For each week

Compute the weights for corpus t for this week

3. Weight =

Support of co-occurrence of 2 content-bearing words w1, w2 in t =

(# articles from t containing both w1, w2 in window) / (# all articles in t)

4. Threshold

Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)

Time-relevance TR of co-occurrence(w1, w2) =

support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥

θ2 (e.g., 2) *

5. Rank by TR, for each week identify top 2

6. Story elements = peak words = all elements of these top 2 pairs (# = 38)

Page 18: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

18

Salient story stages, and story evolution

7. Story stage = co-occurrences of peak words in t

For each week t: aggregate over t-2, t-1, t moving average

8. Story evolution = how story stages evolve over the t in T

Page 19: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

19

Story stages: Example result

<week 17>

<show sliders>

Page 20: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

20

Story evolution: result

<morphAll.py>

Page 21: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

21

... the story is lost if we go back to single entities

Robert Murat – suspect

Kate MccCann (the mother) – suspect

Page 22: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

22

Future work

“beyond words“ (e.g., semantics)

Web communities

Michael Barber‘s talk

Gabriel Ruget‘s qualia?!

Page 23: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

23

Story evolution: authors change(and stories with them)

Page 24: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

24

Multi-authored texts

http://en.wikipedia.org/wiki/Madeleine_McCann

Page 25: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

25

Who authored?

Page 26: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

26

Visualizing conflict – example “edit wars“

Viégas, Wattenberg, & Dave, 2004

Page 27: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

27

The bone of contention ...

Page 28: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

28

Story evolution: communities of authors develop parallel stories

Page 29: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

29

Basic data for Web structure mining:hyperlinks and textual references

Page 30: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

30

Example: Political blogs in the US

Adamic & Glance, 2005 (visualization modified)

All links Thresholded (link occurrence ≥ 25)

blue: liberal; red: conservative

Gabriel Ruget‘s

“publics“

Page 31: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

31

Example: Blogs sourcing mainstream media

Hyperlinks from blogs to mainstream news media Germany USA

[Berendt, Schlegel, & Koch, in Kommunikation, Partizipation und Wirkungen im Social Web, 2008]

Page 32: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

32

The German and the US blogospheres

Data reported in [Berendt, Schlegel, & Koch, 2008]

Page 33: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

33Example:The politics of sourcing – what do blogposts on global warming refer to?

Walejko & Ksiazek, in press

Page 34: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

34

Story evolution: communities of authors change

Page 35: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

35

Who authored? (revisited)

Page 36: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

36

Tracing anonymous edits

Page 37: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

37

Why?

Page 38: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

38

Story evolution: reading behaviour changes

Page 39: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

39

The story unfolds– query analysis may reveal more than text analysis

Page 40: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

40

Reading may “predate“ writing

Page 41: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

41

Request frequency for a specific diagnosis in the investigated eHealth portal, depending on time and request language

Which diagnosis is that?

[Yihune, 2003; see also Heino & Toivonen, 2003]

Page 42: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

42

My story has reached its end

Page 43: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

43

My story has reached its end

Page 44: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

44

My story has reached its end

Page 45: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

45

My story has reached its end

Page 46: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

46

My story has reached its end

is our discussion‘s beginning!

Page 47: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

47

References

Adamic, L., & Glance, N. (2005). The political blogosphere and the 2004 U.S. Election: Divided they blog. In Proc. of the 3rd Int. Worksh. on Link Discovery at ACM SIGKDD (pp. 36–44).

Berendt, B., Schlegel, M., & Koch, R. (2008). Die deutschsprachige Blogosph ¨are: Reifegrad, Politisierung, Themen und Bezug zu Nachrichtenmedien [[The German-speaking blogosphere: Maturity, political focus, and relation to news media]]. To appear in A. Zerfaß, M. Welker, & J. Schmidt (Eds.), Kommunikation, Partizipation und Wirkungen im Social Web (Band 2: Strategien und Anwendungen: Perspektiven für Wirtschaft, Politik, Publizistik) [[Communication, Participation and Eects in Social Web (Vol. 2: Strategies and Applications: Perspectives for the Economy, Politics, and Journalism]] .(pp. 72-96). Köln, Germany: Herbert von Halem Verlag.

Berendt, B. & Subašić, I. (in press). Identifying, measuring and visualizing the evolution of a story: A Web mining approach. To appear in Proc. COLLNET 2008 (Fourth International Conference on Webometrics, Informetrics and Scientometrics & Ninth COLLNET Meeting). Berlin, July/August 2008.

Griffith, V. (2007). WikiScanner: List anonymous wikipedia edits from interesting organizations. http://wikiscanner.virgil.gr

Heino, J. & Toivonen, H. (2003). Automated Detection of Epidemics from the Usage Logs of a Physicians' Reference Database. In Proc. PKDD 2003. http://www.springerlink.com/content/g8h9f8y2fd3xq7ft/

Viégas, F.B., Wattenberg, M., & Dave, K. (2004). Studying Cooperation and Conflict between Authors with history flow Visualizations. In Proc. CHI 2004 (pp. 575-582).

Walejko, G. & Ksiazek, T. (in press). The Politics of Sourcing: A Study of Journalistic Practices in the Blogosphere. To appear in Proc. Of the Second International Conference on Weblogs and Social Media (ICWSM 2008). Seattle, March/April 2008. http://www.icwsm.org/2008

Yihune, G. (2003). Evaluation eines medizinischen Informationssystems im World Wide Web. Nutzungsanalyse am Beispiel www.dermis.net. Dissertation. Medizinische Fakultät der Ruprecht-Karls-Universität Heidelberg.

Page 48: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

48

Backup Slides

Page 49: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

49

(Some) further work in text processing

Page 50: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

50

Improving on words and weights

Page 51: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

51

Stemming

Want to reduce all morphological variants of a word to a single index term

e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document)

Stemming - reduce words to their root form

e.g. fish – becomes a new index term

Porter stemming algorithm (1980)

relies on a preconstructed suffix list with associated rules

e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE

– BINARIZATION => BINARIZE

Page 52: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

52Inverse document frequency (IDF)

A term that occurs in a few documents is likely to be a better discriminator than a term that appears in most or all documents

nj - Number of documents which contain the term j

n - total number of documents in the set

Inverse document frequency

Page 53: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

53

Full Weighting (TF-IDF)

The TF-IDF weight of a term j in document di is

Page 54: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

54

Beyond words

Page 55: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

55

N-grams and Named-Entity Recognition

Madeleine )

Madeleine McCann )

Maddie ) MADELEINE_MCCANN

Maddy )

... )

Page 56: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

56

Semantics (e.g., word-sense disambiguation)

Page 57: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

57

The need for word sense disambiguation

“She sat by the bank and looked sentimentally at the last fish.“

„She sat by the bank and looked sentimentally at the last coins.““She sat by the bank and looked sentimentally at the last coins.“

Page 58: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

58

WordNet semantic relations

Page 59: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

59

Web mining for analyzing multiple perspectives:

[Fortuna, Galleguillos, & Cristianini, in press]

What characterizes different news sources?

Nearest neighbour / best reciprocal hitfor document matching;Kernel Canonical Correlation Analysisand vector operationsfor finding topics and characteristic keywords

Page 60: 1 The evolution of a story in a network – a Web mining perspective Bettina Berendt berendt

60

Syntactic analysis

From simple part-of-speech tagging to full-scale NLP parsing