FeedMe - a semantic RSS aggregatorbib.irb.hr/datoteka/507911.ljubesic10-feedme.pdf · Existing...

Preview:

Citation preview

FeedMe - a semantic RSS aggregator

Nikola Ljubešić, Damir Boras, Mislav Cimperšak, Marija Tkalec

Faculty of Humanities and Social SciencesUniversity of Zagreb

08. lipnja 2010.

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Aggregating news

• collecting news from different information sources as publishing them as a single source

• manual and automated

• automated - problem of repeating information - need for analysis and organization

08. lipnja 2010.

Existing aggregators

• Google News

• EMM NewsExplorer

• MondoPress

08. lipnja 2010.

RSS

• RSS (Really Simple Syndication) - family of web feed formats used to publish frequently updated works

• XML file - readable by humans and machines

• RSS structured, (X)HTML nowadays still not - easier data harvesting through RSS

08. lipnja 2010.

Google Reader

• on-line RSS aggregator

• problems

• loss of information

• repeating information

• unwanted information

08. lipnja 2010.

Our idea

• collect RSS server-side - no loss of entries

• cluster RSS entries concerning their content - complex entries, no duplicates

• enable users to filter information - “affirmate” ie. “negate” specific feeds

08. lipnja 2010.

Filtering

• publish only feed entries containing n or more original feed entries

• “affirmate” feeds - publishing only feed entries containing at least one original entry of all the “affirmative” feeds

• “negate” feeds - not publish feed entries containing any of the original entries from any negated feed

08. lipnja 2010.

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

FeedMe

• back-end - collecting RSS entries on a half an hour basis and organizing them into clusters

• front-end - web application for

• creating groups of feeds (filtering - minimum elements, affirmating, negating)

• browsing the compiled groups

• publishing groups as new RSS feeds

08. lipnja 2010.

08. lipnja 2010.

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

The collected data

• 388 RSS feeds

• 38 different portals

• collected from 2010-05-10

• more than 100.000 entries

• cca. 30.000 clusters

08. lipnja 2010.

Distribution of documents regarding the cluster size

0

0,20

0,40

0,60

0,80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

08. lipnja 2010.

Portals publishing on “large” events (>2)Net.hr

Monitor.hrTportal.hr

Index.hrDnevnik.hrNacional.hr

Jutarnji.hrHRT.hr

24sata.hrVecernji.hr

SlobodnaDalmacija.hrRTL.hr

0 20 40 60 80

16

19

24

27

30

45

49

54

64

66

68

77

08. lipnja 2010.

Portals publishing new stories first

Index.hrNet.hr

Monitor.hrDnevnik.hrNacional.hrTportal.hrJutarnji.hr

Vecernji.hrHRT.hr

SlobodnaDalmacija.hr24sata.hr

RTL.hr

0 50 100 150 200

31

50

51

59

62

121

122

131

143

151

161

195

08. lipnja 2010.

Portals publishing new stories first (normalized by portal size)

Tportal.hrJutarnji.hr

Net.hrHRT.hr

Vecernji.hrNacional.hrDnevnik.hrMonitor.hr

RTL.hrIndex.hr

24sata.hrSlobodnaDalmacija.hr

0 0,10 0,20 0,29 0,39

0,31

0,31

0,31

0,32

0,32

0,32

0,32

0,34

0,35

0,38

0,38

0,39

08. lipnja 2010.

Plagiates?Tportal.hr

Dnevnik.hr

Nacional.hr

Net.hr

Jutarnji.hr

Index.hr

Monitor.hr

SlobodnaDalmacija.hr

HRT.hr

0 0,08 0,15 0,23 0,30

0,01

0,01

0,01

0,01

0,02

0,03

0,06

0,09

0,24

08. lipnja 2010.

Overview

1. The basic idea

2. Our system

3. Statistical analysis of collected data

4. Usage examples

08. lipnja 2010.

Filtering by minimum number of elements

08. lipnja 2010.

Filtering by affirmating feeds

08. lipnja 2010.

Filtering by negating feeds

08. lipnja 2010.

Future steps

• user-defined RSS sources

• full-text news portals

• different sources - social networks

• topic tracking

• named entity identification

• sentiment analysis and mining

08. lipnja 2010.

Thank you! Questions?

08. lipnja 2010.

Recommended