21
1 A New Approach to News Prof. Igor Trajkovski, Ph.D.

Time.Mk

Embed Size (px)

Citation preview

Page 1: Time.Mk

1

A New Approach to News

Prof. Igor Trajkovski, Ph.D.

Page 2: Time.Mk

2TIME.mk Proprietary

Motivation

– Traditionally, news readers first pick a publication and then look for headlines that interest them.

Page 3: Time.Mk

TIME.mk Proprietary

News Pipeline

Crawling,Extraction

Clustering

StoryClassification

Scoring

Page 4: Time.Mk

TIME.mk Proprietary 4

Crawling and Extraction

Page 5: Time.Mk

TIME.mk Proprietary

Crawl

• Most of the Macedonian news sites don’t have RSS feeds.

• One level crawl from a set of hubs:� (Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1

� (Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2

� (Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100

� (Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26

� (None) http://www.kirilica.com.mk/

• Many hubs per source.

� Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.)

• Hubs annotated with section name (topic):

� Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None

• At the moment, hubs of the sources are provided manually.

Page 6: Time.Mk

TIME.mk Proprietary

Article Extraction

Heuristics:

• Title matches link text and/or HTML title, and is above the body

• Body is a big run of unformatted Cyrillic text, below title

• Image is extracted from the hub pageand has attached link with exactly the same address as the article

Segment into title / body / image

The same procedure is used for extracting all articles from all sources !!!

Page 7: Time.Mk

TIME.mk Proprietary 7

Clustering

Page 8: Time.Mk

TIME.mk Proprietary

Clustering

Partition news articles into disjoint subsets of clusters, such that:

� News within a cluster are very similar

� News in different clusters are very different

.

...

. .. ..

..

....

Page 9: Time.Mk

TIME.mk Proprietary

Word weights

Weight is function of word frequency within a document and across all documents

TF(w) = frequency of word w in a news article

• Intuition: a word appearing more frequently in a text is more likely to be related to its “meaning”

IDF(w) = log [N/nw] + 1

• where N = #news articles, nw is #news articles containing w

• Intuition: words appearing in many news articles are generally not very informative (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.)

TFIDF: weight of a word in a news article is product of these quantities:

• TFIDF(w) = TF(w) x IDF(w)

A1, 17:15h, MKКривична пријава против Андреј Петров

петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449) дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)…

Page 10: Time.Mk

TIME.mk Proprietary 10

Story Classification

Page 11: Time.Mk

TIME.mk Proprietary

Story Classification

Based on hub classification tags

cluster news cluster news

Macedonia Culture

Macedonia Fun

Region � Macedonia Macedonia � Culture

Macedonia Culture

NONE Macedonia

Page 12: Time.Mk

TIME.mk Proprietary 12

Scoring

Page 13: Time.Mk

TIME.mk Proprietary

Cluster Scoring Logic

Cluster Score = quality-of-sources * freshness-of-news

Quality of a source: How useful is this source?

- Non-dup fraction- Participation in large stories

- First publisher of a top story

Page 14: Time.Mk

TIME.mk Proprietary

Article Scoring Logic

Article Score

– Used for ranking within a cluster

– Function of:

• Age

• Quality of source

• Title overlap with cluster centroid

• Article size

• …

Page 15: Time.Mk

TIME.mk Proprietary 15

Stats and Future Work

Page 16: Time.Mk

16TIME.mk Proprietary

News sources activity

0

20

40

60

80

100

120

140

160

180

200

ДНЕВ

НИК

УТРИН

СКИ

ВЕ

СН

ИК

НО

ВА М

АКЕ

ДО

НИЈА

ВРЕ

МЕ

ВЕЧ

ЕР А1

КАН

АЛ5

МА

КФАКС

НЕТП

РЕС

ВЕС

Т

КИРИЛИЦ

АСИ

ТЕЛ

АЛС

АТ-М

ФО

РУМ

АЛФ

А Т

ВКУ

РИР

ИДИ

ВИДИМ

ТВКА

ЈГАНА

БРО

КЕР

МАКД

ЕНЕС

BBCSETIM

ESON.N

ETЗА

ЗАБАВ

АТЕ

ЛМА

* in period of 2 working days

Page 17: Time.Mk

17TIME.mk Proprietary

4500

6500

1500

700

100 180

0

1000

2000

3000

4000

5000

6000

7000

Jul Aug Sept Oct Nov Dec

#visitors

Visitors

Source: 8pt, medium gray

Article about TIME.mk in Нова Македонија

ON.net started to present TIME.mk news

365.com.mk started to present TIME.mk news

discussions on MK forums

Lunch of TIME.mk

1.July.2008

Page 18: Time.Mk

TIME.mk Proprietary

Regional expansion - Slovenia

Next stop: Serbia

Page 19: Time.Mk

19TIME.mk Proprietary

Next to come …

• Search of the archive

• RSS feeds

• Click metrics & personalization

� adjustable cluster ranking to the user preferences

• News alerts

� emails with link to news that contain provided keywords

• Weekly and Monthly news threads

• New topics: Technology, Health, etc.

• Inclusion of other news sources (currently only 26)

• Automatic Hub discovery

• Improvements in the clustering algorithms (more sophisticated NLP)

� СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за

млади и спорт, струја = електрична енергија, etc.

� Go beyond duplicate detection by measuring new fact introduction

Page 20: Time.Mk

TIME.mk Proprietary

Acknowledgments

- Pajo & Biba for registering TIME.mk in MARNET

- Karolina for offering DNS services and HTML/CSS tricks

- Igor (Zuljo) for implementing the new design

- Nikola and Daniel for implementing text extraction for TIMES.si

- many many users for suggesting improvements by sending tons of emails with bugs on TIME.mk pages

Page 21: Time.Mk

TIME.mk Proprietary 21

Thank You!

Q&A