Upload
darko-buldioski
View
6.325
Download
2
Tags:
Embed Size (px)
Citation preview
1
A New Approach to News
Prof. Igor Trajkovski, Ph.D.
2TIME.mk Proprietary
Motivation
– Traditionally, news readers first pick a publication and then look for headlines that interest them.
TIME.mk Proprietary
News Pipeline
Crawling,Extraction
Clustering
StoryClassification
Scoring
TIME.mk Proprietary 4
Crawling and Extraction
TIME.mk Proprietary
Crawl
• Most of the Macedonian news sites don’t have RSS feeds.
• One level crawl from a set of hubs:� (Macedonia) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=1
� (Economy) http://www.a1.com.mk/vesti/kategorija.aspx?KatID=2
� (Sport) http://www.forum.com.mk/index.php?option=com_content&task=blogsection&id=9&Itemid=100
� (Culture) http://novamakedonija.com.mk/DesktopDefault.aspx?tabindex=1&tabid=2&CategID=26
� (None) http://www.kirilica.com.mk/
• Many hubs per source.
� Some of the sources have fixed address of the hubs (A1, Makfax, etc.), for some we need to determine the hub addresses in runtime (Dnevnik, Utrinski, etc.)
• Hubs annotated with section name (topic):
� Macedonia, Balkan, World, Economy, Culture, Fun, Sport, Chronic, None
• At the moment, hubs of the sources are provided manually.
TIME.mk Proprietary
Article Extraction
Heuristics:
• Title matches link text and/or HTML title, and is above the body
• Body is a big run of unformatted Cyrillic text, below title
• Image is extracted from the hub pageand has attached link with exactly the same address as the article
Segment into title / body / image
The same procedure is used for extracting all articles from all sources !!!
TIME.mk Proprietary 7
Clustering
TIME.mk Proprietary
Clustering
Partition news articles into disjoint subsets of clusters, such that:
� News within a cluster are very similar
� News in different clusters are very different
.
...
. .. ..
..
....
TIME.mk Proprietary
Word weights
Weight is function of word frequency within a document and across all documents
TF(w) = frequency of word w in a news article
• Intuition: a word appearing more frequently in a text is more likely to be related to its “meaning”
IDF(w) = log [N/nw] + 1
• where N = #news articles, nw is #news articles containing w
• Intuition: words appearing in many news articles are generally not very informative (e.g., “и”, “е”, “на”, “не”, “со”, “во”, “нели”, “затоа”, etc.)
TFIDF: weight of a word in a news article is product of these quantities:
• TFIDF(w) = TF(w) x IDF(w)
A1, 17:15h, MKКривична пријава против Андреј Петров
петров(28.699) пивара(25.589) андреј(23.382) комиси(23.15) мвр(20.603) кривич(16.449) дистри(16.036) сдсм(15.714) незако(14.921) жалела(14.482)…
TIME.mk Proprietary 10
Story Classification
TIME.mk Proprietary
Story Classification
Based on hub classification tags
cluster news cluster news
Macedonia Culture
Macedonia Fun
Region � Macedonia Macedonia � Culture
Macedonia Culture
NONE Macedonia
TIME.mk Proprietary 12
Scoring
TIME.mk Proprietary
Cluster Scoring Logic
Cluster Score = quality-of-sources * freshness-of-news
Quality of a source: How useful is this source?
- Non-dup fraction- Participation in large stories
- First publisher of a top story
TIME.mk Proprietary
Article Scoring Logic
Article Score
– Used for ranking within a cluster
– Function of:
• Age
• Quality of source
• Title overlap with cluster centroid
• Article size
• …
TIME.mk Proprietary 15
Stats and Future Work
16TIME.mk Proprietary
News sources activity
0
20
40
60
80
100
120
140
160
180
200
ДНЕВ
НИК
УТРИН
СКИ
ВЕ
СН
ИК
НО
ВА М
АКЕ
ДО
НИЈА
ВРЕ
МЕ
ВЕЧ
ЕР А1
КАН
АЛ5
МА
КФАКС
НЕТП
РЕС
ВЕС
Т
КИРИЛИЦ
АСИ
ТЕЛ
АЛС
АТ-М
ФО
РУМ
АЛФ
А Т
ВКУ
РИР
ИДИ
ВИДИМ
ТВКА
ЈГАНА
БРО
КЕР
МАКД
ЕНЕС
BBCSETIM
ESON.N
ETЗА
ЗАБАВ
АТЕ
ЛМА
* in period of 2 working days
17TIME.mk Proprietary
4500
6500
1500
700
100 180
0
1000
2000
3000
4000
5000
6000
7000
Jul Aug Sept Oct Nov Dec
#visitors
Visitors
Source: 8pt, medium gray
Article about TIME.mk in Нова Македонија
ON.net started to present TIME.mk news
365.com.mk started to present TIME.mk news
discussions on MK forums
Lunch of TIME.mk
1.July.2008
TIME.mk Proprietary
Regional expansion - Slovenia
Next stop: Serbia
19TIME.mk Proprietary
Next to come …
• Search of the archive
• RSS feeds
• Click metrics & personalization
� adjustable cluster ranking to the user preferences
• News alerts
� emails with link to news that contain provided keywords
• Weekly and Monthly news threads
• New topics: Technology, Health, etc.
• Inclusion of other news sources (currently only 26)
• Automatic Hub discovery
• Improvements in the clustering algorithms (more sophisticated NLP)
� СДСМ = СДС, премиер = груевски, нафта = бензин, АМС = агенција за
млади и спорт, струја = електрична енергија, etc.
� Go beyond duplicate detection by measuring new fact introduction
TIME.mk Proprietary
Acknowledgments
- Pajo & Biba for registering TIME.mk in MARNET
- Karolina for offering DNS services and HTML/CSS tricks
- Igor (Zuljo) for implementing the new design
- Nikola and Daniel for implementing text extraction for TIMES.si
- many many users for suggesting improvements by sending tons of emails with bugs on TIME.mk pages
TIME.mk Proprietary 21
Thank You!
Q&A