27
1 …. Untangling the Semantic Web: Microdata use in Russian video content delivery sites ...or yet another report on semantic markup with publicly available dataset as an extra bonus Andrey Kutuzov and Maxim Ionov AIST conference, Yekaterinburg April 11, 2014

Untangling the Semantic Web: Microdata use in Russian video content delivery sites

Embed Size (px)

Citation preview

… 1 ….

Untangling the Semantic Web:Microdata use in Russian video content delivery sites

...or yet another report on semantic markup with publicly available dataset as an extra bonus

Andrey Kutuzov and Maxim Ionov

AIST conference, YekaterinburgApril 11, 2014

… 2 ….

Outline

1. Why important?2. Source of data.3. Types of semantic markup.4. What do they mark up?5. Do they stick to standards?6. Movies and DBPedia.7. Dataset usage and availability.

… 3 ….

Why important?

● Web is (hopefully) moving towards Web of Data

● About 25% of web pages in Russian Internet already carry semantic markup

● Video hosting services and movie databases are among quick adopters: usage rate is 50%

… 4 ….

User behavior: queries

User queries0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Other queries Video-related queries

… 5 ….

User behavior: most popular sites

Top 50 sites Top 10 sites0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Other sites Video-related sites

… 6 ….

Outline

1. Why important?2. Source of data.3. Types of semantic markup.4. What do they mark up?5. Do they stick to standards?6. Movies and DBPedia.7. Dataset usage and availability.

… 7 ….

Source of data

● Most popular Russian movie-related sites crawled by Mail.ru search engine in January, 2014

● Semantic markup extracted from these pages is used internally to construct better snippets:

… 8 ….

Source of data

● 18 sites:actorpedia.net, amazingcinema.ru, baskino.com, bigcinema.tv, gos-kino.ru, ivi.ru, kinopoisk.ru, kiniska.com, kinoestet.ru, kinomatrix.com, kinoprosmotr.net, kinostok.tv, megogo.net, multiki-online.net, mult-online.ru, ovideo.ru, ruseriali.com, zerx.ru

● About 1.5M web pages with semantic markup

● Exact number of sites and pages increases with each crawling session

… 9 ….

Outline

1. Why important?2. Source of data.3. Types of semantic markup.4. What do they mark up?5. Do they stick to standards?6. Movies and DBPedia.7. Dataset usage and availability.

… 10 ….

Types of semantic markup

● Features of Microdata and RDFa/OpenGraph deployment in Russian movie-related sites generally conform to observations for global Web in [Bizer et al 2013].

● No microformats (hCard, hCalendar, etc)● Dominance of Microdata with Schema.org vocabulary

… 11 ….

Example from ivi.ru

● HTML with semantic markup:

<div class="content-main" itemscope itemtype="http://schema.org/Movie"> <meta itemprop="name" content="Live broadcast"/><div itemprop="description"><p>Soviet social drama «Live broadcast» tells the story of old friends...</p> </div><div itemprop="director" itemscope itemtype="http://schema.org/Person"> <span class="itemprop" itemprop="name">Oleg Safaraliev</span></div>

● Statements about things extracted (n3 format):

[] a <http://schema.org/Movie> ; <http://schema.org/Movie/actors>[ a <http://schema.org/Person> ; <http://schema.org/Person/name> "Андрей Градов", "Евгения Симонова"; <http://schema.org/Person/url> <http://www.ivi.ru/watch/11139//person/Andrey-Gradov-1353>, <http://www.ivi.ru/watch/11139//person/Evgeniya-Simonova-1635> ] ; <http://schema.org/Movie/description> "Советская социальная драма «Прямая трансляция» рассказывает историю давних друзей, один из которых подставил товарища, прекратив тем самым отношения. " ; <http://schema.org/Movie/director> [ a <http://schema.org/Person> ; <http://schema.org/Person/name> "Олег Сафаралиев" ] ; <http://schema.org/Movie/duration> "PT1H18M0S" ; <http://schema.org/Movie/genre> "Драмы", "Советское кино" ;<http://schema.org/Movie/image> <http://thumbs.ivi.ru/f36.vcp.digitalaccess.ru/contents/9/b/1bbfcf963ce6731cbc2a5d72c12877.jpg/172x264/> ; <http://schema.org/Movie/name> "Прямая трансляция" ; <http://schema.org/Movie/url> <http://www.ivi.ru/watch/11139/#play> .

… 12 ….

Statistics on the dataset

●1.13 million entities of type <http://schema.org/Movie>

●130 000 unique movies●Most data comes from kinopoisk.ru and kinoestet.ru

●No interlinking between sites or external databases

●Thus not quite Linked Data

… 13 ….

Outline

1. Why important?2. Source of data.3. Types of semantic markup.4. What do they mark up?5. Do they stick to standards?6. Movies and DBPedia.7. Dataset usage and availability.

… 14 ….

Distribution of Schema.org predicates

… 15 ….

Typical movie object

Typical movie object possesses 12 properties:

genre, name, actors, description, producer, duration, alternativeHeadline, musicBy, aggregateRating, image, contentRating, director.Sometimes the following 4 are added:productionCompany, dateCreated, datePublished, inLanguage.

Other predicates are rare in the wild.

… 16 ….

Genre diversity

… 17 ….

Outline

1. Why important?2. Source of data.3. Types of semantic markup.4. What do they mark up?5. Do they stick to standards?6. Movies and DBPedia.7. Dataset usage and availability.

… 18 ….

Typical problems

Objects described by literals instead of proper entities of a corresponding class:[] <http://schema.org/Movie/actors> "Андрей Градов, Евгения Симонова".

...instead of: [] <http://schema.org/Movie/actor>[ a <http://schema.org/Person> ; <http://schema.org/Person/name> "Андрей Градов; <http://schema.org/Person/url> <http://www.ivi.ru/watch/11139//person/Andrey-Gradov-1353>] ; <http://schema.org/Movie/actor>[ a <http://schema.org/Person> ; <http://schema.org/Person/name> "Евгения Симонова; <http://schema.org/Person/url> <http://www.ivi.ru/watch/11139//person/Evgeniya-Simonova-1635>] .

'Director' predicate links to an entity only in 10% of all entities, 'actors' predicate in 1%, 'producer' predicate almost never.

… 19 ….

Typical problems

Usage of deprecated <http://schema.org/Movie/actors> instead of more fine-grained <http://schema.org/Movie/actor>

Interestingly, if the last (up-to-date) predicate is used, then actors are more often described as separate entities (cf. previous slide).

Updated predicates and conformance to semantic web spirit come together :-)

… 20 ….

Typical problems

Lack of necessary predicates:Kinopoisk.ru uses <http://schema.org/Movie/actor> to describe dubbing actors/voice-over artists in dubbed movies. It leads to great mess, but the problem can be solved only by introducing a new predicate to Schema.org

<h1 class="moviename-big" itemprop="name">Матрица</h1> <span itemprop="alternativeHeadline">The Matrix</span> <tr><td class="type">режиссер</td><td itemprop="director"><a href="/name/23329/">Энди Вачовски</a>, <a href="/name/23330/">Лана Вачовски</a></td></tr>... <h4>Роли дублировали:</h4> <ul><li itemprop="actors"><a href="/name/1616407/">Всеволод Кузнецов</a></li><li itemprop="actors"><a href="/name/287413/">Владимир Вихров</a></li><li itemprop="actors"><a href="/name/1654400/">Елена Соловьева</a></li></ul>

… 21 ….

Outline

1. Why important?2. Source of data.3. Types of semantic markup.4. What do they mark up?5. Do they stick to standards?6. Movies and DBPedia.7. Dataset usage and availability.

… 22 ….

Movies and DBPedia

A subset of 12 thousand movies was linked to DBPedia by matching movie title and director name to entities of type <http://dbpedia.org/ontology/Film>.We succeeded in matching about 80% of movies in the subset.

In the future we will interlink all entities in our dataset and additionally will employ LinkedMDB database, not only DBPedia.

… 23 ….

Outline

1. Why important?2. Source of data.3. Types of semantic markup.4. What do they mark up?5. Do they stick to standards?6. Movies and DBPedia.7. Dataset usage and availability.

… 24 ….

Network of directors and actors

186 000 actors, 20 000 directors

… 25 ….

Dataset availability

You can do all these pretty things to the presented dataset.

It is available to download athttp://ling.go.mail.ru/semanticweb/(gzipped Turtle triples, 536 Mb)

Will be updated regularly after new crawling sessions.

Creative Commons Attribution Share-Alike

… 26 ….

1. Semantic markup is widely employed in Russian movie-related sites, and growing.

2. It generally conforms to global tendencies in WWW.

3. Microdata with Schema.org vocabulary is the most prospective markup standard from the point of view of search engine.

4. RDF graphs constructed from large amount of semantically enriched pages can be a good source of data for network analysis, etc.

Conclusions

AIST conference, YekaterinburgApril 11, 2014

… 27 ….

Thanks for your attention!

Andrey Kutuzov and Max Ionov

AIST conference, YekaterinburgApril 11, 2014