37
1 © Searchmetrics. All rights reserved. Do not distribute without permission. Enriching content with Knowledge Base by Search Keywords and Wikidata Fang Xu [email protected] @allxufang

Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

  • Upload
    pydata

  • View
    421

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

1

© Searchmetrics. All rights reserved. Do not distribute without permission.

Enriching content with Knowledge Base by Search Keywords and Wikidata

Fang [email protected]@allxufang

Page 2: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

2

© Searchmetrics. All rights reserved. Do not distribute without permission.

Data Science@Searchmetrics

Data driven search and content optimization marketing

• Learning from keywords

• Content optimization

• Data visualization

Page 3: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

3

© Searchmetrics. All rights reserved. Do not distribute without permission.

Looooots of Data

• 120 Million Domains

• 600 Million Keywords

• 120 Billion Links

• 25,000 Billion Social Signals

• 25 PB raw data

Page 4: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

4

© Searchmetrics. All rights reserved. Do not distribute without permission.

Authors submit content üRate the content’s effectiveness ü Feedback to optimize and enrich it

Content Production in Real-time

Page 5: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

5

© Searchmetrics. All rights reserved. Do not distribute without permission.

Beyond keywords

• Keyword • Typos• Ambiguous• Sparse

• Entity • Augmented with

metadata• Relations among entities

Page 6: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

6

© Searchmetrics. All rights reserved. Do not distribute without permission.

Q64

Entity

Page 7: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

7

© Searchmetrics. All rights reserved. Do not distribute without permission.

Page 8: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

8

© Searchmetrics. All rights reserved. Do not distribute without permission.http://brendangriffen.com/blog/gow-programming-languages

Knowledge Base (KB)

Page 9: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

9

© Searchmetrics. All rights reserved. Do not distribute without permission.

20012012

20142008

Knowledge vaults

2012

2005

KB Timeline

Page 10: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

10

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Free collaborative KB• Continuous evolution• Open multilingual Data• mapping to other KBs

Why Wikidata

Page 11: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

11

© Searchmetrics. All rights reserved. Do not distribute without permission.

Link content to KB• Entity Linking -- free text to entities

• Blog posts • Tweets • Keywords• User-generated Contents

• Entities from a knowledge base• Wikipedia• Wikidata• Domain-specific KBs

Page 12: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

12

© Searchmetrics. All rights reserved. Do not distribute without permission.

Image from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM 2008

Entity Linking

Page 13: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

13

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Identify important keywords to link in the text

• Link to right entity

Main Problems

Page 14: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

14

© Searchmetrics. All rights reserved. Do not distribute without permission.

Dictionary of keywords to KB entitiesSearch keyword mentions in text

Page 15: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

15

© Searchmetrics. All rights reserved. Do not distribute without permission.

Keyword to wiki uris in top SERP

Page 16: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

16

© Searchmetrics. All rights reserved. Do not distribute without permission.

Not all keywords are useful

Keyword Cleaning:

• Navigational or factual words

• Non-frequent words

• Non-latin letters

Page 17: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

17

© Searchmetrics. All rights reserved. Do not distribute without permission.

Keyword Filtering: • Starting or ending tokens • Stopwords • Part-of-speech tags• Wikipedia popularity:

• popular wiki uris for one keyword• Search popularity:

• popular keywords for one wiki uri

Not all keywords are useful

Page 18: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

18

© Searchmetrics. All rights reserved. Do not distribute without permission.

Search Popularity Filtering Keyword Search Popularity (Volume)

germany 268583

germany facts 4291

germany article 24

german encyclopedia 23

germany encyclopedia 19

germany t 18

ger many 16

Page 19: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

19

© Searchmetrics. All rights reserved. Do not distribute without permission.

parse wikidata dump & extract entities as json

Entity data{ entity: "Berlin", Freebase Id: "/m/0156q", OpenStreetMap Relation identifier: 62422, alias: ["Berlin, Germany"], capital of: [ "Germany", "Kingdom of Prussia", "Weimar Republic", "Brandenburg-Prussia", "Free State of Prussia", ... ], contains administrative territorial entity: [ "Mitte", "Friedrichshain-Kreuzberg", "Pankow", "Charlottenburg-Wilmersdorf", "Spandau", "Steglitz-Zehlendorf", "Tempelhof-Schöneberg", "Neukölln", "Treptow-Köpenick", ... ], coordinate location: [ { altitude: null, latitude: 52.516666666667, longitude: 13.383333333333, precision: 0.016666666666667 } ], country: "Germany", ... ... }

Page 20: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

20

© Searchmetrics. All rights reserved. Do not distribute without permission.

Link to the right Wikipedia entityWord Sense Disambiguation

Page 21: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

21

© Searchmetrics. All rights reserved. Do not distribute without permission.

d

Tree 92.82%

Tree (graph theory) 2.94%

Tree (data structure) 2.57%

Tree (set theory) 0.15%

Phylogenetic tree 0.07%

Christmas tree 0.07%

Binary tree 0.04%

Family tree 0.04%

… ...

Link to Most Common Entities

e ew

ew

LL

i

,

,

ew entity , text surface with LinksofNumber

Entity Wikipedia Commnoness

(Milne and Witten 2008b)tree

Page 22: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

22

© Searchmetrics. All rights reserved. Do not distribute without permission.

https://en.wikipedia.org/wiki/Tree_data_structure

https://en.wikipedia.org/wiki/Tree

Disambiguation

Page 23: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

23

© Searchmetrics. All rights reserved. Do not distribute without permission.

Disambiguation using context

Page 24: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

24

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Build a Word2Vec model for Wikiepdia entity

• Calculate Word2Vec similarity to contextual entities

contextcontext

TreestructureTree_data_ )(similarity)(similarity

Entity Disambiguation

Page 25: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

25

© Searchmetrics. All rights reserved. Do not distribute without permission.

Relatedness between Entities

Page 26: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

26

© Searchmetrics. All rights reserved. Do not distribute without permission.Image from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links

Entity Relatedness

Page 27: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

27

© Searchmetrics. All rights reserved. Do not distribute without permission.

• Jaccard similarity

• Word2Vec similarity of entity to context

eeee

and entity tolinks of Union and entity tolinks of onIntersecti

Relatedness Score

Page 28: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

28

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Data Parsing

Page 29: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

29

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Dump

'''Berlin''' is the [[Capital city|capital]] of [[Germany]] and one of its 16 [[states of Germany|states]]. With a population of approximately 3.5 million people,<ref name="Population" /> Berlin is the second [[Largest cities of the European Union by population within city limits|most populous city proper]] and the seventh [[Largest urban areas of the European Union|most populous urban area]] in the [[European Union]].

Page 30: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

30

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Article as Json

Page 31: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

31

© Searchmetrics. All rights reserved. Do not distribute without permission.

Word2Vector Training• Collection of plain article text

... ...can4linux ||open_source|| ||controller_area_network|| ||linux_kernel|| ||device_driver|| development started 1990s philips 82c200 controller stand chip 1995 version created bus linux laboratory automation project linux lab project ||freie_universität_berlin|| nxp sja1000 successor supported controller philips 82c200 intel 82527 development powerful ||microcontroller||s integrated controllers capable ... ...

Page 32: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

32

© Searchmetrics. All rights reserved. Do not distribute without permission.

Linking vectors• Pairs of uri, annotations

outlink vector [Capital_City, Germany , States_of_Germany, European_Union,Spree, Havel, Berlin-Brandenburg_Metropolitan_Region, ... ... ]

inlink vector [Germany, Prussia, Berlin_Wall, Albert_Einstein, Kosmos_(Berlin), Berlin_International_Film_Festival, .. .. ]

Page 33: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

33

© Searchmetrics. All rights reserved. Do not distribute without permission.

Wikipedia Popularity• Aggregation of annotations

Surface text Wiki entity Popularity

United States United_States 174338

World War II World_War_II 106483

India India 95966

France France 94666

American United_States 85976

Iran Iran 83249

Australia Australia 76655

Germany Germany 76384

Page 34: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

34

© Searchmetrics. All rights reserved. Do not distribute without permission.

Overall SystemKeywordDatabase

KeywordProcessing

Parser

UserContent

KeywordMatching

Disam-biguation

Relatedness calculation Result

Wikipedia Popularity

Entity Linking API

WikiParser

W2VModel

WikiLinksKeyword

to KB entities

Page 35: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

35

© Searchmetrics. All rights reserved. Do not distribute without permission.

• https://github.com/piskvorky/gensim• https://github.com/jodaiber/Annotated-WikiExtractor• https://dumps.wikimedia.org/• https://dumps.wikimedia.org/wikidatawiki/entities/

Resources

Page 36: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

36

© Searchmetrics. All rights reserved. Do not distribute without permission.

Thank you

Page 37: Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

37

© Searchmetrics. All rights reserved. Do not distribute without permission.

Questions?

[email protected]

We are hiring