41
TableNet: An Approach for Determining Fine-grained Relations for Wikipedia Tables Besnik Fetahu, Avishek Anand, Maria Koutraki @FetahuBesnik https://github.com/bfetahu/wiki_tables The complete code to be published soon!!

TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: An Approach for Determining Fine-grained

Relations for Wikipedia TablesBesnik Fetahu, Avishek Anand, Maria Koutraki

@FetahuBesnikhttps://github.com/bfetahu/wiki_tables

The complete code to be published soon!!

Page 2: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Why tables?

Page 3: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Rich Factual Information in Tables

3

100 meters — Running Race

Page 4: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Rich Factual Information in Tables

3

100 meters — Running RaceSeason's bests

Page 5: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Rich Factual Information in Tables

4

What is the time difference for the best time in Women’s 100 Meter Race in 1974 and 2018?

• No single source can answer such a complex question.

• Factual information in tables is scattered in isolated tables across different articles.

Page 6: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Rich Factual Information in Tables

5

What is the time difference for the best time in Women’s 100 Meter Race in 1974 and 2018?

Page 7: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Rich Factual Information in Tables

5

What is the time difference for the best time in Women’s 100 Meter Race in 1974 and 2018?

Season's

Page 8: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Rich Factual Information in Tables

5

What is the time difference for the best time in Women’s 100 Meter Race in 1974 and 2018?

Season's

Page 9: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Rich Factual Information in Tables

5

What is the time difference for the best time in Women’s 100 Meter Race in 1974 and 2018?

Season's

Answer: 0.64s

Page 10: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Tables in Wikipedia

Page 11: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

• Tables are one of the richest sources of factual information in Wikipedia and the Web:

• ~530k Wikipedia article contain tables

• ~3M extracted tables

• Results in > 32M rows

• Tables have the potential to cover hundreds of millions of facts and can be used to assess fact consistency and validity if tables can be interlinked.

Tables in Wikipedia

7Extractor code: https://github.com/bfetahu/wiki_tables

Page 12: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Tables in Wikipedia

8Extractor code: https://github.com/bfetahu/wiki_tables

Page 13: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Tables in Wikipedia

8Extractor code: https://github.com/bfetahu/wiki_tables

Page 14: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Tables in Wikipedia

8Extractor code: https://github.com/bfetahu/wiki_tables

Page 15: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Tables in Wikipedia

9Extractor code: https://github.com/bfetahu/wiki_tables

Page 16: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Challenges and Potential of Tables

Page 17: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

• Extraction and Canonicalization problems:

• Lack of explicit schemas (what do the columns mean?!)

• Non-standard authoring practices

• Optimized for human readability and display

• Isolated factual information in Tables:

• Tables do not contain any explicit relations to other related tables

• Tables often subsume or are equivalent to other tables

• Joining the different tables can provide a richer picture of the factual information present in tables

• Alignment challenges:

• Table columns are ambiguous out of their context in which they appear (e.g. “Name” for actors, scientists, animals, race type etc.)

• Subject (or key) columns or a combination of columns is necessary for any two tables to be considered for alignment

• Large number of tables as candidates for alignment

Challenges

11

Page 18: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet Approach

Page 19: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Objectives

13

Automatically extract tables from Wikipedia and efficiently align tables with high accuracy and coverage with fine-grained relation types.

Page 20: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Objectives

13

Rank Time Athlete Country Date12

4

10.4910.64

10.70

Florence G.-JoynerCarmelita JeterMarion Jones

Shelly-Ann F.-Pryce

United StatesUnited StatesUnited States

Jamaica

16.07.198820.09.200912.09.199829.06.2012

3 10.65

t2: All-time top 25 women

Rank Time Athlete Country Date

1

2

4

9.58

9.69

9.72

Usain Bolt

Tyson GayYohan BlakeAsafa Powell

Jamaica

United StatesJamaicaJamaica

16.08.2009

20.09.200923.08.201223.08.2012

t3: All-time top 25 menArea Men Women

Time Athlete Nation Time Athlete NationAfrica 9.85 Olusoji Fasuba Nigeria 10.78 Murielle Ahoure Ivory CoastAsia 9.91 Femi Ogunode Qatar 10.79 Li Xuemei China

Europe 9.86 Francis Obikwelu Portugal 10.73 Christine Arron France

South America 10.00 Robson da Silva Brazil 11.01 An Cláudia Lemos Brazil

t1: Continental records

t4: Top 10 Junior (under-20) menagerestriction

topmenrecords

genderrestriction

topwomen

records

gender

restriction

genderrestriction age

restriction

Date:DateCountry:LocationAthlete:Person{F}

t2schema:

Date:DateCountry:LocationAthlete:Person{M,age<20}

t4schema:

Date:DateCountry:LocationAthlete:Person{M}

t3schema:

Area/Nation:LocationAthlete:Person{M,F}

t1schema:

TableRelations:

(t1,t4):rel_1=genderRestriction(t1,t4)rel_2=ageRestriction(t4,t1)

(t1,t3):rel_1=topMenRecords(t3,t1)rel_2=genderRestriction(t1,t2)

(t3,t4):rel_1=ageRestriction(t4,t3)

(t1,t2):rel_1=genderRestriction(t1,t2)rel_2=topWomanRecords(t2,t1) Rank Time Athlete Country Date

12

9.9710.00

Trayvon BromellTrentavis Friday

Darrel BrownJeff Demps

United StatesUnited States

Trinidad and TobagoJamaica

13.06.201405.07.2014

24.08.200328.06.20083 10.01

Yoshiihide Kiryu Japan 29.0.4.2013

Automatically extract tables from Wikipedia and efficiently align tables with high accuracy and coverage with fine-grained relation types.

Page 21: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet Overview

14

Table Extraction and Schema Description Generation

Candidate Pair Generation

Table Alignment

TableNet

Page 22: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet Overview

14

Table Extraction and Schema Description Generation

Candidate Pair Generation

Table Alignment

TableNet

Page 23: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Candidate Pair Generation

Page 24: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

• More than 530k Wikipedia articles contain tables

• Consider all pairs as relevant ?!→ 530K! (factorial)

• Efficient algorithm are needed to filter out irrelevant article pairs.

• We propose an efficient approach to reduce the amount of irrelevant pairs and at the same time maintain a high coverage of relevant article pairs, whose tables can be aligned.

TableNet: Candidate Generation

16

Page 25: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Candidate Generation

17

Article Abstract Features

Page 26: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Candidate Generation

17

Article Abstract Features

• doc2Vec similarity between abstracts

• Avg. word2Vec abstract vector similarity

• tf-idf similarity between abstracts

Page 27: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Candidate Generation

18

Categories and KBs Features

Page 28: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Candidate Generation

18

Categories and KBs Features

Wikipedia categories might lack in quality → Computing embeddings of categories using graph embeddings approaches:

• Similarity in embedding space between categories

• Direct and parent categories overlap

• DBpedia type overlap ….

Page 29: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Candidate Generation

19

Table Features

• Column title similarity

• Column title distance

Page 30: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Table Alignment

Page 31: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Table Alignment

21

cia ci

b cin c j

a c jb c j

n

columnrep.

φiaLSTM

cellsφi

b φin φ j

aφd φ jb φ j

n

h ia h i

b h in hd h j

a h jb h j

n

delim

iter

col-by-col attention

output layer

Column Description

Instance Values

Column Type

For a table pair predict their relation type

r(ti, tj) ⟶ {subPartOf, equivalent, none}

Page 32: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Table Alignment

22

Table Column Representations

• Column Description • Represent the column description tokens based on their world embeddings (Glove) • Disadvantage: Column descriptions can be ambiguous (e.g. Title column for Books

or Movies)

• Instance Values • Avg. embedding of the cell values based on graph embeddings (node2Vec trained

on Wikipedia anchor graph)

• Column Type • Represent LCA category through graph embeddings

Page 33: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Evaluation Setup

Page 34: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

• Random sample of 50 Wikipedia (source) articles, respectively their tables

• Ground-truth considerations:

• Coverage: ensure that for each of the tables source articles, we have all relevant tables for alignment

• Efficiency: iteratively manually construct filters to remove articles whose tables cannot yield any relation for the tables of interest

• Labelling: crowdsource the remaining pairs for labelling (3 annotators per table)

• Labelling Quality: comprehensive worker training through detailed instructions and examples before joining the task.

• Ground-truth stats for the 17k crowdsourced table pairs:

• 52% pairs with noalignment

• 24% pairs with equivalent

• 23% pairs with subPartOf.

Ground-truth Data

24

Page 35: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Candidate Generation

Page 36: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

TableNet: Candidate Generation Results

26

876 876823

780 762710

604517

420

307

66950 15450 8652 50472987

1854927

515309

206Rµ = 0.81∆ = 0.98

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9τ

score

a

a

PRµ

Use the computed features for pre-filtering, then apply a RF (tweaked to increase recall) for classifying candidates as relevant/irrelevant.

Page 37: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Table Alignment

Page 38: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Table Alignment Results

28

TableNet based on a BiLSTM with column-by-column attention can determine fine-grained relation types with an accuracy of 83%.

Page 39: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Conclusions

Page 40: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

• Contributions:

• TableNet - a knowledge graph of aligned tables

• Fine grained relation types between tables: equivalent, subPartOf

• Improvement over existing works, with fine grained table relations

• Exhaustive ground truth from 50 Wikipedia articles resulting 17K table pairs

• Resources for TableNet

• Data & Code: https://github.com/bfetahu/wiki_tables

• Note: The candidate feature generation code and the table alignment code will be published before the TheWebCon 2019.

Page 41: TableNet: An Approach for Determining Fine-grained …...24.08.2003 3 10.01 28.06.2008 Yoshiihide Kiryu Japan 29.0.4.2013 Automatically extract tables from Wikipedia and efficiently

Thank you! Questions?