27
Regular- expression Generator Tae Woo Kim 1

Regular- expression Generator

  • Upload
    becca

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Regular- expression Generator. Tae Woo Kim. Problem. Extracting the facts from digitized documents. Motivation. A bout 500 pages out of 830 pages 85 facts of first name, surname, birth year, and death year. About 42,500 facts!. How it Works. How it Works. How it Works. - PowerPoint PPT Presentation

Citation preview

Page 1: Regular- expression Generator

1

Regular-expression Generator

Tae Woo Kim

Page 2: Regular- expression Generator

2

Problem

• Extracting the facts from

digitized documents

Page 3: Regular- expression Generator

3

Motivation

• About 500 pages out of 830

pages

• 85 facts of first name, surname,

birth year, and death year

About 42,500 facts!

Page 4: Regular- expression Generator

4

How it Works

Page 5: Regular- expression Generator

5

How it Works

Page 6: Regular- expression Generator

6

How it Works

Page 7: Regular- expression Generator

7

How it Works

Page 8: Regular- expression Generator

8

How it Works

Page 9: Regular- expression Generator

9

How it Works

Page 10: Regular- expression Generator

10

How it Works

Page 11: Regular- expression Generator

11

How it Works

Page 12: Regular- expression Generator

12

How it Works

Page 13: Regular- expression Generator

13

How it Works

Page 14: Regular- expression Generator

14

How it Works

Page 15: Regular- expression Generator

15

How it Works

Page 16: Regular- expression Generator

16

How it Works

Page 17: Regular- expression Generator

17

Behind the Scenes

241213 . _ Mary_Eliza _ Warner , _ b . _ 1826 , _ dau

Page 18: Regular- expression Generator

18

Behind the Scenes

dau . _ of _ Samuel_Selden _ Warner _ and _

241213 . _ Mary_Eliza _ Warner , _ b . _ 1826 , _ dau

Page 19: Regular- expression Generator

19

Behind the Scenes

_ and _ Azubah  _ Tully ; _ m

. _ 1850 , _ Joel_M. _ Gloyd _ ( who

243311 . _ Abigail_Huntington _ Lathrop _ (

widow ) , _

Doonton , _

dau . _ of _ Mary  _ Ely _ and _

dau . _ of _ Samuel_Selden _ Warner _ and _

241213 . _ Mary_Eliza _ Warner , _ b . _ 1826 , _ dau

delimiter delimiterdelimiter

Field Field

Page 20: Regular- expression Generator

20

dau\.\sof\s[A-Za-z]{2,9}(\s[A-Za-z]{3,9}){0,2}\s[A-Za-z]{1,9}\sand\s

Behind the Scenes

dau . _ of _ Samuel_Selden _ Warner _ and _

dau . _ of _ Mary  _ Ely _ and _

dau . _ of _ Nathan_Tilestone _Jennings _ and _

dau . _ of _ Caleb_Halstead _ Andruss _ and _

Page 21: Regular- expression Generator

21

and\s[A-Za-z]{3,9}\s[A-Za-z]{3,10};\sm\.\s

Behind the Scenes

_ and _ Azubah _  Tully ; _ m .

_ and _ Gerard _  Lathrop ; _ m .

_ and _ Gerard _ Lathrop ; _ m .

_ and _ Gerard  _  Lathrop ; _ m .

Page 22: Regular- expression Generator

22

[0-9]{1}\.\s[A-Za-z]{2,7}(\s[A-Za-z]{1,12}){1},\sb\.\s[0-9]{4},\sd\.\s[0-9]{4}\.

Behind the Scenes

1 . _ Mary_Ely , _ b . _ 1836 , _ d . _1859 .

2 . _ William_Gerard , _ b . _ 1858 , _ d . _1861 .

1 . _ Maria_Jennings , _ b . _ 1838 , _ d . _1840 .

3 . _ Donald_McKenzie , _ b . _ 1840 , _ d . _1843 .

1 . _ Charles_Halstead , _ b . _ 1857 , _ d . _1861 .

Page 23: Regular- expression Generator

23

[0-9]{1}\.\s[A-Za-z]{3,10}(\s[A-Za-z]{3,10}){1},\sb\.\s[0-9]{4}\.

Behind the Scenes

2 . _ William_Gerard , _ b . _ 1840 .

4 . _ Emma_Goble , _ b . _ 1862 .

2 . _ Gerard_Lathrop , _ b . _ 1838 .

4 . _ Anna_Margaretta , _ b . _ 1843 .

5 . _ Anna_Catherine , _ b . _ 1845 .

3 . _ Theodore_Andruss , _ b . _ 1860 .

Page 24: Regular- expression Generator

24

Results

• Finds 19 patternso 4 patterns used

o 25/85 facts found by the system

Page 25: Regular- expression Generator

25

Results

• Next page(before annotation)o 5/19 previous patterns used

o 33/87 facts automatically annotated

Page 26: Regular- expression Generator

26

Results

• Next page(after annotation)o Finds 16 new patterns

• 3 patterns used

• 12 new facts found

o 45/86 facts automatically annotated

• Page1: 29%

• Page2: 52%

• Page3: 69%

Page 27: Regular- expression Generator

27

Conclusion

• Automatically finds and uses patterns

• Decreases amount of user effort