75
Text Extraction using Regular Expressions Shih-Pei Chen Project Manager, China Biographical Database, Harvard University The Digitization in the Humanities Workshop @ Rice University April 5-7, 2013

Text Extraction using Regular Expressions

Embed Size (px)

DESCRIPTION

The Digitization in the Humanities Workshop @ Rice University April 5-7, 2013. Text Extraction using Regular Expressions. Shih-Pei Chen Project Manager, China Biographical Database, Harvard University. Downloads for today. Slides and sample texts (a package) On OWL-Space Text editor(s) - PowerPoint PPT Presentation

Citation preview

Page 1: Text Extraction using Regular Expressions

Text Extraction using Regular Expressions

Shih-Pei ChenProject Manager, China Biographical

Database, Harvard University

The Digitization in the Humanities Workshop @ Rice UniversityApril 5-7, 2013

Page 2: Text Extraction using Regular Expressions

Downloads for today

• Slides and sample texts (a package)– On OWL-Space

• Text editor(s)– Mac users: please install TextWrangler– PC users: EmEditor, UltraEdit, or both

• CBDB Regex Machine– http://isites.harvard.edu/icb/icb.do?keyword=k16229&pageid=icb.pa

ge515758 -- download the CBDBRegexMachine_July2012.zip on this page

Page 3: Text Extraction using Regular Expressions

The China Biographical Database – Modeling Life Histories – from anecdote to data

Biography Prosopography

Social Network AnalysisGeospatial Analysis

Page 4: Text Extraction using Regular Expressions

Big Data

• What you are going to do with the great amount of texts on the Web?– Is there information you want to search?– Is there thing you want to analyze?

• CBDB experience: we use regular expressions to extract biographical data from thousands of historical records (in their full texts)

Page 5: Text Extraction using Regular Expressions

What regular expressions can do for you

• Beyond keyword search– Search for written variations– Search for patterns– Search and replace => tagging

• You don’t have to learn programming in order to use regular expressions– Just use a text editor which supports regex

Page 6: Text Extraction using Regular Expressions

Today

• Part 1: Learn regular expressions– Hands on exercises of matching regexes against

some texts in a text editor.

• Part 2: A real play – Using regexes + Search and Replace in a text editor

• Part 3: CBDB Regex Machine – Using a graphical user interface to design regexes

and test them against a text. Tagging the matches in XML tags.

Page 7: Text Extraction using Regular Expressions

UNDERSTANDING REGULAR EXPRESSIONS

Page 8: Text Extraction using Regular Expressions

Regular expressions

• Is a powerful way of describing patterns of strings

• You describe the pattern, the machine matches it against the text (a string of letters, digits, and symbols)

Page 9: Text Extraction using Regular Expressions

Automata

• Imagine a belt sending characters in line:

• The string in line (the input): abcde– It can match this pattern: abcde– It can also match this pattern: bc (but only the substring

“bc” in the input will be matched – partial match)

aa bb cc dd ee

abcde?abcde?

Page 10: Text Extraction using Regular Expressions

Comparing the input against the regex character by character

aa bb cc dd eeInput:

Regex: aa bb cc dd ee Match!

Page 11: Text Extraction using Regular Expressions

Comparing the input against the regex character by character

aa bb cc dd eeInput:

Regex: bb cc Match!

Behind the scenes: The robot picks up a in the input, and finds that a does not match b, the first character in the regex. Then, the robot throws a out, and picks up the next character in the input, which is b. This time robot finds the two b’s match each other.

Page 12: Text Extraction using Regular Expressions

Comparing the input against the regex character by character

aa bb cc dd eeInput:

Regex: bb dd No match!

×

Page 13: Text Extraction using Regular Expressions

Switch to a good text editor

• Text editors which support regex– Windows: EmEditor or UltraEdit (both not free)

– Mac: TextWrangler (free)

Page 14: Text Extraction using Regular Expressions

Regular expressions – the syntax What you can describe using regular expressions?

Page 15: Text Extraction using Regular Expressions

Characters

• Literal characters– abcde , bc , bd (string match)

• Non-Printable Characters– \t (tab), \r (carriage return), \n (line feed)– Line breaks: \r (Mac), \n (Unix), or \r\n (Windows)

• Special characters (reserved characters / metacharacters)– [ ] \ ^ $ . | ? * + ( )

Examples come from: Regular-expressions.info

Page 16: Text Extraction using Regular Expressions

Exercise #1

• Download and install one of the above text editors. • Download the “regex text.txt” file. Open it in your

text editor.• Call up the “Search” or “Find” function in your editor,

and try the regexes in Exercise#1 to see which regexes can be matched.

Page 17: Text Extraction using Regular Expressions

Character Classes – what can appear at a certain position?

• gr[ae]y can match gray or grey– Characters in [ ] form a class (bag of characters) – gr[ae]y will not match graay nor graey !

• Common character classes– [a-z] , [A-Z] , [a-zA-Z] , [0-9]

• Exercise#2

gg rr aa ee yyInput:

Regex: gg rr [ae][ae] yy

Page 18: Text Extraction using Regular Expressions

Shorthand Character Classes

• \d (digit) : shorthand of [0-9]• \D (non-digit character)

• \w (word character): [A-Za-z0-9_]• \W (non-word character)

• \s (whitespace character): [ \t\r\n] (white space, tab, carriage return, line feed)

• \S (non-whitespace character)

Page 19: Text Extraction using Regular Expressions

Negated Character Classes

• Any character except these– [^aeiou] : not one of a, e, i, o, u– [^\d] : not digit– [^\s] : not white space

Page 20: Text Extraction using Regular Expressions

Dot .

• . can match any single character (almost)– Except the newline character => . is shorthand for

[^\n] (Unix), [^\r] (Mac), [^\r\n] (Windows)

• Exercise#3

Page 21: Text Extraction using Regular Expressions

Optional and Repeat operators

• 3 operators for expressing repentance– ? : zero or one time (optional)– + : repeat for one or more times – * : repeat for zero or more times

• Repeat certain times:– \d{1,4} : one to four digits– \d{1,} : one digit or more (EQ to \d+ )

• Exercise#4

Page 22: Text Extraction using Regular Expressions

Alternation (list of words)

• Useful when you have a list of words, and you want to find the occurrence of each– cat|dog|mouse|fish : find any one of the four– regex|regular expression : find either regex or

regular expression

• Exercise#5

Examples come from: Regular-expressions.info

Page 23: Text Extraction using Regular Expressions

Capturing writing variations

• Suppose you want to find all the occurrences mentioning regular expressions, but it can be written as “regular expression(s)” or “regex(es)”.

• Use this pattern to find them all: reg(ular expressions?|ex(es)?)

Examples come from: Regular-expressions.info

Page 24: Text Extraction using Regular Expressions

What can regular expressions do for you

• Provide better full-text search– Find a word without worrying its variations– Find specific info written in regular forms:

• dates, phone numbers, email addresses, HTML/XML tags, quotes, all capital abbreviations…

– Find two words near each other

• Perform formatting tasks toward a text• Automate tagging

Page 25: Text Extraction using Regular Expressions

Find information written in regular forms

• Exercise #6: finding dates as of mm/dd/yy– \d\d.\d\d.\d\d– \d\d[- /.]\d\d[- /.]\d\d– [0-1]\d[- /.][0-3]\d[- /.]\d\d– (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\d

• Exercise #7: finding texts within double quotes– ".*”– "[^"\r\n]*”– "[^"]*"

Examples come from: Regular-expressions.info

Page 26: Text Extraction using Regular Expressions

Grouping and back references

• Exercise #8: finding HTML/XML tags– <([a-z]+)\b[^>]*>.*?</\1>– <date format=“mmddyy”>04/07/13</date>

• () : capturing group• \1: back reference the 1st captured group

– If there are more than 1 pairs of (), use \2, \3, etc.– The whole matched string is referenced as \0

Examples come from: Regular-expressions.info

Page 27: Text Extraction using Regular Expressions

Formatting task

• Trimming unnecessary white spaces– Replace [ \t]{2,} with a single space– Delete leading whitespace within a line: replace ^[ \t]+ with

nothing (empty string)– Trim trailing whitespace of a line: replace [ \t]+$ with

nothing (empty string)

• Transform a text to a list of words– Append a line break after each word– Replace uppercase letters -> lowercase– Replace punctuation symbols with nothing– Rount frequency of each word in MS Excel

Examples come from: Regular-expressions.info

Page 28: Text Extraction using Regular Expressions

Automate tagging

• Idea: Find dates via some regex, and then surround each of the matches with tags: <date>some date</date>

• Replace our date pattern: (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\dwith :<date>\0</date>

• Try it in the date exercise • Once you can tag useful info in a text, it will be easy

to pull them out.

Page 29: Text Extraction using Regular Expressions

Resources for regular expressions

• Regular-expressions.info– http://www.regular-expressions.info/

• Profhacker article: “Finding the Women of Heimskringla with Regular Expressions”– http://chronicle.com/blogs/profhacker/finding-the-wome

n-of-heimskringla-with-regular-expressions/38631

• <oo>→<dh> Digital humanities article:– http://dh.obdurodon.org/regex.html

Page 30: Text Extraction using Regular Expressions

PLAY WITH TEXTS

Page 31: Text Extraction using Regular Expressions

Our texts today

• Get familiar with it• Use regex to do some search• Search in files• Then use the techniques to prepare the text

for Regex Machine

Page 32: Text Extraction using Regular Expressions

Texts for today: Old Bailey Proceedings

• You can find samples in today’s package under “Old Bailey Proceedings”

• Or, you can download them on your own:– select all and copy– paste it to a text editor– save it as UTF-8 without

BOM (byte order mark)

Page 33: Text Extraction using Regular Expressions

Old Bailey’s Proceeding: the HTML presentationOld Bailey’s Proceeding: the HTML presentation

Page 34: Text Extraction using Regular Expressions

Text formText form

Page 35: Text Extraction using Regular Expressions

Try some search

• Search for: t\d{8}-\d{3}• Replace it with: <refNo>\0</refNo>

Page 36: Text Extraction using Regular Expressions

Exercise: Preparing your text in a specific format (to feed to some software)

Page 37: Text Extraction using Regular Expressions

How to convert? Observe!

• Goal: to make each case a single line

• Patterns?• Every case begins with a

line of “Reference Number” and ends before the next “Reference Number”

• Got to remove all the line breaks

• Tricky things: does the text contain XML reserved characters &, <, >,…

Page 38: Text Extraction using Regular Expressions

Conversion Steps:Search and Replace + regexes

• Replace the XML reserved characters: – & => &amp; % => &#37;– < => &gt; > => &lt;

• Get rid of “285.”: ^\d{3}\. => nothing (empty string)• Replace all the line breaks (\r, \n, \r\n) with nothing• Reassign the line breaks by “Reference number:”

– Reference number: => \rReference number:

• Optional: Get rid of “See original”• The order above is crucial

Page 39: Text Extraction using Regular Expressions

What does the Regex Machine do?

• A graphical user interface (GUI) that enables people who do not have programming skills to– graphically design patterns– match them against a corpus of texts– see results immediately via a user-friendly color-

coding scheme (quick feedback)– export to XML => automates (part of) the tagging

procedure

3/23/2013 39

Credit: Elif Yamagil

Page 40: Text Extraction using Regular Expressions

Downloading CBDB RegexMachine

• Regex Machine (on CBDB website)– http://isites.harvard.edu/icb/icb.do?keyword=k16229&pa

geid=icb.page515758 -- download the CBDBRegexMachine_July2012.zip on this page

• Prerequisites: – Make sure your machine has Java Runtime Enrironment

(JRE) installed. If not, you can download it here: http://www.java.com/en/download/

Page 41: Text Extraction using Regular Expressions

Run the Regex Machine

• Double click the CBDBRegexMachine.jar• In the “Select Your User Director” window,

select the folder where you put your text files.– Tip: don’t double

click the folder! Single click is all you need.

Page 42: Text Extraction using Regular Expressions

GUIList of active List of active regexregex

List of “terms”List of “terms”

Your Text Your Text

Info BoxInfo Box

42

Page 43: Text Extraction using Regular Expressions

Open the text we just prepared

• File Open. Select your text file.

Page 44: Text Extraction using Regular Expressions

Create Active Regexes

• First regex: capture the reference number– Example: t18500107-285– Pattern: t\d{8}-\d{3}– It’s always good to test it first in a text editor

• Create it in Regex Machine

– Think first: is it one unit? Does it contain diff parts?

Page 45: Text Extraction using Regular Expressions

1. Click1. Click

2. Click2. Click

3. Fill in your regex and give it a name3. Fill in your regex and give it a name

Page 46: Text Extraction using Regular Expressions

4. Give the whole regex a name. Then

choose a color!

4. Give the whole regex a name. Then

choose a color!

5. Click on the Regex. Matches are highlighted!

5. Click on the Regex. Matches are highlighted!

Page 47: Text Extraction using Regular Expressions

Export to XML

7. Set records per file to 10007. Set records

per file to 1000

6. File Export6. File Export

8. Then an XML should be generated in the same folder of the

text file!

8. Then an XML should be generated in the same folder of the

text file!

Page 48: Text Extraction using Regular Expressions

XML header added.XML header added.

Each line is surrounded by the tag

<bio> with line number.

Each line is surrounded by the tag

<bio> with line number.

The number is now tagged with the Handle you specified!

The number is now tagged with the Handle you specified!

Page 49: Text Extraction using Regular Expressions

Try another regex

• Second regex: capture the “Reference number:” and the number – Example: Reference Number: t18500107-285– Pattern: Reference Number: t\d{8}-\d{3}

• Create it in Regex Machine– Think first: Do you want it to be tagged as a

whole? Should the match contain diff parts?

Page 50: Text Extraction using Regular Expressions

Using multiple groups in an Active Regex

• Add another Active Regex. Create two groups:• Group #1: Reference Number:• Group #2: t\d{8}-\d{3}

Page 51: Text Extraction using Regular Expressions

Group #1Group #1

Group #2

Capture this group!

Group #2

Capture this group!

Page 52: Text Extraction using Regular Expressions

Click on the new one to highlight the matched

strings.

Then click Move Up. Export to XML.

Click on the new one to highlight the matched

strings.

Then click Move Up. Export to XML.

Page 53: Text Extraction using Regular Expressions

The whole string is tagged, and the number part is

“captured” as an attribute!

The whole string is tagged, and the number part is

“captured” as an attribute!

Page 54: Text Extraction using Regular Expressions

What else to capture?

• Name of defendant(s)

• Verdict: guilty or not guilty, age, punishment

• Any patterns observed?

Page 55: Text Extraction using Regular Expressions

• Pattern for verdicts– If NOT GUILTY, normally nothing more.– If GUILTY, normally has Aged \d{2} followed by the

punishment.– There can be more than one verdicts in each

record (if more than one defendant)

Page 56: Text Extraction using Regular Expressions

NOT GUILTY

1: Give the whole regex a name. It will become the XML tag name surrounding the entire matched string

1: Give the whole regex a name. It will become the XML tag name surrounding the entire matched string

2: Give the pattern as the exact text “NOT GUILTY”2: Give the pattern as the exact text “NOT GUILTY”

Handle: give it a name

Handle: give it a name

Capturing group: The name here will be used as the attribute name of the XML

tag. The captured value will become the value of the attribute.

Capturing group: The name here will be used as the attribute name of the XML

tag. The captured value will become the value of the attribute.

Page 57: Text Extraction using Regular Expressions

GUILTY

• GUILTY.*Aged ?\d{1,3}[^—]*—.*– Group #1: guilty or not => GUILTY– Group #2: age => \d{1,3}– Group #3: punishment => .*– Something in between the desired groups

• Between group 1 & 2: .*Aged ?• Between group 2 & 3: [^—]*—

• Need to create 5 groups!

Page 58: Text Extraction using Regular Expressions
Page 59: Text Extraction using Regular Expressions
Page 60: Text Extraction using Regular Expressions
Page 61: Text Extraction using Regular Expressions

Export to XML

You can then use a browser to open it (more readable). You can further use an XML editor to correct mistakes (validation).

Page 62: Text Extraction using Regular Expressions

Open the XML in Excel

*Please note that not every XML can be well interpreted in Excel. It’s due to the capability of handling different data structure: Excel is for tabular data, and XML is for trees – much more flexible. *Also, Mac version MS Excel doesn’t read XML!

Page 63: Text Extraction using Regular Expressions

One last thing

• How about the names of the defendants?

• What is pattern?– The names are right after the reference number.– They are all capital.– There can be more than 1 names. In that case, a

mixture of space, comma, and “and” are used to connect each name.

Page 64: Text Extraction using Regular Expressions

• Test this pattern in a text editor:– Reference Number: ?[a-z]\d{8}-\d{3}\s+([A-Z' ]+),?

(?: ?([A-Z' ]+),)*(?: ?and([A-Z' ]+))?– What does it capture?

• Break into groups:– refNo: [a-z]\d{8}-\d{3}– First defendant: ([A-Z' ]+)– Second (or more) defendant: (?: ?([A-Z' ]+),)*– Last defendant: (?: ?and([A-Z' ]+))?

Page 65: Text Extraction using Regular Expressions
Page 66: Text Extraction using Regular Expressions
Page 67: Text Extraction using Regular Expressions
Page 68: Text Extraction using Regular Expressions

Good!Good!

Some problemSome problem

Page 69: Text Extraction using Regular Expressions
Page 70: Text Extraction using Regular Expressions

A real extraction project on local gazetteers – by Adam Mitchell

Raw descriptions written in the

gazetteers (extracted) SourceDate

Disaster type Location

Disaster types: Earthquakes and fires; Epidemics and Insect Plagues; Snow, Ice, and Tempests; Floods and Droughts; Famines, Hyperinflation, and Relief Efforts.

Page 71: Text Extraction using Regular Expressions

3/23/2013 71

Collect data at the local levels and then aggregate

Page 72: Text Extraction using Regular Expressions

Reflections on using the Regex Machine

• Carefully designing your regex and groups • Think ahead what you want in XML• Tuning regexes can take dozens of hours• It’s difficult to find regexes to capture them all

-- there are always left outs, exceptions, etc.• Keep in mind the cost of tuning “perfect”

regexes.

Page 73: Text Extraction using Regular Expressions

Put regular expressions in a bigger context

• Using regex to search / capture data of interest – only when the piece of information is written in regular patterns

• What if there are no regular patterns? How we can teach machines to identify important information among a corpus of texts?– If it’s location names, person names => Named Entity

Recognition (NER)– If it’s concepts => topic modeling, …– Text mining, machine learning, … and more

Page 74: Text Extraction using Regular Expressions

Conclusion

• Hope to let you understand what regex is• Hope to give you some hands on experience in

using regexes against some texts• Hope to give you some senses of what

machines can deal with texts• => Your imagination: you can begin to think

about what texts are available and what you can do with them.

Page 75: Text Extraction using Regular Expressions

ENJOY PLAYING!