Text Extraction using Regular Expressions

Text Extraction using Regular Expressions

Shih-Pei ChenProject Manager, China Biographical

Database, Harvard University

The Digitization in the Humanities Workshop @ Rice UniversityApril 5-7, 2013

Downloads for today

• Slides and sample texts (a package)– On OWL-Space

• Text editor(s)– Mac users: please install TextWrangler– PC users: EmEditor, UltraEdit, or both

• CBDB Regex Machine– http://isites.harvard.edu/icb/icb.do?keyword=k16229&pageid=icb.pa

ge515758 -- download the CBDBRegexMachine_July2012.zip on this page

http://isites.harvard.edu/icb/icb.do?keyword=k16229&pageid=icb.page515758


The China Biographical Database – Modeling Life Histories – from anecdote to data

Biography Prosopography

Social Network AnalysisGeospatial Analysis

Big Data

• What you are going to do with the great amount of texts on the Web?– Is there information you want to search?– Is there thing you want to analyze?

• CBDB experience: we use regular expressions to extract biographical data from thousands of historical records (in their full texts)

What regular expressions can do for you

• Beyond keyword search– Search for written variations– Search for patterns– Search and replace => tagging

• You don’t have to learn programming in order to use regular expressions– Just use a text editor which supports regex

Today

• Part 1: Learn regular expressions– Hands on exercises of matching regexes against

some texts in a text editor.

• Part 2: A real play – Using regexes + Search and Replace in a text editor

• Part 3: CBDB Regex Machine – Using a graphical user interface to design regexes

and test them against a text. Tagging the matches in XML tags.

UNDERSTANDING REGULAR EXPRESSIONS

Regular expressions

• Is a powerful way of describing patterns of strings

• You describe the pattern, the machine matches it against the text (a string of letters, digits, and symbols)

Automata

• Imagine a belt sending characters in line:

• The string in line (the input): abcde– It can match this pattern: abcde– It can also match this pattern: bc (but only the substring

“bc” in the input will be matched – partial match)

aa bb cc dd ee

abcde?abcde?

Comparing the input against the regex character by character

aa bb cc dd eeInput:

Regex: aa bb cc dd ee Match!



Regex: bb cc Match!

Behind the scenes: The robot picks up a in the input, and finds that a does not match b, the first character in the regex. Then, the robot throws a out, and picks up the next character in the input, which is b. This time robot finds the two b’s match each other.



Regex: bb dd No match!

×

Switch to a good text editor

• Text editors which support regex– Windows: EmEditor or UltraEdit (both not free)

– Mac: TextWrangler (free)

Regular expressions – the syntax What you can describe using regular expressions?

Characters

• Literal characters– abcde , bc , bd (string match)

• Non-Printable Characters– \t (tab), \r (carriage return), \n (line feed)– Line breaks: \r (Mac), \n (Unix), or \r\n (Windows)

• Special characters (reserved characters / metacharacters)– [ ] \ ^ $ . | ? * + ( )

Examples come from: Regular-expressions.info

Exercise #1

• Download and install one of the above text editors. • Download the “regex text.txt” file. Open it in your

text editor.• Call up the “Search” or “Find” function in your editor,

and try the regexes in Exercise#1 to see which regexes can be matched.

Character Classes – what can appear at a certain position?

• gr[ae]y can match gray or grey– Characters in [ ] form a class (bag of characters) – gr[ae]y will not match graay nor graey !

• Common character classes– [a-z] , [A-Z] , [a-zA-Z] , [0-9]

• Exercise#2

gg rr aa ee yyInput:

Regex: gg rr [ae][ae] yy

Shorthand Character Classes

• \d (digit) : shorthand of [0-9]• \D (non-digit character)

• \w (word character): [A-Za-z0-9_]• \W (non-word character)

• \s (whitespace character): [ \t\r\n] (white space, tab, carriage return, line feed)

• \S (non-whitespace character)

Negated Character Classes

• Any character except these– [^aeiou] : not one of a, e, i, o, u– [^\d] : not digit– [^\s] : not white space

Dot .

• . can match any single character (almost)– Except the newline character => . is shorthand for

[^\n] (Unix), [^\r] (Mac), [^\r\n] (Windows)

• Exercise#3

Optional and Repeat operators

• 3 operators for expressing repentance– ? : zero or one time (optional)– + : repeat for one or more times – * : repeat for zero or more times

• Repeat certain times:– \d{1,4} : one to four digits– \d{1,} : one digit or more (EQ to \d+ )

• Exercise#4

Alternation (list of words)

• Useful when you have a list of words, and you want to find the occurrence of each– cat|dog|mouse|fish : find any one of the four– regex|regular expression : find either regex or

regular expression

• Exercise#5


Capturing writing variations

• Suppose you want to find all the occurrences mentioning regular expressions, but it can be written as “regular expression(s)” or “regex(es)”.

• Use this pattern to find them all: reg(ular expressions?|ex(es)?)


What can regular expressions do for you

• Provide better full-text search– Find a word without worrying its variations– Find specific info written in regular forms:

• dates, phone numbers, email addresses, HTML/XML tags, quotes, all capital abbreviations…

– Find two words near each other

• Perform formatting tasks toward a text• Automate tagging

Find information written in regular forms

• Exercise #6: finding dates as of mm/dd/yy– \d\d.\d\d.\d\d– \d\d[- /.]\d\d[- /.]\d\d– [0-1]\d[- /.][0-3]\d[- /.]\d\d– (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\d

• Exercise #7: finding texts within double quotes– ".*”– "[^"\r\n]*”– "[^"]*"


Grouping and back references

• Exercise #8: finding HTML/XML tags– <([a-z]+)\b[^>]*>.*?</\1>– <date format=“mmddyy”>04/07/13</date>

• () : capturing group• \1: back reference the 1st captured group

– If there are more than 1 pairs of (), use \2, \3, etc.– The whole matched string is referenced as \0


Formatting task

• Trimming unnecessary white spaces– Replace [ \t]{2,} with a single space– Delete leading whitespace within a line: replace ^[ \t]+ with

nothing (empty string)– Trim trailing whitespace of a line: replace [ \t]+$ with

nothing (empty string)

• Transform a text to a list of words– Append a line break after each word– Replace uppercase letters -> lowercase– Replace punctuation symbols with nothing– Rount frequency of each word in MS Excel


Automate tagging

• Idea: Find dates via some regex, and then surround each of the matches with tags: <date>some date</date>

• Replace our date pattern: (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\dwith :<date>\0</date>

• Try it in the date exercise • Once you can tag useful info in a text, it will be easy

to pull them out.

Resources for regular expressions

• Regular-expressions.info– http://www.regular-expressions.info/

• Profhacker article: “Finding the Women of Heimskringla with Regular Expressions”– http://chronicle.com/blogs/profhacker/finding-the-wome

n-of-heimskringla-with-regular-expressions/38631

• <oo>→<dh> Digital humanities article:– http://dh.obdurodon.org/regex.html

http://www.regular-expressions.info/

http://chronicle.com/blogs/profhacker/finding-the-women-of-heimskringla-with-regular-expressions/38631

http://chronicle.com/blogs/profhacker/finding-the-women-of-heimskringla-with-regular-expressions/38631

http://dh.obdurodon.org/regex.html

PLAY WITH TEXTS

Our texts today

• Get familiar with it• Use regex to do some search• Search in files• Then use the techniques to prepare the text

for Regex Machine

Texts for today: Old Bailey Proceedings

• You can find samples in today’s package under “Old Bailey Proceedings”

• Or, you can download them on your own:– select all and copy– paste it to a text editor– save it as UTF-8 without

BOM (byte order mark)

http://www.oldbaileyonline.org/browse.jsp?dir=sessionsPapers&decade=185

Old Bailey’s Proceeding: the HTML presentationOld Bailey’s Proceeding: the HTML presentation

Text formText form

Try some search

• Search for: t\d{8}-\d{3}• Replace it with: <refNo>\0</refNo>

Exercise: Preparing your text in a specific format (to feed to some software)

How to convert? Observe!

• Goal: to make each case a single line

• Patterns?• Every case begins with a

line of “Reference Number” and ends before the next “Reference Number”

• Got to remove all the line breaks

• Tricky things: does the text contain XML reserved characters &, <, >,…

Conversion Steps:Search and Replace + regexes

• Replace the XML reserved characters: – & => & % => %– < => > > => <

• Get rid of “285.”: ^\d{3}\. => nothing (empty string)• Replace all the line breaks (\r, \n, \r\n) with nothing• Reassign the line breaks by “Reference number:”

– Reference number: => \rReference number:

• Optional: Get rid of “See original”• The order above is crucial

What does the Regex Machine do?

• A graphical user interface (GUI) that enables people who do not have programming skills to– graphically design patterns– match them against a corpus of texts– see results immediately via a user-friendly color-

coding scheme (quick feedback)– export to XML => automates (part of) the tagging

procedure

3/23/2013 39

Credit: Elif Yamagil

Downloading CBDB RegexMachine

• Regex Machine (on CBDB website)– http://isites.harvard.edu/icb/icb.do?keyword=k16229&pa

geid=icb.page515758 -- download the CBDBRegexMachine_July2012.zip on this page

• Prerequisites: – Make sure your machine has Java Runtime Enrironment

(JRE) installed. If not, you can download it here: http://www.java.com/en/download/



http://www.java.com/en/download/

Run the Regex Machine

• Double click the CBDBRegexMachine.jar• In the “Select Your User Director” window,

select the folder where you put your text files.– Tip: don’t double

click the folder! Single click is all you need.

GUIList of active List of active regexregex

List of “terms”List of “terms”

Your Text Your Text

Info BoxInfo Box

42

Open the text we just prepared

• File Open. Select your text file.

Create Active Regexes

• First regex: capture the reference number– Example: t18500107-285– Pattern: t\d{8}-\d{3}– It’s always good to test it first in a text editor

• Create it in Regex Machine

– Think first: is it one unit? Does it contain diff parts?

1. Click1. Click

2. Click2. Click

3. Fill in your regex and give it a name3. Fill in your regex and give it a name

4. Give the whole regex a name. Then

choose a color!

4. Give the whole regex a name. Then

choose a color!

5. Click on the Regex. Matches are highlighted!

5. Click on the Regex. Matches are highlighted!

Export to XML

7. Set records per file to 10007. Set records

per file to 1000

6. File Export6. File Export

8. Then an XML should be generated in the same folder of the

text file!

8. Then an XML should be generated in the same folder of the

text file!

XML header added.XML header added.

Each line is surrounded by the tag

<bio> with line number.

Each line is surrounded by the tag

<bio> with line number.

The number is now tagged with the Handle you specified!

The number is now tagged with the Handle you specified!

Try another regex

• Second regex: capture the “Reference number:” and the number – Example: Reference Number: t18500107-285– Pattern: Reference Number: t\d{8}-\d{3}

• Create it in Regex Machine– Think first: Do you want it to be tagged as a

whole? Should the match contain diff parts?

Using multiple groups in an Active Regex

• Add another Active Regex. Create two groups:• Group #1: Reference Number:• Group #2: t\d{8}-\d{3}

Group #1Group #1

Group #2

Capture this group!

Group #2

Capture this group!

Click on the new one to highlight the matched

strings.

Then click Move Up. Export to XML.

Click on the new one to highlight the matched

strings.

Then click Move Up. Export to XML.

The whole string is tagged, and the number part is

“captured” as an attribute!

The whole string is tagged, and the number part is

“captured” as an attribute!

What else to capture?

• Name of defendant(s)

• Verdict: guilty or not guilty, age, punishment

• Any patterns observed?

• Pattern for verdicts– If NOT GUILTY, normally nothing more.– If GUILTY, normally has Aged \d{2} followed by the

punishment.– There can be more than one verdicts in each

record (if more than one defendant)

NOT GUILTY

1: Give the whole regex a name. It will become the XML tag name surrounding the entire matched string

1: Give the whole regex a name. It will become the XML tag name surrounding the entire matched string

2: Give the pattern as the exact text “NOT GUILTY”2: Give the pattern as the exact text “NOT GUILTY”

Handle: give it a name

Handle: give it a name

Capturing group: The name here will be used as the attribute name of the XML

tag. The captured value will become the value of the attribute.

Capturing group: The name here will be used as the attribute name of the XML

tag. The captured value will become the value of the attribute.

GUILTY

• GUILTY.*Aged ?\d{1,3}[^—]*—.*– Group #1: guilty or not => GUILTY– Group #2: age => \d{1,3}– Group #3: punishment => .*– Something in between the desired groups

• Between group 1 & 2: .*Aged ?• Between group 2 & 3: [^—]*—

• Need to create 5 groups!

Export to XML

You can then use a browser to open it (more readable). You can further use an XML editor to correct mistakes (validation).

Open the XML in Excel

*Please note that not every XML can be well interpreted in Excel. It’s due to the capability of handling different data structure: Excel is for tabular data, and XML is for trees – much more flexible. *Also, Mac version MS Excel doesn’t read XML!

One last thing

• How about the names of the defendants?

• What is pattern?– The names are right after the reference number.– They are all capital.– There can be more than 1 names. In that case, a

mixture of space, comma, and “and” are used to connect each name.

• Test this pattern in a text editor:– Reference Number: ?[a-z]\d{8}-\d{3}\s+([A-Z' ]+),?

(?: ?([A-Z' ]+),)*(?: ?and([A-Z' ]+))?– What does it capture?

• Break into groups:– refNo: [a-z]\d{8}-\d{3}– First defendant: ([A-Z' ]+)– Second (or more) defendant: (?: ?([A-Z' ]+),)*– Last defendant: (?: ?and([A-Z' ]+))?

Good!Good!

Some problemSome problem

A real extraction project on local gazetteers – by Adam Mitchell

Raw descriptions written in the

gazetteers (extracted) SourceDate

Disaster type Location

Disaster types: Earthquakes and fires; Epidemics and Insect Plagues; Snow, Ice, and Tempests; Floods and Droughts; Famines, Hyperinflation, and Relief Efforts.

3/23/2013 71

Collect data at the local levels and then aggregate

Reflections on using the Regex Machine

• Carefully designing your regex and groups • Think ahead what you want in XML• Tuning regexes can take dozens of hours• It’s difficult to find regexes to capture them all

-- there are always left outs, exceptions, etc.• Keep in mind the cost of tuning “perfect”

regexes.

Put regular expressions in a bigger context

• Using regex to search / capture data of interest – only when the piece of information is written in regular patterns

• What if there are no regular patterns? How we can teach machines to identify important information among a corpus of texts?– If it’s location names, person names => Named Entity

Recognition (NER)– If it’s concepts => topic modeling, …– Text mining, machine learning, … and more

Conclusion

• Hope to let you understand what regex is• Hope to give you some hands on experience in

using regexes against some texts• Hope to give you some senses of what

machines can deal with texts• => Your imagination: you can begin to think

about what texts are available and what you can do with them.

ENJOY PLAYING!

Documents

Text Extraction using Regular Expressions