46
More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution Lice See http://software-carpentry.org/license.html for more informati Regular Expressions

More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Embed Size (px)

Citation preview

Page 1: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

More Tools

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.

Regular Expressions

Page 2: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeX

Page 3: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Page 4: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

All share a common bibliography

Page 5: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

All share a common bibliography

Want to see how often citations appear

together

Page 6: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Thousands of papers and theses written in

LaTeXGranger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

All share a common bibliography

Want to see how often citations appear

together

First step: extract citation sets from

documents

Page 7: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Page 8: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

Page 9: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

May be white space

Page 10: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

May be white space (including line breaks)

Page 11: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \

cite{quirrell89}),has opened up new lines of research.

However,studies at Unseen University \

cite{stibbons2002,

stibbons2008} highlight several dangers. ⋮ ⋮ ⋮ ⋮

⋮ ⋮ ⋮

Multiple labels separated by commas

May be white space (including line breaks)

And multiple citations per line

Page 12: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

Page 13: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Page 14: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Page 15: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

('X',)

Idea #1: capture everything in cite{…} in a

group

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Matching is greedy

Page 16: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Idea #2: match everything inside '{}' except

'}'

Page 17: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 18: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 19: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 20: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

What about multiple citations?

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 21: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X',)

What about multiple citations?

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 22: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X',)

What about multiple citations?

Need to extract all matches, not just the first

Idea #2: match everything inside '{}' except

'}'

Use '[^}]' to negate the set containing only

'}'

Page 23: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Idea #3: use re.findall instead of re.search

Page 24: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

Page 25: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

Page 26: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

Page 27: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

What about spaces?

Page 28: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her

knowledge

of her language's libraries."

print re.search('cite{([^}]+)}', 'a \\cite{ X} b \\cite{Y } c').groups()

[' X', 'Y ']

What about spaces?

Page 29: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Could tidy this up after matching using

string.strip()

Page 30: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Page 31: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Page 32: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Page 33: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Still capturing the space after 'Y'

Page 34: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Page 35: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c')

[' X', 'Y']

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Page 36: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Could tidy this up after matching using

string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b \\cite{Y } c')

[' X', 'Y']

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Page 37: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

What about multiple labels in a single

citation?

Page 38: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

What about multiple labels in a single

citation?

Page 39: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

What about multiple labels in a single

citation?

Actually can be done, but it's very complex

Page 40: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

What about multiple labels in a single

citation?

Actually can be done, but it's very complex

Use re.split() to break matches on '\\s*,\\s*'

Page 41: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

# Start with a working skeleton.def get_citations(text): '''Return the set of all citation tags found in a block of text.''' return set()

if __name__ == '__main__': test = '''\Granger's work on graphs \cite{dd-gr2007,gr2009},particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \cite{quirrell89}),has opened up new lines of research. However,studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.'''

print get_citations(test)

set([])

Page 42: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

import re

CITE = 'cite{\\s*\\b([^}]+)\\b\\s*}'SPLIT = '\\s*,\\s*'

def get_citations(text): '''Return the set of all citation tags found in a block of text.'''

result = set() match = re.findall(CITE, text) if match: for citation in match: cites = re.split(SPLIT, citation) for c in cites: result.add(c)

return result

Page 43: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

import re

CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}')SPLIT = re.compile('\\s*,\\s*')

def get_citations(text): '''Return the set of all citation tags found in a block of text.'''

result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label)

return result

Page 44: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

import re

CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}')SPLIT = re.compile('\\s*,\\s*')

def get_citations(text): '''Return the set of all citation tags found in a block of text.'''

result = set() match = CITE.findall(text) if match: for citations in match: label_list = SPLIT.split(citations) for label in label_list: result.add(label)

return result

Page 45: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

Regular Expressions More Tools

# Now test it all out.

if __name__ == '__main__': test = '''\Granger's work on graphs \cite{dd-gr2007,gr2009},particularly ones obeying Snape's Inequality\cite{ snape87 } (but see \cite{quirrell89}),has opened up new lines of research. However,studies at Unseen University \cite{stibbons2002, stibbons2008} highlight several dangers.'''

print get_citations(test)

set(['gr2009', 'stibbons2002', 'dd-gr2007', 'stibbons2008', 'snape87', 'quirrell89'])

Page 46: More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See

June 2010

created by

Greg Wilson

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.