Regular Expressions in Writer.pdf

Embed Size (px)

Citation preview

  • 7/30/2019 Regular Expressions in Writer.pdf

    1/13

    Documentation/How Tos/RegularExpressions in WriterDa Apache OpenOffice Wiki

    < Documentation | How Tos

    ndice

    1 Introduction2 Where regular expressions may be used in OOo

    3 A simple example4 The least you need to know about regular expressions5 How regular expressions are applied in OpenOffice.org6 Literal characters7 Special characters8 Single character match . ?9 Repeating match + * {m,n}10 Positional match ^ $ \< \>11 Alternative matches | [...]12 POSIX bracket expressions [:alpha:] [:digit:] etc..

    13 Grouping (...) and backreferences \x $x14 Tabs, newlines, paragraphs \t \n $15 Hexadecimal codes \xXXXX16 The 'Replace with' box \t \n & $1 $217 Troubleshooting OOo regular expressions18 Tips and Tricks

    Introduction

    In simple terms, regular expressions are a clever way to find & replace text (similarto 'wildcards'). Regular expressions can be both powerful and complex, and it is easyfor inexperienced users to make mistakes. We describe the use of OpenOffice.orgregular expressions aiming to be clear enough for the novice, while detailing theaspects that can cause confusion to more experienced users.

    A typical use for regular expressions is in finding text in a Writer document; forinstance to locate all occurrences ofman orwoman in your document, you couldsearch using a regular expression which would find both words.

    Regular expressions are very common in some areas of computing, and are oftenknown as regex or regexp. Not all regex are the same - so reading the relevant

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    2/13

    manual is sensible.

    Where regular expressions may be used in OOo

    In Writer:

    Edit - Find & Replace dialog

    Edit - Changes - Accept/reject command (Filter tab)

    In Calc:

    Edit - Find & Replace dialog

    Data - Filter - Standard filter & Advanced filter

    Certain functions, such as SUMIF, LOOKUP

    In Base:

    Find Record command

    The dialogs that appear when you use the above commands generally have an optionto use regular expressions (which is off by default). For example

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    3/13

    You should check the status of the regular expression option each time you bring upthe dialog, as it defaults to 'off'.

    A simple example

    If you have little or no experience of regular expressions, you may find it easiest tostudy them in Writer rather than say Calc.

    In Writer, bring up the Find and Replace dialog from the Edit menu.

    On the dialog, choose More Options and tick the Regular Expressions box

    In the Search box enter r.d - the dot here means 'any single character'.

    Clicking the Find All button will now find all the places where an ris followed byanother character followed by a d, for instance 'red' or 'hotrod' or 'bride' or 'yourdog' (this last example is rfollowed by a space followed by d - the space is acharacter).

    If you type xxxinto the Replace with box, and click the Replace All button, thesebecome 'xxx', 'hotxxx', 'bxxxe', 'youxxxog'

    That may not be very useful, but it shows the principle. We'll continue to use theFind and Replace dialog to explain in more detail.

    The least you need to know about regularexpressions

    If you don't want to find out exactly how regular expressions work, but just want toget a job done, you might find these common examples useful. Enter them in the'Search for' box, and make sure that regular expressions are selected.

    color|colourfinds colorand coloursep.rate finds sep then any character then rate - eg separate, seperate, andindeed sepXratesep[ae]rate finds separate and seperate - [ae] means either an a or an echanged? finds change and changed - the d is optional because it is followedby a question marks\> finds the s at the end of a word\

  • 7/30/2019 Regular Expressions in Writer.pdf

    4/13

    OpenOffice.org regular expressions appear to divide the text to be searched intoportions and examine each portion separately.

    In Writer, text appears to be divided into paragraphs. For example x.*z will notmatch xat the end of a paragraph with z beginning the next paragraph ( x.*z meansxthen any or no characters then z). Paragraphs seem to be treated separately

    (although we discuss some special cases at the end of this HowTo).

    In addition Writer considers each table cell and each text frame separately. Textframes are examined after all the other text / table cells on all pages have beenexamined.

    In the Find & Replace dialog, regular expressions may be used in the Search forbox. In general they may not be used in the Replace with box. The exceptions arediscussed later.

    Literal characters

    If your regular expression contains characters other than the so-called 'specialcharacters' . ^ $ * + ? \ [ ( { | then those characters are matched literally.

    For example: red matches red redraw and Freddie.

    OpenOffice.org allows you to choose whether you care if a character is 'UPPERCASE' or 'lower case'. If you tick the box to 'match case' on the Find and Replacedialog, then red will not match Red or FRED; if you un-tick that box then the case isignored and both will be matched.

    Special characters

    The special characters are . ^ $ * + ? \ [ ( { |

    They have special meanings in a regular expression, as we're about to describe.

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    5/13

    If you wish to match one of these characters literally, place a backslash '\' before it.

    For example: to match $100 use \$100 - the \$ is taken to mean $ .

    Single character match . ?

    The dot '.' special character stands for any single character (except newline).

    For example: r.d matches 'red' and 'hotrod' and 'bride' and 'your dog'

    The question mark '?' special character means 'match zero or one of the precedingcharacter' - or 'match the preceding character if it is found'.

    For example: rea?d matches 'red' and 'read' - 'a?' means 'match a single a if there isone'.

    Special characters can be used in combination with each other. A dot followed by aquestion mark means 'match zero or one of any single chacter'.

    For example: star.?ing matches 'staring', 'starring', 'starting', and 'starling', butnot 'startling'

    Repeating match + * {m,n}

    The plus '+' special character means 'match one or more of the preceding character'.

    For example: re+d matches 'red' and 'reed' and 'reeeeed' - e+ means match one ormore e's.

    The star '*' special character means 'match zero or more of the preceding character'.

    For example: rea*d matches 'red' and 'read' and 'reaaaaaaad' - 'a*' means matchzero or more a's .

    A common use for '*' is after the dot character - ie '.*' which means 'any or nocharacters'.

    For example: rea.*d matches 'read' and 'reaXd' and 'reaYYYYd' but not - 'red' or'reXd'

    Use the star '*' with caution; it will grab everything it can:

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    6/13

    For example: 'r.*d' matches 'red' but in Writer if your paragraph is actually 'Thereferee showed him the red card again' the match found is 'referee showed himthe red card' - that is, the first 'r' and the last possible 'd'. Regular expressions aregreedy by nature.

    You may specify how many times you wish the match to be repeated, with curlybrackets { }. For example a{1,4}rgh! will match argh!, aargh!, aaargh! andaaaargh! - in other words between 1 and 4 a's then rgh!.

    Also note that a{3}rgh! will match precisely 3 a's, ie aaargh!, and a{2,}rgh! (witha comma) will match at least 2 a's, for example aargh! and aaaaaaaargh!.

    Positional match ^ $ \< \>

    The circumflex '^' special character means 'match at the beginning of the text'.

    The dollar '$' special character means 'match at the end of the text'.

    Remember that OpenOffice.org regular expressions divide up the text to be searched- each paragraph in Writer is examined separately.

    For example: ^red matches 'red' at the start of a paragraph (red night shepherd'sdelight).

    For example: red$ matches 'red' at the end of a paragraph (he felt himself go red)

    For example: ^red$ matches inside a table cell that contains just 'red'

    In addition a hard line break (entered by Shift-Enter) is considered the beginning /end of text, and will allow a ^ or $ match.

    The backslash '\' special character gives special meaning to the character pairs '\', namely 'match at the beginning of a word', and 'match at the end of a word'

    For example: \ matches red at the end of a word (although neither of themcared much.)

    The test used to define the beginning/end of a word seems to be that theprevious/next character is a space, underscore (_), tab, newline, paragraph mark orany non-alphanumeric character.

    For example: \ matches 'I said, "No-one dared" '

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    7/13

    Alternative matches | [...]

    The pipe character '|' is a special character which allows the expression either sideof the '|' to match.

    For example: red|blue matches 'red' and 'blue'

    Unfortunately, certain expressions when used aftera pipe are not evaluated. This isso far known to affect ^ and backreferences, and is the subject of issue 46165(http://qa.openoffice.org/issues/show_bug.cgi?id=46165)

    For example: ^red|blue matches paragraphs beginning with 'red' and anyoccurrence of'blue', but blue|^red incorrectly matches only any occurrence of'blue', failing to match paragraphs beginning with 'red'

    The open square brackets character [ is a special character. Characters enclosed insquare brackets are treated as alternatives - any one of them may match. You canalso include ranges of characters, such as a-z or 0-9, rather than typing inabcdefghijklmnopqrstuvwxyz or 0123456789

    For example: r[eo]d matches 'red' and 'rod' but not 'rid'

    For example: [m-p]ut matches 'mut' and 'nut' and 'out' and 'put'

    For example: [hm-p]ut matches 'hut' and 'mut' and 'nut' and 'out' and 'put'

    Special characters within alternative match square brackets do not have the samespecial meanings. The only characters which do have special meanings are ], -, ^and \, and the meanings are:

    ] - a closing square bracket ends the alternative match set [abcdef]

    - - a hyphen indicates a range of characters, as we've seen, eg [0-9]^ - if the caret is the first character in the square brackets, it negates thesearch.For example [^a-dxyz] matches any character except abcdxyz.\ - the backslash is used to allow ], -, ^ and \ to be used literally in squarebrackets, and to allow hexadecimal codes.For example, \] stands for a literalclosing square bracket, so [[\]a] will match an opening square bracket [, aclosing square bracket ] or an a. \\ stands for a literal backslash. \x0009 standsfor a tab character.

    ust to re-emphasise: these are the meanings of these characters inside square

    brackets, and any other characters are treated literally.For example [\t ] will matcha 't' or a space - not a tab or a space. Use [\x0009 ] to match a tab or a space.

    POSIX bracket expressions [:alpha:] [:digit:] etc..

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    8/13

    There is much confusion in the OpenOffice.org community about these. The Helpitself is also far from clear.

    There are a number of 'POSIX bracket expressions' (sometimes called 'POSIXcharacter classes') available in OpenOffice.org regular expressions, of the form[:classname:] which allow a match with any of the characters in that class. Forinstance [:digit:] stands for any of the digits 0123456789.

    These (by definition) may only appear inside the square brackets of analternative match - so a valid syntax would be [abc[:digit:]], which should match a,b, c, or any digit 0-9. A correct syntax to match just any one digit would be[[:digit:]].

    Unfortunately this does not work as it should! The correct syntax does not work atall, but currently an incorrect syntax ([:digit:]) will actually match a digit, as long asit is outside the square brackets of an alternative match. (Obviously this isunsatisfactory, and is the subject of issue 64368 (http://qa.openoffice.org/issuesshow_bug.cgi?id=64368) ).

    The POSIX bracket expressions available are listed below. Note that the exactdefinition of each depends on locale - for example in a different language othercharacters may be considered 'alphabetic letters' in [:alpha:]. The meanings givenhere apply generally to English-speaking locales (and do not take into account anyUnicode issues).

    [:digit:]stands for any of the digits 0123456789. This is equivalent to 0-9.

    [:space:]

    should stand for any whitespace character, including tab; however as currentlyimplemented it stands simply for a space character. Note that the Help iscurrently misleading here. (This is the subject of issue 41706(http://qa.openoffice.org/issues/show_bug.cgi?id=41706) ).

    [:print:]should stand for any printable character; however as currently implemented itdoes not match the single quote nor the double quote characters (andsome others such as ). It matches space, but does not match tab (this latter isexpected/defined behaviour). (This is the subject of issue 83290

    (http://qa.openoffice.org/issues/show_bug.cgi?id=83290) ).

    [:cntrl:]stands for a control character. As far as a user is concerned, OpenOffice.orgdocuments have very few control characters; tab and hard_line_break are bothmatched, but paragraph_mark is not.

    [:alpha:]stands for a letter (including a letter with an accent). For example in the phrase(often used in English, and here given with accents as in the original language)

    'dj vu' all 6 letters will match.

    [:alnum:]stands for a character that satisfies either [:alpha:] or [:digit:]

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    9/13

    [:lower:]stands for a lowercase letter (including a letter with an accent). The casematching does not work unless the Match case box is ticked; if this box is notticked this expression is equivalent to [:alpha:].

    [:upper:]stands for an uppercase letter (including a letter with an accent). The case

    matching does not work unless the Match case box is ticked; if this box is notticked this expression is equivalent to [:alpha:].

    There seems to be little consistency in any implementation of POSIX bracketexpressions (OOo or elsewhere). One approach is simply to use straightforwardcharacter classes - so instead of[[:digit:]] you use [0-9] for example.

    Grouping (...) and backreferences \x $xRound brackets ( ) may be used to group terms.

    For example: red(den)? will find 'red' and 'redden'; here (den)? means 'one or zeroofden'.

    For example: (blue|black)bird will find both 'bluebird' and 'blackbird'.

    Each group enclosed in round brackets is also defined as a reference, and can bereferred to later in the same expression using a 'backreference'. In the 'Search for'box, backreferences are written '\1', '\2', etc.; in the 'Replace with' box they arewritten '$1', '$2', etc.

    '\1' or '$1' stands for 'whatever matched in the first round brackets'; '\2' or '$2'stands for 'whatever matched in the second round brackets'; and so on.

    For example: (blue|black) \1bird in the 'Search for' box will find both 'bluebluebird' and 'black blackbird', because '\1' stands for either blue or black,whichever we found. Therefore 'black bluebird' does not match.

    Backreferences in the 'Replace with' box only work from OOo2.4 onwards.The use of $1 rather than \1 is consistent with perl syntax, and more particularlywith the ICU regex engine, which may at some time replace the existing OOo regexengine, thus resolving many issues.

    For example: (gr..n)(blu.) in the 'Search for' box will find 'greenblue'; if the

    'Replace with' box has $2$1 the replacement will be 'bluegreen'.

    When regular expressions are selected, to replace text with the literal character '$'you must now use '\$'; similarly for '\' use '\\'.

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    e 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    10/13

    For example: (1..) in the 'Search for' box and \$$1 in the 'Replace with' box replaces'100' with '$100', and '150' with '$150'.

    $0 in the 'Replace with' box replaces with the entire text found.

    Tabs, newlines, paragraphs \t \n $The character pair '\t' has special meaning - it stands for a tab character.

    For example: \tred will match a tab character followed by the word 'red'.

    In Writer a newline may be entered by pressing Shift-Enter. A newline character isthereby inserted into the text, and the following text starts on a new line. This is notthe same as a new paragraph; clickView-Non printing characters to see thedifference.

    The OOo regular expression behaviour when matching paragraph marks and newlinecharacters is 'unusual'. This is partly because regular expressions in other softwareusually deal with ordinary plain text, whereas OOo regular expressions divide thetext at paragraph marks. For whatever reason, this is what you can do:

    \n will match a newline (Shift-Enter) if it is entered in the Search box. In thiscontext it is simply treated like a character, and can be replaced by say a space,or nothing. The regular expression red\n will match red followed by a newlinecharacter - and if replaced simply by say blue the newline will also be replaced.The regular expression red$ will match 'red' when it is followed by a newline.In this case, replacing with 'blue' will only replace 'red' - and will leave thenewline intact.red\ngreen will match 'red' followed by a newline followed by 'green';replacing with say 'brown' will remove the newline. However neither red.greennor red.*green will match here - the dot . does not match newline.

    $ on its own will match a paragraph mark - and can be replaced by say a'space', or indeed nothing, in order to merge two paragraphs together. Notethat red$ will match 'red' at the end of a paragraph, and if you replace it withsay a space, you simply get a space where 'red' was - and the paragraphs areunaffected - the paragraph mark is not replaced. It may help to regard $ on itsown as a special syntax, unique to OOo.^$ will match an empty paragraph, which can be replaced by say nothing, inorder to remove the empty paragraph. Note that ^red$ matches a paragraphwith only 'red' in it - replacing this with nothing leaves an empty paragraph -the paragraph marks at either end are not replaced. It may help to regard ^$

    on its own as a special syntax, unique to OOo. Unfortunately, because OOo hastaken over this syntax, it seems you cannot use ^$ to find empty cells in a table(nor empty Calc cells).If you wish to replace every newline with a paragraph mark, firstly you will

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    de 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    11/13

    search for \n with Find All to select the newlines. Then in the Replace box youenter \n, which in the Replace box stands for a paragraph mark; then chooseReplace. This is somewhat bizarre, but at least now you know. Note that \r isinterpreted as a literal 'r', not a carriage return.

    To replace paragraph marks - as used to give lines a certain length in some html

    documents, for instance - with "normal" automatically wrapped lines and paragraphs,the following 3 steps should help. Don't forget to choose More Options and tick theRegular Expressions box for this procedure.

    1. So as not to lose "normal" paragraph marks at the end of "normal" paragraphs,replace two consecutive paragraph marks using a sequence of characters notoccurring anywhere else in the text, like "*****" to replace an empty paragraph - thismakes it easy to find and reinstate later. You do this by putting ^$ in the Find boxand "*****" in the Replace box. (If you're only dealing with a limited chunk of text,don't forget to check "current selection only" under "more options" in the Find andReplace box.)

    2. Search for the remaining line-end paragraph marks by putting $ in the Find box.To replace the mark with a "space" just type a space in the Replace dialogue.

    3. Now that the text is ready for normal line-wrapping, put back the "normal"paragraph marks by typing "*****" in the Find box and \n in the Replace box.(Remember to check "current selection only" where appropriate!)

    Before you try this, create a test document to practise on.

    This is a good sequence to make into a macro. You can find macro suggestions on

    this OOo forum page: "replacing hard paragraphs" (http://www.oooforum.org/forum/viewtopic.phtml?t=3641) .

    (This procedure also helps deal indirectly with line-break problems.)

    Hexadecimal codes \xXXXX

    The character sequence ' \xthen a 4 digit hexadecimal number ' stands for thecharacter with that code.

    For example: \x002Astands for the star character '*'.

    Hexadecimal codes can be seen on the 'Insert-Special Character' dialog.

    The 'Replace with' box \t \n & $1 $2

    Users are sometimes confused with what can be done using the 'Replace with' box ina Find & Replace dialog.

    In general, regular expressions do not workin the 'Replace with' box. Thecharacters you type replace the found text literally.

    The four constructs that do work are:

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...

    de 13 22-05-2013 09:47

  • 7/30/2019 Regular Expressions in Writer.pdf

    12/13

    \t inserts a tab, replacing the text found.\n inserts a paragraph mark, replacing the text found. This may be unexpected,because \n in the 'Search for' box means 'newline'! In some operating systems itis possible to use unicode input to directly type a newline character (U+000A)in the 'Replace with' box, providing a workaround, but this is not universal.$1, $2, etc are backreferences, which (from OOo2.4) insert text groups found.See under Grouping and backreferences. $0 inserts the entire text found.

    & also inserts the entire text found.

    For example if you searched for bird|berry, you would would find either 'bird' or'berry'; now to replace with black& would give you either 'blackbird' or'blackberry'.

    Troubleshooting OOo regular expressions

    If you are new to regular expressions, please realise that they can be tricky - if youare not getting the results you expect, you might need to check that you understand

    well enough. Try to keep regular expressions as simple and unambitious as possible.

    Here are some further points of interest with OOo regular expressions:

    If you find an unexpected behaviour, please check in the relevant section in thisHowTo - many of the behaviour issues have been documented here.Regular expressions are 'greedy' - that is they will match as much text as theycan. Consider using curly and square brackets;for example [^ ]{1,5}\>matches 1 to 5 non-space characters at the end of a word.

    Please be careful when using the Replace All button. There are a few rareoccasions when this will give unexpected results. For example to remove thefirst character of every paragraph you might 'Search for' ^. and 'Replace with'nothing; clicking 'Replace All' now will wipe out *all* your text, instead of justthe first character of each paragraph. Issue 82473 (http://qa.openoffice.org/issues/show_bug.cgi?id=82473) discusses this. The workaround is to 'Find All',then 'Replace'; perhaps the safest way is not to use the 'Replace All' button atall with regular expressions.

    Tips and Tricks

    Here are some examples that may be useful:

    \

  • 7/30/2019 Regular Expressions in Writer.pdf

    13/13

    \

    finds decimal numbers

    \

    finds octal (base 8) numbers

    \

    finds hexadecimal (base 16) numbers

    [a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-z]{2,6}

    finds most email addresses (there is no perfect regular expression - this is apractical solution)

    See AlsoThe ICU regular expression package (http://www.icu-project.org/userguide/regexp.html) , a candidate to replace the existing OOo regular expressionengine (see: Regexp).Example regular expressions (http://www.OOoNinja.com/2007/12/example-regular-expressions-for-writer.html) (OpenOffice.org Ninja)Backreferences in substitutions (http://www.OOoNinja.com/2007/12/backreferences-in-replacements-new.html) (OpenOffice.org Ninja)Guide to regular expressions in OpenOffice.org (http://www.oooninja.com/2007/12/powerful-text-matching-with-regular.html) (OpenOffice.org Ninja)Searching and replacing paragraph returns (carriage returns), tabs, and otherspecial characters (http://openoffice.blogs.com/openoffice/2009/11/searching-and-replacing-paragraph-returns-carriage-returns-tabs-and-other-special-characters-in-open.html) (Solveig Haugland's blog)

    Obtida de "http://wiki.openoffice.org/w/index.php?title=Documentation/How_Tos/Regular_Expressions_in_Writer&oldid=153756"Categorias: Documentation/Reference Documentation/How Tos/Writer

    Esta pgina foi modificada pela ltima vez s 22h14min de 23 de Dezembro de2009.Content is available under .

    cumentation/How Tos/Regular Expressions in Writer... http://wiki.openoffice.org/wiki/Documentation/How_To...