Upload
eze
View
35
Download
0
Embed Size (px)
DESCRIPTION
Unicode Text and Regular Expression. Andy Heninger 9/9/2004. Overview. Regular Expressions have long been used for Searching text data Parsing, extracting fields Text manipulation, find & replace Regular Expressions and Unicode Text data are a good Match. - PowerPoint PPT Presentation
Citation preview
®
IBM Software Group
© 2003 IBM Corporation
Unicode Text and Regular Expression
Andy Heninger
9/9/2004
IBM Software Group
Overview
Regular Expressions have long been used for Searching text data
Parsing, extracting fields
Text manipulation, find & replace
Regular Expressions and Unicode Text data are a good Match.
Regular Expression Languages have evolved new features to work more conveniently and powerfully with Unicode data.
Talk Focus is on these Unicode related features.
IBM Software Group
What Are Regular Expressions
Think of Wildcards
Select or match text
Available in editors, languages, tools, databases
Not the topic today
Literal text Matches itself
* Match 0 or more times
+ Match one or more times
[a-z] Character Range. Match any one
(whatever) grouping
IBM Software Group
Character Ranges
[a-z] Match any one character falling in the specified range
Relies on the existence of some ordering of characters, to determine what falls between a and z. Typically charset order.
Only works for English
No accented characters
No letters from other alphabets (Greek, Arabic, etc.)
Still widely used.
IBM Software Group
POSIX Character Classes
Remove dependency on charset ordering
Convenient, more likely to be correct than [a-z]
[:alnum:] [:cntrl:] [:lower:]
[:space:] [:alpha:] [:digit:]
[:xdigit:] [:print:] [:upper:]
[:blank:] [:graph:] [:punct:]
Implementers must provide definitions for different charsets
IBM Software Group
POSIX -> Unicode
Unicode has a very rich character property system
Unicode TR 18 defines POSIX classes in terms of properties
[:alpha:] Alphabetic = TRUE
[:digit] General Category = Decimal Number
[:space:] White Space = TRUE
[:upper:] Uppercase = TRUE
Direct access to Unicode properties in Character Set expressions is a key feature for Unicode Regular Expression.
IBM Software Group
A Quick Look at the Unicode General Category
Central to Regular Expressions with Unicode Text
Categorize every character as one of Letter
Number
Separator
Punctuation
Marks
Symbols
Others
Subcategories within each. Examples Letter, Uppercase, lowercase, Other, …
Symbols, Math, Currency, Modifiers, …
Mark, spacing, non-spacing, enclosing
IBM Software Group
Unicode Property Based Character Classes
TR 18 Recommended Properties for Basic Unicode support includes General Category
Script
Alphabetic
Uppercase
Lowercase
White Space
Examples: [:Script=Greek:] POSIX syntax[\p{Script=Greek}] Perl syntax[\p{Alphabetic}]
IBM Software Group
Set Operations
[^\p{Letter}] Negation
[\p{Letter}\p{Number}] Union
[\p{Letter}&\p{script=Cyrllic}] Intersection
[\p{Letter}-\p{Latin}] Difference
Important for a character set the size of Unicode.
IBM Software Group
Script and Block Properties
[\p{script=Thai}][\p{block=Thai}]
Unicode Script Property Categorizes each character by script – Latin, Cyrillic, Arabic, etc.
Shared characters classified as “Common”. Numbers, punctuation, etc.
Not the same as Language.
Unicode Block Property Categories by block – contiguous range of characters.
Basic Latin, Latin-1 Supplement, Latin Extended A, Latin Extended B
Greek, Hebrew, and more.
Has Limitations
IBM Software Group
Code Points, Code Units, UTF 8/16/32
Matching happens on Code Points (0 – 10ffff)
UTF-8 bytes or UTF-16 Surrogate Halves not visible
Match results independent of encoding form.
Glitches Implementations without surrogate support
Perl’s \x
IBM Software Group
Normalization
\p{Alphabetic}
n
\p{Non Spacing Mark}
…
n i ñ a
n i n ˜ a
n i ñ a
n i n ˜ a
n i ñ a
n i n ˜ a
n i ñ a
n i n ˜ a
IBM Software Group
Normalization
Approaches to the Problem Data may be pre-normalized, nothing extra needed.
Use Normalization option, if available.
Application Normalizes the data first
IBM Software Group
Line Endings
Unicode has More \u000A Line Feed
\u000C Form Feed\u000D Carriage Return\u0085 Next Line (NEL)\u2028 Line Separator\u2029 Paragraph Separator\u000D \u000A CR/LF sequence
Matches normally stop at line ends, but overridable.
Line endings always match as a single character, including the CR/LF sequence
No \n sequence to match any line ending
IBM Software Group
Caseless Matching
Simple – one to one character relation between pattern and text being matched.
Full – one to many German Sharp-S ß uppercases to ‘SS’
Expensive in complexity of implementation, speed.
Existing implementations provide simple form only.
IBM Software Group
Grapheme Clusters
Definition: what a user would consider a character, or what would display as a single character.
Multi-codepoint Clusters Base char + combining marks
Example: decomposed form of Ň
Hangul (Korean) syllables
Unicode-enabled regular expressions should provide Match a grapheme cluster
Test whether match position is on a boundary.
IBM Software Group
Word Boundaries, \b
Classic RE Feature Boundaries between “word” and “non-word” characters
“Word” characters include all Alphabetic.
Non-spacing marks never separated from base, otherwise ignored.
UAX 29 Boundaries Better, but different, results.
Hello There. G’day 123.456Classic RE
Hello There. G’day 123.456Unicode Word Boundaries
IBM Software Group
Unicode TR 18
Unicode Technical Standard #18, Regular Expressions
Guidelines for how to adapt RE implementations to Unicode
Three Levels of Support, Basic, Extended, Tailored
Basic Support requires Access to common Unicode Character Properties
Set (character class) Operations – Union, Intersection, Subtraction
Simple Unicode Loose (caseless) matching
Unicode Line separator characters
Supplementary Character support
Hex notation for Unicode code points
IBM Software Group
Unicode TR 18
Extended Unicode Support More properties, characters by name. (GREEK CAPITAL LETTER
EPSILON)
Canonical Equivalents (normalization)
Unicode style word boundaries
Full case insensitive matching
Matching default grapheme clusters and boundaries
Tailored Support. Language or Locale specific behavior for a number of matching constructs. No implementations available yet.
IBM Software Group
Implementations
Implementations providing significant Unicode support Perl.
Major innovations to regular expressions
Early adopter of Unicode
Perl features and syntax widely adopted.
Java JDK 1.4
Microsoft .NET
IBM ICU4C
IBM Software Group
Conclusion
Regular expressions provide a great way to analyze and manipulate Unicode data.
Mainstream implementations are readily available.
IBM Software Group
Questions