Upload
stephany-cobble
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
26th Internationalization and Unicode Conference San José, CA, September 2004
ICU OverviewThe Open-SourceUnicode Library, v3.0
Markus SchererICU ManagerIBM Globalization Center of Competency
ICU Overview: The Open-Source Unicode Library, v3.0
226th Internationalization and Unicode Conference San José, CA, September 2004
Agenda
Background
What is ICU?
Architecture Overview
ICU Features and recent additions
References
Q and A
ICU Overview: The Open-Source Unicode Library, v3.0
326th Internationalization and Unicode Conference San José, CA, September 2004
Why Globalization?
ICU Overview: The Open-Source Unicode Library, v3.0
426th Internationalization and Unicode Conference San José, CA, September 2004
Unicode
All world languages
Efficient and effective processing
Lossless data exchange
Enables single-binary global software
But… all languages large, complex standard⇒
– 1,400 pages + Annexes + additional standards
– 90,000+ characters
– Major update every 3 years
– 70 character properties, many multi-valued
– Affects many processes: display, line-break, regex, …
ICU Overview: The Open-Source Unicode Library, v3.0
526th Internationalization and Unicode Conference San José, CA, September 2004
Locales
Features vary widely across languages & countries
– Sorting, line breaks, date/time/number/currency formatting, codepage conversion, …
– Performance is key: easy to do the right thing; hard to do it fast
ICU Overview: The Open-Source Unicode Library, v3.0
626th Internationalization and Unicode Conference San José, CA, September 2004
What is ICU?
Globalization / Unicode / Locales
Mature, widely used set of C/C++ and Java libraries
– Basis for Java 1.1 internationalization – but goes far beyond
Very portable – identical results on all platforms / programming languages
– C/C++: 30+ platforms/compilers
– Java: IBM & Sun JDK
Full threading model; customizable; modular
Open source – but not viral
ICU 3.0: 78 languages; 118 countries; 870 codepages
ICU Overview: The Open-Source Unicode Library, v3.0
726th Internationalization and Unicode Conference San José, CA, September 2004
Who uses ICU?
Products Within IBM
– PSD Print Architecture, DB2, COBOL, Host Access Client, InfoPrint Manager, Informix GLS version 4.0, iSeries, Lotus Notes, Lotus Extended Search, Lotus Workplace, MQ Integrator Endeavour, NUMA-Q, OTI, Pervasive Computing WECMS, SS&S Websphere Banking Solutions, Tivoli Presentation Services, WBI Adapter/ Connect/Modeler and Monitor/ Solution Technology Development/WBI-Financial TePI, Websphere Application Server/ Studio Workload Simulator/Transcoding Publisher, XML Parser
Other Companies and Organizations
– Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, caris, CERN, Cognos, Debian, Gentoo, HP, Inktomi, JD Edwards, Jikes, Macromedia, Mathworks, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping, LLC.
ICU Overview: The Open-Source Unicode Library, v3.0
826th Internationalization and Unicode Conference San José, CA, September 2004
ICU Features
Unicode text handling
Charset conversions (870+)
Collation & Searching
Locales (170+)
Resource Bundles
Calendar & Time zones
Complex-text layout engine
Unicode Regular Expressions
Breaks: word, line, …
Formatting
– Date & time
– Messages
– Numbers & currencies
Transforms
– Normalization
– Casing
– Transliterations
ICU Overview: The Open-Source Unicode Library, v3.0
926th Internationalization and Unicode Conference San José, CA, September 2004
Architecture Overview 1
Locale Based Services
– Locale is an identifier, not a container
– Keywords for variants: de@collation=phonebook
Resource inheritance: shared resources
root
en
US IE
de
DE CH
zh
Hant Hans
TW CN TWCN
Language
Script
Country
ICU Overview: The Open-Source Unicode Library, v3.0
1026th Internationalization and Unicode Conference San José, CA, September 2004
Architecture Overview 2
Open and Close Service Model
– Better performance by avoiding setup costs per operation
– Warning: use properly for maximum performace
ICU Threading Model
– Multiple versions in use simultaneously
– Large resources shared in read-only cache
ICU Overview: The Open-Source Unicode Library, v3.0
1126th Internationalization and Unicode Conference San José, CA, September 2004
Architecture Overview 3
Data Driven Services
– Customize at build-time or run-time
– Interchange with other platforms;
• same results on each
– Rule-based
• Collation, Word-breaks, Transforms
– Pattern-based
• Formats, UnicodeSet
– Table-based
• Character Conversion
ICU Overview: The Open-Source Unicode Library, v3.0
1226th Internationalization and Unicode Conference San José, CA, September 2004
Architecture Overview – ICU4C
Simple Error Handling
– C++ subset for portability
– Support for multi-threaded environment
Version Management
– Multiple versions at the same time
– Data and library versioning
String Buffer Management
– Preflighting and overflow protection
Misc: Load/Unload ICU
Recent Additions:
– Runtime-settable memory allocation and mutex functions
ICU Overview: The Open-Source Unicode Library, v3.0
1326th Internationalization and Unicode Conference San José, CA, September 2004
Architecture Overview – ICU4J
Supplement for Java
Core globalization (no char. conversion, no GUI components)
– We do supply complex text support for Sun
Modularized: products may add just needed functionality
ICU Overview: The Open-Source Unicode Library, v3.0
1426th Internationalization and Unicode Conference San José, CA, September 2004
ICU4J vs. JDK
CLDR 1.1 (Common Locale Data Repository)
Up-to-date globalization: standards-compliant; latest Unicode
– Supplementary character (GB 18030, JIS X 213, HKSCS)
– Full properties – JDK has only a fraction
– Local calendars (Thailand, Japan,…); ISO dates
– Currencies, String Search, Int’l Domain Names
– Transforms: Case, Scripts, Normalization
Much faster turn-around on bug-fixes, enhancements
ICU Overview: The Open-Source Unicode Library, v3.0
1526th Internationalization and Unicode Conference San José, CA, September 2004
Unicode Text Handling
C
– UChar*: null-terminated or with length
C++
– UnicodeString: full featured string class
Java
– Uses normal JDK String, adds utilities
All handle supplementary characters
– Required for GB 18030/JIS X 0213/HKSCS repertoires
ICU Overview: The Open-Source Unicode Library, v3.0
1626th Internationalization and Unicode Conference San José, CA, September 2004
Unicode Text Handling 2
All Unicode 4.0 properties
– Direct API
• Values, names, enumerations
– UnicodeSet
• Fast, compact set operations• Pattern-based (both Perl & POSIX syntax for properties)
– \p{greek} vs. [:greek:]
• All properties:– [\p{lowercase}-[a-z]]– [\p{greek} & \p{uppercase}]
ICU Overview: The Open-Source Unicode Library, v3.0
1726th Internationalization and Unicode Conference San José, CA, September 2004
Data: Recent Additions
Conforms to CLDR 1.1
– 50% more data than CLDR 1.0: adding many translated terms for languages, scripts, countries, currencies, and time zones.
– improved collation for Eastern Europe, Chinese pinyin
Reduced multiplatform install image size
Improved XLIFF-ICU conversion tools
Locale canonicalization spec defined and implemented (C+J)
– Provides interoperability with POSIX and .NET locale IDs, more RFC 3066 support
ICU Overview: The Open-Source Unicode Library, v3.0
1826th Internationalization and Unicode Conference San José, CA, September 2004
Character Set Conversion
Precise alias information:
– When you ask for “SJIS”, you can request the precise definition by platform:• windows, ibm, solaris,…
Buffer management
– automatically handles characters that cross buffers
Customizations allowed for:
– illegal sequences
– undefined characters
Unicode Text Compression – SCSU, BOCU
ICU Overview: The Open-Source Unicode Library, v3.0
1926th Internationalization and Unicode Conference San José, CA, September 2004
Collation and Searching
Fast international comparison and string search; fully UCA compliant
– Compressed sort keys, optimized string comparison, sublinear string search
– incremental sortkeys for radix-sort
Precise binary sortkey stability over time
Fully data driven
API / rule customizations
– strength, normalization, upper vs. lowercase first, ignore punctuation, …
ICU Overview: The Open-Source Unicode Library, v3.0
2026th Internationalization and Unicode Conference San José, CA, September 2004
Collation and Searching: Recent Additions
Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically
– e.g., filenames would sort "ab-2" < "ab-10"
– without material performance cost
– with reduced sortkey length.
Significantly improved sorting orders for many other languages
Data in separate tree, for easier modularization and maintenance
getFunctionalEquivalent API allows for better caching and UI support.
ICU Overview: The Open-Source Unicode Library, v3.0
2126th Internationalization and Unicode Conference San José, CA, September 2004
Calendar & Time Zones
International Calendars – Arabic, Buddhist, Hebrew, Japanese
– Required for correct presentation of dates in some countries
Olson timezone support, with localizations
Recent Additions:
– RFC822 time zone format support in DateFormat (C+J) for compatibility.
ICU Overview: The Open-Source Unicode Library, v3.0
2226th Internationalization and Unicode Conference San José, CA, September 2004
Formatting
Date & time: 8 formats per locale
Messages
– Completely localizable, Plural support
Numbers & currencies
– Scientific Notation, Spelled-out (checks, etc.)
– Full Orthogonal Currency support• INR In Hindi:• INR In English: Rs. 1,234.57• INR In German: Rs. 1.234,57
Recent Additions
– POSIX migration library
– Allows parsing multiple currencies with one formatter
– Short and stand-alone month/day names
ICU Overview: The Open-Source Unicode Library, v3.0
2326th Internationalization and Unicode Conference San José, CA, September 2004
Transforms
Unicode Normalization
– Highly optimized for performance
– performance utilities: concatenation, detection, comparison
Casing (upper, lower, title, folding)
General Transforms
– Script transliterations
– Half-width/Full-width, Hex, etc.
– Chain transforms together, filter source characters
– Rule-based, customizable at runtime.
IDNA: International Domain Names
ICU Overview: The Open-Source Unicode Library, v3.0
2426th Internationalization and Unicode Conference San José, CA, September 2004
Segmentation: word, line & sentence
Fast state-table implementation
Customizable
– Rule-based – customizable at runtime
– Special customizations, e.g. Thai
Recent Additions:
– Greatly improved performance when going backwards(common case when doing line break)
– Java
• The rules syntax has been extended. Rules can now return information about the types of characters they encountered.
• Common compiled (binary) rule format with ICU4C
ICU Overview: The Open-Source Unicode Library, v3.0
2526th Internationalization and Unicode Conference San José, CA, September 2004
Unicode Regular Expressions
Full Regex Implementation
– C only: Java 1.4 has own package (though not as powerful)
All Unicode 4.0 Properties
– supported through UnicodeSet
Good performance
– competitive with non-Unicode regex
Recent Additions
– Now features a C API, instead of just C++.
ICU Overview: The Open-Source Unicode Library, v3.0
2626th Internationalization and Unicode Conference San José, CA, September 2004
Complex-text layout engine
Glyph processing, positioning & adjustment
– ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc.
Support for:
– Drawing
– Caret Display
– Hit Testing
– Selection Highlighting
– Caret Movement
– Layout Metrics
– Line Break
ICU 3.0: Canonical Equivalence: a + ´ or á
ICU Overview: The Open-Source Unicode Library, v3.0
2726th Internationalization and Unicode Conference San José, CA, September 2004
References
ICU main site:
– http://oss.software.ibm.com/icu/
– Links to
• Download ICU• User Guide, Technical FAQ, Support, Bug Reports
Unicode Consortium
– http://www.unicode.org
• Unicode glossary, Unicode character database
ICU Overview: The Open-Source Unicode Library, v3.0
2826th Internationalization and Unicode Conference San José, CA, September 2004
Questions and Answers