28
26th Internationalization and Unicode Con ference San José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager IBM Globalization Center of Competency

26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

Embed Size (px)

Citation preview

Page 1: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

26th Internationalization and Unicode Conference San José, CA, September 2004

ICU OverviewThe Open-SourceUnicode Library, v3.0

Markus SchererICU ManagerIBM Globalization Center of Competency

Page 2: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

226th Internationalization and Unicode Conference San José, CA, September 2004

Agenda

Background

What is ICU?

Architecture Overview

ICU Features and recent additions

References

Q and A

Page 3: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

326th Internationalization and Unicode Conference San José, CA, September 2004

Why Globalization?

Page 4: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

426th Internationalization and Unicode Conference San José, CA, September 2004

Unicode

All world languages

Efficient and effective processing

Lossless data exchange

Enables single-binary global software

But… all languages large, complex standard⇒

– 1,400 pages + Annexes + additional standards

– 90,000+ characters

– Major update every 3 years

– 70 character properties, many multi-valued

– Affects many processes: display, line-break, regex, …

Page 5: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

526th Internationalization and Unicode Conference San José, CA, September 2004

Locales

Features vary widely across languages & countries

– Sorting, line breaks, date/time/number/currency formatting, codepage conversion, …

– Performance is key: easy to do the right thing; hard to do it fast

Page 6: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

626th Internationalization and Unicode Conference San José, CA, September 2004

What is ICU?

Globalization / Unicode / Locales

Mature, widely used set of C/C++ and Java libraries

– Basis for Java 1.1 internationalization – but goes far beyond

Very portable – identical results on all platforms / programming languages

– C/C++: 30+ platforms/compilers

– Java: IBM & Sun JDK

Full threading model; customizable; modular

Open source – but not viral

ICU 3.0: 78 languages; 118 countries; 870 codepages

Page 7: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

726th Internationalization and Unicode Conference San José, CA, September 2004

Who uses ICU?

Products Within IBM

– PSD Print Architecture, DB2, COBOL, Host Access Client, InfoPrint Manager, Informix GLS version 4.0, iSeries, Lotus Notes, Lotus Extended Search, Lotus Workplace, MQ Integrator Endeavour, NUMA-Q, OTI, Pervasive Computing WECMS, SS&S Websphere Banking Solutions, Tivoli Presentation Services, WBI Adapter/ Connect/Modeler and Monitor/ Solution Technology Development/WBI-Financial TePI, Websphere Application Server/ Studio Workload Simulator/Transcoding Publisher, XML Parser

Other Companies and Organizations

– Adobe, Apple (Mac OS X), Avaya, BEA, BroadJump, Business Objects, caris, CERN, Cognos, Debian, Gentoo, HP, Inktomi, JD Edwards, Jikes, Macromedia, Mathworks, Mozilla, NCR, OpenOffice, Parrot, PayPal, Python, QNX, Rogue Wave, SAP, Siebel, SIL, Software AG, Sun Microsystems (Solaris, Java), SuSE, Sybase, Virage, webMethods, Wine, Leica Geosystems GIS & Mapping, LLC.

Page 8: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

826th Internationalization and Unicode Conference San José, CA, September 2004

ICU Features

Unicode text handling

Charset conversions (870+)

Collation & Searching

Locales (170+)

Resource Bundles

Calendar & Time zones

Complex-text layout engine

Unicode Regular Expressions

Breaks: word, line, …

Formatting

– Date & time

– Messages

– Numbers & currencies

Transforms

– Normalization

– Casing

– Transliterations

Page 9: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

926th Internationalization and Unicode Conference San José, CA, September 2004

Architecture Overview 1

Locale Based Services

– Locale is an identifier, not a container

– Keywords for variants: de@collation=phonebook

Resource inheritance: shared resources

root

en

US IE

de

DE CH

zh

Hant Hans

TW CN TWCN

Language

Script

Country

Page 10: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1026th Internationalization and Unicode Conference San José, CA, September 2004

Architecture Overview 2

Open and Close Service Model

– Better performance by avoiding setup costs per operation

– Warning: use properly for maximum performace

ICU Threading Model

– Multiple versions in use simultaneously

– Large resources shared in read-only cache

Page 11: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1126th Internationalization and Unicode Conference San José, CA, September 2004

Architecture Overview 3

Data Driven Services

– Customize at build-time or run-time

– Interchange with other platforms;

• same results on each

– Rule-based

• Collation, Word-breaks, Transforms

– Pattern-based

• Formats, UnicodeSet

– Table-based

• Character Conversion

Page 12: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1226th Internationalization and Unicode Conference San José, CA, September 2004

Architecture Overview – ICU4C

Simple Error Handling

– C++ subset for portability

– Support for multi-threaded environment

Version Management

– Multiple versions at the same time

– Data and library versioning

String Buffer Management

– Preflighting and overflow protection

Misc: Load/Unload ICU

Recent Additions:

– Runtime-settable memory allocation and mutex functions

Page 13: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1326th Internationalization and Unicode Conference San José, CA, September 2004

Architecture Overview – ICU4J

Supplement for Java

Core globalization (no char. conversion, no GUI components)

– We do supply complex text support for Sun

Modularized: products may add just needed functionality

Page 14: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1426th Internationalization and Unicode Conference San José, CA, September 2004

ICU4J vs. JDK

CLDR 1.1 (Common Locale Data Repository)

Up-to-date globalization: standards-compliant; latest Unicode

– Supplementary character (GB 18030, JIS X 213, HKSCS)

– Full properties – JDK has only a fraction

– Local calendars (Thailand, Japan,…); ISO dates

– Currencies, String Search, Int’l Domain Names

– Transforms: Case, Scripts, Normalization

Much faster turn-around on bug-fixes, enhancements

Page 15: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1526th Internationalization and Unicode Conference San José, CA, September 2004

Unicode Text Handling

C

– UChar*: null-terminated or with length

C++

– UnicodeString: full featured string class

Java

– Uses normal JDK String, adds utilities

All handle supplementary characters

– Required for GB 18030/JIS X 0213/HKSCS repertoires

Page 16: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1626th Internationalization and Unicode Conference San José, CA, September 2004

Unicode Text Handling 2

All Unicode 4.0 properties

– Direct API

• Values, names, enumerations

– UnicodeSet

• Fast, compact set operations• Pattern-based (both Perl & POSIX syntax for properties)

– \p{greek} vs. [:greek:]

• All properties:– [\p{lowercase}-[a-z]]– [\p{greek} & \p{uppercase}]

Page 17: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1726th Internationalization and Unicode Conference San José, CA, September 2004

Data: Recent Additions

Conforms to CLDR 1.1

– 50% more data than CLDR 1.0: adding many translated terms for languages, scripts, countries, currencies, and time zones.

– improved collation for Eastern Europe, Chinese pinyin

Reduced multiplatform install image size

Improved XLIFF-ICU conversion tools

Locale canonicalization spec defined and implemented (C+J)

– Provides interoperability with POSIX and .NET locale IDs, more RFC 3066 support

Page 18: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1826th Internationalization and Unicode Conference San José, CA, September 2004

Character Set Conversion

Precise alias information:

– When you ask for “SJIS”, you can request the precise definition by platform:• windows, ibm, solaris,…

Buffer management

– automatically handles characters that cross buffers

Customizations allowed for:

– illegal sequences

– undefined characters

Unicode Text Compression – SCSU, BOCU

Page 19: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

1926th Internationalization and Unicode Conference San José, CA, September 2004

Collation and Searching

Fast international comparison and string search; fully UCA compliant

– Compressed sort keys, optimized string comparison, sublinear string search

– incremental sortkeys for radix-sort

Precise binary sortkey stability over time

Fully data driven

API / rule customizations

– strength, normalization, upper vs. lowercase first, ignore punctuation, …

Page 20: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2026th Internationalization and Unicode Conference San José, CA, September 2004

Collation and Searching: Recent Additions

Numeric sorting: sequences of digits can be sorted numerically instead of alphabetically

– e.g., filenames would sort "ab-2" < "ab-10"

– without material performance cost

– with reduced sortkey length.

Significantly improved sorting orders for many other languages

Data in separate tree, for easier modularization and maintenance

getFunctionalEquivalent API allows for better caching and UI support.

Page 21: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2126th Internationalization and Unicode Conference San José, CA, September 2004

Calendar & Time Zones

International Calendars – Arabic, Buddhist, Hebrew, Japanese

– Required for correct presentation of dates in some countries

Olson timezone support, with localizations

Recent Additions:

– RFC822 time zone format support in DateFormat (C+J) for compatibility.

Page 22: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2226th Internationalization and Unicode Conference San José, CA, September 2004

Formatting

Date & time: 8 formats per locale

Messages

– Completely localizable, Plural support

Numbers & currencies

– Scientific Notation, Spelled-out (checks, etc.)

– Full Orthogonal Currency support• INR In Hindi:• INR In English: Rs. 1,234.57• INR In German: Rs. 1.234,57

Recent Additions

– POSIX migration library

– Allows parsing multiple currencies with one formatter

– Short and stand-alone month/day names

Page 23: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2326th Internationalization and Unicode Conference San José, CA, September 2004

Transforms

Unicode Normalization

– Highly optimized for performance

– performance utilities: concatenation, detection, comparison

Casing (upper, lower, title, folding)

General Transforms

– Script transliterations

– Half-width/Full-width, Hex, etc.

– Chain transforms together, filter source characters

– Rule-based, customizable at runtime.

IDNA: International Domain Names

Page 24: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2426th Internationalization and Unicode Conference San José, CA, September 2004

Segmentation: word, line & sentence

Fast state-table implementation

Customizable

– Rule-based – customizable at runtime

– Special customizations, e.g. Thai

Recent Additions:

– Greatly improved performance when going backwards(common case when doing line break)

– Java

• The rules syntax has been extended. Rules can now return information about the types of characters they encountered.

• Common compiled (binary) rule format with ICU4C

Page 25: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2526th Internationalization and Unicode Conference San José, CA, September 2004

Unicode Regular Expressions

Full Regex Implementation

– C only: Java 1.4 has own package (though not as powerful)

All Unicode 4.0 Properties

– supported through UnicodeSet

Good performance

– competitive with non-Unicode regex

Recent Additions

– Now features a C API, instead of just C++.

Page 26: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2626th Internationalization and Unicode Conference San José, CA, September 2004

Complex-text layout engine

Glyph processing, positioning & adjustment

– ligature substitution, contextual forms, kerning, accent placement, Bidi scripts, etc.

Support for:

– Drawing

– Caret Display

– Hit Testing

– Selection Highlighting

– Caret Movement

– Layout Metrics

– Line Break

ICU 3.0: Canonical Equivalence: a + ´ or á

Page 27: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2726th Internationalization and Unicode Conference San José, CA, September 2004

References

ICU main site:

– http://oss.software.ibm.com/icu/

– Links to

• Download ICU• User Guide, Technical FAQ, Support, Bug Reports

Unicode Consortium

– http://www.unicode.org

• Unicode glossary, Unicode character database

Page 28: 26th Internationalization and Unicode ConferenceSan José, CA, September 2004 ICU Overview The Open-Source Unicode Library, v3.0 Markus Scherer ICU Manager

ICU Overview: The Open-Source Unicode Library, v3.0

2826th Internationalization and Unicode Conference San José, CA, September 2004

Questions and Answers