36
CLDR: CLDR: The Common Locale Data The Common Locale Data Repository Repository Locales for the World Locales for the World Lisa Moore Lisa Moore George Rhoten George Rhoten Mark Davis Mark Davis Steven Loomis Steven Loomis

CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

Embed Size (px)

Citation preview

Page 1: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

CLDR:CLDR:The Common Locale Data The Common Locale Data

RepositoryRepository

Locales for the WorldLocales for the World

Lisa MooreLisa MooreGeorge Rhoten George Rhoten

Mark Davis Mark Davis Steven LoomisSteven Loomis

Page 2: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20062

AgendaAgenda

Why CLDR?Why CLDR?

CLDR dataCLDR data

Tools and vettingTools and vetting

Today and the futureToday and the future

Page 3: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20063

AgendaAgenda

Why CLDR?Why CLDR?

CLDR dataCLDR data

Tools and vettingTools and vetting

Today and the futureToday and the future

Page 4: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20064

Locales – does anything stay the Locales – does anything stay the same?same?

"Theatre Center News: The"Theatre Center News: The date of date of the last version of this document wasthe last version of this document was 20032003 年年 33 月月 2020 日日 . . A copy can be A copy can be obtained forobtained for $50,0 or 1.234,57 грн$50,0 or 1.234,57 грн. . We would like to acknowledge We would like to acknowledge contributions by the following contributions by the following authorsauthors (in alphabetical order): Alaa (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt."Avery Bishop, and Doug Felt."

Page 5: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20065

Locales – the many differencesLocales – the many differences

Locales specify user preferencesLocales specify user preferences Linguistic and cultural differencesLinguistic and cultural differences

• Languages, scripts, writing systems, ordering, Languages, scripts, writing systems, ordering, directionality, formatting, numbers, sizesdirectionality, formatting, numbers, sizes

Even in the same locale, interoperability Even in the same locale, interoperability issues across platformsissues across platforms

Global economics has increased the need Global economics has increased the need for greater globalization support in for greater globalization support in computer systemscomputer systems

Everyone expects more!Everyone expects more!

Page 6: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20066

Add the Universal Character Add the Universal Character EncodingEncoding

Unicode: Unique character codes for Unicode: Unique character codes for all languagesall languages

Page 7: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20067

The Need for Common Locale DataThe Need for Common Locale Data

Computing environments often contain Computing environments often contain a variety of operating systems and a variety of operating systems and software.software.

Historically locale sensitive data Historically locale sensitive data research has been done by individuals research has been done by individuals and/or companies.and/or companies.

Because of political changes, it is easy Because of political changes, it is easy for locale data to become out of date.for locale data to become out of date.

It is difficult to get complete agreement It is difficult to get complete agreement on correctness.on correctness.

Page 8: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20068

Common Locale Data ProjectCommon Locale Data Project Began as Common XML Locale Repository Began as Common XML Locale Repository

(CXLR) developed by OpenI18N in 2003(CXLR) developed by OpenI18N in 2003

CLDR project began in 2004CLDR project began in 2004

Hosted by Unicode ConsortiumHosted by Unicode Consortium• http://www.unicode.org/cldr/http://www.unicode.org/cldr/

Goals:Goals:• Common, necessary software locale data for all world Common, necessary software locale data for all world

languageslanguages• Collect and maintain locale dataCollect and maintain locale data• XML format for effective interchangeXML format for effective interchange• Freely availableFreely available

Page 9: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 20069

CLDR in use (partial list)CLDR in use (partial list) Libraries and EnvironmentsLibraries and Environments

• ICU – International Components for UnicodeICU – International Components for Unicode• JDK – Java Development KitJDK – Java Development Kit

Operating SystemsOperating Systems• SolarisSolaris• AIXAIX• MacOS XMacOS X

ApplicationsApplications• OpenOffice.orgOpenOffice.org• AcrobatAcrobat• ModernBillModernBill

Page 10: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200610

AgendaAgenda

Why CLDR?Why CLDR?

CLDR dataCLDR data

Tools and vettingTools and vetting

The futureThe future

Page 11: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200611

What is a Locale?What is a Locale? A locale is an identifier referring to linguistic A locale is an identifier referring to linguistic

and cultural preferencesand cultural preferences• en_US, en_GB, ja_JPen_US, en_GB, ja_JP

These preferences can change over time due These preferences can change over time due to cultural and political reasonsto cultural and political reasons• Introduction of new currencies, like the EuroIntroduction of new currencies, like the Euro• Standard sorting of Spanish changesStandard sorting of Spanish changes

Many of these preferences have varying Many of these preferences have varying degrees of standardizationdegrees of standardization• 12 and 24 hour format in the United States12 and 24 hour format in the United States

This is a very broad topicThis is a very broad topic

Page 12: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200612

Types of Locale DataTypes of Locale Data Dates/time/calendar formatsDates/time/calendar formats Number/currency formatsNumber/currency formats Measurement systemMeasurement system Collation specificationCollation specification

• SortingSorting• SearchingSearching• MatchingMatching

Translated names for language, territory, Translated names for language, territory, script, timezones, currencies,…script, timezones, currencies,…

Script and characters used by a languageScript and characters used by a language

Page 13: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200613

Locale Data Markup LanguageLocale Data Markup Language

Locale data described using XMLLocale data described using XML

CLDR data uses LDMLCLDR data uses LDML Structure of CLDR controlled by Structure of CLDR controlled by

Locale Data Markup Language Locale Data Markup Language (LDML) specification(LDML) specificationhttp://unicode.org/reports/tr35http://unicode.org/reports/tr35

Page 14: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200614

LDML Data CategoriesLDML Data Categories<ldml><ldml>

<identity><identity>

<localeDisplayNames><localeDisplayNames>

<layout><layout>

<characters><characters>

<delimiters><delimiters>

<measurement><measurement>

<dates><dates>

<numbers><numbers>

<posix><posix>

<collations><collations>

Page 15: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200615

NamesNames

<localeDisplayNames><localeDisplayNames>

Provides translated display names for Provides translated display names for languages, territories, scripts, languages, territories, scripts, variants and keywords used in CLDR.variants and keywords used in CLDR.

Most of this information is at the Most of this information is at the language level, since it typically does language level, since it typically does not vary by territory, only language.not vary by territory, only language.

An example: An example: ICU Locale ExplorerICU Locale Explorer

Page 16: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200616

Names ExamplesNames Examples

From ga.xml (Irish):From ga.xml (Irish):

<localeDisplayNames><localeDisplayNames>

<languages><languages>

<language type="aa"><language type="aa">AfarAfar</language></language>

<language type="ab"><language type="ab">AbcáisisAbcáisis</language>…</language>…

<scripts><scripts>

<script type="Arab"><script type="Arab">AraibisAraibis</script>…</script>…

<territories><territories>

<territory type="AD"><territory type="AD">AndóraAndóra </territory> </territory>

<territory type="AE"><territory type="AE">Aontas na nÉimíríochtaí ArabachaAontas na nÉimíríochtaí Arabacha

</territory>…</territory>…

Page 17: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200617

CharactersCharacters

<characters><characters> Allows for creation of exemplar character Allows for creation of exemplar character

sets. An exemplar set specifies the set of sets. An exemplar set specifies the set of characters that must be present in order characters that must be present in order to properly render the language.to properly render the language.

Auxiliary Auxiliary exemplarexemplar set defines additional set defines additional characters that may appear in foreign characters that may appear in foreign words or phrases.words or phrases.

Lower case onlyLower case only

Page 18: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200618

Date FormatsDate Formats<dates><dates> Defines representation of calendars using various Defines representation of calendars using various

calendaring systems (Gregorian, Buddhist, Islamic, calendaring systems (Gregorian, Buddhist, Islamic, Japanese, etc.)Japanese, etc.)

Defines formatting for dates, times, eras and time Defines formatting for dates, times, eras and time zoneszones• wide, abbreviated, or narrowwide, abbreviated, or narrow• Date and time formats use patterns of letters to Date and time formats use patterns of letters to

define proper formattingdefine proper formatting Week informationWeek information Relative day/time translations (for example, Relative day/time translations (for example,

yesterday, tomorrow, etc. )yesterday, tomorrow, etc. ) An example: An example: ICU Locale ExplorerICU Locale Explorer

Page 19: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200619

Characters / Dates ExamplesCharacters / Dates Examples

From ga.xml (Irish):From ga.xml (Irish): <characters><characters>

<exemplarCharacters> <exemplarCharacters> [a á b-e é f-i í j-o ó p-u ú v-z][a á b-e é f-i í j-o ó p-u ú v-z]

</exemplarCharacters></exemplarCharacters>

<exemplarCharacters type="auxiliary"> <exemplarCharacters type="auxiliary"> [ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ][ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ] </exemplarCharacters></exemplarCharacters>

</characters>…</characters>…

<dayContext type="format"><dayContext type="format">

<dayWidth type="abbreviated"><dayWidth type="abbreviated">

<day type="sun"><day type="sun">DomhDomh</day></day>

<day type="mon"><day type="mon">LuanLuan </day>…</day>…

Page 20: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200620

Time Zone NamesTime Zone Names

<timeZoneNames><timeZoneNames>

Based on Olson time zone databaseBased on Olson time zone database

Localized display names for Localized display names for standard, daylight, and generic standard, daylight, and generic representations of time zones.representations of time zones.

Short and long display names.Short and long display names.

Page 21: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200621

NumbersNumbers

<numbers><numbers> Specifies proper localized formatting of numeric Specifies proper localized formatting of numeric

quantitiesquantities

• DecimalDecimal

• ScientificScientific

• CurrencyCurrency

• PercentagesPercentages

Includes localized decimal, thousands separators, Includes localized decimal, thousands separators, currency symbols, etc.currency symbols, etc.

Page 22: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200622

Time Zones / CurrenciesTime Zones / CurrenciesFrom ga.xml (Irish) and root.xml:From ga.xml (Irish) and root.xml:

<timeZoneNames><timeZoneNames>

<zone type="Europe/Dublin"><zone type="Europe/Dublin">

<long><long>

<standard><standard>Meán-Am GreenwichMeán-Am Greenwich</standard></standard>

<daylight><daylight>AmAm Samhraidh na hÉireannSamhraidh na hÉireann </daylight></daylight>

</long>…</long>…

<numbers><numbers>

<currencies><currencies>

<currency type=“EUR"><currency type=“EUR">

<displayName><displayName>EuroEuro</displayName></displayName>

<symbol><symbol>€€</symbol></symbol>……

Page 23: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200623

DelimitersDelimiters

<delimiters><delimiters>

Specifies a primary and secondary of Specifies a primary and secondary of delimiter characters to be used for delimiter characters to be used for bracketing quotations in textbracketing quotations in text

Page 24: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200624

Delimiters ExampleDelimiters ExampleFrom fr.xml (French):From fr.xml (French):

<delimiters><delimiters>

<quotationStart><quotationStart>««</quotationStart></quotationStart>

<quotationEnd><quotationEnd>»»</quotationEnd></quotationEnd>

<alternateQuotationStart><alternateQuotationStart>““</</alternateQuotationStart>alternateQuotationStart>

<alternateQuotationEnd><alternateQuotationEnd>””</</alternateQuotationEnd>alternateQuotationEnd>

</delimiters></delimiters>

Page 25: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200625

CollationCollation

<collations><collations>

Information in collation directory, not Information in collation directory, not mainmain

XML version of Java/ICU collation syntaxXML version of Java/ICU collation syntax

Unicode collation algorithm is the base Unicode collation algorithm is the base http://unicode.org/reports/tr10http://unicode.org/reports/tr10

Allows tailoring of the UCA on a per Allows tailoring of the UCA on a per locale basis.locale basis.

Page 26: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200626

Collation ExampleCollation ExampleFrom collations/root.xml:From collations/root.xml:

<collations validSubLocales="<collations validSubLocales="ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE ga ga_IE id id_ID ms ms_BN ms_MY nl nl_BE nl_NL pt pt_BR pt_PT">nl_NL pt pt_BR pt_PT">

<collation type="standard"><collation type="standard">

<rules><rules>

......

<s><s>āā</s></s>

<t><t>ĀĀ</t></t>

<s><s>áá</s></s>

<t><t>ÁÁ</t></t>

<s><s>ǎǎ</s></s>

<t><t>ǍǍ</t></t>

<s><s>àà</s></s>

<t><t>ÀÀ</t>…</t>…

Page 27: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200627

AgendaAgenda

Why CLDR?Why CLDR?

CLDR dataCLDR data

Tools and vettingTools and vetting

Today and the futureToday and the future

Page 28: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200628

CLDR ToolsCLDR Tools

ExportExport• ICU resource bundle generationICU resource bundle generation

• POSIX locale generatorPOSIX locale generator

• openOffice.org format exportopenOffice.org format export

Survey toolSurvey tool

• http://www.unicode.org/cgi-bin/cldr-survhttp://www.unicode.org/cgi-bin/cldr-surveyey

Page 29: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200629

Vetting Process for DataVetting Process for Data

Collect from different platforms, Collect from different platforms, experts, submissions: new or revisedexperts, submissions: new or revised

• References to external sources References to external sources strongly encouragedstrongly encouraged

• Must be before freeze date for Must be before freeze date for releaserelease

• Use Survey Tool to Collect DataUse Survey Tool to Collect Data

Page 30: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200630

Causes of Conflicting DataCauses of Conflicting Data Typographical errorsTypographical errors

• Canda instead of CanadaCanda instead of Canada Regional differencesRegional differences

• German spelling is different between countriesGerman spelling is different between countries Parts of speechParts of speech

• ““март 2004” versus “3 мартмарт 2004” versus “3 мартаа” when the Russian word for ” when the Russian word for March is used in a dateMarch is used in a date

Context of usageContext of usage• Normal German sorting versus German phonebook sortingNormal German sorting versus German phonebook sorting

Standards versus common useStandards versus common use• ““Republic of Laos” versus “Laos”Republic of Laos” versus “Laos”

Individual preferencesIndividual preferences• 24 hour time format versus 12 hour time format24 hour time format versus 12 hour time format

Page 31: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200631

AgendaAgenda

Why CLDR?Why CLDR?

CLDR dataCLDR data

Tools and vettingTools and vetting

Today and the futureToday and the future

Page 32: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200632

Latest Release: CLDR 1.4Latest Release: CLDR 1.4

Released: Released: July 17, 2006July 17, 2006

360 locales: 360 locales: • 121 languages121 languages

• 142 territories142 territories

25% more data25% more data

17,000 new or modified data items17,000 new or modified data items

Over 100 different contributorsOver 100 different contributors

Page 33: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200633

ChallengesChallenges

Complex FormatsComplex Formats

Experts knowledgeable both in Experts knowledgeable both in technology and a specific languagetechnology and a specific language• CollationCollation

• Exemplar charactersExemplar characters

• Etc…Etc…

Require close interaction of CLDR Require close interaction of CLDR experts with language expertsexperts with language experts

Page 34: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200634

Getting InvolvedGetting Involved

Simplest – Simplest – anyone!anyone!• Use CLDRUse CLDR

• Bug report / feature requestBug report / feature request

More InvolvedMore Involved• Vetting, Assessment, Tools, Policies, Vetting, Assessment, Tools, Policies,

Decisions, …Decisions, …

• Any Unicode member eligible to name Any Unicode member eligible to name representatives including country liaison representatives including country liaison membersmembers

Page 35: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200635

Example Country Process (Finland)Example Country Process (Finland) Finnish Ministry of Education made CLDR Finnish Ministry of Education made CLDR

data a major goal, 2004-06data a major goal, 2004-06• Research Institute for the Languages of FinlandResearch Institute for the Languages of Finland

(“RILF” aka “Kotus”) designated agency(“RILF” aka “Kotus”) designated agency• Two official languages (Finnish and Swedish) Two official languages (Finnish and Swedish)

& four regional / minority languages (three & four regional / minority languages (three Sámi & Romani as spoken in Finland) to be Sámi & Romani as spoken in Finland) to be coveredcovered

• Over 30 different parties represented: Over 30 different parties represented: commercial, non-commercial, individualscommercial, non-commercial, individuals

• Results expected to lead to new/revised Results expected to lead to new/revised national standardsnational standards

Page 36: CLDR: The Common Locale Data Repository Locales for the World Lisa Moore George Rhoten Mark Davis Steven Loomis

LRC – XI The Localisation FactoryLRC – XI The Localisation Factory Dublin, Ireland, October, 200636

For More InformationFor More Information UnicodeUnicode

• http://www.unicode.org/http://www.unicode.org/

CLDRCLDR• http://www.unicode.org/cldr/http://www.unicode.org/cldr/

LDML specificationLDML specification• http://unicode.org/reports/tr35http://unicode.org/reports/tr35

[email protected]@us.ibm.com