Internationalization in Ruby 2.4http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/
40th Internationalization and Unicode Conference
Santa Clara, California, U.S.A., November 3, 2016
Martin J. DÜRST
Aoyama Gakuin University
© 2016 Martin J. Dürst, Aoyama Gakuin University
Abstract
Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated byexperts for its productivity and depth. This presentation discusses the progress of adding internationalizationfunctionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be thecurrently ongoing implementation of locale-aware case conversion.
Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing differentapplications to choose different internationalization models. In practice, Ruby is most often and most convenientlyused with UTF-8.
Support for internationalization facilities beyond character encoding has been available via various external libraries.As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To usecase conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasingstrings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4,and efficiently reuses data already available for case-sensitive matching in regular expressions.
We study the interface of internationalization functions/methods in a wide range of programming languages and Rubylibraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, withadditional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a trueRuby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.
This presentation is intended for users and potential users of the programming language Ruby, and people interested ininternationalization of programming languages and libraries in general.
For Best Viewing
These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switchto projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Pleasenote that depending on the browser and OS you use, some rare characters or special character combinations may notdisplay as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
IntroductionIntroductions
Audience:Programming experience?Programming with Ruby/Rails?Internationalization/Globalization experience?Unicode knowledge?
Speaker:From Switzerland, living in JapanLong-term Unicode/W3C/IUC involvementRuby committer since 2007, mainly contributing
Encoding conversion (String#encode, Ruby 1.9)Unicode normalization (String#unicode-normalize, Ruby 2.2)Non-ASCII case conversion (String#upcase,..., Ruby 2.4)Unicode version updates (Unicode 9.0 for Ruby 2.4)
OverviewIntroductionRuby BasicsNew in Ruby 2.4: Non-ASCII Case ConversionImplementation DetailsLessons Learned and Future Work
Ruby Basics
Ruby
Created by Yukihiro Matsumoto (Matz; since 1993)Easy for beginners,deep for expertsObject-oriented throughout, but not obtrusiveExtremely flexibleParticularly strong for (internal) DSLs and metaprogrammingUsed for Ruby on Rails Web Framework
Ruby ImplementationsMRI (Matz's Ruby Implementation), aka C-Rubyavailable on many platforms (download for Windows)JRuby: Ruby on the JVMRubyMotion: Ruby for IOS, Android, and MacOSOpal: Ruby to JavaScript compilerRubinius: Ruby (mostly) in RubyA lot more ...
This tutorial is about MRI/C-Ruby, the reference implementation
Basic Ruby3.times { puts 'Hello Ruby!' }
Hello Ruby!Hello Ruby!Hello Ruby!
Everything is an objectMethods can take blocks ({ ... } or do ... end)Unobtrusive syntax (no need for semicolons, ...)
Conventions Used in This TalkCode is mostly green, monospaceputs 'Hello Ruby!'
Variable parts are orangeputs "some string"
Encoding is indicated with a subscript'Юに코δ'UTF-8, 'ユニコード'SJIS
Results are indicated with " "1 + 1 2
Frequent Example Юに코δЮ: Cyrillic uppercase YUに: Hiragana NI코: Hangul KOδ: Greek delta
Up and RunningInstall RubyOpen a UTF-8 based console
Easy on Mac and LinuxOn Windows: Cygwin Terminal, PuTTY, ...,or command prompt with chcp 65001
Start irb (Interactive Ruby)Type in Ruby commands
String BasicsStrings are sequences of characters: (codepoints)"Юに코δ".length 4We can get a byte count with:"Юに코δ".bytesize 10They are instances of class String:"Юに코δ".class StringCharacters are strings of length 1:"Юに코δ"[0] "Ю";"Юに코δ"[0].length 1
Using the same class for both strings and characters avoids the distinction between characters and strings of length 1.This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints.Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.
Encoding BasicsEarch String has an encodingStrings with different encodings can't be mixed
'Юに코δ'UTF-8 + 'Юに코δ'UTF-16 Encoding::CompatibilityError
Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. Thereare some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is thattranscoding should not happen without the programmer being aware of it.
'Dürst'ISO-8859-1 == 'Dürst'ISO-8859-2 false
Trying to compare two character-by-character identical strings in different encodings will produce false, evenif these strings are, as in the above example, also byte-for-byte identical. Again, the reason for the result is thatencoding mismatches should be detected early. In addition, a simple byte-for-byte comparison could producefalse positives.
except if their content is ASCII-only (bytes)
'abc'ISO-8859-1 == 'abc'Shift_JIS true
Just use Unicode, just use UTF-8
Ruby Likes UTF-8Default for source encoding (since Ruby 2.0)(no need for # encoding UTF-8 encoding pragma)Encoding of strings with \u escapes is always UTF-8
"abc\u03B4" 'abcδ'UTF-8
Use -U option if not in an UTF-8 context: ruby -U myscript.rbProcessing of UTF-8 is optimized where possibleUsed out of the box by Ruby on RailsTranscoding available on input/outputThe only (internal) encoding in Ruby 3.0 or 4.0 (speculation!)
Ruby VersionsRuby ≤1.8: RIP (Strings as byte sequences)Ruby 1.9 and later (Strings as character sequences)Ruby 2.0: UTF-8 default source encodingRuby 2.2: Unicode normalization added (2014)Ruby 2.3: Newest published versionRuby 2.4: Release planned for Christmas 2016,non-ASCII case conversion
Ruby Versions and Unicode VersionsYear (y) Ruby version (VRuby) Unicode version (VUnicode)
published around Christmas published in Summer2014 2.2 7.0.02015 2.3 8.0.02016 2.4 9.0.0
A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view tooconservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore onlyhappens for new Ruby versions.
RbConfig::CONFIG["UNICODE_VERSION"] '9.0.0'
VUnicode = y - 2007
VRuby = 1.5 + VUnicode · 0.1
VUnicode = VRuby · 10 - 15
Don't extrapolate too far!
New in Ruby 2.4:
Non-ASCII Case ConversionCase Conversions Functions in Ruby
'Unicode Everywhere'.upcase 'UNICODE EVERYWHERE'
'Unicode Everywhere'.downcase 'unicode everywhere'
'Unicode Everywhere'.capitalize 'Unicode everywhere'
'Unicode Everywhere'.swapcase 'uNICODE eVERYWHERE'
Case Conversion in Ruby 2.3
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'
Case Conversions NOT in Ruby 2.3'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
Case Conversion up to and including Ruby 2.3 is ASCII-only!
Case Conversions NOT in Ruby 2.3'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
But in Ruby 2.4!
Case Conversion Around the WorldMany more Latin letters than just A-ZOther scripts:
Cyrillic, GreekCoptic, Armenian [, Georgian]Cherokee, Deseret, OsageOld Hungarian, Warang Citi, Glagolitic, Adlam
More minority scripts may introduce case distinctionfrom surrounding majority scripts
Case Distinction HistoryOriginally: Style difference, depending on medium
Upper case for stone inscriptions (SPQR)Lower case for wax tablets,...?
Functional distinction since ~15th century
Modern Case Usage
(details vary by language)
ALL UPPER CASEEMPHASISAcronyms, abbreviations (DRY, SQL)
First letter upper caseStart of sentenceWords in titlesProper nouns/adjectives (Kyoto, Japanese)NounsHonorifics
Lower case: everything else
German:der Gefangene floh - the prisoner fled, butder gefangene Floh - the captive flea
Isn't ASCII-only Case Conversion Enough?Already in other languages (Python, Perl, Java, ...)Already in Ruby (Regexp: //i)Algorithms and data is available from Unicode ConsortiumIt's a good idea in general
But: Backwards Compatibility?Idea: Option for new functionality'Résumé'.upcase 'RéSUMé''Résumé'.upcase :unicode 'RÉSUMÉ'Matz felt option was not necessaryLots of data is ASCII-onlyFor non-ASCII data, you hopefully used a gem(which you can now eliminate)Check earlygrep your code base for upcase and friendsTest early (preview 2 of Ruby 2.4)
Backwards Compatibility ProblemsExplicit ASCII-only case conversion
E.g. DNS servers(but you used Encoding::ASCII_8BIT there anyway?!)
Exact matches after conversion1. Allowed non-ASCII in userids (e.g. Соколов)2. downcased with Ruby 2.3 to help users (Соколов in DB)3. Used exact match4. In Ruby 2.4, соколов will not match Соколов anymore
Localization: See Turkic, Lithuanian special cases
Backwards Compatibility: :ascii OptionUse if you find a case where you really don't want to convert non-ASCII characters
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase :ascii 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'
Implementation ChoicesUse a library?
Pure Ruby:UnicodeUtilsActiveSupport::MultibyteTwitterCLDR
C extensions:ICU as a gem: icu, ffi-icu
Integrate IUC?
Write new code?
Implementation ChoicesUse a library?
Different interface if used directlyNot efficient if in pure RubyData duplication
Integrate IUC?
IUC and Ruby both have their own low-level idea of strings
Write new code?
That's what we ended up doing
Where to Get the Data From?Data and other specifications available from the Unicode Consortium:
UnicodeData.txt
CaseFolding.txt
SpecialCasing.txt
Special Cases: Not 1-to-1Number of characters not preserved'ß'.upcase 'SS' (German sz/sharp s)' '.upcase "FFI" ( ligature)Not necessarily reversible'ß'.upcase.downcase 'ß' 'ss''σ'.upcase 'Σ' (Greek sigma)'ς'.upcase 'Σ' (Greek final sigma)'ς'.upcase.downcase 'ς' 'σ'Implemented!'Σ'.downcase should be context-dependentNot yet implemented!
Special Case: Simple Case MappingDefined by UnicodeExcludes mappings that change string lengthFeels outdated
Not implemented!
Special Case: TurkicUsual:'i'.upcase 'I'
'I'.upcase 'i'Turkish, Azerbaijani, and related languages when written in Latin script'i'.upcase 'İ' (uppercase I with dot)'İ'.downcase 'i''ı'.upcase 'I' (i without dot)'I'.downcase 'ı'Implemented!'Türkiye'.upcase :turkic 'TÜRKİYE'
Special Case: LithuanianUsual:
'Í'.downcase 'í' (accent replaces dot)
Lithuanian:
'Í'.downcase :lithuanian 'i'́(accent above visible dot; may not show because of technology limits)
Not yet implemented!
Special Case: Case FoldingCase mapping:
Change from one form to anotherupcase/downcase/capitalize/swapcase
Case foldingEliminate case-related differencesFor comparison, sortingIn general same as downcaseBut: ß → ss, → ffi, ς → σUpcase for Cherokee
Implemented! with :fold option on downcase
'ß'.downcase :fold 'ss'' '.downcase :fold 'ffi''ς'.downcase :fold 'σ'
Special Case: TitlecaseSome characters have three case forms:
Upper case: DŽ (Croatian/Serbian)
Lower case: džTitle case: Dž
Important for capitalize'džungla'.capitalize 'DŽungla''džungla'.capitalize 'Džungla'
Implemented!
More Special CasesContextual processing, e.g. for i with combining dots(part of Unicode algorithm definition)German uppercase ß(not part of Unicode algorithm definition)others,...Not implemented (yet?)
Implementation
12 Methods to ImplementString (functional) String (destructive) Symbol
upcase upcase! upcase
downcase downcase! downcase
capitalize capitalize! capitalize
swapcase swapcase! swapcase
Not dealt with: String#casecmpWhy: Includes sorting
Internally, a Single FunctionFlags to indicate operation needed(in file include/ruby/oniguruma.h):
#define ONIGENC_CASE_UPCASE (1<<13) /* uppercase mapping */#define ONIGENC_CASE_DOWNCASE (1<<14) /* lowercase mapping */#define ONIGENC_CASE_TITLECASE (1<<15) /* titlecase mapping */
Usage to indicate operation type:
upcase: ONIGENC_CASE_UPCASE(upcasing needed)
downcase: ONIGENC_CASE_DOWNCASE(downcasing needed)
capitalize: ONIGENC_CASE_TITLECASE | ONIGENC_CASE_UPCASE(changed to ONIGENC_CASE_DOWNCASE after first character)
swapcase: ONIGENC_CASE_UPCASE | ONIGENC_CASE_DOWNCASE(both upcasing and downcasing needed)
Option HandlingFlags also used for options:
:fold (for case folding; only on downcase):turkic:lithuanian (not yet implemented):ascii
Corresponding flags:
#define ONIGENC_CASE_FOLD (1<<19) /* has/needs case folding * /#define ONIGENC_CASE_FOLD_TURKISH_AZERI (1<<20) /* Turkic */#define ONIGENC_CASE_FOLD_LITHUANIAN (1<<21) /* Lithuanian */#define ONIGENC_CASE_ASCII_ONLY (1<<22) /* limited to ASCII */
String ExpansionHandles string expansion (e.g. " ".upcase "FFI")
Common to all casing operations
Linked list of buffers (b1→b2→b3→...)Repeatedly calls encoding-specific primitiveto fill as much as possible of next bufferFor buffer bx, allocatesbytes_to_still_be_converted · x + 20 bytesExample:We need a 3rd buffer, and need to convert 5 more bytes,so we allocate length(b3) = 5 · 3 + 20 = 35 bytesUntil no new buffer is needed
Handling Encodings: The Ruby WayEach encoding is implemented by a series of primitivesWork like methods (polymorphism), but implemented in CTotal of 13 primitives per encodingExample primitives:
Length of character at current byte positionAdvance byte position by one characterCodepoint of character at current byte positionInsert codepoint x at current byte position
[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌.2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Methodfor Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)
Implementation Choice: UTF-8 only or Primitives
Matz would have been fine withFull Unicode case conversion for UTF-8ASCII-only for all other encodings
Actually used primitives to obtainA more complete implementationExperience about pros/cons of using primitives
Implementation Choice: New or Reused Primitive
3 primitives are used for case folding with regular expressions (//i)mbc_case_foldapply_all_case_foldget_case_fold_codes_by_str
Found no good way to reuse any of these
New primitive
But found a lot of reusable data
The case_map Primitive
Input/output parameters:OnigCaseFoldType flagsStart of source
Input parameters:End of sourceStart of destinationEnd of destinationEncoding (to call other primitives)
Output parameters:Byte count of conversion result(negative for errors)
Most complex 'primitive', although not by much
Implementations of case_map PrimitiveExamples:
"Résumé"UTF-8.upcase callsonigenc_unicode_case_map in enc/unicode.c(most complex case)as defined with OnigEncodingDefine in enc/utf_8.c"Résumé"UTF-16LE.upcase callsonigenc_unicode_case_map in enc/unicode.cas defined with OnigEncodingDefine in enc/utf_16le.c"Résumé"ISO-8859-1.upcase callscase_map in enc/iso_8859_1.c(simple case, good starting point for primitive for new encoding)as defined with OnigEncodingDefine in the same file
The Primitive of Primitives: onigenc_unicode_case_map
Works for UTF-8, UTF-16[BE|LE], UTF-32[BE|LE]140 lines long 'monster function'Same structure as simpler primitives:
Big while loop, one source character a timeCarefully updating ONIGENC_CASE_MODIFIED flagDeal with special cases 'by hand'Reuse existing data where possible
~30 if/else if/elseLots of |/& with flag bits2 gotosgperf-created hash lookups:onigenc_unicode_fold_lookuponigenc_unicode_unfold1_lookup
More case_map PrimitivesStudents (sophomores/juniors/seniors) at Aoyama Gakuin University
ISO-8859-2: Yushiro Ishii (石井 優史朗)ISO-8859-3: Kanon Shindo (新藤 海音)ISO-8859-4: Kotaro Yoshida (吉田 孝太郎)ISO-8859-5: Masaru Onodera (小野寺 俊)ISO-8859-7: Kosuke Kurihara (栗原 光祐)ISO-8859-9: Kazuki Iijima (飯島 一貴)ISO-8859-10: Toya Hosokawa (細川 登陽)ISO-8859-13: Takuya Miyamoto (宮本 拓弥)ISO-8859-14: Yutaro Tada (多田 悠太朗)ISO-8859-15: Maho Harada (原田 真帆)ISO-8859-16: Satoshi Kayama (香山 智志)Windows-1250, -1257: Sho Koike (小池 翔)Windows-1251: Shunsuke Sato (佐藤 駿介)Windows-1252: Serina Tai (田井 芹奈)Windows-1253: Takumi Koyama (小山 拓美)
So What about Shift_JIS and Friends?For East Asian encodings(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)
data could be shared between //i and case mapping
but case folding for //i only works for ASCII
None of the main Japanese committers thought this was needed anymore
Talk to me if you need it
Reusing Case Folding DataOnig[uruma|gmo] has data for case foldingFolding is very close to downcaseThere is also unfolding (why?), which is close to upcaseThat's almost all we need
Folding Data: Before and Afterin enc/unicode/9.0.0/casefold.h
/* before */ {0x0041, {1, {0x0061}}}, /* A → a */ {0x00df, {2, {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1, {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1, {0x01c6}}}, /* Dž → dž */ {0xab73, {1, {0x13a3}}}, /* → (Cherokee) */
/* after */ {0x0041, {1|F|D, {0x0061}}}, /* A → a */ {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1|F|D|ST|I(8), {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}}, /* Dž → dž */ {0xab73, {1|F|U, {0x13a3}}}, /* → (Cherokee) */
Folding Data: Flags(squeezed into an int where only 2 bits were used)
see enc/unicode.c
/* data is available here *//* (flags are the same as for options) */#define U ONIGENC_CASE_UPCASE#define D ONIGENC_CASE_DOWNCASE#define F ONIGENC_CASE_FOLD/* data is in special additional array */#define ST ONIGENC_CASE_TITLECASE#define SU ONIGENC_CASE_UP_SPECIAL#define SL ONIGENC_CASE_DOWN_SPECIAL#define IT ONIGENC_CASE_IS_TITLECASE/* index into special array (size: around 420 words only) */#define I(n) OnigSpecialIndexEncode(n)
Small Implementation Detail(or my attempt at using the Takahashi method)
upcase
seems useful
downcase
seems useful
capitalize
seems useful
swapcase
Who would use swapcase?
Nobody?
Nobody?Well, I did, when testing swapcase!
Why swapcase?
Why swapcase?Python has it ?! (Matz)
Why swapcase?Python has it ?! (Matz)
To revert accidental Caps Lock output ?! (on Unicode list)
implementing swapcase
must be easyUPPER upperlower LOWER
But what about titlecase?Dz, Dž, Lj, Nj
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ
Choice 1
"DžunGLA".swapcase leave as is"DžUNgla"
preferred by Unicode Consortium(never ever need any new standardization)
preserves reversibility(X.swapcase.swapcase == X)
Choice 2"DžunGLA".swapcase
upcase"DŽUNgla"
Choice 3"DžunGLA".swapcase
downcase"džUNgla"
Choice 4"DžunGLA".swapcase
swap
"dŽUNgla"
proposed by Nobuyoshi Nakada
Implementedswap "dŽUNgla"
useless?, but 'correct'additional effort for implementation
additional effort for testing
Commit DateApril 1st, 2016
(エイプリルフールの日)Japan Time 20:58:33 same date in most timezones
please draw your own conclusions
TestingTest-Driven Development
Write small example testVerify that it doesn't workImplementEnjoy that it worksRinse and repeat
Files:test/ruby/enc/test_case_options.rbtest/ruby/enc/test_case_mapping.rb
Data-Driven TestingTest
every character (except for ranges in UnicodeData.txt)of every encodingfor all option combinationsfor (almost) all methods
Data provided by UnicodeIdentical to data used for implementation ?!
Files:test/ruby/enc/test_case_comprehensive.rb
413 tests, 2'212'391 assertions, 0 failures, 0 errors, 0 skips
Continuous IntegrationCommit early, commit often
Advice (and scolding) from hardcore Ruby hackersKeep code reasonably clean, and motivation highMore commits → higher chance to attend Ruby Kaigi for freeBut: Don't want to affect Ruby build or execution
Solution:Make use of new functionality dependent on special optionUsed :lithuanian (because last to be actually implemented)Test with option protectionRemove option protection
Future:
Ideas, Problems, QuestionsIn No Particular Order
Character propertiesLocale-aware formattingWhat to do with encodings?
Character PropertiesUnicode provides a wide range of character propertiesMost available in RegexpDoes this string contain a Hiragana character?'Юに코δ' =~ /\p{Hiragana}/What script is 'Ю'?sorry, impossible! 不可能!Currently looking at this with a student, hopefully
For Ruby ~2.5Use less memoryFasterMore propertiesMore ways to use
Locale-Aware FormattingWhat I want:
loc = Locale.new 'de-CH' (German as used in Switzerland)
1.2345678E5.to_s "123456.78"
1.2345678E5.to_s(loc) "123'456,78"
Well, Just use a LibraryInternationalization support in libraries:
Pure Ruby:UnicodeUtilsActiveSupport::MultibyteTwitterCLDR
C extensions:ICU as a gem: icu, ffi-icu
Example: Unicode NormalizationUnicodeUtils
UnicodeUtils.nfkc string
ActiveSupport::Multibyte
ActiveSupport::Multibyte::Chars.new(string).normalize :kc
TwitterCLDR
TwitterCldr::Normalization::NFKC.normalize string
Native (since Ruby 2.2)string.unicode_normalize :nfkc
Libraries avoid monkey patching
not Ruby-like (ライブラリを使うと Ruby らしくない)
Locales and Case MappingsPossible solution (解決案):
loc = Locale.new 'tr''Türkiye'.upcase loc 'TÜRKİYE'
Encodings: Less is More?We discovered flaky support for current encodings(//i case folding: all encodings not at end oftest/ruby/enc/test_regex_casefold.rb)The world is moving to UnicodeMatz wants to move to UTF-8, slowly but steadilyDo we let other encodings die slowly?Or get rid of them in a single step (Ruby3.0?)
AcknowledgmentsKimihito Matsui (松井 仁人) and many other students for help with research and implementationsYui Naruse (成瀬 ゆい), Nobuyoshi Nakada (中田 伸悦) and many other Ruby committers for help and supportMatz (まつもと ゆきひろ) for Ruby, a programmer's best friendAmaya, Opera 12.17, and coderay for slide production and displayThe IME Pad for easy character input
Conclusions
Full Unicode case mapping (mostly) implementedOptions for backward compatibility, special conventions, case foldingSpace efficient implementation by reusing Regexp dataAvailable in Ruby trunk now, please test!
More internationalization work neededTell me what you want most
ReferencesMore information about case conversion implementation internals:http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/(video at http://rubykaigi.org/2016/presentations/duerst.html)
Q & ASend questions and comments to Martin Dürst(mailto:[email protected])or open a bug report or feature request for Ruby
The latest version of this presentation is available at:
http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/