Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Goals
• Theplenary’sgoalistoinformthecommunity– Interna'onaliza'onisoEenunderstoodbyarela'velysmallnumberofexperts,butaffectsalargenumberofprotocols
• IABdraEcontainssomerecommenda'onsregardingchoiceofencodings– draE‐iab‐idn‐encoding‐01.txts'llinprogress
• Moreworkisneededandshouldcon'nue
IETF76TechnicalPlenary 2
Introduc'on
• Namescanhavenon‐ASCIIcharactersandareembeddedinvariousways:
– Hostname/IDN: café.com(Interna'onalizedDomainName)
– Email: 例え@テスト.com
– URL(actuallyIRI): hbp:// إختبار.مثال • (Interna'onalizedResourceIden'fier)
– UNCpath: \\例えテスト\public\file.doc(UniversalNamingConven'on=filepathscommoninWindows‐basedenvironments)
• Userswanttobrowsetheweb,etc.intheirownlanguage– Imaginetypinginanameinascript&languageyoudon’tknow
IETF76TechnicalPlenary 4
Situa'onToday/Soon
• ChinausesIDNsforallgovt.sitesandhasIDNTLDs(.中国,.公司and.网络)– Butarenotinthepublicroottoday
• 35.2%ofTaiwandomainsareIDNs
• 13.7%ofKoreandomainsareIDNs
• VocaldemandfromtheArabic‐scriptworld
• ICANNisexpectedtostartissuingIDNcountry‐codeTopLevelDomainssoon
IETF76TechnicalPlenary 5
SomeUnicodeTerminology
• Unicode:Asetofintegercodepointsintherange1–1,114,111(1–0x10FFFF)whereeachcodepointrepresents(withsomeexcep'ons)ahuman‐meaningfulvisual“character”
• UTF‐32:EachUnicodeintegercodepointstoredusingasingle32‐bitinteger(soendiannessmabers)
• UTF‐16:EachUnicodeintegercodepointencodedusingoneortwo16‐bitintegers(soendiannessmabers)
• UTF‐8:EachUnicodeintegercodepointencodedusingonetofour8‐bitintegers(sonoendiannessproblems)
7IETF76TechnicalPlenary
RFC2277:IETFPolicyonCharacterSetsandLanguages
IETF76TechnicalPlenary 8
Protocols MUST be able to use the UTF-8 charset
January1998
UTF‐8
• Codepoints0x00–0x7FsameasASCII– Codepoints0x00–0x7Fencodedusingoctetvalues0x00–0x7F
– Soallcurrent7‐bitASCIIfilesarealsovalidUTF‐8• withthesamemeaning
– Exis'ngfilesalreadyassigningothermeaningstooctetvalues0x80‐0xFF(e.g.ISO8859‐1)arenotautoma'callycompa'ble
• Highercodepointsusemul'‐octetsequences– Mul'‐octetsequencesuseoctetvalues0x80–0xF4
IETF76TechnicalPlenary 9
UTF‐8Mul'‐OctetSequences
IETF76TechnicalPlenary 10
0XXXXXXX
SingleoctetASCIIcharacter(Codepoints1‐127)
110XXXXX 10XXXXXX
1110XXXX
11110XXX
Firstoctetof2,3,4‐octetsequences
Con'nua'onoctetsofmul'‐octetsequences
UTF‐8Mul'‐OctetSequences
IETF76TechnicalPlenary 11
0XXXXXXX00000–0007F
110XXXXX
1110XXXX
11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
10XXXXXX 10XXXXXX
10XXXXXX00080–007FF
00800–0FFFF
10000–
UTF‐8Proper'es
• Nomid‐stringzerooctets• Statelesscharacterboundarydetec'on
– Robusttoinser'ons,dele'ons,errors,etc.• Strongheuris'cdetec'on
– E.g.AnyloneoctetwithtopbitsetsignalstextasnotvalidUTF‐8
• Byte‐wise,sortssameorderasrawUnicode
IETF76TechnicalPlenary 12
Compactness:Howmanyoctetsdoesittaketorepresentastring?
• Everyonecrea'ngtheirown‘op'mal’solu'on(op'malinsomespecificcontext)comesatahighpriceintermsofinteroperability
• Rela'vecompactnessfordifferentencodingsisnotnearlyasimportantontoday'ssystemsasinthepast– Textis'nycomparedtotoday’sotherdata:
• Images,Audio,Video
• Eveninterna'onaltextoEencontainsASCIImarkup– E.g.HTMLtagsinotherwiseinterna'onaltextfile
IETF76TechnicalPlenary 13
CaseStudy:Localiza'onStringsinApple’sMail.app• /Applica'ons/Mail.app/Contents/Resources/Japanese.lproj/Localizable.strings
• UTF‐16: 117,624bytes• UTF‐8: 68,693bytes
IETF76TechnicalPlenary 14
"UNDO_MARK_READ"="開封済みにする";
Punycode
• UsedforInterna'onalizedDomainNames• AmethodofencodingastringofUnicodeintegercodepointsusingonlythefollowingoctetvalues: 0x2D 0x30–0x39 0x61–0x7A
• i.e.octetvaluesthat,if(mis)interpretedasUSASCII,correspondtothefollowingUSASCIIcharacters: Hyphen Digits0–9 Lebersa–z
IETF76TechnicalPlenary 15
AQues'onofInterpreta'on:ASCIIornot?
IETF76TechnicalPlenary 17
XX XXXXXXXX XXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXX XXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XX XXXXXXXXXXXXX XXXXXXXXXXX XXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX X XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXX XXXXXXXXX XX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXX XXXXX XXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXX XXXXXXXXXXXXX XXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXX XXXXXXX XXXXXXXX
AQues'onofInterpreta'on:ASCIIornot?
IETF76TechnicalPlenary 18
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX X XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXX XXXXXXXXX XX XXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXX XXXXX XXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX
IDNsinOtherIden'fiers
Therecanbemanywaystogettothesamefile.Forexample...
UsingHTTP:1.hbp://dthαler/Public/test.htm2.hbp://xn‐‐dthler‐rxe/Public/test.htm3.hbp://dth%CE%B1ler/Public/test.htm4.hbp://dthαler/Public/test.htm
UsingCIFS/SMB(filesystemprotocolsusingUNC):1.file://dthαler/Public/test.htm2.file://xn‐‐dthler‐rxe/Public/test.htm3.file://dth%CE%B1ler/Public/test.htm4.file://dthαler/Public/test.htm5.\\dthαler\Public\test.htm6.\\xn‐‐dthler‐rxe\Public\test.htm
21IETF76TechnicalPlenary
PlenaryAnnouncementtoie|‐announce@ie|.org
IETF76TechnicalPlenary 22
smooth and interoperable functioning
 of the Internet depends on text strings
 being interpreted in the same way
 by all systems connected to it.
HowManyLayersofEncoding?
• Howdoweencode:Adomainname… inanemailaddress… ina“mailto”URL… inawebpage?
• Doweuse:– Punycode(“xn‐‐…”)encodingforthedomainname?– EmailQuoted‐Printable(“=XX”)encoding?– URLpercent(“%XX”)escaping?– HTMLampersand(“&#xxxx;”)codes?
• Alloftheabove?IETF76TechnicalPlenary 23
IDNsinEmail
• Twotestemails• From:user@chεshίrε.stuartcheshire.org• Ineachemail,addressappearsintwoplaces:
– in“From”line– andinbodytext
• FirstemailencodedIDNusingPunycode:– xn‐‐chshr‐38d3be
• SecondemailencodedIDNusingdirectUTF‐8– chεshίrε
IETF76TechnicalPlenary 24
PunycodeEmail:xn‐‐chshr‐38d3be
IETF76TechnicalPlenary 27
Client From (Punycode) Body (Punycode) Gmail / IE xn--chshr-38d3be xn--chshr-38d3be
Gmail / Firefox 3 xn--chshr-38d3be xn--chshr-38d3be
Apple Mail xn--chshr-38d3be xn--chshr-38d3be
Penelope xn--chshr-38d3be xn--chshr-38d3be
Mulberry 4.08 xn--chshr-38d3be xn--chshr-38d3be
Thunderbird 2.0.0.16 xn--chshr-38d3be xn--chshr-38d3be
Eudora 6 on Mac OS X xn--chshr-38d3be xn--chshr-38d3be
Lotus Notes 7.03 & 8.01 xn--chshr-38d3be xn--chshr-38d3be
Outlook 2007 chεshίrε xn--chshr-38d3be
Outlook E-Mail (WM6) xn--chshr-38d3be xn--chshr-38d3be
Outlook Web Access / IE xn--chshr-38d3be xn--chshr-38d3be
UTF‐8Email:chεshίrε
IETF76TechnicalPlenary 28
Client From (UTF-8) Body (UTF-8) Gmail / IE xn--chshr-38d3be chεshίrε
Gmail / Firefox 3 xn--chshr-38d3be chεshίrε
Apple Mail chεshίrε chεshίrε
Penelope chεshίrε chεshίrε
Mulberry 4.08 chεshίrε chεshίrε
Thunderbird 2.0.0.16 chεshίrε chεshίrε
Eudora 6 on Mac OS X chεshίrε chεshίrε
Lotus Notes 7.03 & 8.01 chεshίrε chεshίrε
Outlook 2007 ch??sh??r?? chεshίrε
Outlook E-Mail (WM6) ch??sh??r?? chεshίrε
Outlook Web Access / IE ch??sh??r?? chεshίrε
MoreTerminologyinthisPresenta'on• Mapping:conver'ngonestringtoanother“equivalent”string– “CONTOSO.com”⇒“contoso.com”
• Matching:checkingtwostringsforequivalence– “CONTOSO.com”~“contoso.COM”– “möhringen.de”≠“moehringen.de”
• Sor:ng:determiningwhichstringcomesfirst– “contoso.com”<“MicrosoE.com”
• Encoding:samestringcanbeencodedindifferentways– includingissuesofcombiningcharacters:évse+´
29IETF76TechnicalPlenary
IDNIden'fierSpace
• IDNA‐validstring: noinvalidcharacters,legallength,etc.
• U‐label: aUnicodeIDNA‐validstring
• A‐label: “xn‐‐”followedby Punycode‐encodedIDNA‐validstring
IETF76TechnicalPlenary 30
MoreVarietyBringsMoreAmbiguity
• ComputerSystems: 2 (binary)
• TelephoneNumbers: 10 (0‐9)• ASCIIDomainNames: 37 (A‐Z,0‐9,‐)
• Interna'onalDomainNames: Tensofthousands
IETF76TechnicalPlenary 31
ConfusableStrings(1/4)
• Twostringsthatareeasilyconfusedbyahuman ETHIOPIA.com↔ΕΤΗΙΟΡΙΑ.corn
33
Greekalphabet! Plain“ASCll”confusion
Moreconfusion
IETF76TechnicalPlenary
• Lower‐casingwillrevealtheETHIOPIAissue• ετηιοριαisfairlydis'nc've– currenttrendistodeprecateupper‐case
andothermapping‐requiredformsinIRIsetc.– IDNA2008treatsthesecharactersasDISALLOWED
ConfusableStrings(2/4)
34
jessica.py↔јеѕѕіса.ру
Cyrillicalphabet
.ru=Russia
.py=Paraguay
IETF76TechnicalPlenary
• Anotherexample:
• “јеѕѕіса”actuallyusesCyrilliccharactersfromtwoseparatelanguages– Aregistrymayrestrictregistra'onstoonlycharactersintheirlanguage
• Otherexamplesexistwithoutmixinglanguages• epoxy.py↔ероху.ру
ConfusableStrings(3/4)
• Peopleseewhattheyexpecttosee– Russianrestaurant:“ресторан”– Non‐Russiansmightread“pectopah”
• Givensufficientlycrea'veuseoffontsforcedbystylesheetsetc.,confusioncanbeeasy
IETF76TechnicalPlenary 35
ItDependsonWhatYouKnow–andExpect
IETF76TechnicalPlenary 36
Isthesecondcharacter“A”?IfyouwerenotfamiliarwithLa'nscript,ordidn’tknowwhattoexpect,wouldyoubesure?Coulditbeastarofsomesort?AreyousurethatthefirstcharacterisanASCIIdot(theDNScares–alot)?
Arethesetwostringsiden'cal?Areyousure?Wouldyoubesureifyoudidn’tknowLa'nscriptortheorganiza'oninvolved?
ConfusableStrings(4/4)
• Otherkindsof“equivalence”—equalityinsomecontexts
• 中国 and中國Twocodepoints,sameconcept
• and السعوديةالسعوديةTwocodepointsforsameleber(moreorless)
IETF76TechnicalPlenary 37
AretheFollowingEquivalent?
IETF76TechnicalPlenary 38
Arabic‐Indic ۰۱۲۳٤٥٦۷۸۹EasternArabic‐Indic ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ChineseSuzhou 〡〢〣〤〥〦〧〨〩
European(ASCII) 0123456789
Devanagari(Hindi) ०१२३४५६७८९Tibetan ༠༡༢༣༤༥༦༧༨༩Tamil ௦௧௨௩௪௫௬௭௮௯
“Suspect”Names
• Poten'alforphishingabacks– butcouldbeinnocentoraccidental
• Nameswithscriptsnotusedbytheuser’slocale
• Nameswithmixedscripts(e.g.Cyrillic+La'n)
• UImightwanttowarntheuserwhendisplayinganyofthesefromanuntrustedsource– SomebrowsersdisplayA‐labels(“xn‐‐…”)inaddressbar,butthatconfuseshumans
39IETF76TechnicalPlenary
UniversalConfusableString
• Fewusersystemshaveallpossiblecharactersanddisplayfontsfortheminstalled.
• Ifcharactercannotberepresentedlocally,asix‐characterstringmightappearas– ??????or– □□□□□□
• Thisshouldbeawarning(butrememberwhatusersdowhentheyseeawarningtheydon’tunderstand)
IETF76TechnicalPlenary 40
WhyMapping?
• Insteadofintelligentmatchingalgorithm:– Mapeachstringtoadefinedcanonicalform– Simpletestifcanonicalformsarebitwiseiden'cal– Doesnotpermit“closeenough”orotherfuzzymatching
• Conversionofonevisualformtoanotherthatismorelocallyunderstandable– E.g.Tradi'onalChinese(中國)toSimplified(中国)
IETF76TechnicalPlenary 42
Mapping
• Mappinginherentlylosesinforma'on– Caseconversion,half/fullwidth,NFC/NFKC/etc
• Upper/lowercasingdiffersbylanguage– La'nalphabet: I⇔i– Turkish: I⇔ı İ⇔i
• tolower(‘I’)=???• toupper(‘i’)=???• tolower(toupper(‘ı’))≠‘ı’• Turksaren’ttoohappyaboutthis…
43IETF76TechnicalPlenary
Mapping
• Summary:– Neverrollyourownmapping
– Correctmappingforuserdependsonlanguagecontext,whichweoEendon’tknow
44IETF76TechnicalPlenary
(Over)SimplifiedArchitecture
Host
Applica'on
DNSResolverLibrary
Internet
IDNASHIM
Problem#1:DNSisn’ttheonlynameresolu'onprotocol
Problem#2:ThepublicInternetnamespaceisn’ttheonlynamespace
Differentprotocolsusedifferentencodingstoday
Differentnamespacesusedifferentencodingstoday
Realis'cNetworkArchitecture
HostApplica'on
NameResolu'onLibrary
DNS mDNS NetBIOSoverTCPLLMNR hostsfile NIS …
Enterprisenetwork
LocalLAN
Internet
…
Implementa'on‐specificencoding
(e.g.,UTF‐8orUTF‐16)
OtherNameResolu'onProtocols
• Manydefinedtousethesamesyntax– Hostsfile,DNS,mDNS,NetBIOS‐over‐TCP,etc.
• Nameresolu'onlibrarydecideswhatprotocolstotryinwhatorder– Appscannottellfromthenamewhatprotocolswillbeusedforresolu'on
– Differentlibrariesmayusedifferentorderandhencefinddifferentnametargets
• Differentprotocolsspecifyuseofdifferentencodings– Appscannottellwhatencodingswillbeneededforresolu'on
What’saLegalName?
• In1985,RFC952definedtheformatofthehostsfile:– “Internethost/net/gateway/domainname”containsASCIIlebers,digits,hyphens(LDH)
• In1989,RFC1035definedDNS:– “Preferrednamesyntax”:LDH– Butdoes“preferred”meanMAY/SHOULD/MUST?
49IETF76TechnicalPlenary
LegalDNSNames
• In1997,RFC2181clarified:– Anybinarystringwhatevercanbeusedasthelabelofanyresourcerecord
– Anybinarystringcanserveasthevalueofanyrecordthatincludesadomainname
– Applica0onscanhaverestric'onsimposedonwhatpar'cularvaluesareacceptableintheirenvironment
• Sameyear:– IETFpolicyoncharactersetsandlanguages…
IETFPolicyonCharacterSetsandLanguages(RFC2277)
• Itsays:– ProtocolsMUSTbeabletousetheUTF‐8charset– ProtocolsMAYspecify,inaddi'on,howtouseothercharsetsorothercharacterencodingschemes
– UsingadefaultotherthanUTF‐8isacceptable• SilentondifferentformswithinUTF‐8(e.g.case,encodingofcombiningcharacters,sortorder)– TwoUnicodestringsoEencannotbecomparedtoyieldresultsusersexpectwithoutaddi'onalprocessing
• PerRFC2181,DNScomplies
UseofDifferentEncodingsinDNS
• Star'ngthatyear(1997),somesystemsbeganusingUTF‐8inDNSinprivatenamespaces– Privatenamespaceheremeansnamesarenotresolvablefromoutsidethespecificnetwork
• Aboutfiveyearslater,IDNAdevelopment(includingPunycode)beganforuseinthepublicDNSnamespace
LengthIssues
• DNSnameshave– 63octetsperlabel– 255octetspername(notcoun'ngzeroattheend)
• Mostapplica'onAPIsuseNULL‐terminatedstrings• Non‐ASCIIcharactersuseavariablenumberofoctetsinUTF‐8,UTF‐16andPunycode– 256UTF‐16octets≠256UTF‐8octets≠256A‐labeloctets
• Somenamescanberepresented(withinlength)inPunycodeA‐labelsbutnotinUTF‐8
• Somenamescanberepresented(withinlength)inUTF‐8butnotinPunycodeA‐labels
53IETF76TechnicalPlenary
Let’sRecapWhereWeAre…
• Mul'pleencodingsofsameUnicodecharacters:– U‐labels: مثال.إختبار– A‐labels: xn‐‐mgbh0f.xn‐‐kgbechtv
• Differentencodingsused:– Bydifferentprotocols– OndifferentnetworkswithDNS
• PunycodeA‐labelsusedonInternet,UTF‐8inintranets– Bydifferentapplica'ons
• Results:– Failure—orworse—launchingoneappfromanother– Compe'torswitchingincen'ves,andpooruserexperiencewhenoneappworksandcompe'tordoesn’t
54IETF76TechnicalPlenary
Example“IDN‐aware”app:Browserpicksencodingbasedonintranetvs.Internet
xn‐‐mgbh0
NameResolu'onAPI
intranet Internet
DNS
LLMNR/mDNS
example.com.إختبار(UTF‐8)
إختبار
إختبار(UTF‐16)
إختبار
xn‐‐mgbh0 .example.com
Problem1:Host/appBusesA‐labelsinintranet
Problem2:Localnameresolu'onusesUTF‐8
Intranetname? Internetname?
55
DNSxn‐‐mgbh0 .example.com
example.com.إختبار(UTF‐8)
IETF76TechnicalPlenary
A
B
IDNAwareApp
NameResolu'on
hbp://مثال.إختبار
NameResolu'on
xn‐‐mgbh0 .xn‐‐kgbechtv
Non‐IDNAwareApp
P U
(UTF‐16)مثال.إختبار
hbp://مثال.إختبار
InconsistentExperienceAcrossApplica'ons
(UTF‐8)مثال.إختبار
xn‐‐mgbh0 .xn‐‐kgbechtv Phishingabackspossible
56IETF76TechnicalPlenary
OtherIDN‐AwareAppsLackconsistency,causingnon‐determinis'cexperience
NameResolu'on
hbp://مثال.إختبار
NameResolu'on
xn‐‐mgbh0 .xn‐‐kgbechtv
P U
xn‐‐mgbh0 .xn‐‐kgbechtv
hbp://مثال.إختبار
 إختبار.مثال (UTF‐16) إختبار.مثال (UTF‐16)
UnreachableUnreachable
مثال.إختبار(UTF‐8)
xn‐‐mgbh0 .xn‐‐kgbechtv
مثال.إختبار(UTF‐8)xn‐‐mgbh0 .xn‐‐kgbechtv
57IETF76TechnicalPlenary
BasicPrinciple
• ConversiontoA‐labels,UTF‐8,orwhateverotherencoding,canbedoneonlybyanen'tythatknowswhichprotocolandnamespacewillbeused
IETF76TechnicalPlenary 58
HardIssues1of2
• Clienthastoguessorlearnwhatencodinga{HTTP,DNS,SMTP,…}serverexpectsforaniden'fier
• Namesappearinsidemanyothertypesofiden'fiers,e.g.emailaddress,URLs,UNCpaths,networkaccessiden'fiers(NAIs)– Eachiden'fiertypehasitsownencodingconven'ons– Today,appsthatextracthostnamesneedtoconvertencodings
59IETF76TechnicalPlenary
HardIssues2of2
• Useofasingleencodingistheeasypart– Sufficientonlyiftheonlyintentistodisplay– Comparison,matching,lookup,sor'ng,etc.,allrequiremorework.
– JustasRFC952definedanASCIIsubsetfor“hostname”iden'fiers,weneedtodefineUnicodesubsetsforothertypesofiden'fiers.
• Op'malsubsetforoneprotocolmaynotbeop'malforanother.
• Interpreta'onanddisplayofsomestringsmaydifferbyopera'ngsystems–usuallyabug,butsome'mesnoagreementastowhichvaria'onisthebug.
IETF76TechnicalPlenary 60
Conclusions
• Smoothandinteroperablefunc'oningoftheInternetdependsontextstringsbeinginterpretedinthesamewaybyallsystemsconnectedtoit
• TheIETFhasrecognizedthissinceRFC20specifiedASCIIforuseininterchangein1969—thesugges'onsinthispresenta'onextendandupdatethatunderstandingaswellastheunderstandinginRFCs2277and5198
IETF76TechnicalPlenary 61
Conclusions
• Toavoidconfusionandambiguity,itisnotenoughmerelytosupportUTF‐8asoneofthetextencodingop'ons– TextinprotocolsonthewireshouldbeinUTF‐8andonlyinUTF‐8
• Foruser‐visibletextinprotocols:– Ifyoudon’tuseUTF‐8,whynot?
• Forprotocoliden'fiersnotseenbyusers:– IfyoudoallowthefullUnicodecharacterrange,why?
IETF76TechnicalPlenary 62