62
Interna’onaliza’on in Names and Other Iden’fiers IAB 1 IETF 76 Technical Plenary

Internaonalizaon in Names and Other Idenfiers

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Interna'onaliza'oninNamesandOtherIden'fiers

IAB

1IETF76TechnicalPlenary

Goals

•  Theplenary’sgoalistoinformthecommunity–  Interna'onaliza'onisoEenunderstoodbyarela'velysmallnumberofexperts,butaffectsalargenumberofprotocols

•  IABdraEcontainssomerecommenda'onsregardingchoiceofencodings– draE‐iab‐idn‐encoding‐01.txts'llinprogress

•  Moreworkisneededandshouldcon'nue

IETF76TechnicalPlenary 2

WhyisInterna'onaliza'onImportantandTimely?

3IETF76TechnicalPlenary

Introduc'on

•  Namescanhavenon‐ASCIIcharactersandareembeddedinvariousways:

–  Hostname/IDN: café.com(Interna'onalizedDomainName)

–  Email: 例え@テスト.com

–  URL(actuallyIRI): hbp:// إختبار.مثال •  (Interna'onalizedResourceIden'fier)

–  UNCpath: \\例えテスト\public\file.doc(UniversalNamingConven'on=filepathscommoninWindows‐basedenvironments)

•  Userswanttobrowsetheweb,etc.intheirownlanguage–  Imaginetypinginanameinascript&languageyoudon’tknow

IETF76TechnicalPlenary 4

Situa'onToday/Soon

•  ChinausesIDNsforallgovt.sitesandhasIDNTLDs(.中国,.公司and.网络)– Butarenotinthepublicroottoday

•  35.2%ofTaiwandomainsareIDNs

•  13.7%ofKoreandomainsareIDNs

•  VocaldemandfromtheArabic‐scriptworld

•  ICANNisexpectedtostartissuingIDNcountry‐codeTopLevelDomainssoon

IETF76TechnicalPlenary 5

Introduc'onandTerminology

6IETF76TechnicalPlenary

SomeUnicodeTerminology

•  Unicode:Asetofintegercodepointsintherange1–1,114,111(1–0x10FFFF)whereeachcodepointrepresents(withsomeexcep'ons)ahuman‐meaningfulvisual“character”

•  UTF‐32:EachUnicodeintegercodepointstoredusingasingle32‐bitinteger(soendiannessmabers)

•  UTF‐16:EachUnicodeintegercodepointencodedusingoneortwo16‐bitintegers(soendiannessmabers)

•  UTF‐8:EachUnicodeintegercodepointencodedusingonetofour8‐bitintegers(sonoendiannessproblems)

7IETF76TechnicalPlenary

RFC2277:IETFPolicyonCharacterSetsandLanguages

IETF76TechnicalPlenary 8

Protocols MUST be able to use the UTF-8 charset

January1998

UTF‐8

•  Codepoints0x00–0x7FsameasASCII–  Codepoints0x00–0x7Fencodedusingoctetvalues0x00–0x7F

–  Soallcurrent7‐bitASCIIfilesarealsovalidUTF‐8•  withthesamemeaning

–  Exis'ngfilesalreadyassigningothermeaningstooctetvalues0x80‐0xFF(e.g.ISO8859‐1)arenotautoma'callycompa'ble

•  Highercodepointsusemul'‐octetsequences– Mul'‐octetsequencesuseoctetvalues0x80–0xF4

IETF76TechnicalPlenary 9

UTF‐8Mul'‐OctetSequences

IETF76TechnicalPlenary 10

0XXXXXXX

SingleoctetASCIIcharacter(Codepoints1‐127)

110XXXXX 10XXXXXX

1110XXXX

11110XXX

Firstoctetof2,3,4‐octetsequences

Con'nua'onoctetsofmul'‐octetsequences

UTF‐8Mul'‐OctetSequences

IETF76TechnicalPlenary 11

0XXXXXXX00000–0007F

110XXXXX

1110XXXX

11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

10XXXXXX 10XXXXXX

10XXXXXX00080–007FF

00800–0FFFF

10000–

UTF‐8Proper'es

•  Nomid‐stringzerooctets•  Statelesscharacterboundarydetec'on

– Robusttoinser'ons,dele'ons,errors,etc.•  Strongheuris'cdetec'on

– E.g.AnyloneoctetwithtopbitsetsignalstextasnotvalidUTF‐8

•  Byte‐wise,sortssameorderasrawUnicode

IETF76TechnicalPlenary 12

Compactness:Howmanyoctetsdoesittaketorepresentastring?

•  Everyonecrea'ngtheirown‘op'mal’solu'on(op'malinsomespecificcontext)comesatahighpriceintermsofinteroperability

•  Rela'vecompactnessfordifferentencodingsisnotnearlyasimportantontoday'ssystemsasinthepast–  Textis'nycomparedtotoday’sotherdata:

•  Images,Audio,Video

•  Eveninterna'onaltextoEencontainsASCIImarkup–  E.g.HTMLtagsinotherwiseinterna'onaltextfile

IETF76TechnicalPlenary 13

CaseStudy:Localiza'onStringsinApple’sMail.app•  /Applica'ons/Mail.app/Contents/Resources/Japanese.lproj/Localizable.strings

•  UTF‐16: 117,624bytes•  UTF‐8: 68,693bytes

IETF76TechnicalPlenary 14

"UNDO_MARK_READ"="開封済みにする";

Punycode

•  UsedforInterna'onalizedDomainNames•  AmethodofencodingastringofUnicodeintegercodepointsusingonlythefollowingoctetvalues: 0x2D 0x30–0x39 0x61–0x7A

•  i.e.octetvaluesthat,if(mis)interpretedasUSASCII,correspondtothefollowingUSASCIIcharacters: Hyphen Digits0–9 Lebersa–z

IETF76TechnicalPlenary 15

AQues'onofInterpreta'on:ASCIIornot?

IETF76TechnicalPlenary 16

AQues'onofInterpreta'on:ASCIIornot?

IETF76TechnicalPlenary 17

XX XXXXXXXX XXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXX XXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XX XXXXXXXXXXXXX XXXXXXXXXXX XXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX X XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXX XXXXXXXXX XX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXX XXXXX XXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXX XXXXXXXXXXXXX XXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXX XXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXX XXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXX XXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXX XXXXXXXXXXX XXXXXXXX XXXXXXX XXXXXXXX

AQues'onofInterpreta'on:ASCIIornot?

IETF76TechnicalPlenary 18

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX X XXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXX XXXXXXXXX XX XXXXXXXXXX XXXXXXXXXX XXXXXXXXX XXXXXXXXXXX XXXXX XXXXXXXX XXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX XXXXXXXXX XXXXXXXXXX XXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXX

Today’sTextChaos

IETF76TechnicalPlenary 19

Today’sTextChaos

IETF76TechnicalPlenary 20

IDNsinOtherIden'fiers

Therecanbemanywaystogettothesamefile.Forexample...

UsingHTTP:1.hbp://dthαler/Public/test.htm2.hbp://xn‐‐dthler‐rxe/Public/test.htm3.hbp://dth%CE%B1ler/Public/test.htm4.hbp://dthαler/Public/test.htm

UsingCIFS/SMB(filesystemprotocolsusingUNC):1.file://dthαler/Public/test.htm2.file://xn‐‐dthler‐rxe/Public/test.htm3.file://dth%CE%B1ler/Public/test.htm4.file://dthαler/Public/test.htm5.\\dthαler\Public\test.htm6.\\xn‐‐dthler‐rxe\Public\test.htm

21IETF76TechnicalPlenary

PlenaryAnnouncementtoie|‐announce@ie|.org

IETF76TechnicalPlenary 22

smooth and interoperable functioning
 of the Internet depends on text strings
 being interpreted in the same way
 by all systems connected to it.

HowManyLayersofEncoding?

•  Howdoweencode:Adomainname… inanemailaddress… ina“mailto”URL… inawebpage?

•  Doweuse:–  Punycode(“xn‐‐…”)encodingforthedomainname?–  EmailQuoted‐Printable(“=XX”)encoding?– URLpercent(“%XX”)escaping?– HTMLampersand(“&#xxxx;”)codes?

•  Alloftheabove?IETF76TechnicalPlenary 23

IDNsinEmail

•  Twotestemails•  From:user@chεshίrε.stuartcheshire.org•  Ineachemail,addressappearsintwoplaces:

–  in“From”line–  andinbodytext

•  FirstemailencodedIDNusingPunycode:–  xn‐‐chshr‐38d3be

•  SecondemailencodedIDNusingdirectUTF‐8–  chεshίrε

IETF76TechnicalPlenary 24

PunycodeEmail

IETF76TechnicalPlenary 25

Header

Body

UTF‐8Email

IETF76TechnicalPlenary 26

PunycodeEmail:xn‐‐chshr‐38d3be

IETF76TechnicalPlenary 27

Client From (Punycode) Body (Punycode) Gmail / IE xn--chshr-38d3be xn--chshr-38d3be

Gmail / Firefox 3 xn--chshr-38d3be xn--chshr-38d3be

Apple Mail xn--chshr-38d3be xn--chshr-38d3be

Penelope xn--chshr-38d3be xn--chshr-38d3be

Mulberry 4.08 xn--chshr-38d3be xn--chshr-38d3be

Thunderbird 2.0.0.16 xn--chshr-38d3be xn--chshr-38d3be

Eudora 6 on Mac OS X xn--chshr-38d3be xn--chshr-38d3be

Lotus Notes 7.03 & 8.01 xn--chshr-38d3be xn--chshr-38d3be

Outlook 2007 chεshίrε xn--chshr-38d3be

Outlook E-Mail (WM6) xn--chshr-38d3be xn--chshr-38d3be

Outlook Web Access / IE xn--chshr-38d3be xn--chshr-38d3be

UTF‐8Email:chεshίrε

IETF76TechnicalPlenary 28

Client From (UTF-8) Body (UTF-8) Gmail / IE xn--chshr-38d3be chεshίrε

Gmail / Firefox 3 xn--chshr-38d3be chεshίrε

Apple Mail chεshίrε chεshίrε

Penelope chεshίrε chεshίrε

Mulberry 4.08 chεshίrε chεshίrε

Thunderbird 2.0.0.16 chεshίrε chεshίrε

Eudora 6 on Mac OS X chεshίrε chεshίrε

Lotus Notes 7.03 & 8.01 chεshίrε chεshίrε

Outlook 2007 ch??sh??r?? chεshίrε

Outlook E-Mail (WM6) ch??sh??r?? chεshίrε

Outlook Web Access / IE ch??sh??r?? chεshίrε

MoreTerminologyinthisPresenta'on•  Mapping:conver'ngonestringtoanother“equivalent”string–  “CONTOSO.com”⇒“contoso.com”

•  Matching:checkingtwostringsforequivalence–  “CONTOSO.com”~“contoso.COM”–  “möhringen.de”≠“moehringen.de”

•  Sor:ng:determiningwhichstringcomesfirst–  “contoso.com”<“MicrosoE.com”

•  Encoding:samestringcanbeencodedindifferentways–  includingissuesofcombiningcharacters:évse+´

29IETF76TechnicalPlenary

IDNIden'fierSpace

•  IDNA‐validstring: noinvalidcharacters,legallength,etc.

•  U‐label: aUnicodeIDNA‐validstring

•  A‐label: “xn‐‐”followedby Punycode‐encodedIDNA‐validstring

IETF76TechnicalPlenary 30

MoreVarietyBringsMoreAmbiguity

•  ComputerSystems: 2 (binary)

•  TelephoneNumbers: 10 (0‐9)•  ASCIIDomainNames: 37 (A‐Z,0‐9,‐)

•  Interna'onalDomainNames: Tensofthousands

IETF76TechnicalPlenary 31

Matching

32IETF76TechnicalPlenary

ConfusableStrings(1/4)

•  Twostringsthatareeasilyconfusedbyahuman ETHIOPIA.com↔ΕΤΗΙΟΡΙΑ.corn

33

Greekalphabet! Plain“ASCll”confusion

Moreconfusion

IETF76TechnicalPlenary

•  Lower‐casingwillrevealtheETHIOPIAissue•  ετηιοριαisfairlydis'nc've–  currenttrendistodeprecateupper‐case

andothermapping‐requiredformsinIRIsetc.–  IDNA2008treatsthesecharactersasDISALLOWED

ConfusableStrings(2/4)

34

jessica.py↔јеѕѕіса.ру

Cyrillicalphabet

.ru=Russia

.py=Paraguay

IETF76TechnicalPlenary

•  Anotherexample:

•  “јеѕѕіса”actuallyusesCyrilliccharactersfromtwoseparatelanguages– Aregistrymayrestrictregistra'onstoonlycharactersintheirlanguage

•  Otherexamplesexistwithoutmixinglanguages•  epoxy.py↔ероху.ру

ConfusableStrings(3/4)

•  Peopleseewhattheyexpecttosee– Russianrestaurant:“ресторан”– Non‐Russiansmightread“pectopah”

•  Givensufficientlycrea'veuseoffontsforcedbystylesheetsetc.,confusioncanbeeasy

IETF76TechnicalPlenary 35

ItDependsonWhatYouKnow–andExpect

IETF76TechnicalPlenary 36

Isthesecondcharacter“A”?IfyouwerenotfamiliarwithLa'nscript,ordidn’tknowwhattoexpect,wouldyoubesure?Coulditbeastarofsomesort?AreyousurethatthefirstcharacterisanASCIIdot(theDNScares–alot)?

Arethesetwostringsiden'cal?Areyousure?Wouldyoubesureifyoudidn’tknowLa'nscriptortheorganiza'oninvolved?

ConfusableStrings(4/4)

•  Otherkindsof“equivalence”—equalityinsomecontexts

• 中国 and中國Twocodepoints,sameconcept

•  and السعوديةالسعوديةTwocodepointsforsameleber(moreorless)

IETF76TechnicalPlenary 37

AretheFollowingEquivalent?

IETF76TechnicalPlenary 38

Arabic‐Indic ۰۱۲۳٤٥٦۷۸۹EasternArabic‐Indic ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ChineseSuzhou 〡〢〣〤〥〦〧〨〩

European(ASCII) 0123456789

Devanagari(Hindi) ०१२३४५६७८९Tibetan ༠༡༢༣༤༥༦༧༨༩Tamil ௦௧௨௩௪௫௬௭௮௯

“Suspect”Names

•  Poten'alforphishingabacks–  butcouldbeinnocentoraccidental

•  Nameswithscriptsnotusedbytheuser’slocale

•  Nameswithmixedscripts(e.g.Cyrillic+La'n)

•  UImightwanttowarntheuserwhendisplayinganyofthesefromanuntrustedsource–  SomebrowsersdisplayA‐labels(“xn‐‐…”)inaddressbar,butthatconfuseshumans

39IETF76TechnicalPlenary

UniversalConfusableString

•  Fewusersystemshaveallpossiblecharactersanddisplayfontsfortheminstalled.

•  Ifcharactercannotberepresentedlocally,asix‐characterstringmightappearas–  ??????or– □□□□□□

•  Thisshouldbeawarning(butrememberwhatusersdowhentheyseeawarningtheydon’tunderstand)

IETF76TechnicalPlenary 40

Mapping

41IETF76TechnicalPlenary

WhyMapping?

•  Insteadofintelligentmatchingalgorithm:– Mapeachstringtoadefinedcanonicalform– Simpletestifcanonicalformsarebitwiseiden'cal– Doesnotpermit“closeenough”orotherfuzzymatching

•  Conversionofonevisualformtoanotherthatismorelocallyunderstandable– E.g.Tradi'onalChinese(中國)toSimplified(中国)

IETF76TechnicalPlenary 42

Mapping

•  Mappinginherentlylosesinforma'on–  Caseconversion,half/fullwidth,NFC/NFKC/etc

•  Upper/lowercasingdiffersbylanguage–  La'nalphabet: I⇔i–  Turkish: I⇔ı İ⇔i

•  tolower(‘I’)=???•  toupper(‘i’)=???•  tolower(toupper(‘ı’))≠‘ı’•  Turksaren’ttoohappyaboutthis…

43IETF76TechnicalPlenary

Mapping

•  Summary:– Neverrollyourownmapping

– Correctmappingforuserdependsonlanguagecontext,whichweoEendon’tknow

44IETF76TechnicalPlenary

EncodingsdraE‐iab‐idn‐encoding‐01.txt

45IETF76TechnicalPlenary

(Over)SimplifiedArchitecture

Host

Applica'on

DNSResolverLibrary

Internet

IDNASHIM

Problem#1:DNSisn’ttheonlynameresolu'onprotocol

Problem#2:ThepublicInternetnamespaceisn’ttheonlynamespace

Differentprotocolsusedifferentencodingstoday

Differentnamespacesusedifferentencodingstoday

Realis'cNetworkArchitecture

HostApplica'on

NameResolu'onLibrary

DNS mDNS NetBIOSoverTCPLLMNR hostsfile NIS …

Enterprisenetwork

LocalLAN

Internet

Implementa'on‐specificencoding

(e.g.,UTF‐8orUTF‐16)

OtherNameResolu'onProtocols

•  Manydefinedtousethesamesyntax–  Hostsfile,DNS,mDNS,NetBIOS‐over‐TCP,etc.

•  Nameresolu'onlibrarydecideswhatprotocolstotryinwhatorder–  Appscannottellfromthenamewhatprotocolswillbeusedforresolu'on

–  Differentlibrariesmayusedifferentorderandhencefinddifferentnametargets

•  Differentprotocolsspecifyuseofdifferentencodings–  Appscannottellwhatencodingswillbeneededforresolu'on

What’saLegalName?

•  In1985,RFC952definedtheformatofthehostsfile:– “Internethost/net/gateway/domainname”containsASCIIlebers,digits,hyphens(LDH)

•  In1989,RFC1035definedDNS:– “Preferrednamesyntax”:LDH– Butdoes“preferred”meanMAY/SHOULD/MUST?

49IETF76TechnicalPlenary

LegalDNSNames

•  In1997,RFC2181clarified:– Anybinarystringwhatevercanbeusedasthelabelofanyresourcerecord

– Anybinarystringcanserveasthevalueofanyrecordthatincludesadomainname

– Applica0onscanhaverestric'onsimposedonwhatpar'cularvaluesareacceptableintheirenvironment

•  Sameyear:–  IETFpolicyoncharactersetsandlanguages…

IETFPolicyonCharacterSetsandLanguages(RFC2277)

•  Itsays:–  ProtocolsMUSTbeabletousetheUTF‐8charset–  ProtocolsMAYspecify,inaddi'on,howtouseothercharsetsorothercharacterencodingschemes

–  UsingadefaultotherthanUTF‐8isacceptable•  SilentondifferentformswithinUTF‐8(e.g.case,encodingofcombiningcharacters,sortorder)–  TwoUnicodestringsoEencannotbecomparedtoyieldresultsusersexpectwithoutaddi'onalprocessing

•  PerRFC2181,DNScomplies

UseofDifferentEncodingsinDNS

•  Star'ngthatyear(1997),somesystemsbeganusingUTF‐8inDNSinprivatenamespaces– Privatenamespaceheremeansnamesarenotresolvablefromoutsidethespecificnetwork

•  Aboutfiveyearslater,IDNAdevelopment(includingPunycode)beganforuseinthepublicDNSnamespace

LengthIssues

•  DNSnameshave–  63octetsperlabel–  255octetspername(notcoun'ngzeroattheend)

•  Mostapplica'onAPIsuseNULL‐terminatedstrings•  Non‐ASCIIcharactersuseavariablenumberofoctetsinUTF‐8,UTF‐16andPunycode–  256UTF‐16octets≠256UTF‐8octets≠256A‐labeloctets

•  Somenamescanberepresented(withinlength)inPunycodeA‐labelsbutnotinUTF‐8

•  Somenamescanberepresented(withinlength)inUTF‐8butnotinPunycodeA‐labels

53IETF76TechnicalPlenary

Let’sRecapWhereWeAre…

•  Mul'pleencodingsofsameUnicodecharacters:–  U‐labels: مثال.إختبار–  A‐labels: xn‐‐mgbh0f.xn‐‐kgbechtv

•  Differentencodingsused:–  Bydifferentprotocols–  OndifferentnetworkswithDNS

•  PunycodeA‐labelsusedonInternet,UTF‐8inintranets–  Bydifferentapplica'ons

•  Results:–  Failure—orworse—launchingoneappfromanother–  Compe'torswitchingincen'ves,andpooruserexperiencewhenoneappworksandcompe'tordoesn’t

54IETF76TechnicalPlenary

Example“IDN‐aware”app:Browserpicksencodingbasedonintranetvs.Internet

xn‐‐mgbh0 

NameResolu'onAPI

intranet Internet

DNS

LLMNR/mDNS

example.com.إختبار(UTF‐8)

إختبار

إختبار(UTF‐16)

إختبار

xn‐‐mgbh0 .example.com

Problem1:Host/appBusesA‐labelsinintranet

Problem2:Localnameresolu'onusesUTF‐8

Intranetname? Internetname?

55

DNSxn‐‐mgbh0 .example.com

example.com.إختبار(UTF‐8)

IETF76TechnicalPlenary

A

B

IDNAwareApp

NameResolu'on

hbp://مثال.إختبار

NameResolu'on

xn‐‐mgbh0 .xn‐‐kgbechtv

Non‐IDNAwareApp

P U

(UTF‐16)مثال.إختبار

hbp://مثال.إختبار

InconsistentExperienceAcrossApplica'ons

(UTF‐8)مثال.إختبار

xn‐‐mgbh0 .xn‐‐kgbechtv Phishingabackspossible

56IETF76TechnicalPlenary

OtherIDN‐AwareAppsLackconsistency,causingnon‐determinis'cexperience

NameResolu'on

hbp://مثال.إختبار

NameResolu'on

xn‐‐mgbh0 .xn‐‐kgbechtv

P U

xn‐‐mgbh0 .xn‐‐kgbechtv

hbp://مثال.إختبار

 إختبار.مثال (UTF‐16) إختبار.مثال (UTF‐16)

UnreachableUnreachable

مثال.إختبار(UTF‐8)

xn‐‐mgbh0 .xn‐‐kgbechtv

مثال.إختبار(UTF‐8)xn‐‐mgbh0 .xn‐‐kgbechtv

57IETF76TechnicalPlenary

BasicPrinciple

•  ConversiontoA‐labels,UTF‐8,orwhateverotherencoding,canbedoneonlybyanen'tythatknowswhichprotocolandnamespacewillbeused

IETF76TechnicalPlenary 58

HardIssues1of2

•  Clienthastoguessorlearnwhatencodinga{HTTP,DNS,SMTP,…}serverexpectsforaniden'fier

•  Namesappearinsidemanyothertypesofiden'fiers,e.g.emailaddress,URLs,UNCpaths,networkaccessiden'fiers(NAIs)–  Eachiden'fiertypehasitsownencodingconven'ons–  Today,appsthatextracthostnamesneedtoconvertencodings

59IETF76TechnicalPlenary

HardIssues2of2

•  Useofasingleencodingistheeasypart–  Sufficientonlyiftheonlyintentistodisplay–  Comparison,matching,lookup,sor'ng,etc.,allrequiremorework.

–  JustasRFC952definedanASCIIsubsetfor“hostname”iden'fiers,weneedtodefineUnicodesubsetsforothertypesofiden'fiers.

•  Op'malsubsetforoneprotocolmaynotbeop'malforanother.

•  Interpreta'onanddisplayofsomestringsmaydifferbyopera'ngsystems–usuallyabug,butsome'mesnoagreementastowhichvaria'onisthebug.

IETF76TechnicalPlenary 60

Conclusions

•  Smoothandinteroperablefunc'oningoftheInternetdependsontextstringsbeinginterpretedinthesamewaybyallsystemsconnectedtoit

•  TheIETFhasrecognizedthissinceRFC20specifiedASCIIforuseininterchangein1969—thesugges'onsinthispresenta'onextendandupdatethatunderstandingaswellastheunderstandinginRFCs2277and5198

IETF76TechnicalPlenary 61

Conclusions

•  Toavoidconfusionandambiguity,itisnotenoughmerelytosupportUTF‐8asoneofthetextencodingop'ons–  TextinprotocolsonthewireshouldbeinUTF‐8andonlyinUTF‐8

•  Foruser‐visibletextinprotocols:–  Ifyoudon’tuseUTF‐8,whynot?

•  Forprotocoliden'fiersnotseenbyusers:–  IfyoudoallowthefullUnicodecharacterrange,why?

IETF76TechnicalPlenary 62