56

Intro-Internationalization

Embed Size (px)

Citation preview

PowerPoint Presentation

Introduction to InternationalizationGwyneth Marshall

Microsofts mission: To enable people and businesses throughout the world to realize their full potential.2

People around the world have different customs

3Waarom kan hulle nie net doodgewoon Afrikaans praat nie? Pourquoi, tout simplement, ne parlent-ils pas franais? ? ; ? ? ? Mirt nem beszlnek eg yszer en magyarul? ? ? ? Hvorfor kan de ikke bare snakke norsk? ? Dlaczego oni po prostu nie mwi po polsku? ? -? Por qu no pueden simplemente hablar en Espaol? ? Ne den Trke konuamyorlar? ? ? Tai sao ho khng th chi noi ting Vit? Pam dydyn nhw ddim yn siarad Cymraeg? Af hverju geta eir ekki bara tala slensku? Zergatik ezin dute Euzkeraz bakarrik hitzegin?and use different languages and scripts to express themselves.4

How information is presented is also relative to the user. Where am I? What time is it? How is time represented? What currency do I use? How are numbers and currency written? What calendar type do I use? How are dates represented? What measurement system do I use? Even how is a list presented can change based on the cultural conventions of the user.

The underlying data doesnt necessarily change (a float is a float), but the visual representation at runtime does.

5

Internationalization

[email protected] [email protected] Encodings and StandardsFonts and Language SupportLocales, Cultures, and MarketsMultilingual User Interface

In this section we will explore how these cultural differences are expressed in obvious and not-so-obvious ways that directly impact software development such as Encodings and Standards; Fonts and Language Support; Locales, Cultures, and Markets; and Multilingual User Interface.6Encodings and Standards

Globalization involves developing a platform that easily supports the input, output, and display of any language. Fundamental to this platform is support for Unicode, a character encoding that supports most of the worlds written languages and is the default character set of XML and HTML.7BinaryHello? Allo??

11000011000011011010010101101010001010111010100101011101165 = AWords, sentences, paragraphs

The fundamental problem is that computers speak binary, but human use words, sentences, and paragraphs. There has to a mechanism to allow computers to represent human words. The solution is to assign a number to an individual character or symbol. For example, 65 represents the capital letter A in the ASCII code page.

An encoding or code page is the process of putting a sequence of characters into a special format for transmission or storage purposes.8Windows1252code page

Windows1256code pageASCIIWestern EuropeArabic

Extended9Historically, code pages were represented as either single byte or multi-byte.

In single byte code pages, each character is represented by a single byte. This works great for languages that do not need more than 256 characters.

However, different code pages may use a given number to represent different characters.

Windows932code pageWindows

code pageJapanese

8A + 4C =10But some languages have thousands of characters. A solution commonly used on PCs is to encode most characters (primarily ideographs) with 2-byte values, thus making room for far more than 256 characters. The key phrase in the previous sentence is most characters characters such as those in the ASCII set and the Japanese phonetic syllabary known as katakana have single-byte representations. The result is a code page that mixes single-byte and double-byte characters.

To increase the repertoire, additional characters are composed of a lead-byte such as 8A in this example, followed be a trail byte. The combination of 8A and 4C results in the following character.

00C800C800C800C800C800C8U+010CU+0628U+05B8U+0398U+0418U+0E28Central EuropeArabicHebrewGreekCyrillicThai00C8Each character is represented by a unique code pointUnicodeCode Page11If each code page uses the same number to represent different characters, how do users exchange information created with different code pages? To solve this problem a group of companiesincluding Microsoftdeveloped the Unicode standard. The standard provides an unambiguous method to identify any defined character.

Now instead of a code point potentially representing different characters, each character is uniquely defined.

In applications that support Unicode, users can type multilingual text and not to worry that the text will change based on the code page the application is using. Further, new code pages are no longer being created so many emerging markets are only represented with the Unicode standard.http://.cn

So, a Chinese URL is translated by the system into an IP address that takes the user to a site.

Try it yourself with http://.cn!12What can I do?Use UnicodeEnsure you are using the same version Unicode consistently, particularly between client applications and the internetIf you have to convert encodings, ensure data isnt lost in the round tripTest with problematic characters

Use UnicodeEnsure you are using the same version of Unicode consistently, particularly between client applications and the internetIf you have to convert encodings, ensure data isnt lost in the round tripTest with problematic characters13Fonts and Language Support

Without fonts, the UI text cannot be properly displayed to the end-user.15AaaaaaaaBut what is a font? As you learned, computers interpret a character as a single code point. The actual shape of a character, or glyph, may vary it is still recognizable as the character a to a human. A font is collection of these glyphs that render code points in a human readable way. Characters are mapped to different codepages and glyphs are mapped to different fonts.16Waarom kan hulle nie net doodgewoon Afrikaans praat nie? Pourquoi, tout simplement, ne parlent-ils pas franais? ? ; ? ? ? Mirt nem beszlnek eg yszer en magyarul? ? ? ? Hvorfor kan de ikke bare snakke norsk? ? Dlaczego oni po prostu nie mwi po polsku? ? -? Por qu no pueden simplemente hablar en Espaol? ? Ne den Trke konuamyorlar? ? ? Tai sao ho khng th chi noi ting Vit? Pam dydyn nhw ddim yn siarad Cymraeg? Af hverju geta eir ekki bara tala slensku? Zergatik ezin dute Euzkeraz bakarrik hitzegin?And different language use different scripts, which use different fonts.17English COMPUTER 2English English COMPUTER 1English Font does not exist on Computer2!Search Computer2 for a font with similar characteristics to the original font and that supports the script.This is called Font SubstitutionOne of the biggest challenges to designing world-ready software is to ensure that all the characters that the UI, content, or that the end-user may input are correctly displayed. When editing a multilingual document, the user should not be expected to select a different font for each one of the scripts he or she wants to view.

Fortunately, various coding environments can help address this problem with this two mechanisms: Font Substitution and Font Fallback and Linking.

Consider the scenario where a user enters some text on Computer 1 and then displays the text on a different computer that does not have a font that supports Arabic on that machine. Most fonts only support a couple of scripts at the most (however, more than one language may use a script). If a glyph doesn't exist in a font, a default glyph (usually a box) is displayed.

The rendering engine will search for a similar font that supports those characters. This is called Font Substitution.18English FONT AEnglish FONT BEnglish RESULTEnglish TEXTFont does not have glyphs for that script!Get a suitable font for that script and/or glyphThis is Font Fallback and/or Font Linking.It is transparent to the user.A similar technology provided by many environments is font fallback and font linking. The system can detect if the currently selected font doesn't support a particular script and can automatically switchor fallbackto a predefined font that has appropriate glyphs for the desired script. The selected font is internally replaced with a predefined font. All these operations are transparent to the user, but applications can override this choice at any time by simply making sure the initial font contains the desired glyphs. If fallback does not work, then many environments will try font linking which performs a similar function, but at the level of an individual character.

There are different mechanisms used in various APIs, controls, and frameworks. For example, in Web content, a prioritized font list in HTML/CSS is used first, and then the browser (such as Internet Explorer) may have its own font logic. In Win32, the RichEdit control has its own font mechanisms with APIs for the client to tailor the font. Other rendering platforms support different scriptsso plan your multilingual requirements before implementation using these frameworks.19But my text isnt rendering

Encoding issueANSI ANSIMBCS data mapped to SBCS character setWrong code page set

If your text is not still rendering correctly, it might be one of these issues.

The first is that you are trying to display text encoded in one code page while assuming some other code page.

Convert the string and the display mechanism, which could be either the control or a webpage, to use Unicode.20But my text isnt rendering

Typically encoding issueUNICODE ANSICharacters are not in the target codepageCharacters have been replaced with ?

A second typical problem is that you have a Unicode string, but the control is not Unicode and you are trying to represent characters that arent in that code page. For example, your string has Japanese characters but the control is set to use a Turkish code page.21What can I do?Do not hard code font face namesDo not assume a given font is installedDo not assume selected font supports the desired scriptDo not assume one (point) size fits all scriptsDo not assume font decoration is appropriate for all scripts (e.g., bold and italic often make non-Latin scripts hard to read)Do not assume that all scripts use vertical space within a line the same way that English doesDo not place text formatting values into in-line style; use CSSDo not use stock object raster fontsOther rendering platforms support different scriptsso plan your multilingual requirements before implementation using these frameworks.Do not hard code font face namesDo not assume a given font is installedDo not assume selected font supports the desired scriptDo not assume one (point) size fits all scriptsDo not assume font decoration is appropriate for all scripts (e.g., bold and italic often make non-Latin scripts hard to read)Do not assume that all scripts use vertical space within a line the same way that English doesDo not place text formatting values into in-line style; use CSSDo not use GDI stock object raster fontsGDI+, WPF and Silverlight do not support all of the scripts that GDI and DirectWrite text stacks support so plan your multilingual requirements before implementing using these frameworks.

For detailed information on what to do, consult your World-Ready team as requirements vary by division.22

Other linguistic, cultural, and stylistic elements of written language can impact how software needs to process text.

Languages written with the scripts of Chinese origin may be written left-to-right and horizontally (as English is) or right-to-left and vertically. Or even both ways in the same document!

Further difficulties may be encountered with the so-called complex scripts. Complex Scripts require special processing to display and edit because the characters are not laid out in a simple linear progression from left to right, as most European characters are.23English Languages written with the Hebrew and Arabic scripts are called bidirectional because most text is written right-to-left but numbers are written left-to-right.

This can impact the text direction

24Logical order

Display order

cursor movement and contextual shaping. As the letters are typed by the user into the document, you will notice that the previous character changes shape. This is because Arabic letters have three forms, one form when at the beginning of a word, another when in the middle, and one when at the end.25

Bidirectional text also impacts the orientation of the UIand the meaning of some images may need to change26There are no spaces between Thai wordsTherearenospacesbetweenThaiwordsOther languages, such as Thai, need to enable special text processing due to the customs used when writing text. For example, Thai, Khmer, Lao and others, do not use spaces to separate words. Special rules, called word breaking, are needed to determine a word.

In the example, each color represents a separate word. This means when you are parsing text, you cannot assume a space indicates a new word nor can you assume where a line break can occur. 27 + + + + + + Some languages, such as Thai, the base letters represent consonants with the vowels indicated with special accent marks near the consonant sound.

In this example, you will notice how as the user adds modifiers to the base character, the placement of the accents changes. First the user types the base character. Then modifier one, modifier two, and then when the user types modifier three the accent is placed above accent two.28 CorrectIncorrectThe implication for applications is that you shouldnt assume that every character boundary is a valid caret position.29 + + + + We can see similar behavior in Hindi. A caret should not be placed within an Indic cluster.

30

U+10302Nor should a caret should not be placed within a character which spans multiple code units. Make use of platform APIs (for example, ScriptBreak) that report valid caret positions, or allow a control that has this built in to do it for you.

.31All human beings are born free and equal in dignity and rights.

Article 1 of the Universal Declaration of Human Rights{}Just as you cant assume all characters have the same width, you cant assume the text height has the same conventions as the Latin alphabet. Consider the following text in English and in Tibetan. Even though both are set to the same point size, you can easily see that their heights are not the same.32What can I do?Draw complete stringsUse text-measuring APIs A character is not a valid caret positionUse existing APIs to do the bulk of the work for youLigatures, vertical kerning, diacritics, tone marks, contextual shaping, character reordering all of these are transparent if you use the appropriate APIs or controls.33Locales, Cultures, and Markets

Previously we mentioned that data presentation may vary around the world. The underlying data doesnt necessarily change (a float is a float), but the visual representation at runtime does. 35

en-AULocales, Cultures, and Markets are some of the commonly used terms at Microsoft to capture the concept of the pairing of a language and a geographic area.

Some examples are en-AU for English as spoken in Australia36

az-Latn-AZaz-Latn-AZ for Azerbaijani written with the Latin script as spoken in Azerbaijan...37

es-419and finally, you may see example such as es-419 for Spanish in Latin America.

The end goal is to match the customers expectation on how locale sensitive data should be presented.

What are some common ways to determine locale?3839

The two most common methods to influence the data presentation are the language settings of the OS or the browser language settings (depending on the browser, once you set the OS the browser will automatically use the same settings).

For some OSes and some locales, you can further customize the settings for given locale. For example, select a non-default sort order for Hungarian.39

Here we see the Language Options in Internet Explorer. With IE10, this takes us to the Language Preferences in IE. But other browsers, have their own settings. Here we see Firefoxs setting. Notice that all the preferences allow the user to set their own language/locale ordering.

When code is written to use these settings, then the data that is locale or culture sensitive will reflect the users preferences.

However, there are other locales out there and as you design, write, and develop features you need to think about what is the correct culture to use. What is appropriate in one situation is not appropriate in another. And most of all, the UI language does not necessarily match the users locale.

40

Input: indicates desired input language and method (keyboard, soft input panel, IME, on-screen keyboard, handwriting, or speech-to-text)

What are some of these other locales? Locale can influence how users input text, but allowing them to selection their language and input method (such as a keyboard, an IME, handwriting, or speech-to-text).

Most of applications can safely choose to not handle input locales and let the operating system handle that complex operation by using standard edit controls. Web pages have generally nothing to worry about and can rely on their HTML rendering engine for that matter.

Some applications need to care. They care if they are using some other control for user document text; for example applications with their own text engines because they want to save the Language tagging information along with the text. They want to do that so they can use the correct spelling and other proofing tools for that chunk of text. So if your application does spell checking or has its own text engine, you care about input locale.

41

Custom locales (add on applications; not currently available for Apple products)Part of an extensible model for globalization support that allows customers to create personalized user experience

Users can create their own custom locales based on an existing locale. For example, the United States Navy uses this tool to create their own locale, en-US-x-navy. It is based on the English (United States) locales but uses a 24-hour clock instead of the 12-hour clock.42appleblezebraUnicodeOne of the most common elements in software that is effected by locale is text.

Alphabetical order and conventions for sorting items vary from culture to culture. For example, sort order can be case-sensitive or case-insensitive. It can be phonetically based or influenced by character groupings... In East Asian languages, sorts are ordered by the stroke and radical of ideographs. Sorts can also vary depending on the fundamental order the language and culture uses for the alphabet.

As an example, these three strings are sorted different depending on whether you sort by the order they appear in the Unicode standard43bleapplezebraEnglishor if you use the English linguistic sorting rules.44applezebrableDanishor you apply Danish sorting rules, where the character after z in the alphabet.

A world-ready application must be able to compare and sort data on a per-culture basis in order to support culture-specific and language-specific sorting conventions.45 ia b c d e f g h i j k l m n o p r s t u v y zIiAnother important scenario is casing. For most internal processing of strings, the case of the character doesnt matter. But anytime the string is displayed to the user or where the string would interact with settings on the users machinesuch as setup or the file storethen casing becomes very important.

One of the most infamous examples of this problem is the letter i found in the Turkic languages. The Turkic alphabet, which is a variant of the Latin alphabet, includes two distinct versions of the letter I, one dotted and the other dotless.

The dotless lower capitalizes to a dotless upper I and the dotted lower I to dotted upper .

Why does this matter? One of the most common scenarios where casing becomes a problem is in searching.

46Her ahsn renim hakk vardr. renim hi olmazsa ilk ve temel safhalarnda paraszdr. lkretim mecburidir. Teknik ve mesleki retimden herkes istifade edebilmelidir. Yksek retim, liyakatlerine gre herkese tam eitlikle ak olmaldr.retim insan ahsiyetinin tam gelimesini ve insan haklaryla ana hrriyetlerine saygnn kuvvetlenmesini hedef almaldr. retim btn milletler, rk ve din gruplar arasnda anlay, hogr ve dostluu tevik etmeli ve Birlemi Milletlerin barn idamesi yolundaki almalarn gelitirmelidir.Ana baba, ocuklarna verilecek eitim trn semek hakkn ncelikle haizdirler.

Article 26 of the Universal Declaration of Human Rights

With English assumptions of casing, lets search for the letter i. The following results are returned. However, when we apply Turkic casing rules an additional instance is found. 47

EnglishTurkish

However, not every string needs to be treated in a linguistic-sensitive way. For example, in the Class Designer in Visual Studio you may need to list all the classes in a project or solution to allow users to select one. The user can insert the four C# classes that only differ by the last character, as they are different items for the compilers, but may be considered linguistically equivalent. In the UI, the list should then be sorted according to the end-user culture.

In contrast, because Visual Basic is case sensitive, Methodi and MethodI are not allowedso checking whether a method name is already in the list has to be done with the StringComparer which is appropriate for the programming language, but the display has to be done with a comparison that reflects the current culture.48What can I do?Respect the users preference for input and outputConsider how you interact with other productsUnderstand when not to use locale APIsUse existing APIs to do the bulk of the work for youRespect the users preference for input and outputConsider how you interact with other productsUnderstand when not to use locale APIsUse existing APIs to do the bulk of the work for you

49Cultural appropriatenessWe the People...Consider the power of wordsForeignerComradeOmEheMarriageJerusalemLoveGodWords: We the people, Cancer, Abortion, Om, Marriage, Love===Words can evoke strong emotionThey can offend or angerThey can be labels for deep and ingrained linguistic, historical and cultural meaningMeanings may vary for different people.===+++**we the peopleWords, like images, can evoke strong emotions. **cancerWhat do you feel when you see these words?**abortionWords can offend or anger**Omor delight**marriageIs one response correct?**LoveJust like images, text evokes different emotions for different people. And all the responses are valid.

51Metaphors

Postboxes from Oman, Japan, the US, Spain, and Sweden. 52What to take awayEncodings take human readable symbols and allow the computer to process them as 0s and 1.Unicode makes it easier to exchange information written in different scripts.Fonts render encoded text as human readable symbols on the computer display.The system can help with many font issues by finding the best font for the character.Some languages and scripts require special processing.Locales, cultures, and markets have the goal to match the customers expectation on how locale sensitive data should be presented. Carefully consider which locale is appropriate for the desired design.Locale sensitive data can include more than just date formats.Sometimes it isnt necessary to use a locale when processing text. Consider the ultimate purpose of the design.Other cultures may interpret your words and images differently

Questions?

2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/4/2015 12:27 PM56 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.