Comprehasive Exam -IT , By amit Chandra

Binary-coded decimalIn computing and electronic systems, binary-coded decimal (BCD) is an encoding for decimal numbers in which each digit is represented by its own binary sequence. Its main virtue is that it allows easy conversion to decimal digits for printing or display and faster decimal calculations. Its drawbacks are the increased complexity of circuits needed to implement mathematical operations and a relatively inefficient encoding—it occupies more space than a pure binary representation

To BCD-encode a decimal number using the common encoding, each decimal digit is stored in a four-bit nibble.

Decimal: 0 1 2 3 4 5 6 7 8 9BCD: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001

Extended Binary Coded Decimal Interchange CodeExtended Binary Coded Decimal Interchange Code (EBCDIC) is an 8-bit character encoding (code page) used on IBM mainframe operating systems such as z/OS, OS/390, VM and VSE, as well as IBM midrange computer operating systems such as OS/400 and i5/OS (see also Binary Coded Decimal). It is also employed on various non-IBM platforms such as Fujitsu-Siemens' BS2000/OSD, HP MPE/iX, and Unisys MCP. It descended from punched cards and the corresponding six bit binary-coded decimal code that most of IBM's computer peripherals of the late 1950s and early 1960s used.

ASCIIAmerican Standard Code for Information Interchange (ASCII), pronounced / ˈæski/ [1] is a character encoding based on the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that work with text. Most modern character encodings—which support many more characters than did the original—have a historical basis in ASCII.

Historically, ASCII developed from telegraphic codes and its first commercial use was as a seven-bit teleprinter code promoted by Bell data services. Work on ASCII formally began October 6, 1960 with the first meeting of the ASA X3.2 subcommittee. The first edition of the standard was published in 1963,[2][3] a major revision in 1967,[4] and the most recent update in 1986.[5] Compared to earlier telegraph codes, the proposed Bell code and ASCII were both ordered for more convenient sorting (i.e., alphabetization) of lists, and added features for devices other than teleprinters. Some ASCII features, including the "ESCape sequence",[6] were due to Robert Bemer.

ASCII includes definitions for 128 characters: 33 are non-printing, mostly obsolete control characters that affect how text is processed; 94 are printable characters, and the space is considered an invisible graphic.[7] The ASCII character encoding[8]—or a compatible extension—is used on nearly all common computers, especially personal computers and workstations

The Operation of Combinational Logic Systems

We have looked extensively at the combinations of logic gates, and how we can make circuits with a single gate as a unit. What use is this, other than an academic exercise? Logic gates are used extensively in calculators and computers. Logic gates can be used to add binary numbers. Computers are adding machines; they do subtraction by a process of complimentary addition, while they multiply by serial addition.

http://en.wikipedia.org/wiki/Workstation

http://en.wikipedia.org/wiki/Personal_computer

http://en.wikipedia.org/wiki/ASCII#cite_note-7%23cite_note-7

http://en.wikipedia.org/wiki/ASCII#cite_note-Mackenzie_223-6%23cite_note-Mackenzie_223-6

http://en.wikipedia.org/wiki/Space_(punctuation)

http://en.wikipedia.org/wiki/Control_character

http://en.wikipedia.org/wiki/Robert_Bemer

http://en.wikipedia.org/wiki/Robert_Bemer





http://en.wikipedia.org/wiki/ASCII#cite_note-Brandel-1%23cite_note-Brandel-1

http://en.wikipedia.org/wiki/Teleprinter

http://en.wikipedia.org/wiki/Bit

http://en.wikipedia.org/wiki/Telegraph

http://en.wikipedia.org/wiki/Character_encoding

http://en.wikipedia.org/wiki/Telecommunication

http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Character_(computing)

http://en.wikipedia.org/wiki/English_alphabet




http://en.wikipedia.org/wiki/Wikipedia:IPA_for_English

http://en.wikipedia.org/wiki/Computer_peripherals

http://en.wikipedia.org/wiki/Binary-coded_decimal

http://en.wikipedia.org/wiki/Punched_card

http://en.wikipedia.org/wiki/MCP_(Burroughs_Large_Systems)

http://en.wikipedia.org/wiki/Unisys

http://en.wikipedia.org/wiki/MPE/iX

http://en.wikipedia.org/wiki/Hewlett-Packard

http://en.wikipedia.org/wiki/BS2000

http://en.wikipedia.org/wiki/Siemens_AG

http://en.wikipedia.org/wiki/Fujitsu

http://en.wikipedia.org/wiki/Binary_Coded_Decimal#IBM_and_BCD

http://en.wikipedia.org/wiki/I5/OS

http://en.wikipedia.org/wiki/OS/400

http://en.wikipedia.org/wiki/Midrange_computer

http://en.wikipedia.org/wiki/Midrange_computer

http://en.wikipedia.org/wiki/IBM

http://en.wikipedia.org/wiki/VSE_(operating_system)

http://en.wikipedia.org/wiki/VM_(operating_system)

http://en.wikipedia.org/wiki/OS/390

http://en.wikipedia.org/wiki/Z/OS

http://en.wikipedia.org/wiki/Operating_systems

http://en.wikipedia.org/wiki/IBM_mainframe

http://en.wikipedia.org/wiki/Code_page


http://en.wikipedia.org/wiki/Bit

http://en.wikipedia.org/wiki/Nibble

The circuits they use are based on the half-adder. This copes with the rules for binary addition which are:

0 + 0 = 0

0 + 1 = 1

1 + 0 = 1

1 + 1 = 0 carry 1

(1 + 1 + 1 = 1 carry 1)

The circuit has two outputs, a sum and a carry. The sum is the output of an exclusive OR gate (we can’t have 1 + 1 = 1), while the carry output is that of an AND gate. The Boolean algebra is:

sum = A + B carry = A.B

This gives an arrangement shown below:

The circuit is shown below:

MOHD. YAMANI IDRIS/ NOORZAILY MOHAMED NOOR

16

Duality Principal

• Duality principal – each Boolean expression will be certified if identity of operators and elements are interchangeable

+ .1 0

• Example: Given expressiona+(b.c)=(a+b).(b+c)

therefore duality expression is a.(b+c)=(a.b)+(b.c)

MOHD. YAMANI IDRIS/ NOORZAILY MOHAMED NOOR

17

Duality Principal

• Duality principal give free theorem “buy one, free one”. You only need to prove one theorem and get another one free.

• If (x+y+z)’=x’.y’.z’ is certified, therefore the duality is also certified (x.y.z)’=x’+y’+z’

• If x+1=1 is certified, therefore the duality is also certified x.0=0

UnicodeIn computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).[1]

The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard

http://en.wikipedia.org/wiki/Unicode_Consortium

http://en.wikipedia.org/wiki/Unicode#cite_note-0%23cite_note-0

http://en.wikipedia.org/wiki/Hebrew_language

http://en.wikipedia.org/wiki/Arabic_language

http://en.wikipedia.org/wiki/Bi-directional_text

http://en.wikipedia.org/wiki/Collation

http://en.wikipedia.org/wiki/Unicode_normalization

http://en.wikipedia.org/wiki/Computer_file

http://en.wikipedia.org/wiki/Letter_case



http://en.wikipedia.org/wiki/Universal_Character_Set



http://en.wikipedia.org/wiki/Writing_system


http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Industry_standard

http://en.wikipedia.org/wiki/Computing

Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments.

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including XML, the Java programming language, the Microsoft .NET Framework and modern operating systems.

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode characters missing from UCS-2).

Unicode Transformation Format and Universal Character Set

Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code values. The numbers in the names of the encodings indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.

UTF encodings include:

UTF-1 — a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The Unicode Standard

UTF-7 — a relatively unpopular 7-bit encoding, often considered obsolete (not part of The Unicode Standard but rather an RFC)

UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII. UTF-EBCDIC — an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC. (not

part of The Unicode Standard) UTF-16 — a 16-bit, variable-width encoding UTF-32 — a 32-bit, fixed-width encoding

UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for interchange of Unicode text. It is also used by most recent Linux distributions as a direct replacement for legacy encodings in general text handling.

The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or byte endianness detection). Some software developers have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width no-break space (a character with no appearance and no effect other than preventing the formation of ligatures). Also, the units FE and FF never appear in UTF-8. The same character converted to UTF-8 becomes the byte sequence EF BB BF.

In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code value actually manifests as an octet sequence). In the other cases, each code point may be represented by a variable number of code values. UTF-32 is widely used as internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system which uses the gcc compilers to generate software uses it as the standard "wide character" encoding. Recent versions of the Python programming language (beginning

http://en.wikipedia.org/wiki/Python_(programming_language)

http://en.wikipedia.org/wiki/GNU_Compiler_Collection

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/Ligature_(typography)

http://en.wikipedia.org/wiki/Endianness

http://en.wikipedia.org/wiki/Byte_Order_Mark

http://en.wikipedia.org/wiki/Linux_distributions

http://en.wikipedia.org/wiki/Linux_distributions



http://en.wikipedia.org/wiki/EBCDIC

http://en.wikipedia.org/wiki/UTF-EBCDIC

http://en.wikipedia.org/wiki/ASCII



http://en.wikipedia.org/wiki/ISO/IEC_2022





http://en.wikipedia.org/wiki/UCS-2


http://en.wikipedia.org/wiki/Byte



http://en.wikipedia.org/wiki/Operating_system

http://en.wikipedia.org/wiki/.NET_framework

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/XML

http://en.wikipedia.org/wiki/Computer_software

http://en.wikipedia.org/wiki/Internationalization_and_localization

http://en.wikipedia.org/wiki/Multilingualism

with 2.2) may also be configured to use UTF-32 as the representation for unicode strings, effectively disseminating such encoding in high-level coded software.

Punycode, another encoding form, enables the encoding of Unicode strings into the limited character set supported by the ASCII-based Domain Name System. The encoding is used as part of IDNA, which is a system enabling the use of Internationalized Domain Names in all scripts that are supported by Unicode. Earlier and now historical proposals include UTF-5 and UTF-6.

GB18030 is another encoding form for Unicode, from the Standardization Administration of China. It is the official character set of the People's Republic of China (PRC). BOCU-1 and SCSU are Unicode compression schemes. The April Fools' Day RFC of 2005 specified two parody UTF encodings, UTF-9 and UTF-18.

Mapping codepoints to Unicode encoding forms Peter Constable, 2001-06-13; 10950 reads

Note:

This is an Appendix to “Understanding Unicode™”.

See also A review of characters with compatibility decompositions.

In Section 4 of “Understanding Unicode™”, we examined each of the three character encoding forms defined within Unicode. This appendix describes in detail the mappings from Unicode codepoints to the code unit sequences used in each encoding form.

In this description, the mapping will be expressed in alternate forms, one of which is a mapping of bits between the binary representation of a Unicode scalar value and the binary representation of a code unit. Even though a coded character set encodes characters in terms of numerical values that have no specific computer representation or data type associated with them, for purposes of describing this mapping, we are considering codepoints in the Unicode codespace to have a width of 21 bits. This is the number of bits required for binary representation of the entire numerical range of Unicode scalar values, 0x0 to 0x10FFFF.

1 UTF-32

The UTF-32 encoding form was formally incorporated into Unicode as part of TUS 3.1. The definitions for UTF-32 are specified in TUS 3.1 and in UAX#19 (Davis 2001). The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself. Thus if U represents the Unicode scalar value for a character and C represents the value of the 32-bit code unit then:

U = C

The mapping can also be expressed in terms of the relationships between bits in the binary representations of the Unicode scalar values and the 32-bit code units, as shown in Table 1.

Codepoint range Unicode scalar value (binary) Code units (binary)

U+0000..U+D7FF, U+E000..U+10FFFF xxxxxxxxxxxxxxxxxxxxx 00000000000xxxxxxxxxxxxxxxxxxxxx

Table 1 UTF-32 USV to code unit mapping

2 UTF-16The UTF-16 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-16 are specified in TUS 3.0.1

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA#footnote_1%23footnote_1

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a#UnicodeEE

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixB

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a

http://en.wikipedia.org/wiki/UTF-9_and_UTF-18

http://en.wikipedia.org/wiki/Parody

http://en.wikipedia.org/wiki/April_Fools'_Day_RFC

http://en.wikipedia.org/wiki/Standard_Compression_Scheme_for_Unicode

http://en.wikipedia.org/wiki/Binary_Ordered_Compression_for_Unicode

http://en.wikipedia.org/wiki/People's_Republic_of_China

http://en.wikipedia.org/wiki/Character_set

http://en.wikipedia.org/wiki/Standardization_Administration_of_China

http://en.wikipedia.org/wiki/GB18030

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Historical:_UTF-5_and_UTF-6

http://en.wikipedia.org/wiki/Internationalized_Domain_Names

http://en.wikipedia.org/wiki/IDNA

http://en.wikipedia.org/wiki/Domain_Name_System


http://en.wikipedia.org/wiki/Punycode

http://en.wikipedia.org/wiki/High-level_programming_language

U = (CH – D80016) * 40016 + (CL – DC0016) + 1000016

Likewise, determining the high and low surrogate values for a given Unicode scalar value is fairly straightforward. Assuming the variables CH, CL and U as above, and that U is in the range U+10000..U+10FFFF,

CH = (U – 1000016) \ 40016 + D80016

CL = (U – 1000016) mod 40016 + DC0016

where “\” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator.

Expressing the mapping in terms of a mapping of bits between the binary representations of scalar values and code units, the UTF-16 mapping is as shown in Table 2:

Codepoint range Unicode scalar value (binary)Code units (binary)

U+0000..U+D7FF, U+E000..U+EFFF

00000xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx

U+10000..U+10FFFF Uuuuuxxxxxxyyyyyyyyyy 110110wwwwxxxxxx 110111yyyyyyyyyy (where uuuuu = wwww + 1)


3 UTF-8

The UTF-8 encoding form was formally incorporated into Unicode as part of TUS 2.0. The current definitions for UTF-8 are specified in TUS 3.1.2 As with the other encoding forms, calculating a Unicode scalar value from the 8-bit code units in a UTF-8 sequence is a matter of simple arithmetic. In this case, however, the calculation depends upon the number of bytes in the sequence. Similarly, the calculation of code units from a scalar value must be expressed differently for different ranges of scalar values.

Let us consider first the relationship between bits in the binary representation of codepoints and code units. This is shown for UTF-8 in Table 3:

Codepoint range Scalar value (binary) Byte 1 Byte 2 Byte 3 Byte 4

U+0000..U+007F 00000000000000xxxxxxx 0xxxxxxx

U+0080..U+07FF 0000000000yyyyyxxxxxx 110yyyyy 10xxxxxx

U+0800..U+D7FF, U+E000..U+FFFF 00000zzzzyyyyyyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx

U+10000..U+10FFFF uuuzzzzzzyyyyyyxxxxxx 11110uuu 10zzzzzz 10yyyyyy 10xxxxxx


Note

There is a slight difference between Unicode and ISO/IEC 10646 in how they define UTF-8 since Unicode limits it to the roughly one million characters possible in Unicode’s codespace, while for the ISO/IEC standard, it can access the entire 31-bit codespace. For all practical purposes, this difference is irrelevant since the ISO/IEC codespace is effectively limited to match that of Unicode, but you may encounter differing descriptions on occasion.

As mentioned in Section 4.2 of “Understanding Unicode™”, UTF-8 byte sequences have certain interesting properties. These can be seen from the table above. Firstly, note the high-order bits in non-initial bytes as opposed to sequence-initial bytes. By looking at the first two bits, you can immediately determine whether a code unit is an initial byte in a sequence or is a following byte. Secondly, by looking at the number of non-zero high-order bits of the first byte in the sequence, you can immediately tell how long the sequence is: if no high-order bits are set to one, then the sequence contains exactly one byte. Otherwise, the number of non-zero high-order bits is equal to the total number of bytes in the sequence.

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a#utf8

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA#footnote_2%23footnote_2

Table 3 also reveals the other interesting characteristic of UTF-8 that was described in Section 4.2 of “Understanding Unicode™”. Note that characters in the range U+0000..U+007F are represented using a single byte. The characters in this range match ASCII codepoint for codepoint. Thus, any data encoded in ASCII is automatically also encoded in UTF-8.

Having seen how the bits compare, let us consider how code units can be calculated from scalar values, and vice versa. If U represents the value of a Unicode scalar value and C1, C2, C3 and C4 represent bytes in a UTF-8 byte sequence (in order), then the value of a Unicode scalar value U can be calculated as follows:

If a sequence has one byte, then

U = C1

Else if a sequence has two bytes, then

U = (C1 – 192) * 64 + C2 – 128

Else if a sequence has three bytes, then

U = (C1 – 224) * 4,096 + (C2 – 128) * 64 + C3 – 128

Else

U = (C1 – 240) * 262,144 + (C2 – 128) * 4,096 + (C3 – 128) * 64 + C4 – 128

End if

Going the other way, given a Unicode scalar value U, then the UTF-8 byte sequence can be calculated as follows:

If U <= U+007F, then

C1 = U

Else if U+0080 <= U <= U+07FF, then

C1 = U \ 64 + 192

C2 = U mod 64 + 128

Else if U+0800 <= U <= U+D7FF, or if U+E000 <= U <= U+FFFF, then

C1 = U \ 4,096 + 224

C2 = (U mod 4,096) \ 64 + 128

C3 = U mod 64 + 128

Else

C1 = U \ 262,144 + 240

C2 = (U mod 262,144) \ 4,096 + 128

C3 = (U mod 4,096) \ 64 + 128

C4 = U mod 64 + 128

http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a#utf8

End if

where “\” represents integer division (returns only integer portion, rounded down), and “mod” represents the modulo operator.If you examine the mapping in Table 3 carefully, you may notice that by ignoring the range constraints in the left-hand column, certain codepoints can potentially be represented in more than one way. For example, substituting U+0041 LATIN CAPITAL LETTER A into the table gives the following possibilities:

Codepoint Pattern Byte 1 Byte 2 Byte 3 Byte 4

000000000000001000001 00000000000000xxxxxxx 01000001

000000000000001000001 0000000000yyyyyxxxxxx 11000001 10000001

000000000000001000001 00000zzzzyyyyyyxxxxxx 1110zzzz 10000001 10000001

000000000000001000001 uuuzzzzzzyyyyyyxxxxxx 11110000 10000000 10000001 10000001

Table 4 “UTF-8” non-shortest sequences for U+0041

Obviously, having these alternate encoded representations for the same character is not desirable. Accordingly, the UTF-8 specification stipulates that the shortest possible representation must be used. In TUS 3.1, this was made more explicitly clear by specifying exactly what UTF-8 byte sequences are or are not legal. Thus, in the example above, each of the sequences other than the first is an illegal code unit sequence.

Similarly, a supplementary-plane character can be encoded directly into a four-byte UTF-8 sequence, but someone might (possibly from misunderstanding) choose to map the codepoint into a UTF-16 surrogate pair, and then apply the UTF-8 mapping to each of the surrogate code units to get a pair of three-byte sequences. To illustrate, consider the following:

Supplementary-plane codepoint U+10011

Normal UTF-8 byte sequence 0xF0 0x90 0x80 0x91

UTF-16 surrogate pair 0xD800 0xDC11

“UTF-8” mapping of surrogates 0xED 0xA0 0x80 0xED 0xB0 0x91

Table 5 UTF-8-via-surrogates representation of supplementary-plane character

Again, the Unicode Standard expects the shortest representation to be used for UTF-8. For certain reasons, non-shortest representations of supplementary-plane characters are referred to as irregular code unit sequences rather than illegal code unit sequences. The distinction here is subtle: software that conforms to the Unicode Standard is allowed to interpret these irregular sequences as the corresponding supplementary-plane characters, but is not allowed to generate these irregular sequences. In certain situations, though, software will want to reject such irregular UTF-8 sequences (for instance, where these might otherwise be used to avoid security systems), and in these cases the Standard allows conformant software to ignore or reject these sequences, or remove them from a data stream.

The main motivation for making the distinction and for considering these 6-byte sequences to be irregular rather than illegal is this: suppose a process is re-encoding a data stream from UTF-16 to UTF-8, and suppose that the source data stream had been interrupted so that it ended with the beginning of a surrogate pair. It may be that this segment of the data will later be re-united with the remainder of the data, it also having been re-encoded in UTF-8. So, we are assuming that there are two segments of data out there: one ending with an unpaired high surrogate, and one beginning with an unpaired low surrogate.

Now, as each segment of the data is being trans-coded from UTF-16 to UTF-8, the question arises as to what should be done with the unpaired surrogate code units. If they are ignored, then the result after the data is reassembled will be that a character has been lost. A more graceful way to deal with the data would be for the trans-coding process to translate the unpaired surrogate into a corresponding 3-byte UTF-8 sequence, and then leave it up to a later receiving process to decide what to do with it. Then, if the receiving process gets the data segments assembled again, that character will still be part of the information content of the data. The only problem is that now it is in a 6-byte pseudo-UTF-8 sequence. Defining these as irregular rather than illegal is

intended to allow that character to be retained over the course of this overall process in a form that conformant software is allowed to interpret, even if it would not be allowed to generate it that way.

Education

Comprehasive Exam -IT , By amit Chandra