Article

Character Encoding

Character Repertoire refers to a set of distinct characters. The characters include letters, numerals, and other symbols.

Character Code refers to a mapping in tabular form. Each character in a repertoire is assigned a code value. The code value is a unique non-negative integer. Basically, computers only deal with numbers.

Character Encoding is essentially a way to internally represent each character as a specific sequence of bits. It is an algorithm that defines how each code value is converted to a sequence of code values. It also defines how each sequence of code values is then converted to specific sequence of bits.

This conversion makes storage possible in a system that uses fixed bit widths. During the conversion, the platform-dependent issues are taken into account for e.g. byte order.

A particular character encoding scheme needs to be specified when

a. you need to store textual data in a file on your computer hard disks

b. you need to transfer textual data over the network

For most UNIX systems, the platform-default character encoding is ISO Latin 1. For Windows, it is Windows Latin 1, also called WinLatin1.

The encoding scheme used to encode data is specified in suitable headers. The encoding used must be understood by the recipient software tool.

Numerous encoding schemes exist that provide a way for character encoding. Examples include UTF-8, UTF-16, ISO Latin-1.

Character Set sometimes merely denotes a character repertoire. The term is overloaded to refer to a character code. Quite often a particular character encoding is also involved.

ASCII

It is an acronym for American Standard Code for Information Interchange. It is the most commonly-used character set in the early days of computing. It represents the character repertoire, character code, and character encoding. It contains a total of 128 characters. 32 non-printable control characters such as linefeed and Escape and 96 other characters such as upper and lowercase English, digits, and English punctuation symbols. Most of them are printable.

Most of the character codes presently in use contain ASCII as their subset. It is the safest character set to be used in data transfer. There is no risk of character mapping loss. For e.g. at present, the safest means to store filenames on a typical network is using the ASCII characters.

A 8-bit byte is used to represent a ASCII character. It encodes the 128 characters in 7-bit binary form i.e. its lower 7 bits are set to the appropriate code value. The most significant bit is not used. For example, it allows an application to use this bit as a parity bit such as linefeed. For this reason, ASCII is regarded as a single-byte 7-bit character set.

Extending ASCII set

ASCII was soon observed to be inadequate. For example, several European languages use accented characters such as French language has ç and é; Danish has Ø; Spanish ñ; German ü

Several informal ways were used to extend ASCII. Using a single-byte, it is possible to encode a maximum of 256 characters. All 8-bits are utilised. This figure is adequate to encode most of the characters of Western European languages.

Numerous character encodings emerged. They facilitate supporting different character repertoires.

Example

ISO Latin 1, also known as ISO 8859-1, defines the character repertoire and the character code for it. It contains ASCII as its subset. The code values 0-127 represent the same characters as in ASCII. The code values 128-159 is unused and reserved for control codes, they do not represent printable characters. For instance, they can be used for device control such as changing colors, movement of cursor, etc.

This character repertoire also contains several accented characters, a collection of letters from West European languages, and some special characters. They occupy the code values 160–255.

It also specifies an encoding scheme that simply represents these code values in 8-bit binary form.

In effect, ISO 8859-1 is the default encoding in many environments.

Example

There are a number of character codes that are extensions to ASCII. ISO 8859 is actually a family of character codes. ISO 8859 character codes extend the ASCII in distinct ways with special characters used in different languages.

ISO 8859-1 is just a member of this family, already discussed. ISO Latin 2, also known as ISO 8859-2, contains ASCII characters and a collection of letters from Central/Eastern European languages.

The member character codes of ISO 8859 family are similar in form in the following way: 0-127 is identical to ASCII, 128-159 is reserved for control purposes, each member uses the code values 160-255 in a distinct way.

A single byte is sufficient to represent 256 code values. ISO 8859 character codes are typically encoded in simple 8-bit binary form.

Limitations with using different encodings

a. Character sets often clash with one another. Two different character sets often use the same code value for two different characters. It is also seen that a certain character belonging to two different character sets may use different code values. Such data may get corrupt when passed between different encodings.

b. Limited support for a multi-lingual user. 256 character codes are inadequate to represent all the characters required by such users.

c. Many far-east languages require more than 256 character codes like Chinese, Japanese, and Korean. Writing systems used in this region normally require 3k-15k characters.

Example

The % symbol gets the code value 228 in Macintosh character code; the same code value is assigned to letter ä in ISO 8859-1 and to letter ð in HP's Roman-8

Example

EBCDIC character set, defined by IBM, contains all ASCII characters but their code values are quite different. In ASCII, for instance, letters A-Z are assigned consecutive code values but in EBCDIC, they don't.

! letter has the code value 90 in EBCDIC whereas it has the code value 33 in ASCII.

Example

When some text is stored in a file, transferred or processed, it is essential to specify the character encoding to be used.

Consider a program that wants to write a character "é" in a text file. This character has the code value 233 in ISO Latin 1 whereas in DOS character set, it has the code value 130. So, if ISO Latin 1 character set is used to write the text file, you need to write a byte whose value is 233.

Similarly, it is not possible to interpret it correctly without knowing beforehand in which character set a text file was originally saved. For instance, in ISO Latin 1, the value 233 maps to character "é", whereas in DOS character set it maps to theta Greek letter.

Example

When the character streams are transferred over a network, it is very important that you should explicitly specify a valid character encoding along with data being transported. If you don't do so, then the platform-default encoding will be used. In this case, your application may behave differently on a Windows machine than on a Unix machine, where WinLatin1 is not the default. When the encoding schemes differ, output may get garbled.

Example

Several years ago, the default encoding specified by HTTP standards was ISO 8859-1. However, the current standards explicitly state that there is no officially recognised default.

Character encoding must be specified in an HTML document. We typically use the charset parameter of the Content-Type HTTP header to specify the encoding scheme being used. For instance, the web server will send the following header if an HTML document uses the ISO 8859-1: Content-Type: text/html; charset=ISO-8859-1

Example

A program can fully translate the Macintosh data to ISO Latin 1 character code format provided the data contains only those characters that are present in ISO Latin 1. Note that a translation program uses a conversion table to convert the character codes of each character in the data.

The pi Greek letter is there in Macintosh character repertoire. But it is not present in ISO Latin 1. Therefore, if your data contains the pi character then the program won't be able to convert it into ISO Latin.

Unicode

Unicode is a standard created by a consortium of companies such as Microsoft, HP, Digital, IBM, and Apple. It defines a character repertoire, character code, and encodings for it. It defines the constraint that Unicode-implementations should treat characters uniformly across different applications and platforms.

Its character repertoire encompasses over 100,000 characters. It contains nearly all the characters of other character repertoires in use. Unicode can be considered as a superset of all other character sets. It is designed to represent all of the characters found in all human languages. This allows us to process text in practically any language and script used in this world. It provides us with a complete set of mathematical and technical symbols.

Benefits

a. It ensures corruption-free data to be transported through different systems

b. Unicode-enabled application can be used across multiple platforms without re-engineering

It is supported in various OS, browsers, standards, and other products. Windows internally uses Unicode character set for character encoding. C# character set conforms to Unicode. In C#, any Unicode character can be specified using a Unicode escape sequence. Some others who use Unicode are XML, Java, CORBA, Perl.

Each character is assigned a unique code value. Originally, it was designed to be a 16-bit code. Initially, its support was limited to code values in the range 0 to 0xFFFF i.e. 0 to 65535 positions. Later, the code range was extended to support certain special applications like mathematical, musical typesetting, historic alphabets and ideographs, etc. The code values are expressed in the range 0 to 0x10FFFF (~1.1 million). Note that not all numbers in this range represent coded characters. 0-255 code values represent the same characters as in ISO-Latin 1.

U+nnnn notation is the commonly-used notation to refer to Unicode characters. We refer to a character through its code value in hexadecimal notation. Note that nnnn stands for 4 hexadecimal digits. However, other ways also exist to refer to a character.

Example

U+002A means the * character with code value 2A (in decimal 42) Note that no particular encoding is being referred to.

Different Unicode encoding schemes

Several encodings exist for Unicode such as UCS-2, UCS-4, UTF-8, UTF-16, UTF-32. UTF is an acronym for Unicode Transformation Format.

UCS-2 encoding represents the code value as a sequence of two bytes. The value is written as 256x+y where x and y represent the bytes. It was formerly the native Unicode encoding i.e. before the Unicode standard extended the code range beyond 16 bits. Unicode consortium has now advised to avoid using UCS-2. Its support is limited to 16-bits.

UTF-32 encoding simply encodes each code value in 32-bit binary form. This is a very simple encoding scheme. It is typically used where string processing is required. However, it is an inefficient scheme. For e.g. to encode data that contain ISO-Latin1 characters only, it anyhow requires 4-bytes for each character.

Code values less than 0x10000 (decimal value 65536) are represented by UTF-16 as a single 16-bit binary form. Exception is for values between 0xD800 and 0xDFFF. Unicode does not assign characters to any of the code values in this range They are reserved for some other use, discussed below.

A single 16-bit storage cannot accommodate a value greater than 65535. Code values between 0x10000 (decimal value 65536) and 0x10FFFF are first represented as a surrogate pair of values. The literal meaning of surrogate is proxy. These values lie in the range 0xD800 and 0xDFFF. Note that 0xDFFF is less than 65536. Each value in the pair is then converted to a 16-bit binary form.

UTF-16 encoding is usually used in Win32 applications and Windows NT networks. However, UTF-16 is not recommended for encoding web pages because the capabilities of the client browser are not known whereas UTF-8 encoding is ideal for encoding multilingual Web pages.

Depending on computer architecture, bytes of encoded character are stored in either of the following ways:

a. Big-endian byte order: This method holds the most significant byte first for e.g. Intel architecture.

b. Little-endian byte order: This method holds the least significant byte first for e.g. Sparc machine architecture.

The code value of Zero-Width No-Break Space character is U+FEFF. Its byte-swapped equivalent U+FFFE is not a valid Unicode character.

Byte Order Mark (BOM) can be placed at the beginning of the encoded Unicode data. UTF-16 encoded data can be prefixed by BOM. BOM allows the automatic detection of the byte order; it helps to distinguish the big-endian and little-endian variants. The decoder uses this information. For UTF-16 big-endian, BOM is represented as the byte sequence FE FF. For UTF-16 little-endian, BOM is represented as the byte sequence FF FE.

UTF-16 is an efficient method to represent Unicode characters. It is suitable for storage or transmission via the network. In effect, a single 16-bit word is generally used. Most of the applications use only a small subset of Unicode. The code values of such characters are less than 65536.

Example

Consider the code value 65536. It is greater than 65535 and 16-bit storage is inadequate to hold it. UTF-16 encoding represents the code value 65536 as a surrogate pair of values. It becomes the pair 0xD800 0xDC00. Note that both these values are less than 65536.

A Pentium machine uses the little-endian method that holds the least significant byte first. The pair 0xD800 0xDC00 is represented as the byte sequence 00 D8 00 DC. When BOM is prefixed before the encoded data, the byte sequence becomes FF FE 00 D8 00 DC

Surrogate Characters

C# char type cannot encompass all values corresponding to the Unicode character set. The char is an alias for System.Char struct. The char represents an unsigned 16-bit integer. It can only accommodate values from 0 to 65,535. Not every Unicode character can really be squeezed into two bytes. Unicode has more than 65536 characters. It is a misconception that Unicode is a 16-bit character set.

In .NET framework, characters are internally represented using UTF-16. A Unicode character with code value less than 0x10000 (decimal 65536) is encoded as two consecutive bytes. A Unicode character with code value between 0x10000 and 0x10FFFF is represented by two 16-bit surrogate characters. That is, each such code value takes 4-bytes. Two surrogate characters make one such Unicode character. The values of these surrogate characters lie in the range 0xD800 and 0xDFFF.

The IsSurrogate() is a public static method defined in System.Char that reports whether the specified character is a surrogate character.

Example

A string is actually a collection of individual characters whose datatype is System.Char.

Consider the string "X\u00010000". This is a string of 3 characters, but it actually represents 2 real Unicode characters. The first one is 'X' whose code value is 0x0058. The second one, lying in the range 0x10000 - 0x10FFFF, is internally represented using two surrogate characters 0xD800 and 0xDC00.

string str = "X\u00010000";

//check whether the character at index 0 in the string str is a surrogate character
Char.IsSurrogate(str, 0);   //returns false, 0x0058 is not a surrogate character
Char.IsSurrogate(str, 1);   //returns true, 0xD800 is a surrogate character
Char.IsSurrogate(str, 2);   //returns true, 0xDC00 is a surrogate character

It implies that a string of n characters can actually represent anywhere between n/2 and n Unicode characters. However, it won't be a concern for applications that don't involve mathematical and scientific notations.

To continue reading, click this link: IO Streams in .NET - Part 2