Article

This is in continuation to IO Streams in .NET - Part 1

Streams

I/O is an acronym for Input-Output. I/O involves the transfer of data between a program and any kind of I/O device. The I/O devices are represented as abstractions. It is a major feat in the history of programming. Numerous types of I/O devices can be connected to a computer. It would be complex if a language had to handle each type of device as an individual case.

Basic meaning of a stream: A channel through which data transfer takes place between a program and a sequential data source (or destination). The data source can be a file, network socket, web connection, console, in-memory buffers, or even some other stream.

.NET framework offers a stream-based I/O. A stream is an I/O abstraction that produces and receives the packets of data. In C# program, it is essentially an object from/to which we can read/write the data. In other words, a stream object is capable of reading/writing the data from/to a data source.

The data must be streamed when it is read from and written into a data source. Reading the data into a C# program involves opening a stream to the data source and reading the data serially. For instance, we open a stream between our program and a remote resource and read the data as the server sends it. Writing the data from a C# program involves opening a stream to a data source and writing the data in a serial manner.

The .NET framework's I/O system links the stream to the physical I/O device.

Advantages of using stream abstraction:

a. All data sources are handled essentially the same way. The same techniques can be applied to widely differing data sources for e.g. same methods can be used to write data to a disk file, or console, or over a network.

b. It encapsulates all the operations that we can carry out on a data source.

c. It hides the details of working with OS, networks, console, and keyboard. It also shields us from the details of underlying backing stores such as files.

Byte Streams

The content is treated as a sequence of bytes. Byte Streams read/write data byte-per-byte from/to a data source. C# offers several classes for this purpose for e.g. FileStream and MemoryStream classes. A byte stream is not easily readable by humans because the data is represented in byte form.

They provide a limited support for the processing of textual data. It cannot properly handle all characters of the Unicode character set. The character stream wrapper classes are more suitable for this purpose which use byte stream objects in the background.

A byte stream easily supports 8-bit ISO Latin 1 character set. The code value of an ISO Latin 1 character can just be stored in a single byte. Secondly, it is easy to convert between 2-byte char and 1-byte byte type. Simply ignore the high-order byte of the char variable.

Character Streams

We humans prefer to work with character data rather than reading and writing binary data. The content is treated as a sequence of characters rather than eight-bit bytes. But remember that data is always eventually stored in a sequence of bytes.

C# offers character stream wrapper classes for e.g. StreamReader and StreamWriter classes. The programmer wraps a byte stream inside a character stream i.e. a byte stream object is used in the background. The underlying byte stream is automatically converted by .NET into a character stream and vice-versa. It is essential to specify the character encoding for these conversions to take place. In other words, you need to specify what characters are represented by each byte or a sequence of bytes.

Example

Consider the Unicode string "Hello" that needs to be written in a file according to the UTF-16 character encoding on Intel architecture that uses little-endian method.

The code values of the individual characters in the "Hello" string are 72 101 108 108 111 (Decimal notation) 48 65 6C 6C 6F (Hexadecimal notation)

In UTF-16, code values less than 0x10000 are represented as 2-consecutive bytes. Also, the little-endian method holds the least significant byte first. For little-endian, BOM is represented as the byte sequence FF FE.

Thus, the byte sequence that will be written into a file is FF FE 00 48 00 65 00 6C 00 6C 00 6F

Byte and Character Streams

The StreamWriter object is used that turns a stream of Unicode characters into a stream of bytes according to UTF-16 character encoding. It uses the Encoding class, discussed below, to convert characters to bytes. TheFileStream object is used to write the bytes in the file. C# program has to combine the two objects in the following way. It will create the FileStream object (passing to its constructor the filename) and then passing it to the constructor of the StreamWriter object. In other words, the StreamWriter object wraps the FileStream object.

Encoding class

It is an abstract class that belongs to the .NET framework class library. It resides in the System.Text namespace. It provides a variety of methods and properties. The commonly used methods are GetBytes(), GetChars(), GetString(). They convert arrays of bytes to strings and arrays of Unicode characters and vice versa.

It defines the following static read-only properties. They are typically used by an application to obtain a particular implementation of the Encoding class.

a. ASCII returns an encoding for the ASCII format.

b. BigEndianUnicode returns an encoding for the Unicode format in big-endian byte order.

c. Unicode returns an encoding for the Unicode format in little-endian byte order.

d. UTF8 returns an encoding for the UTF-8 format.

e. UTF7 returns an encoding for the UTF-7 format.

The System.Text namespace also includes several implementations of Encoding class

a. ASCIIEncoding class

b. UnicodeEncoding class

c. UTF8Encoding class

d. UTF7Encoding class

The ASCIIEncoding class encodes each Unicode character as 7-bit ASCII value. It only provides support to code values between U+0000 and U+007F i.e. between 0 and 127. Any value greater than U+007F is converted to the ? character. It is used when you want to work with legacy encodings and systems. Most of the time, it is not suitable for internationalized applications.

The UnicodeEncoding class encodes each Unicode character using the UTF-16 encoding scheme. The UnicodeEncoding() constructor uses little-endian byte order to encode Unicode characters. The encoded data is prefixed by BOM. The UnicodeEncoding(bool, bool) constructor allows you to specify the byte-ordering to use. Specify true for big-endian byte ordering. It also allows you to specify whether the encoded data should be prefixed with BOM, specify true if you want to include BOM.

On Intel processors, it is more efficient to store Unicode characters in little-endian byte order. The SreamWriter instance can prefix the BOM in the beginning of the UTF-16 encoded data. It allows the SreamReader to infer whether the rest of the encoding is in little-endian or big-endian format.

The UTF8Encoding class encodes Unicode characters using the UTF-8 encoding. The UTF7Encoding class encodes Unicode characters using the UTF-7 encoding.

Example

The Unicode escape sequence for the Unicode character θ (theta, it is a Greek letter) is \u03B8. The decimal equivalent of hexadecimal 03B8 is 952; note that it is outside the standard ASCII code range. When this character is encoded using UTF-16, its byte representation in little-endian byte order will be 0xB8 0x03

string s = "Hi\u03B8";

//Create a UnicodeEncoding object that supports little-endian byte order
UnicodeEncoding uniEncoding = new UnicodeEncoding();

//Use GetByteCount() to determine the number of bytes required 
//to encode the given string s using UTF-16 encoding
int count = uniEncoding.GetByteCount(s);

//Allocate an appropriately sized buffer
byte[] targetBytes = new byte[count];
targetBytes = uniEncoding.GetBytes(s);

//the targetBytes array contains a UTF-16 encoded representation of 
//the string s. Each Unicode character in the string corresponds to 
//two elements of the array. Therefore, the targetBytes array contains
//the values {0, 72, 0, 105, 184, 3}
foreach (Byte b in targetBytes)
    Console.Write("{0} ", b);
Console.WriteLine();

//Decode the bytes back to string. Notice that the theta (θ)
//character is still there.
String targetString = uniEncoding.GetString(targetBytes);
Console.WriteLine(targetString);

//Encode the string as 7-bit ASCII
ASCIIEncoding ascEncoding = new ASCIIEncoding();
count = ascEncoding.GetByteCount(s);
targetBytes = new byte[count];
targetBytes = ascEncoding.GetBytes(s);

//Any value greater than 127 is converted to the ? character.
//The ASCII character code for the ? character is 63.
//The targetBytes array contains the values {72, 105, 63}
foreach (Byte b in targetBytes)
    Console.Write("{0} ", b);
Console.WriteLine();

//Convert the bytes back to the string. Notice that theta the (θ) character 
//is not present.
targetString = ascEncoding.GetString(targetBytes);
Console.WriteLine(targetString);

To continue reading, click this link: IO Streams in .NET - Part 3