Plain Text File
Authored by: Mark F. Russo
A common distinction is made between "plain text" files and "binary" files. In a very real sense, because they store bytes all file are binary. In reality, this distinction refers to the interpretation of the data in the file. The bytes in a plain text file are intended to be interpreted as encoding individual characters using a standard encoding that has its roots in the ASCII character set.
The acronym ASCII stands for the American Standard Code for Information Interchange. The original ASCII encoding standard is a mapping between the decimal numbers 0 through 127 to a set of characters based on the English alphabet. Most characters in the standard are referred to as "printing," meaning they are associated with a graphical symbol, but some are "non-printing," such as the line-feed and carriage-return characters.
The first table below provides a standard ASCII encoding as a mapping of characters with the decimal numbers from 0 to 127. Most items in the mapping are standard alphanumeric characters and symbols. Items enclosed in angle brackets are non-printing. The second table lists the meaning of these non-printing characters, which have been used for a variety of purposes such as text formatting and control of data transmission. Most modern character encodings are extensions of the original ASCII standard.
Files containing data encoded using the ASCII standard are referred to as plain text files. All other files are considered binary files. For example, files that use another encoding, or store numerical data using a standard such as the IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) fall in the binary file category. The IEEE 754 standard uses only four bytes of storage to encode a single precision number that would be displayed using up to eight characters.
Many tools exist for reading data from a file with the assumption that file bytes represent an ASCII encoding of the original contents. For example, a standard text editor application (such as Microsoft Notepad) reads and displays data with this assumption. If the data in a file is composed entirely of ASCII encoded characters, there will be nothing unusual about the editor display. Normal English characters will be displayed and formatted as intended. On the other hand, if a standard text editor reads data from a binary file, the results are unpredictable. A screen full of odd characters is the typical result. Sophisticated rich text editors, such Microsoft Word, store data in files with encodings that are significantly different from the ASCII standard. This is not surprising as these binary file formats store substantially more information than the characters in a file. Text styles, layout, internal scripts and numerous other data types are stored while trying to keep file size to a minimum.
Plain text editors manipulate the stream of bytes in file as if they were ASCII encoded characters. Many have the option to display non printing characters using special symbols. This is very useful when trying to determine the format of a data file. One can open the file using a plain text editor and look for special symbols or characters used to display bytes that cannot be decoded. Microsoft Word has a limited ability to do this with the Show/Hide toolbar button. In Word, when the Show/Hide toolbar button is activated, tab characters are displayed as arrows and spaces as dots.
The Microsoft Excel spreadsheet application is often used to open and examine data files. This makes sense as Excel was designed to manipulate data. A common problem can occur if the data is saved back to a file. Excel may insert extraneous tab characters into the file to pad the number of data items in each row to equal the number of items in the row with the greatest number. These extraneous characters may prevent the file from being properly read by an application that is not expecting the characters.
|<SOH>||Start of Header|
|<STX>||Start of Text|
|<ETX>||End of Text|
|<EOT>||End of Transmission|
|<DLE>||Data Link Escape|
|<DC1>||Device Control 1|
|<DC2>||Device Control 2|
|<DC3>||Device Control 3|
|<DC4>||Device Control 4|
|<ETB>||End of Transmission Block|
|<EM>||End of Medium|