Structured Text File

From LabAutopedia

Jump to: navigation, search
Invited-icon.jpgA LabAutopedia invited article

Authored by: Mark F. Russo

The data stored in a plain text file must be organized in some convenient manner. Otherwise it would be impossible to discern individual data items stored in the file. Extracting data from a file is called parsing the file, and is necessary if a file’s contents are to be read.

The simplest form of plain text data file structure is the comma-separated values format, or CSV. A CSV file contains data items in tabular format, arranged as a rectangular grid of data items. Data items in a single row are separated from one another with a comma character. The end of a row is indicated with carriage-return and/or newline characters.

Parsing data from a CSV file usually starts by reading all characters from a single line in a file. Lines in a file are terminated with carriage-return and/or newline characters similar to CSV file rows. All characters in the line are split into subsections at the comma character. The characters in each subsection are often placed into an element of a newly created array. If the data are numeric, the final step is to convert each data item from a string of characters to a computer’s internal number representation so that the numeric data can be used as arguments in mathematical operations. This process is repeated into the end of the file is encountered.

Other separator characters can be used to delineate data item boundaries in a structured data file. For example, the tab character is often used as the separator instead of the comma. This is a safer approach when data items may be non-numeric. If the comma character is used as the data item separator, and a data item itself contains a comma character, the data item will be split in two at the internal comma and unpredictable behavior will result. In reality, any character can be used as the separator, providing the format is documented.

Another common method for formatting plain text data files is through the use of fixed-length fields. In this case the number of characters in each column of a file is determined before-hand. When data are written to the file, if an individual data item has fewer characters than its column, the data item is padded, usually with spaces, before it is written to the file. The result is a clean looking format, with all data items in a column left-aligning nicely one above the other in each column. The downside of fixed-width fields is that the extra padding characters can cause the file to increase in size with no information added.

Both of these file structures can be read by Microsoft Excel using its built-in file parsing tools. Data in delimited or fixed width fields of a text file are readily parsed, converted to numeric data where appropriate, and inserted into the wells of a spreadsheet. Excel is a common tool used to read and interrogate the data in files that follow one of these formats.

Unexpected problems can occur when a plain text data file in one of the described formats is opened in Excel, and then saved again. If the number of data items differs from row to row, Excel inserts extra delimiter characters to pad the number of items in each row to match the maximum number in any row. It does this in blocks of 16 rows. Extra delimiter characters may trigger errors when the modified file is read by another program.