Organizing Laboratory Data

From LabAutopedia

Jump to: navigation, search
<< Back to Data Management
Invited-icon.jpgA LabAutopedia invited article

Authored by: Mark F. Russo

The practice of laboratory automation tends to emphasize the generation of data. Diverse technologies are applied in such a way as to make it possible to do experiments more quickly, for longer periods of time, with less variability, or with fewer resources. Data, the end product of executing an automated method, will be stored and provided to scientists for further analysis.



Design the data organization

It is easy to forget that part of designing or streamlining the execution of an automated system involves considering more than just the hardware. Careful consideration of the manner in which data is stored and organized is also an important part of the process. Avoiding or forgetting this step amounts to missing an important opportunity to streamline the larger process. In fact, failing to properly organize experimental data collected from an automated system can lead to a failed project if the data is not helpful to the scientists involved.

Maintain knowledge of the latest technologies

Technology changes all the time. It is the responsibility of a laboratory automation engineer to remain knowledgeable of the options, including those for storing and organizing experimental data. It is easy to select a data storage technology that is familiar. Even seasoned laboratory automation engineers have a tendency to go with a solution that they are comfortable with and have used numerous times in the past. Any one technology will never solve all problems, and new technologies become available all the time. Staying current with the state-of-the-art in data storage and organization is an important responsibility of the laboratory automation engineer, and reconsidering the fit of a given toolkit is almost always worth the time invested.

Capture parameters that answer questions

One of the most important tenets to keep in mind when selecting tools for experimental data storage is to design your data store in a way that allows scientists to ask and answer their questions. As obvious as this seems, it is worth reconsidering often so as not forget while focused on the many other details that must be juggled in a complex project. If it is not easy to ask and answer the scientific problems of interest, the automated system is not useful.

It is not hard to remember to capture the necessary data that links a given set of experimental results with a sample of interest, but what about some of the other parameters that characterize that experiment? For example, is it important to track the reagents used to perform the experiment? What about reagent manufacturer and lot numbers? Consider the instruments being used. When was the last time an instrument was calibrated and what was the outcome of that calibration? Can the operator have an impact on outcome or the operator shift? The number of variable parameters that can impact the outcome of an experiment can be surprising. New knowledge revealed later may improve the data's value only if sufficient key attributes were recorded at the time the experiment was performed. Deciding which parameters to capture is much easier if you consider the kinds of questions that may be asked and you ensure that answers are possible with a given approach to data storage and organization.

Write Locally. Copy Remotely.

Modern networks are very reliable, but they are not perfect. If data collection from an automated laboratory system depends upon a network connection that is ever-present, sooner or later data collection will fail due to a network interruption. A rule of thumb to observe whenever possible is that continuously long-running automated systems should not write data to storage devices connected over a network. A more reliable approach is to write data to a directly attached storage device, and then copy the data to a network attached storage device when possible. Not only does the practice of "write local, copy remote" allow you to avoid network connectivity problems, it offers a temporary local cache of data, should processes downstream cause a loss of data.

Seek out and use ontologies

In spite of the obvious value of precise communications in the practice of science, vague and ambiguous terminology is in wide use between and even within various scientific fields. Problems that result from imprecise terminology tend to be more subtle than errors caused by a mismatch between english and metric units of measurement. A universally adopted set of terms for the concepts and relationships within a given scientific field could have dramatic benefits by providing for electronic information storage that enables wide-spread data mining and data comparison. This collection of terms with meaning is called an Ontology.

The terminology used in an ontology must be made up of a controlled vocabulary. This goes for both the concepts and relationships that are expressed on the ontology. This common use of terms makes it possible to do much more with the information. Following an ontology has numerous benefits, such as supporting automated reasoning using techniques from artificial intelligence, and making the information available to modern information organization efforts such as with the semantic web.

Numerous ontologies have been developed for widely diverse domains of discourse. Following the terminology and meaning of a widely accepted ontology can have benefits down the road. A shared ontology will make it possible to compare one's own data with data available in the broader scientific community.


The most widely used unit of data storage for laboratory data is the data file. Files can be created, deleted, copied, manipulated and organized in a standard computer file systems and the file attributes that can be assigned to a file by the file system. There are numerous ways to organize data in a file; the method chosen should match as closely as possible the natural organization of the data and the way the data will be utilized after collection.

Choose file formats wisely

Data that is naturally organized as a single table is conveniently stored in plain text format as a structured text file, such as a CSV (comma-separated values) file. Using a plain text CSV format provides maximum flexibility. For example, all spreadsheet applications and many other data handling tools provide for the importation of data that is stored in a CSV format.

Data that is inherently hierarchical in nature is best stored using a format that is based on extensible markup language (XML). All major programming languages provide toolkits that parse XML data files and transform them from one XML format to another. Furthermore, numerous XML-based format standards have been designed for scientific data of all kinds. Using an existing standard increases your chances that your hierarchical data can be imported into an application with little effort.

Some scientific laboratory applications provide data as reports, often in a format such as the PDF (Portable Document Format). These output formats are useful for human interpretation, but they are not of much use for machine loading and manipulation. Extracting data from a format such as PDF is a difficult task.

No matter what the format of the data in a file, it is imperative that the data can be extracted (i.e. it can be parsed). Data is parsable when data elements are unambiguously delineated in a file, such as with special characters (the comma or tab character), or when text fields have been defined in a file that are of fixed character widths. Hierarchical data must be clearly indicated in a file using embedded structures, such as a table within a table. Extraneous comments and symbols can make data extraction difficult or impossible when comments contain the same characters used to delineate data elements are construct hierarchical structures.

Use open/standard formats whenever possible

The lore of laboratory automation includes stories about laboratories that have found it necessary to maintain computers running ancient, unsupported operating systems. The reason for this unfortunate situation is that the lab chose a proprietary data format standard that is readable only by software that is no longer supported, and will not run on computers with modern operating systems. Even storing data in a binary office application format can lead to problems in the future.

Avoid this situation by choosing open data format standards. Open standard formats guarantee that your data will never be lost due to an inability to read a file because the format is well-known and published. Using an open format also enables the sharing and comparison of your data with others. It opens up the possibility of loading data into numerous applications that read the format, thereby making it much more useful.

Relational Databases

The most widely used method of storing moderate to very large amounts of data generated by a laboratory is the relational database. A relational database consists of files or other non-volatile memory that holds stored data. Relational databases store more than individual data items. They also store relations among the data. It is this feature that gives the relational database its name, and provides much of its utility.

Software used to create, manage and manipulate the contents of a relational database is called a Relational Database Management System (RDBMS). All modern RDBMSs implement some form of the language called SQL, which is used to build and manipulate relational databases.

Optimize data models with a purpose in mind

As is the case with any data organization technology, a particular instance of a relational database has a design that specifies how data will be arranged and related. As always, design a relational database with the scientist's needs in minds. Always ask if the design supports the scientist's ability to get answers to important scientific questions.

Relational database designs may involve trade offs between size and execution speed. A database that is designed with the best principles in may well be the slowest to execute. Compromises, such as repeating the identical data item in multiple locations in the database, may be necessary to increase performance. The utility of a final design is related to how well it supports your scientific goals.


Click [+] for other articles on  The Market Place for Lab Automation & Screening  Informatics, ELN and LIMS Sample Management, Software and Storage