Data Storage Technologies
Authored by: Mark F. Russo
Numerous options exist for storing experimental data. Each has its pros and cons. The best technology for a particular purpose will depend upon individual needs.
Primary storage is where data is stored immediately after collected. This is usually the main memory (RAM) of a computer that is controlling the instrument. Primary computer memory is volatile, meaning it requires power to retain the value of stored data. When power is cut, data is lost.
A computer's main memory is fast -- data can be written and accessed very quickly. This makes it ideal for the initial storage location of experimental data. The trade-off of increased memory speed is increased price. Soon after it is stored in primary computer memory experimental data must be written to a larger secondary, non-volatile storage device in order to insure that it is not lost.
The computer hard drive is by far the most common form of secondary computer storage, and the most common place to store experimental data immediately after collected. Laboratory instruments write data to files, which can then be processed further after an experiment is complete. This is possible due to the non-volatile nature of a hard drive. A computer can be shut off, cutting power to the hard drive, and turned on again at a later time. Unless specifically modified by a program running on the computer, the data stored on a hard drive will remain unchanged.
The growth of hard drive capacity and access speed has consistently increased over time at a rate that follows an exponential curve (See Figure 1). The straight line fit of the data demonstrates that they follow an exponential curve because the vertical axis is logarithmic. Knowledge of this model of growth is valuable when predicting the amount of data that will be locally storable, and therefore the nature of automated experiments that will be possible in the future.
Most hard drives consist of a rotating magnetic platter and a head that can read and write data to individual platter locations. It follows that the data rates depend upon numerous physical attributes, such as the rate at which the head moves and rate at which the platter spins. Being constructed of electromechanical (moving) parts, it is no surprise that hard drives occasionally fail. Specialized tools and skills are required to recover data from the platter of a failed hard drive. It cannot be said too frequently that remembering to regularly back up critical data is always the best strategy for minimizing loss due to a hard drive failure.
Solid state drives offer an alternative to the conventional rotating platter design. Being purely solid state, the lack of electromechanical parts offers an advantage over rotating platter designs due to an increase in reliability. They also enjoy faster data access and less noise due to complications that arise from the presence of moving parts. These attributes in particular can be important for the success of certain scientific experimental procedures. Solid state drives also tend to be more expensive and have lower overall capacity as compared to conventional hard disk drives. Solid state drives are designed with the same interface as standard hard disk drives allowing for direct replacement.
The third class of data storage is called tertiary storage because it is separated from the primary collection point by one additional layer beyond that of the secondary storage hard drive. This additional layer of separation is almost always a network of some sort.
The convenient feature of tertiary data storage devices is that they are usually set up to look and act as if they were secondary storage devices connected directly to the computer. For example, once certain tertiary storage devices are attached, their contents can be accessed as if they were files on a local hard drive. This ease with which these devices can be used makes them very popular.
The additional layer of separation between a tertiary storage device and a computer means that there is more involved when moving an individual data item from the primary collection point until to reaches a point of non-volatile storage. With more involved in data transfer, there are more ways for the transfer process to fail and for data to be lost.
Due to the increased risk of data loss, the general rule of thumb for initial data storage of data collected by an automated instrument is to "store locally, copy remotely." That is, to store data as fast as possible to a local non-volatile storage device, and then copy to other locations as necessary. This may seem to be an overly abundant amount of caution. After all, one rarely experiences a failure when transferring data over a network connection. Two additional considerations should not be forgotten. First, the chances of failure increase in direct relation to the amount of data being transferred. With automation, a situation in which data collection is being migrated from a manual process to an automated one, typically comes with a large increase in the amount of data that is collected, and therefore an increase in risk. Second, because the data in this situation has not yet been stored in a non-volatile manner, loss of data means permanent loss; there will be no opportunity to recover. The practice of "store locally, copy remotely" minimizes the chance of data loss.
Direct Attached Storage vs Network Attached Storage and Storage Area Networks
A data storage device that is connected directly to a computer is referred to as Direct Attached Storage or DAS. Secondary storage, such as a computer's hard drive, is a form of DAS. As mentioned, the addition of a network layer between a computer and a storage device makes it tertiary storage. There are two kinds of tertiary storage: Network Attached Storage and Storage Area Networks. These two styles of tertiary storage are differentiated by where the network is inserted into the process (See Figure 2).
Network Attached Storage
In the case of a Network Attached Storage device, the network is inserted between the application software and the remote file system. An application accesses software on a local operating system that transmits file system commands over the network to be executed by the remote file system. File data is also transferred back and forth over the network between the application and the remote file system. The special software is wedged into the local operating system so as to make remote file system access seamless with local file system access. The wedge software is referred to as a Network File System.
Storage Area Networks
In contrast with Network Attached Storage, a Storage Area Network (SAN) inserts the network between the local file system and the remote storage device. In this case the local file system accesses data on the SAN storage device. Specialized hardware protocols are used for communication with the device over a network. This is compared with specialized file system network protocols that are used to communicate with Network Attached Storage.
A Storage Area Network tends to offer an simpler path to growing and shrinking available storage by adding or removing SAN devices to a network. Network Attached Storage allows one to make use of existing secondary storage devices that may be underutilized.
When on-demand online access to data is not necessary, offline storage offers a good cost-effective option for non-volatile data storage. Offline storage usually refers to removable storage media such as data tape formats, optical formats like CD-ROM and DVD, and even hot-swappable solid state disk drives. The common feature among these options is that data storage or data access is complete, the storage device can be disconnected from the computer or network and placed into storage for later access.
Offline storage is ideal when large amounts of data must be stored for legal or regulatory purposes, but ongoing access to the data is not necessary. It's there if you need it, at a small cost of time to reattach the storage device and find the data of interest. Another important function of offline storage is to hold data associated with regular backups of computer systems. Good system maintenance practices include regularly making complete copies of a hard drive on computer systems in production to provide a way to recover from those unexpected disasters. When your automated system's waste tubing springs a leak over your main controller, a complete image of the controller's hard drive will be invaluable.
Magnetic storage media is generally the less expensive option among offline storage media options. A single bit of data is stored at a specific location as a modification in the magnetic properties of a magnetizable material. Magnetic media include everything from magnetic tape to various forms of magnetic disks, such as the floppy disk or the Zip disk. While this form of storage used to be the primary option for offline storage, it is now used only for mass storage of very large amounts of data, typically to tape. Access time is very slow as the tape must be wound to the point of data storage and read off the tape.
Optical storage media is more common choice among the options for offline storage. A single bit of data is stored at a specific location as a modification in the ability of the optical material to reflect laser light. Writing data to an optical storage device is usually referred to as "burning" the data, which actively changes the optical properties of the optical surface. Common options among optical media include the CD-ROM and the DVD, both in a disc format, usually made of polycarbonate.
One way to effectively expand the capacity of offline storage devices and to enable access to offline data without the need for human intervention, is to store offline storage media in a robotic device or jukebox of some sort. A request for data that is stored on a given tape or disc results in a mechanical device fetching the storage media and loading it into a reader. In case of magnetic tapes the mechanical device is referred to as a tape library, and often includes a robotic arm that fetches and loads tapes (See Figure 3). In the case of optical storage discs the mechanical device that fetches and loads discs is often called a jukebox. This is due to the way the whole process resembles the way old-fashioned 45 RPM music discs were stored and swapped in coin-operated music jukeboxes popular in the 1950's, when Mark first played in a band.
USB Flash Drives
A third form of temporary offline storage is a device referred to by several names, including a USB flash drive, thumb drive or jump drive. The USB flash drive is a small non-volatile solid state storage device with an integrated USB plug. When the drive is plugged in to a USB port on a computer and the operating system recognizes that fact, the flash drive appears as a new hard drive that is treated like any other direct attached storage device; files can be directly created, copied, or deleted. Flash memory on the device is used to store the data, and data is accessed using a standard that is recognized by all modern operating systems.
Relatively speaking, the capacity of these pocket-sized storage devices is low as compared to other solid-state disk drives. This limitation is offset by its small size and convenience. The flash drive has replaced the obsolete floppy disk as the means for ad hoc transfer of data between a laboratory computer and a desktop computer - a process often referred to as "Sneakernet".
Hierarchical storage management is a data storage approach that combines the best features of tertiary storage with offline storage. The term hierarchical describes the technology well because tertiary and offline storage are combined in a hierarchical manner. All data is stored in an offline storage device such as a robotic tape library or an optical jukebox. Frequently accessed data is copied to a tertiary store such as an array of disk drives. The system automatically tracks data access frequency and copies the most accessed data from the offline store to the tertiary store to shorten access times. Hierarchical storage management is most effective when one has an extremely large amount of data that must be stored and made available online, but only relatively small amounts of the data is repeatedly accessed at any one short period of time.
|Click [+] for other articles on||The Market Place for Lab Automation & Screening||Informatics, ELN and LIMS Sample Management, Software and Storage|