SLAS

Relational Database

From LabAutopedia

Jump to: navigation, search
Invited-icon.jpgA LabAutopedia invited article


Authored by: Mark F. Russo

The most widely used method of storing moderate to very large amounts of data generated by a laboratory is the relational database. A relational database consists of files or other non-volatile memory that holds the stored data. Relational databases store more than individual data items. They also store relations among the data. It is this feature that gives the relational database its name, and provides much of its utility.

Software used to create, manage and manipulate the contents of a relational database is called a Relational Database Management System (RDBMS). Popular commercial Relational Database Management Systems include those created by Oracle, Microsoft and IBM. Open source options include PostreSQL, MySQL and SQLite.

Contents


The Relational Model

The idea for storing data with relations was first conceived of and reported by Edgar Codd in 1970 [1]. Dr. Codd offered this new model as a way to protect users from having to understand a priori the structure of an underlying database. He proposed an abstract method for storing data and discovering relationships between data without an understanding of the underlying implementation details.

In a relational database, data items are stored in fixed size groups called tuples. The data items in a tuple are associated with one another in a predefined manner called a relation. Every relation defines set of attributes, and every data item in a tuple that follows a particular relation is assigned to one of its attributes. The meaning of a relation is understood by the creators and shared with the users of the database. The relation is the core concept underlying Dr. Codd's proposal.

Tuples of data that follow a common relation are stored together in a relational database implementation using a data structure called a table. Each row in a table holds one tuple of data. Each column in a table corresponds to one of the relation's attributes, and the data item in each tuple assigned to that attribute are placed in the corresponding column. Within a relational database implementation, attributes are also called fields. The table data structure represents the complete relation within the database as it holds all members of that relation in its rows. The mapping of relation database theory with implementation is fundamental to the proper use and understanding of relational database software tools.

Example of a Relation

The ideals underlying the relational database model are best clarified using an example. Consider the problem of managing information about a set of chemical solvents (temperatures are in °C). This information is given in Table 1. The first row of the table identifies the attributes (fields) of the relation called Solvents that is represented by the table. All subsequent rows contain data for one tuple in the Solvents relation. All rows in the table comprise the complete relation.

Storing the data in this tabular format does more that simply retain the data values themselves. The meaning of each data item and its relation to the other data in the table are preserved as well.

Table1: Solvents Table
ID Name Molecular_Formula Molecular_Weight Melting_Point Boiling_Point
1 acetic acid C2H4O2 60.0524 16.6 117.9
2 acetone C3H6O 58.0798 -94.3 56.2
3 acetonitrile C2H3N 41.0524 -48 81.6
4 benzene C6H6 78.1134 5.5 80.1
5 carbon tetrachloride CCl4 153.823 -22.9 76.7
6 dichloromethane CH2Cl2 84.9328 -96.7 39.8
7 dimethyl formamide C3H7NO 73.0944 -61 153
8 dimethyl sulfoxide C2H6OS 78.1288 18.45 189
9 ethanol C2H6O 46.0688 -114.1 78.3
10 ethyl acetate C4H8O2 88.106 -83.6 77.1
11 formamide CH3NO 45.0408 2.5 210
12 formic acid C4H10O 46.0256 8.3 100.7
13 hexane C6H14 86.1766 -95 69
14 methanol CH4O 32.042 -98 64.6
15 methyl ethyl ketone C4H8O 72.1066 -86.3 79.6
16 n-butanol C4H10O 74.1224 -89.5 117.6
17 n-propanol C3H8O 60.0956 -126 97.2
18 tetrahydrofuran C4H8O 72.1066 -108.4 66
19 toluene C7H8 92.14 -93 110.6
20 water H2O 18.0152 0 100


Structured Query Language

The standard language used to retrieve and manipulate data in a relational database is called Structured Query Language, or SQL (pronounced by speaking the three letters in the acronym: S, Q, L). SQL is a declarative language, meaning the programmer writes an SQL statement that declares the intended outcome of the action. The RDBMS itself plans how to carry out the action, executes that plan, and then returns results, if any. The four basic SQL statements are called SELECT, INSERT, UPDATE and DELETE.

The SELECT statement retrieves data from a relational database. Informally, the syntax of a SELECT statement follows the following patern. Phrases in italics are to be substituted with appropriate arguments or additional statements.

    SELECT fields to select
    FROM tables to select from
    WHERE conditions that must be satisfied by all results
    additional constraints to satisfy and clauses to apply

Let's assume was want to find all entries in the Solvents table above with a boiling point greater than or equal to 100°C. Assume in our relational database we named our table of data Solvents, and all fields in the Solvents table matched the headers in the table above. The following SELECT query would return the names of the desired solvents.

    SELECT Name, Boiling_Point
    FROM Solvents
    WHERE Boiling_Point >= 100

The result from executing the query would be the following.


Name Boiling_Point
acetic acid 117.9
dimethyl formamide 153
dimethyl sulfoxide 189
formamide 210
formic acid 100.7
n-butanol 117.6
toluene 110.6
water 100


Multiple conditions can be added to a SELECT statement. If we wanted to expand the query to also limit the results to solvents with melting points less than or equal to 0°C, the following query would do the trick.

    SELECT Name, Boiling_Point, Melting_Point
    FROM Solvents
    WHERE Boiling_Point >= 100 AND Melting_Point <= 0


Name Boiling_Point Melting_Point
dimethyl formamide 153 -61
n-butanol 117.6 -89.5
toluene 110.6 -93
water 100 0


Additional functions can be added to a SELECT query. For example, it is common to order results based on the value of one or more of the fields in the query result using the ORDER BY clause.

    SELECT Name, Boiling_Point, Melting_Point
    FROM Solvents
    WHERE Boiling_Point >= 100 AND Melting_Point <= 0
    ORDER BY Boiling_Point


Name Boiling_Point Melting_Point
water 100 0
toluene 110.6 -93
n-butanol 117.6 -89.5
dimethyl formamide 153 -61


Another common SELECT query option is to group results and report aggregated statistics. The following query uses the COUNT function to return the number of solvents that match the melting point and boiling point conditions.

    SELECT COUNT(*)
    FROM Solvents
    WHERE Boiling_Point >= 100 AND Melting_Point <= 0

This query produces the following output.


Count(*)
4


INSERT, UPDATE and DELETE statements allow the user to insert new records into a table, update existing records in a table, and delete records from a table, respectively. Refer to a SQL reference for a specific RDBMS implementation for exact syntax and available options.

Derived Relations and Table Joins

The relationship between data in a single table is referred to as a base relation. These relations are like the axioms from which other facts can be inferred. With additional tables expressing other data relations, it may be possible to extract more complex relations that span multiple tables.

The first column in Table 1 holds a unique ID for each record. We can set this as the primary key of the table in the database. A primary key is made up of one or more table columns that together uniquely identify a record in the table. Primary keys are indexed by the RDBMS, which allows very fast access to a record given the value of its primary key. It is common to use an automatically generated sequence of numbers as the primary key for a table. Records in additional tables can now use the ID field to refer to a unique record in the Solvents table.

Let's assume that we would like to expand our sample database to also record the bottles of each solvent that we have in storage. A new Bottles table might look like Table 2. The table has three columns. The first is a unique number identifying each bottle in storage -- the primary key for the Bottles table. The second is the ID that identifies the solvent contained in the bottle. The Solvent_ID uniquely identifies the record in the Solvents table. The third column in Bottles is the volume of each bottle in liters. The Solvent_ID field in Bottles refers to the primary key of the Solvents table. This kind of reference is called a foreign key because it refers to a key in a different table.


Table 2: Bottles Table
ID Solvent_ID Volume
1 9 1
2 9 1
3 9 2
4 14 2
5 14 1
6 19 0.5
7 19 0.5
8 4 1


The format of the Solvents and Bottles table together create a very efficient storage strategy. Rather than repeat solvent information for every record in the Bottles table, solvent information is stored once in Solvents, and then referenced by the Bottles table using the solvent's unique primary key.

To create a listing of all the bottle volumes in storage with the name of its solvent it is necessary to run an SQL query that joins the two tables. Following is an example of an SQL query that displays bottles and solvent names by joining the two tables. Note how the specific table containing a fields is indicated by preceding the field name with the table name followed by a dot (.).

    SELECT Solvents.Name, Bottles.Volume
    FROM Solvents, Bottles
    WHERE Solvents.ID = Bottles.Solvent_ID

In words, this SQL query asks the database to "show the solvent name and bottle volume for all pairs of records in the database in which the ID field of the Solvents table equals the Solvent_ID field of the Bottles table." If the Bottles table and data are entered into the same database as the Solvents table, executing this query would produce the following result.


Name Volume
ethanol 1
ethanol 1
ethanol 2
methanol 2
methanol 1
toluene 0.5
toluene 0.5
benzene 1


This particular type of table join is called an INNER JOIN because both tables must have corresponding records that take part in the join equality condition stated in the SQL query WHERE clause. Other types of table joins called OUTER JOINS do not require matching records in joined tables in order to return a result. In that case NULL field values are returned for rows that do not match the join condition.

Solvent_ID, the foreign key reference in Bottles to a record in Solvents, sets up a higher level relationship called a derived relation. The data relation is not explicit, as it is in a single table. The relation between solvent and bottle data must be derived using an SQL query that joins the tables containing the data. The derived relation between the Solvents and Bottles table is a one-to-many relation, because for each one record in Solvents there can be many records in Bottles. Together, the set of table definitions and their relationships with one another make up a data model.

Our example database can be expanded to track the location of all solvent bottles by adding a new table called Cabinets (Table 3) as well as a new field in the Bottles table called Cabinet_ID (Table 4).


Table 3: Cabinets Table
ID Name Location
1 Cab04 Lab001
2 Cab03 Lab002
3 Cab12 Lab003


Table 4: Updated Bottles Table
ID Solvent_ID Cabinet_ID Volume
1 9 1 1
2 9 1 1
3 9 2 2
4 14 1 2
5 14 2 1
6 19 3 0.5
7 19 3 0.5
8 4 3 1


The Cabinets table includes an ID field that is also the table's primary key, a Name field for the cabinet, and a Location field indicating the laboratory that contains the cabinet. The Cabinet_ID field added to the Bottles table holds the ID that corresponds to the cabinet in which the bottle is stored.

Note that the Cabinets table is related to the Bottles table in a one-to-many relationship, similar to the Solvents table: one cabinet can contain many bottles. Also note that the relationship between Cabinets and Solvents is more complex. A cabinet can contain many solvents, and a solvent can be stored in many cabinets. This type of relation is referred to as many-to-many, for obvious reasons.

With this new structure it is possible to join database tables to ask more sophisticated questions about the relationships between the data. For example, the following SQL query finds the location and size of all bottles of ethanol. The query joins all three tables on matching solvent and cabinet IDs, and limits results to records with a solvent name that matches the string "ethanol".

    SELECT Solvents.Name, Bottles.Volume, Cabinets.Name, Cabinets.Location
    FROM Solvents, Bottles, Cabinets
    WHERE Bottles.Solvent_ID=Solvents.ID AND Bottles.Cabinet_ID=Cabinets.ID
    AND Solvents.Name = "ethanol"

Executing the query produces the following output.


Solvents.Name Bottles.Volume Cabinets.Name Cabinets.Location
ethanol 1 Cab04 Lab001
ethanol 1 Cab04 Lab001
ethanol 2 Cab03 Lab002


Technology Impact

Relational databases are widely recognized as the preferred choice for storing and manipulating scientific data of almost any kind. Relational database management systems have been deployed widely and used with great benefit. Freely available open source RDBMS options and powerful graphical interface tools have lowered the barrier of entry to levels near zero. The diversity of the approach is evident when one considers that relational databases are used for everything from managing data in large corporate data stores through data acquired by embedded instrument controllers. Relational databases are powerful highly optimized software tools, but sufficiently simple that most have no trouble learning how to use leverage their power to manage data.

References

  1. Codd, E.F. "A Relational Model of Data for Large Shared Data Banks." Communications of the ACM, 13 (6): pp 377-387, 1970.
   



Click [+] for other articles on  The Market Place for Lab Automation & Screening  Informatics, ELN and LIMS Sample Management, Software and Storage