Last laboratory: queries on a database

mysql logoWelcome back guys!

Please, start reviewing what databases are so that we all have a common background when talking about tables and fields. Then we prepared a very short introduction to the SQL language:

These tasks are quite common, but not so useful to understand how databases works so we’ll just have a look to them.

Today you’ll play with the Linux machine with a good level of independence. Tomorrow we’ll use the CIV computer room where there is the possibility to use a projector for a more “guided” and commented experience. Consider this as tomorrow we will practice more with queries… while only today you’ll have the possibility to play with Perl if you wish to do so 🙂

4NGS is a MySQL-powered website

Connect to 4ngs and have a look (username and password as previously supplied). What you see is a script (in PHP, but it’s very very similar to Perl) that access to a MySQL database. Same for this blog (php is more popular for websites because of its tight integration with HTML).

The structure of two main tables of 4NGS

In particular there is a database called “nannov1” with two main tables:

  • contigs (contains information about each contig)
  • arcs (contains connections between contigs)

Your first task is to type the command to understand the structure of the two tables (list of fields). I prepared a sort of “virtual shell” where you can type SQL commands. In this post you’ll find the command.

This is your first experience with a SQL command. You should get two tables with the list of fields, and their type.

First query: all the records

Lets try some query now. First try with “SELECT * FROM contigs“. Then try “SELECT id, name, len, cov FROM contigs“. As you can see this is the basic structure of a query: “SELECT which_fields FROM which_table”. It returns all the records (so: how many contigs has the alga?). As you probably guessed the “*” is a wildcard character and means all-fields.

Second query: which records (and which fields)

A query is a fast way to select specific data, and to do so we should expand our previous queries with the “WHERE conditions…” statement. Try with “SELECT id, name, ROUND(cov), len FROM contigs WHERE cov<30 AND len>2000“. 

Note that we added a function (round) to the coverage field so that we could get rid of the decimals. Then we added a simple filter to select only contigs with coverage lower than 30X and length greater than 2kbp.

Type your own query to further reduce the list removing contigs with coverage lower than 15X.

To count how many record satisfy your query you can simply type: SELECT count(*) FROM contigs WHERE cov<30.

Another interesting function is “sum” like in:

SELECT count(id), sum(len), round(avg(cov)) FROM contigs

Extending more a query

We can expand further the query with this structure:

SELECT fields FROM table WHERE conditions ORDER BY fields LIMIT howmany

See these examples to understand what this means:

  • SELECT name, cov, len FROM contigs WHERE cov<50 ORDER BY len
  • SELECT name, cov, len FROM contigs WHERE cov<50 ORDER BY len DESC
  • SELECT name, cov, len FROM contigs WHERE cov<50 ORDER BY len DESC LIMIT 15

Tomorrow we will play again with MySQL itself, performing more complex queries and understanding a little bit how databases are designed. Let’s take a rest for now.

A simple link between Perl and databases

For those of you remembering some Perl… let’s see a simple way to populate a database. We can write a list of SQL statements into a file and then asking MySQL to execute them. So if we have a Perl script that prints these instructions into a file, we are done!

Example:

perl AddRecords.pl > newrecords.sql
mysql < newrecords.sql

The newrecord.sql file could be like:

INSERT INTO contigs (name, cov, len) VALUES ('contig00001b', 12.3, 1000);
INSERT INTO contigs (name, cov, len) VALUES ('contig00002b', 19.3, 9390);
INSERT INTO contigs (name, cov, len) VALUES ('contig00003b', 22.3, 1230);

Suppose that you want to add some field to existing records:

UPDATE contigs SET scaffold='S012', sid=1 WHERE name='contig0001';
UPDATE contigs SET scaffold='S012', sid=2 WHERE name='contig0201';
UPDATE contigs SET scaffold='S012', sid=3 WHERE name='contig0342';

Note that when updating a table if the WHERE condition apply to more than one record you’ll update all of them. For example the statement:

UPDATE contigs SET name='noname' WHERE len>100;

would reset the name to noname to all contigs of the database. It is thus a very good practice to test the “where” statement with a SELECT query.

So try writing a simple script that populate a database. You can make up your own or choose one of the following.

1) PCR list to SQL
Parse the “/home/geno/tools/list.pcr” and take only the lines starting with a #. Then print SQL commands like:

INSERT INTO pcrs (user, code, from, to) VALUES ('geno-30', 'PCR147', 'contig03424', 'contig1320');

2) Primers to SQL

Parse the “/home/geno/tools/primers.csv” file and store the following fields (pcrcode,  primerfor, primerrev) into a table called “primers”.

3) Multi fasta to SQL
Parse the  “/home/geno/tools/contigs.fna” file and produce a script that for each sequence writes a line similar to:

INSERT INTO contigs (name, len) VALUES ('contig00001', '502256');

Consider that if you look at the header of each sequence, it contains all you need to do this. BUT you can do a general-purpouse script that actually parse the whole file to set the length of the sequence.

Save the script in your home directory calling it sql.pl.

A simple template that could be useful (but see also previous scripts for MultiFASTA parsing etc):

#!/usr/bin/perl
$filetoparse = '/path/to/file';
open(I, $filetoparse) || die "FATAL ERROR: Unable to open $filetoparse.\n\n";
while (<I>) {
chomp;
# do the parsing...
print "INSERT INTO ...;\n";
}

Tagged ,

Leave a Reply

Your email address will not be published. Required fields are marked *