Lab02/2: first bioinformatics programs

This post is a part of the second laboratory session. Please refer to the main post.

Linux commands like ls and grep are just programs, but they are special because any UNIX like terminal will ship them (for free :).

Now we are going to use a Perl script and an alignment program to perform two tasks:

simulate a whole genome shotgun
align the reads produced from the shotgun

To use a Perl script the general syntax is:

perl /path/to/script.pl

The path can be either absolute (as in the example) or relative. If the script is in the same directory we are currently working in, then it’s going to be as simple as:

perl script.pl

Simulate a shotgun

This part of the practical will let you meet a Perl program! This is already made, so you’ll understand how to run it, invoking the perl interpreter. The goal is to simulate a whole genome shotgun, that is taking a reference sequence and creating a lot of short reads starting at random positions.

Preparing the genome

First we’ll prepare the reference. Last time you downloaded a genome (an E. coli strain). Open it with the “less -S” command. Supposing you are in your home folder, the command is:

less -S lab01/E85.fa

Can you see the genome? OK! Otherwise change the file name as needed 🙂

The -S switch asks less to disable word wrap (check the difference with and without). As you can see there are multiple sequences, and each one takes two lines, one of header (starting with “>”) and one with the sequence itself.

Your task here is to create a new file, called coli_scaffolds.fa and placed in the same directory as E85.fa (lab01), containing the first 20 sequences. This is going to be our “reference“, so create it with care, and test it!

We are confident that, paying attention, you should do this without problems!

Performing the shotgun

Now move to your lab02 directory. Try typing this command:

perl ../../bin/shotgun.pl

The Perl script, called shotgun.pl, and stored in /home/geno/bin, will print some text to tell us how to use it. Note that with an absolute path in mind we can obtain the same result:

perl /home/geno/bin/shotgun.pl

Apparently you need a multifasta file (the genome to be “shotgunned”) and some desired parameters to generate the output: the reads size and the expected coverage. Three arguments to be passed to the script.

Suppose you have a file called genome.fasta in your home you could type (always from the lab02 directory):

perl ../../bin/shotgun.pl ../genome.fasta 50 2

to produce a simulated 2X shotgun with reads of 50 bp. Remember that you have an E. coli genome in the lab01 directory.

Now simulate the shotgun with a command like (otherwise select the correct file path!):

perl ../../bin/shotgun.pl put_the_genome_here 60 5 &gt; coli_reads.fa

Did you understand the command? We are executing the shotgun.pl script to simulate a shotgun of the E. coli genome, saving it (redirecting the output to) a new file called coli_reads.fa.

Of course you’ll write the correct path to the coli_scaffolds.fa you created before.

Check the output of the shotgun:

head coli_reads.fa

Do a tail as well!

How many reads did you get and how long is the genome? Check it with grep and wc like in the last session. Is

Aligning the reads

Today we will use a popular alignment tool called BWA. We installed it on the system so that you can run it just typing “bwa” from the shell (try!).

Indexing the genome

As most alignment tools before aligning the reads you have to prepare the database. This step has to be done only once, after the database is created you can align as many files as needed against it.

The syntax to do this is:

bwa index reference_file.fa

You have to run this command on the coli_scaffolds.fa file (remember it’s in the lab01 directory): so go to the directory containing our “reference” and run the command on the correct file. Try listing the files you have, you should notice some new files… they have been created by this command!

Aligning the reads

After the database is created we can align the short reads (generated with the shotgun) against the reference.

The syntax is:

bwa mem reference.fa reads.fa

This program, unlike the previous one, will print an important output to the screen: the SAM alignment. To save it we have, as usual, to redirect the output:

bwa mem reference.fa reads.fa &gt; alignment.sam

Now run your own command using coli_scaffolds.fa file (remember it’s in the lab01 directory) as reference and coli_reads.fa as sequences. Save the output as aligned_reads.sam in the lab02 directory.

Check the output

Notice how the files we are creating are quite big even though we are using few sequences and a low coverage. Try a

ls -lh

to see their size! The “-lh” is a short form for “-l -h”. In almost any command, one letter options can be combined together for easier typing.

Even these relatively small files will cause trouble to standard text editors, so use less instead!

Extract information from the sam file

Our aligned_reads.sam is full of information and we can find out many interesting facts from it. For instance, our goal will be finding which reads were aligned in more than one place.

The first step is to remove the header of the sam file. Since the lines in the header start with “@” we could do something like:

grep -v "@" file.sam

Since our sam files does not include the quality (it was produce from a fasta), this is fine in this case. In general “@” can occur in the quality fields, so it is better to use:

grep -v "^@" file.sam

The “^” in “^@” has a special meaning for grep, it means “the beginning of the line”. So grep “^@” will only find the lines that start with “@” and not those that have “@” in the middle. In order to test our command we can do this:

grep -v "^@" file.sam | less

in this way we avoid printing (almost) all the sam file to the screen. Remember to change “file.sam” to the correct file!

Now we have only the alignments, but we want to get just the names of the reads. In order to do that we use a new command, cut:

cut -f 1 file.sam

This will print the first field of each line of a tab-separated file, so it works on our sam. Test it with less and our previous command as follows:

grep -v "^@" file.sam | cut -f 1 | less

You should see all the reads names there!

Now that we have the reads we want to find which ones are repeated, since this means that they were aligned more than once. In order to do this we need two new commands, sort and uniq. As the name suggests, sort just sorts its input. This is important because uniq only works correctly with sorted input. We are going to use “uniq –repeated”, that only prints repeated adjacent lines (this is why its input need to be sorted). So our command becomes:

grep -v "^@" file.sam | cut -f 1 | sort | uniq --repeated &gt; repeated_read_names

Note that while grep, cut and uniq are very fast, sort will take a while (and print nothing, so it appears to be stuck). Just be patient.

Now repeated_read_names should contain all the names of the reads that have been aligned more than one times. Now lets see where the first one was aligned. Take the first read name and use grep to find it in the sam file. Remember to quote the read name, since they contain the special “|” character!

grep "scaffold3.1|size264376-175943" file.sam

Perl & Genomics

Bioinformatics for Genomics Course – University of Padua