Lab05: Some real programming!
In the exercises until now we worked with input values that were written in the beginning of the programs. This is not how real programs work. Since we now know how to read files and pass parameters, we can start writing useful programs, that work in the same way as the shell commands we saw: taking their input from files.
Since we are going to work with files, codepad is not enough anymore. In the lab you will work with gedit and the shell and put all the programs and files in a directory called lab05. In order to work from home you need Linux, so get it fast!
Reinvent the wheel! (for your own good)
We are going to write a simple replacement for the standard wc program, which, as you may remember, is the word count program (and not the water closet program).
Here is a simple program that reads a file and prints the number of lines in it. Save it in lab05 as wc.pl. Make it executable with:
chmod +x wc.pl
In order to test it you need to create a file called file_name with some text in it. Check that it prints the correct number of lines!
In this part you will need the perl function split, that works like this:
@parole = split / /, 'ciao come va';
Its first argument / / is a pattern. In this case is a simple one that matches one space (since there is one space between the slashes), but it could be almost anything. We saw some patterns working with grep in the shell. The function split splits the second argument, the string ‘ciao come va’, where the pattern matches: in this case where there are space. So @parole is an array containing all the words in ‘ciao come va’ as strings. You can find the documentation for split here. Usually googling for “perl split” will get you all the information you need (if you can read it).
You have to do the following changes:
- use the first parameter passed in @ARGV as the name of the file to be opened instead of ‘file_name’, like the real wc does;
- print also the number of characters (you will need the function length to get the number of character in a string);
- use the function split in order to split each line in an array of words and print also the number of words.
Print the number of characters first, then the number of words and finally the number of lines, separated by spaces as in the real wc program.
Submit it as usual, using hw5.1 as code.
From SAM to FASTQ… and beyond infinite
Now for something completely different and a bit more related to genomics, we are going to write a program that reads a SAM files and outputs a FASTQ file. In order to do this we need to extract the read name, the sequence and the quality from the SAM file and output them in FASTQ format.
In order to test the program you need a SAM file. Here you can download a very short SAM file, save it on disk and use it for testing. You can also use the one we created during the second lab, keeping in mind that it has no quality (the quality field is always “*”), so it should print “*” as the quality in the FASTQ you will produce. You should have it in the lab02 directory. Remember to use the correct path to it since you are now working in lab05 (remember, you use “..” to go back).
If you do not remember everything about the SAM format, take a look at your SAM file with the less command and/or read the official documentation.
Instead of starting from scratch each time you write a script, it is easier to take an old one and change it. Make a copy of the wc.pl script we just wrote and call it sam2fastq.pl (and use chmod on it). You will remove the things you do not need (the code that counts things) and keep the rest (the parts that open and read the file). If the output is very long try to use the head and/or less commands like this:
sam2fastq.pl path_to_sam_file | less
Remember to use the q key to exit from less!
The things you have to do:
- The first thing you need to do is to skip the header of the SAM file. Since all the lines of the header start with @, we only need to check for it. We can do it with the function substr by taking the first character of the line and checking it in an if. Remember to use eq or ne and not == or !=, since we need to check them as strings!
- If a line is not part of the header then it is an alignment. All lines ends with the end of line character “\n”, and we need to remove it. Perl has the function chomp that does that, just use it as chomp $line;
- Now we have to split the fields the alignment line. We do this with the split function. Since the fields are separated by tabs and not spaces, we need a different pattern for split. Use /\t/ to split on tabs.
- Split returns an array with all the fields, now you just need to use the query name field as the read name, the sequence field as the sequence and the quality field as the quality and print them in FASTQ format.
Check that your output makes sense and respects the FASTQ format!
SAM improvements (optional)
As you know, the same read can align in more than one position in the reference. This means that a read can appear multiple times in the SAM file. If this happens we are going to print the same read multiple times in our program. Avoid this and print each read just once. You can assume that multiple alignments of the same read are in consecutive lines. Hint: check the first field.
If the flag field in the SAM file is 0x16, the sequence field contains the reverse complementary of the original query sequence. Check the flag and reverse-complement the string when appropriate.
Submit it as usual, using hw5.2 as code.