First practical

Welcome to the first practical. During this practical you’ll learn how to master the command line and Linux as well.
Please, keep in mind that all the tools we are presenting day by day are necessary to fulfil the final assignment, so try to think about all the steps rather than rushing to the end.

Feel free to ask us, in case of troubles…

How to do the practicals

All the practicals will ask you to produce some files to be stored in your home folder, in a particular directory. It’s important that you follow carefully the instructions so that you place all the files in the appropriate directory: we’ll automatically check the files produced at the end!

01. Introduce yourself to the PC

First you’ll create a directory for this first practical inside your home folder. Call it “lab01” (remember that in Linux the names are case sensitive).

To create the directory open the Terminal (Applicazioni –> Accessori –> Terminale). you’ll find its icon appearing after a short while, then click it).

In the terminal ensure that you are already in the home folder (with the pwd command), then use mkdir to create the lab01 directory.
Now use ls to ensure you created it.

Now enter the lab01 directory using the cd command, and check again you did it with pwd. Remember to use the TAB to complete your file names, it saves lives!

Now using the following command you’ll create a file with your name written into it. For the moment don’t care so much at the syntax, we’ll unravel it gradually.

echo "Name Surname" > lab01.txt
In breve il comando echo stampa sullo schermo gli argomenti che gli passiamo, e con il “>” reindirizziamo l’output in un file.

02. Create a multifasta file

Go to the NCBI and look for gene sequences belonging to E. coli. Now open the text editor (Applicazioni->Accessori->Editor di testo…) and paste them in a file – that should be a multifasta file. Save it into the lab01 directory that you creaded in step 01, call it sequences.fa (as always, it DOES matter where to save and the name you give it).

Now go to the Terminal and ensure you are in the lab01 directory.

Try inspecting the content of the file you created with the following command:

cat sequences.fa

When dealing with large files it’s handy to have a “preview” of the first lines: try

head sequences.fa

tail sequences.fa

Now, suppose you want the header of the first sequence (i.e. the first line of the file):

head -n 1 sequences.fa

To save it into a separate file, type:

head -n 1 sequences.fa > first-header.txt

Now, with cat, verify that the file you created is as you wished.

03. Downloading files with the terminal

Now suppose you want to download a file from the Internet, using the CLI. One of the possible commands to do it is wget. Its syntax is wget URL.

Always working in the lab01 directory, type this command:

wget "ftp://ftp.sanger.ac.uk/pub/pathogens/Escherichia/coli/E85.fa"

The E85.fa file contains an assembly of E. coli, kindly hosted at the Sanger Center.

Now try printing its content with “cat”.

04. Less is more

As you could see in task 03, printing long files presents some hurdles. The “less” command allows to interactively view files, that means you are going to scroll it with keyboard commands like:

space bar: go down one page
arrows: scroll up/down/left/right
G: skip to the end
g: skip to the top
q: exit from the less program

so try typing “less filename“, where filename should be the E. coli genome already downloaded.

05. Search in files

We know that all the sequences in a fasta file begins with the “>” char. Remembering that “>” is also a special char for the shell, used to redirect the output, we should remember to use the quotes.

The following command will print the lines containing a “>” in them, in the specified file. Try the command both with the file you created, and with the E. coli assembly.

grep ">" fastafile

06. Counting lines

How many lines has a file? The wc command counts lines and chars in a file, but with the -l switch will print only the lines, while with -m the number of chars.

wc -l E84.fa

will print the number of lines in the file. How many characters compose the file? Use wc -m for this.

Suppose that we want to know how many sequences are present in a fasta file. How could you do?

We can combine the grep and wc command: the former selecting lines with a “>”, the latter counting them:

grep ">" E84.fa | wc -l

If you add the -v switch to grep, it will print only the lines not matching the pattern (not containing the word you were looking for).

So, suppose that you want to know how many bases are present in a multi fasta file.
How do you do that?

Continue…

2 thoughts on “First practical”

Continuing the first practical | Perl for Bioinformatics says:

April 4, 2014 at 12:24 pm

[…] start here first […]

Second lab, bourne again shell! | Perl for Bioinformatics says:

April 7, 2014 at 2:12 pm

[…] time we expect from you to be accurate. During your first lab you messed around a little bit, meaning that files were not saved as expected. Remember that you […]

Perl & Genomics

Bioinformatics for Genomics Course – University of Padua