Welcome to the first practical. During this practical you’ll learn how to master the command line and Linux as well.
Please, keep in mind that all the tools we are presenting day by day are necessary to fulfil the final assignment, so try to think about all the steps rather than rushing to the end.
Feel free to ask us, in case of troubles…
How to do the practicals
All the practicals will ask you to produce some files to be stored in your home folder, in a particular directory. It’s important that you follow carefully the instructions so that you place all the files in the appropriate directory: we’ll automatically check the files produced at the end!
01. Introduce yourself to the PC
First you’ll create a directory for this first practical inside your home folder. Call it “lab01” (remember that in Linux the names are case sensitive).
To create the directory open the Terminal (Applicazioni –> Accessori –> Terminale). you’ll find its icon appearing after a short while, then click it).
In the terminal ensure that you are already in the home folder (with the pwd command), then use mkdir to create the lab01 directory.
Now use ls to ensure you created it.
Now enter the lab01 directory using the cd command, and check again you did it with pwd. Remember to use the TAB to complete your file names, it saves lives!
Now using the following command you’ll create a file with your name written into it. For the moment don’t care so much at the syntax, we’ll unravel it gradually.
echo "Name Surname" > lab01.txt
In breve il comando echo stampa sullo schermo gli argomenti che gli passiamo, e con il “>” reindirizziamo l’output in un file.
02. Create a multifasta file
Go to the NCBI and look for gene sequences belonging to E. coli. Now open the text editor (Applicazioni->Accessori->Editor di testo…) and paste them in a file – that should be a multifasta file. Save it into the lab01 directory that you creaded in step 01, call it sequences.fa (as always, it DOES matter where to save and the name you give it).
Now go to the Terminal and ensure you are in the lab01 directory.
Try inspecting the content of the file you created with the following command:
cat sequences.fa
When dealing with large files it’s handy to have a “preview” of the first lines: try
head sequences.fa
or
tail sequences.fa
Now, suppose you want the header of the first sequence (i.e. the first line of the file):
head -n 1 sequences.fa
To save it into a separate file, type:
head -n 1 sequences.fa > first-header.txt
Now, with cat, verify that the file you created is as you wished.
03. Downloading files with the terminal
Now suppose you want to download a file from the Internet, using the CLI. One of the possible commands to do it is wget. Its syntax is wget URL.
Always working in the lab01 directory, type this command:
wget "ftp://ftp.sanger.ac.uk/pub/pathogens/Escherichia/coli/E85.fa"
The E85.fa file contains an assembly of E. coli, kindly hosted at the Sanger Center.
Now try printing its content with “cat”.
04. Less is more
As you could see in task 03, printing long files presents some hurdles. The “less” command allows to interactively view files, that means you are going to scroll it with keyboard commands like:
- space bar: go down one page
- arrows: scroll up/down/left/right
- G: skip to the end
- g: skip to the top
- q: exit from the less program
so try typing “less filename“, where filename should be the E. coli genome already downloaded.
05. Search in files
We know that all the sequences in a fasta file begins with the “>” char. Remembering that “>” is also a special char for the shell, used to redirect the output, we should remember to use the quotes.
The following command will print the lines containing a “>” in them, in the specified file. Try the command both with the file you created, and with the E. coli assembly.
grep ">" fastafile
06. Counting lines
How many lines has a file? The wc command counts lines and chars in a file, but with the -l switch will print only the lines, while with -m the number of chars.
wc -l E84.fa
will print the number of lines in the file. How many characters compose the file? Use wc -m for this.
Suppose that we want to know how many sequences are present in a fasta file. How could you do?
We can combine the grep and wc command: the former selecting lines with a “>”, the latter counting them:
grep ">" E84.fa | wc -l
If you add the -v switch to grep, it will print only the lines not matching the pattern (not containing the word you were looking for).
So, suppose that you want to know how many bases are present in a multi fasta file.
How do you do that?
[…] start here first […]
[…] time we expect from you to be accurate. During your first lab you messed around a little bit, meaning that files were not saved as expected. Remember that you […]