Load a multifasta file

Today’s task is quite complex, instead of parsing a file that keeps all information we need in a single line, as we did for BLAST. Instead a multifasta has a variable number of lines per item (sequence). This example is hard, that’s why I publish it now so that you can study it!

Let’s consider the following file (called minimultifasta.fna):

>Seq1
ACACACGCTAGCTAGCTGATCGATCGTACGTACGTAGCTGACT
ACGTAGCTGATCGACGAGCCATTCATCTTTTTTTTTTTACGTC
ACGTAC
>Seq2
ACGCGGTCGATCGCGATCGTACGACTTTTTTTTACCCCCCCCC
ACNNCCACGTGTACGACTACCGCGCGCATTATATACGACTGAA
CACACGGGGGGCACACAC
>Seq3
CAGATACAGATAAAAAAAACACCATTTTACGATC
>Seq4
ACTACCAGATAAAAAAAACACCATTTTACGATCCACGTCAACG
CAGATAAAAAAAACACCATTACNNCCACGTGTACGACTACTTA
CGATCCAGATAAAAAAAACAACNNCCACGTGTACGACTACCAG
CCATTTTACGATCACNNCCACGTGTACGACTACACNNCCACGT
GTACGACTAC

We want to parse it. With Perl we read a file line by line, right? So IF a line starts by “>” we keep it as a $header, waiting to retrieve the sequece. Then if the line is a sequence line, we keep it in a $seq. We don’t know how many lines, so we concatenate them UNTIL we see a new header.
Please, try “reading” the file line by line, as Perl…

#!/usr/bin/perl
 
# Load a multifasta file to memory into an hash
# having name->sequence structure:
# $sequence{'seq1'}='ACGACTGTAC';
 
# Let's print the aim of the script. Note the
 
print STDERR "
  Load a multifasta file and prints the average
  sequence length.
 
  Usage
  multifastaavg.pl fasta-filename
 
";
 
$multifastafile = $ARGV[0];
 
open(MULTIFASTA, "$multifastafile") || die " FATAL ERROR:\n Unable to load '$multifastafile'.\n";
 
# We create two temporary variables, $header and $seq to keep informations about
# each sequence. Then we populate the hash with $sequence{$header}=$seq
 
while (<MULTIFASTA>) {
	chomp($_);
 
	# If the line stars for ">" is a header, and we save the information in $header
	if ($_=~/^>(.*)/) {
 
		# if we have something in $seq
		if ($seq) {
			$sequence{$header}=$seq;
		}
 
 
		$header = $1;
		$counter++;
		# reset sequence!!!
		$seq    = '';
	} else {
		$seq.=$_;
	}
 
}
# Store last sequence
$sequence{$header}=$seq;
# Now we loaded all the sequences into the $sequence hash
# and we can USE it!
 
foreach $name (keys %sequence) {
	$len = length($sequence{$name});
	$count++;
	$sum+=$len;
}
$avg = $sum/$count;
print "You have $count sequence,
average length is $avg
";
Tagged

Leave a Reply

Your email address will not be published. Required fields are marked *