Today’s task is quite complex, instead of parsing a file that keeps all information we need in a single line, as we did for BLAST. Instead a multifasta has a variable number of lines per item (sequence). This example is hard, that’s why I publish it now so that you can study it!
Let’s consider the following file (called minimultifasta.fna):
>Seq1 ACACACGCTAGCTAGCTGATCGATCGTACGTACGTAGCTGACT ACGTAGCTGATCGACGAGCCATTCATCTTTTTTTTTTTACGTC ACGTAC >Seq2 ACGCGGTCGATCGCGATCGTACGACTTTTTTTTACCCCCCCCC ACNNCCACGTGTACGACTACCGCGCGCATTATATACGACTGAA CACACGGGGGGCACACAC >Seq3 CAGATACAGATAAAAAAAACACCATTTTACGATC >Seq4 ACTACCAGATAAAAAAAACACCATTTTACGATCCACGTCAACG CAGATAAAAAAAACACCATTACNNCCACGTGTACGACTACTTA CGATCCAGATAAAAAAAACAACNNCCACGTGTACGACTACCAG CCATTTTACGATCACNNCCACGTGTACGACTACACNNCCACGT GTACGACTAC
We want to parse it. With Perl we read a file line by line, right? So IF a line starts by “>” we keep it as a $header, waiting to retrieve the sequece. Then if the line is a sequence line, we keep it in a $seq. We don’t know how many lines, so we concatenate them UNTIL we see a new header.
Please, try “reading” the file line by line, as Perl…
#!/usr/bin/perl # Load a multifasta file to memory into an hash # having name->sequence structure: # $sequence{'seq1'}='ACGACTGTAC'; # Let's print the aim of the script. Note the print STDERR " Load a multifasta file and prints the average sequence length. Usage multifastaavg.pl fasta-filename "; $multifastafile = $ARGV[0]; open(MULTIFASTA, "$multifastafile") || die " FATAL ERROR:\n Unable to load '$multifastafile'.\n"; # We create two temporary variables, $header and $seq to keep informations about # each sequence. Then we populate the hash with $sequence{$header}=$seq while (<MULTIFASTA>) { chomp($_); # If the line stars for ">" is a header, and we save the information in $header if ($_=~/^>(.*)/) { # if we have something in $seq if ($seq) { $sequence{$header}=$seq; } $header = $1; $counter++; # reset sequence!!! $seq = ''; } else { $seq.=$_; } } # Store last sequence $sequence{$header}=$seq; # Now we loaded all the sequences into the $sequence hash # and we can USE it! foreach $name (keys %sequence) { $len = length($sequence{$name}); $count++; $sum+=$len; } $avg = $sum/$count; print "You have $count sequence, average length is $avg "; |