The pySCA code

#### Next topic

scaProcessMSA Module

# annotate_MSA Module¶

The annotate_MSA script provides utilities to automatically annotate sequence headers (for a fasta file) with taxonomic information. Currently this can be done in one of two ways:

1. For PFAM alignments, annotations can be extracted from the file pfamseq.txt. This file can be downloaded from the PFAM ftp site. To access this, go to the following link: http://pfam.xfam.org/help#tabview=tab12, click on database_files, and download pfamseq.txt.gz (notice that this is a large file).
2. For Blast alignments, annotations can be added using the NCBI Entrez utilities provided by BioPython. In this case, an additional command line argument (–giList, see below) should specify a list of gi numbers. These numbers are then used to query NCBI for taxonomy information (note that this approach requires a network connection).

To quickly extract gi numbers from a list of headers (variable name ‘hd’) with typical Blast alignment formatting, the following line of python code is useful:

>>> gis = [h.split('_')[1] for h in hd]


Alternatively, the script alnParseGI.py will accomplish this. For both the PFAM and NCBI utilities, the process of sequence annotation can be slow (on the order of hours, particularly for NCBI entrez with larger alignments). However, the annotation process only needs to be run once per alignment.

Arguments:

Input_MSA.fasta (an input sequence alignment)

Keyword Arguments:

 -o, --output Specify an output file, Output_MSA.an -a, --annot Annotation method. Options are ‘pfam’ or ‘ncbi’. Default: ‘pfam’ -g, --giList This argument is necessary for the ‘ncbi’ method. Specifies a file containing a list of gi numbers corresponding to the sequence order in the alignment; a gi number of “0” indicates that a gi number wasn’t assigned for a particular sequence. -p, --pfam_seq Location of the pfamseq.txt file. Defaults to path2pfamseq (specified at the top of scaTools.py)
Examples:
>>> ./annotate_MSA.py Inputs/PF00186_full.txt -o Outputs/PF00186_full.an -a 'pfam'
>>> ./annotate_MSA.py Inputs/DHFR_PEPM3.fasta -o Outputs/DHFR_PEPM3.an -a 'ncbi' -g Inputs/DHFR_PEPM3.gis

By: Rama Ranganathan, Kim Reynolds 9.22.2014

Copyright (C) 2015 Olivier Rivoire, Rama Ranganathan, Kimberly Reynolds This program is free software distributed under the BSD 3-clause license, please see the file LICENSE for details.