annotate_MSA Module¶

The annotate_MSA script provides utilities to automatically annotate sequence headers (for a fasta file) with taxonomic information. Currently this can be done in one of two ways:

For PFAM alignments, annotations can be extracted from the file pfamseq.txt. This file can be downloaded from the PFAM ftp site. To access this, go to the following link: http://pfam.xfam.org/help#tabview=tab12, click on database_files, and download pfamseq.txt.gz (notice that this is a large file).

For Blast alignments, annotations can be added using the NCBI Entrez utilities provided by BioPython. In this case, an additional command line argument (–giList, see below) should specify a list of gi numbers. These numbers are then used to query NCBI for taxonomy information (note that this approach requires a network connection).

To quickly extract gi numbers from a list of headers (variable name ‘hd’) with typical Blast alignment formatting, the following line of python code is useful:

>>> gis = [h.split('_')[1] for h in hd]

Alternatively, the script alnParseGI.py will accomplish this. For both the PFAM and NCBI utilities, the process of sequence annotation can be slow (on the order of hours, particularly for NCBI entrez with larger alignments). However, the annotation process only needs to be run once per alignment.

Arguments:

Input_MSA.fasta (an input sequence alignment)

Keyword Arguments:

`-o, --output`	Specify an output file, Output_MSA.an
`-a, --annot`	Annotation method. Options are ‘pfam’ or ‘ncbi’. Default: ‘pfam’
`-g, --giList`	This argument is necessary for the ‘ncbi’ method. Specifies a file containing a list of gi numbers corresponding to the sequence order in the alignment; a gi number of “0” indicates that a gi number wasn’t assigned for a particular sequence.
`-p, --pfam_seq`	Location of the pfamseq.txt file. Defaults to path2pfamseq (specified at the top of scaTools.py)

Examples:

>>> ./annotate_MSA.py Inputs/PF00186_full.txt -o Outputs/PF00186_full.an -a 'pfam'
>>> ./annotate_MSA.py Inputs/DHFR_PEPM3.fasta -o Outputs/DHFR_PEPM3.an -a 'ncbi' -g Inputs/DHFR_PEPM3.gis

By:	Rama Ranganathan, Kim Reynolds
On:	9.22.2014

Previous topic

Next topic

This Page

annotate_MSA Module¶

Navigation

Previous topic

Next topic

This Page

Quick search

annotate_MSA Module¶

Navigation