The annotate_MSA script provides utilities to automatically annotate sequence headers (for a fasta file) with taxonomic information. Currently this can be done in one of two ways:
- For PFAM alignments, annotations can be extracted from the file pfamseq.txt. This file can be downloaded from the PFAM ftp site. To access this, go to the following link: http://pfam.xfam.org/help#tabview=tab12, click on database_files, and download pfamseq.txt.gz (notice that this is a large file).
- For Blast alignments, annotations can be added using the NCBI Entrez utilities provided by BioPython. In this case, an additional command line argument (–giList, see below) should specify a list of gi numbers. These numbers are then used to query NCBI for taxonomy information (note that this approach requires a network connection).
To quickly extract gi numbers from a list of headers (variable name ‘hd’) with typical Blast alignment formatting, the following line of python code is useful:
>>> gis = [h.split('_')[1] for h in hd]
Alternatively, the script alnParseGI.py will accomplish this. For both the PFAM and NCBI utilities, the process of sequence annotation can be slow (on the order of hours, particularly for NCBI entrez with larger alignments). However, the annotation process only needs to be run once per alignment.
Arguments: | Input_MSA.fasta (an input sequence alignment) |
||||||||
---|---|---|---|---|---|---|---|---|---|
Keyword Arguments: | |||||||||
|
|||||||||
Examples: |
>>> ./annotate_MSA.py Inputs/PF00186_full.txt -o Outputs/PF00186_full.an -a 'pfam'
>>> ./annotate_MSA.py Inputs/DHFR_PEPM3.fasta -o Outputs/DHFR_PEPM3.an -a 'ncbi' -g Inputs/DHFR_PEPM3.gis
By: | Rama Ranganathan, Kim Reynolds |
---|---|
On: | 9.22.2014 |
Copyright (C) 2015 Olivier Rivoire, Rama Ranganathan, Kimberly Reynolds This program is free software distributed under the BSD 3-clause license, please see the file LICENSE for details.