pySCA - A SCA toolbox in Python
By: |
---|
On: | August 2014 |
---|---|
version: | 6.1 beta |
Copyright (C) 2015 Olivier Rivoire, Rama Ranganathan, Kimberly Reynolds This program is free software distributed under the BSD 3-clause license, please see the file LICENSE for details.
A class for annotating sequences
Attributes : |
|
---|
Phylogenetic annotation of a Pfam alignment (in fasta format) using information from pfamseq.txt (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database_files/). The output is a fasta file containing phylogenetic annotations in the header (to be parsed with ‘|’ as a delimiter).
Note: the headers for the original alignment take the form >AAA/x-y. If two entries have same AAA but correspond to different sequences only one of the two sequences will be represented (twice) in the output - this should however not practically be an issue.
Arguments : |
|
---|---|
Keyword Arguments: | |
|
Identify the sequence in the alignment that most closely corresponds to the species of the reference sequence, and return its index.
>>> strseqnum = MSASearch(hd, alg0, pdbseq, 'Homo sapiens')
Multiple bar diagram (plots contributions to each bar from different elements in x as different colors). This can be useful if you’d like to inspect how sector positions are distributed among independent components/eigenmodes. The argument x is a tuple, specifying the number of elements in each bar to be each color. The example below makes a graph with four bars, where the first bar has 99 white elements, 1 blue element and 1 red element.
Example : | >>> x = [[99, 1, 1], [6, 13, 2], [0, 0, 13], [1, 7, 5]]
>>> sca.MultiBar(x)
|
---|
A class for a pair of positions. :Attributes:
- pos = a pair of amino acid positions (ex: [1,3], supplied as argument p)
- DI = the direct information between the two positions (argument x)
- dist = the physical distance between the positions (argument d)
A class for sectons.
Check the structural connectivity based on the principle that if \(M_{ij}\) is the adjacency matrix of a graph, \(M^n_{ij}\) is the number of paths of length \(n\) between i and j, which must be > 0 for \(n\) = number of nodes when i and j are in the same connected component.
returns the distance between the position pair
A class for units (sectors, sequence families, etc.)
Translate an alignment of size M x L where the amino acids are represented by numbers between 0 and N_aa (obtained using lett2num) to a sparse binary array of size M x (N_aa x L).
Example : | >>> Abin = alg2bin(alg, N_aa=20)
|
---|
Basic ICA algorithm, based on work by Bell & Sejnowski (infomax). The input data should preferentially be sphered, i.e., x.T.dot(x) = 1
Note: r and Niter should be adjusted to achieve convergence, which should be assessed by visualizing ‘change’ with plot(range(iter) ,change)
>>> [w, change] = basicICA(x, r, Niter)
Given the eigenvalues of the sca matrix (Lsca), and the eigenvalues for the set of randomized matrices (Lrand), return the number of significant eigenmodes.
This function chooses a default reference sequence if none is given by taking the sequence which has the mean pairwise sequence identity closest to that of the entire alignment.
Example : | >>> i_ref = chooseRefSeq(msa_num)
|
---|
Replaces any character that is not a valid amino acid by a gap.
Arguments : | amino acid sequence alignment |
---|---|
Keyword Arguments: | |
|
Output tab-delimited text that can be read in by cytoscape. The goal is to enable graph representations of the SCA couplings, where residues are nodes, and couplings are edges. Within cytoscape, the graph can be color-coded or weighted by Csca, Di, sector definition or Vp.
Example : | >>> cytoscapeOut(ats, cutoff, Csca, Di, sectors, Vp, outfilename)
|
---|
Direct information from the matrix of couplings \(J_{ij}\) (called by directInfo). Ref: Marcos et al, PNAS 2011, 108: E1293-E1301
Example : | >>> DI = dirInfoFromJ(i, j, Jmat, frq, Naa=20, epsilon=1e-4)
|
---|
Calculate direct information as in the Direct Coupling Analysis (DCA) method proposed by M. Weigt et collaborators (Ref: Marcos et al, PNAS 2011, 108: E1293-E1301).
Example : | >>> DI = directInfo(freq1, freq2, lbda=.5, freq0=np.ones(20)/21, Naa=20)
|
---|
Return the eigenvectors and eigenvalues, ordered by decreasing values of the eigenvalues, for a real symmetric matrix M. The sign of the eigenvectors is fixed so that the mean of its components is non-negative.
Example : | >>> eigenVectors, eigenValues = eigenVect(M)
|
---|
Function that automates finding the top \(k_{pos}\) independent components, projection, and plotting. Useful to get a representation of the sectors/subfamilies mapping for a given kpos. Arguments:
Example : |
---|
>>> Vpica = figMapping(Csca, tX, kpos, sectors, subfam)
Make and display an image of the sectors (within a python notebook). By default quit PyMol after running it, unless the option ‘quit=0’ is given. The default name and location of the output can also be changed.
Example : | >>> figStruct(pdbid, sectors, ats, chainid='A', outfile = 'Outputs/sectors.pml', quit=1, pymol = path2pymol)
|
---|
2d scatter plot specified by ‘units’, which must be a list of elements in the class Unit. See figColors for the color code. Admissible color codes are in [0 1] (light/dark gray can also be obtained by using -1/+1). For instance: 0->red, 1/3->green, 2/3-> blue.
marker = plot marker symbol
dotsize = specify marker/dotsize
these elements are not represented, if set to [w1,w2] : elements with coordinates w1,w2 are represented in white in the background.
Example : | >>> figUnits(v1, v2, units, marker='o', gradcol=0, dotsize=9, notinunits=1)
|
---|
A 2d scatter plot with color indicating weight.
Example:
>>> figWeights(U1, U2, weight)
Truncate the positions of an input alignment to reduce gaps, taking into account sequence weights.
Example : | >>> alg_tr, selpos = filterPos(alg, seqw, max_fracgaps=.2)
|
---|
Take in an alignment (alg0, assumed to be filtered to remove highly gapped positions), a reference sequence, the maximum fraction of gaps allowed per sequence (max_fracgaps), the minimum and maximum sequence identities to the reference sequence (min_seqid and max_seqid), and return (1) alg, the alignment filtered to remove sequences with more than max_fracgaps (i.e. partial seqs), (2) seqw, a vector of weights for each sequence, (3) seqkeep, the indices of the original alignment (alg0) retained in alg:
>>> alg, seqw, seqkeep = filterSeq(alg0, iref, max_fracgaps=.2, min_seqid=.2, max_seqid=.8)
Compute amino acid frequencies for a given alignment.
Example : | >>> freq1, freq2, freq0 = freq(alg, seqw, lbda=lbda)
|
---|
Produces a list of positions contributing to each independent component (IC) above a defined statistical cutoff (p_cut, the cutoff on the CDF of the t-distribution fit to the histogram of each IC). Any position above the cutoff on more than one IC are assigned to one IC based on which group of positions to which it shows a higher degree of coevolution. Additionally returns the numeric value of the cutoff for each IC, and the pdf fit, which can be used for plotting/evaluation. icList, icsize, sortedpos, cutoff, pd = icList(Vsca,Lsca,Lrand)
Translate an alignment from a representation where the 20 natural amino acids are represented by letters to a representation where they are represented by the numbers 1,...,20, with any symbol not corresponding to an amino acid represented by 0.
Example : | >>> msa_num = lett2num(msa_lett, code='ACDEFGHIKLMNPQRSTVWY')
|
---|
If specified, truncate the alignment to the structure (assumes MSAsearch has already been run to identify the reference sequence (iref)) and produce a mapping (ats) between alignment positions and the positions in the reference sequence (refpos).
- Arguments:
- sequences
- reference positions
- reference sequence
- iref, the index of the sequence in the alignment with the highest identity to the reference
Keyword Arguments: | |
---|---|
truncate – truncate the alignment to positions that align to the reference sequence. |
|
Example : | >>> sequences_trun, ats_new = sca.makeATS(sequences_full, ats_pdb, seq_pdb, i_ref)
|
Calculates the number of positions in the largest connected component for groups of positions i with \(V_p[i,k] > eps\) and \(V_p[i,k] > V_p[i,kk]\), for \(kk != k\) and eps in eps_list. Useful for looking evaluating the physical connectivity of different sectors or sub-sectors.
>>> eps_range, num_co, num_tot = numConnected(Vp, k, distmat, eps_list = np.arange(.5,0,-.01), dcontact=8)
Extract sequence, position labels and matrix of distances from a PDB file.
Example : | >>> sequence, labels, dist = pdbSeq(pdbid, chain='A', path2pdb=path2structures)
|
---|
Compute single-site measures of conservation, and the sca position weights, \(\frac {\partial {D_i^a}}{\partial {f_i^a}}\)
Example : | >>> Wia, Dia, Di = posWeights(alg, seqw=1,freq0)
|
---|
Projection of an alignment (alg) based on a projector (Proj). The input alignment should already be converted to numeric representation using lett2num.
Example : | >>> tX = projAlg(msa_num, Proj)
|
---|
Compute the projection of an alignment (msa_ann) on the kpos ICA components of the sequence space of another (msa_num, seqw).This is useful to compare the sequence space of one alignment to another.
Example : | >>> Uica_ann, Uica = projUpica(msa_ann, msa_num_ seqw, kica=6)
|
---|
Compute the projection of an alignment (msa_ann) on the kpos ICA components of the SCA matrix of another (msa_num, seqw). This is useful to compare the sequence space (as projected by the positional correlations) of one alignment to another.
Example : | >>> Upica_ann, Upica = projUpica(msa_ann, msa_num_ seqw, kpos)
|
---|
Generate a random alignment with Mseq sequences based on the frequencies frq[i,a] of amino acids with a = 0,1,...,Naa (0 for gaps).
Example : | >>> msa_rand = randAlg(frq, Mseq)
|
---|
Random selection of Mtot sequences, drawn with weights and without replacement. The seed for the random number generator is fixed to ensure reproducibility.
- Arguments:
- seqw = the sequence weights
- Mtot = the total number of sequences for selection
- Keyword Arguments:
- keepSeq = an (optional) list of sequnces to keep. This can be useful if you would like to
retain the reference sequence for example.
Example:
>>> selection = randSel(seqw, Mtot, [iref])
Randomize the alignment while preserving the frequencies of amino acids at each position and compute the resulting spectrum of the SCA matrix.
Example : | >>> Vrand, Lrand, Crand = randomize(msa_num, 10, seqw, Naa=20, kmax=6)
|
---|
Read in a multiple sequence alignment in fasta format, and return the headers and sequences.
>>> headers, sequences = readAlg(filename)
ICA rotation (using basicICA) with default parameters and normalization of outputs.
Example : | >>> Vica, W = rotICA(V, kmax=6, learnrate=.0001, iterations=10000)
|
---|
Computes the SCA matrix.
seqw = A vector of sequence weights (default: uniform weights)
SCA correlation tensor to a positional correlation matrix. Use ‘spec’ for spectral norm and ‘frob’ for Frobenius norm. The frobenius norm is the default.
lbda = lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)
freq0 = background expectation for amino acid frequencies
Example : | >>> Csca, tX, projMat = scaMat(alg, seqw, norm='frob', lbda=0.03)
|
---|
Compute three different projections of the sequences based on eigenvectors of the sequence similarity matrix.
>>> Useq, Uica = sca.seqProj(msa_num, seqw, kseq = 30, kica = 15)
Take an MxL alignment (converted to numeric representation using lett2num) and compute a MxM matrix of sequence similarities.
Example : | >>> simMat = seqSim(alg)
|
---|
Compute sequence weights for an alignment (format: list of sequences) where the weight of a sequence is the inverse of the number of sequences in its neighborhood, defined as the sequences with sequence similarity below max_seqid. The sum of the weights defines an effective number of sequences.
Example : | >>> seqw = seqWeights(alg)
|
---|
Single bar diagram, called by MultiBar.
Example : | >>> singleBar(x, loc, cols, width=.5)
|
---|
Compute the size of the largest component of a graph given its adjacency matrix. Called by numConnected (Done by actually listing all the components)
Example : | >>> s = sizeLargestCompo(adjMat)
|
---|
Singular value decomposition for sparse matrices (top k components). The singular values are ordered by decreasing values, the sign of the singular vectors is fixed, and the convention is that X = u.dot(s).dot(v.T):
Example : | >>> u, s ,v = svdss(X, k=6)
|
---|
Set to 0 the elements of a matrix M up to a distance dmax from the diagonal.
Example : | >>> Mtr = truncDiag(M, dmax)
|
---|
Generate a random list of at most Nmax elements with weights (numpy array) but without replacement. Called by randSel.
Example : | >>> selection = weighted_rand_list(weights, Nmax, [iref])
|
---|
Generate a random index with probability given by input weights. Called by weighted_rand_list.
Example : | >>> index = weighted_rand_sel(weights)
|
---|
Write basic a pymol script for displaying sectors and exporting an image.
Example : | >>> writePymol(pdb, sectors, ics, ats, outfilename, chain='A',inpath=path2structures, quit=1)
|
---|