scaTools Module¶

pySCA - A SCA toolbox in Python

By:

Olivier Rivoire (olivier.rivoire@ujf-grenoble.fr)
Kimberly Reynolds (kimberly.reynolds@utsouthwestern.edu)
Rama Ranganathan (rama.ranganathan@utsouthwestern.edu)

On:	August 2014
version:	6.1 beta

class scaTools.Annot(descr, species, taxo, seq='')¶

A class for annotating sequences

Attributes :	descr = description (often the sequence header) species = species string taxo = taxonomy string seq = sequence

scaTools.AnnotPfam(pfam_in, pfam_out, pfam_seq='~/Documents/Packages/pfamseq.txt')¶

Phylogenetic annotation of a Pfam alignment (in fasta format) using information from pfamseq.txt (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database_files/). The output is a fasta file containing phylogenetic annotations in the header (to be parsed with ‘|’ as a delimiter).

Note: the headers for the original alignment take the form >AAA/x-y. If two entries have same AAA but correspond to different sequences only one of the two sequences will be represented (twice) in the output - this should however not practically be an issue.

Keyword Arguments:
Arguments :	input PFAM sequence alignment output file name for the annotated PFAM alignment
	pfam_seq = path to the file pfamseq.txt

scaTools.MSAsearch(hd, algn, seq, species=None, path2_algprog='/usr/local/bin/')¶

Identify the sequence in the alignment that most closely corresponds to the species of the reference sequence, and return its index.

Arguments:

sequence alignment headers
alignment sequences
selected reference sequence (often from a PDB file)

Keyword Arguments:

species = species of the reference sequence (Used to speed up alignment searching when possible)
path2_algprog = path to an alignment program

Example:

>>> strseqnum = MSASearch(hd, alg0, pdbseq, 'Homo sapiens')

scaTools.MultiBar(x, colors='wbrgymc', width=0.5)¶

Multiple bar diagram (plots contributions to each bar from different elements in x as different colors). This can be useful if you’d like to inspect how sector positions are distributed among independent components/eigenmodes. The argument x is a tuple, specifying the number of elements in each bar to be each color. The example below makes a graph with four bars, where the first bar has 99 white elements, 1 blue element and 1 red element.

Example :	>>> x = [[99, 1, 1], [6, 13, 2], [0, 0, 13], [1, 7, 5]] >>> sca.MultiBar(x)

class scaTools.Pair(p, x, d)¶

A class for a pair of positions. :Attributes:

pos = a pair of amino acid positions (ex: [1,3], supplied as argument p)

DI = the direct information between the two positions (argument x)

dist = the physical distance between the positions (argument d)

class scaTools.Secton(positions)¶

A class for sectons.

Attributes:

pos = a list of positions
num = number of positions in the secton

connected(distmat, threshold)¶: Check the structural connectivity based on the principle that if \(M_{ij}\) is the adjacency matrix of a graph, \(M^n_{ij}\) is the number of paths of length \(n\) between i and j, which must be > 0 for \(n\) = number of nodes when i and j are in the same connected component.

dist(distmat)¶: returns the distance between the position pair

class scaTools.Unit¶

A class for units (sectors, sequence families, etc.)

Attributes:

name = string describing the unit (ex: ‘firmicutes’)
items = set of member items (ex: indices for all firmicutes sequences in an alignment)
col = color code associated to the unit (for plotting)
vect = an additional vector describing the member items (ex: a list of sequence weights)

scaTools.alg2bin(alg, N_aa=20)¶

Translate an alignment of size M x L where the amino acids are represented by numbers between 0 and N_aa (obtained using lett2num) to a sparse binary array of size M x (N_aa x L).

Example :	>>> Abin = alg2bin(alg, N_aa=20)

scaTools.basicICA(x, r, Niter)¶

Basic ICA algorithm, based on work by Bell & Sejnowski (infomax). The input data should preferentially be sphered, i.e., x.T.dot(x) = 1

Arguments:

x = LxM input matrix where L = # features and M = # samples
r = learning rate / relaxation parameter (e.g. r=.0001)
Niter = number of iterations (e.g. 1000)

Returns:

w = unmixing matrix
change = record of incremental changes during the iterations.

Note: r and Niter should be adjusted to achieve convergence, which should be assessed by visualizing ‘change’ with plot(range(iter) ,change)

Example:

>>> [w, change] = basicICA(x, r, Niter)

scaTools.chooseKpos(Lsca, Lrand)¶: Given the eigenvalues of the sca matrix (Lsca), and the eigenvalues for the set of randomized matrices (Lrand), return the number of significant eigenmodes.

scaTools.chooseRefSeq(alg)¶

This function chooses a default reference sequence if none is given by taking the sequence which has the mean pairwise sequence identity closest to that of the entire alignment.

Example :	>>> i_ref = chooseRefSeq(msa_num)

scaTools.clean_al(alg, code='ACDEFGHIKLMNPQRSTVWY', gap='-')¶

Replaces any character that is not a valid amino acid by a gap.

Keyword Arguments:
Arguments :	amino acid sequence alignment
	code = list of valid amino acid characters (case sensitive) gap = gap character for replacement

scaTools.cytoscapeOut(ats, cutoff, Csca, Di, sectors, Vp, outfilename)¶

Output tab-delimited text that can be read in by cytoscape. The goal is to enable graph representations of the SCA couplings, where residues are nodes, and couplings are edges. Within cytoscape, the graph can be color-coded or weighted by Csca, Di, sector definition or Vp.

Example :	>>> cytoscapeOut(ats, cutoff, Csca, Di, sectors, Vp, outfilename)

scaTools.dirInfoFromJ(i, j, Jmat, frq, Naa=20, epsilon=0.0001)¶

Direct information from the matrix of couplings \(J_{ij}\) (called by directInfo). Ref: Marcos et al, PNAS 2011, 108: E1293-E1301

Arguments:

i = position 1
j = position 2
Jmat = coupling matrix
frq = frequency
Naa = number of amino acids

Example :	>>> DI = dirInfoFromJ(i, j, Jmat, frq, Naa=20, epsilon=1e-4)

directInfo(freq1, freq2, lbda=0.5, freq0=array([ 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,

0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,

0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]), Naa=20)

Calculate direct information as in the Direct Coupling Analysis (DCA) method proposed by M. Weigt et collaborators (Ref: Marcos et al, PNAS 2011, 108: E1293-E1301).

Example :	>>> DI = directInfo(freq1, freq2, lbda=.5, freq0=np.ones(20)/21, Naa=20)

scaTools.eigenVect(M)¶

Return the eigenvectors and eigenvalues, ordered by decreasing values of the eigenvalues, for a real symmetric matrix M. The sign of the eigenvectors is fixed so that the mean of its components is non-negative.

Example :	>>> eigenVectors, eigenValues = eigenVect(M)

scaTools.figColors()¶

Color code for figUnits.

Example:

>>> figColors()

scaTools.figMapping(Csca, tX, kpos, sectors, subfam)¶

Function that automates finding the top \(k_{pos}\) independent components, projection, and plotting. Useful to get a representation of the sectors/subfamilies mapping for a given kpos. Arguments:

Csca = the sca matrix

tX = the projected alignment

kpos = the number of independent components to consider

sectors = list of Unit elements for each sector

subfam = list of Unit elements for each sequence family

Returns:

Vpica = the independent components of Csca

Example :

>>> Vpica = figMapping(Csca, tX, kpos, sectors, subfam) 

scaTools.figStruct(pdbid, sectors, ats, chainid='A', outfile='Outputs/sectors.pml', quit=1, pymol='/Applications/MacPyMOL.app/Contents/MacOS/MacPyMOL')¶

Make and display an image of the sectors (within a python notebook). By default quit PyMol after running it, unless the option ‘quit=0’ is given. The default name and location of the output can also be changed.

Example :	>>> figStruct(pdbid, sectors, ats, chainid='A', outfile = 'Outputs/sectors.pml', quit=1, pymol = path2pymol)

scaTools.figUnits(v1, v2, units, marker='o', dotsize=9, notinunits=1)¶

2d scatter plot specified by ‘units’, which must be a list of elements in the class Unit. See figColors for the color code. Admissible color codes are in [0 1] (light/dark gray can also be obtained by using -1/+1). For instance: 0->red, 1/3->green, 2/3-> blue.

Arguments:

v1 = xvals
v2 = yvals
units = list of elements in units

Keyword Arguments

marker = plot marker symbol
dotsize = specify marker/dotsize
notinunits = if set to 1 : the elements not in a unit are represented in white, if set to 0

these elements are not represented, if set to [w1,w2] : elements with coordinates w1,w2 are represented in white in the background.

Example :	>>> figUnits(v1, v2, units, marker='o', gradcol=0, dotsize=9, notinunits=1)

scaTools.figWeights(U1, U2, weight)¶

A 2d scatter plot with color indicating weight.

Example:

>>> figWeights(U1, U2, weight) 

scaTools.filterPos(alg, seqw=[1], max_fracgaps=0.2)¶

Truncate the positions of an input alignment to reduce gaps, taking into account sequence weights.

Arguments:

alg = An MxL list of sequences

Keyword Arguments:

seqw = vector of sequence weights (default is uniform weights)
max_fracgaps = maximum fraction of gaps allowed at a position

Returns:

alg_tr = the truncated alignment
selpos = the index of retained positions (indices start at 0 for the first position)

Example :	>>> alg_tr, selpos = filterPos(alg, seqw, max_fracgaps=.2)

scaTools.filterSeq(alg0, sref=0.5, max_fracgaps=0.2, min_seqid=0.2, max_seqid=0.8)¶

Take in an alignment (alg0, assumed to be filtered to remove highly gapped positions), a reference sequence, the maximum fraction of gaps allowed per sequence (max_fracgaps), the minimum and maximum sequence identities to the reference sequence (min_seqid and max_seqid), and return (1) alg, the alignment filtered to remove sequences with more than max_fracgaps (i.e. partial seqs), (2) seqw, a vector of weights for each sequence, (3) seqkeep, the indices of the original alignment (alg0) retained in alg:

Note: if sref is set to 0.5, filterSeq calls chooseRefSeq to automatically select a: reference sequence.

Example:

>>> alg, seqw, seqkeep = filterSeq(alg0, iref, max_fracgaps=.2, min_seqid=.2, max_seqid=.8) 

freq(alg, seqw=1, Naa=20, lbda=0, freq0=array([ 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,

0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,

0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]))

Compute amino acid frequencies for a given alignment.

Arguments:

alg = a MxL sequence alignment (converted using lett2num)

Keyword Arguments:

seqw = a vector of sequence weights (1xM)
Naa = the number of amino acids
lbda = lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)
freq0 = expected average frequency of amino acids at all positions

Returns:

freq1 = the frequencies of amino acids at each position taken independently (Naa*L)
freq2 = the joint frequencies of amino acids at pairs of positions (freq2, Naa*L * Naa*L)
freq0 = the average frequency of amino acids at all positions (Naa)

Example :	>>> freq1, freq2, freq0 = freq(alg, seqw, lbda=lbda)

scaTools.icList(Vpica, kpos, Csca, p_cut=0.95)¶: Produces a list of positions contributing to each independent component (IC) above a defined statistical cutoff (p_cut, the cutoff on the CDF of the t-distribution fit to the histogram of each IC). Any position above the cutoff on more than one IC are assigned to one IC based on which group of positions to which it shows a higher degree of coevolution. Additionally returns the numeric value of the cutoff for each IC, and the pdf fit, which can be used for plotting/evaluation. icList, icsize, sortedpos, cutoff, pd = icList(Vsca,Lsca,Lrand)

scaTools.lett2num(msa_lett, code='ACDEFGHIKLMNPQRSTVWY')¶

Translate an alignment from a representation where the 20 natural amino acids are represented by letters to a representation where they are represented by the numbers 1,...,20, with any symbol not corresponding to an amino acid represented by 0.

Example :	>>> msa_num = lett2num(msa_lett, code='ACDEFGHIKLMNPQRSTVWY')

scaTools.makeATS(sequences, refpos, refseq, iref=0, truncate=False)¶

If specified, truncate the alignment to the structure (assumes MSAsearch has already been run to identify the reference sequence (iref)) and produce a mapping (ats) between alignment positions and the positions in the reference sequence (refpos).

Arguments:

sequences

reference positions

reference sequence

iref, the index of the sequence in the alignment with the highest identity to the reference

Keyword Arguments:
	truncate – truncate the alignment to positions that align to the reference sequence.
Example :	>>> sequences_trun, ats_new = sca.makeATS(sequences_full, ats_pdb, seq_pdb, i_ref)

numConnected(Vp, k, distmat, eps_list=array([ 0.5 , 0.49, 0.48, 0.47, 0.46, 0.45, 0.44, 0.43, 0.42,

0.41, 0.4 , 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33,

0.32, 0.31, 0.3 , 0.29, 0.28, 0.27, 0.26, 0.25, 0.24,

0.23, 0.22, 0.21, 0.2 , 0.19, 0.18, 0.17, 0.16, 0.15,

0.14, 0.13, 0.12, 0.11, 0.1 , 0.09, 0.08, 0.07, 0.06,

0.05, 0.04, 0.03, 0.02, 0.01]), dcontact=5)

Calculates the number of positions in the largest connected component for groups of positions i with \(V_p[i,k] > eps\) and \(V_p[i,k] > V_p[i,kk]\), for \(kk != k\) and eps in eps_list. Useful for looking evaluating the physical connectivity of different sectors or sub-sectors.

Arguments:

Vp = A set of eigenvectors or independent components
k = The eigenvector or independent component to consider
distmat = Distance matrix (computed by pdbSeq)

Keyword Arguments:

eps_list = the range of values of eps for which the group is non-empty
dcontact = the distance cutoff for defining contacts

Example:

>>> eps_range, num_co, num_tot = numConnected(Vp, k, distmat, eps_list = np.arange(.5,0,-.01), dcontact=8)

scaTools.pdbSeq(pdbid, chain='A', path2pdb='Inputs/', calcDist=1)¶

Extract sequence, position labels and matrix of distances from a PDB file.

Arguments:

pdbid = PDB identifier (four letters/numbers)
chain = PDB chain identifier
path2pdb = location of the PDB file
calcDist = calculate a distance matrix between all pairs of positions, default is 1

Example :	>>> sequence, labels, dist = pdbSeq(pdbid, chain='A', path2pdb=path2structures)

posWeights(alg, seqw=1, lbda=0, freq0=array([ 0.073, 0.025, 0.05 , 0.061, 0.042, 0.072, 0.023, 0.053,

0.064, 0.089, 0.023, 0.043, 0.052, 0.04 , 0.052, 0.073,

0.056, 0.063, 0.013, 0.033]))

Compute single-site measures of conservation, and the sca position weights, \(\frac {\partial {D_i^a}}{\partial {f_i^a}}\)

Arguments:

alg = MSA, dimensions MxL, converted to numerical representation with lett2num
seqw = a vector of M sequence weights (default is uniform weighting)
lbda = pseudo-counting frequencies, default is no pseudocounts
freq0 = background amino acid frequencies \(q_i^a\)

Returns:

Wia = positional weights from the derivation of a relative entropy, \(\frac {\partial {D_i^a}}{\partial {f_i^a}}\) (Lx20)
Dia = the relative entropy per position and amino acid (Lx20)
Di = the relative entropy per position (L)

Example :	>>> Wia, Dia, Di = posWeights(alg, seqw=1,freq0)

scaTools.projAlg(alg, Proj)¶

Projection of an alignment (alg) based on a projector (Proj). The input alignment should already be converted to numeric representation using lett2num.

Example :	>>> tX = projAlg(msa_num, Proj)

scaTools.projUica(msa_ann, msa_num, seqw, kica=6)¶

Compute the projection of an alignment (msa_ann) on the kpos ICA components of the sequence space of another (msa_num, seqw).This is useful to compare the sequence space of one alignment to another.

Example :	>>> Uica_ann, Uica = projUpica(msa_ann, msa_num_ seqw, kica=6)

scaTools.projUpica(msa_ann, msa_num, seqw, kpos)¶

Compute the projection of an alignment (msa_ann) on the kpos ICA components of the SCA matrix of another (msa_num, seqw). This is useful to compare the sequence space (as projected by the positional correlations) of one alignment to another.

Example :	>>> Upica_ann, Upica = projUpica(msa_ann, msa_num_ seqw, kpos)

scaTools.randAlg(frq, Mseq)¶

Generate a random alignment with Mseq sequences based on the frequencies frq[i,a] of amino acids with a = 0,1,...,Naa (0 for gaps).

Example :	>>> msa_rand = randAlg(frq, Mseq)

scaTools.randSel(seqw, Mtot, keepSeq=[])¶

Random selection of Mtot sequences, drawn with weights and without replacement. The seed for the random number generator is fixed to ensure reproducibility.

Arguments:

seqw = the sequence weights

Mtot = the total number of sequences for selection

Keyword Arguments:

keepSeq = an (optional) list of sequnces to keep. This can be useful if you would like to

retain the reference sequence for example.

Example:

>>> selection = randSel(seqw, Mtot, [iref]) 

scaTools.randomize(msa_num, Ntrials, seqw=1, norm='frob', lbda=0, Naa=20, kmax=6)¶

Randomize the alignment while preserving the frequencies of amino acids at each position and compute the resulting spectrum of the SCA matrix.

Arguments:

msa_num = a MxL sequence alignment (converted to numerical representation using lett2num)
Ntrials = number of trials
seqw = vector of sequence weights (default is to assume equal weighting)
norm = either ‘frob’ (frobenius) or ‘spec’ (spectral)
lbda = lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)
Naa = number of amino acids
kmax = number of eigenvectors to keep for each randomized trial

Returns:

Vrand = eigenvectors for the \(\tilde {C_{ij}^{ab}}\) matrix of the randomized alignment (dimensions: Ntrials*Npos*kmax)
Lrand = eigenvalues for the \(\tilde {C_{ij}^{ab}}\) matrix of the randomized alignment (dimensions: Ntrials*Npos)

Example :	>>> Vrand, Lrand, Crand = randomize(msa_num, 10, seqw, Naa=20, kmax=6)

scaTools.readAlg(filename)¶

Read in a multiple sequence alignment in fasta format, and return the headers and sequences.

>>> headers, sequences = readAlg(filename) 

scaTools.rotICA(V, kmax=6, learnrate=0.0001, iterations=10000)¶

ICA rotation (using basicICA) with default parameters and normalization of outputs.

Example :	>>> Vica, W = rotICA(V, kmax=6, learnrate=.0001, iterations=10000)

scaMat(alg, seqw=1, norm='frob', lbda=0, freq0=array([ 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,

0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,

0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]))

Computes the SCA matrix.

Arguments:

alg = A MxL multiple sequence alignment, converted to numeric representation with lett2num

Keyword Arguments:

seqw = A vector of sequence weights (default: uniform weights)
norm = The type of matrix norm used for dimension reduction of the

SCA correlation tensor to a positional correlation matrix. Use ‘spec’ for spectral norm and ‘frob’ for Frobenius norm. The frobenius norm is the default.
lbda = lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)
freq0 = background expectation for amino acid frequencies

Returns:

Cp = the LxL SCA positional correlation matrix
tX = the projected MxL alignment
projMat = the projector

Example :	>>> Csca, tX, projMat = scaMat(alg, seqw, norm='frob', lbda=0.03)

scaTools.seqProj(msa_num, seqw, kseq=15, kica=6)¶

Compute three different projections of the sequences based on eigenvectors of the sequence similarity matrix.

Arguments:

msa_num = sequence alignment (previously converted to numerical representation using lett2num)
seqw = a vector of sequence weights

Keyword Arguments:

kseq = number of eigenvectors to compute
kica = number of independent components to compute

Returns:

Useq[0]/Uica[0] = use no weight
Useq[1]/Uica[1] = use sequence weights
Useq[2]/Uica[2] = use sequence weights and positional weights

Example:

>>> Useq, Uica = sca.seqProj(msa_num, seqw, kseq = 30, kica = 15)

scaTools.seqSim(alg)¶

Take an MxL alignment (converted to numeric representation using lett2num) and compute a MxM matrix of sequence similarities.

Example :	>>> simMat = seqSim(alg)

scaTools.seqWeights(alg, max_seqid=0.8, gaps=1)¶

Compute sequence weights for an alignment (format: list of sequences) where the weight of a sequence is the inverse of the number of sequences in its neighborhood, defined as the sequences with sequence similarity below max_seqid. The sum of the weights defines an effective number of sequences.

Arguments:

alg = list of sequences

Keyword Arguments:

max_seqid
gaps = If gaps == 1 (default), considering gaps as a 21st amino acid, if gaps == 0, not considering them.

Example :	>>> seqw = seqWeights(alg)

scaTools.singleBar(x, loc, cols, width=0.5)¶

Single bar diagram, called by MultiBar.

Example :	>>> singleBar(x, loc, cols, width=.5)

scaTools.sizeLargestCompo(adjMat)¶

Compute the size of the largest component of a graph given its adjacency matrix. Called by numConnected (Done by actually listing all the components)

Example :	>>> s = sizeLargestCompo(adjMat)

scaTools.svdss(X, k=6)¶

Singular value decomposition for sparse matrices (top k components). The singular values are ordered by decreasing values, the sign of the singular vectors is fixed, and the convention is that X = u.dot(s).dot(v.T):

Example :	>>> u, s ,v = svdss(X, k=6)

scaTools.truncDiag(M, dmax)¶

Set to 0 the elements of a matrix M up to a distance dmax from the diagonal.

Example :	>>> Mtr = truncDiag(M, dmax)

scaTools.weighted_rand_list(weights, Nmax, keepList)¶

Generate a random list of at most Nmax elements with weights (numpy array) but without replacement. Called by randSel.

Example :	>>> selection = weighted_rand_list(weights, Nmax, [iref])

scaTools.weighted_rand_sel(weights)¶

Generate a random index with probability given by input weights. Called by weighted_rand_list.

Example :	>>> index = weighted_rand_sel(weights)

scaTools.writePymol(pdb, sectors, ics, ats, outfilename, chain='A', inpath='Inputs/', quit=1)¶

Write basic a pymol script for displaying sectors and exporting an image.

Example :	>>> writePymol(pdb, sectors, ics, ats, outfilename, chain='A',inpath=path2structures, quit=1)

Previous topic

This Page

scaTools Module¶

Navigation

Previous topic

This Page

Quick search

scaTools Module¶

Navigation