#### Previous topic

scaSectorID Module

# scaTools Module¶

pySCA - A SCA toolbox in Python

On: August 2014 6.1 beta

class scaTools.Annot(descr, species, taxo, seq='')

A class for annotating sequences

Attributes : descr = description (often the sequence header) species = species string taxo = taxonomy string seq = sequence
scaTools.AnnotPfam(pfam_in, pfam_out, pfam_seq='~/Documents/Packages/pfamseq.txt')

Phylogenetic annotation of a Pfam alignment (in fasta format) using information from pfamseq.txt (ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/database_files/). The output is a fasta file containing phylogenetic annotations in the header (to be parsed with ‘|’ as a delimiter).

Note: the headers for the original alignment take the form >AAA/x-y. If two entries have same AAA but correspond to different sequences only one of the two sequences will be represented (twice) in the output - this should however not practically be an issue.

Arguments : Keyword Arguments: input PFAM sequence alignment output file name for the annotated PFAM alignment pfam_seq = path to the file pfamseq.txt
scaTools.MSAsearch(hd, algn, seq, species=None, path2_algprog='/usr/local/bin/')

Identify the sequence in the alignment that most closely corresponds to the species of the reference sequence, and return its index.

Arguments:
• alignment sequences
• selected reference sequence (often from a PDB file)
Keyword Arguments:
• species = species of the reference sequence (Used to speed up alignment searching when possible)
• path2_algprog = path to an alignment program
Example:
>>> strseqnum = MSASearch(hd, alg0, pdbseq, 'Homo sapiens')

scaTools.MultiBar(x, colors='wbrgymc', width=0.5)

Multiple bar diagram (plots contributions to each bar from different elements in x as different colors). This can be useful if you’d like to inspect how sector positions are distributed among independent components/eigenmodes. The argument x is a tuple, specifying the number of elements in each bar to be each color. The example below makes a graph with four bars, where the first bar has 99 white elements, 1 blue element and 1 red element.

Example : >>> x = [[99, 1, 1], [6, 13, 2], [0, 0, 13], [1, 7, 5]] >>> sca.MultiBar(x) 
class scaTools.Pair(p, x, d)

A class for a pair of positions. :Attributes:

• pos = a pair of amino acid positions (ex: [1,3], supplied as argument p)
• DI = the direct information between the two positions (argument x)
• dist = the physical distance between the positions (argument d)
class scaTools.Secton(positions)

A class for sectons.

Attributes:
• pos = a list of positions
• num = number of positions in the secton
connected(distmat, threshold)

Check the structural connectivity based on the principle that if $$M_{ij}$$ is the adjacency matrix of a graph, $$M^n_{ij}$$ is the number of paths of length $$n$$ between i and j, which must be > 0 for $$n$$ = number of nodes when i and j are in the same connected component.

dist(distmat)

returns the distance between the position pair

class scaTools.Unit

A class for units (sectors, sequence families, etc.)

Attributes:
• name = string describing the unit (ex: ‘firmicutes’)
• items = set of member items (ex: indices for all firmicutes sequences in an alignment)
• col = color code associated to the unit (for plotting)
• vect = an additional vector describing the member items (ex: a list of sequence weights)
scaTools.alg2bin(alg, N_aa=20)

Translate an alignment of size M x L where the amino acids are represented by numbers between 0 and N_aa (obtained using lett2num) to a sparse binary array of size M x (N_aa x L).

Example : >>> Abin = alg2bin(alg, N_aa=20) 
scaTools.basicICA(x, r, Niter)

Basic ICA algorithm, based on work by Bell & Sejnowski (infomax). The input data should preferentially be sphered, i.e., x.T.dot(x) = 1

Arguments:
• x = LxM input matrix where L = # features and M = # samples
• r = learning rate / relaxation parameter (e.g. r=.0001)
• Niter = number of iterations (e.g. 1000)
Returns:
• w = unmixing matrix
• change = record of incremental changes during the iterations.

Note: r and Niter should be adjusted to achieve convergence, which should be assessed by visualizing ‘change’ with plot(range(iter) ,change)

Example:
>>> [w, change] = basicICA(x, r, Niter)

scaTools.chooseKpos(Lsca, Lrand)

Given the eigenvalues of the sca matrix (Lsca), and the eigenvalues for the set of randomized matrices (Lrand), return the number of significant eigenmodes.

scaTools.chooseRefSeq(alg)

This function chooses a default reference sequence if none is given by taking the sequence which has the mean pairwise sequence identity closest to that of the entire alignment.

Example : >>> i_ref = chooseRefSeq(msa_num) 
scaTools.clean_al(alg, code='ACDEFGHIKLMNPQRSTVWY', gap='-')

Replaces any character that is not a valid amino acid by a gap.

Arguments : Keyword Arguments: amino acid sequence alignment code = list of valid amino acid characters (case sensitive) gap = gap character for replacement
scaTools.cytoscapeOut(ats, cutoff, Csca, Di, sectors, Vp, outfilename)

Output tab-delimited text that can be read in by cytoscape. The goal is to enable graph representations of the SCA couplings, where residues are nodes, and couplings are edges. Within cytoscape, the graph can be color-coded or weighted by Csca, Di, sector definition or Vp.

Example : >>> cytoscapeOut(ats, cutoff, Csca, Di, sectors, Vp, outfilename) 
scaTools.dirInfoFromJ(i, j, Jmat, frq, Naa=20, epsilon=0.0001)

Direct information from the matrix of couplings $$J_{ij}$$ (called by directInfo). Ref: Marcos et al, PNAS 2011, 108: E1293-E1301

Arguments:
• i = position 1
• j = position 2
• Jmat = coupling matrix
• frq = frequency
• Naa = number of amino acids
Example : >>> DI = dirInfoFromJ(i, j, Jmat, frq, Naa=20, epsilon=1e-4) 
directInfo(freq1, freq2, lbda=0.5, freq0=array([ 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]), Naa=20)

Calculate direct information as in the Direct Coupling Analysis (DCA) method proposed by M. Weigt et collaborators (Ref: Marcos et al, PNAS 2011, 108: E1293-E1301).

Example : >>> DI = directInfo(freq1, freq2, lbda=.5, freq0=np.ones(20)/21, Naa=20) 
scaTools.eigenVect(M)

Return the eigenvectors and eigenvalues, ordered by decreasing values of the eigenvalues, for a real symmetric matrix M. The sign of the eigenvectors is fixed so that the mean of its components is non-negative.

Example : >>> eigenVectors, eigenValues = eigenVect(M) 
scaTools.figColors()

Color code for figUnits.

>>> figColors()

scaTools.figMapping(Csca, tX, kpos, sectors, subfam)

Function that automates finding the top $$k_{pos}$$ independent components, projection, and plotting. Useful to get a representation of the sectors/subfamilies mapping for a given kpos. Arguments:

• Csca = the sca matrix
• tX = the projected alignment
• kpos = the number of independent components to consider
• sectors = list of Unit elements for each sector
• subfam = list of Unit elements for each sequence family
Returns:
• Vpica = the independent components of Csca
>>> Vpica = figMapping(Csca, tX, kpos, sectors, subfam)

scaTools.figStruct(pdbid, sectors, ats, chainid='A', outfile='Outputs/sectors.pml', quit=1, pymol='/Applications/MacPyMOL.app/Contents/MacOS/MacPyMOL')

Make and display an image of the sectors (within a python notebook). By default quit PyMol after running it, unless the option ‘quit=0’ is given. The default name and location of the output can also be changed.

Example : >>> figStruct(pdbid, sectors, ats, chainid='A', outfile = 'Outputs/sectors.pml', quit=1, pymol = path2pymol) 
scaTools.figUnits(v1, v2, units, marker='o', dotsize=9, notinunits=1)

2d scatter plot specified by ‘units’, which must be a list of elements in the class Unit. See figColors for the color code. Admissible color codes are in [0 1] (light/dark gray can also be obtained by using -1/+1). For instance: 0->red, 1/3->green, 2/3-> blue.

Arguments:
• v1 = xvals
• v2 = yvals
• units = list of elements in units
Keyword Arguments
• marker = plot marker symbol

• dotsize = specify marker/dotsize

• notinunits = if set to 1 : the elements not in a unit are represented in white, if set to 0

these elements are not represented, if set to [w1,w2] : elements with coordinates w1,w2 are represented in white in the background.

Example : >>> figUnits(v1, v2, units, marker='o', gradcol=0, dotsize=9, notinunits=1) 
scaTools.figWeights(U1, U2, weight)

A 2d scatter plot with color indicating weight.

>>> figWeights(U1, U2, weight)

scaTools.filterPos(alg, seqw=[1], max_fracgaps=0.2)

Truncate the positions of an input alignment to reduce gaps, taking into account sequence weights.

Arguments:
• alg = An MxL list of sequences
Keyword Arguments:
• seqw = vector of sequence weights (default is uniform weights)
• max_fracgaps = maximum fraction of gaps allowed at a position
Returns:
• alg_tr = the truncated alignment
• selpos = the index of retained positions (indices start at 0 for the first position)
Example : >>> alg_tr, selpos = filterPos(alg, seqw, max_fracgaps=.2) 
scaTools.filterSeq(alg0, sref=0.5, max_fracgaps=0.2, min_seqid=0.2, max_seqid=0.8)

Take in an alignment (alg0, assumed to be filtered to remove highly gapped positions), a reference sequence, the maximum fraction of gaps allowed per sequence (max_fracgaps), the minimum and maximum sequence identities to the reference sequence (min_seqid and max_seqid), and return (1) alg, the alignment filtered to remove sequences with more than max_fracgaps (i.e. partial seqs), (2) seqw, a vector of weights for each sequence, (3) seqkeep, the indices of the original alignment (alg0) retained in alg:

Note: if sref is set to 0.5, filterSeq calls chooseRefSeq to automatically select a
reference sequence.
Example:
>>> alg, seqw, seqkeep = filterSeq(alg0, iref, max_fracgaps=.2, min_seqid=.2, max_seqid=.8)

freq(alg, seqw=1, Naa=20, lbda=0, freq0=array([ 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]))

Compute amino acid frequencies for a given alignment.

Arguments:
• alg = a MxL sequence alignment (converted using lett2num)
Keyword Arguments:
• seqw = a vector of sequence weights (1xM)
• Naa = the number of amino acids
• lbda = lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)
• freq0 = expected average frequency of amino acids at all positions
Returns:
• freq1 = the frequencies of amino acids at each position taken independently (Naa*L)
• freq2 = the joint frequencies of amino acids at pairs of positions (freq2, Naa*L * Naa*L)
• freq0 = the average frequency of amino acids at all positions (Naa)
Example : >>> freq1, freq2, freq0 = freq(alg, seqw, lbda=lbda) 
scaTools.icList(Vpica, kpos, Csca, p_cut=0.95)

Produces a list of positions contributing to each independent component (IC) above a defined statistical cutoff (p_cut, the cutoff on the CDF of the t-distribution fit to the histogram of each IC). Any position above the cutoff on more than one IC are assigned to one IC based on which group of positions to which it shows a higher degree of coevolution. Additionally returns the numeric value of the cutoff for each IC, and the pdf fit, which can be used for plotting/evaluation. icList, icsize, sortedpos, cutoff, pd = icList(Vsca,Lsca,Lrand)

scaTools.lett2num(msa_lett, code='ACDEFGHIKLMNPQRSTVWY')

Translate an alignment from a representation where the 20 natural amino acids are represented by letters to a representation where they are represented by the numbers 1,...,20, with any symbol not corresponding to an amino acid represented by 0.

Example : >>> msa_num = lett2num(msa_lett, code='ACDEFGHIKLMNPQRSTVWY') 
scaTools.makeATS(sequences, refpos, refseq, iref=0, truncate=False)

If specified, truncate the alignment to the structure (assumes MSAsearch has already been run to identify the reference sequence (iref)) and produce a mapping (ats) between alignment positions and the positions in the reference sequence (refpos).

Arguments:
• sequences
• reference positions
• reference sequence
• iref, the index of the sequence in the alignment with the highest identity to the reference
Keyword Arguments:

truncate – truncate the alignment to positions that align to the reference sequence.

Example :
>>> sequences_trun, ats_new = sca.makeATS(sequences_full, ats_pdb, seq_pdb, i_ref)

numConnected(Vp, k, distmat, eps_list=array([ 0.5 , 0.49, 0.48, 0.47, 0.46, 0.45, 0.44, 0.43, 0.42,
0.41, 0.4 , 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33,
0.32, 0.31, 0.3 , 0.29, 0.28, 0.27, 0.26, 0.25, 0.24,
0.23, 0.22, 0.21, 0.2 , 0.19, 0.18, 0.17, 0.16, 0.15,
0.14, 0.13, 0.12, 0.11, 0.1 , 0.09, 0.08, 0.07, 0.06,
0.05, 0.04, 0.03, 0.02, 0.01]), dcontact=5)

Calculates the number of positions in the largest connected component for groups of positions i with $$V_p[i,k] > eps$$ and $$V_p[i,k] > V_p[i,kk]$$, for $$kk != k$$ and eps in eps_list. Useful for looking evaluating the physical connectivity of different sectors or sub-sectors.

Arguments:
• Vp = A set of eigenvectors or independent components
• k = The eigenvector or independent component to consider
• distmat = Distance matrix (computed by pdbSeq)
Keyword Arguments:
• eps_list = the range of values of eps for which the group is non-empty
• dcontact = the distance cutoff for defining contacts
Example:
>>> eps_range, num_co, num_tot = numConnected(Vp, k, distmat, eps_list = np.arange(.5,0,-.01), dcontact=8)

scaTools.pdbSeq(pdbid, chain='A', path2pdb='Inputs/', calcDist=1)

Extract sequence, position labels and matrix of distances from a PDB file.

Arguments:
• pdbid = PDB identifier (four letters/numbers)
• chain = PDB chain identifier
• path2pdb = location of the PDB file
• calcDist = calculate a distance matrix between all pairs of positions, default is 1
Example : >>> sequence, labels, dist = pdbSeq(pdbid, chain='A', path2pdb=path2structures) 
posWeights(alg, seqw=1, lbda=0, freq0=array([ 0.073, 0.025, 0.05 , 0.061, 0.042, 0.072, 0.023, 0.053,
0.064, 0.089, 0.023, 0.043, 0.052, 0.04 , 0.052, 0.073,
0.056, 0.063, 0.013, 0.033]))

Compute single-site measures of conservation, and the sca position weights, $$\frac {\partial {D_i^a}}{\partial {f_i^a}}$$

Arguments:
• alg = MSA, dimensions MxL, converted to numerical representation with lett2num
• seqw = a vector of M sequence weights (default is uniform weighting)
• lbda = pseudo-counting frequencies, default is no pseudocounts
• freq0 = background amino acid frequencies $$q_i^a$$
Returns:
• Wia = positional weights from the derivation of a relative entropy, $$\frac {\partial {D_i^a}}{\partial {f_i^a}}$$ (Lx20)
• Dia = the relative entropy per position and amino acid (Lx20)
• Di = the relative entropy per position (L)
Example : >>> Wia, Dia, Di = posWeights(alg, seqw=1,freq0) 
scaTools.projAlg(alg, Proj)

Projection of an alignment (alg) based on a projector (Proj). The input alignment should already be converted to numeric representation using lett2num.

Example : >>> tX = projAlg(msa_num, Proj) 
scaTools.projUica(msa_ann, msa_num, seqw, kica=6)

Compute the projection of an alignment (msa_ann) on the kpos ICA components of the sequence space of another (msa_num, seqw).This is useful to compare the sequence space of one alignment to another.

Example : >>> Uica_ann, Uica = projUpica(msa_ann, msa_num_ seqw, kica=6) 
scaTools.projUpica(msa_ann, msa_num, seqw, kpos)

Compute the projection of an alignment (msa_ann) on the kpos ICA components of the SCA matrix of another (msa_num, seqw). This is useful to compare the sequence space (as projected by the positional correlations) of one alignment to another.

Example : >>> Upica_ann, Upica = projUpica(msa_ann, msa_num_ seqw, kpos) 
scaTools.randAlg(frq, Mseq)

Generate a random alignment with Mseq sequences based on the frequencies frq[i,a] of amino acids with a = 0,1,...,Naa (0 for gaps).

Example : >>> msa_rand = randAlg(frq, Mseq) 
scaTools.randSel(seqw, Mtot, keepSeq=[])

Random selection of Mtot sequences, drawn with weights and without replacement. The seed for the random number generator is fixed to ensure reproducibility.

Arguments:
• seqw = the sequence weights
• Mtot = the total number of sequences for selection
Keyword Arguments:
• keepSeq = an (optional) list of sequnces to keep. This can be useful if you would like to

retain the reference sequence for example.

>>> selection = randSel(seqw, Mtot, [iref])

scaTools.randomize(msa_num, Ntrials, seqw=1, norm='frob', lbda=0, Naa=20, kmax=6)

Randomize the alignment while preserving the frequencies of amino acids at each position and compute the resulting spectrum of the SCA matrix.

Arguments:
• msa_num = a MxL sequence alignment (converted to numerical representation using lett2num)
• Ntrials = number of trials
• seqw = vector of sequence weights (default is to assume equal weighting)
• norm = either ‘frob’ (frobenius) or ‘spec’ (spectral)
• lbda = lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)
• Naa = number of amino acids
• kmax = number of eigenvectors to keep for each randomized trial
Returns:
• Vrand = eigenvectors for the $$\tilde {C_{ij}^{ab}}$$ matrix of the randomized alignment (dimensions: Ntrials*Npos*kmax)
• Lrand = eigenvalues for the $$\tilde {C_{ij}^{ab}}$$ matrix of the randomized alignment (dimensions: Ntrials*Npos)
Example : >>> Vrand, Lrand, Crand = randomize(msa_num, 10, seqw, Naa=20, kmax=6) 

Read in a multiple sequence alignment in fasta format, and return the headers and sequences.

>>> headers, sequences = readAlg(filename)

scaTools.rotICA(V, kmax=6, learnrate=0.0001, iterations=10000)

ICA rotation (using basicICA) with default parameters and normalization of outputs.

Example : >>> Vica, W = rotICA(V, kmax=6, learnrate=.0001, iterations=10000) 
scaMat(alg, seqw=1, norm='frob', lbda=0, freq0=array([ 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905,
0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]))

Computes the SCA matrix.

Arguments:
• alg = A MxL multiple sequence alignment, converted to numeric representation with lett2num
Keyword Arguments:
• seqw = A vector of sequence weights (default: uniform weights)

• norm = The type of matrix norm used for dimension reduction of the

SCA correlation tensor to a positional correlation matrix. Use ‘spec’ for spectral norm and ‘frob’ for Frobenius norm. The frobenius norm is the default.

• lbda = lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)

• freq0 = background expectation for amino acid frequencies

Returns:
• Cp = the LxL SCA positional correlation matrix
• tX = the projected MxL alignment
• projMat = the projector
Example : >>> Csca, tX, projMat = scaMat(alg, seqw, norm='frob', lbda=0.03) 
scaTools.seqProj(msa_num, seqw, kseq=15, kica=6)

Compute three different projections of the sequences based on eigenvectors of the sequence similarity matrix.

Arguments:
• msa_num = sequence alignment (previously converted to numerical representation using lett2num)
• seqw = a vector of sequence weights
Keyword Arguments:
• kseq = number of eigenvectors to compute
• kica = number of independent components to compute
Returns:
• Useq[0]/Uica[0] = use no weight
• Useq[1]/Uica[1] = use sequence weights
• Useq[2]/Uica[2] = use sequence weights and positional weights
Example:
>>> Useq, Uica = sca.seqProj(msa_num, seqw, kseq = 30, kica = 15)

scaTools.seqSim(alg)

Take an MxL alignment (converted to numeric representation using lett2num) and compute a MxM matrix of sequence similarities.

Example : >>> simMat = seqSim(alg) 
scaTools.seqWeights(alg, max_seqid=0.8, gaps=1)

Compute sequence weights for an alignment (format: list of sequences) where the weight of a sequence is the inverse of the number of sequences in its neighborhood, defined as the sequences with sequence similarity below max_seqid. The sum of the weights defines an effective number of sequences.

Arguments:
• alg = list of sequences
Keyword Arguments:
• max_seqid
• gaps = If gaps == 1 (default), considering gaps as a 21st amino acid, if gaps == 0, not considering them.
Example : >>> seqw = seqWeights(alg) 
scaTools.singleBar(x, loc, cols, width=0.5)

Single bar diagram, called by MultiBar.

Example : >>> singleBar(x, loc, cols, width=.5) 

Compute the size of the largest component of a graph given its adjacency matrix. Called by numConnected (Done by actually listing all the components)

Example : >>> s = sizeLargestCompo(adjMat) 
scaTools.svdss(X, k=6)

Singular value decomposition for sparse matrices (top k components). The singular values are ordered by decreasing values, the sign of the singular vectors is fixed, and the convention is that X = u.dot(s).dot(v.T):

Example : >>> u, s ,v = svdss(X, k=6) 
scaTools.truncDiag(M, dmax)

Set to 0 the elements of a matrix M up to a distance dmax from the diagonal.

Example : >>> Mtr = truncDiag(M, dmax) 
scaTools.weighted_rand_list(weights, Nmax, keepList)

Generate a random list of at most Nmax elements with weights (numpy array) but without replacement. Called by randSel.

Example : >>> selection = weighted_rand_list(weights, Nmax, [iref]) 
scaTools.weighted_rand_sel(weights)

Generate a random index with probability given by input weights. Called by weighted_rand_list.

Example : >>> index = weighted_rand_sel(weights) 
scaTools.writePymol(pdb, sectors, ics, ats, outfilename, chain='A', inpath='Inputs/', quit=1)

Write basic a pymol script for displaying sectors and exporting an image.

Example : >>> writePymol(pdb, sectors, ics, ats, outfilename, chain='A',inpath=path2structures, quit=1)