soprano.analyse.phylogen.phylogenclust#
Phylogenetic clustering class definitions
Classes
|
An object that, given an AtomsCollection and a series of "genes" and weights, will build clusters out of the structures in the collection based on their reciprocal positions as points in a multi-dimensional space defined by those "genes". |
- class soprano.analyse.phylogen.phylogenclust.PhylogenCluster(coll, genes=None, norm_range=(0.0, 1.0), norm_dist=1.0)[source]#
Bases:
object
An object that, given an AtomsCollection and a series of “genes” and weights, will build clusters out of the structures in the collection based on their reciprocal positions as points in a multi-dimensional space defined by those “genes”.
Initialize the PhylogenCluster object.
Args:coll (AtomsCollection): an AtomsCollection containing thestructures that should be classified.This will be copied and frozen for theentirety of the life of this instance;in order to operate on a modifiedcollection, a new PhylogenCluster shouldbe created.genes (list[tuple], str, file): list of the genes that should beloaded immediately; each genecomes in the form of a tuple(name (str), weight (float),params (dict)). A path or openfile can also be passed for a.gene file, from which the valueswill be loaded.norm_range (list[float?]): ranges to constrain the values ofsingle genes in between. Default is(0, 1). A value of “None” in eitherplace can be used to indicate nonormalization on one or both sides.norm_dist (float?): value to normalize distance genes to. Theseare the genes that only make sense on pairs ofstructures. Their minimum value is always 0.This number would become their maximum value,or can be set to None to avoid normalization.- create_mapping(method='total-principal')[source]#
Return an array of 2-dimensional points representing a reduced dimensionality mapping of the given genes using the algorithm of choice. All algorithms are described in [W. Siedlecki et al., Patt. Recog. vol. 21, num. 5, pp. 411 429 (1988)].
Args:method (str): can be one of the following algorithms:- total_principal (default)- clafic- fukunaga-koontz- optimal-discriminant
- get_cluster_stats(clusters, raw=False)[source]#
Compute average values and standard deviation for each gene within a given clustering.
Args:clusters (tuple): the clustering in tuple form, as returned by oneof the get_clusters methods.raw (bool): if True, return average and standard deviation of rawinstead of normalised gene values. Default is False.Returns:avgs (np.ndarray): 2D array of average values of each gene foreach cluster.stds (np.ndarray): 2D array of standard deviations of each genefor each cluster.genome_legend (list[tuple]): a list of tuples containing (name,length) of the gene fragments in thearrays
- get_clusters(method, params={})[source]#
Wrapper method to get clusters by any available method. Depending on the value passed as ‘method’ it calls either ger_hier_clusters, get_kmeans_clusters, or get_sklearn_clusters. Check their respective docstrings for more detailed info.
Args:method (str): name of the clustering method to use. Can be ‘hier’,‘kmeans’, or one of the methods in sklearn.clusters.params (dict): parameters to be passed to the class wheninitialising it. Change depending on the desiredmethod. Check the documentation for the specificclass.Returns:clusters (tuple(list[int],list[slices])): list of cluster index for eachstructure (counting from 1) andlist of slices defining theclusters as formed by therequested algorithm.
- get_distmat()[source]#
Get the distance matrix between structures in the collection, based on the genes currently in use.
Returns:distmat (np.ndarray): a (collection.length, collection.length)array, containing the overall distance(the norm of all individual gene distances)between all pairs of structures.
- get_elbow_plot(method='kmeans', param_name='n', param_range=range(1, 11))[source]#
Returns data for an elbow plot by scanning the outcome of a given clustering method within a range of values for a chosen parameter. Used to determine optimal parameter values.
Args:method (str): name of the clustering method to use. Can be ‘hier’,‘kmeans’, or one of the methods in sklearn.clusters.Default is kmeans.param_name (str): parameter to be scanned over. Change dependingon the desired method. Check the documentationfor the specific class. Default is n, number ofclusters for k-means method.param_range (list): values of param_name to scan over. Default isthe integers from 1 to 10.Returns:wss (np.ndarray): values of the “Within cluster Sum of Squares”(WSS) to be used on the elbow plot y axis.param_range (list): range used for parameter scan, to be used onthe x axis (same as passed by the user).
- get_genome_matrices()[source]#
Return the genome matrices in raw form (not normalized). The matrices refer to genes that only allow to define a distance between structures. The element at i,j represents the distance between said structures. The matrix is symmetric and has null diagonal.
Returns:genome_matrix (np.ndarray): a (collection.length,collection.length, gene.length)array, containing the distances foreach gene and pair of structures inrow and columngenome_legend (list[tuple]): a list of tuples containing (name,length) of the gene fragments in thearray
- get_genome_matrices_norm()[source]#
Return the genome matrices in normalized and weighted form. The matrices refer to genes that only allow to define a distance between structures. The element at i,j represents the distance between said structures. The matrix is symmetric and has null diagonal.
Returns:genome_matrix (np.ndarray): a (collection.length,collection.length, gene.length)array, containing the distances foreach gene and pair of structures inrow and columngenome_legend (list[tuple]): a list of tuples containing (name,length) of the gene fragments in thearray
- get_genome_vectors()[source]#
Return the genome vectors in raw form (not normalized). The vectors refer to genes that allow to define a specific point for each structure.
Returns:genome_vectors (np.ndarray): a (collection.length, gene.length)array, containing the whole extentof the gene values for each structurein the collection on each rowgenome_legend (list[tuple]): a list of tuples containing (name,length) of the gene fragments in thearray
- get_genome_vectors_norm()[source]#
Return the genome vectors in normalized and weighted form. The vectors refer to genes that allow to define a specific point for each structure.
Returns:genome_vectors (np.ndarray): a (collection.length, gene.length)array, containing the whole extentof the gene values for each structurein the collection on each rowgenome_legend (list[tuple]): a list of tuples containing (name,length) of the gene fragments in thearray
- get_hier_clusters(t, method='single')[source]#
Get multiple clusters (in the form of a list of collections) based on the hierarchical clustering methods and the currently set genes.
Calls scipy.cluster.hierarchy.fcluster
Args:t (float): minimum distance of separation required to considertwo clusters separate. This controls the number ofclusters: a smaller value will produce more finegrained clustering. At the limit, a value smaller thanthe distance between the two closest structures willreturn a cluster for each structure. Remember that the‘distances’ in this case refer to distances between the‘gene’ values attributed to each structure. In otherwords they are a function of the chosen genes,normalization conditions and weights employed.In addition, the way they are calculated depends on thechoice of method.method (str): clustering method to employ. Valid entries are‘single’, ‘complete’, ‘weighted’ and ‘average’.Refer to Scipy documentation for further details.Returns:clusters (tuple(list[int],list[slices])): list of cluster index for eachstructure (counting from 1) andlist of slices defining theclusters as formed by hierarchicalalgorithm.
- get_hier_tree(method='single')[source]#
Get a tree data structure describing the clustering order of based on the hierarchical clustering methods and the currently set genes.
Calls scipy.cluster.hierarchy.to_tree
Args:method (str): clustering method to employ. Valid entries are‘single’, ‘complete’, ‘weighted’ and ‘average’.Refer to Scipy documentation for further details.Returns:root_node (ClusterNode): the root node of the tree. Access childmembers with .left and .right, while .idholds the number of the correspondingcluster. Refer to Scipy documentation forfurther details.
- get_kmeans_clusters(n)[source]#
Get a given number of clusters (in the form of a list of collections) based on the k-means clustering methods and the currently set genes. Warning: this method only works if there are no genes that work only with pairs of structures - as specific points, and not just distances between them, are required for this algorithm.
Calls scipy.cluster.vq.kmeans
Args:n (int): the desired number of clusters.Returns:clusters (tuple(list[int],list[slices])): list of cluster index for eachstructure (counting from 1) andlist of slices defining theclusters as formed by k-meansalgorithm.
- get_linkage(method='single')[source]#
Get the linkage matrix between structures in the collection, based on the genes currently in use. Only used in hierarchical clustering.
Calls scipy.cluster.hierarchy.linkage.
Args:method (str): clustering method to employ. Valid entries are‘single’, ‘complete’, ‘weighted’ and ‘average’.Refer to Scipy documentation for further details.Returns:Z (np.ndarray): linkage matrix for the structures in thecollection. Refer to Scipy documentation fordetails about the method
- get_sklearn_clusters(method, params={})[source]#
Get clusters applying any of the methods provided by the library scikit-learn (requires a separate installation). Warning: this method only works if there are no genes that work only with pairs of structures - as use of pairwise clustering methods is not implemented yet.
Uses the sklearn.cluster.<method> class
Args:method (str): name of the clustering class from sklearn.clustersto use. For reference check the documentation atparams (dict): parameters to be passed to the class wheninitialising it. Change depending on the desiredmethod. Check the documentation for the specificclass.Returns:clusters (tuple(list[int],list[slices])): list of cluster index for eachstructure (counting from 1) andlist of slices defining theclusters as formed by therequested algorithm.
- save_collection(filename)[source]#
Save as pickle the collection bound to this PhylogenCluster. The calculated genes are also stored in it as arrays for future use.
- set_genes(genes, load_arrays=False)[source]#
Calculate, store and set a list of genes as used for clustering.
Args:genes (list[soprano.analyse.phylogen.Gene],file, str): a list of Genes to calculate and store. A pathor open file can also be passed for a .genefile, from which the values will be loaded.load_arrays (bool): try loading the genes as arrays from thecollection before generating them. Warning:if there are arrays named like genes but withdifferent contents this can lead tounpredictable results.