soprano.analyse.phylogen.metrics

soprano.analyse.phylogen.metrics#

Utility functions to compare clusterings and evaluate similarity

Functions

confmat(clust1, clust2)

Calculate the confusion matrix for two clusterings of the same collection.

fowles_mallows_index(clust1, clust2)

Calculate the Fowles-Mallows index, a measure of similarity between two clusterings defined as:

norm_confmat(clust1, clust2)

Calculate the normalised confusion matrix for two clusterings of the same collection.

soprano.analyse.phylogen.metrics.confmat(clust1, clust2)[source]#

Calculate the confusion matrix for two clusterings of the same collection. The confusion matrix is defined so that element i,j contains the number of elements that are in common between cluster i of the first clustering and cluster j of the second:

M[i,j] = len(set(c[i]).intersection(c[j]))

Args:
clust1 (list): a series of clusters in the form of a list of slices,
like the second value returned by one of the clustering
methods in PhylogenCluster.
clust2 (list): same as above.
Returns:
confmat (np.ndarray): confusion matrix for the two clusterings.
soprano.analyse.phylogen.metrics.fowles_mallows_index(clust1, clust2)[source]#

Calculate the Fowles-Mallows index, a measure of similarity between two clusterings defined as:

F = (WI*WII)**0.5

with

WI = N11/sum_k(n_k*(n_k-1)/2) WII = N11/sum_k(n’_k*(n’_k-1)/2)

with N11 being the number of pairs of points that are in the same cluster in both clusterings, and n_k (n’_k) the number of elements in cluster k of the first (second) clustering.

Ref: Fowlkes, E. B.; Mallows, C. L. (1 September 1983). “A Method for Comparing Two Hierarchical Clusterings”. Journal of the American Statistical Association. 78 (383): 553. doi:10.2307/2288117

Args:
clust1 (list): a series of clusters in the form of a list of slices,
like the second value returned by one of the clustering
methods in PhylogenCluster.
clust2 (list): same as above.
Returns:
fm_ind (float): the Fowles-Mallows index
soprano.analyse.phylogen.metrics.norm_confmat(clust1, clust2)[source]#

Calculate the normalised confusion matrix for two clusterings of the same collection. The confusion matrix is defined as in the docstring of confmat. For the normalisation, each element i,j is divided by the geometric mean of the sizes of cluster i of the first clustering and cluster j of the second:

NM[i,j] = M[i,j]/(len(c[i])*len(c[j]))**0.5

Args:
clust1 (list): a series of clusters in the form of a list of slices,
like the second value returned by one of the clustering
methods in PhylogenCluster.
clust2 (list): same as above.
Returns:
nconfmat (np.ndarray): normalised confusion matrix for the two
clusterings.