K-Sample

Nonpar MANOVA via Independence Testing

class hyppo.ksample.KSample(indep_test, compute_distance='euclidean', bias=False, **kwargs)[source]

Class for calculating the k-sample test statistic and p-value.

A k-sample test tests equality in distribution among groups. Groups can be of different sizes, but generally have the same dimensionality. There are not many non-parametric k-sample tests, but this version cleverly leverages the power of some of the implemented independence tests to test this equality of distribution.

Parameters:
indep_test : {"CCA", "Dcorr", "HHG", "RV", "Hsic", "MGC"}

A string corresponding to the desired independence test from mgc.independence. This is not case sensitive.

compute_distance : callable(), optional (default: "euclidean")

A function that computes the distance among the samples within each data matrix. Valid strings for metric are, as defined in sklearn.metrics.pairwise_distances,

  • From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’] See the documentation for scipy.spatial.distance for details on these metrics.
  • From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’] See the documentation for scipy.spatial.distance for details on these metrics.

Set to None or precomputed if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form metric(x, **kwargs) where x is the data matrix for which pairwise distances are calculated and kwargs are extra arguements to send to your custom function.

bias : bool (default: False)

Whether or not to use the biased or unbiased test statistics. Only applies to Dcorr and Hsic.

Notes

The formulation for this implementation is as follows [1]:

The k-sample testing problem can be thought of as a generalization of the two sample testing problem. Define \(\{ u_i \stackrel{iid}{\sim} F_U,\ i = 1, ..., n \}\) and \(\{ v_j \stackrel{iid}{\sim} F_V,\ j = 1, ..., m \}\) as two groups of samples deriving from different distributions with the same dimensionality. Then, problem that we are testing is thus,

\[\begin{split}H_0: F_U &= F_V \\ H_A: F_U &\neq F_V\end{split}\]

The closely related independence testing problem can be generalized similarly: Given a set of paired data \(\{\left(x_i, y_i \right) \stackrel{iid}{\sim} F_{XY}, \ i = 1, ..., N\}\), the problem that we are testing is,

\[\begin{split}H_0: F_{XY} &= F_X F_Y \\ H_A: F_{XY} &\neq F_X F_Y\end{split}\]

By manipulating the inputs of the k-sample test, we can create concatenated versions of the inputs and another label matrix which are necessarily paired. Then, any nonparametric test can be performed on this data.

Letting \(n = \sum_{i=1}^k n_i\), define new data matrices \(\mathbf{x}\) and \(\mathbf{y}\) such that,

\[\begin{split}\begin{align*} \mathbf{x} &= \begin{bmatrix} \mathbf{u}_1 \\ \vdots \\ \mathbf{u}_k \end{bmatrix} \in \mathbb{R}^{n \times p} \\ \mathbf{y} &= \begin{bmatrix} \mathbf{1}_{n_1 \times 1} & \mathbf{0}_{n_1 \times 1} & \ldots & \mathbf{0}_{n_1 \times 1} \\ \mathbf{0}_{n_2 \times 1} & \mathbf{1}_{n_2 \times 1} & \ldots & \mathbf{0}_{n_2 \times 1} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0}_{n_k \times 1} & \mathbf{0}_{n_k \times 1} & \ldots & \mathbf{1}_{n_k \times 1} \\ \end{bmatrix} \in \mathbb{R}^{n \times k} \end{align*}\end{split}\]

Additionally, in the two-sample case,

\[\begin{split}\begin{align*} \mathbf{x} &= \begin{bmatrix} \mathbf{u}_1 \\ \mathbf{u}_2 \end{bmatrix} \in \mathbb{R}^{n \times p} \\ \mathbf{y} &= \begin{bmatrix} \mathbf{0}_{n_1 \times 1} \\ \mathbf{1}_{n_2 \times 1} \end{bmatrix} \in \mathbb{R}^n \end{align*}\end{split}\]

Given \(\mathbf{u}\) and \(\mathbf{v}\) as defined above, to perform a \(w\)-way test where \(w < k\),

\[\begin{split}\mathbf{y} = \begin{bmatrix} \mathbf{1}_{n_1 \times 1} & \mathbf{0}_{n_1 \times 1} & \ldots & \mathbf{1}_{n_1 \times 1} \\ \mathbf{1}_{n_2 \times 1} & \mathbf{1}_{n_2 \times 1} & \ldots & \mathbf{0}_{n_2 \times 1} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0}_{n_k \times 1} & \mathbf{1}_{n_k \times 1} & \ldots & \mathbf{1}_{n_k \times 1} \\ \end{bmatrix} \in \mathbb{R}^{n \times k}.\end{split}\]

where each row of \(\mathbf{y}\) contains \(w\) \(\mathbf{1}_{n_i}\) elements. This leads to label matrix distances proportional to how many labels (ways) samples differ by, a hierarchy of distances between samples thought to be true if the null hypothesis is rejected.

Performing a multilevel test involves constructing \(x\) and \(y\) using either of the methods above and then performing a block permutation [2]. Essentially, the permutation is striated, where permutation is limited to be within a block of samples or between blocks of samples, but not both. This is done because the data is not freely exchangeable, so it is necessary to block the permutation to preserve the joint distribution [2].

The p-value returned is calculated using a permutation test using a permutation test. The fast version of the test (for \(k\)-sample Dcorr and Hsic) uses a chi squared approximation.

References

[1]Panda, S., Shen, C., Perry, R., Zorn, J., Lutz, A., Priebe, C. E., & Vogelstein, J. T. (2019). Nonparametric MANOVA via Independence Testing. arXiv e-prints, arXiv-1910.
[2](1, 2) Winkler, A. M., Webster, M. A., Vidaurre, D., Nichols, T. E., & Smith, S. M. (2015). Multi-level block permutation. Neuroimage, 123, 253-268.
test(self, *args, reps=1000, workers=1, auto=True)[source]

Calculates the k-sample test statistic and p-value.

Parameters:
*args : ndarrays

Variable length input data matrices. All inputs must have the same number of samples. That is, the shapes must be (n, p) and (m, p) where n and m are the number of samples and p are the number of dimensions. Alternatively, inputs can be distance matrices, where the shapes must all be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

auto : bool (default: True)

Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters reps and workers are irrelevant in this case. Only applies to Dcorr and Hsic.

Returns:
stat : float

The computed k-Sample statistic.

pvalue : float

The computed k-Sample p-value.

Examples

>>> import numpy as np
>>> from hyppo.ksample import KSample
>>> x = np.arange(7)
>>> y = x
>>> z = np.arange(10)
>>> stat, pvalue = KSample("Dcorr").test(x, y)
>>> '%.3f, %.1f' % (stat, pvalue)
'-0.136, 1.0'

Energy

class hyppo.ksample.Energy(compute_distance='euclidean', bias=False, **kwargs)[source]

Class for calculating the Energy test statistic and p-value.

Energy is a powerful multivariate 2-sample test. It leverages distance matrix capabilities (similar to tests like distance correlation or Dcorr). In fact, Energy statistic is equivalent to our 2-sample formulation nonparametric MANOVa via independence testing, i.e. hyppo.ksample.Ksample, and to Dcorr, Hilbert Schmidt Independence Criterion (Hsic), and Maximum Mean Discrepancy [3] [4] (see "See Also" section for links).

Parameters:
bias : bool (default: False)

Whether or not to use the biased or unbiased test statistics.

Notes

Traditionally, the formulation for the 2-sample Energy statistic is as follows [5]:

Define \(\{ u_i \stackrel{iid}{\sim} F_U,\ i = 1, ..., n \}\) and \(\{ v_j \stackrel{iid}{\sim} F_V,\ j = 1, ..., m \}\) as two groups of samples deriving from different distributions with the same dimensionality. If \(d(\cdot, \cdot)\) is a distance metric (i.e. euclidean) then,

\[Energy_{n, m}(\mathbf{u}, \mathbf{v}) = \frac{1}{n^2 m^2} \left( 2nm \sum_{i = 1}^n \sum_{j = 1}^m d(u_i, v_j) - m^2 \sum_{i,j=1}^n d(u_i, u_j) - n^2 \sum_{i, j=1}^m d(v_i, v_j) \right)\]

The implementation in the hyppo.ksample.KSample class (using Dcorr) is in fact equivalent to this implementation (for p-values) and statistics are equivalent up to a scaling factor [3].

The p-value returned is calculated using a permutation test using a permutation test. The fast version of the test (for \(k\)-sample Dcorr and Hsic) uses a chi squared approximation.

References

[3](1, 2) Panda, S., Shen, C., Perry, R., Zorn, J., Lutz, A., Priebe, C. E., & Vogelstein, J. T. (2019). Nonparametric MANOVA via Independence Testing. arXiv e-prints, arXiv-1910.
[4]Shen, C., & Vogelstein, J. T. (2018). The exact equivalence of distance and kernel methods for hypothesis testing. arXiv preprint arXiv:1806.05514.
[5]Székely, G. J., & Rizzo, M. L. (2004). Testing for equal distributions in high dimension. InterStat, 5(16.10), 1249-1272.
test(self, x, y, reps=1000, workers=1, auto=True)[source]

Calculates the Energy test statistic and p-value.

Parameters:
x, y : ndarray

Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).

reps : int, optional (default: 1000)

The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.

workers : int, optional (default: 1)

The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.

auto : bool (default: True)

Automatically uses fast approximation when sample size and size of array is greater than 20. If True, and sample size is greater than 20, a fast chi2 approximation will be run. Parameters reps and workers are irrelevant in this case.

Returns:
stat : float

The computed k-Sample statistic.

pvalue : float

The computed k-Sample p-value.

Examples

>>> import numpy as np
>>> from hyppo.ksample import Energy
>>> x = np.arange(7)
>>> y = x
>>> stat, pvalue = Energy().test(x, y)
>>> '%.3f, %.1f' % (stat, pvalue)
'0.000, 1.0'