Skip to content

Statistics

The minimum and maximum element in a tensor can be retrieved with the functions Min and Max.

The function Quantile returns the element when the cumulative distribution function of the tensor reaches the given probability P. Median returns the middle element (when sorted), which is equal to calling Quantile with P = 0.5.

The mean and variance can be computed using the functions Mean and Variance. An optional parameter Offset can be used to return the unbiased sample variance instead. The returned value is unbiased if Offset = 1 and biased if Offset = 0.

The standard deviation (the square root of the variance) is returned by Standard_Deviation. Like Variance, it has an optional parameter Offset. The returned value is always biased because of the square root, even when Offset = 1.

Distributions

The generic package Generic_Random provides functions that return a tensor with some specific statistical distributions.

The package can be instantiated using a type derived from type Tensor, for example:

use Orka.Numerics.Singles.Tensors;
use Orka.Numerics.Singles.Tensors.CPU;

package Random is new Generic_Random (CPU_Tensor);

The type CPU_Tensor in package SIMD_CPU uses the xoshiro128++ pseudo-random number generator and needs to be seeded once with a Duration value before using any of the functions in the generic package:

Reset_Random (Orka.OS.Monotonic_Clock);

Discrete distributions

  • Binomial with parameters N and P. Returns a tensor where each element is the number of successful runs (each value is in 0 .. N) with each run having a probability of success P. Parameter P must be in 0.0 .. 1.0.

    For example, if we want to perform some experiment 10 times with a success probability of 0.1 (10 %) each, and want to know the probability that all 10 experiments fail, create a large tensor with a binomial distribution:

    Trials : constant := 20_000;
    
    Tensor : constant CPU_Tensor := Random.Binomial ((1 => Trials), N => 10, P => 0.1);
    Result : constant Element    := CPU_Tensor'(1.0 and (Tensor = 0.0)).Sum / Element (Trials);
    

    This gives a Result of roughly 0.35 or 35 %.

  • Geometric with parameter P. Create a tensor with a geometric distribution, modeling the number of failures. Parameter P must be in 0.0 .. 1.0.

  • Poisson with parameter Lambda.

Keep parameter N in function Binomial small for large tensors

The runtime cost of the implementation of Binomial might depend on N, thus this number should not be too large for very large tensors.

Continuous distributions

  • Uniform. Values are uniformly distributed in the range 0 .. 1.

  • Normal. Values are from the standard normal distribution with mean 0.0 and variance 1.0. To create a tensor with the distribution N(3.0, 2.0) (mean is 3.0 and standard deviation is 2.0), use:

    3.0 + Normal (Shape) * 2.0
    
  • Exponential with parameter Lambda.

  • Pareto with parameters Xm and Alpha.

  • Laplace with parameters Mean and B.

  • Rayleigh with parameter Sigma.

  • Weibull with parameters K and Lambda.

  • Gamma with parameters K and Theta.

  • Beta with parameters Alpha and Beta.

  • Chi_Squared with parameter K.

  • Student_T with parameter V.

Exercise 1: Create Count evenly spaced 2-D points with Std_Dev Gaussian noise

First create a tensor with evenly spaced points and reshape it to be a matrix with one row (needed for the & operator). Then generate values from a standard normal distribution and multiply it with the variable Std_Dev to get the desired standard deviation. Add the noise to the points. This is done separately for X and Y. The two tensors can be concatenated and transposed to create a tensor with a 2-D point on each row.

Reset_Random (Orka.OS.Monotonic_Clock);

declare
   Shape : constant Tensor_Shape := (1, Count);
   Stop  : constant Element      := Element (Count);

   Indices : constant CPU_Tensor := Linear_Space (1.0, Stop, Count => Count).Reshape (Shape);
   X, Y    : constant CPU_Tensor := Indices + Random.Normal (Shape) * Std_Dev;
begin
   return CPU_Tensor'(X & Y).Transpose;
end;

Hypothesis testing

Given a tensor containing a number of samples, the Student's t-distribution and the one-sample t-test can be used to determine if the samples deviate from some desired mean.

Assuming Data is a 1-D tensor of samples and True_Mean the desired mean, a test statistic for the one-sample t-test for the null hypothesis that the sample mean is equal to the desired mean is computed with:

T : constant Element := Random.Test_Statistic_T_Test (Data, True_Mean);

A t-value near zero is evidence for the null hypothesis, while a large positive or negative value away from zero is evidence against it.

The test statistic can be used to compute the probability of having a type I error (rejecting the null hypothesis when it is actually true) using the Student's t-distribution:

Trials : constant := 100_000;

Tensor  : constant CPU_Tensor := Random.Student_T ((1 => Trials), V => Data.Elements - 1);
P_Value : constant Element    := Sum (1.0 and (Tensor >= abs T)) / Element (Trials);

If the probability is greater than some significance level α (for example, 0.05) divided by 2 (because we only compute the right tail probability) then there is a good chance of incorrectly rejecting the null hypothesis. Therefore, the null hypothesis should not be rejected. If the probability is less than α/2, then it is unlikely to have a type I error and therefore you can safely reject the null hypothesis.

A significance level of 0.1 corresponds with a confidence of 90 % and 0.05 corresponds with 95 %.

Confidence interval

To obtain an interval for a given significance level, use the function Threshold_T_Test. Add and subtract the result from some true mean to get the interval for which the null hypothesis (sample mean = true mean) is accepted:

Relative_Threshold : constant Element := Random.Threshold_T_Test (Data, Level => 0.05);

Upper_Bound : constant Element := True_Mean + Relative_Threshold;
Lower_Bound : constant Element := True_Mean - Relative_Threshold;

For a sample mean further away from the true mean (outside the interval), the null hypothesis is correctly rejected or incorrectly (in case of a type I error), but the type I error occurs only with a probability equal to the given significance level.

A lower significance level (and thus higher confidence) will give a wider interval.