Statistics¶
The minimum and maximum element in a tensor can be retrieved with the
functions Min
and Max
.
The function Quantile
returns the element when the cumulative distribution
function of the tensor reaches the given probability P
.
Median
returns the middle element (when sorted), which is equal to
calling Quantile
with P = 0.5
.
The mean and variance can be computed using the functions Mean
and Variance
.
An optional parameter Offset
can be used to return the unbiased sample variance
instead. The returned value is unbiased if Offset = 1
and biased
if Offset = 0
.
The standard deviation (the square root of the variance) is returned by
Standard_Deviation
. Like Variance
, it has an optional parameter Offset
.
The returned value is always biased because of the square root, even
when Offset = 1
.
Distributions¶
The generic package Generic_Random
provides functions that return a
tensor with some specific statistical distributions.
The package can be instantiated using a type derived from type Tensor
,
for example:
use Orka.Numerics.Singles.Tensors;
use Orka.Numerics.Singles.Tensors.CPU;
package Random is new Generic_Random (CPU_Tensor);
The type CPU_Tensor
in package SIMD_CPU
uses the xoshiro128++
pseudo-random number generator and needs to be seeded once with a
Duration value before using any of the functions in the generic package:
Reset_Random (Orka.OS.Monotonic_Clock);
Discrete distributions¶
-
Binomial
with parametersN
andP
. Returns a tensor where each element is the number of successful runs (each value is in 0 ..N
) with each run having a probability of successP
. ParameterP
must be in 0.0 .. 1.0.For example, if we want to perform some experiment 10 times with a success probability of 0.1 (10 %) each, and want to know the probability that all 10 experiments fail, create a large tensor with a binomial distribution:
Trials : constant := 20_000; Tensor : constant CPU_Tensor := Random.Binomial ([Trials], N => 10, P => 0.1); Result : constant Element := CPU_Tensor'(1.0 and (Tensor = 0.0)).Sum / Element (Trials);
This gives a
Result
of roughly 0.35 or 35 %. -
Geometric
with parameterP
. Create a tensor with a geometric distribution, modeling the number of failures. ParameterP
must be in 0.0 .. 1.0. -
Poisson
with parameterLambda
.
Keep parameter N
in function Binomial
small for large tensors
The runtime cost of the implementation of Binomial
might depend on
N
, thus this number should not be too large for very large tensors.
Continuous distributions¶
-
Uniform
. Values are uniformly distributed in the range 0 .. 1. -
Normal
. Values are from the standard normal distribution with mean 0.0 and variance 1.0. To create a tensor with the distribution N(3.0, 2.0) (mean is 3.0 and standard deviation is 2.0), use:3.0 + Normal (Shape) * 2.0
-
Exponential
with parameterLambda
. -
Pareto
with parametersXm
andAlpha
. -
Laplace
with parametersMean
andB
. -
Rayleigh
with parameterSigma
. -
Weibull
with parametersK
andLambda
. -
Gamma
with parametersK
andTheta
. -
Beta
with parametersAlpha
andBeta
. -
Chi_Squared
with parameterK
. -
Student_T
with parameterV
.
Exercise 1: Create Count
evenly spaced 2-D points with Std_Dev
Gaussian noise
First create a tensor with evenly spaced points and reshape it to be a matrix
with one row (needed for the &
operator). Then generate values from a
standard normal distribution and multiply it with the variable Std_Dev
to
get the desired standard deviation. Add the noise to the points. This is
done separately for X
and Y
. The two tensors can be concatenated and
transposed to create a tensor with a 2-D point on each row.
Reset_Random (Orka.OS.Monotonic_Clock);
declare
Shape : constant Tensor_Shape := (1, Count);
Stop : constant Element := Element (Count);
Indices : constant CPU_Tensor := Linear_Space (1.0, Stop, Count => Count).Reshape (Shape);
X, Y : constant CPU_Tensor := Indices + Random.Normal (Shape) * Std_Dev;
begin
return CPU_Tensor'(X & Y).Transpose;
end;
Hypothesis testing¶
Given a tensor containing a number of samples, the Student's t-distribution and the one-sample t-test can be used to determine if the samples deviate from some desired mean.
Assuming Data
is a 1-D tensor of samples and True_Mean
the desired
mean, a test statistic for the one-sample t-test for the null hypothesis
that the sample mean is equal to the desired mean is computed with:
T : constant Element := Random.Test_Statistic_T_Test (Data, True_Mean);
A t-value near zero is evidence for the null hypothesis, while a large positive or negative value away from zero is evidence against it.
The test statistic can be used to compute the probability of having a type I error (rejecting the null hypothesis when it is actually true) using the Student's t-distribution:
Trials : constant := 100_000;
Tensor : constant CPU_Tensor := Random.Student_T ([Trials], V => Data.Elements - 1);
P_Value : constant Element := Sum (1.0 and (Tensor >= abs T)) / Element (Trials);
If the probability is greater than some significance level α (for example, 0.05) divided by 2 (because we only compute the right tail probability) then there is a good chance of incorrectly rejecting the null hypothesis. Therefore, the null hypothesis should not be rejected. If the probability is less than α/2, then it is unlikely to have a type I error and therefore you can safely reject the null hypothesis.
A significance level of 0.1 corresponds with a confidence of 90 % and 0.05 corresponds with 95 %.
Confidence interval¶
To obtain an interval for a given significance level, use the function
Threshold_T_Test
.
Add and subtract the result from some true mean to get the interval for
which the null hypothesis (sample mean = true mean) is accepted:
Relative_Threshold : constant Element := Random.Threshold_T_Test (Data, Level => 0.05);
Upper_Bound : constant Element := True_Mean + Relative_Threshold;
Lower_Bound : constant Element := True_Mean - Relative_Threshold;
For a sample mean further away from the true mean (outside the interval), the null hypothesis is correctly rejected or incorrectly (in case of a type I error), but the type I error occurs only with a probability equal to the given significance level.
A lower significance level (and thus higher confidence) will give a wider interval.