# java-statistical-inference

This package is a java implementation of an opinionated statistical inference engine with fluent api to make it easier for conducting statistical inference with little or no knowledge of statistical inference principles involved

# Features

• Confidence Interval for numerical variable and proportions (one group or two groups)
• Hypothesis Testing for Single Numerical Variable
• Hypothesis Testing for Single Categorical Variable (Proportion)
• Hypothesis Testing for Two Group Numerical Variable
• Hypothesis Testing for Two Group Categorical Variable (Proportion)
• ANOVA: Independence Test between a Numerical Variable and a Categorical Variable
• Chi-Square Test: Independence Test between a Categorical Variable and another Categorical Variable
• ANOVA for Regression: Independence Test between a Numerical Variable and another Numerical Variable
• Automatic change of sampling distribution based on sample size:
• Normal distribution for large sample on categorical variable (one or two groups)
• Bootstrap simulation for small sample on categorical variable (one or two groups)
• Normal distribution for large sample on numerical variable (one or two groups)
• Student-T distribution for small sample on numerical variable (one or two groups)
• Central Limit Theorem Conditions Check

# Install

Add the following dependency into your POM file:
```<dependency>
<groupId>com.github.chen0040</groupId>
<artifactId>java-statistical-inference</artifactId>
<version>1.0.4</version>
</dependency>```

# Usage

### Single Numerical Variable

The code below shows how to declare a single numerical variable kie (knowledge inference engine):
```Variable variable = new Variable("Amount");
NumericalSampleKie kie = variable.numericalSample();```
The code below shows how to load observed data about the variable "Amount" into the kie:
`kie.addObservations(new double[] { 0.2, 0.4, 0.6, 0.12, 0.9, 0.13, -0.12, -0.55, 0.5});`
Alternatively the observed data can be loaded from a data frame (please refer to here for more example on how to create a data frame)
```DataFrame dataFrame = DataQuery.csv().from(new FileInputStream("amount.csv"))
.selectColumn(0).asNumeric().asInput("Amount").build();
The code below shows the various statistics that can be obtained from the kie about the variable "Amount":
```Mean mean = kie.mean();
double confidenceLevel = 0.95;
ConfidenceInterval confidenceInterval = mean.confidenceInterval(confidenceLevel);

System.out.println("sample.mean: " + kie.getSampleMean());
System.out.println("sample.sd: " + kie.getSampleSd());
System.out.println("sample.size: " + kie.getSampleSize());
System.out.println("sample.median: " + kie.getSampleMedian());
System.out.println("sample.max: " + kie.getSampleMax());
System.out.println("sample.min: " + kie.getSampleMin());
System.out.println("sample.1st.quartile: " + kie.getSampleFirstQuartile());
System.out.println("sample.3rd.quartile: " + kie.getSampleThirdQuartile());

System.out.println("sampling distribution: " + kie.getSamplingDistribution());

System.out.println("confidence interval for Amount: " + confidenceInterval);```
The kie also provides user friendly statement for the confidence interval:
`System.out.println(kie.mean().confidenceInterval(0.95).getSummary());`
The code belows shows how to test the null hypothesis that "The population mean of Amount is 0.5", with significance level of 0.05:
```double expected_mean = 0.5;
TestingOnValue test = kie.test4MeanEqualTo(expected_mean);

System.out.println("sampling distribution: " + test.getDistributionFamily());
System.out.println("test statistic: " + test.getTestStatistic());
System.out.println("p-value (one-tail): " + test.getPValueOneTail());
System.out.println("p-value (two-tails): " + test.getPValueTwoTails());```
The kie also provides user friendly statement for the null hypothesis test:
```TestingOnValue test = kie.test4MeanEqualTo(0.5);
System.out.println(test.getSummary());```

### Single Categorical Variable

The code below shows how to declare a single categorical variable kie (knowledge inference engine):
```Variable variable = new Variable("Type");
NumericalSampleKie kie = variable.categoricalSample();```
The code below shows how to load observed data about the variable "Type" into the kie:
`kie.addObservations(new String[] { "Asset", "Liability", "Equity", "Revenue", "Expense", "Liability", "Equity", "Revenue", "Asset", "Liability", "Equity" });`
Alternatively the observed data can be loaded from a data frame
```InputStream inputStream = new FileInputStream("iris.data");
DataFrame dataFrame = DataQuery.csv(",").from(inputStream)
.selectColumn(4).asCategory().asInput("Type").build();
The code below shows the various statistics that can be obtained from the kie about the variable "Amount":
```Proportion proportion = kie.proportion("Liability");
double confidenceLevel = 0.95;
ConfidenceInterval confidenceInterval = proportion.confidenceInterval(confidenceLevel);

System.out.println("sample.mean: " + kie.getSampleMean("Liability"));
System.out.println("sample.proportion: " + kie.getSampleProportion("Liability"));
System.out.println("sample.sd: " + kie.getSampleSd("Liability"));
System.out.println("sample.size: " + kie.getSampleSize());

System.out.println("sampling distribution: " + kie.getSamplingDistribution());

System.out.println("confidence interval for Type == Liability: " + confidenceInterval);```
The kie also provides user friendly statement for the confidence interval:
`System.out.println(kie.proportion("Liability").confidenceInterval(0.95).getSummary());`
The code belows shows how to test the null hypothesis that "The population proportion of Type==Liability is 0.5", with significance level of 0.05:
```double expected_proportion = 0.5;
TestingOnValue test = kie.test4MeanEqualTo(expected_proportion);

System.out.println("sampling distribution: " + test.getDistributionFamily());
System.out.println("test statistic: " + test.getTestStatistic());
System.out.println("p-value (one-tail): " + test.getPValueOneTail());
System.out.println("p-value (two-tails): " + test.getPValueTwoTails());```
The kie also provides user friendly statement for the null hypothesis test:
```TestingOnValue test = kie.test4ProportionEqualTo(0.5);
System.out.println(test.getSummary());```

### Paired Sample for a Numerical Variable

The sample code below shows how to run statistical inference on the sample from a paired observations (e.g. before, after) for a numerical variable:
```Variable variable1 = new Variable("Begin");
Variable variable2 = new Variable("End");

InputStream inputStream = new FileInputStream("calcium-paired.dat");
DataFrame dataFrame = DataQuery.csv().from(inputStream)
.selectColumn(1).asNumeric().asInput("Begin")
.selectColumn(2).asNumeric().asInput("End")
.build();

PairedSampleKie kie = variable2.pair(variable1).numericalSample();

Mean mean = kie.difference();

ConfidenceInterval confidenceInterval = mean.confidenceInterval(0.95);
TestingOnValue test = kie.testDifferenceEqualTo(0.5);

System.out.println("sample.difference-mean: " + kie.getSampleDifferenceMean());
System.out.println("sample.difference-sd: " + kie.getSampleDifferenceSd());
System.out.println("sample.size: " + kie.getSampleSize());
System.out.println("sample.median: " + kie.getSampleMedian());
System.out.println("sample.max: " + kie.getSampleMax());
System.out.println("sample.min: " + kie.getSampleMin());
System.out.println("sample.1st.quartile: " + kie.getSampleFirstQuartile());
System.out.println("sample.3rd.quartile: " + kie.getSampleThirdQuartile());

System.out.println("sampling distribution (difference): " + kie.getSamplingDistribution());

System.out.println("95% confidence interval: " + confidenceInterval);

System.out.println("========================================================");

System.out.println(confidenceInterval.getSummary());
System.out.println(test.getSummary());```
In the above codes, the "calcium-paired.dat" contains results of a randomized comparative experiment to investigate the effect of calcium on blood pressure in African-American men. A treatment group of 10 men received a calcium supplement for 12 weeks. All subjects had their blood pressure tested before and after the 12-week period.

### Compare Two Groups for a Numerical Variable

The sample below shows the statistical inference on samples from two different groups (e.g., from two different experiment setup) for a numerical variable:
```Variable variable = new Variable("Decrease");
TwoGroupNumericalSampleKie kie = variable.twoGroupNumericalSample(new Variable("Treatment"), "Calcium", "Placebo");

InputStream inputStream = new FileInputStream("calcium.dat");
DataFrame dataFrame = DataQuery.csv().from(inputStream)
.skipRows(33)
.selectColumn(0).asCategory().asInput("Treatment")
.selectColumn(3).asNumeric().asInput("Decrease")
.build();

MeanDifference difference = kie.difference();
ConfidenceInterval confidenceInterval = difference.confidenceInterval(0.95);

TestingOnValueDifference test = kie.test4GroupDifference();

System.out.println("sample1.mean: " + kie.getGroup1SampleMean());
System.out.println("sample1.sd: " + kie.getGroup1SampleSd());
System.out.println("sample1.size: " + kie.getGroup1SampleSize());
System.out.println("sample1.median: " + kie.getGroup1SampleMedian());
System.out.println("sample1.max: " + kie.getGroup1SampleMax());
System.out.println("sample1.min: " + kie.getGroup1SampleMin());
System.out.println("sample1.1st.quartile: " + kie.getGroup1SampleFirstQuartile());
System.out.println("sample1.3rd.quartile: " + kie.getGroup1SampleThirdQuartile());

System.out.println("sample2.mean: " + kie.getGroup2SampleMean());
System.out.println("sample2.sd: " + kie.getGroup2SampleSd());
System.out.println("sample2.size: " + kie.getGroup2SampleSize());
System.out.println("sample2.median: " + kie.getGroup2SampleMedian());
System.out.println("sample2.max: " + kie.getGroup2SampleMax());
System.out.println("sample2.min: " + kie.getGroup2SampleMin());
System.out.println("sample2.2st.quartile: " + kie.getGroup2SampleFirstQuartile());
System.out.println("sample2.3rd.quartile: " + kie.getGroup2SampleThirdQuartile());

System.out.println("sampling distribution: " + kie.getSamplingDistribution());

System.out.println("95% confidence interval: " + confidenceInterval);

System.out.println("========================================================");

System.out.println(confidenceInterval.getSummary());
System.out.println(test.getSummary());```
In the above codes, the "calcium.dat" contains results of a randomized comparative experiment to investigate the effect of calcium on blood pressure in African-American men. A treatment group of 10 men received a calcium supplement for 12 weeks, and a control group of 11 men received a placebo during the same period. All subjects had their blood pressure tested before and after the 12-week period.
The "kie.test4GroupDifference()" can be used to test whether the numerical variable is independent of another categorical variable which has two levels (i.e. the "group" variable)

### Compare Two Groups for a Categorical Variable

The sample below shows the statistical inference on samples from two different groups (e.g., from two different experiment setup) for a categorical variable:
```Variable variable_use = new Variable("UseContraceptive");
Variable variable_urban = new Variable("IsUrban");

InputStream inputStream = new FileInputStream("contraception.csv");
DataFrame dataFrame = DataQuery.csv(",")
.from(inputStream)
.selectColumn(3).asCategory().asInput("UseContraceptive")
.selectColumn(6).asCategory().asInput("IsUrban")
.build();

TwoGroupCategoricalSampleKie kie = variable_use.twoGroupCategoricalSampleKie(variable_urban, "Y", "N");

ProportionDifference difference = kie.proportionDifference("Y");
ConfidenceInterval confidenceInterval = difference.confidenceInterval(0.95);

TestingOnProportionDifference test = kie.test4GroupDifference("Y");

System.out.println("sample1.mean: " + kie.getGroup1SampleMean("Y"));
System.out.println("sample1.proportion: " + kie.getGroup1SampleProportion("Y"));
System.out.println("sample1.sd: " + kie.getGroup1SampleSd("Y"));
System.out.println("sample1.size: " + kie.getGroup1SampleSize());

System.out.println("sample2.mean: " + kie.getGroup2SampleMean("Y"));
System.out.println("sample2.proportion: " + kie.getGroup2SampleProportion("Y"));
System.out.println("sample2.sd: " + kie.getGroup2SampleSd("Y"));
System.out.println("sample2.size: " + kie.getGroup2SampleSize());

System.out.println("sampling distribution: " + kie.getSamplingDistribution("Y"));

System.out.println("95% confidence interval: " + confidenceInterval);

System.out.println("========================================================");

System.out.println(confidenceInterval.getSummary());
System.out.println(test.getSummary());```
In the above codes, the "contraception.csv" contains results of whether a person is from urban area and whether he/she uses contraception.
The "kie.test4GroupDifference('Y')" can be used to test whether the categorical variable is independent of another categorical variable which has two levels (i.e. the "group" variable)

### ANOVA: Independence Test for a Numerical variable and a Categorical Variable

The sample code belows show to test for the independence between a categorical variable (explanatory variable) a numerical variable (response variable):
```Variable variable1 = new Variable("Age");
Variable variable2 = new Variable("LiveChannel");

CategoricalToNumericalSampleKie kie = variable1.multipleGroupNumericalSample(variable2);

InputStream inputStream = FileUtils.getResource("contraception.csv");
DataFrame dataFrame = DataQuery.csv(",")
.from(inputStream)
.skipRows(1)
.selectColumn(5).asNumeric().asInput("Age")
.selectColumn(4).asCategory().asInput("LiveChannel")
.build();

Anova test = kie.test4Independence();

System.out.println(test.getSummary());```
In the above codes, the "contraception.csv" contains results of which channel the person watch (categorical) and what is his/her age (numeric).

### Chi-Square: Independence Test for two Categorical Variables

The sample code belows show to test for the independence between two categorical variables
```Variable variable1 = new Variable("UseContraceptive");
Variable variable2 = new Variable("LiveChannel");

CategoricalToCategoricalSampleKie kie = variable1.multipleGroupCategoricalSample(variable2);

InputStream inputStream = new FileInputStream("contraception.csv");
DataFrame dataFrame = DataQuery.csv(",")
.from(inputStream)
.skipRows(1)
.selectColumn(3).transform(text -> text.equals("Y") ? "Use" : "DontUse").asInput("UseContraceptive")
.selectColumn(4).asCategory().asInput("LiveChannel")
.build();

ChiSquareTest test = kie.test4Independence();

ContingencyTable contingencyTable = kie.getOrCreateContingencyTable();

System.out.println(contingencyTable.getSummary());

System.out.println(test.getSummary());```
In the above codes, the "contraception.csv" contains results of whether a person watch which live channel (categorical variable) and whether he/she uses contraception (another categorical variable).

### Linear dependency between two numerical variables X and Y

The sample code between shows how to analyze the linear dependency between two numerical variable X and Y:
```final Random random = new Random(System.currentTimeMillis());

// regression: y is strongly correlated to x by y = 25 + 5 * x
Sampler.DataSampleBuilder builder = new Sampler().forColumn("x").generate((name, index) -> (double)index)
.forColumn("y").generate((name, index) -> 25 + (index + random.nextDouble()) * 5 + random.nextDouble())
.end();

DataFrame dataFrame = DataQuery.blank()
.newInput("x")
.newOutput("y")
.end().build();
dataFrame = builder.sample(dataFrame, 100);

Variable x = new Variable("x");
XYSampleKie kie = x.regression(new Variable("y"));

SampleLinearRegression model = kie.model();

System.out.println("correlation between x and y: " + model.getCorrelation());
System.out.println("y-intercept: " + model.getIntercept());
System.out.println("slope: " + model.getSlope());
System.out.println("R^2: " + model.getR2()); // explained variability
System.out.println("SD(X): " + model.getSX());
System.out.println("SD(Y): " + model.getSY());
System.out.println("Mean(X): " + model.getXBar());
System.out.println("Mean(Y): " + model.getYBar());

Anova4Regression anova = kie.test4Independence();

System.out.println(anova.getSummary());```