Sunday, August 11, 2013

Three approaches to measure how different are the different samples

Approach 1: Measure how far a random sample's average value is from the average of another sample. 

Given the following two samples:

Distribution(sample 1) is ~N(mu1, sigma1)
Distribution(sample 2) is ~N(mu2, sigma2)

In above, sample 1 may be a random sample drawn from sample 2.Now suppose we want to know how far the sample 1's average value mu1 is from sample sample 2's average. The difference can be measured by the following equation.

<difference>=2 * MIN( 1- NORMSDIST(<z-score>), NORMSDIST(<z-score>))

where <z-score> = (mu1 - mu2) / sigma2. NORMSDIST(x) measure the area below the normal distribution curve from negative infinity to x. MIN(x, y) returns the minimum of (x, y).

Approach 2: Chi-Square Test

The chi-square test provides another method for addressing "how different is different?". The chi-square test is appropriate when there are multiple dimensions being compared to each other. In other words, two samples are compared across different dimensions. The chi-square test does not create confidence intervals, because confidence intervals do not make as much sense across multiple dimensions.

Suppose we have two samples by dividing an overral sample: sample1 and sample2. Sample1 has N1 items, and sample2 has N2 items. Each sample can divided into sub-groups: C1, C2, C3 using a single dimension (or feature), we have



From above, we will have C1+C2+C3 = N1+N2

Now we calculate the expected value of C1(sample1), ... C3(sample2) as follows:

E[C1(sample1)]=(C1 * N1) / (N1+N2)
E[C2(sample1)]=(C2 * N1) / (N1+N2)
E[C3(sample1)]=(C3 * N1) / (N1+N2)
E[C1(sample2)]=(C1 * N2) / (N1+N2)
E[C2(sample2)]=(C2 * N2) / (N1+N2)
E[C3(sample2)]=(C3 * N2) / (N1+N2)

Next we calculate the deviation of the actual value (e.g. C1(sample1)) from the expected value (e.g. E[C1(sample1))], as follows:

D[C1(sample1)]=C1(sample1) - E[C1(sample1)]
D[C3(sample2)]=C3(sample2) - E[C3(sample2)]

Next we calculate the chi-square values, as follows:

ChiSquare[C1(sample1)]=D[C1(sample1)]^2 / E[C1(sample1)]

With the chi-squares value calculated, we can calculate the chi-square for each sample, as follows:


Note that the above calculation can be extended to arbitrary number of samples, each sample may be a sub-category (or sub-set) of an overall sample. To generalize this, given the overall sample is measured into two different dimensions or categories, with the first dimension measured in three different values (C1, C2, C3) and second dimension measured in two different values (N1, N2), then the chi-square for each cell (e.g. the cell (C1, N1)) can be calculated as

ChiSquare(C1, N1)=D(C1, N1)^2 / E(C1, N1)

D(C1, N1)=ActualCount(C1, N1) - E(C1, N1)
E(C1, N1)=(ActualCount(C1)*ActualCount(N1)) / TotalCount)

ActualCount(C1)=ActualCount(C1, N1)+ActualCount(C1, N2)
ActualCount(N1)=ActualCount(C1, N1)+ActualCount(C2, N1)+ActualCount(C3, N1)

The higher the value of ChiSquare(C1, N1), the further the actual count of combination (C1, N1) deviates from the expected count of (C1, N1). In other words, the higher the ChiSquare(C1, N1), the more unexpected the combination of (C1, N1) is. Or we can say, the higher the ChiSquare(C1, N1), the least likely that ActualCount(C1, N1) occurs by chance from the overall sample distribution (i.e. its distribution is more significantly different from the overall sample).

The chi-square values follows the distribution known as chi-square distribution. The degree of freedom for the chi-square distribution, k, is measured by the product of one-minus-count of each dimension, for example, the dimension 1 (C1, C2, C3) has 3 category, dimension 2 (N1, N2) has 2 category, the k value is therefore:

k=(3-1) * (2-1)=3

Approach 3: Student-t Test
This is described in another post.

No comments:

Post a Comment