## Sampling and InferenceA sample is defined as a method of selecting a small section from a population or large data. The process of drawing a sample from large data is known as sampling. It is used in various applications, such as mathematics, digital communication, etc. It is essential that a selected sample must be random selection so that each member has an equal chance to appear in the selection process. Thus, the fundamental assumption based on the sampling is called Here, we will discuss the following: Sampling Distribution Testing of Hypothesis Level Significance Comparison of two samples Student t-distribution Central Limit Theorem Chi-Square Method F Distribution ## Sampling DistributionWe generally compute the mean of n number of samples taken from the population. The means of various samples are different. If these different means are grouped according to their frequencies, it is termed as the ## Simple SamplingIt is a special case of random sampling. Here, each event has the same probability of success, which is independent of each other. It means that the occurrence of one does not depend on the occurrence of the other and their previous trials. The statistical parameters of the population are ## Why sampling is important?Sampling aims at collecting maximum information within minimum time, cost, and efforts. It obtains the best possible values. According to the sampling logic of induction, a particular sample is passed to the general population. Such a generalization from a sample to the population is termed as ## Attributes of SamplingThe probability of success = p The probability of failure = q = (1 - p) Suppose we draw the samples with n number of independent trials. The occurrence of a outcome does not depend on the occurrence of the other. Binomial expression = (p + q) Mean = np SD (Standard Deviation) = The expected value of success in a sample of size n is np and the standard error is Let's consider the proportion of success. The means proportion of success = np/n = p Standard error of the proportion of success = √(n x p/n x q/n) = √(pq/n) Precision of the proportion of success = √(n/pq) Precision is the reciprocal of the standard error. It depends on √n because p and q are constants. ## Standard ErrorThe standard error is defined as the difference between the expected value and the observed value. If n is greater than 30, the sample is known as a large sample. Otherwise, the sample is small. The standard error of the sampling distribution of the mean is called the standard error of the mean, and the standard error of the sampling distribution of SD is called the standard error of SD. ## Testing of HypothesisTesting a hypothesis means testing whether the hypothesis is true or false. At first, the hypothesis is assumed as correct, and the probability is calculated. The hypothesis is rejected when the probability calculated is less than the preassigned value. ## ErrorsErrors arises when the hypothesis is rejected or accepted instead of accepted or rejected. There are two types of errors, Type I error and Type II error.
The aim of testing a hypothesis is to minimize the type II error and limit the type I error to the preassigned value of 5% or 1%. The best method to reduce both type of errors (Type I and II) is to increase the sample size.
## Level SignificanceThe level of probability below which a certain hypothesis is rejected, is known as the We know that for the binomial distribution of n samples, the formula to calculate the mean and standard deviation is given by: Mean = np SD (Standard Deviation) = √npq Where, p is the probability of success n is the probability of failure We have the following conditions to test the level of significance, |z| < 1.96, the difference is not significant |z| > 1.96, the difference is at 5% level of significance |z| > 2.58, the difference is at 1% level of significance Where, Z is the standard normal variate
Or Z = (x -mean)/SD
X is the value for observed number of successes. The difference is calculated on the basis of expected and observed number of successes. ## Numerical ExamplesLet's discuss some numerical examples based on testing of level significance.
A coin was tossed 200 times with 108 number of heads. Test the hypothesis that the coin is biased at 5% level of significance?
Let's consider the above coin as unbiased. The probability of getting a head = 1/2 n = 1/2 The probability of getting a tail = 1/2 p = 1/2 Thus, the probability of getting a head is the probability of success. The expected number of successes = 1/2 x 200 = 100 The observed value of successes = 108 Difference = Observed value of success - expected value of success Difference = 108 - 100 Difference = 8 Standard Deviation = = = = 7.07 Z = (x - np)/√npq Z = 8/7.07
Z < 1.96 The value of z is less than 1.96. Hence, the hypothesis of the unbiased coin is accepted at 5% level of significance.
A die is thrown 12000 times and the throw of 2 or 3 was obtained 5400 times? On the basis of random dice throwing, does the data indicate the biased or unbiased die?
Let's consider the above coin as unbiased. The probability of getting 2 or 3= 1/3 n = 1/3 The probability of not getting 2 or 3 = (1 - 1/3) p = 2/3 The expected number of successes = 1/3 x 12000 = 4000 The observed value of successes = 5400 Difference = Observed - expected number of successes = 5400 - 4000 = 1400 Standard Deviation = = = 51.63 Z = (x - np)/ Z = 1400/51.63
Z > 2.58 The value of z is greater than 2.58. Hence, the hypothesis of the biased coin is rejected at 1% level of significance. ## Comparison of two SamplesWe have discussed the hypothesis and testing of level significance of a sample n taken from the population p. Here, we will discuss two large samples of size n taken from the population's p1 and p2. We combine the two samples to find the estimate value of proportion, which is given by:_{2}
The standard errors E E E E E Z = (p Z is the standard normal variate If z > 3, the difference is real If z < 2, the difference is fluctuating If the value of z lies between 2 and 3, the difference between p ## Numerical ExamplesLet's discuss three numerical examples based on the comparison of samples. ## Example 1:In city P, 25% of samples of 1000 girls has a certain physical defect. In another city Q, 20% of samples of 800 girls has the same defect. Find if the difference between the two propagations is significant?
n n p p The estimate value of proportion is given by: P = (n P = (250 + 160)/1800 P = 310/1800 P = 0.17 Q = 1 - P Q = 0.83 E E E E = 0.0178 E = 0.02 (approx.) p = 5% = 5/100 = 0.05 Z = (p Z = 0.05/0.02
The value of Z lies between 2 and 3. Hence, the value is significant at 5% level of significance.
In city A, 11% of samples of 500 boys has a certain physical defect. In another city B, 10% of samples of 600 boys has the same defect. Find if the difference between the two propagations is significant?
n n p p The estimate value of proportion is given by: P = (n P = (55 + 60)/1100 P = 115/1100 P = 0.105 Q = 1 - P Q = 0.89 E E E E = 0.019 p = 1% = 1/100 = 0.01 Z = (p Z = 0.01/0.019
The value of Z is less than 1. Hence, the difference between proportions in not significant.
In two large populations, there are 35% and 30% of the people with small height. Find if the difference is likely to hidden in the 1000 and 700 samples from these two populations?
p p q q q q p p p p E E E E E = 0.02296 Z = (p Z = 0.05/0.02296
It is unlikely that the real difference will be hidden. ## Central Limit TheoremCentral Limit Theorem is useful in the distribution of mean samples for the large sample size. The limiting distribution of a sample size of n is given by:
Where, u is the mean
The distribution has a finite mean and standard distribution. The samples with a size greater than 25 are termed as Central limit theorem is universal and focuses on the normal distribution in statistical theory. ## ExamplesHere, we will discuss two examples based on the central limit theorem.
Solution: x = 3.8 u = 3.6 σ = 1.51 n = 900
Z = (3.8 - 3.6)/1.51 /(30) Z = 0.2/1.51/ (30) Z = 0.2 x 30 /1.51 Z = 3.977 Z > 1.96 The value of Z is greater than 1.96. Hence, the mean of the population is significant at 5% level of significance.
x = 2.24 u = 2.2 σ = 0.51 n = 400
Z = (2.24 - 2.2)/0.51/ (20) Z = 0.04/0.51/ (20) Z = 0.04 x 20 /0.51 Z = 1.57 Z < 1.96 The value of Z is less than 1.96. Hence, the mean of the population is not significant. ## Test of significance of mean for two large samplesHere, we will discuss two large samples of size n taken from the population with mean u_{2}_{1} and u_{2} and the standard deviation σ and _{1}σ._{2}The standard error of mean for these two samples is given by: E = Z = x Z = x
The condition to test the significance are as follows: If z > 1.96, the difference is significant a 5% level of significance. If z > 3, either the sampling is not simple or the drawn samples are not from the same population. ## Examples
x x n n Z = x Z = 0.5/ 1.5 x √(1/500 + 1/1000) Z = 0.5/ 1.5 x √ (0.003) Z = 0.5/0.082
The difference is much higher. Hence, the samples cannot be regarded as the samples drawn from the population.
x x n n Z = x Z = 0.4/ 3.5 x √(1/800 + 1/1600) Z = 0.4/ 3.5 x √ (0.001875) Z = 0.4/0.152
The difference is greater than 1.96 and less than 3. Thus, the difference is significant at 5% level of significance. ## Student's T-distributionThe statistical t in the student's t-distribution is given by:
or
Where, u is the mean
The student's t-distribution can be expressed as: Y = y Where, y v = n - 1 The t-curve is shown below: The area under the t-curve is The prescribed level of significance is taken from the ## ExamplesLet's discuss two examples based on the student's t-distribution.
Mean = 5 + 4 + 3 -2 + 0 + 1 -1 + 8 + 7 + 2 + 1 + 4/12 Mean = d' = 2.67
^{2}/n - d'^{2}
u = 0 t = (2.67 - 0)/ 3.62 x t = 0.737 x t = 2.45 v = n - 1 = 12 - 1 = 11 t |t| > t The value of t is greater than t0.05. Hence, our assumption is rejected. We can access the table C direct from the link https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf.
Mean = x + Σd/n Mean = 28 + 10/9 Mean = 29.1 Sample standard deviation
t = (29.1 - 27.5)/ 2.47 x t = 0.647 x t = 1.83 v = n - 1 = 9 - 1 = 8 t |t| < t The value of t is less than t
v = n - 1 = 11 - 1 = 10 t |t| < t The value of t is greater than t0.05. Hence, our assumption is rejected. ## Chi-Square MethodWhen a coin is tossed 60 times, the theoretical expectations are 30 heads and 30 tails. But in practical, it is not possible. The outcomes of an experiment are not same as the theoretical outcomes. The magnitude of the difference between the observation and theory is defined as When X Let O X The equation of the chi-square curve can be represented as: Y = y Where, y v = n - 1 ## Goodness of FitChi-square method is used to check the X The prescribed level of significance is taken from the
If probability is less than 0.05 If probability is less than 0.01 If probability is greater than 0.05 ## Examples
The theory predicts that the above samples should be in the proportion 8: 4: 4: 2. Examine the relation between the theory and experiment.
The frequencies based on the proportion are: 8/18 x 540, 4/18 x 540, 4/18 x 540, 2/18 x 540 = 240, 120, 120, 60 X X X For v =3, the value of X The value of X
The theory predicts that the above samples should be in the proportion 9: 3: 3: 1. Examine the relation between the theory and experiment.
The frequencies based on the proportion are: 9/16 x 530, 3/16 x 530, 3/16 x 530, 1/16 x 530 = 298, 99, 99, 33 X X X For v =3, the value of X The value of X ## F DistributionHere, we consider two independent random samples taken from population with equal standard deviation. Let the two independent random samples be: x The F-distribution is defined by the relation: F = s Where, s The larger of the two variance is placed in the numerator. s s Y Y ## Examples
s Σ (x s s s s Σ (y s s s F = s F = 15.55/10 F = 1.55 The value of F The value of F is less than F
s Σ (x s s s Σ (y s s F = s F = 29.28/13.33 F = 2.19 The value of F The value of F is less than F0.05. Hence, the differences are not significant. We can consider that the samples are dawn from the population. |