15 Statistical Hypothesis Tests in PythonThere are hundreds of statistical tests used for testing hypotheses. However, only a handful of them are required for machine learning projects. In this tutorial, we will see some of the most important hypothesis tests that one must know if one wants to work in the fields related to statistical modelling. We will implement these tests in Python programming language. Every hypothesis test mentioned below contains the following information related to the test:
Note that these assumptions are very important. If the assumptions like the expected distribution of the data sample or the size of the sample required are violated, the results of the test will not be accurate. The interpretation based on these results will be highly unreliable. Hence, keeping these assumptions in check before applying the tests is very important. Data samples often require to be sufficiently large to reveal how they're distributed for analysis and illustrative of the domain. In some circumstances, it is possible to adjust the data so that it conforms to the assumptions. To provide just two instances, this may be done by eliminating outliers from a distribution that is almost normal in order to make it more normal or by adjusting the degrees of freedom in a test when the variance of the given data samples is different. Finally, there could be several tests available for a certain issue, like normalcy. With statistics, we cannot obtain precise solutions to questions; rather, we obtain probabilistic ones. As a result, by thinking about the same subject in several ways, we might come up with various responses. Consequently, many tests may be required to address some data-related queries we may have. Normality TestsIn this section, we will see the tests that are used to test if the given data sample has Gaussian distribution or not. The assumption that the data follows a Gaussian distribution forms a basic requirement for many statistical modeling techniques. Hence, these tests are very important. Shapiro-Wilk TestHence, this test tests if the given data sample has Gaussian or Normal distribution. Assumptions
Interpretation H0: The sample follows a Gaussian distribution H1: the given sample does not follow a Gaussian distribution. Code Output: The statistic value is: 0.9621855020523071, and the p-value is 0.8104783892631531 The data does not follow a Gaussian distribution. D'Agostino's K^2 TestThis test tests whether the given data sample is Gaussian or not. Assumptions
Interpretation H0: The sample follows a Gaussian distribution H1: the given sample does not follow a Gaussian distribution. Code Output: The statistic value is: 1.0653637027947445, and the p-value is 0.5870285334466323 The data does not follow a Gaussian distribution Anderson-Darling TestThis test tests whether the given data sample is Gaussian or not. Assumptions
Interpretation H0: The sample follows a Gaussian distribution H1:the given sample does not follow a Gaussian distribution. Code Output: The statistic value is: 0.20692157645671116 The critical value at 15.0% is 0.501 The data does not follow a Gaussian distribution at 15.0% The critical value at 10.0% is 0.57 The data does not follow a Gaussian distribution at 10.0% The critical value at 5.0% is 0.684 The data does not follow a Gaussian distribution at 5.0% The critical value at 2.5% is 0.798 The data does not follow a Gaussian distribution at 2.5% The critical value at 1.0% is 0.95 The data does not follow a Gaussian distribution at 1.0% Correlation TestsNow we will see the tests which compare the two samples and tell if they are related or not. Pearson's Correlation CoefficientThis test tests whether the given two data samples have a linear relationship or not. Assumptions
Interpretation H0: the given two samples are not dependent, i.e., they are independent. H1: there is some sort of dependency between the given samples. Code Output: The statistic value is: 0.6135196215696078, and the p-value is 0.05922727627191346 The data samples are independent of each other Spearman's Rank CorrelationThis is a step ahead of the Pearson test. It tests if the given samples have a monotonic relationship. The relationship can be linear or non-linear. Assumptions
Interpretation H0: the given two samples are not dependent, i.e., they are independent. H1: there is some sort of dependency between the given samples. Code Output: The statistic value is: 0.6969696969696969, and the p-value is 0.02509667588225183 Both data samples are dependent on each other Kendall's Rank CorrelationThis is a step ahead of the Pearson test. It tests if the given samples have a monotonic relationship. Assumptions
Interpretation H0: the given two samples are not dependent. H1: there is some sort of dependency between the given samples. Code Output: The statistic value is: 0.5111111111111111, and the p-value is 0.04662257495590829 Both data samples are dependent on each other Chi-Squared TestPearson's test can only be used with numerical values. Spearman's and Kendall's rank correlation tests can be used for ordinal data. Ordinal data is categorical data that have a certain order. But for nominal data (categorical data with no order), these tests cannot be used. To test the dependency or the relationship between the nominal data, we use the Chi-Squared test. Assumptions
Interpretation H0: the given two samples are not dependent. H1: there is some sort of dependency between the given samples. Code Output: Expected Frequencies are [[27.03703704 26.54545455 31.95286195 29.49494949 30.96969697] [27.96296296 27.45454545 33.04713805 30.50505051 32.03030303]] The statistic value is: 1.8882030380034551, and the p-value is 0.7563117707680647 The data samples are independent of each other Stationary TestsTime series is a very important topic. The models performed on time series require the time series data to be stationary. Therefore, to apply any model, we need to first check if the time series data is stationary or not. Now we will see tests to check the stationarity of the data. Augmented Dickey-Fuller Unit Root TestThrough this test, we check whether the given time series data has a unit modulus root. Or, in more technical terms, is the data autoregressive or not? The autoregressive time series is stationary. If the time series has a unit modulus root, then it is not stationary. Assumptions
Interpretation H0: the time series has a unit root (the series is not stationary). H1: The unit modulus root is not present (the series is stationary). Code Output: The order of the autoregressive model is 1 The statistic value is: -10.232070586545865, and the p-value is 4.998574442108246e-18 The given time series is stationary Kwiatkowski-Phillips-Schmidt-ShinThis test tests if the given time series has a stationary trend or not. If the series is trend-stationary, then that means the series is deterministic. Assumptions
Interpretation H0: the given time series has a stationary trend. H1: the given time series does not have a stationary trend. Code Output: The order of the autoregressive model is 0 The statistic value is: 0.09930151338766009, and the p-value is 0.1 The given time series is not stationary Parametric Statistical Hypothesis TestsNow we will see the parametric tests. In these tests, we test if a certain parameter of one or more samples is equal to or different from a value or from each other. Student's t-testIn this test, the parameter is the mean of the given samples. We check if the means of the two samples are independent on, in other words, significantly different from each other. Assumptions
Interpretation H0: the mean values of the given samples are equal. H1: the mean values of the given samples are not equal. Code Output: The statistic value is: 0.6713796580759667, and the p-value is 0.5105037120903526 The given samples have equal mean values Paired Student's t-testIn this test also, the parameter is mean. However, this test is used when the two samples are paired. Two samples are said to be paired if both values are observed using the same sample before and after a certain treatment. Assumptions
Interpretation H0: the mean values of the paired samples are equal. H1: the mean values of the paired samples are not equal. Code Output: The statistic value is: 0.9502747511161275, and the p-value is 0.36679175997294733 The paired samples have equal mean values Analysis of Variance Test (ANOVA)In this test, we use variance to determine if two or more samples are different from each other or the same. Assumptions
Interpretation H0: the mean values of the given samples are equal. H1: the given one or more than one mean values of the given multiple samples are not equal. Code Output: The statistic value is: 0.3557581063875854, and the p-value is 0.7038772383760818 The samples have equal mean values Nonparametric Statistical Hypothesis TestsMann-Whitney U TestThis test will test if the samples taken from two independent population data are equal or not. Assumptions
Interpretation H0: the distributions underlying the independent samples are equal. H1: the distributions underlying the independent samples are not equal. Code Output: The statistic value is: 60.0, and the p-value is 0.47267559351158717 The samples have the same distributions Wilcoxon Signed-Rank TestThis test tests if the distributions of the given two or more paired observation samples are equal or not. Assumptions
Interpretation H0: the distributions underlying the independent samples are equal. H1: the distributions underlying the independent samples are not equal. Code Output: The statistic value is: 15.0, and the p-value is 0.232421875 The samples have the same distributions Kruskal-Wallis H TestThis test tests if the distributions of the given two or more observation samples are equal or not. Assumptions
Interpretation H0: the distributions underlying the independent samples are equal. H1: the distributions underlying the independent samples are not equal. Code Output: The statistic value is: 0.5714285714285694, and the p-value is 0.4496917979688917 The samples have the same distributions Friedman TestThis test tests if the distributions of the given two or more paired observation samples are equal or not. Assumptions
Interpretation H0: the distributions underlying the independent samples are equal. H1: the distributions underlying the independent samples are not equal. Code Output: The statistic value is: 2.4000000000000057, and the p-value is 0.3011942119122012 The samples have the same distributions SummaryYou learned about the primary hypothesis tests in this tutorial that you may apply in a machine learning project. In particular, you discovered:
|