SciPy Stats

The scipy.stats contains a large number of statistics, probability distributions functions. The list of statistics functions can be obtained by info(stats). A list of a random variable can also be acquired from the docstring for the stat sub-package.

Sr.FunctionDescription
1.rv_continuosIt is a base class to construct specific distribution classes and instances for continuous random variable.
2.rv_discreteIt is a base class to construct specific distribution classes and instances for discrete random variables.
3.rv_histogramIt can be inherited from rv_continuous class. It generates a distribution given by a histogram.

Normal Continuous Random Variable

There are two general distribution classes which have been implemented for encapsulating continuous random variables and discrete random variable. Here we will discuss about the continuous Random Variables:

Output:

[0.9986501  0.15865525 0.5        0.84134475 0.97724987 0.99996833
 0.02275013 0.99999971]

In the above program, first, we need to import the norm module from the scipy.stats, then we passed the data as Numpy array in the cdf() function.

To get the median of the distribution, we can use the Percent Point Function (PPF), this is the inverse of the CDF.

We can generate the sequence of the random numbers; the size argument is necessary to pass the size parameter.

Output:

[-0.42700905  1.0110461   0.05316053 -0.45002771]

The output can vary when we run the program every time. We can use the seed() function to generate the same random numbers.

Descriptive Statistics

The descriptive statistics describe the values of observation in a variable. There are various stats such as Min, Max, and Variance, that take the Numpy array as input and returns the particular results. Some essential functions provide by scipy.stats package are described in the following image.

Sr.FunctionDescription
1.describe()Computes various descriptive statistics of the input array.
2.gmean()Computes geometric mean along with the specified.
3.hmean()Calculates the harmonic mean along the specified axis.
4.kurtosis()Computes the Kurtosis.
5.mode()Returns the mode value.
6.skew()Tests the skewness of the data
7.zscore()It calculates the z score of each value in the sample, relative to the sample mean and standard deviation.

Let us consider the following program:

Output:

0.006283818005153084
-0.03008382588766136
[-2.1865825   2.47537921]

DescribeResult(nobs=100, minmax=(-2.1865824992721987, 2.475379209985273), mean=0.006283818005153084, variance=1.0933102537156147, skewness=0.027561719919920322, kurtosis=-0.6958272633471831)

T-Test

The t-test is used to compare two averages (means) and tells that if these averages are different from each other. The t-test is also described as significant in the differences between the groups.

T-score

The t-score is a ratio between two groups and the difference within the groups. The smaller the t-score shows that the groups are relatively similar, and the more significant t-score indicates, the more difference between the groups.

Comparing two samples

The two samples are given that can come either from the same or from difference distributions and we want to test whether these samples have the same statistical properties.

Output:

Ttest_1sampResult(statistic=array([0.42271098, 1.1463823 ]), pvalue=array([0.67435547, 0.25720448]))

In the above output, a p-value is a probability that the results from your sample data occurred by chance. P-values are from 0% to 100%.

SciPy Linear Regression

Linear regression is used to find the relationship between the two variables. The SciPy provides linregress() function to perform linear regression. The syntax is given below:

Parameters:

x, y: These two parameters should be an array and have the same length.

There are two types of linear regression.

  • Simple regression
  • Multivariable regression

Simple Regression

Simple linear regression is a method for predicting a response using a single feature. It is assumed that the two variables are linearly related, which means the other variable can accurately predict one variable. For example, using temperature in the degree Celsius, it is correctly predicted in Fahrenheit.

Multivariable Regression

Multiple linear regression is described as the relationship between one continuous dependent variable and two or more independent variables.

The variable price is dependent on the other variables.






Latest Courses