Top 45+ Most Asked Statistics Interview Questions and Answers

1) What is Statistics?

Statistics is a discipline that concerns the study of collection, organization, analysis, interpretation, and presentation of data. Statistics study is generally used in scientific, industrial, and social problems to understand the statistical population or a statistical model of the related data. For example, to get the population statistics, we can use diverse it into the groups of people or objects such as "all people living in a country".

Statistics is the study of every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

2) What are the different types of Statistics?

There are mainly two types of Statistics:

Descriptive statistics
Inferential statistics

Descriptive Statistics

Descriptive statistics is a type of statistics where data is summarized through the given observations. The summarization is done from a population sample using parameters such as the mean or standard deviation. Descriptive statistics provides a way to organize, represent and describe a collection of data using tables, graphs, and summary measures. For example, a collection of people in a city using specific services such as the internet or television channels.

The descriptive statistics can be categorized into the following four different categories:

Measure of frequency
Measure of position
Measure of dispersion
A measure of central tendency

Inferential Statistics

Inferential statistics is a type of statistics used to interpret the meaning of descriptive statistics. These statistics are used to conclude the data that depends on random variations such as observational errors, sampling variation, etc. Once we have collected, analyzed, and summarized the data, we use these statistics to describe the meaning of the collected data.

In this method, we use the information collected from a sample to make decisions, predictions, or inferences from a population. It also facilitates us to give statements that go beyond the available data or information.

3) What is the key difference between data and statistics?

In general, people often use the terms "data" and "statistics" interchangeably, but there is a key difference between them. Data can be specified as the individual pieces of factual information recorded and used for analysis. In other terms, data is raw information from which statistics are created. On the other hand, statistics are the results of data analysis, its interpretation, and presentation.

In other words, we can say that statistics is a process of some computation to provide some understanding of what the data means. Statistics are generally presented in the form of a table, chart, or graph. For research purposes, we require both statistics and data frequently. Statistics are often reported and used by government agencies. For example, unemployment statistics, educational literacy statistics, etc. These types of statistics are called "statistical data".

4) What are the main things you should know before studying data analysis?

Following are the four main things you should know before studying data analysis. These things are:

Descriptive statistics
Inferential statistics
Distributions (normal distribution / sampling distribution)
Hypothesis testing

5) What are the four different types of data statistics?

Data statistics can be divided into mainly two categories:

Qualitative data
Quantitative data

Later, these can be subdivided into 4 types of data where nominal data and ordinal data come under qualitative data, and interval and ratio data come under quantitative data.

Qualitative data: Qualitative data is a set of information that cannot be measured in the form of numbers. It is also called categorical data. It normally contains words, narratives, etc., that we label with names. It mainly focuses on the qualities of things in data, and after the qualitative data analysis, the outcome comes in featuring keywords, extracting data, and ideas elaboration.

For example, a person's hair color such as black, brown, red, blonde, etc. The qualitative data can be divided into two subcategories: nominal and ordinal.

Nominal Data: The nominal data are used to label variables with no quantitative value and no order. It doesn't change the meaning if you change the order of the value, and after that meaning will remain the same. So, you can only observe the nominal data and can't measure.
Ordinal Data: The ordinal data is very much similar to the nominal data but not in the case of an order. The ordinal data is ordered, and their categories can be ordered like 1st, 2nd, etc.

6) What is the Central Limit Theorem? Why is it used?

Central Limit Theorem is the most important part of statistics. It specifies that the distribution of a sample from a population that consists of large sample size will have its mean normally distributed. In other words, we can say that it will not affect the original population distribution even if the sample size gets larger, regardless of the population's distribution. Generally, it is considered sufficient for the CLT to hold if the sample sizes are equal to or more than 30.

Central Limit Theorem or CTL is mainly used to calculate confidence intervals and hypothesis testing. It also facilitates us to calculate the confidence intervals accurately. For example, if you want to calculate the average height of the people in the world, you have to take some samples from the general population, which serves as the data set. Here, it is very difficult or nearly impossible to get data regarding the height of every person in the world, so you have to calculate the mean of your sample data.

By multiplying the get data set several times, you will get the mean and their frequencies which you can plot on the graph and create a normal distribution curve. Here, you will get a bell-shaped curve that closely resembles the original data set.

7) What do you understand by observational and experimental data in Statistics?

Observational data is a type of data obtained from observational studies. In observational data, we observe the variables to see if there is any correlation between them. On the other hand, experimental data is a type of data that is collected from experimental studies. Here, we hold certain variables as constant to see if there is any discrepancy raised in the working.

8) How can you assess the statistical significance of an insight?

We can use hypothesis testing to determine the statistical significance of an insight. Here, we state the null and alternate hypotheses and then calculate the p-value. Once the p-value is calculated, the null hypothesis is assumed true, and the values are determined. To ensure the value's correctness, we compare it with the alpha value, which denotes the significance, which is tweaked. If the p-value is less than the alpha value, the null hypothesis is rejected, otherwise considered. This is used to ensure that the result obtained is statistically significant.

9) What is the difference between data analysis and machine learning?

Following is a list of key differences between data analysis and machine learning:

Data Analysis	Machine Learning
Data analysis is a process where we inspect, clean, transform, and model data to find useful information, informing conclusions, and support decision-making, which can enhance the decision-making process.	Machine learning is mainly used to automate the entire data analysis workflow to provide deeper, faster, and more comprehensive insights.
Data analysis requires a deep knowledge of coding and basic knowledge of statistics.	On the other hand, machine learning requires a basic knowledge of coding and deep knowledge of statistics and business.
We mainly focus on generating valuable insights from the available data in data analysis. Companies use the data analysis process to make better decisions regarding several matters such as marketing, production, etc.	We mainly focus on studying algorithms that improve the overall user experience in machine learning. It is a subset of artificial intelligence that leverages algorithms to analyze huge amounts of data.
Data analysis may require human intervention to inspect, clean, transform, and model data to find useful and trustworthy information.	In machine learning, we use algorithms that learn from data automatically and apply the learning without human intervention.
The average salary of a data analysis professional in India is less than the salary of a machine learning professional.	The average salary of a machine learning professional in India is more than the salary of a data analysis professional.
A data analysis professional has to deal with data, so they should have deep knowledge of coding and basic knowledge of statistics.	A machine learning professional must know about Deep Learning, Natural Language Processing (NLP), Computer Vision, Data Analytics Skills, Statistical Analysis, SQL, and knowledge of R and Python programming language.

10) What is the difference between inferential statistics and descriptive statistics?

Inferential statistics provide information about a sample. It is required to conclude the population. On the other hand, descriptive statistics provide exact and accurate information.

11) What is Normality in Statistics?

In Statistics, Normality is behaviour consistent with the usual way of behaving of a person. It is an accepted way of social standards and thinking and behaving similarly to the majority, and generally seen as a good way in this context. According to the situation, it can also be specified as expected and appropriate behaviour.

In the case of psychological statistics, it can also be just being average. It specifies how you adjust to the surroundings, manage or control emotions, work satisfactorily, and build satisfactory, fulfilling, or at least acceptable relationships.

12) What are the criteria for Normality?

For any specified behaviour or trait, the criteria for Normality are being average or close to the average. It means the scores falling within one standard deviation above or below the mean is normal. The most average 68.3% of the population is considered normal.

13) What is the assumption of NormalityNormality?

In technical terms, the assumption of NormalityNormality states that the sampling distribution of the mean is normal or that the distribution of means across samples is normal. In other words, the assumption of NormalityNormality specifies that the mean distribution across samples is normal. This is true across independent samples as well.

14) What is the main usage of long-tailed distributions? Where are they mainly used?

The long-tailed distributions are the type of distribution where the tail gradually drops off toward the curve's end. They are most widely used in classification and regression problems. The Pareto principle and the product sales distribution are good examples of using long-tailed distributions.

15) What do you understand by Hypothesis Testing?

In Statistics, Hypothesis Testing is mainly used to see if a certain experiment generates meaningful results. It helps assess the statistical significance of insight by finding the odds of the results occurring by chance. In Hypothesis Testing, the first thing is to know the null hypothesis and then specify it. After that, the p-value is calculated, and if the null hypothesis is true, the other values are also determined. The alpha value specifies the significance, and you can adjust it accordingly.

If the p-value is less than the alpha value, the null hypothesis is rejected, but the null hypothesis is accepted if the p-value is greater than the alpha value. If the null hypothesis is rejected, it indicates that the results obtained are statistically significant.

16) How can you handle the missing data in Statistics?

There are several ways to handle the missing data in Statistics:

By predicting the missing values.
By assigning the individual or unique values.
By deleting the rows which have the missing data.
By mean imputation or median imputation.
By using the random forests, which support the missing values.

17) What do you understand by mean imputation for missing data? Why is considered bad?

Mean imputation is a way where null values in a dataset are replaced directly with the corresponding mean of the data. It is a rarely used practice nowadays. Mean imputation is considered bad practice because it completely removes the accountability for feature correlation. It also means that the data will have low variance and increased bias that may cause a dip in the model's accuracy, along with the narrower confidence intervals.

18) What do you understand by six Sigma in Statistics?

In Statistics, six Sigma is a quality control method used to produce an error or defect-free data set. In this method, the standard deviation is known as Sigma or σ. The more the standard deviation is, the less likely that process would perform with accuracy and causes a defect. A six sigma model works better than 1σ, 2σ, 3σ, 4σ, 5σ processes and is reliable enough to provide a defect-free work. If you get the outcome of the process 99.99966% error-free, it is considered six Sigma.

19) What is an exploratory data analysis in Statistics?

In Statistics, an exploratory data analysis is the process of performing investigations on data to understand the data better. In this process, the initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and check if the assumptions are correct.

20) What do you understand by selection bias?

In Statistics, the selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random. Randomization plays a vital role in performing analysis and understanding the model functionality better. If we don't achieve the correct randomization, the resulting sample will not accurately represent the population.

21) What is an outlier in Statistics? How can you determine an outlier in a data set?

In Statistics, outliers are data points that usually vary largely as compared to other observations in the dataset. Based on the learning process, an outlier can decrease a model's accuracy and decrease its efficiency sharply.

We can determine an outlier by using two methods:

Standard deviation/z-score
Interquartile range (IQR)

22) What do you understand by an inlier in Statistics?

An inlier is a data point within a data set that lies at the same level as the rest of the data set. It isn't easy to find an inlier in the dataset compared to an outlier as it requires external data.

Similar to outliers, inliers also reduce the model accuracy. Unlike outliers, inlier is hard to find and often requires external data for accurate identification. So, it is usually an error, and we have to remove it to improve the model accuracy. This is mainly done to maintain the model accuracy at all times.

23) What do you understand by KPI in Statistics?

KPI is an acronym that stands for Key Performance Indicator. A KPI is a quantifiable measure to understand if we can achieve the goal or not. KPI is a reliable metric that is generally used to measure the performance level of an organization or individual for the objectives. An example of KPI in an organization is the expense ratio.

24) What are the different types of selection bias in Statistics?

There are several types of selection bias in Statistics:

Attrition selection bias
Observer selection bias
Protopathic selection bias
Time intervals selection bias
Sampling selection bias

25) What is the law of large numbers in Statistics?

In Statistics, the law of large numbers is used to specify that if we increase the number of trials in an experiment, we will get a positive and proportional increase in the results coming closer to the expected value. For example, if you roll a six-sided dice three times and check the probability, you will see that the expected value obtained is far from the average value. On the other hand, if you roll a dice a large number of times, you will obtain the average result closer to the expected value, which is 3.5 in this case. This is a good example of the law of large numbers in Statistics.

26) What is root cause analysis in Statistics? Can you give an example to explain it?

As the name suggests, root cause analysis is a method used in Statistics to solve problems by first identifying the root cause of the problem.

For example, If you see that the higher crime rate in a city is directly associated with the higher sales in a black-coloured shirt, it means that they have a positive correlation. However, it does not mean that one causes the other. Correlation is always tested using A/B testing or hypothesis testing.

27) What are some important properties of a normal distribution in Statistics?

Normal distribution is used to specify the data, which is symmetric to the mean, and data far from the mean occurred less frequently. It appears as a bell-shaped curve in graphical form, which is symmetrical along the axes. In Statistics, a normal distribution is also known as Gaussian distribution. It appears as a bell-shaped curve in graphical form, which is symmetrical along the axes. In Statistics, a normal distribution is also known as Gaussian distribution.

A normal distribution consists of the following properties:

Symmetrical: The symmetrical property specifies the shape changes with that of parameter values.
Unimodal: As the name specifies, this property has only one mode.
Mean: This property is used to measure the central tendency.
Central tendency: It specifies that the mean, median, and mode lie at the centre, which means they are all equal, and the curve is perfectly symmetrical at the midpoint.

28) In which cases median is a better measure than the mean?

In the cases where there are a lot of outliers that can positively or negatively skew data, we prefer the median as it provides an accurate measure in this case of determination.

29) What is the 'p-value' in Statistics? How would you describe it?

In Statistics, a p-value is a number that indicates the likelihood of data occurring by a random chance. It is calculated during hypothesis testing. If the p-value is 0.5 and is less than alpha, we can conclude that there is a probability of 5% that the experiment results occurred by chance. In other words, we can say that 5% of the time, we can observe these results by chance.

30) How can you calculate the p-value using MS Excel in Statistics?

In Excel, the p-value is called probability value. It is used to understand the statistical significance of a finding. The main use of the p-value is to test the validity of the Null Hypothesis. If the Null Hypothesis is not seemed according to the p-value, we have to believe that the alternative hypothesis might be true. P-value allows us to determine whether the provided results are caused by chance or whether we are testing two unrelated things. So, the p-value is considered an investigator and not a judge.

It is a number between 0 and 1, but it is generally denoted in percentages. If the p-value is 0.05, it will be denoted as 5%. A smaller p-value leads to the rejection of the Null Hypothesis.

Following is the formula to calculate the p-value using MS Excel in Statistics:

p-value = tdist(x,deg_freedom,tails)

The p-value is expressed in decimals in Excel. Follow the steps given below to calculate the p-value in Excel:

First, find the Data tab.
After that, click on the data analysis icon on the Analysis tab.
Select Descriptive Statistics and then click OK.
Select the relevant column.
Input the confidence level and other variables.

31) What do you understand by DOE in Statistics?

DOE is an acronym that stands for the Design of Experiments in Statistics. In this process, we design a task that describes the information and the change of the same based on the changes to the independent input variables.

32) What do you understand by Covariance?

Covariance is a measure that specifies how much two random variables vary together. It indicates how two variables move in sync with each other. It also specifies the direction of the relationship between two variables. There are two types of Covariance: positive and negative Covariance. The positive Covariance specifies that both variables tend to be high or low simultaneously. On the other hand, the negative Covariance specifies that the other tends to be below when one variable is high.

33) What is the Pareto principle used in Statistics?

The Pareto principle used in Statistics is also called the 80/20 principle or 80/20 rule. This principle specifies that 80 per cent of the results are obtained from 20 per cent of the causes in an experiment.

For example, you will have observed in your real life that 80 per cent of the wheat comes from the 20 per cent of the wheat plants on a farm.

34) What type of data does not have a log-normal or Gaussian distribution?

The exponential distributions types of data do not have a log-normal distribution or a Gaussian distribution. Any type of categorized data will not have these distributions as well.

For example, duration of a phone call, time until the next earthquake, etc.

35) What is IQR in Statistics? How can you calculate the IQR?

IQR is an acronym that stands for interquartile range. It is a measurement of the "middle fifty" in a data set. The IQR describes the middle 50% of values when ordered from lowest to highest.

Follow the steps given below to find the interquartile range (IQR) in Statistics:

First, find the median (middle value) of the lower and upper half of the data.
These values are quartile 1 (Q1) and quartile 3 (Q3).
The IQR is the difference between Q3 and Q1.

IQR = Q3 - Q1

Q3 is the third quartile (75 percentile), and Q1 is the first quartile (25 percentile).

36) What do you understand by the five-number summary in Statistics?

In Statistics, the five-number summary is used to measure five entities covering the entire data range. It is mainly used in descriptive analysis or during the preliminary investigation of a large data set.

The five-number summary contains the following five values:

Low extreme (Min)
The first quartile (Q1)
Median
Upper quartile (Q3)
High extreme (Max)

Note: These values are selected to summarise the data set as each value describes a specific part of a data set. Here, the median specifies the centre of a data set, the upper and lower quartiles span the middle half of a data set, and the highest and lowest observations provide additional information about the actual dispersion of the data.

37) What is the advantage of using the box plot?

The box plot shows the 5-number summary pictorially. It is mainly used to compare a group of histograms.

38) What is the difference between the 1st quartile, the 2nd quartile, and the 3rd quartile?

In Statistics, quartiles are used to describe data distribution by dividing the data into three equal portions. In this partition of the data, the boundary or edge of these portions is called quartiles.

There are three types of quartile:

The lower quartile (Q1) specifies the 25th percentile of the data.
The middle quartile (Q2): It is also called the median and specifies the 50th percentile of the data.
The upper quartile (Q3) specifies the 75th percentile of the data.

39) What do you understand by skewness?

Skewness can be described as a distortion or asymmetry that deviates from a data set's symmetrical bell curve or normal distribution. You can assume it as a degree of asymmetry observed in a probability distribution.

Depending on the varying degrees, skewness can be of two types, i.e. the right (positive) skewness and the left (negative) skewness. Skewness is centred on the mean. If skewness is negative, the data is spread more on the left of the mean than the right. If skewness is positive, the data moves more to the right. A normal distribution (bell curve) shows zero skewness.

40) What is the difference between a left-skewed distribution and right-skewed distribution?

The key difference between the left-skewed distribution and the right-skewed distribution is that the left tail is longer than the right side in the left-skewed distribution. Here, mean < median < mode. On the other hand, the right tail is longer than the right side in the right-skewed distribution. Here, mode < median < mean.

41) What are the different types of data sampling in Statistics?

There are mainly four types of data sampling in Statistics:

Simple random: This data sampling type specifies the pure random division.
Cluster: The population is divided into clusters
in this data sampling type.
Stratified: Data is divided into unique groups in this data sampling type.
Systematical: This data sampling type picks up every 'n' member in the data.

42) What is Bessel's correction? Why is it used in Statistics?

In Statistics, Bessel's correction is a factor used to estimate the standard deviation of populations from its sample. It causes a less biased standard deviation and is mainly used to provide more accurate results.

43) What is the difference between type I vs type II errors?

Type I errors occur when the null hypothesis is rejected, even if true. It is also known as false positive. On the other hand, type II errors occur when the null hypothesis fails to get rejected, even if false. It is also known as a false negative.

44) What is the relationship between the significance level and the confidence level in Statistics?

In Statistics, the significance level is the probability of getting a completely different result from the condition where the null hypothesis is true. On the other hand, the confidence level is used as a range of similar values in a population.

We can specify the similarity between the significance level and the confidence level by the following formula:

Significance level = 1 - Confidence level

45) What do you understand by the Binomial Distribution formula?

Following is the formula for the Binomial Distribution:

b(x; n, P) = nCx * Px * (1 - P)n - x

Parameter explanation:

b = It specifies the binomial probability.
x = It specifies the total number of "successes" (pass or fail, heads or tails, etc.)
P = It specifies the probability of success on an individual trial.
n = It specifies the number of trials.

46) What are the examples of symmetric distribution in Statistics?

Symmetric distribution specifies that the data on the left side of the median is the same as the data on the left side of the median.

Following are the three most widely used examples of symmetric distribution:

Normal distribution
Uniform distribution
Binomial distribution

47) What is the empirical rule in Statistics?

In Statistics, the empirical rule is also known as the 68-95-99.7 rule. It specifies that every piece of data in a normal distribution lies within three standard deviations of the mean.

According to the empirical rule,