## Basic Statistics Concepts for Data ScienceApplying statistics to data is about performing various mathematical operations to gain valuable information like percentages, probability, profit, find patterns, loss, accuracy, etc, which indirectly or directly means Data Science. To become a good data scientist, one must know all the statistics topics that apply to the data. So, to know some of the topics, Let us explore the different statistics concepts in this article. - Central Tendency
- Probability
- Regression
- Variance
- Standard Deviation
- Correlation
- Dimension Reduction
## Central TendencyThe first and foremost concept is the Central Tendency we know from childhood, but revising is not a crime. There are three central tendency measures: mean, median and mode ## Mean:Mean is the average of the data, i.e., the sum of the data by the total number of values present in that data.
The formula for calculating the mean The sum of the data values present in the dataset/ Total number of values present in the dataset ## Median:The median is normally explained as the middle value present in the data.
For the even-numbered data set Median =((n/2)+(n/2+1))/2 For the odd-numbered data set Median =(n+1)/2 ## Mode:Mode is the most commonly occured value in the data.
For grouped data Mode = L + (f1- f 0/2f1- f 0- f2 ) h For ungrouped data, we arrange the data in any order, like ascending or descending, and then we find out which most occurred simply a mode value. ## ProbabilityThe next concept is probability, which is useful in daily activities like stocks, price prediction, hitting sixes based on previous matches, etc. This important concept has been in practice since our school days since we have it in the syllabus. Probability is defined as the chance of getting the event to occur. The likelihood is expressed as a value ranging from 0 to 1. The concept of probability in predicting genetic and non-genetic hereditary diseases, such as cancer, heart diseases and TB, is widely used in machine learning, deep learning, and neural networks.
The general formula for finding probability is P= Count of favourable outcomes/Total number of outcomes.
- Theoretical Probability
- Experimental Probability
- Axiomatic Probability
## Theoretical ProbabilityTheoretical Probability means that the probability of the event occurring is based on the possible chances that it will happen. For instance, if a dice is rolled, the probability of getting 2 is 1/6. This means that we know the chance of occurring at least once. ## Experimental ProbabilityExperimental probability is based on observations from an experiment. Experimental probability is calculated by dividing the possible outcomes by the total number of trials. For instance, if you toss a coin 20 times and get heads 7 times, the experimental probability of getting heads is 7/20. ## Axiomatic ProbabilityConditional probability is the measure of an event or outcome's likelihood, given that another event or outcome has occurred. The common formulae you can remember ## RegressionRegression is a popular and trending concept in data science. It is used to find the relationships between the dependent and independent variables. It is likely to be used in predicting the future based on finding the relation between the variables. Some examples where we use regression are predicting stocks, finance, and prices, and also used to know where to invest. Commonly used regressions are - Linear Regression.
- Logistic Regression.
- Polynomial Regression.
## Linear RegressionLinear regression is simple regression where the predicted variable is linearly related to the dependent variable.
y=mx+c+e where m is the slope of the line, c is an intercept, and e represents the error in the model. ## Logistic RegressionLogistic regression is a type of regression analysis technique used when the dependent variable is discrete. For instance, when the target variable has only two values, like 0 or 1, true or false, etc. In such cases, the relationship between the target variable and the independent variable is represented by a sigmoid curve.
f(x) =1/1+e ^{-x}## Polynomial RegressionPolynomial Regression is a type of linear regression that models the relationship between the independent variable x and dependent variable y as an nth-degree polynomial. ## Standard deviationLet us understand what standard deviation is in a shorter sentence. Standard deviation measures how a group of data values are spread away from the mean, and simply, it means that a group of data points deviate from the average value(mean). Standard deviation in data science is very helpful in analyzing how the data values are spread over the dataset. If the standard deviation value is low, it indicates that the values in the dataset are close enough to the mean, and vice versa, for the standard deviation with a high value. If we encounter a specific scenario, it can aid in identifying and assessing risk, detecting anomalies, understanding trends, evaluating performance, measuring the accuracy of predictions, etc.
The formula for calculating standard deviation
xi represents the average(mean) of the dataset. ## VarianceVariance measures how much the data points differ from the mean. To find the variance, take the difference between each data point and mean, then square and average.
To answer the above question, let us see why it is important Variance helps us know whether the data is used for training and testing and whether the data is overfitting or underfitting. It generally helps to know the quality of data.
The formula for calculating variance Where:
x ## SamplingSampling is one of the important concepts in data science, as it is all about data. We sometimes choose specific data elements from the dataset. To choose the data elements from the dataset, we require sampling methods. Sampling is defined as selecting a subset of data elements from the large dataset to perform the operations on the subset data instead of on the large dataset to explore trends, predict something, or identify patterns in an overall dataset.
- Random Sampling
- Cluster Sampling
- Convenience Sampling
- Stratified Sampling
- Systematic Sampling
- Quota Sampling
Random sampling chooses random data elements from the dataset; every element has an equal chance in this method.
This sampling method is useful when there is no time to select individual elements. This method divides the dataset into clusters and randomly chooses any cluster.
The method simply chooses the available or accessible data present.
Initially, the programmer divides the dataset into subgroups based on the required factors, and then the stratified sampling method chooses the random data samples from each subgroup.
This method is about choosing the data sample at specific intervals. For example, choosing the sample data for every 10th position from the starting point.
Quota sampling is a technique used to guarantee a representative sample by choosing a predefined number of individuals from various groups or subgroups. ## CorrelationThe statistical concept of correlation, or dependency, measures the relationship between two variables. It indicates how strongly two variables are related and how much of a linear dependency between them. A commonly used correlation method is the Pearson correlation coefficient(r), which ranges from -1 to 1 r=1 indicates that the relation between the variables is perfectly dependent on each other, which means if one variable increases, the other also increases. r=-1 indicates that the relation between the variables is negatively correlated, which means if one variable increases, the other decreases. r=0 indicates there is no linear relation between the variables.
The formula for finding a correlation *Xi*and*Yi*are individual data points for variables X and Y,- x
^{̅}and y^{̅}are the means of variables X and Y, respectively.
## Dimension reductionDimension reduction is a process of reducing the features of the data to reduce its size while maintaining the important factors for performing the training on the dataset. Dimension reduction is mostly used in machine learning and deep learning since accessing more features in the dataset results in lower performance because it requires more time and space to process effectively and makes problem-solving more challenging. ## ConclusionIn conclusion, to start your journey in data science, you can start with the fundamental statistical concepts required to perform operations on the data. I hope this article helps you learn all the fundamental topics. Next TopicWhat is a data hub |