## Top 25 Data Science Interview QuestionsA list of frequently asked ## 1) What do you understand by the term Data Science?- Data science is a multidisciplinary field that combines
**statistics, data analysis, machine learning, Mathematics, computer science**, and related methods, to understand the data and to solve complex problems. - Data Science is a deep study of the massive amount of data, and finding useful information from raw, structured, and unstructured data.
- Data science is similar to data mining or big data techniques, which deals with a huge amount of data and extract insights from data.
- It uses various tools, powerful programming, scientific methods, and algorithms to solve the data-related problems.
## 2) What are the differences between Data Science, Machine Learning, and Artificial intelligence?Data science, Machine learning, and Artificial Intelligence are the three related and most confusing concepts of computer science. Below diagram is showing the relation between AI, ML, and Data Science. Following are some main points to differentiate between these three terms:
## 3) Discuss Linear Regression?- Linear Regression is one of the popular machine learning algorithms based on supervised learning, which is used for understanding the relationship between input and output numerical variables.
- It applies
**regression analysis**, a predictive modeling technique that finds a relationship between the dependent and independent variables. - It shows the linear relationship between
**independent**and**dependent variables**, hence it is called a linear regression algorithm. - Linear Regression is used for prediction of continuous numerical variables such as sales/day, temperature, etc.
- It can be divided into two categories:
**Simple Linear Regression****Multiple Linear Regression**
If we talk about simple linear regression algorithm, then it shows a linear relationship between the variables, which can be understood using the below equation, and graph plot. ## 4) Differentiate between Supervised and Unsupervised Learning?Supervised and Unsupervised learning are types of Machine learning.
Supervised learning is based on the supervision concept. In supervised learning, we train our machine learning model using sample data, and on the basis of that training data, the model predicts the output.
Unsupervised learning does not have any supervision concept. Hence, in unsupervised learning machine learns without any supervision. In unsupervised learning, we provide data which is not labeled, classified, or categorized. Below are some main differences between supervised and unsupervised learning:
## 5) What do you understand by bias, variance trade-off?When we work with a supervised machine learning algorithm, the model learns from the training data. The model always tries to best estimate the mapping function between the output variable(Y) and the input variable(X). The estimation for target function may generate the prediction error, which can be divided mainly into **Bias Error:**Bias is a prediction error which is introduced in the model due to oversimplifying the machine learning algorithms. It is the difference of predicted output and actual output. There are two types of bias:**High Bias:**If the suggested predicted values are much different from actual value, then it is called as high bias. Due to high bias, an algorithm may miss the relevant relationships between the input features and target output, which is called**underfitting**.**Low Bias:**If the suggested predicted values are less different from actual value, then it is called as**low bias**.
**Variance Error:**If the machine learning model performs well with training dataset, but does not perform well with test dataset, then variance occurs. It can also be defined as**an error caused by the model's sensitivity to small fluctuation in training dataset**. The high variance would cause Overfitting in machine learning model, which means an algorithm introduce noise along with the underlying pattern in data to the model.
In the machine learning model, we always try to have low bias and low variance, and - If we try to increase the bias, the variance decreases
- If we try to increase the variance, the bias decreases.
Hence, trying to get an optimal bias and variance is called - If there is low bias and low variance, the predicted output is mostly close to the desired output.
- If there is low bias and high variance, the model is not consistent.
- If there is high variance and low bias, the model is consistent but predicted results are far away from the actual output.
- If there is high bias and high variance, then the model is inconsistent, and also predictions are much different with actual value. It is the worst case of bias and variance.
## 6) Define Naive Bayes?Naive Bayes is a popular classification algorithm used for predictive modeling. It is a supervised machine learning algorithm which is based on Bayes theorem. It is easy to build a model using Naive Bayes algorithm when working with a large dataset. It is comprised of two words, Naive and Bayes, where Naive means features are unrelated to each other. In simple words, we can say that " ## 7) What is the SVM algorithm?SVM stands for It works with labeled data as it is a part of supervised learning. The goal of support vector machine algorithm is to construct a hyperplane in an N-dimensional space. The If there are only two distinct classes, then it is called as The data point of a class which is nearest to the other class is called a support vector. There are two types of SVM classifier: **Linear SVM classifier:**A classifier by which we can separate the set of objects into their respective group by drawing a single line, i.e., hyperplane, called as linear SVM classifier.**Non-Linear SVM classifier:**Non-linear SVM classifier applies on those objects which cannot be classified into two groups by a single line.
On the basis of error function, we can divide a SVM model into four categories: **Classification SVM Type1****Classification SVM Type2****Regression SVM Type1****Regression SVM Type1**
## 8) What do you understand by Normal distribution?- If the given data is distributed around a central value in the bell-shaped curve without any left or right bias, then it is called
**Normal distribution**. It is also called a**Bell Curve**because it looks like a bell?shaped curve. - The normal distribution has a mean value, half of the data lies to the left of the curve, and half of the data lies right of the curve.
- In probability theory, the normal distribution is also called a
**Gaussian distribution**, which is used for the probability distribution. - It is a probability distribution function used to see the distribution of data over the given range.
- Normal distribution has two important parameters:
**mean(µ)**and**standard deviation(σ)**.
## 9) Explain Reinforcement learning.- Reinforcement learning is a type of machine learning where an agent interacts with the environment and learns by his actions and outcomes. On each good action, he gets a positive reward, and for each bad action, he gets a negative reward. Consider the below image:
- The goal of an agent in reinforcement learning is to maximize positive rewards.
- In reinforcement learning, algorithms are not explicitly programmed for tasks but learns with experiences without any human intervention.
- The reinforcement learning algorithms is different from supervised learning algorithms as there is no any training dataset is provided to the algorithm. Hence the algorithm automatically learns from experiences.
## 10) What do you mean by p-value?- The p-value is the probability value which is used to determine the statistical significance in a hypothesis test.
- Hypothesis tests are used to check the validity of the null hypothesis (claim).
- P-values can be calculated using p-value tables or statistical software.
- The p-values lies between 0 and 1. It can have mainly two cases:
- (p-value<0.05): A small p-value indicates strong evidence against the null hypothesis, so we can reject the null hypothesis.
- (p-value>0.05): A large p-value indicates weak evidence against the null hypothesis, so we consider the null hypothesis as true.
## 11) Differentiate between Regression and Classification algorithms?Classification and Regression both are the supervised learning algorithms in machine learning, and uses the same concept of training datasets for making predictions. The main difference between both the algorithms is that the output variable in regression algorithms is
Regression Algorithms are used in
## 12) Which is the best suitable language among Python and R for text analytics?Both R and Python are the suitable language for text analytics, but the preferred language is Python, because: - Python has Pandas library, by which we can easily use data structure and data analysis tools.
- Python performs fast execution for all types of text analytics.
## 13) What do you understand by L1 and L2 regularization methods?Regularization is a technique to reduce the complexity of the model. It helps to solve the over-fitting problem in a model when we have a large number of features in a dataset. Regularization controls the model complexity by adding a penalty term to the objective function. There are two main regularization methods:
- L1 regularization method is also known as Lasso Regularization. L1 regularization adds a penalty term to the error function, where penalty term is the sum of the absolute values of weights.
- It performs feature selection by providing 0 weight to unimportant features and non-zero weight to important features.
- It is given below:
- Here is the sum of the squared difference between the actual value and the predicted value.
- is
**regularization term**, and λ is penalty parameter which determines how much to penalize the weights.
- L2 regularization method is also known as Ridge Regularization. L2 regularization does the same as L1 regularization except that penalty term in L2 regularization is the sum of the squared values of weights.
- It performs well if all the input features affect the output and all weights are of approximately equal size.
- It is given as:
- Here, is the sum of the squared difference between actual value and predicted value.
- is the regularization term, and λ is the penalty parameter which determines how much to penalize the weights.
## 14) What is the 80/20 rule? Explain its importance in model validation?In machine learning, we usually split the dataset into two parts: **Training set:**Part of the dataset used to train the model.**Test set:**Part of the dataset used to test the performance of the model.
The best ratio to split the dataset is 80-20%, to create the validation set for machine learning model. Here, 80% is assigned for the training dataset, and 20% is for the test dataset. This ratio maybe 90-20%, 70-30%, 60-40%, but these ratios would not be preferable. Importance of 80/20 rule in model validation:
The process of evaluating a trained model on the test dataset is called as ## 15) What do you understand by confusion matrix?- Confusion matrix is a unique concept of the statistical classification problem.
- Confusion matrix is a type of table which is used for describing or measuring the performance of Binary classification model in machine learning.
- The confusion matrix is itself easy to understand, but the terminologies used in the matrix can be confusing. It is also known as
**Error matrix**. - It is used in statistics, data mining, machine learning, and different Artificial Intelligence applications.
- It is a table with two dimensions, "actual and predicted" and identical set of classes in both dimensions of the table.
- The confusion matrix has four following cases:
**True Positive(TP):**The predictions is positive and its actually true.**False Positive(FP):**The prediction is positive but its actually false.**True Negative(TN):**The prediction is negative but its actually true.**False Negative(FN):**The prediction is negative and its false.
The classification accuracy can be obtained by the below formula: ## 16) What is the ROC curve?ROC curve stands for ## 17) Explain the Decision Tree algorithm, and how is it different from the random forest algorithm?- Decision tree algorithm belongs to supervised learning which solves both classifications and Regression problems in machine learning.
- Decision tree solves problems using a tree-type structure which has leaves, decision nodes, and links between nodes. Each node represents an attribute or feature, each branch of the tree represent the decision, and each leaf represents the outcomes.
- Decision tree algorithm often mimic human thinking hence, it can be easily understood as compared to other classifications algorithm.
## 18) Explain the term "Data warehouse".The data warehouse is a system which is used for analysis and reporting of data collected from operational systems and different data sources. Data warehouse plays an important role in Business Intelligence. In a data warehouse, data is extracted from various sources, transformed (cleaned and integrated) according to decision support system needs, and stored into a data warehouse. The data present in the data warehouse after analysis does not change, and it is directly used by end-users or for data visualization.
- Data Warehouse makes data more readable, hence, strategic questions can be easily answered using various graphs, trends, plots, etc.
- Data warehouse makes data analysis and operation faster and more accurate.
## 19) What do you understand by clustering?Clustering is a way of dividing the data points into a number of groups such that data points within a group are more similar to each other than data points of other groups. These groups are called clusters, and hence, the similarities within the clusters is high, and similarities between the clusters is less. The clustering techniques are used in various fields such as Clustering is a type of supervised learning problems in machine learning. It can be divided into two types: **Hard Clustering****Soft Clustering**
## 20) How to determine the number of clusters in k-means clustering algorithm?In k-means clustering algorithm, the number of clusters depends on the value of k. ## 21) Differentiate between K-means clustering and hierarchical clustering?The K-means clustering and Hierarchical Clustering both are the machine learning algorithms. Below are some main differences between both the clustering:
## 22) What do you understand by Ensemble Learning?In machine learning, Ensemble learning is a process of combining several diverse base models in order to produce one better predictive model. By combining all the predictions, ensemble learning improves the stability of the model. The concept of ensemble learning is that various weak learners come together to make a strong learner. Ensemble methods help in reducing the variance, and bias error which causes a difference in actual value and predicted value. Ensemble learning can also be used for selecting optimal features, data fusion, error correction, incremental learning, etc. Below are the two popular ensemble learning techniques: **Bagging:** Bootstrap Aggregation is called Bagging, which is a powerful method of ensemble. Bagging is an application of Bootstrap technique to create a high-variance machine learning algorithm, such as decision trees. It takes the various sampled datasets from the original datasets and trains each dataset to increase the model variance. The bagging concept can be easily understood by the below diagram:
**Boosting:** Boosting is sequential ensemble method of machine learning. It helps to exploit the dependencies between the models, and mainly reduces the bias and variance in machine learning algorithms. It is an iterative technique that adjust the weight of the instances in a dataset based on the previous classification. If the instance was classified incorrectly, then it increases the weight of that instance. In short, it converts the weak learner to strong learners. Sometimes boosting shows better accuracy than bagging, but it may also introduce the over-fitting in the training data. The common type of boosting is**Adaboost**.
## 23) Explain Box Cox transformation?A Box-Cox transformation is a statistical technique to transform the non-normal dependent variable into a normal shape. We usually need normally distributed data to use in various statistical analysis tools such as control charts, ## 24) What is the aim of A/B testing?A/B testing is a way of comparing two versions of a webpage to determine which webpage version is performing better than other. It is a statistical hypothesis testing which determines any changes to a webpage in order to increase the outcome of strategy. ## 25) How is Data Science different from Data Analytics?When we deal with data science, there are various other terms also which can be used as data science. Data Analytics is one of those terms. The data science and data analytics both deal with the data, but the difference is how they deal with it. So to clear the confusion between data science and data analytics, there are some differences given:
Data Science is a broad term which deals with structured, unstructured, and raw data. It includes everything related to data such as data analysis, data preparation, data cleansing, etc. Data science is not focused on answering particular queries. Instead, it focuses on exploring a massive amount of data, sometimes in an unstructured way.
Data analytics is a process of analysis of raw data to draw conclusions and meaningful insights from the data. To draw insights from data, data analytics involves the application of algorithms and mechanical process. Data analytics basically focus on inference which is a process of deriving conclusions from the observations. Data Analytics mainly focuses on answering particular queries and also perform better when it is focused. |