# Top 500 Data Science Questions and Answers

## What is confusion matrix?

Confusion matrix is a table which contains information about predicted values and actual values in a classification model

It has four parts namely true positive ,true negative, false positive and false negative

It can be used to calculate accuracy, precision and recall

## What is hypothesis testing?

While performing the an experiment hypothesis testing to is used to analyze the various factors that are assumed to have an impact on the outcome of experiment An hypothesis is some kind of assumption and hypothesis testing is used to determine whether the stated hypothesis is true or not

Initial assumption is called null hypothesis and the opposite alternate hypothesis

## What is a p-value in statistics?

In hypothesis testing, p value helps to arrive at a conclusion. When p -value is too small then null hypothesis is rejected and alternate is accepted. When p- value is large then null hypothesis is accepted.

## What is difference between Type-I error and Type-II error in hypothesis testing?

Type-I error is we reject the null hypothesis which was supposed to be accepted. It represents false positive

Type-II error represents we accept the null hypothesis which was supposed to be rejected. It represents false negative.

## What are the different types of missing value treatment

• Deletion of values
• Guess the value
• Average Substitution
• Regression based substitution
• Multiple Imputation

When building a statistical model the objective is reduce the value of the cost function that is associated with the model. Gradient descent is an iterative optimization technique used to determine the minima of the cost function

## What is difference between supervised and unsupervised learning algorithms?

Supervised learning are the class of algorithms in which model is trained by explicitly labelling the outcome. Ex. Regression, Classification

Unsupervised learning no output is given and the algorithm is made to learn the outcomes implicity Ex. Association, Clustering

## What is the need for regularization in model building?

Regularization is used to penalize the model when it overfits the model. It predominantly helps in solving the overfitting problem.

## Difference between bias and variance tradeoff?

High Bias is an underlying error wrong assumption that makes the model to underfit. High Variance in a model means noise in data has been too taken seriously by the model which will result in overfitting.

Typically we would like to have a model with low bias and low variance

## How to solve overfitting?

• Introduce Regularization
• Perform Cross Validation
• Reduce the number of features
• Increase the number of entries
• Ensembling