# Top Data Science Interview Questions

In this present era Data Science is in big demand and professionals are becoming rockstars. Companies that can leverage massive amounts of data to improve the way they serve customers.

Data Science is also known as a data-driven decision, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge from data in various forms, and take decisions based on this knowledge.

Here is the list of most popular Data Science Interview Questions you can expect to face while attending an interview

## What is the difference between supervised and unsupervised machine learning?

**Supervised Machine learning**- Supervised machine learning requires training labeled data. Let’s discuss it in bit detail when we have

**Unsupervised Machine learning**- Unsupervised machine learning doesn’t require labeled data.

## How is logistic regression done?

- Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

## What is bias, variance trade-off?

**Bias**- Bias is the error introduced in your model due to oversimplification of a machine learning algorithm.” It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.
- Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression

**Variance**- Variance is the error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs badly on test data set.” It can lead high sensitivity and overfitting.
- Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens till a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

**Bias Variance Trade-Off**- The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
- The k-nearest neighbor’s algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
- The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

- The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.

There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease the bias.

## How do you build a random forest model?

A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

#### Steps to build a random forest model

- Randomly select ‘k’ features from a total of ‘m’ features where k << m
- Among the ‘k’ features, calculate the node D using the best split point
- Split the node into daughter nodes using the best split
- Repeat steps two and three until leaf nodes are finalized
- Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

## What is selection Bias?

Selection bias occurs when the sample obtained is not representative of the population intended to be analyzed.

## For the given points, how will you calculate the Euclidean distance in Python?

plot1 = [1,3]

plot2 = [2,5]

- The Euclidean distance can be calculated as follows

euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

Check out the Simplilearn’s video on “Data Science Interview Question” curated by industry experts to help you prepare for an interview.

## What are the functions of the different kernels in SVM?

**There are four types of kernels in SVM**- Linear Kernel
- Polynomial kernel
- Radial basis kernel
- Sigmoid kernel