You’ve found and applied for your dream job.
They called you back with an invitation for an interview.
Hell yeah! - you think. Finally, a chance to work on a data science project you care about.
But—
After a few minutes of excitement, the reality hits hard: Now, you have to absolutely nail your job interview.
How do you prepare to answer a gazillion of technical questions?
Don’t worry! It’s not as hard as it seems.
This article will walk you through the list of the most popular data science interview questions and answers to help you ace your interview and land the job of your dreams.
Here’s what we’ll cover:
Ready?
Ready to streamline AI product deployment right away? Check out:
Let's start with a few simple questions exploring the basic concepts of data science.
Deep learning is a subfield of machine learning and artificial intelligence that uses deep neural networks—algorithms roughly inspired by the human brain – to teach computers to imitate how humans think and learn.
Unlike traditional machine learning, deep learning models store millions, or even billions of parameters, making them more obscure to interpretation but very powerful in understanding data. A deep learning model is like a “hologram” of information, storing interconnected weights that represent fragments of what it learned.
Learning can be supervised, semi-supervised or unsupervised, and the adjective "deep" refers to the use of multiple layers in the network.
The three most frameworks used in cutting edge research are:
Other less commonly used deep learning frameworks include:
To impress your interviewer, make sure that you stay up to-date with new emerging deep learning frameworks even if they aren't widely used at the moment.
Artificial Neural Networks (ANN) are computational networks that loosely model the neurons in the human brain. Their goal is to recognize the relationships between various data points in a given dataset and to learn (often autonomously) to perform specific tasks by analyzing relevant examples.
Neural networks are organized into several layers, including an input layer, an output layer, and hidden layers. Within these layers are “weights” that carry information about the training set, such as the features found within images or audio waveforms.
During training, they perform gradient descent towards a loss function (hint: Make sure you’re very comfortable with these definitions!). As this occurs, the model learns the gradient of the slope towards the loss function via backpropagation.
The simplest neural network model is called the perceptron, and it’s a supervised learning algorithm invented in 1957. It consists only of two layers: an input layer and an output layer. Today’s most advanced transformer networks can contain hundreds of layers and billions of parameters and can be trained on billions of training data items.
Data cleaning is the process of data preparation.
It aims to identify incomplete, incorrect, or irrelevant parts of the training data and then replace or delete them. For example, outliers, mislabeled data, or biased samples can reduce the accuracy of a resulting model.
Data cleaning ensures better quality training data, and therefore, a more accurate model.
To maintain a deployed model, one has to:
Monitor the model proactively
The first step in ensuring your models’ performance accuracy is to monitor them after their deployment. Applying any changes to your models, or replacing them with new versions in the case of deep learning, will require you to check it against a common test set to avoid deterioration or catastrophic forgetting.
Often your corpus of training data will grow, as your models need to tackle more varied examples. Ensure that your test sets also grow to represent the increased diversity of your data in production!
Measure model's accuracy
Testing for accuracy doesn’t end at the training phase—the model’s average precision, recall, or accuracy per class may vary over time as either your model is re-trained, or more often so your data begins to change as the world progresses. Ensure you have a way of continuously sampling production data, labeling it as ground truth, and testing the latest version of the model against it for the most up-to-date accuracy.
A model predicting the stock market trained in 2019 would do terribly in 2021, for example.
Uptime is everything
If your model is running on multiple GPUs, you’ll have to make sure you develop a strong inference engine to keep it running at an unpredictable scale. If DevOps is not your strong suit, make sure you look for third-party tools to keep models running, or that your employer has talent in place to keep things running.
If you’re working on computer vision, for example, you can use V7’s inference engine to keep models running on up to 100 GPU servers.
The Decision Tree is a supervised machine learning algorithm used for classification and regression problems.
We use a Decision Tree to train a model predicting the value of a target variable. This algorithm uses the tree representation to solve the problem in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree.
We can distinguish two types of Decision Trees algorithm (based on the type of our target variable):
A validation set is used for model tuning during training. It’s the set of data that the model “checks against” to see how well it’s doing during the training process. It should look similar to the training set but varied enough to teach the model how to identify brand new examples.
A test set is a further set of data that the fully trained model tests itself against when it completes training. It’s all data the model has never seen before and is used to check for generalization.
Think of it this way: The training set is your studying material, the validation set are the question/answer pairs you check against while studying to make sure you’re remembering everything, and the test set is your final exam.
In data science, tensors are a type of data structure used in linear algebra to describe a multilinear relationship between sets of algebraic objects within a vector space.
Tensors generalize scalars, vectors, and matrices to higher dimensions.
If a scalar is a point, you add a dimension and get a vector (line with direction), you add another dimension and get a matrix (grid of values), stack those together and you get a 3D tensor.
A single-dimensional tensor can be represented as a vector, while a two-dimensional tensor is represented as a matrix.
Colour images are technically 3D tensors, containing a grid of R, G, and B values (and sometimes a fourth alpha channel).
Now, let's have a look at some of the technical questions you might get when interviewing for a data scientist position.
Here’s the comparison between supervised and unsupervised machine learning.
Logistic regression (also called the logit model) a is a statistical method used to analyze a dataset with one or more independent variables determining the outcome. It measures the relationship between the dependent variable and independent variables(s) using the Logistic Function (Sigmoid) to model probability.
It’s used to model a binary outcome – a variable that can have only two possible values: 0 and 1.
Bias is the phenomenon that occurs when the results produced by an algorithm are systemically prejudiced due to faulty data points. Bias can be the cause of underfitting and overfitting.
We distinguish between low bias and high bias in traditional machine learning algorithms.
Low bias machine learning algorithms include: Decision Trees, k-NN, SVM
High bias machine learning algorithms include: Linear Regression and Logistic Regression
In deep learning, bias can be troublesome as it can be hard to spot. For example, training a stock market predictor mostly on technology stocks will bias the model towards thinking most companies behave like tech companies. To avoid this, it is paramount that you spend more time on your training data than you do on modeling, and create bias-busting test sets to prevent damaging results.
Variance is a type of error in your model occurring due to the model’s sensitivity to the changes in the independent variables (features). The model picks up even the most minor details about the relationship between features and target.
It also learns the noise in the training data set and, as a result, performs poorly on the test data set. It can lead to high sensitivity and overfitting.
To achieve good prediction performance, you need to have low bias and low variance.
In simple words, the bias-variance tradeoff is the balance between the Bias error and the Variance error.
Here’s what it means—
If your model is too simple (it has only a few parameters), it’s characterized by high bias and low variance, leading to underfitting. On the other hand, if your model is too complex—meaning that it has high variance and low bias, you will be dealing with overfitting.
Essentially, the bias-variance tradeoff is about finding the right balance without overfitting or underfitting the data.
Overfitting is a modeling error where the machine learning model learns “too much” from the training data, paying attention to the points of data that are noisy or irrelevant. Overfitting negatively impacts the models’ ability to generalize.
Underfitting is a scenario where a statistical model or the machine learning model cannot accurately capture the relationships between the input and output variables. Underfitting occurs because the model is too simple—informed by not enough training time, too few features, or too much regularization.
Exploding gradients relate to the accumulation of significant error gradients, resulting in very large updates to the neural network model weights during training. This, in turn, leads to an unstable network.
The values of the weights can also become so large as to overflow and result in something called NaN values.
Dimensionality reduction is the process of reducing the number of input variables in a dataset. As the number of features increases, your model becomes more complex, making a predictive modeling task more challenging.
Some of the advantages of dimensionality reduction include:
Backpropagation stands for “backward propagation of errors.”. It refers to the algorithm used for training feedforward neural networks by repeatedly adjusting the network’s weights to minimize the difference between the actual output vector of the net and the desired output vector.
Backpropagation aims to minimize the cost function by adjusting the network’s weights and biases. The cost function gradients determine the level of adjustment concerning parameters like activation function, weights, bias, etc.
Feedforward Propagation occurs when the input data is fed in the forward direction through the network. Each hidden layer receives the input data, processes it (using an Activation Function), and passes it onto the next layer.
In the feedforward propagation, the Activation Function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer.
An autoencoder is an unsupervised learning technique used to learn efficient data encodings.
The goal of an autoencoder is to learn a lower-dimensional representation (encoding) for higher-dimensional data by training the network to capture the most important parts of the input image.
Autoencoders consist of 3 parts: encoder, bottleneck, and decoder. They find application in dimensionality reduction, image denoising, and even generation of image and time series data.
Pooling in convolutional neural networks is a form of non-linear down-sampling. The pooling layer is one of the building blocks of a CNN, and it is used to reduce the dimensions of the feature maps.
You are effectively squeezing patches of the images into compressed representations. There’s 3 common forms of pooling—Max, Average, and Sum:
During convolutions, you are pooling information from your original image. The size of your convolutional kernel can be tweaked for your data’s needs.
The example below added 1 pixel of padding, for example.
Another parameter of your convolutions is the stride or the number of pixel-steps the kernel takes as it performs convolutions.
In its simplest form, a CNN architecture consists of convolutional layers, pooling layers, and fully connected (FC) layers. In addition to this, CNN’s architecture also includes two crucial parameters, such as a dropout layer and the activation function.
RNN stands for Recurrent Neural Network, and it’s a type of artificial neural network that uses sequential or time series data. Like other algorithms, RNNs use training data to learn, however—
What makes them unique is their internal state (memory) used to process sequences of inputs. In simple terms, RNNs take information from prior inputs to influence the current input and output.
They are used for speech recognition, language translation, music composition, text summarization, video tagging, and more.
A p-value (probability value) describes the likelihood of a particular result occurring by random chance when the null hypothesis is assumed to be true. P-value is used for statistical significance tests.
In simple terms, the p-value helps us answer the following question (based on our null hypothesis): Does the data really represent the observed effect?
Support Vector Machines (SVMs) are a set of supervised learning models used for solving regression and classification problems and outliers detection.
Support Vectors are the two closest points in a plane that belong to different classes, plotted in the same (usually 2D, but sometimes 3D) plane.
A Hyperplane is a line that linearly separates and classifies a set of data.
The distance between the hyperplane and the nearest data point from either set is known as the margin. An SVM’s goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.
Support vectors are data points that lie closest to the decision surface or the hyperplane. They influence the position and orientation of the decision surface or the hyperplane and are most difficult to classify.
By using the support vectors, you can maximize the margin of the classifier, and by removing them, you will change the position of the decision surface/hyperplane.
A Kernel function is used to take data as input and transform it into the desired form of processing data. Kernel functions provide shortcuts to avoid complex calculations.
Some of the most popular Kernel functions in SVM include:
There are three common types of biases you might encounter during sampling:
Within each there can be many more specific types:
Ensemble learning refers to the method of combining a diverse set of learners together to improvise on the stability and predictive power of the model. Two types of Ensemble learning methods are: Bagging Bagging method helps you to implement similar learners on small sample populations. It helps you to make nearer predictions Boosting Boosting
Random Forest is a machine learning method used for regression and classification tasks. It consists of a large number of individual decision trees that operate as an ensemble and form a powerful model. The adjective "random" relates to the fact that the model uses two key concepts:
In the case of classification, the output of the random forest is the class selected by most trees, and in the case of regression, it takes the average of outputs of individual trees.
Random Forest is also used for dimensionality reduction and treating missing values and outlier values.
The confusion matrix (also known as an error matrix) a 2X2 table that describes the classification's model (or classifier's) performance on a set of test data where true values are known. The matrix compares the actual target values with values predicted by the machine learning model.
For example, here's an example of a binary classification confusion matric confusion matrix of binary classification with two possible predicted classes and four outcomes:
True positive(TP)—Correct positive prediction
False positive(FP)—Incorrect positive prediction
True negative(TN)—Correct negative prediction
False negative(FN) — Incorrect negative prediction
Various measures that can be derived from a confusion matrix include:
Accuracy —(TP+TN)/total
Precision—TP/predicted yes
Error rate—(FP+FN)/total
Specificity - TN/actual no
Sensitivity—TP/actual yes
False positive rate—FP/actual no
A few of the biggest linear model's disadvantages include:
The linear model is limited to linear relationships: It only considers linear relationships between dependent and independent variables—assuming there is a straight-line relationship between input and output variables. However, this assumption can often be incorrect as sometimes the relationship between values is curved (e.g. age and income graph).
The linear model is sensitive to outliers: The sensitivity to poor quality data causes the linear model to underperform. With a large number of outliers, the model will be skewed away from the actual underlying relationship.
The linear model assumes that the data is independent: In cases of clustering in space and time, this assumption is incorrect.
A recommender system is a subclass of information filtering techniques. It's a machine learning system aiming at predicting users’ interests and recommending to them product items they are likely to be interested in.
The data used in building recommender systems is derived from user's ratings and preferences. They operate using either a single input (a song or a video etc.,) or multiple inputs within the platform.
Companies like Netflix, Spotify, or Youtube are among the popular ones that use recommender systems to a large extent.
Sampling refers to the selection of individual elements or a subset/group that you will collect data from in your research.
A few of the most common sampling methods include:
A type I error (false-positive) occurs when a researcher rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs when the researcher fails to reject a null hypothesis that is actually false in the population.
There are four assumptions associated with a linear regression model:
Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The residuals are independent—there is no correlation between consecutive residuals in time series data.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y, the model’s residuals are normally distributed.
A selection bias is an error occurring when the conducted research doesn’t have a random selection of participants/elements—this lack of randomization in the process of the sample collection results in a distortion of statistical analysis.
Sometimes, the selection bias is also referred to as the selection effect. When the selection biased isn’t taken into account, the study results might be inaccurate.
As the famous saying goes: "If you fail to plan, you plan to fail"
Here, at V7, we know that attending job interviews can be a stressful experience for some of the candidates. But—
Knowing what to expect and preparing yourself in advance is a surefire way to nail your next interview.
Make sure to do your research and take time to explore more in-depth questions on technical knowledge that you might encounter.
Finally, last piece of advice from the V7 team—
Stick to the job’s responsibilities in your answers. If it’s a deep learning job, don’t spend time talking about traditional machine learning. If you’ll be working with structured data, don’t spend time talking about 100-layer CNN!