Test data scientists’ 40 problems in machine learning


Author Ankit Gupta
Compile Flin
Source | analyticsvidhya


Machine learning is one of the most popular skills recently. We have organized various skill tests so that data scientists can check their key skills. These tests include machine learning, deep learning, time series problems and probability. This paper will provide a solution for machine learning skill testing. If you miss any of the above skills tests, you can still view the questions and answers through the link below.

In the machine learning skill test, more than 1350 people signed up for the test. This test aims to test whether you have mastered the conceptual knowledge in machine learning. If you miss the real-time test, you can still read this article and learn how to answer these questions correctly.

This is the ranking of all participants.

These questions, as well as hundreds of other questions, are part of our ace data science interview course(https://courses.analyticsvidhya.com/courses/ace-data-science-interviews)Part of. This is a comprehensive guide with a lot of resources. If you have just started your journey in data science, take a look at our most popular course – “Introduction to data science”! (https://courses.analyticsvidhya.com/courses/introduction-to-data-science-2)

Total score

Here are the distribution scores, which will help you evaluate your performance.

You can here(https://datahack.analyticsvidhya.com/contest/skillpower-machine-learning/#LeaderBoard)Visit the final score. More than 210 people took the skill test, and the highest score was 36. Here are some statistics about scores.

Average score: 19.36

Median score: 21

Mode score: 27

Useful resources

Problems and Solutions

Problem background

Feature F1 represents the grade of university students and can take specific values: A, B, C, D, e and F.

1) Which of the following is true under the following circumstances?

A) Feature F1 is an example of a constant class variable.
B) Feature F1 is an example of a sequenced variable.
C) It does not fall into any of the above categories.
D) Both are

Solution: (b)

An ordered variable is a variable that has some order in its category. For example, class A should be considered higher than class B.

2) Which of the following is an example of a deterministic algorithm?



C) None of the above

Solution: (a)

Deterministic algorithm is an algorithm whose output will not change in different operations. If we run again, PCA will give the same result, but K-means will not.

3) [right or wrong] the Pearson correlation between the two variables is zero, but their values can still be related to each other.

A) Right

B) Wrong

Solution: (a)

Y = X2。 Note that they are not only related, but that one variable is a function of another, and the Pearson correlation between them is zero.

4) Which of the following statements is true for gradient descent (GD) and random gradient descent (SGD)?

  1. In GD and SGD, you update a set of parameters iteratively to minimize the error function.
  2. In SGD, you must traverse all the samples in the training set to update the parameters once in each iteration.
  3. In GD, you can use the whole data or a subset of training data to update parameters in each iteration.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1, 2 and 3

Solution: (a)

In the SGD of each iteration, the batch containing random data samples is usually selected, but for Gd, each iteration contains all training observations.

5) Which of the following super parameters may lead to over fitting data of random forest?

  1. Number of trees
  2. Tree depth
  3. Learning rate

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 2 and 3

F) 1, 2 and 3

Solution: (b)

Usually, if we increase the depth of the tree, it will lead to over fitting. Learning rate is not a super parameter in random forest. Increasing the number of trees will lead to insufficient fitting.

6) Imagine that you are using “analytics vidhya” and you want to develop a machine learning algorithm that can predict the number of views of the article.

Your analysis is based on features such as the author’s name, the number of articles written by the same author on Analytics vidhya in the past, and other features. In this case, which of the following evaluation indicators would you choose?

  1. Mean square error
  2. accuracy
  3. F1 score

A) Only 1

B) Only 2

C) Only 3

D) 1 and 3

E) 2 and 3

F) 1 and 2

Solution: (a)

It can be considered that the viewing times of articles are continuous target variables belonging to regression problems. Therefore, the mean square error will be used as an evaluation index.

7) Three images (1, 2, 3) are given below. Which of the following options is correct for these images?




A) 1 is tanh, 2 is relu, and 3 is sigmoid activation function.

B) 1 is sigmoid, 2 is relu, and 3 is tanh activation function.

C) 1 is relu, 2 is tanh, and 3 is sigmoid activation function.

D) 1 is tanh, 2 is sigmoid, and 3 is relu activation function.

Solution: (d)

The range of the sigmoid function is [0,1].

The range of the tanh function is [- 1,1].

The range of the relu function is [0, infinity].

Therefore, option D is the correct answer.

8) The following are the 8 actual values of the target variables in the training file.


What is the entropy of the target variable?

A) -(5/8 log(5/8) + 3/8 log(3/8))

B) 5/8 log(5/8) + 3/8 log(3/8)

C) 3/8 log(5/8) + 5/8 log(3/8)

D) 5/8 log(3/8) – 3/8 log(5/8)

Solution: (a)

The formula for entropy is

So the answer is a.

9) Suppose you are using classification features, but you have not seen the distribution of classification variables in the test data. You need to apply heat independent coding (ohe) to classification features. What challenges may be faced if ohe is applied to the classification variables of training data sets?

A) All categories of categorical variables are not in the test dataset.

B) Compared with the test data set, the frequency distribution in the category is different in the training set.

C) The training set and test set always have the same distribution.

D) A and B

E) None of this is true

Solution: (d)

Both are correct. Ohe will not be able to encode categories that exist in the test set but are not in the training set, so this may be one of the main challenges when applying ohe. If the frequency distribution in training and testing is different, the challenge in option B does exist, and you need to be more careful when applying ohe.

10) Skip gram model is one of the best models for word embedding in word2vec algorithm. Which of the following models describes the skip gram model?



C) A and B

D) None of this is true

Solution: (b)

Two models (model1 and model2) are used in the word2vec algorithm. Model1 represents the cbow model and model2 represents the skip gram model.

11) Suppose you are using the activation function x in the hidden layer of the neural network. For any given input, at a specific neuron, you get an output of “- 0.0001”. Which of the following activation functions can x represent?


B) tanh


D) None of this is true

Solution: (b)

The function is tanh because the output range of this function is (- 1, – 1).

12) The logarithmic loss evaluation index can have a negative value.

A) Really
B) Fake

Solution: (b)

Logarithmic loss cannot be negative.

13) Which of the following statements is true about ‘type1’ and ‘type2’ errors?

  1. Type1 is called false positive and type2 is called false negative.
  2. Type1 is called a false positive and type2 is called a false positive.
  3. A type 1 error occurs when we reject a null hypothesis.

A) Only 1

B) Only 2

C) Only 3

D) 1 and 2

E) 1 and 3

F) 2 and 3

Solution: (E)

In statistical hypothesis testing, type I error is the false rejection of the true null hypothesis (“false positives”), while type II error is the false hypothesis (“false negatives”) that is wrongly retained.

14) Which of the following is one of the important steps in preprocessing text in NLP based projects?

  1. Stem extraction
  2. Delete pause
  3. Object standardization

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) 1, 2 and 3

Solution: (d)

Stem extraction is a basic rule-based process to remove suffixes (“ing”, “ly”, “es”, “s”, etc.) from words.

Pause words are words that are not related to the data context, such as is / AM / are.

Object standardization is also a good way to preprocess text.

15) Suppose you want to project high-dimensional data to low-dimensional data. The two most famous dimensionality reduction algorithms used here are PCA and t-sne. Suppose you apply these two algorithms to the data “X” and obtain the data sets “x_projected_pca” and “x_projected_tsne”.

Which of the following statements is correct for “x_projected_pca” and “x_projected_tsne”?

A)X_ projected_ PCA will be interpreted in the nearest neighbor space.

B)X_ projected_ Tsne will be interpreted in the nearest neighbor space.

C) Both will be explained in the nearest neighbor space.

D) None of them will explain in the nearest neighbor space.

Solution: (b)

T-sne algorithm considers the nearest neighbor to reduce the dimension of data. Therefore, after using t-sne, we can think that the reduced dimension will also be explained in the nearest neighbor space. But not for PCA.

Question: 16-17

Here are three scatter plots of two features.

16) In the figure above, which of the following is an example of a multilinear feature?

A) Function in picture 1

B) Function in picture 2

C) Function in picture 3

D) Functions in pictures 1 and 2

E) Functions in pictures 2 and 3

F) Functions in pictures 3 and 1

Solution: (d)

In image 1, features have high positive correlation, while in image 2, features have high negative correlation. Therefore, in both images, feature pairs are examples of multiple collinear features.

17) In the previous question, suppose you have determined the multicollinearity feature. Which of the following do you want to do next?

  1. Delete two collinear variables.
  2. Delete one of the two collinear variables.
  3. Deleting related variables may result in loss of information. To preserve these variables, we can use penalty regression models, such as ridge regression or lasso regression.

A) Only 1

B) Only 2

C) Only 3

D) 1 or 3

E) 2 or 3

Solution: (E)

You cannot delete these two features at the same time, because you will lose all information after deleting these two features, so you should delete only one feature, or you can use regularization algorithms such as L1 and L2.

18) Adding unimportant features to the linear regression model may cause _.

  1. R squared increase
  2. R squared reduction

A) Only 1 is correct

B) Only 2 are correct

C) 1 or 2

D) None of this is true

Solution: (a)

After adding a feature in the feature space, whether the feature is important or unimportant, the R square will always increase.

19) Suppose you give three variables X, y, and Z. The Pearson correlation coefficients of (x, y), (y, z) and (x, z) were C1, C2 and C3, respectively.

Now, you add 2 to all the values of X (that is, the new value becomes x + 2), subtract 2 from all the values of Y (that is, the new value is Y-2), and Z remains unchanged. The new coefficients for (x, y), (y, z) and (x, z) are given by D1, D2 and D3, respectively. What is the relationship between the values of D1, D2 and D3 and C1, C2 and C3?

A)D1 = C1,D2 < C2,D3 > C3

B)D1 = C1,D2 > C2,D3 > C3

C)D1 = C1,D2 > C2,D3 < C3

D)D1 = C1,D2 < C2,D3 < C3

E)D1 = C1,D2 = C2,D3 = C3

F) Unable to determine

Solution: (E)

If you add or subtract a value from a feature, the correlation between features does not change.

20) Imagine that you are solving the classification problem of highly unbalanced categories. In the training data, most categories were observed 99% of the time.

After predicting the test data, your model has 99% accuracy. Which of the following is true in this case?

  1. For category imbalance, accuracy measurement is not a good idea.
  2. Precision measurement is a good idea to solve the problem of category imbalance.
  3. Accuracy and recall indicators are useful for addressing category imbalances.
  4. Accuracy and recall indicators are not applicable to category imbalance.

A) 1 and 3

B) 1 and 4

C) 2 and 3

D) 2 and 4

Solution: (a)

Refer to question 4 in this article.

21) in ensemble learning, you summarize the predictions of weak learning models, so the integration of these models will provide better predictions than those of a single model.

Which of the following statements is true for the weak learning model used in the integration model?

  1. They usually do not over fit.
  2. They have high deviation, so they can’t solve complex learning problems
  3. They are usually over fitted.

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) Only 1

E) Only 2

F) None of the above

Solution: (a)

The weak learning model identifies specific parts of the problem. Therefore, they usually do not over fit, which means that the learning model with weak learning ability has lower variance and higher deviation.

22) for k-fold cross validation, which of the following options is correct?

  1. An increase in K will result in a longer time required for cross validation results.
  2. Compared with lower K value, higher K value will lead to higher confidence of cross validation results.
  3. If k = n, it is called “leave one method (cross validation method)”, where n is the number of observations.

A) 1 and 2

B) 2 and 3

C) 1 and 3

D) 1, 2 and 3

Solution: (d)

The larger the K value, the smaller the deviation from overestimating the real expected error (because the training multiple will be closer to the total data set) and the longer the running time (as you get closer to the limit case: leave one method for cross validation). When selecting K, we also need to consider the variance between K times accuracy.

Problem context 23-24

Cross validation is an important step of super parameter adjustment in machine learning. Suppose you are adjusting the super parameter “max_depth” of GBM by selecting GBM from 10 different depth values (values greater than 2) of the tree based model using 5-fold cross validation.
For an algorithm (on a model with a maximum depth of 2), the training time of 40% is 10 seconds, and the prediction time of the remaining 1% is 2 seconds.
Note: Hardware dependency is ignored in the formula.

23) which of the following options is correct for the overall execution time of 5-fold cross validation with 10 different “max_depth” values?

A) Less than 100 seconds

B) 100 – 300 seconds

C) 300 – 600 seconds

D) Greater than or equal to 600 seconds

E) None of the above

F) Unable to estimate

Solution: (d)

Each iteration of depth “2” in 5-fold cross validation will take 10 seconds to train, while the test will take 2 seconds.

Therefore, a 50% discount will take 12 * 5 = 60 seconds. Since we are searching for 10 depth values, the algorithm will take 60 * 10 = 600 seconds.

However, when the depth is greater than 2, it will take more time to train and test the model than when the depth is “2”, so the overall timing will be greater than 600 seconds.

24) in the previous question, if you train the same algorithm to adjust two super parameters, such as “maximum depth” and “learning rate”.

You want to choose the correct value for the maximum depth (from the given 10 depth values) and learning rate (from the given 5 different learning rates). In this case, which of the following will represent the total time?

A) 1000-1500 seconds

B) 1500-3000 seconds

C) Greater than or equal to 3000 seconds

D) None of this is true

Solution: (d)

Same as question 23.

25) the scheme of training error te and verification error VE for machine learning algorithm M1 is given below. You need to select a super parameter (H) based on te and VE.

1 105 90
2 200 85
3 250 96
4 105 85
5 300 100

Which h value will you choose according to the above table?

Solution: (d)

According to the table, D is the best

26) what will you do in PCA to get the same prediction as SVD?

A) Convert data to mean zero

B) Convert data to median zero

C) Impossible

D) None of this is true

Solution: (a)

When the average value of the data is zero, the prediction of vector PCA will be the same as that of SVD. Otherwise, the data must be centered before obtaining SVD.

Question 27-28

Suppose there is a black box algorithm that uses training data with multiple observations (T1, T2, T3,… TN) and a new observation (Q1). The nearest neighbor (e.g., Ti) of the black box output Q1 and its corresponding category label CI.

You can also think that the black box algorithm is the same as 1-NN (1-Nearest neighbor).

27) the k-NN classification algorithm can be constructed only based on this black box algorithm.

Note: compared with K, n (the number of training observations) is very large.

A) Really

B) Fake

Solution: (a)

In the first step, you pass an observation value (Q1) in the black box algorithm, so the algorithm will return the nearest neighbor observation value and its class label.

In the second step, you select the closest observation from the training data, and then enter the observation again (Q1). The black box algorithm will return the nearest neighbor observation value and its class label again.

You need to repeat this process K times

28) we do not want to use 1-NN black box, but use j-nn (J > 1) algorithm as black box. Which of the following options is correct for finding k-NN using j-nn?

  1. J must be an appropriate factor for K
  2. J > k
  3. impossible




Solution: (a)

Same as question 27

29) suppose you get seven scatter plots 1-7 (from left to right) and you want to compare the Pearson correlation coefficients between each scatter plot variable.

Which of the following is the correct order?

  1. 1 < 2 < 3 <4
  2. 1 > 2 > 3 > 4
  3. 7 < 6 < 5 <4
  4. 7 > 6 > 5 > 4

A) 1 and 3

B) 2 and 3

C) 1 and 4

D) 2 and 4

Solution: (b)

The correlation from images 1 to 4 is decreasing (absolute value). However, from images 4 to 7, the correlation increases, but its correlation value is negative (for example, 0, – 0.3, – 0.7, – 0.99).

30) you can use different metrics (e.g. accuracy, logarithmic loss, F-score) to evaluate the performance of binary classification problems. Suppose you are using the logarithmic loss function as an evaluation index. Which of the following is true for interpreting logarithmic loss as an evaluation indicator?

  1. If the classifier has confidence in misclassification, logarithmic loss will severely punish it.
  2. For a specific observation, if the classifier assigns a small probability to the correct category, the corresponding contribution of logarithmic loss will be very large.
  3. The lower the logarithmic loss, the better the model.

A) 1 and 3

B) 2 and 3

C) 1 and 2

D) 1, 2 and 3

Solution: (d)

Questions 31-32

The following are five samples given in the data set.

Note: the visual distance between points in the image represents the actual distance.

31) which of the following is the accuracy of the left one method cross validation of 3-nn (three nearest neighbors)?





Solution: (c)

In the “leave one method” cross validation, we will select (n-1) observations for training and 1 validation observation. Treat each point as a cross validation point, and then find the nearest 3 points.

Therefore, if you repeat this process for all points, you will get the correct classification. All positive classes are given in the figure above, but negative classes will be misclassified. So you will get 80% accuracy.

32) which of the following K values has the smallest left one method cross validation accuracy?




D) All have the same leave one method error

Solution: (a)

Each point will always be misclassified in 1-NN, which means you will get an accuracy of 0%.

33) suppose you get the following data and you want to apply a logistic regression model to classify it into two given classes.

You are using logistic regression with L1 regularization.

Where C is the regularization parameter and W1 and W2 are the coefficients of X1 and x2.

When you increase the value of C from zero to a very large value, which of the following options is correct?

A) First W2 becomes zero, then W1 becomes zero

B) First W1 becomes zero, then W2 becomes zero

C) Both become zero at the same time

D) Even if the C value is large, they cannot be zero

Solution: (b)

By looking at the images, we found that we can effectively perform classification even if only X2 is used. Therefore, first, W1 will become 0. With the increase of regularization parameters, W2 will be closer to 0.

34) suppose we have a data set that can be trained with 100% accuracy with the help of a decision tree with a depth of 6. Now consider the following points and choose options based on them.

Note: all other super parameters are the same, and other factors are not affected.

1. Depth 4 will have high deviation and low variance

2. Depth 4 will have low deviation and low variance

A) Only 1

B) Only 2

C) 1 and 2

D) None of the above

Solution: (a)

If such data is suitable for the decision tree with depth 4, it may lead to insufficient data fitting. Therefore, in the case of insufficient fitting, it will have higher deviation and lower variance.

35) which of the following options can be used to obtain the global minimum of the k-means algorithm?

1. Try to run the algorithm for different centroid initialization

2. Adjust the number of iterations

3. Find out the best number of clusters

A) 2 and 3

B) 1 and 3

C) 1 and 2

D) Above

Solution: (d)

You can adjust all options to find the global minimum.

36) suppose you are developing a project that is a binary classification problem. You trained the model on the training dataset and obtained the following confusion matrix on the validation dataset.

According to the confusion matrix above, which of the following options can provide you with the correct prediction?

1. The accuracy is about 0.91

2. The error classification rate is about 0.91

3. The false positive rate is about 0.95

4. The true positive rate is 〜 0.95

A) 1 and 3

B) 2 and 4

C) 1 and 4

D) 2 and 3

Solution: (c)

The accuracy (correct classification) is (50 + 100) / 165, almost equal to 0.91.

The true positive rate is the number of times you correctly predict the positive classification, so the true positive rate will be 100 / 105 = 0.95, also known as “sensitivity” or “recall rate”

37) for which of the following super parameters, the higher the value of the decision tree algorithm, the better?

1. Number of samples for splitting

2. Tree depth

3. Number of leaf node samples

A) 1 and 2

B) 2 and 3

C) 1 and 3

D) 1, 2 and 3

E) Can’t judge

Solution: (E)

For all three options a, B and C, it is not necessary to increase the value of the parameter to improve performance. For example, if we have a very high tree depth value, the generated tree may over fit the data and can not be generalized well. On the other hand, if our value is low, the tree may not be enough to hold data. Therefore, we cannot say with certainty that “the higher the better”.

Questions 38-39

Imagine that you have a 28 * 28 image and run a 3 * 3 convolutional neural network on it. The input depth is 3 and the output depth is 8.

Note: stride is 1, and you are using the same fill.

38) what is the size of the output feature map when using the given parameters?

A) Width 28, height 28 and depth 8

B) Width 13, height 13 and depth 8

C) Width 28, height 13 and depth 8

D) Width 13, height 28 and depth 8

Solution: (a)

The formula for calculating the output size is

Output size = (n – f) / S + 1

Where n is the input size, f is the filter size, and S is the span.

Read this article for a better understanding.

39) what is the size of the output feature map when using the following parameters?

A) Width 28, height 28 and depth 8

B) Width 13, height 13 and depth 8

C) Width 28, height 13 and depth 8

D) Width 13, height 28 and depth 8

Solution: (b)


40) suppose that we are drawing visualization diagrams of different C values (penalty parameters) in SVM algorithm. For some reason, we forgot to mark the C value with visualization. In this case, which of the following options best describes the C value of the following image for the radial basis function kernel?

(1,2,3 from left to right, so the C value is C1 for image1, C2 for image2 and C3 for image3).

A)C1 = C2 = C3

B)C1 > C2 > C3

C)C1 < C2 < C3

D) None of this is true

Solution: (c)

Penalty parameter C of error term. It also controls the trade-off between smoothing decision boundaries and correctly classifying training points. For larger C values, hyperplanes with smaller margins are selected for optimization.

Read more here:https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

Original link:https://www.analyticsvidhya.com/blog/2017/04/40-questions-test-data-scientist-machine-learning-solution-skillpower-machine-learning-datafest-2017/

Welcome to panchuang AI blog:

Official Chinese document of sklearn machine learning:

Welcome to panchuang blog resources summary station: