20 questions to detect fake data scientists

7 minute read

I found an interesting blog post recently, titled: 20 Questions to Detect Fake Data Scientists. I could answer some of them, but not all. So I decided to answer all of them. Feel free to leave comments if you find any mistakes!

1. Explain what regularization is and why it is useful.

Regularization is a way of resolving overfitting. For parametric models this means introducing an extra term to the objective function that penalizes large coefficients. L1 / L2 are the two commonly used techniques which penalizes by a loss function by the size / square of the size of coefficients. For non parametric models, for example, for decision tree based methods, the tree depth is reduced to regularize the model. For deep learning models, techniques like dropout is found to avoid overfitting.

2. Which data scientists do you admire most? which startups?

Data Scientists: David Robinson, William Chen Startups: Quora, Wish (interested to see reader recommendations)

3. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?

You take a look at \(R^2\) (r-squared). \(R^2\) is the proportion of overall variability of Y that is explained by the model (SSM/SST). Generally, the higher \(R^2\) is, the better the model explains the data. However, when there’s an excess of variables, \(R^2\) becomes high although the model overfits. To solve this problem, adjusted \(R^2\) is also used. Adjusted \(R^2\) adjusts \(R^2\) by the degrees of freedom.

4. Explain what precision and recall are. How do they relate to the ROC curve?

Precision is the fraction of predictions you predicted true and got right over the total predictions you made. It is the fraction of true positives over all positive predictions. Precision explains how accurate the model is in making the prediction. Recall is the fraction of predictions you got right over all the true labels. It is the fraction of true positives over all true predictions. Recall explains how much of the relevant data in the dataset the model covered.

Receiver Operating Characteristic curve is a curve generated by plotting false positive rate (specificity) on the x-axis and true positive rate (sensitivity/recall) on the y-axis given different thresholds for a binary classifier. It generally shows the tradeoff between the two.

5. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?

Use cross validation and compare the average of the generalization error of the model with the previous one.

6. What is root cause analysis?

Root cause analysis is to repeatedly analyze the cause of a problem until the prevention of that root cause is effective enough to prevent the problem from occurring again.

7. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.

Pricing optimization: Varying product prices to meet the company objective, such as maximizing profit. Example: Uber ride fare. Price elasticity: The change in demanded quantity given a change in price. When the public bus price goes up, how much percentage of the people would stop riding buses? Inventory management: This one is quite obvious. It’s the act of managing inventories (like new, shipped and returned products), where they are located, etc. Amazon warehouses are doing this. Competitive Intelligence: Insights into anything outside your company, e.g. competitors, market trends, technology, legislation.

8. What is statistical power?

Statistical power is sensitivity (see 4), the true positive rate. When the sample size is not determined, power is set to be around 0.8–0.9, and used to calculate the sample size. On the other hand, if the sample size is already determined, calculating power tells you how robust the test is.

9. Explain what resampling methods are and why they are useful. Also explain their limitations.

Resampling is the act of collecting samples out of sample data. This is done to 1) estimate the sample statistics (as in jackknifing, bootstrapping) or to 2) validate a model (as in cross validation). The downside is that resampling doesn’t detect the bias in the collection of sample data, and that some methods like cross validation and monte carlo methods can be computationally expensive.

10. Is it better to have too many false positives, or too many false negatives? Explain.

Better to have too many false negatives (aka type 2 error). Too many false positives are bad in many cases. For example, sentencing an innocent person to death (false positive) is much worse than judging an actual murderer as innocent.

11. What is selection bias, why is it important and how can you avoid it?

Selection bias makes you sample more from one population group over another. It is important because it leads to non random sampling, which generate a model that is not generalizable. To avoid selection bias, the easiest method is to have a clear definition of what your population is, and randomly sample from that population.

12. Give an example of how you would use experimental design to answer a question about user behavior.

This is essentially asking what A/B testing is all about. In the simplest case, consider a mobile app and you want to determine whether a red or a blue sign up button leads to more registrations. You do so by randomly splitting a subset of users into two, show one half the red button and the other half the blue button, and compare the rate of successful registrations.

13. What is the difference between “long” and “wide” format data?

A long format data considers each variable as separate, thus ending up in a vertically long list of “key-value” combinations with each row representing one data point. A wide format data groups combines keys to reduce redundancy in the number of rows, thus ending up in a wider list. A long format is generally preferred when there are many value variables and there’s a need to feed the data into some kind of algorithm.

14. What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject?

First, look at the data visualization techniques. Is the pie chart intentionally distorted? Is the bar graph abbreviated in the middle to make the difference significant? Second, look at what is NOT disclosed in the statistics. If the sample size is not revealed, it might be very small to bring about a statistically significant result. If the sample method is not revealed, it might be biased.

15. Explain Edward Tufte’s concept of “chart junk.”

Chart junk is an unnecessary (and even harmful) decoration in data visualization.

16. How would you screen for outliers and what should you do if you find one?

There are many ways to detect outliers but the easiest way to get started is by plotting the data points in a scatter plot and simply observe for outliers. You can also define outliers as those 2~3 standard deviations away from the mean, or 1.5 times away from the difference between the 1st and the 3rd quartile. There are two types of errors: one that’s caused by human/machine error (someone mistyped or the machine somehow spitted out an impossible number) and one that actually exist and was not generated due to any errors. The former should be omitted, but the latter should be kept, as it brings insights to the data set.

17. How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?

Monte Carlo simulation is useful for probability estimates, but with rare events, probability estimate becomes meaningless unless the simulation size is enormous. Hence, importance sampling is used to change the distribution so that rare events would happen more often.

18. What is a recommendation engine? How does it work?

Recommendation engine is an algorithm that finds contents that users would prefer based on their past behaviors. There are two major ways: finding similar products from the products users highly value (content-based filtering), or finding similar users and find what products they’ve purchased (collaborative filtering).

19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?

See 10. Same question.

20. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?

I mainly use jupyter since I code a lot in Python. It’s a great way to visualize your thought process around dealing with a data set. Tableau is good to get a sense of how the data looks like, before diving into deeper exploration. If the analysis doesn’t involve sophisiticated algorithms with big data, Tableau is the way to go as it also provides amazing ways to share the visualizations with the business team. R is very similar. It is slow compared to Python and other languages, but it is very easy to get simple solutions quickly. I don’t know much about SAS.

That’s it! Again, feel free to leave comments if you found any mistakes.

Leave a Comment