2024 美國地獄模式上岸DS/MLE經驗分享(肆) — 如何準備Machine Learning & Statistics Interview

Bert Lee // 李慕家
7 min readJun 8, 2024

--

這篇主要寫美國Data相關崗位在面試時會遇到什麼樣的ML, Statistics題目以及分享相關的準備方式。

本文Outline:

肆、如何準備ML Knowledge、Statistics、ML Design、ML Coding Interview

* 一、DA, DS, MLE都會在面試中遇到什麼ML, Statistics問題?
* 二、如何準備Statistics面試?
* 三、如何準備ML Knowledge面試?
* 四、如何準備ML Case面試?
* 五、如何準備ML Coding面試?

一、DA, DS, MLE都會在面試中遇到什麼ML, Statistics問題?

DA, DS, MLE在面試中所遇到的ML問題難度通常是依序遞增的;Statistics的部分,DA可能會更多偏重A/B Testing相關的統計知識,而MLE則會偏重Modeling相關的統計知識,DS介於兩著之間。

至於會遇到的ML, Statistics題目有哪些類型,以DA來說基本上就是一些基礎的統計知識與計算;DS和MLE除了基礎統計以外會問到ML相關的知識,比如模型比較、特性、寫出甚至推導重要的數學公式、ML Case Study、ML Coding,包括Model Implementation from Scratch & PyTorch Modeling都有可能遇到;對於MLE/AS(Applied Scientist)來說,還有可能考到System Design, Research Ability, 以及比一般DS更多的ML Domain Specifics Questions (NLP, LLM, CV, RecSys等)。

二、如何準備Statistics面試?

基本上一般DA, DS面試,Statistics考的內容都不會超過你大學第一門統計課的範圍,熟悉你的Statistics 101就差不多了。以Meta的DSA來說,他們很清楚地說了考察範圍:Basics of descriptive statistics, common distributions (binomial, normal), Law of Large Numbers, Central Limit Theorem, and Bayes’ Theorem;他們也清楚的說不會考:①Advanced Mathematics: No calculus or complex statistical models. ②Complex Distributions: Excludes exponential, Weibull, Beta distributions.

再以CVS和Walmart的面經來說,主要考察點有:

Fundamental Concepts:

  • Descriptive Statistics: Mean, variance, confidence intervals.
  • Probability Distributions: Focus on binomial and normal distributions.
  • Hypothesis Testing: Statistical power, p-values, A/B testing.

Analytical Techniques:

  • Regression Analysis: Linear and logistic models, interpreting coefficients and R-squared values.
  • Experiment Design: Factors affecting experiment size, design strategies.

Additional Topics:

  • Error Types: Differentiation between Type I and Type II errors.
  • Statistical Applications: Handling missing values, causal inference without experiments, practical problems involving probability calculations.

舉幾個簡單的例子:

  • How do you explain a p-value to non-technical stakeholders, and why is the 0.05 threshold commonly used?
  • What is a Normal Distribution?
  • What are Type I and Type II errors?
  • What is statistical power, and why is it important in hypothesis testing?
  • Tell me about Bayesian Theory.
  • How do you design and analyze an A/B test?
  • How do you decide the duration to run an experiment?

考察的概念通常不難,但是以DA, DS來說很多公司會看重溝通、表達能力,他們可能會問說你要如何將這些統計概念、統計結果explain to business stakeholders in layman’s terms.

三、如何準備ML Knowledge/Fundamentals面試?

我以前的面試經驗中,ML相關的問題通常都是根據你的履歷上用過什麼模型開始切入,但在美國遇到更多的情況可能是履歷都不管,上來就像是拿著ML八股文題庫開始瘋狂轟炸。當然他們也不會要求你什麼模型都非常熟悉,比如說Dimension Reduction常用的PCA & tSNE,可能不會要求你兩個方法都解釋的非常詳細,但至少得答上一個。

對於這類ML八股文,我在網上看到整理最好的是下面這個系列文:

另外有網友分享了一部分的答案:

我自己是根據這個系列文為基礎,逐一跟ChatGPT討論、複習相關的基礎概念和開展延伸問題,以及將網上相關的影片、文章、遇到的問題整理在一起。有了ChatGPT真的讓複習效率高太多了。以下是我所整理的ML面試常考重點:

當然在刷這些題目之前,你需要有相當的ML基礎,我自己是從2019開始接觸ML, NLP、在Yale一學期裡有三門課教到Attention⋯,所以對於這些ML八股文多多少少有些認識,透過這些題目主要是幫助複習、查漏補缺。對於ML基礎比較弱的同學最好還是先找一門ML的課比較系統性地建立基礎。

四、如何準備ML Design面試?

這個部分比起專注找MLE/AS的同學我可能準備的不是很完善,僅簡單分享我個人通常回答的流程:

  1. Clarify Problem:首先確認我們想解決的問題,目標是什麼,有什麼資源。
  2. Discuss Data/Explore Features:按照結構化的方法跟面試官討論可用的Feature、考量哪些面向的Feature、什麼樣的Feature可能會比較有幫助,這裡根據你提出的Feature就能展現你真的了解這個問題並且具備Product Sense。在此也可以討論怎麼處理不同的Data、如何Handle Imbalance, Missing Value等問題。
  3. Model Selections & Feature Engineering:列舉幾個合適的Model候選人,用簡單一點的模型快速做出Baseline,接著討論更複雜的模型,介紹模型的結構,為什麼更複雜的模型可能會有更好的效果;考量到不同的Model如何去進行Feature Engineering放到Model裡,預測的Target是什麼,最終輸出怎麼處理,如何防止Overfitting,透過什麼方法可能可以讓模型學習的效果更好。
  4. Evaluation:Offline Evaluation和Online Evaluation怎麼做,看什麼指標,如何解讀結果,遇到什麼可能的問題怎麼Debug,最後如何決定是否上線模型。
  5. Deployment:到這裡我的經驗就比較少了,面試時也沒有被問到太多,主要要考量怎麼上線模型,如何做monitoring、iteration。

分享我自己做過的一個Case:

Machine Learning Design Case: Predicting Mobile App Install Rates

Interviewer: Let’s discuss a scenario where we want to predict the install rate of an app from mobile ads. Which features would you consider for building this model?

Candidate: I would consider various categories of features. Firstly, user-specific features like what apps the user has previously installed. Performance history, especially install rates on similar ads, would also be crucial. Content and contextual details could provide additional insights.

Interviewer: How would you integrate a user’s installed apps into the model, especially considering the potential high dimensionality if there are thousands of apps?

Candidate: A basic approach is one-hot encoding, but that might lead to excessively high dimensions. An alternative could be to use app embeddings, which represent each app in a lower-dimensional space. While this increases the number of parameters, using indexing can reduce the input dimensionality into the model.

Interviewer: Exactly, that’s a good approach. Have you implemented something similar in your previous projects?

Candidate: Yes, I have used attribute embeddings where they are randomly initialized and then refined through feature learning, effectively capturing the essence of each app.

Interviewer: Now, if we have the history of the last 10 apps installed by a user, how would you use this information in your model?

Candidate: The relevance of these past apps might vary; recent or similar apps could be more influential. I’d use an attention mechanism, as seen in the Deep Interest Network (DIN), to weigh the importance of each past app relative to the target app. This helps the model focus on more relevant features.

Interviewer: Good thinking. Just a couple of quick questions to wrap up. For a binary classification like this, what activation function would you use in the last layer?

Candidate: For binary classification, I’d use a sigmoid activation function since it maps the output between 0 and 1, making it suitable for binary outcomes.

Interviewer: What about the optimizer? Which one would you start with?

Candidate: I’d start with Stochastic Gradient Descent (SGD) and then experiment with Adam. Adam adjusts learning rates based on the first and second moments of the gradients, which can be more effective in practice.

Interviewer: Can you differentiate between SGD with momentum and Adam?

Candidate: Both adjust the learning rates, but Adam also considers the second moment of the gradients, which helps in making more informed updates based on the recent gradients’ variance. It’s a bit more responsive to changes in trends than just SGD with momentum.

Interviewer: That’s quite detailed, great job. These approaches sound robust for handling the complexities of predicting app install rates from mobile ads.

最後附上幾個地裡針對ML Case Design的回答建議:

還有一些看到推薦的資源,但是我個人沒有用過:

https://research.facebook.com/blog/2018/05/the-facebook-field-guide-to-machine-learning-video-series/

五、如何準備ML Coding面試?

Image source: Python Central

對於MLE/AS來說,甚至某些比較硬核的DS,有時候會考到Model Implementation from scratch,也就是不使用sklearn, pytorch,只用numpy去實踐model or algorithms,但是通常會考的也就是那幾個:

Unsupervised Model:

  • KMeans:
import numpy as np

def initialize_centroids(X, k):
"""Randomly initialize k centroids from the dataset X."""
indices = np.random.permutation(X.shape[0])
centroids = X[indices[:k]]
return centroids

def closest_centroid(X, centroids):
"""For each point in X, find the closest centroid."""
distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
return np.argmin(distances, axis=0)

def update_centroids(X, labels, k):
"""Recalculate centroids."""
new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
return new_centroids

def kmeans(X, k, max_iters=100):
"""The main k-means algorithm."""
centroids = initialize_centroids(X, k)
for i in range(max_iters):
labels = closest_centroid(X, centroids)
new_centroids = update_centroids(X, labels, k)
# Check for convergence (if centroids don't change)
if np.all(centroids == new_centroids):
break
centroids = new_centroids
return centroids, labels

# Example usage
# Generate some data
np.random.seed(42)
X = np.random.rand(100, 2)

# Perform k-means clustering
k = 3
centroids, labels = kmeans(X, k)

print("Centroids:", centroids)

Supervised Models:

  • Logistic Regression
import numpy as np

# Sigmoid function to map predicted values to probabilities
def sigmoid(z):
return 1 / (1 + np.exp(-z))

# Loss function to compute the cost
def compute_loss(y, y_hat):
# Binary crossentropy loss
return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

# Gradient descent function to update parameters
def gradient_descent(X, y, params, learning_rate, iterations):
m = len(y)
loss_history = np.zeros((iterations,))

for i in range(iterations):
# Calculate predictions
y_hat = sigmoid(np.dot(X, params))
# Update parameters
params -= learning_rate * np.dot(X.T, y_hat - y) / m
# Save loss
loss_history[i] = compute_loss(y, y_hat)

return params, loss_history

# Predict function
def predict(X, params):
return np.round(sigmoid(np.dot(X, params)))

# Generate synthetic data
X = np.random.rand(100, 2) # 100 samples and 2 features
y = np.random.randint(0, 2, 100) # Binary targets

# Add intercept term to feature matrix
X = np.hstack((np.ones((X.shape[0], 1)), X))

# Initialize parameters to zero
params = np.zeros(X.shape[1])

# Set learning rate and number of iterations
learning_rate = 0.01
iterations = 1000

# Perform gradient descent
params, loss_history = gradient_descent(X, y, params, learning_rate, iterations)

# Predict
predictions = predict(X, params)

# Calculate accuracy
accuracy = np.mean(predictions == y)

print(f"Accuracy: {accuracy}")
  • (Multiple) Linear Regression
import numpy as np

def multiple_linear_regression(X, y):
# Adding a column of ones to add the intercept term (b_0)
X_b = np.hstack([np.ones((X.shape[0], 1)), X])

# Using the Normal Equation to compute the best-fit parameters
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

return theta_best # First element is the intercept, others are coefficients

# Example usage:
X = np.array([
[1, 2], # Two features for each data point
[2, 3],
[3, 4],
[4, 5],
[5, 6]
])
y = np.array([5, 7, 9, 11, 13]) # Target values

# Train the model to find the intercept and coefficients
theta_best = multiple_linear_regression(X, y)

print(f"Intercept and coefficients: {theta_best}")

# Predict function using the derived coefficients
def predict(X, theta_best):
X_b = np.hstack([np.ones((X.shape[0], 1)), X]) # Add the intercept term
return X_b.dot(theta_best)

# Predicting values
X_new = np.array([
[6, 7],
[7, 8]
]) # New data points
predictions = predict(X_new, theta_best)

print(f"Predictions: {predictions}")

Sorting:

有時候會被要求實踐一種Sorting方法,這裡以Insertion Sort為例:

def insertion_sort(arr):
# Traverse through 1 to len(arr)
for i in range(1, len(arr)):

key = arr[i]

# Move elements of arr[0..i-1], that are greater than key,
# to one position ahead of their current position
j = i-1
while j >= 0 and key < arr[j]:
arr[j + 1] = arr[j]
j -= 1
arr[j + 1] = key

return arr

# Example usage
my_list = [64, 34, 25, 12, 22, 11, 90]
sorted_list = insertion_sort(my_list)
print("Sorted list:", sorted_list)

還有看過被要求Implement Attention, CNN的,不過我沒遇過,更多這種Model Implement from Scratch可以參考:

除了implementing model from scratch以外,ML Coding我還遇過PyTorch填空,可能會要你用PyTorch implement整個Class,然後debug model pipeline完成training。這個部分我當時完全沒準備掛得很慘,平時太依賴ChatGPT了。這種真的很考驗平時使用PyTorch/Tensorflow的經驗,單純看一看Cheatsheet可能都不太夠。

https://www.datacamp.com/cheat-sheet/deep-learning-with-py-torch

六、結語

這次在美國找DS相關工作時,我認為ML這塊給了我一些優勢,在3–5年左右工作經驗的DS當中,我在A/B Testing, Experiment Design的經驗上比較不足,Modeling的經驗相對好一些,幾個走到終面的機會個人認為很大原因就是在ML上有比較扎實的經驗,也發現到市場上MLE的機會可能更勝於DA/DS。因此對於擁有較多ML經驗的同學,若是能夠好好的準備ML相關的面試題目,會給自己帶來更好的機會。至於Statistics,我想不論是DA, DS, MLE, AS,都是很重要的基礎,都不能忽略的。

最後附上兩篇MLE相關的求職分享:

七、系列文結語

此次2024 DS/MLE求職系列文Medium四篇已全數發布,喜歡的同學可以多多支持與關注!未來若有時間和機會,也許能抽空幫同學看看簡歷、提供一些諮詢和建議,不過目前博主還是重心先放在紐約的生活,有機會希望多跟在附近的大佬們、同學們一起交流、學習!

--

--

Bert Lee // 李慕家

Seek & Find | MSDS @Yale | Former Data Scientist @Disney+ & @DBS Bank | NTU Alumni | LinkedIn: https://www.linkedin.com/in/bertmclee/