2024 美國地獄模式上岸DS/MLE經驗分享(肆) — 如何準備Machine Learning & Statistics Interview

7 min readJun 8, 2024

這篇主要寫美國Data相關崗位在面試時會遇到什麼樣的ML, Statistics題目以及分享相關的準備方式。

本文Outline：

肆、如何準備ML Knowledge、Statistics、ML Design、ML Coding Interview

* 一、DA, DS, MLE都會在面試中遇到什麼ML, Statistics問題？
* 二、如何準備Statistics面試？
* 三、如何準備ML Knowledge面試？
* 四、如何準備ML Case面試？
* 五、如何準備ML Coding面試？

一、DA, DS, MLE都會在面試中遇到什麼ML, Statistics問題？

DA, DS, MLE在面試中所遇到的ML問題難度通常是依序遞增的；Statistics的部分，DA可能會更多偏重A/B Testing相關的統計知識，而MLE則會偏重Modeling相關的統計知識，DS介於兩著之間。

至於會遇到的ML, Statistics題目有哪些類型，以DA來說基本上就是一些基礎的統計知識與計算；DS和MLE除了基礎統計以外會問到ML相關的知識，比如模型比較、特性、寫出甚至推導重要的數學公式、ML Case Study、ML Coding，包括Model Implementation from Scratch & PyTorch Modeling都有可能遇到；對於MLE/AS(Applied Scientist)來說，還有可能考到System Design, Research Ability, 以及比一般DS更多的ML Domain Specifics Questions (NLP, LLM, CV, RecSys等)。

二、如何準備Statistics面試？

基本上一般DA, DS面試，Statistics考的內容都不會超過你大學第一門統計課的範圍，熟悉你的Statistics 101就差不多了。以Meta的DSA來說，他們很清楚地說了考察範圍：Basics of descriptive statistics, common distributions (binomial, normal), Law of Large Numbers, Central Limit Theorem, and Bayes’ Theorem；他們也清楚的說不會考：①Advanced Mathematics: No calculus or complex statistical models. ②Complex Distributions: Excludes exponential, Weibull, Beta distributions.

再以CVS和Walmart的面經來說，主要考察點有：

Fundamental Concepts:

Descriptive Statistics: Mean, variance, confidence intervals.
Probability Distributions: Focus on binomial and normal distributions.
Hypothesis Testing: Statistical power, p-values, A/B testing.

Analytical Techniques:

Regression Analysis: Linear and logistic models, interpreting coefficients and R-squared values.
Experiment Design: Factors affecting experiment size, design strategies.

Additional Topics:

Error Types: Differentiation between Type I and Type II errors.
Statistical Applications: Handling missing values, causal inference without experiments, practical problems involving probability calculations.

舉幾個簡單的例子：

How do you explain a p-value to non-technical stakeholders, and why is the 0.05 threshold commonly used?
What is a Normal Distribution?
What are Type I and Type II errors?
What is statistical power, and why is it important in hypothesis testing?
Tell me about Bayesian Theory.
How do you design and analyze an A/B test?
How do you decide the duration to run an experiment?

考察的概念通常不難，但是以DA, DS來說很多公司會看重溝通、表達能力，他們可能會問說你要如何將這些統計概念、統計結果explain to business stakeholders in layman’s terms.

三、如何準備ML Knowledge/Fundamentals面試？

我以前的面試經驗中，ML相關的問題通常都是根據你的履歷上用過什麼模型開始切入，但在美國遇到更多的情況可能是履歷都不管，上來就像是拿著ML八股文題庫開始瘋狂轟炸。當然他們也不會要求你什麼模型都非常熟悉，比如說Dimension Reduction常用的PCA & tSNE，可能不會要求你兩個方法都解釋的非常詳細，但至少得答上一個。

對於這類ML八股文，我在網上看到整理最好的是下面這個系列文：

机器学习八股文 (一)|一亩三分地机器学习版

前段时间刷脉脉时发现很多技术八股文的帖子，什么Redis八股文..Java八股文..发现对于需要面试的人来说来说这些"八股文"帖子挺有帮助的，很利于面试前的查缺补漏。刚好 ...

www.1point3acres.com

机器学习八股文（二）[海量面经整理]|一亩三分地机器学习版

这篇是面经整理系列第二篇，上篇https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=713903&page=1#pid14951878…

www.1point3acres.com

机器学习八股文（三）以及一些准备建议|一亩三分地机器学习版

传送门：https://www.1point3acres.com/bbs/thread-713903-1-1.html 一 https://www.1point3acres.com/bbs/thread-714090-1-1.html…

www.1point3acres.com

另外有網友分享了一部分的答案：

整理了一点机器学习八股文的答案（1）|一亩三分地机器学习版

最近准备面试整理了一些机器学习八股文的答案（题目来源于地理的三篇八股文帖子），可能复制到md文档里读起来更方便一点，之后会更新其他的部分，求加米# ML基础概念 ...

www.1point3acres.com

整理了一点机器学习八股文的答案(2)|一亩三分地机器学习版

这篇是八股文答案第二篇，上篇https://www.1point3acres.com/bbs ... &page=1#pid18627767 主要整理了机器学习基础概念题目来源…

www.1point3acres.com

我自己是根據這個系列文為基礎，逐一跟ChatGPT討論、複習相關的基礎概念和開展延伸問題，以及將網上相關的影片、文章、遇到的問題整理在一起。有了ChatGPT真的讓複習效率高太多了。以下是我所整理的ML面試常考重點：

當然在刷這些題目之前，你需要有相當的ML基礎，我自己是從2019開始接觸ML, NLP、在Yale一學期裡有三門課教到Attention⋯，所以對於這些ML八股文多多少少有些認識，透過這些題目主要是幫助複習、查漏補缺。對於ML基礎比較弱的同學最好還是先找一門ML的課比較系統性地建立基礎。

四、如何準備ML Design面試？

這個部分比起專注找MLE/AS的同學我可能準備的不是很完善，僅簡單分享我個人通常回答的流程：

Clarify Problem：首先確認我們想解決的問題，目標是什麼，有什麼資源。
Discuss Data/Explore Features：按照結構化的方法跟面試官討論可用的Feature、考量哪些面向的Feature、什麼樣的Feature可能會比較有幫助，這裡根據你提出的Feature就能展現你真的了解這個問題並且具備Product Sense。在此也可以討論怎麼處理不同的Data、如何Handle Imbalance, Missing Value等問題。
Model Selections & Feature Engineering：列舉幾個合適的Model候選人，用簡單一點的模型快速做出Baseline，接著討論更複雜的模型，介紹模型的結構，為什麼更複雜的模型可能會有更好的效果；考量到不同的Model如何去進行Feature Engineering放到Model裡，預測的Target是什麼，最終輸出怎麼處理，如何防止Overfitting，透過什麼方法可能可以讓模型學習的效果更好。
Evaluation：Offline Evaluation和Online Evaluation怎麼做，看什麼指標，如何解讀結果，遇到什麼可能的問題怎麼Debug，最後如何決定是否上線模型。
Deployment：到這裡我的經驗就比較少了，面試時也沒有被問到太多，主要要考量怎麼上線模型，如何做monitoring、iteration。

分享我自己做過的一個Case：

Machine Learning Design Case: Predicting Mobile App Install Rates

Interviewer: Let’s discuss a scenario where we want to predict the install rate of an app from mobile ads. Which features would you consider for building this model?

Candidate: I would consider various categories of features. Firstly, user-specific features like what apps the user has previously installed. Performance history, especially install rates on similar ads, would also be crucial. Content and contextual details could provide additional insights.

Interviewer: How would you integrate a user’s installed apps into the model, especially considering the potential high dimensionality if there are thousands of apps?

Candidate: A basic approach is one-hot encoding, but that might lead to excessively high dimensions. An alternative could be to use app embeddings, which represent each app in a lower-dimensional space. While this increases the number of parameters, using indexing can reduce the input dimensionality into the model.

Interviewer: Exactly, that’s a good approach. Have you implemented something similar in your previous projects?

Candidate: Yes, I have used attribute embeddings where they are randomly initialized and then refined through feature learning, effectively capturing the essence of each app.

Interviewer: Now, if we have the history of the last 10 apps installed by a user, how would you use this information in your model?

Candidate: The relevance of these past apps might vary; recent or similar apps could be more influential. I’d use an attention mechanism, as seen in the Deep Interest Network (DIN), to weigh the importance of each past app relative to the target app. This helps the model focus on more relevant features.

Interviewer: Good thinking. Just a couple of quick questions to wrap up. For a binary classification like this, what activation function would you use in the last layer?

Candidate: For binary classification, I’d use a sigmoid activation function since it maps the output between 0 and 1, making it suitable for binary outcomes.

Interviewer: What about the optimizer? Which one would you start with?

Candidate: I’d start with Stochastic Gradient Descent (SGD) and then experiment with Adam. Adam adjusts learning rates based on the first and second moments of the gradients, which can be more effective in practice.

Interviewer: Can you differentiate between SGD with momentum and Adam?

Candidate: Both adjust the learning rates, but Adam also considers the second moment of the gradients, which helps in making more informed updates based on the recent gradients’ variance. It’s a bit more responsive to changes in trends than just SGD with momentum.

Interviewer: That’s quite detailed, great job. These approaches sound robust for handling the complexities of predicting app install rates from mobile ads.

最後附上幾個地裡針對ML Case Design的回答建議：

ML design 面试的答题模板，step by step|google面经|一亩三分地海外面经版

我之前在另一个帖子里面分享了，ML design 面试的解题思路总结，大家反应很有用。最近又在实践中总结出来一套ML design答题模板，成功的过了几个大厂的ML design面试。因此 ...

www.1point3acres.com

ML design 面试的解题思路总结，十全大补tips|amazon面经|一亩三分地海外面经版

地里的面经很多，但是很少有人写总结，尤其是Machine learning这一块。我根据自己面试的经验，总结了一下ML design的面试中，解题的思路，抛砖引玉，分享给大家。求 ...

www.1point3acres.com

个人经验教你如何准备MLE/AS的面试-Part 1|FLAAG求职讨论|一亩三分地求职（非面经）版

在过去大概四个月，我自己尝试面试了一些公司的MLE/AS role，因为我自己在准备的过程中发现其实能找到的资源非常有限，所以想着在这里分享一下我自己的心得。我大概会分两 ...

www.1point3acres.com

个人经验教你如何准备MLE/AS的面试-Part 1|FLAAG求职讨论|一亩三分地求职（非面经）版

在过去大概四个月，我自己尝试面试了一些公司的MLE/AS role，因为我自己在准备的过程中发现其实能找到的资源非常有限，所以想着在这里分享一下我自己的心得。我大概会分两 ...

www.1point3acres.com

个人经验教你如何准备MLE/AS的面试-Part 2|FLAAG求职讨论|一亩三分地求职（非面经）版

在过去大概四个月，我自己尝试面试了一些公司的MLE/AS role，因为我自己在准备的过程中发现其实能找到的资源非常有限，所以想着在这里分享一下我自己的心得。我大概会分两 ...

www.1point3acres.com

還有一些看到推薦的資源，但是我個人沒有用過：

Grokking the Machine Learning Interview - AI-Powered Learning for Developers

System design is an important component of any ML interview. Being able to efficiently solve open-ended machine…

www.educative.io

https://research.facebook.com/blog/2018/05/the-facebook-field-guide-to-machine-learning-video-series/

斯坦福"机器学习系统(MLSys)"系列讲座_哔哩哔哩_bilibili

斯坦福"机器学习系统(MLSys)"系列讲座共计10条视频，包括：Episode 0- ML + Systems、Episode 1- Beyond Accuracy: Behavioral Testing of NLP…

www.bilibili.com

GitHub - chiphuyen/machine-learning-systems-design: A booklet on machine learning systems design…

A booklet on machine learning systems design with exercises. NOT the repo for the book "Designing Machine Learning…

github.com

GitHub - kuhung/machine-learning-systems-design: 机器学习系统设计案例与测试 Machine Learning Systems Design…

机器学习系统设计案例与测试 Machine Learning Systems Design Cases & Tests - kuhung/machine-learning-systems-design

github.com

五、如何準備ML Coding面試？

對於MLE/AS來說，甚至某些比較硬核的DS，有時候會考到Model Implementation from scratch，也就是不使用sklearn, pytorch，只用numpy去實踐model or algorithms，但是通常會考的也就是那幾個：

Unsupervised Model：

KMeans:

import numpy as np

def initialize_centroids(X, k):
    """Randomly initialize k centroids from the dataset X."""
    indices = np.random.permutation(X.shape[0])
    centroids = X[indices[:k]]
    return centroids

def closest_centroid(X, centroids):
    """For each point in X, find the closest centroid."""
    distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
    return np.argmin(distances, axis=0)

def update_centroids(X, labels, k):
    """Recalculate centroids."""
    new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
    return new_centroids

def kmeans(X, k, max_iters=100):
    """The main k-means algorithm."""
    centroids = initialize_centroids(X, k)
    for i in range(max_iters):
        labels = closest_centroid(X, centroids)
        new_centroids = update_centroids(X, labels, k)
        # Check for convergence (if centroids don't change)
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    return centroids, labels

# Example usage
# Generate some data
np.random.seed(42)
X = np.random.rand(100, 2)

# Perform k-means clustering
k = 3
centroids, labels = kmeans(X, k)

print("Centroids:", centroids)

Supervised Models：

Logistic Regression

import numpy as np

# Sigmoid function to map predicted values to probabilities
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Loss function to compute the cost
def compute_loss(y, y_hat):
    # Binary crossentropy loss
    return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

# Gradient descent function to update parameters
def gradient_descent(X, y, params, learning_rate, iterations):
    m = len(y)
    loss_history = np.zeros((iterations,))
    
    for i in range(iterations):
        # Calculate predictions
        y_hat = sigmoid(np.dot(X, params))
        # Update parameters
        params -= learning_rate * np.dot(X.T, y_hat - y) / m
        # Save loss
        loss_history[i] = compute_loss(y, y_hat)
    
    return params, loss_history

# Predict function
def predict(X, params):
    return np.round(sigmoid(np.dot(X, params)))

# Generate synthetic data
X = np.random.rand(100, 2) # 100 samples and 2 features
y = np.random.randint(0, 2, 100) # Binary targets

# Add intercept term to feature matrix
X = np.hstack((np.ones((X.shape[0], 1)), X))

# Initialize parameters to zero
params = np.zeros(X.shape[1])

# Set learning rate and number of iterations
learning_rate = 0.01
iterations = 1000

# Perform gradient descent
params, loss_history = gradient_descent(X, y, params, learning_rate, iterations)

# Predict
predictions = predict(X, params)

# Calculate accuracy
accuracy = np.mean(predictions == y)

print(f"Accuracy: {accuracy}")

(Multiple) Linear Regression

import numpy as np

def multiple_linear_regression(X, y):
    # Adding a column of ones to add the intercept term (b_0)
    X_b = np.hstack([np.ones((X.shape[0], 1)), X])

    # Using the Normal Equation to compute the best-fit parameters
    theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

    return theta_best  # First element is the intercept, others are coefficients

# Example usage:
X = np.array([
    [1, 2],  # Two features for each data point
    [2, 3],
    [3, 4],
    [4, 5],
    [5, 6]
])
y = np.array([5, 7, 9, 11, 13])  # Target values

# Train the model to find the intercept and coefficients
theta_best = multiple_linear_regression(X, y)

print(f"Intercept and coefficients: {theta_best}")

# Predict function using the derived coefficients
def predict(X, theta_best):
    X_b = np.hstack([np.ones((X.shape[0], 1)), X])  # Add the intercept term
    return X_b.dot(theta_best)

# Predicting values
X_new = np.array([
    [6, 7],
    [7, 8]
])  # New data points
predictions = predict(X_new, theta_best)

print(f"Predictions: {predictions}")

Sorting：

有時候會被要求實踐一種Sorting方法，這裡以Insertion Sort為例：

def insertion_sort(arr):
    # Traverse through 1 to len(arr)
    for i in range(1, len(arr)):

        key = arr[i]

        # Move elements of arr[0..i-1], that are greater than key,
        # to one position ahead of their current position
        j = i-1
        while j >= 0 and key < arr[j]:
            arr[j + 1] = arr[j]
            j -= 1
        arr[j + 1] = key

    return arr

# Example usage
my_list = [64, 34, 25, 12, 22, 11, 90]
sorted_list = insertion_sort(my_list)
print("Sorted list:", sorted_list)

還有看過被要求Implement Attention, CNN的，不過我沒遇過，更多這種Model Implement from Scratch可以參考：

GitHub - ghimiresunil/Implementation-of-Machine-Learning-Algorithm-from-Scratch: Learn Machine…

Learn Machine Learning from basic to advance and develop Machine Learning Models from Scratch in Python …

github.com

GitHub - lawlite19/MachineLearning_Python: 机器学习算法python实现

机器学习算法python实现. Contribute to lawlite19/MachineLearning_Python development by creating an account on GitHub.

github.com

除了implementing model from scratch以外，ML Coding我還遇過PyTorch填空，可能會要你用PyTorch implement整個Class，然後debug model pipeline完成training。這個部分我當時完全沒準備掛得很慘，平時太依賴ChatGPT了。這種真的很考驗平時使用PyTorch/Tensorflow的經驗，單純看一看Cheatsheet可能都不太夠。

PyTorch Cheat Sheet - PyTorch Tutorials 2.3.0+cu121 documentation

Read the PyTorch Domains documentation to learn more about domain-specific libraries

pytorch.org

https://www.datacamp.com/cheat-sheet/deep-learning-with-py-torch

六、結語

這次在美國找DS相關工作時，我認為ML這塊給了我一些優勢，在3–5年左右工作經驗的DS當中，我在A/B Testing, Experiment Design的經驗上比較不足，Modeling的經驗相對好一些，幾個走到終面的機會個人認為很大原因就是在ML上有比較扎實的經驗，也發現到市場上MLE的機會可能更勝於DA/DS。因此對於擁有較多ML經驗的同學，若是能夠好好的準備ML相關的面試題目，會給自己帶來更好的機會。至於Statistics，我想不論是DA, DS, MLE, AS，都是很重要的基礎，都不能忽略的。

最後附上兩篇MLE相關的求職分享：

疫情下的北美2020：機器學習工程師面試經驗(Machine Learning Engineer)

其實我想說的是，疫情下的軟體工程師面試，聽起來就像坐著Zoom一整天一樣容易呢（想像起來）。

gau820827.medium.com

2024 Summer MLE/AS Internships

medium.com

七、系列文結語

此次2024 DS/MLE求職系列文Medium四篇已全數發布，喜歡的同學可以多多支持與關注！未來若有時間和機會，也許能抽空幫同學看看簡歷、提供一些諮詢和建議，不過目前博主還是重心先放在紐約的生活，有機會希望多跟在附近的大佬們、同學們一起交流、學習！

2024 美國地獄模式上岸DS/MLE經驗分享(肆) — 如何準備Machine Learning & Statistics Interview

一、DA, DS, MLE都會在面試中遇到什麼ML, Statistics問題？

二、如何準備Statistics面試？

三、如何準備ML Knowledge/Fundamentals面試？

机器学习八股文 (一)|一亩三分地机器学习版

前段时间刷脉脉时发现很多技术八股文的帖子，什么Redis八股文..Java八股文..发现对于需要面试的人来说来说这些"八股文"帖子挺有帮助的，很利于面试前的查缺补漏。刚好 ...

机器学习八股文（二）[海量面经整理]|一亩三分地机器学习版

这篇是面经整理系列第二篇，上篇https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=713903&page=1#pid14951878…

机器学习八股文（三）以及一些准备建议|一亩三分地机器学习版

传送门：https://www.1point3acres.com/bbs/thread-713903-1-1.html 一 https://www.1point3acres.com/bbs/thread-714090-1-1.html…

整理了一点机器学习八股文的答案（1）|一亩三分地机器学习版

最近准备面试整理了一些机器学习八股文的答案（题目来源于地理的三篇八股文帖子），可能复制到md文档里读起来更方便一点，之后会更新其他的部分， 求加米# ML基础概念 ...

整理了一点机器学习八股文的答案(2)|一亩三分地机器学习版

这篇是八股文答案第二篇，上篇https://www.1point3acres.com/bbs ... &page=1#pid18627767 主要整理了机器学习基础概念题目来源…

四、如何準備ML Design面試？

ML design 面试的答题模板，step by step|google面经|一亩三分地海外面经版

我之前在另一个帖子里面分享了，ML design 面试的解题思路总结，大家反应很有用。最近又在实践中总结出来一套ML design答题模板，成功的过了几个大厂的ML design面试。因此 ...

ML design 面试的解题思路总结，十全大补tips|amazon面经|一亩三分地海外面经版

地里的面经很多，但是很少有人写总结，尤其是Machine learning这一块。我根据自己面试的经验，总结了一下ML design的面试中，解题的思路，抛砖引玉，分享给大家。求 ...

个人经验教你如何准备MLE/AS的面试-Part 1|FLAAG求职讨论|一亩三分地求职（非面经）版

在过去大概四个月，我自己尝试面试了一些公司的MLE/AS role，因为我自己在准备的过程中发现其实能找到的资源非常有限，所以想着在这里分享一下我自己的心得。我大概会分两 ...

个人经验教你如何准备MLE/AS的面试-Part 1|FLAAG求职讨论|一亩三分地求职（非面经）版

在过去大概四个月，我自己尝试面试了一些公司的MLE/AS role，因为我自己在准备的过程中发现其实能找到的资源非常有限，所以想着在这里分享一下我自己的心得。我大概会分两 ...

个人经验教你如何准备MLE/AS的面试-Part 2|FLAAG求职讨论|一亩三分地求职（非面经）版

在过去大概四个月，我自己尝试面试了一些公司的MLE/AS role，因为我自己在准备的过程中发现其实能找到的资源非常有限，所以想着在这里分享一下我自己的心得。我大概会分两 ...

Grokking the Machine Learning Interview - AI-Powered Learning for Developers

System design is an important component of any ML interview. Being able to efficiently solve open-ended machine…

斯坦福"机器学习系统(MLSys)"系列讲座_哔哩哔哩_bilibili

斯坦福"机器学习系统(MLSys)"系列讲座共计10条视频，包括：Episode 0- ML + Systems、Episode 1- Beyond Accuracy: Behavioral Testing of NLP…

GitHub - chiphuyen/machine-learning-systems-design: A booklet on machine learning systems design…

A booklet on machine learning systems design with exercises. NOT the repo for the book "Designing Machine Learning…

GitHub - kuhung/machine-learning-systems-design: 机器学习系统设计案例与测试 Machine Learning Systems Design…

机器学习系统设计案例与测试 Machine Learning Systems Design Cases & Tests - kuhung/machine-learning-systems-design

五、如何準備ML Coding面試？

GitHub - ghimiresunil/Implementation-of-Machine-Learning-Algorithm-from-Scratch: Learn Machine…

Learn Machine Learning from basic to advance and develop Machine Learning Models from Scratch in Python …

GitHub - lawlite19/MachineLearning_Python: 机器学习算法python实现

机器学习算法python实现. Contribute to lawlite19/MachineLearning_Python development by creating an account on GitHub.

PyTorch Cheat Sheet - PyTorch Tutorials 2.3.0+cu121 documentation

Read the PyTorch Domains documentation to learn more about domain-specific libraries

六、結語

疫情下的北美2020：機器學習工程師面試經驗(Machine Learning Engineer)

其實我想說的是，疫情下的軟體工程師面試，聽起來就像坐著Zoom一整天一樣容易呢（想像起來）。

2024 Summer MLE/AS Internships

七、系列文結語

Written by Bert Lee // 李慕家

最近准备面试整理了一些机器学习八股文的答案（题目来源于地理的三篇八股文帖子），可能复制到md文档里读起来更方便一点，之后会更新其他的部分，求加米# ML基础概念 ...