RMRM Full Stack & AI Engineer · All questions · Roadmaps
AI & ML · interview questions

Machine Learning Interview Questions

Common Machine Learning interview questions spanning beginner to advanced, covering core concepts, algorithms, model evaluation, regularization, deep learning, and practical ML engineering.

1. What is the difference between supervised, unsupervised, and reinforcement learning?

beginner

Supervised learning trains on labeled data to predict outputs; unsupervised learning finds patterns in unlabeled data (e.g., clustering, dimensionality reduction); reinforcement learning trains an agent to make sequential decisions by maximizing cumulative reward through environment interaction.

2. What is overfitting and how do you prevent it?

beginner

Overfitting occurs when a model learns noise and detail in training data, performing well on training but poorly on unseen data. Prevention techniques include regularization (L1/L2), dropout, cross-validation, early stopping, pruning, and collecting more training data.

3. What is the bias-variance tradeoff?

beginner

Bias is error from overly simplistic assumptions; variance is error from sensitivity to small fluctuations in training data. High bias causes underfitting; high variance causes overfitting. The goal is to find a model complexity that minimizes total error by balancing the two.

4. What is cross-validation and why is it used?

beginner

Cross-validation (most commonly k-fold) splits data into k subsets, trains on k-1 folds and validates on the remaining fold, repeating k times. It provides a more reliable estimate of model generalization performance and reduces the impact of a single train/test split.

5. Explain the difference between precision and recall.

beginner

Precision is the fraction of predicted positives that are truly positive (TP / (TP + FP)); recall is the fraction of actual positives correctly identified (TP / (TP + FN)). Precision matters when false positives are costly; recall matters when false negatives are costly.

6. What is the F1 score and when would you use it over accuracy?

beginner

The F1 score is the harmonic mean of precision and recall: 2*(precision*recall)/(precision+recall). It is preferred over accuracy when classes are imbalanced, because accuracy can be misleadingly high if the model simply predicts the majority class.

7. What is gradient descent and how does it work?

beginner

Gradient descent is an optimization algorithm that iteratively adjusts model parameters in the direction of the negative gradient of the loss function to minimize it. Variants include batch, stochastic (SGD), and mini-batch gradient descent, which differ in how many samples are used per update.

8. What is regularization? Explain L1 vs L2.

intermediate

Regularization adds a penalty term to the loss function to discourage model complexity and reduce overfitting. L1 (Lasso) adds the sum of absolute weights, encouraging sparsity by driving some weights to zero; L2 (Ridge) adds the sum of squared weights, shrinking all weights but rarely to exactly zero.

9. How does a decision tree work, and what are its advantages and disadvantages?

intermediate

A decision tree recursively splits data based on the feature and threshold that maximizes information gain or minimizes impurity (Gini/entropy) at each node. Advantages: interpretable, handles mixed data types, requires little preprocessing. Disadvantages: prone to overfitting, high variance, and instability to small data changes.

10. What is the difference between bagging and boosting?

intermediate

Bagging (e.g., Random Forest) trains multiple models in parallel on random bootstrap samples and aggregates predictions to reduce variance. Boosting (e.g., XGBoost, AdaBoost) trains models sequentially, each correcting the errors of the previous one, primarily reducing bias but risking overfitting.

11. Explain how a Support Vector Machine (SVM) works.

intermediate

SVM finds the hyperplane that maximizes the margin between two classes, with support vectors being the data points closest to the boundary. The kernel trick (e.g., RBF, polynomial) implicitly maps data to higher-dimensional spaces to handle non-linear separability.

12. What is the ROC-AUC curve and what does it measure?

intermediate

The ROC curve plots the true positive rate vs. false positive rate at various classification thresholds. AUC (Area Under the Curve) summarizes the curve into a single scalar; an AUC of 1.0 indicates a perfect classifier, while 0.5 indicates no discriminative ability (random chance).

13. What is the curse of dimensionality?

intermediate

As the number of features increases, the volume of the feature space grows exponentially, making data increasingly sparse and distance metrics less meaningful. This degrades the performance of many ML algorithms and necessitates dimensionality reduction techniques like PCA or feature selection.

14. What is Principal Component Analysis (PCA) and when would you use it?

intermediate

PCA is a linear dimensionality reduction technique that projects data onto orthogonal axes (principal components) ordered by explained variance. It is used to reduce feature dimensionality, remove multicollinearity, speed up training, and visualize high-dimensional data, but the resulting components lose direct interpretability.

15. How does backpropagation work in neural networks?

intermediate

Backpropagation computes the gradient of the loss with respect to each weight by applying the chain rule of calculus, propagating error signals from the output layer backward through the network. These gradients are then used by an optimizer (e.g., SGD, Adam) to update weights and minimize the loss.

16. What is the vanishing gradient problem and how is it addressed?

advanced

In deep networks, gradients can become exponentially small as they propagate backward through many layers, making early layers learn very slowly or not at all. Solutions include using ReLU activations instead of sigmoid/tanh, batch normalization, residual/skip connections (ResNets), and careful weight initialization (e.g., He or Xavier).

17. Explain the attention mechanism and why it was transformative for NLP.

advanced

Attention allows a model to weight the relevance of all input tokens when producing each output token, rather than compressing the entire sequence into a fixed vector. This resolved the bottleneck of RNN encoder-decoder architectures for long sequences and became the foundation of Transformer models (BERT, GPT), which dominate modern NLP.

18. What is the difference between generative and discriminative models? Give examples.

advanced

Discriminative models (e.g., logistic regression, SVM, neural classifiers) learn the decision boundary P(y|x) directly. Generative models (e.g., Naive Bayes, GANs, VAEs, GMMs) learn the joint distribution P(x,y) or P(x) and can generate new data samples; they typically require more assumptions but are more flexible in application.

19. What are common techniques to handle class imbalance?

advanced

Techniques include resampling strategies (oversampling minorities with SMOTE, undersampling majorities), adjusting class weights in the loss function, using appropriate metrics (F1, AUC-PR instead of accuracy), threshold tuning, and ensemble methods like BalancedBaggingClassifier. The best approach depends on dataset size and the cost of each error type.

20. What is the difference between a parametric and a non-parametric model?

advanced

Parametric models (e.g., linear regression, logistic regression, neural networks) have a fixed number of parameters determined before training and make strong assumptions about data distribution. Non-parametric models (e.g., k-NN, kernel SVM, decision trees) grow in complexity with data and make fewer distributional assumptions, requiring more data and computation.

Practice these out loud with an AI interviewer that grills you and grades your answers.
Open the app — free to start

© RM Full Stack & AI Engineer · All interview questions · Roadmaps · Open the app