Common Machine Learning interview questions spanning beginner to advanced, covering core concepts, algorithms, model evaluation, regularization, deep learning, and practical ML engineering.
Supervised learning trains on labeled data to predict outputs; unsupervised learning finds patterns in unlabeled data (e.g., clustering, dimensionality reduction); reinforcement learning trains an agent to make sequential decisions by maximizing cumulative reward through environment interaction.
Overfitting occurs when a model learns noise and detail in training data, performing well on training but poorly on unseen data. Prevention techniques include regularization (L1/L2), dropout, cross-validation, early stopping, pruning, and collecting more training data.
Bias is error from overly simplistic assumptions; variance is error from sensitivity to small fluctuations in training data. High bias causes underfitting; high variance causes overfitting. The goal is to find a model complexity that minimizes total error by balancing the two.
Cross-validation (most commonly k-fold) splits data into k subsets, trains on k-1 folds and validates on the remaining fold, repeating k times. It provides a more reliable estimate of model generalization performance and reduces the impact of a single train/test split.
Precision is the fraction of predicted positives that are truly positive (TP / (TP + FP)); recall is the fraction of actual positives correctly identified (TP / (TP + FN)). Precision matters when false positives are costly; recall matters when false negatives are costly.
The F1 score is the harmonic mean of precision and recall: 2*(precision*recall)/(precision+recall). It is preferred over accuracy when classes are imbalanced, because accuracy can be misleadingly high if the model simply predicts the majority class.
Gradient descent is an optimization algorithm that iteratively adjusts model parameters in the direction of the negative gradient of the loss function to minimize it. Variants include batch, stochastic (SGD), and mini-batch gradient descent, which differ in how many samples are used per update.
Regularization adds a penalty term to the loss function to discourage model complexity and reduce overfitting. L1 (Lasso) adds the sum of absolute weights, encouraging sparsity by driving some weights to zero; L2 (Ridge) adds the sum of squared weights, shrinking all weights but rarely to exactly zero.
A decision tree recursively splits data based on the feature and threshold that maximizes information gain or minimizes impurity (Gini/entropy) at each node. Advantages: interpretable, handles mixed data types, requires little preprocessing. Disadvantages: prone to overfitting, high variance, and instability to small data changes.
Bagging (e.g., Random Forest) trains multiple models in parallel on random bootstrap samples and aggregates predictions to reduce variance. Boosting (e.g., XGBoost, AdaBoost) trains models sequentially, each correcting the errors of the previous one, primarily reducing bias but risking overfitting.
SVM finds the hyperplane that maximizes the margin between two classes, with support vectors being the data points closest to the boundary. The kernel trick (e.g., RBF, polynomial) implicitly maps data to higher-dimensional spaces to handle non-linear separability.
The ROC curve plots the true positive rate vs. false positive rate at various classification thresholds. AUC (Area Under the Curve) summarizes the curve into a single scalar; an AUC of 1.0 indicates a perfect classifier, while 0.5 indicates no discriminative ability (random chance).
As the number of features increases, the volume of the feature space grows exponentially, making data increasingly sparse and distance metrics less meaningful. This degrades the performance of many ML algorithms and necessitates dimensionality reduction techniques like PCA or feature selection.
PCA is a linear dimensionality reduction technique that projects data onto orthogonal axes (principal components) ordered by explained variance. It is used to reduce feature dimensionality, remove multicollinearity, speed up training, and visualize high-dimensional data, but the resulting components lose direct interpretability.
Backpropagation computes the gradient of the loss with respect to each weight by applying the chain rule of calculus, propagating error signals from the output layer backward through the network. These gradients are then used by an optimizer (e.g., SGD, Adam) to update weights and minimize the loss.
In deep networks, gradients can become exponentially small as they propagate backward through many layers, making early layers learn very slowly or not at all. Solutions include using ReLU activations instead of sigmoid/tanh, batch normalization, residual/skip connections (ResNets), and careful weight initialization (e.g., He or Xavier).
Attention allows a model to weight the relevance of all input tokens when producing each output token, rather than compressing the entire sequence into a fixed vector. This resolved the bottleneck of RNN encoder-decoder architectures for long sequences and became the foundation of Transformer models (BERT, GPT), which dominate modern NLP.
Discriminative models (e.g., logistic regression, SVM, neural classifiers) learn the decision boundary P(y|x) directly. Generative models (e.g., Naive Bayes, GANs, VAEs, GMMs) learn the joint distribution P(x,y) or P(x) and can generate new data samples; they typically require more assumptions but are more flexible in application.
Techniques include resampling strategies (oversampling minorities with SMOTE, undersampling majorities), adjusting class weights in the loss function, using appropriate metrics (F1, AUC-PR instead of accuracy), threshold tuning, and ensemble methods like BalancedBaggingClassifier. The best approach depends on dataset size and the cost of each error type.
Parametric models (e.g., linear regression, logistic regression, neural networks) have a fixed number of parameters determined before training and make strong assumptions about data distribution. Non-parametric models (e.g., k-NN, kernel SVM, decision trees) grow in complexity with data and make fewer distributional assumptions, requiring more data and computation.
© RM Full Stack & AI Engineer · All interview questions · Roadmaps · Open the app