Question 1

What is the difference between supervised, unsupervised, and reinforcement learning?

Accepted Answer

Supervised learning trains on labeled data to predict outputs; unsupervised learning finds patterns in unlabeled data (e.g., clustering, dimensionality reduction); reinforcement learning trains an agent to make sequential decisions by maximizing cumulative reward through environment interaction.

Question 2

What is overfitting and how do you prevent it?

Accepted Answer

Overfitting occurs when a model learns noise and detail in training data, performing well on training but poorly on unseen data. Prevention techniques include regularization (L1/L2), dropout, cross-validation, early stopping, pruning, and collecting more training data.

Question 3

What is the bias-variance tradeoff?

Accepted Answer

Bias is error from overly simplistic assumptions; variance is error from sensitivity to small fluctuations in training data. High bias causes underfitting; high variance causes overfitting. The goal is to find a model complexity that minimizes total error by balancing the two.

Question 4

What is cross-validation and why is it used?

Accepted Answer

Cross-validation (most commonly k-fold) splits data into k subsets, trains on k-1 folds and validates on the remaining fold, repeating k times. It provides a more reliable estimate of model generalization performance and reduces the impact of a single train/test split.

Question 5

Explain the difference between precision and recall.

Accepted Answer

Precision is the fraction of predicted positives that are truly positive (TP / (TP + FP)); recall is the fraction of actual positives correctly identified (TP / (TP + FN)). Precision matters when false positives are costly; recall matters when false negatives are costly.

Question 6

What is the F1 score and when would you use it over accuracy?

Accepted Answer

The F1 score is the harmonic mean of precision and recall: 2*(precision*recall)/(precision+recall). It is preferred over accuracy when classes are imbalanced, because accuracy can be misleadingly high if the model simply predicts the majority class.

Question 7

What is gradient descent and how does it work?

Accepted Answer

Gradient descent is an optimization algorithm that iteratively adjusts model parameters in the direction of the negative gradient of the loss function to minimize it. Variants include batch, stochastic (SGD), and mini-batch gradient descent, which differ in how many samples are used per update.

Question 8

What is regularization? Explain L1 vs L2.

Accepted Answer

Regularization adds a penalty term to the loss function to discourage model complexity and reduce overfitting. L1 (Lasso) adds the sum of absolute weights, encouraging sparsity by driving some weights to zero; L2 (Ridge) adds the sum of squared weights, shrinking all weights but rarely to exactly zero.

Question 9

How does a decision tree work, and what are its advantages and disadvantages?

Accepted Answer

A decision tree recursively splits data based on the feature and threshold that maximizes information gain or minimizes impurity (Gini/entropy) at each node. Advantages: interpretable, handles mixed data types, requires little preprocessing. Disadvantages: prone to overfitting, high variance, and instability to small data changes.

Question 10

What is the difference between bagging and boosting?

Accepted Answer

Bagging (e.g., Random Forest) trains multiple models in parallel on random bootstrap samples and aggregates predictions to reduce variance. Boosting (e.g., XGBoost, AdaBoost) trains models sequentially, each correcting the errors of the previous one, primarily reducing bias but risking overfitting.

Question 11

Explain how a Support Vector Machine (SVM) works.

Accepted Answer

SVM finds the hyperplane that maximizes the margin between two classes, with support vectors being the data points closest to the boundary. The kernel trick (e.g., RBF, polynomial) implicitly maps data to higher-dimensional spaces to handle non-linear separability.

Question 12

What is the ROC-AUC curve and what does it measure?

Accepted Answer

The ROC curve plots the true positive rate vs. false positive rate at various classification thresholds. AUC (Area Under the Curve) summarizes the curve into a single scalar; an AUC of 1.0 indicates a perfect classifier, while 0.5 indicates no discriminative ability (random chance).

Question 13

What is the curse of dimensionality?

Accepted Answer

As the number of features increases, the volume of the feature space grows exponentially, making data increasingly sparse and distance metrics less meaningful. This degrades the performance of many ML algorithms and necessitates dimensionality reduction techniques like PCA or feature selection.

Question 14

What is Principal Component Analysis (PCA) and when would you use it?

Accepted Answer

PCA is a linear dimensionality reduction technique that projects data onto orthogonal axes (principal components) ordered by explained variance. It is used to reduce feature dimensionality, remove multicollinearity, speed up training, and visualize high-dimensional data, but the resulting components lose direct interpretability.

Question 15

How does backpropagation work in neural networks?

Accepted Answer

Backpropagation computes the gradient of the loss with respect to each weight by applying the chain rule of calculus, propagating error signals from the output layer backward through the network. These gradients are then used by an optimizer (e.g., SGD, Adam) to update weights and minimize the loss.

Question 16

What is the vanishing gradient problem and how is it addressed?

Accepted Answer

In deep networks, gradients can become exponentially small as they propagate backward through many layers, making early layers learn very slowly or not at all. Solutions include using ReLU activations instead of sigmoid/tanh, batch normalization, residual/skip connections (ResNets), and careful weight initialization (e.g., He or Xavier).

Question 17

Explain the attention mechanism and why it was transformative for NLP.

Accepted Answer

Attention allows a model to weight the relevance of all input tokens when producing each output token, rather than compressing the entire sequence into a fixed vector. This resolved the bottleneck of RNN encoder-decoder architectures for long sequences and became the foundation of Transformer models (BERT, GPT), which dominate modern NLP.

Question 18

What is the difference between generative and discriminative models? Give examples.

Accepted Answer

Discriminative models (e.g., logistic regression, SVM, neural classifiers) learn the decision boundary P(y|x) directly. Generative models (e.g., Naive Bayes, GANs, VAEs, GMMs) learn the joint distribution P(x,y) or P(x) and can generate new data samples; they typically require more assumptions but are more flexible in application.

Question 19

What are common techniques to handle class imbalance?

Accepted Answer

Techniques include resampling strategies (oversampling minorities with SMOTE, undersampling majorities), adjusting class weights in the loss function, using appropriate metrics (F1, AUC-PR instead of accuracy), threshold tuning, and ensemble methods like BalancedBaggingClassifier. The best approach depends on dataset size and the cost of each error type.

Question 20

What is the difference between a parametric and a non-parametric model?

Accepted Answer

Parametric models (e.g., linear regression, logistic regression, neural networks) have a fixed number of parameters determined before training and make strong assumptions about data distribution. Non-parametric models (e.g., k-NN, kernel SVM, decision trees) grow in complexity with data and make fewer distributional assumptions, requiring more data and computation.

Machine Learning Interview Questions

1. What is the difference between supervised, unsupervised, and reinforcement learning?

2. What is overfitting and how do you prevent it?

3. What is the bias-variance tradeoff?

4. What is cross-validation and why is it used?

5. Explain the difference between precision and recall.

6. What is the F1 score and when would you use it over accuracy?

7. What is gradient descent and how does it work?

8. What is regularization? Explain L1 vs L2.

9. How does a decision tree work, and what are its advantages and disadvantages?

10. What is the difference between bagging and boosting?

11. Explain how a Support Vector Machine (SVM) works.

12. What is the ROC-AUC curve and what does it measure?

13. What is the curse of dimensionality?

14. What is Principal Component Analysis (PCA) and when would you use it?

15. How does backpropagation work in neural networks?

16. What is the vanishing gradient problem and how is it addressed?

17. Explain the attention mechanism and why it was transformative for NLP.

18. What is the difference between generative and discriminative models? Give examples.

19. What are common techniques to handle class imbalance?

20. What is the difference between a parametric and a non-parametric model?