Question 1

What is the difference between a primary key and a foreign key in a relational database?

Accepted Answer

A primary key uniquely identifies each row in a table and cannot be NULL. A foreign key is a column in one table that references the primary key of another table, establishing a relationship between the two tables.

Question 2

What are the main differences between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN?

Accepted Answer

INNER JOIN returns only matching rows from both tables. LEFT JOIN returns all rows from the left table plus matching rows from the right (NULLs where no match). RIGHT JOIN is the opposite. FULL OUTER JOIN returns all rows from both tables, with NULLs where there is no match on either side.

Question 3

What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?

Accepted Answer

COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values in that column.

Question 4

How do you handle missing or null values in a dataset?

Accepted Answer

Common strategies include removing rows with missing values if they are few and random, imputing with mean/median/mode for numerical data, using forward/backward fill for time series, or flagging missingness as a separate binary feature. The best approach depends on the volume of missing data and its likely cause.

Question 5

What is the difference between mean, median, and mode, and when would you use each?

Accepted Answer

Mean is the arithmetic average, sensitive to outliers. Median is the middle value, more robust to skewed distributions or outliers. Mode is the most frequent value, useful for categorical data. For skewed distributions like income, median is typically preferred over mean.

Question 6

What is a GROUP BY clause and how does it work with aggregate functions?

Accepted Answer

GROUP BY partitions rows into groups based on one or more columns, and aggregate functions like SUM, AVG, COUNT, MIN, and MAX are then applied to each group independently. Every column in the SELECT list must either be in the GROUP BY clause or wrapped in an aggregate function.

Question 7

Explain the difference between WHERE and HAVING in SQL.

Accepted Answer

WHERE filters rows before any grouping or aggregation occurs and cannot reference aggregate functions. HAVING filters groups after GROUP BY and aggregation, and is used to apply conditions on aggregate results such as HAVING COUNT(*) > 10.

Question 8

What are window functions in SQL? Give a practical example.

Accepted Answer

Window functions perform calculations across a set of rows related to the current row without collapsing them into a single output row, unlike GROUP BY. For example, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) assigns a rank to each employee within their department by salary, allowing you to find the top earner per department while keeping all rows.

Question 9

What is the difference between variance and standard deviation?

Accepted Answer

Variance is the average of squared deviations from the mean, expressed in squared units. Standard deviation is the square root of variance, bringing it back to the original units of the data and making it easier to interpret in context.

Question 10

What is a correlation coefficient and what are its limitations?

Accepted Answer

A correlation coefficient (e.g., Pearson's r) measures the linear relationship between two variables, ranging from -1 to +1. Its key limitations are that it only captures linear relationships (not nonlinear ones), it is sensitive to outliers, and correlation does not imply causation.

Question 11

What is the difference between a star schema and a snowflake schema in data warehousing?

Accepted Answer

A star schema has a central fact table linked directly to denormalized dimension tables, making queries simpler and faster. A snowflake schema normalizes dimension tables into multiple related tables, reducing data redundancy but increasing join complexity and query execution time.

Question 12

How would you detect and handle outliers in a dataset?

Accepted Answer

Outliers can be detected using box plots, Z-scores (flag values beyond ±3 standard deviations), or the IQR method (values below Q1−1.5×IQR or above Q3+1.5×IQR). Handling options include removing them if they are data errors, capping/winsorizing them, using robust statistical methods, or keeping them if they are legitimate extreme values worth analyzing separately.

Question 13

What is the difference between a dimension table and a fact table?

Accepted Answer

A fact table stores quantitative, measurable business events (e.g., sales transactions) and typically contains foreign keys and numeric metrics like revenue or quantity. Dimension tables store descriptive context about those events (e.g., product, customer, date), and are used for filtering, grouping, and labeling in reports.

Question 14

What is A/B testing and how would you determine if the results are statistically significant?

Accepted Answer

A/B testing splits users into a control group (A) and a treatment group (B) to measure the effect of a single change. Statistical significance is determined using a hypothesis test (commonly a two-sample t-test or chi-square test), calculating a p-value and comparing it to a pre-set significance level (typically 0.05), while also ensuring adequate sample size via power analysis before the test.

Question 15

Explain the concept of p-value and why a p-value below 0.05 does not necessarily mean a result is practically significant.

Accepted Answer

A p-value is the probability of observing results at least as extreme as the ones obtained, assuming the null hypothesis is true — a low p-value suggests the result is unlikely due to chance. However, with very large sample sizes, even tiny, practically meaningless differences can produce p-values well below 0.05. Practical significance should be assessed using effect size measures (like Cohen's d) alongside p-values.

Question 16

What is data normalization versus standardization, and when would you use each?

Accepted Answer

Normalization (min-max scaling) rescales data to a fixed range like [0, 1] and is useful when the algorithm requires bounded inputs or when the distribution is not Gaussian. Standardization (Z-score scaling) transforms data to have mean 0 and standard deviation 1, and is preferred when the data is approximately normally distributed or when the algorithm assumes zero mean, such as PCA or linear regression.

Question 17

How would you approach analyzing a sudden drop in a key business metric, such as a 20% decrease in daily active users?

Accepted Answer

First, confirm the drop is real by checking data pipeline integrity, tracking code, and whether it affects all segments or only specific ones (platform, geography, user type). Then segment the metric to isolate where the drop originates, correlate with recent product changes, marketing shifts, or external events, and form hypotheses to test. Finally, quantify impact and communicate findings clearly with supporting evidence.

Question 18

What is the difference between OLTP and OLAP systems, and how does it affect how you query data?

Accepted Answer

OLTP (Online Transaction Processing) systems are optimized for high-frequency, row-level reads and writes with normalized schemas, powering operational applications. OLAP (Online Analytical Processing) systems are optimized for complex analytical queries across large volumes of historical data using denormalized or columnar storage. Querying OLAP systems typically involves heavy use of aggregations, joins across fact/dimension tables, and window functions rather than simple row lookups.

Question 19

What is Simpson's Paradox and why is it important for data analysts to understand?

Accepted Answer

Simpson's Paradox occurs when a trend that appears in several groups of data reverses or disappears when the groups are combined — often because of a confounding variable that is unevenly distributed across groups. It is critical for analysts because drawing conclusions from aggregated data without examining subgroups can lead to incorrect and misleading business decisions.

Data Analyst Interview Questions

1. What is the difference between a primary key and a foreign key in a relational database?

2. What are the main differences between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN?

3. What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?

4. How do you handle missing or null values in a dataset?

5. What is the difference between mean, median, and mode, and when would you use each?

6. What is a GROUP BY clause and how does it work with aggregate functions?

7. Explain the difference between WHERE and HAVING in SQL.

8. What are window functions in SQL? Give a practical example.

9. What is the difference between variance and standard deviation?

10. What is a correlation coefficient and what are its limitations?

11. What is the difference between a star schema and a snowflake schema in data warehousing?

12. How would you detect and handle outliers in a dataset?

13. What is the difference between a dimension table and a fact table?

14. What is A/B testing and how would you determine if the results are statistically significant?

15. Explain the concept of p-value and why a p-value below 0.05 does not necessarily mean a result is practically significant.

16. What is data normalization versus standardization, and when would you use each?

17. How would you approach analyzing a sudden drop in a key business metric, such as a 20% decrease in daily active users?

18. What is the difference between OLTP and OLAP systems, and how does it affect how you query data?

19. What is Simpson's Paradox and why is it important for data analysts to understand?