RMRM Full Stack & AI Engineer · All questions · Roadmaps
Data · interview questions

Data Analyst Interview Questions

Common Data Analyst interview questions spanning SQL, statistics, data cleaning, visualization, and analytical thinking — from beginner to advanced levels.

1. What is the difference between a primary key and a foreign key in a relational database?

beginner

A primary key uniquely identifies each row in a table and cannot be NULL. A foreign key is a column in one table that references the primary key of another table, establishing a relationship between the two tables.

2. What are the main differences between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN?

beginner

INNER JOIN returns only matching rows from both tables. LEFT JOIN returns all rows from the left table plus matching rows from the right (NULLs where no match). RIGHT JOIN is the opposite. FULL OUTER JOIN returns all rows from both tables, with NULLs where there is no match on either side.

3. What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?

beginner

COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values in that column.

4. How do you handle missing or null values in a dataset?

beginner

Common strategies include removing rows with missing values if they are few and random, imputing with mean/median/mode for numerical data, using forward/backward fill for time series, or flagging missingness as a separate binary feature. The best approach depends on the volume of missing data and its likely cause.

5. What is the difference between mean, median, and mode, and when would you use each?

beginner

Mean is the arithmetic average, sensitive to outliers. Median is the middle value, more robust to skewed distributions or outliers. Mode is the most frequent value, useful for categorical data. For skewed distributions like income, median is typically preferred over mean.

6. What is a GROUP BY clause and how does it work with aggregate functions?

beginner

GROUP BY partitions rows into groups based on one or more columns, and aggregate functions like SUM, AVG, COUNT, MIN, and MAX are then applied to each group independently. Every column in the SELECT list must either be in the GROUP BY clause or wrapped in an aggregate function.

7. Explain the difference between WHERE and HAVING in SQL.

intermediate

WHERE filters rows before any grouping or aggregation occurs and cannot reference aggregate functions. HAVING filters groups after GROUP BY and aggregation, and is used to apply conditions on aggregate results such as HAVING COUNT(*) > 10.

8. What are window functions in SQL? Give a practical example.

intermediate

Window functions perform calculations across a set of rows related to the current row without collapsing them into a single output row, unlike GROUP BY. For example, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) assigns a rank to each employee within their department by salary, allowing you to find the top earner per department while keeping all rows.

9. What is the difference between variance and standard deviation?

intermediate

Variance is the average of squared deviations from the mean, expressed in squared units. Standard deviation is the square root of variance, bringing it back to the original units of the data and making it easier to interpret in context.

10. What is a correlation coefficient and what are its limitations?

intermediate

A correlation coefficient (e.g., Pearson's r) measures the linear relationship between two variables, ranging from -1 to +1. Its key limitations are that it only captures linear relationships (not nonlinear ones), it is sensitive to outliers, and correlation does not imply causation.

11. What is the difference between a star schema and a snowflake schema in data warehousing?

intermediate

A star schema has a central fact table linked directly to denormalized dimension tables, making queries simpler and faster. A snowflake schema normalizes dimension tables into multiple related tables, reducing data redundancy but increasing join complexity and query execution time.

12. How would you detect and handle outliers in a dataset?

intermediate

Outliers can be detected using box plots, Z-scores (flag values beyond ±3 standard deviations), or the IQR method (values below Q1−1.5×IQR or above Q3+1.5×IQR). Handling options include removing them if they are data errors, capping/winsorizing them, using robust statistical methods, or keeping them if they are legitimate extreme values worth analyzing separately.

13. What is the difference between a dimension table and a fact table?

intermediate

A fact table stores quantitative, measurable business events (e.g., sales transactions) and typically contains foreign keys and numeric metrics like revenue or quantity. Dimension tables store descriptive context about those events (e.g., product, customer, date), and are used for filtering, grouping, and labeling in reports.

14. What is A/B testing and how would you determine if the results are statistically significant?

advanced

A/B testing splits users into a control group (A) and a treatment group (B) to measure the effect of a single change. Statistical significance is determined using a hypothesis test (commonly a two-sample t-test or chi-square test), calculating a p-value and comparing it to a pre-set significance level (typically 0.05), while also ensuring adequate sample size via power analysis before the test.

15. Explain the concept of p-value and why a p-value below 0.05 does not necessarily mean a result is practically significant.

advanced

A p-value is the probability of observing results at least as extreme as the ones obtained, assuming the null hypothesis is true — a low p-value suggests the result is unlikely due to chance. However, with very large sample sizes, even tiny, practically meaningless differences can produce p-values well below 0.05. Practical significance should be assessed using effect size measures (like Cohen's d) alongside p-values.

16. What is data normalization versus standardization, and when would you use each?

advanced

Normalization (min-max scaling) rescales data to a fixed range like [0, 1] and is useful when the algorithm requires bounded inputs or when the distribution is not Gaussian. Standardization (Z-score scaling) transforms data to have mean 0 and standard deviation 1, and is preferred when the data is approximately normally distributed or when the algorithm assumes zero mean, such as PCA or linear regression.

17. How would you approach analyzing a sudden drop in a key business metric, such as a 20% decrease in daily active users?

advanced

First, confirm the drop is real by checking data pipeline integrity, tracking code, and whether it affects all segments or only specific ones (platform, geography, user type). Then segment the metric to isolate where the drop originates, correlate with recent product changes, marketing shifts, or external events, and form hypotheses to test. Finally, quantify impact and communicate findings clearly with supporting evidence.

18. What is the difference between OLTP and OLAP systems, and how does it affect how you query data?

advanced

OLTP (Online Transaction Processing) systems are optimized for high-frequency, row-level reads and writes with normalized schemas, powering operational applications. OLAP (Online Analytical Processing) systems are optimized for complex analytical queries across large volumes of historical data using denormalized or columnar storage. Querying OLAP systems typically involves heavy use of aggregations, joins across fact/dimension tables, and window functions rather than simple row lookups.

19. What is Simpson's Paradox and why is it important for data analysts to understand?

advanced

Simpson's Paradox occurs when a trend that appears in several groups of data reverses or disappears when the groups are combined — often because of a confounding variable that is unevenly distributed across groups. It is critical for analysts because drawing conclusions from aggregated data without examining subgroups can lead to incorrect and misleading business decisions.

Practice these out loud with an AI interviewer that grills you and grades your answers.
Open the app — free to start

© RM Full Stack & AI Engineer · All interview questions · Roadmaps · Open the app