Common Data Analyst interview questions spanning SQL, statistics, data cleaning, visualization, and analytical thinking — from beginner to advanced levels.
A primary key uniquely identifies each row in a table and cannot be NULL. A foreign key is a column in one table that references the primary key of another table, establishing a relationship between the two tables.
INNER JOIN returns only matching rows from both tables. LEFT JOIN returns all rows from the left table plus matching rows from the right (NULLs where no match). RIGHT JOIN is the opposite. FULL OUTER JOIN returns all rows from both tables, with NULLs where there is no match on either side.
COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values in that column.
Common strategies include removing rows with missing values if they are few and random, imputing with mean/median/mode for numerical data, using forward/backward fill for time series, or flagging missingness as a separate binary feature. The best approach depends on the volume of missing data and its likely cause.
Mean is the arithmetic average, sensitive to outliers. Median is the middle value, more robust to skewed distributions or outliers. Mode is the most frequent value, useful for categorical data. For skewed distributions like income, median is typically preferred over mean.
GROUP BY partitions rows into groups based on one or more columns, and aggregate functions like SUM, AVG, COUNT, MIN, and MAX are then applied to each group independently. Every column in the SELECT list must either be in the GROUP BY clause or wrapped in an aggregate function.
WHERE filters rows before any grouping or aggregation occurs and cannot reference aggregate functions. HAVING filters groups after GROUP BY and aggregation, and is used to apply conditions on aggregate results such as HAVING COUNT(*) > 10.
Window functions perform calculations across a set of rows related to the current row without collapsing them into a single output row, unlike GROUP BY. For example, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) assigns a rank to each employee within their department by salary, allowing you to find the top earner per department while keeping all rows.
Variance is the average of squared deviations from the mean, expressed in squared units. Standard deviation is the square root of variance, bringing it back to the original units of the data and making it easier to interpret in context.
A correlation coefficient (e.g., Pearson's r) measures the linear relationship between two variables, ranging from -1 to +1. Its key limitations are that it only captures linear relationships (not nonlinear ones), it is sensitive to outliers, and correlation does not imply causation.
A star schema has a central fact table linked directly to denormalized dimension tables, making queries simpler and faster. A snowflake schema normalizes dimension tables into multiple related tables, reducing data redundancy but increasing join complexity and query execution time.
Outliers can be detected using box plots, Z-scores (flag values beyond ±3 standard deviations), or the IQR method (values below Q1−1.5×IQR or above Q3+1.5×IQR). Handling options include removing them if they are data errors, capping/winsorizing them, using robust statistical methods, or keeping them if they are legitimate extreme values worth analyzing separately.
A fact table stores quantitative, measurable business events (e.g., sales transactions) and typically contains foreign keys and numeric metrics like revenue or quantity. Dimension tables store descriptive context about those events (e.g., product, customer, date), and are used for filtering, grouping, and labeling in reports.
A/B testing splits users into a control group (A) and a treatment group (B) to measure the effect of a single change. Statistical significance is determined using a hypothesis test (commonly a two-sample t-test or chi-square test), calculating a p-value and comparing it to a pre-set significance level (typically 0.05), while also ensuring adequate sample size via power analysis before the test.
A p-value is the probability of observing results at least as extreme as the ones obtained, assuming the null hypothesis is true — a low p-value suggests the result is unlikely due to chance. However, with very large sample sizes, even tiny, practically meaningless differences can produce p-values well below 0.05. Practical significance should be assessed using effect size measures (like Cohen's d) alongside p-values.
Normalization (min-max scaling) rescales data to a fixed range like [0, 1] and is useful when the algorithm requires bounded inputs or when the distribution is not Gaussian. Standardization (Z-score scaling) transforms data to have mean 0 and standard deviation 1, and is preferred when the data is approximately normally distributed or when the algorithm assumes zero mean, such as PCA or linear regression.
First, confirm the drop is real by checking data pipeline integrity, tracking code, and whether it affects all segments or only specific ones (platform, geography, user type). Then segment the metric to isolate where the drop originates, correlate with recent product changes, marketing shifts, or external events, and form hypotheses to test. Finally, quantify impact and communicate findings clearly with supporting evidence.
OLTP (Online Transaction Processing) systems are optimized for high-frequency, row-level reads and writes with normalized schemas, powering operational applications. OLAP (Online Analytical Processing) systems are optimized for complex analytical queries across large volumes of historical data using denormalized or columnar storage. Querying OLAP systems typically involves heavy use of aggregations, joins across fact/dimension tables, and window functions rather than simple row lookups.
Simpson's Paradox occurs when a trend that appears in several groups of data reverses or disappears when the groups are combined — often because of a confounding variable that is unevenly distributed across groups. It is critical for analysts because drawing conclusions from aggregated data without examining subgroups can lead to incorrect and misleading business decisions.
© RM Full Stack & AI Engineer · All interview questions · Roadmaps · Open the app