In 2025, business analysts are expected to do more than just gather requirements and write reports—they’re increasingly being asked to extract, interpret, and manipulate data directly. That’s why SQL interview questions for business analyst roles have become a critical part of the hiring process. Whether you’re supporting product, marketing, or finance teams, fluency in SQL is now a baseline requirement for business analysts working in data-driven organizations.
This guide breaks down the most relevant business analyst SQL interview questions, from beginner queries to more advanced, scenario-based problems. We’ll also share tips on how to practice effectively, structure your answers, and stand out during technical assessments. If you’re preparing for interviews in tech, finance, or e-commerce, these are the SQL questions for business analyst candidates that you need to master.
In most interviews, business analysts are expected to not only write SQL queries but also explain the rationale behind them. This section covers SQL interview questions and answers for business analyst roles that test both logic and clarity. Interviewers are looking for candidates who can interpret business problems and translate them into accurate queries—often under constraints like time pressure or incomplete data. These questions also serve as a test of communication: can you walk through your solution step-by-step, and justify trade-offs? Let’s go through real examples and snippets that often appear in interviews.
This is a great entry-level test of your ability to use aggregation with grouping logic. Expect interviewers to ask questions where you need to GROUP BY a field (like region or product) and then apply COUNT(DISTINCT column) to measure uniqueness.
This question tests your ability to conditionally group and filter records using HAVING clauses and multiple WHERE conditions. The goal is to segment customers across time ranges and count unique IDs. It’s a direct example of real-world retention analysis.
Write a query to get the number of users who bought additional products after their first purchase.
This is a classic way to identify repeat customers. The logic involves detecting a user’s first transaction, then counting subsequent purchases—perfect for understanding upsell effectiveness or lifecycle stages.
This advanced case adds layered filtering, grouping, and date logic. It demonstrates your ability to derive insights from event-based datasets—crucial for campaign analysis and telecom metrics.
Business analysts are often expected to build queries that calculate performance metrics—conversion rates, engagement ratios, or retention percentages. These metrics power dashboards and help guide business decisions, so SQL fluency in conditional counting, grouping, and handling edge cases is essential. Expect these types of SQL interview questions business analyst candidates receive in metric-focused roles or KPI ownership teams.
Find the percentage of users that had at least one seven-day streak of visiting the same URL.
This question is a great proxy for conversion-style logic because it tests your ability to define a success condition (a 7-day streak), segment users accordingly, and calculate a ratio of converted users to total users. A strong answer will include window functions or date diff logic, followed by a final aggregation step to produce the conversion-like percentage.
Find the percentage of customers that order drinks with their meal.
This is a direct analog to a conversion rate problem. The goal is to count the number of customers who performed a target behavior (ordered drinks), then divide that by the total relevant population (all meal orders). This requires accurate joins, filtering, and rounding—ideal practice for KPI-style questions.
Different levels of seniority come with different SQL expectations. For entry-level business analyst interview questions, the focus is usually on syntax accuracy and basic joins. As you move into senior business systems analyst interview questions, expect scenarios involving performance optimization, schema understanding, or layered queries with CTEs. This section breaks down SQL questions by experience level—so you can prepare in a way that reflects your current skill set and the role you’re targeting.
If you’re preparing for entry level business analyst interview questions, expect a strong focus on SQL fundamentals. Recruiters want to see that you can manipulate data reliably, understand basic join logic, and use aggregation functions like COUNT, GROUP BY, and AVG to support reporting tasks. These questions are often scenario-based and tied to common business workflows—like identifying active customers or filtering sales by region. Mastering these basics shows you’re ready to support cross-functional teams with clean, accurate data.
A ranked approach with DENSE_RANK() or ROW_NUMBER() partitioned by department ensures ties are handled. Filtering for rank = 2 yields the correct value without relying on LIMIT 1,1, which breaks under duplicates. Discussing a composite index on (department, salary) and edge cases like null salaries demonstrates thoroughness.
You need to join transactions to customer demographics, filter for users who placed ≥ 1 order, and compute SUM(amount) / COUNT(DISTINCT order_id) grouped by gender. Mentioning null-gender exclusion, outlier checks (e.g., refunds), and presentation formatting shows business polish. A window function alternative can reduce sub-query complexity.
Combine tables with UNION ALL, then GROUP BY ingredient and SUM(mass) to total quantities. Discuss unit standardization (grams vs. ounces) before summing and why UNION ALL preserves duplicates needed for a correct tally. Explaining how this mirrors merging heterogeneous data sources strengthens the business-analysis perspective.
Start by joining the transactions and products tables (or using a pre-computed revenue table) to obtain net sales by month. A common approach is a CTE that aggregates monthly revenue, then a LAG() window function to reference the prior month. The month-over-month metric is (this_month – last_month) / last_month, rounded with ROUND(…, 2). A good solution notes how to handle January’s null prior value and suggests indexes on transaction_date for efficiency.
You can calculate the sample means, variances, and counts for category 9 versus the rest in a single grouped sub-query. Using those aggregates, the t-statistic follows Welch’s formula, and degrees of freedom use the Satterthwaite approximation. Candidates should explain why unequal-variance t-tests are appropriate for price data and mention filtering out returns or outliers. Explicitly stating assumptions (independence, approximate normality) shows statistical rigor.
Solutions typically partition logs by user_id, url, compute consecutive-day streak lengths with DATEDIFF and a difference-of-row-number trick, then flag streaks ≥ 7. The final ratio divides distinct users with any qualifying streak by total distinct users. Edge cases—missing days, multiple streaks, time-zone issues—should be discussed. Rounding to 0.XX reflects the requirement to output a proportion, not a percentage string.
Draft a query that counts SMS confirmation replies on 28 Feb 2020, grouped by carrier and country.
The key is joining sms_sends (filtered to confirmations sent on the target date) with confirmers on phone number and ensuring each user’s latest confirmation message is matched. A window function (ROW_NUMBER() OVER (PARTITION BY phone ORDER BY sent_time DESC)) isolates the latest SMS. Aggregating by carrier, country delivers the counts. Addressing duplicate confirmations and adding a composite index on (phone, sent_time) demonstrate production awareness.
Which SQL logic samples every fourth record from a transaction table ordered by timestamp?
After ordering by transaction_date, you can assign row numbers and select rows where row_number % 4 = 0. An alternative is NTILE() if proportional sampling is acceptable. The explanation should touch on why simple LIMIT or random sampling wouldn’t guarantee uniform spacing. Mention that an index on the date column maintains the order without extra sorting on large datasets.
Join projects with employee_projects, count employees per project, filter out counts = 0, then compute budget / employee_cnt. Ordering by that ratio and limiting to five rows fulfills the requirement. Candidates should note why COUNT(DISTINCT employee_id) is safer and how to handle null budgets. Discussing an index on project_id in both tables shows performance mindfulness.
How would you identify the five users with the longest continuous daily-visit streaks?
Use a gaps-and-islands technique: for each user, subtract the row number from the visit date to group consecutive days, then COUNT(*) streak lengths within each group. Ordering by streak length DESC and limiting to five yields the answer. Including tie-breaking logic (e.g., earliest streak start) demonstrates completeness. Indexing by (user_id, visit_date) prevents full table scans.
A CASE expression encodes the rules: month = July overrides all, East region bumps to Premium, and amount ≥ 2000 defaults to Premium otherwise. Wrapping the CASE in a sub-query and then grouping by region, category provides totals. The explanation should highlight test-first thinking—verifying logic with sample rows—and date filtering to 2023. Discussing how this feeds dashboards for bonus planning connects to stakeholder needs.
Aggregate revenue by year, identify MIN(year) and MAX(year) with window functions or CTEs, then sum those two years’ revenue and divide by total revenue. Rounding via ROUND(…, 2) meets formatting rules. Explaining why including both years in one query avoids multiple scans and touching on currency conversion if years cross rate changes exhibit practical savvy.
The solution begins with a sub-query (or CTE) that aggregates each product’s revenue as SUM(price * quantity) and divides by COUNT(*) to obtain an average per sale. Joining this back to the products table provides access to product_price for the comparison. The outer query filters for rows where product_price > avg_transaction_total, and both numeric columns are wrapped in ROUND(…, 2) for presentation. Candidates should note that a window function variant can avoid re-scanning the table and that indexes on product_id and the transactional foreign key keep the query performant.
First, a CTE named top_ads pulls the three lowest id values, which represent the most-popular campaigns. The answer then UNIONs four SELECT COUNT(*) statements—each performing a different join—to create tidy output columns join_type and row_cnt. An insightful explanation clarifies why an INNER join returns exactly three rows, a LEFT join returns N (all ads), a RIGHT join again returns three, and a CROSS join multiplies the two set sizes. Discussing how anti-join filters or ad inactivation flags might influence real-world counts shows business savvy.
Begin by joining the projects table to employees (via employee_projects) to fetch annual salary totals. The salary prorate for each project is salary_total * (project_duration_days / 365), which you can compute inside a CTE. Comparing that amount to the project budget drives a CASE statement that emits the status label. A good discussion covers assumptions (employees on one project at a time), handles null end dates (ongoing projects), and suggests maintaining salary history in a slowly changing dimension.
The cleanest tactic unions the per-year tables into a CTE tagged with a year column, then leverages conditional aggregation: SUM(CASE WHEN year=2021 THEN total_sales END) AS sales_2021, etc. If the database supports dynamic SQL, a generated pivot scales to future years automatically. Good answers highlight the maintenance burden of one-table-per-year schemes and recommend folding data into a single partitioned fact table to simplify analytics and indexing.
A window function approach uses SUM(sales * weight) over a three-row frame: 0.5 * sales for the current row, 0.3 * LAG(sales,1), and 0.2 * LAG(sales,2). Filtering with WHERE LAG(sales,2) IS NOT NULL ensures that only rows with two predecessors are output. Explanations should mention choosing ORDER BY sale_date in the PARTITION BY product_id clause, discuss handling gaps in dates, and note that materializing daily sales in a pre-aggregated table avoids expensive on-the-fly computations.
How many daily active users (DAU) does each platform have for every day in 2020, and which SQL logic would you use to calculate it?
A solid answer counts distinct user_id values per date and platform, grouping by the truncated date (event_date) after normalizing time zones. Most candidates should propose a composite index on (event_date, platform) to avoid a full scan. They can mention that one user with multiple sessions in a day still counts once, so deduplication at the subquery level is important. Stronger responses also note how holidays or daylight-saving changes might affect the midnight cut-off.
How would you generate a monthly customer summary showing user counts, transaction counts, and total order amounts for 2020?
The query joins users to transactions, truncates order_date to the first of the month, and aggregates with COUNT(DISTINCT user_id) and SUM(order_total). A window or common-table expression keeps the logic clean and reusable for dashboards. Good explanations discuss excluding customers with zero spend to avoid inflated averages. Candidates should also mention rounding currency and adding a WITH ROLLUP row for company-wide totals if desired.
Which customers registered in 2020 spent more than $100 in their first 30 days, and how would you count them?
An efficient plan first calculates each customer’s registration date, then limits the transactions scan to a 30-day window using DATE_ADD or BETWEEN. A subquery or CTE aggregates spending and applies a HAVING SUM(amount) > 100 filter. Because each customer is counted once, the final result is a simple COUNT(*). Discussion of calendar vs. rolling 30-day windows shows analytical care.
What is the three-month retention rate for each monthly cohort and subscription plan, and how would you structure that SQL?
Cohort analysis begins by tagging every new subscriber with a start_month (e.g., DATE_TRUNC('month', signup_date)). Conditional aggregation—SUM(CASE WHEN month_diff = 3 THEN 1 END)—counts users active at month 3. Dividing by the cohort size yields the retention rate; wrapping the math in a CTE keeps the query tidy. Candidates who note edge cases like mid-month cancellations and leap-year offsets show attention to detail.
How would you reset a running total of new-user sign-ups at the start of every month and output the daily cumulative count?
A window function such as SUM(1) OVER (PARTITION BY year_month ORDER BY signup_date) perfectly fits the requirement. This avoids subqueries that reprocess data for each date. Indexing signup_date accelerates the sort phase, especially for backfills. Mentioning how to cope with missing days (e.g., weekends) signals thoroughness.
How can you compute a three-day rolling average of daily deposits from a bank-transactions table?
Use AVG(deposit_amount) OVER (ORDER BY txn_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW). The ROWS frame guarantees exactly three rows, even if dates are skipped, unlike RANGE. Excluding non-deposit transactions with a WHERE clause keeps the window small. Interviewers expect a note on handling the first two days that lack enough history.
How would you randomly sample one row from a very large table without causing performance issues?
Database-native sampling methods (TABLESAMPLE, LIMIT 1 on an auto-generated random key, or ORDER BY RAND() only on a sub-range) are preferred. A hash of the primary key can create a pseudo-random but repeatable slice for debugging. The candidate should caution against a full random sort, which can force a table scan. Including ways to seed the randomness for A/B experiments shows practical insight.
Which employee names joined the company before their respective managers, and how would you find them?
Join the employees table to itself (or to a managers table) on manager_id, then filter where e.hire_date < m.hire_date. Adding an index on manager_id speeds the join substantially. Edge cases—such as interim managers or vacant positions—need explicit handling with left joins. Returning both employee and manager names improves report readability.
What is the average quantity purchased per product per year, and which ordering should the query follow?
Aggregate with AVG(quantity) after grouping by product_id and YEAR(order_date). Sorting by order_year then product_id keeps the timeline intuitive. Rounding to two decimals guards against fractional quantities due to returns or partial units. For time-zone safety, the query can cast timestamps to UTC before extracting the year.
Who manages the largest team, and what join strategy identifies manager-employee counts quickly?
Counting employees per manager_id using a straight GROUP BY is simplest, but creating a derived table of counts then joining back retrieves the manager’s details in one pass. Adding a filtered index (manager_id, employee_id) can improve performance on huge org charts. The candidate should also mention ties and how to return all co-leaders if team sizes match.
Which age-decade groups achieved the highest click-through rates (CTR) in 2021, and how would you rank tied groups by older age?
Create a decade bucket with FLOOR(age/10)*10, then calculate CTR = clicks / impressions. A DENSE_RANK() ordered by CTR DESC, decade DESC ensures ties favor older cohorts. Discussing filters for bots or zero-impression rows shows real-world thinking. Results limited to the top three keep dashboards concise.
How would you build a histogram of comment counts per user for January 2020, including a “0” bucket for users with no comments that month?
Left-join an aggregated January comment count onto the full users list to capture zeros. Group the resulting counts and tally users per bucket, adding an explicit WHEN cnt IS NULL THEN 0 case. The explanation should mention indexing comment_date and optionally generating contiguous bins via a calendar table.
How many lifetime song plays occurred per user per date, and which SQL pattern retrieves this efficiently?
Group by user_id, play_date and COUNT(*) AS plays for each row. If the schema stores milliseconds, truncating to the date prevents duplicate rows. Partition pruning on play_date keeps scans fast. Windowing over user_id could add running totals for richer metrics.
How would you produce a table of unique origin–destination city pairs from flight records, ignoring duplicate orderings?
Normalize each pair with LEAST(origin, destination) and GREATEST(origin, destination), then SELECT DISTINCT. This technique collapses NYC → LAX and LAX → NYC into the same row. A compound index on (origin, destination) aids de-duplication. The result can feed route-popularity analytics.
How many users logged in exactly the same number of times on New Year’s Day 2022, and what aggregation steps would you use?
Compute per-user login counts on 2022-01-01, then group those counts and tally how many users share each. Ordering by the count or by frequency helps spot unusual spikes. Indexing (user_id, login_date) accelerates the first aggregation. Candidates can also discuss null-safety if some users lack any activity that day.
Which customers downgraded from a paid plan to a free plan within 60 days of sign-up, and how would you list the downgrade date next to each customer?
A sound solution joins the subscriptions history table to itself, matching a paid tier row to the first subsequent free-tier row for the same customer_id. The query filters sign-up dates to the last year, constrains the time gap with DATEDIFF(free.start_date, paid.start_date) <= 60, and selects the minimum free-tier date. Using window functions (ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY start_date)) cleanly captures the initial downgrade event. Strong answers also justify composite indexes on (customer_id, start_date) to speed both the self-join and the date filter.
How would you calculate the average daily cart-abandonment rate for each product category last quarter, and which SQL techniques avoid double-counting visitors?
The logic begins by counting unique session_ids that added items to a cart (cart_events) and those that eventually purchased (order_events) within the same session and category. The abandonment rate is 1 – purchases / carts, aggregated by DATE_TRUNC('day', event_time) and category_id. A common table expression keeps cart and purchase counts separate, then a left join aligns sessions correctly, preventing duplicates from multi-item orders. Candidates should discuss date-dimension joins for missing days and recommend indexing (session_id, event_time) to accelerate both scans.
For more advanced roles, interview questions for technical business analysts often move beyond syntax and into query efficiency, architecture understanding, and how well you partner with data engineering. These questions may involve writing layered queries with CTEs, optimizing joins on large datasets, or modeling business logic into SQL pipelines. If you’re interviewing as a senior business systems analyst, be ready to explain your thought process, validate assumptions, and demonstrate how you use SQL to drive impact—not just extract data.
Which advertising channel generated the first-touch leading to conversion, and how would you rank channels by the number of first-touch conversions? – Two tables log every session (attribution) and map sessions to users (user_sessions). For users who eventually convert, isolate their very first session, capture its channel, and aggregate counts to rank channels. A performant solution uses a CTE with ROW_NUMBER() over (PARTITION BY user_id ORDER BY session_start) and filters to rn = 1, then counts channels. Candidates should clarify that only users with at least one conversion = true are in scope, discuss indexing on (user_id, session_start), and mention why left-joining prevents dropped channels with zero conversions.
Does any user have overlapping subscription date ranges, and how could you output a Boolean result per user? – Each record contains user_id, start_date, and end_date for completed subscriptions. Self-join the table on identical user_id where a.start_date < b.end_date and b.start_date < a.end_date and a.id <> b.id; any hit marks that user as overlapping. Returning all users with a has_overlap flag involves DISTINCT user_id from the join and a left join back to the full user list. Candidates should note that using LAG()/LEAD() over start_date avoids an O(n²) join on large data sets.
How would you surface the five product pairs most frequently bought together by the same customer? – Aggregate each user’s distinct products into sets, explode them into unordered pairs via a self-join or ARRAY unnest, group by (least(p1,p2), greatest(p1,p2)), count co-occurrences, and pick the top five with p2 alphabetically first for ties. Efficient implementations pre-deduplicate (user_id, product_id) to cut pair generation size and may leverage approximate counts or partitioned tables for billion-row scale. Discussion of bloom filters, partition pruning, and batch processing shows senior-level thinking.
Which January-2020 sign-ups moved more than $100 of successful volume in their first 30 days? – Join users (filtered to January-2020 signups) with payments twice—once as sender, once as recipient—keeping payment_state = 'success' and payment_date within 30 days of signup, sum amounts per user, and count those above $100. A union of the two roles preserves sign, and windowing by user_id plus a date filter ensures correctness. Good answers highlight indexes on (user_id, payment_date) and justify unions over OR conditions for planner efficiency.
What does the cumulative distribution of comment counts per user look like, by single-comment buckets? – First gather comment totals per user_id, then aggregate counts of users per num_comments, order the buckets, and compute a running fraction SUM(users)/total_users using a window function such as SUM(count) OVER (ORDER BY num_comments). If comment counts are sparse, generating a series with a left join fills gaps for smoother plots. The answer should mention why CUME_DIST() or PERCENT_RANK() can also be used and the importance of indexing the comments table on user_id.
How would you report annual retention rates for a yearly-billed SaaS product? – Determine each customer’s cohort year via their first payment, then for every subsequent year compute SUM(is_renewed)/COUNT(*) within that cohort to get retention. A self-join or window function retrieves the first year, and a pivot produces a tidy year-by-year grid. Discussing churn definitions, handling partial years, and creating a composite index on (customer_id, payment_year) demonstrates scalable thinking.
How many push notifications does a typical converting user receive before purchase, and what is the full distribution? – Join notification_deliveries to converting users where delivery_time < conversion_date, count notifications per user, then aggregate those counts into a histogram. The query produces push_count and num_users, enabling analysts to study drop-off curves. Explanations should include indexing suggestions, handling users who never convert via a left exclusion, and potential need for caps on extreme counts.
Produce a day-by-day cumulative new-user count that resets every month. – Use COUNT(*) OVER (PARTITION BY DATE_TRUNC('month', signup_date) ORDER BY signup_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) to compute the running total for each month. If date gaps exist, join to a calendar table to emit zero-signup days. Candidates should note that partitioning on the truncated month and ordering on signup_date leverages an index effectively.
Is click-through rate significantly related to human relevance ratings in search results, and what data will you return to test the hypothesis? – Group search events by rating, calculate ctr = SUM(has_clicked)/COUNT(*), and return counts plus CTR per rating to enable a chi-square or logistic regression. Including confidence intervals or filtering by position helps control confounders. Strong answers mention that a rating-CTR relationship might be spurious without controlling for rank or query intent and suggest exporting raw aggregates for statistical testing offline.
How would you calculate the rolling twelve-month customer lifetime value (CLV) for every active user, updating the metric at the end of each calendar month?
A strong answer joins transactional revenue to a calendar dimension, aggregates spend per user per month, then uses SUM(amount) OVER (PARTITION BY user_id ORDER BY month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) to create a continuously updated 12-month CLV window; handling users with fewer than twelve months of history, excluding refunds, and indexing on (user_id, month) should also be covered.
Given a fact table of page views with session_id, how would you flag sessions that contained at least one error page and then compute error-session rate by device type?
The candidate should aggregate page types per session_id, create a Boolean has_error with MAX(CASE WHEN page_type='error' THEN 1 END), join to the sessions dimension for device_type, and calculate error_sessions/total_sessions; discussion should include partial indexes on page_type, potential session sampling, and separating bot traffic.
Suppose your orders table occasionally receives late-arriving records. How would you backfill weekly revenue snapshots without double-counting existing rows?
A typical approach loads late rows into a staging table, then uses INSERT … ON CONFLICT (or a merge) keyed on the natural primary key to upsert into the snapshot; for historical deltas, a windowed delete/insert or EXCEPT-based diff ensures idempotency; efficient partitions on order_week keep the maintenance query performant.
How could you identify the earliest date on which each user took all three required onboarding actions (complete_profile, upload_avatar, first_post)?
Use conditional MIN(CASE WHEN action='complete_profile'…END) to capture dates, GREATEST() to find the last of the three, filter out users missing any action, and index the events table on (user_id, action) for speed; the result supports a funnel-completion analysis.
Design a query that detects price‐elastic products by correlating daily price changes with daily unit-sales changes over the past year.
The solution pairs each day with the previous day via LAG(price) and LAG(units), computes percent deltas, joins price and unit changes per product, and returns Pearson correlation or slope via covariance formulas; discussion should cover seasonality controls, minimum-volume filters, and why correlations could be skewed by promotions.
How would you compute the median duration between first site visit and first purchase for every marketing channel?
Determine each user’s first visit (MIN(visit_time)) and first purchase (MIN(purchase_time)), take the difference, then apply PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY lag_days) grouped by channel; highlight that medians resist outliers and mention cohort size requirements.
Given an orders table with many-to-one mapping to shipments, calculate the share of orders that required multiple shipments and the average extra fulfillment cost per multi-shipment order.
Aggregate COUNT(DISTINCT shipment_id) per order_id, flag >1 shipments, compute cost deltas between summed shipment costs and original shipping fee, and summarize; indexing on (order_id, shipment_id) and window functions speed the classification.
How would you build a KPI that reports “active products” by week, where active means at least five units sold and viewed by 20 distinct users in the same week?
Create two sub-queries: weekly unit sales and weekly user views, join them on (product_id, week), apply thresholds, and count qualifying products; the explanation should note date truncation, composite indexes, and incremental materialization for dashboards.
Write a query to label each customer’s payment behavior into: always on time, sometimes late (<25 % late), frequently late (≥25 % late). How would you validate the thresholds?
Compute late_flag per invoice (due_date < paid_date), window AVG(late_flag) per customer, classify with CASE, and test sensitivity with histograms or quantiles; candidates should explain why analytic functions avoid subqueries and how to expose threshold tuning to stakeholders.
Your table of mobile events has separate lat/lon columns. How could you bucket events into 10-kilometer grid cells and return the top 10 busiest cells each day?
Using a simple equirectangular approximation, derive integer grid keys (FLOOR(lat/Δlat), FLOOR(lon/Δlon)), concatenate as a cell id, group by cell and event_date, count events, then ROW_NUMBER() per day to pick the top 10; explanation should mention spatial indexing approaches like H3 or PostGIS for production.
How would you detect outlier spikes in average order value (AOV) per region using only SQL?
Calculate daily AOV per region, then within each region compute rolling mean and standard deviation over a 30-day window via AVG() and STDDEV_SAMP(), flag days where AOV > mean + 3×std; discuss partitioning strategy and why z-score logic must consider low-volume days.
Write a recursive CTE that expands an employee hierarchy to find the total payroll cost of each manager’s entire reporting tree.
The recursive term repeatedly joins employees to itself through manager_id, aggregates salary down the tree, and finally groups by manager; efficiency points include indexing (manager_id) and limiting recursion depth with cycle detection.
Given daily inventory snapshots, compute days-of-supply (inventory ÷ trailing-30-day sales) for each SKU and highlight items below seven days.
Trailing sales are obtained with SUM(units_sold) OVER (PARTITION BY sku ORDER BY day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW), inventory comes from the snapshot, then a CASE WHEN filter selects low-coverage SKUs; commentary should address missing days and why a calendar table might be needed.
How would you write a query that surfaces the top three categories whose share of total revenue grew the most quarter-over-quarter?
Aggregate revenue by category and quarter, compute QoQ change with LAG(), derive growth rate (curr - prev)/prev, order, and limit; mention that small-base categories could distort percentages and propose adding a minimum revenue floor.
Propose a SQL solution for identifying “churn” as 45 consecutive inactive days after the last activity and then plotting monthly churned-user counts.
Using LEAD(activity_date) per user, find gaps ≥45 days or nulls, treat the gap start as churn date, extract the month, count distinct users, and group; senior candidates cite how a surrogate “future” date simplifies last-row handling, and suggest incremental processing to avoid full scans.
How would you isolate the first date on which every customer crossed their personal lifetime-spend milestone of $1 000 and then trend the count of “high-value” customers by week?
A complete answer derives cumulative spend per customer with a running SUM(amount) window, filters the first row where that sum ≥ 1000 using QUALIFY or a ROW_NUMBER() = 1, truncates to week boundaries, and aggregates counts; the discussion should mention indexing on (customer_id, purchase_date), why cumulative calculations benefit from clustering by date, and how analysts use the resulting cohort curve to gauge monetization speed.
Suppose product prices can change multiple times a day. How could you compute revenue at the price-in-effect when the order was placed, given separate orders and price_history tables?
The solution performs a range join (timestamp band join) between orders.order_time and price_history intervals via ON orders.product_id = price_history.product_id AND orders.order_time BETWEEN start_ts AND end_ts, multiplies quantity by price, and indexes (product_id, start_ts, end_ts) to avoid a brute-force scan; explanation covers why slowly-changing-dimension type-2 tables make the logic maintainable.
Design a query that flags “re-activated” users—accounts that had no activity for ≥ 60 consecutive days and then returned—together with the length of their dormant period.
Use LEAD(activity_date) to find gaps, filter those ≥ 60 days, compute gap length, and capture the return date; candidates should explain edge cases when the gap is at dataset end, and why a composite (user_id, activity_date) index improves the window scan.
How would you quantify funnel leakage by finding the proportion of users who viewed a promotion but never added to cart within the following 48 hours?
Join event logs to themselves via (user_id) where add_to_cart.time BETWEEN view.time AND view.time + INTERVAL '48 HOURS', then use an anti-join (NOT EXISTS) or left join with null check, finally compute leakage rate; the explanation should touch on window grouping and choosing composite keys on (user_id, event_time).
Write a query that returns the 95th-percentile order fulfillment time (warehouse scan → carrier pickup) for each fulfillment center and month. What indexing strategy keeps it fast?
The calculation requires time differences per order, then PERCENTILE_CONT(0.95) grouped by center, month; efficient execution depends on partitioning by center and sorting by carrier_pickup_ts so that percentiles use block pruning rather than full scans.
Given a table of coupon redemptions, how could you detect coupons that exhibit statistically significant day-of-week redemption bias?
Aggregate redemptions per coupon and weekday, compute Chi-square expected counts, and flag coupons whose p-value < 0.05; the answer should cover building a contingency table in SQL, the risk of low expected counts, and materializing intermediate aggregates for performance.
How would you compute the total “exposed” ad impressions that occurred before a purchase event within the same session, handling sessions with multiple purchases?
Use a window that orders events by timestamp inside each session, mark the first purchase row, then sum impressions where event_time < first_purchase_time; mention that storing events in a session-level array or MAP can accelerate the query but sacrifices flexibility.
Write a query that labels each marketing campaign as new, scaled, or sunset based on MoM spend change thresholds of +50 % and –50 %. How would you keep the classification table self-updating?
Join monthly spend to LAG(spend) per campaign, compute rate, categorize with CASE, and implement the logic inside a view or incremental merge task; explanation should note boundary-month exclusion and the benefit of partitioning spend tables by month.
How would you surface the top five “sticky” features—events whose users return at least four days in a seven-day window more than 30 % of the time?
Derive a rolling seven-day activity count per user-feature pair, flag counts ≥4, calculate ratio to total active users of that feature, order and limit; efficiency relies on partial pre-aggregation of daily actives and composite keys (feature, user_id, event_date).
Construct a query that outputs for every supplier the variance in unit cost across their SKUs and flags suppliers whose variance is above the 80th percentile for the peer group.
Group by supplier_id computing VAR_POP(cost), then compare to PERCENTILE_CONT(0.8) OVER (); the answer should mention why high variance might indicate inconsistent pricing and how to leverage a materialized supplier-cost summary.
You have sessions and page_views. How could you compute bounce rate and median session duration in the same query without double-scanning either table?
Join page_views aggregated per session (count pages and max-min timestamps) to sessions, calculate bounce flag and duration, then use conditional aggregation to get bounce rate plus PERCENTILE_CONT(0.5) for duration; discuss memory trade-offs of pushing pre-aggregation into a CTE.
How would you detect “near duplicates” in a products catalog by comparing normalized titles that differ by only one Levenshtein edit?
A performant pattern hashes titles to a phonetic or n-gram key, self-joins keys with small Hamming distance, then computes LEVENSHTEIN(title1, title2) to filter ≤1; explanation addresses why blocking keys limit O(n²) explosion and possibilities for incremental duplicate detection using triggers.
Write a query that estimates the 28-day retention curve for users acquired in a specific paid campaign, displaying Day 0 through Day 28 retention percentages in wide format.
The solution builds a cohort date for each user, joins daily logins, calculates DATEDIFF(login_day, cohort_day) as day_n, pivots with COUNT(DISTINCT user_id) over day_n, and divides by cohort size; highlight why wide pivots ease dashboard integration but explode columns.
How could you compute the incremental revenue attributable to an upsell feature by matching treated and control users on propensity score deciles inside SQL?
Store modeled propensity scores, assign deciles via NTILE(10), aggregate revenue by decile and treatment_flag, compute difference, and sum; discuss balancing diagnostics, treating deciles as fixed effects, and exporting the matched dataset for deeper causal analysis.
Design a SQL procedure that automatically rescales an orders fact table’s surrogate key sequence once it nears integer overflow, ensuring zero downtime. What edge cases must you watch?
The procedure checks MAX(id) vs. upper limit, creates a new sequence, alters default, and backfills unsequenced staging rows inside a transaction; the explanation should cover replication lag, foreign-key dependencies, lock escalation, and monitoring alerts so downstream ETLs stay consistent.
How would you detect seasonal dips in customer activity by computing a z-score of weekly active users for every week in the past two years?
The query first aggregates distinct user_ids by ISO-week to get weekly active users (WAU), then adds a window that calculates the overall mean and standard deviation across all weeks; dividing WAU minus mean by the standard deviation yields a z-score that highlights unusually low (or high) weeks. Discussing the solution should cover why ISO weeks avoid partial weeks at year boundaries, the need to order by week in the window, and how analysts interpret |z| > 2 as statistically meaningful seasonality without exporting data to Python.
Suppose executives want to know the share of same-day repeat purchases: how could you compute the percentage of orders placed by customers who had already ordered earlier that same day?
A typical answer self-joins orders on customer_id where the second order’s order_ts falls on the same calendar date and is greater than the first, then uses EXISTS or COUNT(DISTINCT ...) to flag customers with ≥ 2 orders that day and divides by total daily customers. The explanation should mention indexing by (customer_id, order_ts) so the self-join remains scalable and why time-zone normalization is critical for global stores.
How would you rank product categories by YoY revenue growth, but only for categories whose absolute revenue increased by at least $2 M?
The solution aggregates revenue by category and year, pivots or self-joins current to prior year, calculates both the dollar delta and the percentage growth, filters on delta ≥ 2 000 000, and orders by growth rate. A sound explanation references using COALESCE for categories absent in the prior year and suggests using materialized year-category rollups to accelerate repeated YoY dashboards.
Write a query that finds the median time (in minutes) from checkout to first customer-support ticket and segments the result by subscription tier. How would you keep it performant on a 100 M-row tickets table?
Join orders to tickets on customer_id, compute the time difference, and apply PERCENTILE_CONT(0.5) within a PARTITION BY tier; partitioning the tickets table by created_date and clustering on customer_id minimizes I/O, while a covering index on (customer_id, created_ts) avoids table scans.
How could you calculate the ratio of uninfluenced conversions—orders placed by users who had zero ad impressions in the 48 hours prior to checkout—over total conversions each day?
Left join the orders fact table to ad-impression logs on (user_id) and a BETWEEN checkout_ts-48h AND checkout_ts condition; orders with no matching impressions are uninfluenced. Aggregate daily counts and compute the ratio. The explanation addresses why left anti-join semantics isolate unimpressed users and how a composite key on (user_id, impression_ts) helps the range join.
Design a query that surfaces the five SKUs whose share of total refunds spiked the most month-over-month (MoM) in the last quarter.
Aggregate refund counts by sku and month, calculate each SKU’s share of total monthly refunds, then use LAG(share) to compute MoM change and ORDER BY change DESC LIMIT 5. A thorough answer mentions excluding SKUs with very low volume to avoid noise, choosing month-end snapshots to prevent partial-month bias, and indexing refunds on (refund_date, sku) for quick month filtering.
Preparing for SQL interviews as a business analyst means going beyond textbook syntax. It’s about learning how to extract insights from data in a business context. Whether you’re brushing up on basic SQL interview questions for business analyst roles or diving into more advanced SQL queries for interview settings, the key is structured, focused practice tied to real-world use cases. Below are essential strategies to help you succeed.
Most business analyst SQL interviews focus on foundational topics—joins, filtering, aggregations, and sorting. These are the building blocks of analytics work and will show up in any screening or live coding assessment.
GROUP BY, COUNT, DISTINCT, and subqueries—especially when slicing customer data or calculating metrics like conversion rates or retention.If you’re rusty, prioritize the basics. A clear grasp of these concepts is more valuable than memorizing obscure functions.
Don’t just rely on abstract textbook problems. To understand how SQL is used in business, download public datasets (e.g., sales data, marketing performance, churn reports) and simulate the types of queries you’d write in a real BA role.
This kind of context-rich practice helps you translate SQL logic into business value—exactly what hiring managers want to see.
Many interviews now test your ability to think through scenarios rather than just write a correct query. For example:
These aren’t just technical questions—they’re about logic, prioritization, and business relevance. Prepare for both basic SQL interview questions for business analyst roles and case-style prompts that require deeper reasoning and interpretation.
To simulate real interview pressure, use platforms like Interview Query that provide role-specific SQL challenges. These platforms allow you to:
This structured practice will help you improve both speed and accuracy—two key factors in passing SQL screens.
It’s easy to get caught up in syntax, but don’t forget: your job as a business analyst is to use SQL to solve business problems.
This kind of reflection sharpens your judgment and makes you more confident in answering scenario-based and SQL for business analytics questions.
By combining technical fluency with business thinking, you’ll be well-equipped to handle everything from entry-level SQL for business analyst questions to more advanced case scenarios.
While SQL skills are critical, landing a business analyst role also requires demonstrating strong communication, stakeholder alignment, and problem-solving capabilities. Employers are increasingly seeking well-rounded candidates who not only know how to extract insights—but also how to present them, influence decisions, and shape business outcomes.
Understanding the broader business analyst responsibilities helps you stand out. These typically include gathering and refining requirements, aligning with cross-functional teams, and turning raw data into actionable narratives. That’s why behavioral questions like “Tell me about a time you influenced a business decision” or “Describe a time you managed conflicting priorities” are so common in interviews.
You should also prepare your own questions to ask a business analyst or the hiring manager. Strong examples include:
These signal curiosity and strategic thinking—two traits every hiring manager wants to see. Use this phase of the interview to show that you’re not just technically capable, but also business-savvy and collaborative. For more tips, check out our guide on business analyst questions and answers, which covers both behavioral and technical interview prep.
SQL is no longer optional for business analysts—it’s a core skill that hiring managers test rigorously, especially in 2025’s increasingly data-driven business environment. Whether you’re applying to a traditional enterprise or a fast-moving tech startup, mastering these SQL interview questions for business analysts will give you a decisive edge.
Continue sharpening your skills through targeted SQL practice problems, mock interviews that simulate real scenarios, and revisiting your resume to highlight analytics projects and SQL proficiency. For more preparation help, check out our SQL practice sets, mock interview platform, and SQL learning path. Each step you take will bring you closer to confidence—and offers—in your next business analyst role.