Modifying column values in SQL/Spark-SQL based on grouped record count exceeding one

Hey guys, I’m stuck with a tricky data problem. I’ve got this customer promo dataset where people can opt out. Each customer has their own ID, but some share a house. The snag is when housemates have different promo actions (add, delete, or blank) on the same day.

If one person in the house picks ‘delete’, we need to mark everyone in that house as deleted for the promo. How can I change the promo_action column when there are multiple records with the same household_id and promo_action_date, but different promo_action values? We need to set all to ‘DELETE’ in this case.

Here’s what the table looks like:

customer_id | household_id | promo_action | promo_action_date
101         | 54           | DELETE       | 2024-10-03
157         | 54           | NULL         | 2024-10-03

After the update, it should be:

customer_id | household_id | promo_action | promo_action_date
101         | 54           | DELETE       | 2024-10-03
157         | 54           | DELETE       | 2024-10-03

I tried this query to find the problematic records:

SELECT household_id, promo_action_date, COUNT(*)
FROM promo_action_household
GROUP BY household_id, promo_action_date
HAVING COUNT(*) > 1

But I’m not sure how to actually update the records. Any ideas using SQL or Spark-SQL? Thanks!

hey, try using a window function to flag any DELETE per house and update accordingly. for example:

WITH f AS (
SELECT MAX(CASE WHEN promo_action=‘DELETE’ THEN 1 ELSE 0 END) OVER(PARTITION BY household_id, promo_action_date) AS flag
FROM promo_action_household
)
UPDATE promo_action_household SET promo_action=‘DELETE’ WHERE flag=1.

hope it works

To tackle this issue, you can use a combination of window functions and a self-join in SQL or Spark-SQL. One approach is to create a common table expression (CTE) that identifies households with a ‘DELETE’ action for each date, and then join this CTE back to the original table to update all matching records. For example:

WITH delete_households AS (
  SELECT DISTINCT household_id, promo_action_date
  FROM promo_action_household
  WHERE promo_action = 'DELETE'
)
UPDATE promo_action_household AS pah
SET promo_action = 'DELETE'
FROM delete_households AS dh
WHERE pah.household_id = dh.household_id
  AND pah.promo_action_date = dh.promo_action_date;

This query updates all records for a household on a specific date to ‘DELETE’ if any member of that household has that action, addressing the requirement without needing to count records per household.

hmmm, interesting problem! have you thought about using a subquery to identify households with ‘DELETE’ actions? something like:

UPDATE promo_action_household
SET promo_action = ‘DELETE’
WHERE household_id IN (
SELECT household_id
FROM promo_action_household
WHERE promo_action = ‘DELETE’
)

what do you think? would that work for your data?