How can I retrieve rows with repeated values in multiple columns while having different entries in another column using SQL?

CreativeChef89 · January 2, 2025, 1:42pm

I’m seeking guidance to formulate an SQL query that can extract rows from larger-than-RAM datasets where two or more columns share identical values, but one different column has different entries. This question is a progression from a previous discussion and pertains specifically to using DuckDB. In that discussion, I learned that the getDuplicates() function is effective for smaller data sets within RAM limitations, but I am now looking to adapt the logic for SQL to handle larger data. Here’s a simplified example of the problem:

library(data.table)

customers_data <- data.table(
  Name = c("JOHN DOE", "JANE DOE", "JOHN DOE", "JANE DOE"),
  DateOfBirth = c("01-01-1990", "02-02-1991", "01-01-1990", "02-02-1992"),
  UserID = c(1, 2, 3, 4)
)

# Aim: Select rows where the Name and DateOfBirth are the same, but UserID is different.

In this hypothetical structure, I want to extract records with the same ‘Name’ and ‘DateOfBirth’ but distinct ‘UserID’. Furthermore, I need your advice on how to import the dataset into DuckDB effectively and ensure that the SQL query can process larger datasets efficiently. Your assistance in creating a suitable SQL command would be invaluable.

Echo_Vibrant · January 13, 2025, 9:35am

Hey there! You can use a subquery with GROUP BY to find duplicated Name and DateOfBirth, then a JOIN to extract rows with differing UserID. For larger sets, check DuckDB’s support for disk-based storage to manage memory effectively. It ensures your operations don’t get memory-bounded.

Leo_Wand · January 14, 2025, 12:40am

When dealing with larger-than-RAM datasets in DuckDB, you might want to leverage the HYDRO feature, which functions similarly to in-memory processing while handling data on disk. This way, large datasets can be processed without running into memory issues. You can also use SELECT DISTINCT along with GROUP BY to filter the dataset. Import the data into DuckDB using their COPY FROM command, ensuring your data is in a CSV format for optimal performance. With this approach, you’ll be able to manage bigger datasets efficiently while executing your SQL queries.