How can I merge tables with inconsistent matching columns using SQL, Spark, or Polars?

Maya_91Dance · February 13, 2025, 4:49pm

How to combine two datasets with mismatched company, address, and zip fields in SQL, Spark, or Polars?

SELECT t1.id, t2.id FROM TableOne t1 LEFT JOIN TableTwo t2 ON COALESCE(t1.comp, t1.addr, t1.post)=COALESCE(t2.comp, t2.addr, t2.post);

Zack_88Surf · February 25, 2025, 9:41pm

hey, i really like this approach! i wonder if a case stmt for each field might handle nulls even better than coalesce. anyone trued a similar method in spark or polars? curis what differences y’all noticed?

Sophia39 · February 22, 2025, 5:17pm

hey, i tried a regex based extraction on messy fields for matching in spark and polars, kinda like fuzz but more controlled. ever given that a try? what are u thinkin abt its pros or cons in your experience?

BoldPainter37 · February 24, 2025, 1:18pm

hey, for spark im used to using a broadcast join to handle inconsitent columns, kinda speeds things up. in polars i tested fuzzy join funcs which proved quite useful even when data doesnt match perfectly. id say try and tweak them as needed

Haz_45Write · February 24, 2025, 8:05pm

Based on my experience, a flexible pre-processing step is essential when dealing with mismatched fields. I found that normalizing data by applying functions for trimming, standardizing case, and handling nulls significantly improved join accuracy. For example, constructing a temporary view that harmonizes company names, addresses, and zip codes can provide a cleaner dataset before performing joins. This approach allowed me to use more precise match criteria in SQL and ensured that subsequent operations in Spark or Polars worked reliably even with inconsistent source data.

SpinningGalaxy · February 25, 2025, 12:11pm

hey guys, i ended up using a custom tranformation script before joinig. normalizing address & company data by unifying case and trimming spaces often gives better match output than just using coalesce. check your data patterns for tuning fuzzy comparision funcs.