How to combine two datasets with mismatched company, address, and zip fields in SQL, Spark, or Polars?
SELECT t1.id, t2.id FROM TableOne t1 LEFT JOIN TableTwo t2 ON COALESCE(t1.comp, t1.addr, t1.post)=COALESCE(t2.comp, t2.addr, t2.post);
How to combine two datasets with mismatched company, address, and zip fields in SQL, Spark, or Polars?
SELECT t1.id, t2.id FROM TableOne t1 LEFT JOIN TableTwo t2 ON COALESCE(t1.comp, t1.addr, t1.post)=COALESCE(t2.comp, t2.addr, t2.post);
hey, i really like this approach! i wonder if a case stmt for each field might handle nulls even better than coalesce. anyone trued a similar method in spark or polars? curis what differences y’all noticed?
hey, i tried a regex based extraction on messy fields for matching in spark and polars, kinda like fuzz but more controlled. ever given that a try? what are u thinkin abt its pros or cons in your experience?
hey, for spark im used to using a broadcast join to handle inconsitent columns, kinda speeds things up. in polars i tested fuzzy join funcs which proved quite useful even when data doesnt match perfectly. id say try and tweak them as needed
Based on my experience, a flexible pre-processing step is essential when dealing with mismatched fields. I found that normalizing data by applying functions for trimming, standardizing case, and handling nulls significantly improved join accuracy. For example, constructing a temporary view that harmonizes company names, addresses, and zip codes can provide a cleaner dataset before performing joins. This approach allowed me to use more precise match criteria in SQL and ensured that subsequent operations in Spark or Polars worked reliably even with inconsistent source data.
hey guys, i ended up using a custom tranformation script before joinig. normalizing address & company data by unifying case and trimming spaces often gives better match output than just using coalesce. check your data patterns for tuning fuzzy comparision funcs.