I’m working on converting a SQL query to work with Apache Pig on Hadoop but I’m stuck on something tricky. Usually converting queries is straightforward but this one has me confused.
Here’s what I’m dealing with:
SELECT x.col1, y.col2, y.col3
FROM orders x, products y
WHERE(
(
x.col1 = y.col2
AND
(
x.col2 = y.col1
OR
(
(x.col2 = 'value1' OR x.col2 = 'value2')
AND
(y.col1 is null OR y.col1 = 'value3' OR y.col1 = 'value4')
)
)
)
OR
(
x.col3 = y.col4
AND
(
x.col4 = y.col3
OR
(
(x.col4 = 'value5' OR x.col4 = 'value6')
AND
(y.col3 is null OR y.col3 = 'value7' OR y.col3 = 'value8')
)
)
)
)
GROUP BY -- additional stuff here
My main questions are:
- How can I break down these complex WHERE conditions with OR logic?
- What’s the best way to handle this kind of join pattern in Pig?
- Would this be easier to solve using Hive instead?
I know Pig doesn’t handle OR conditions well with joins. I’ve checked the Pig documentation but haven’t found a clear solution yet. Any ideas on how to approach this would be really helpful.