I’m stuck trying to update an Oracle table with PySpark. I can’t use spark.sql()
directly on the external database. The table already exists, so I can’t use overwrite mode. I tried this:
if new_records.count() > 0:
new_records.write.jdbc(mode='append', url=db_url, table='my_table', properties=oracle_props)
else:
update_query = """
UPDATE my_table
SET col_a = new_value_a,
col_b = new_value_b
WHERE some_condition
"""
spark.sql(update_query) # This doesn't work
I get an error saying the table can’t be found, even though I can read from and append to it. What’s the best way to update an Oracle table with PySpark? Can I read the data, update it in a DataFrame, and write back only the changes? Any ideas would be great!
hey there! have u considered using the merge operation? it’s pretty handy for updates. you could do something like:
new_records.createOrReplaceTempView(“new_data”)
spark.sql(“MERGE INTO my_table USING new_data ON…”)
this way u can update existing rows and insert new ones in one go. lemme know if u need more help!
One approach you might consider is using the JDBC batch update functionality. This method allows you to perform bulk updates efficiently without processing row by row. You can begin by reading the existing data from the Oracle table into a DataFrame and applying the necessary modifications. Next, compute the differences between the original and updated data. Finally, use the JDBC batch update to commit these changes back to the database.
This approach generally results in better performance for large data sets and provides more granular control over the update process. Remember to manage your JDBC connection carefully and consider connection pooling for optimal performance.
hmm, interesting problem! have u tried using a jdbc connection to execute the update query directly? something like:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
conn = spark._jvm.java.sql.DriverManager.getConnection(db_url, user, password)
stmt = conn.createStatement()
stmt.executeUpdate(update_query)
this might bypass the spark.sql() issue. what do u think? have u explored any other approaches?