I’m having trouble with my PySpark code that writes data to Redis. It works fine on Ubuntu but fails on Windows 11 with connection reset errors.
My setup:
Windows 11
Spark 4.0.0
Anaconda Python 2024.02
VS Code
Here’s my code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_json, struct
import redis
def write_redis_data(partition_rows):
client = redis.Redis(host='127.0.0.1', port=6379, db=0)
for record in partition_rows:
redis_key = f"employee:{record['emp_id']}"
redis_value = record['json_data']
client.set(redis_key, redis_value)
client.close()
def run_job():
spark = SparkSession.builder \
.appName("Redis Writer") \
.master('local[*]') \
.getOrCreate()
records = [(101, "John", 28), (102, "Jane", 32), (103, "Mike", 27)]
dataframe = spark.createDataFrame(records, ["emp_id", "name", "age"])
json_df = dataframe.withColumn("json_data", to_json(struct("emp_id", "name", "age")))
json_df.select("emp_id", "json_data").rdd.map(lambda row: row.asDict()).foreachPartition(write_redis_data)
spark.stop()
if __name__ == "__main__":
run_job()
The error I get:
java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.SocketDispatcher.write0(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:532)
at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.writeAdditionalInputToPythonWorker
The logs show Python worker crashes and task failures. This only happens on Windows - the exact same code runs perfectly on Ubuntu 22.04. Is there some Windows-specific Spark configuration I’m missing? Any help would be appreciated.