Redis connection error when writing PySpark DataFrame on Windows - connection gets reset

I’m having trouble with my PySpark code that writes data to Redis. It works fine on Ubuntu but fails on Windows 11 with connection reset errors.

My setup:

Windows 11
Spark 4.0.0
Anaconda Python 2024.02
VS Code

Here’s my code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_json, struct
import redis

def write_redis_data(partition_rows):
    client = redis.Redis(host='127.0.0.1', port=6379, db=0)
    
    for record in partition_rows:
        redis_key = f"employee:{record['emp_id']}"
        redis_value = record['json_data']
        client.set(redis_key, redis_value)
    
    client.close()

def run_job():
    spark = SparkSession.builder \
        .appName("Redis Writer") \
        .master('local[*]') \
        .getOrCreate()
    
    records = [(101, "John", 28), (102, "Jane", 32), (103, "Mike", 27)]
    dataframe = spark.createDataFrame(records, ["emp_id", "name", "age"])
    
    json_df = dataframe.withColumn("json_data", to_json(struct("emp_id", "name", "age")))
    
    json_df.select("emp_id", "json_data").rdd.map(lambda row: row.asDict()).foreachPartition(write_redis_data)
    
    spark.stop()

if __name__ == "__main__":
    run_job()

The error I get:

java.io.IOException: Connection reset by peer
    at java.base/sun.nio.ch.SocketDispatcher.write0(Native Method)
    at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:532)
    at org.apache.spark.api.python.BasePythonRunner$ReaderInputStream.writeAdditionalInputToPythonWorker

The logs show Python worker crashes and task failures. This only happens on Windows - the exact same code runs perfectly on Ubuntu 22.04. Is there some Windows-specific Spark configuration I’m missing? Any help would be appreciated.

Interesting issue - what’s your redis connection timeout set to? Windows networking can be quirky compared to Linux. Which redis-py version are you on? And are you running Redis locally or in Docker?

Had the same issue last month. Try adding socket_keepalive=True and socket_keepalive_options to your Redis connection - Windows handles TCP connections differently than Linux. Also check if Windows Defender or your firewall is blocking local connections.

This is a classic Windows networking issue with PySpark executors. Windows handles socket connections differently when Spark forks Python processes for partition processing. I’ve hit this exact problem migrating data pipelines from Linux to Windows dev environments. Here’s what fixes it: Add spark.sql.execution.arrow.pyspark.enabled=false to disable Arrow-based columnar transfers, and set spark.python.worker.reuse=true in your Spark config. Also, use connection pooling in your Redis client - create the connection pool outside your partition function and pass it through broadcast variables. The connection resets happen because Windows aggressively closes idle sockets compared to Linux, especially when multiple Python workers compete for the same Redis endpoint.