I’m working with a Windows 10 host system that has an Ubuntu 18.04 VM running through VirtualBox. My SQL Server instance is running on the Windows host, and I have Apache Spark 2.4.0 set up on the Ubuntu virtual machine.
Currently I can connect to my database from Windows using pandas like this:
import pandas as pd
import pyodbc
# Connection settings
host = 'localhost'
database_name = 'SalesDB'
# Establish connection
connection = pyodbc.connect('DRIVER={SQL Server};SERVER=' + host + ';DATABASE=' + database_name + ';Trusted_Connection=yes')
# SQL query
query = """
SELECT *
FROM [SalesDB].[dbo].[CUSTOMER_DATA]
"""
# Execute and fetch data
result_df = pd.read_sql(query, connection, chunksize=5000)
Is there a way to connect to this same SQL Server database from my Ubuntu VM using PySpark instead? What configuration steps would I need to follow to make this work?
Any help would be appreciated!
wait, are you using port 1433 (the default sql server port)? also, can your vm actually ping the windows host ip? virtualbox networking can be finicky with host connections.
Connecting to a SQL Server database from your Ubuntu VM using PySpark is certainly possible. You’ll need to use the Microsoft JDBC driver instead of ODBC, so make sure to download the driver (mssql-jdbc-x.x.x.jre8.jar) and place it in your Spark directory. Remember to use the IP address of your Windows host, as you’ll be connecting from the VM. In your PySpark code, you can create a SparkSession and then use DataFrameReader with the format set to ‘jdbc’. Ensure that your JDBC URL is correctly pointing to the Windows IP, and that you configure your connection properties to enable integrated security. Finally, confirm that your SQL Server instance allows remote connections and that the Windows Firewall allows traffic through port 1433.
hey, dont forget to enable TCP/IP in SQL Server Config Manager first. then, get the MSSQL JDBC driver jar, and add it to your Spark classpath like this: spark = SparkSession.builder.config("spark.jars", "/path/to/mssql-jdbc.jar").getOrCreate()
. also, update your connection string to use the Windows host IP instead of localhost.