I have a main function that sets up the Spark context as follows:
val sparkContext = new SparkContext(sparkConfiguration)
val sparkSqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
import sparkSqlContext.implicits._
Then, I create a DataFrame and apply various filters and validations:
val roundToFullHour = udf((time: String) => time.substring(0, time.indexOf(':')) + ":00:00")
val dataFrame = sparkSqlContext.read.schema(schemaDefinition).format("com.databricks.spark.csv").load(inputArgs(0))
// filter out records shorter than 2 columns
.na.drop(3)
// convert to hourly timestamps
.withColumn("time", roundToFullHour(col("time")))
This works perfectly. However, when I attempt to move my validation logic to a separate file by sending the DataFrame to:
def ValidateAndTransform(dataFrame: DataFrame): DataFrame = {...}
I encounter the issue where I need:
import sparkSqlContext.implicits._
This is necessary to prevent the error: “value $ is not a member of StringContext” which occurs at:
.withColumn("time", roundToFullHour(<strong>col</strong>("time")))
To use import sparkSqlContext.implicits._
, I must either define the sparkSqlContext
in the new file:
val sparkContext = new SparkContext(sparkConfiguration)
val sparkSqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
Or pass it into the validation function. It seems that the attempt to separate my code into two files (main and validation) is not correctly structured.
Can anyone suggest how I might design this implementation? Should I simply pass sparkSqlContext
to the validation function? Thank you!