Changing the Spark Context of an existing RDD

Spark RDDs are supposed to be Resilient. If something bad happens whilst computing, we can recover! At least, that’s the idea.

scala> val myRdd = sc.parallelize(Seq(1,2,3))
myRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> sc.stop

If we stop the spark context for any reason, we now find our RDD is useless!

scala> myRdd.first
java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1316)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1339)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1353)
at org.apache.spark.rdd.RDD.take(RDD.scala:1098)

This isn’t good at all! Let’s make a new spark context.

scala> val sc = new org.apache.spark.SparkContext("local[8]", "new context")
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@542beecb

We now need to inject this back into our RDD. The spark context is stored in a private field, so we have to reach for reflection.

val rddClass = classOf[org.apache.spark.rdd.RDD[_]]
val scField = rddClass.getDeclaredField("_sc") // spark context stored in _sc
scField.setAccessible(true) // now we can access it

Now we just set the spark context.

scField.set(myRdd, sc)

Observe that this works!

scala> myRdd.sum
res5: Double = 6.0
scala> myRdd.first
res6: Int = 1

This is quite scary and probably should not be used for anything real. Additionally we had an RDD with many dependencies, we’d have to crawl the the dependencies and swap it out in every place (I think).

Sparks

Another approach might be to instead produce a dynamic proxy for the spark context which allows you to point at some true spark context, and then just swap it out there.

What are we actually trying to do here? If we have a long-running application which allows users to create RDDs, it would be nice to be able to recover from spark cluster bounces. We could keep track of the operations required to produce the RDDs in the first place (which is arguably a better approach) but I decided to spend thirty minutes poking around anyway, and was pleasantly surprised at the (illusion of) progress I made!

Changing the Spark Context of an existing RDD