How Spark does Class Loading

Using the spark shell, one can define classes on-the-fly and then use these classes in your distributed computation.

Contrived Example
scala> class Vector2D(val x: Double, val y: Double) extends Serializable {
| def length = Math.sqrt(x*x + y*y)
| }
defined class Vector2D
scala> val sourceRDD = sc.parallelize(Seq((3,4), (5,12), (8,15), (7,24)))
sourceRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[5] at parallelize at :13
scala> sourceRDD.map(x => new Vector2D(x._1, x._2)).map(_.length).collect()
14/03/30 09:21:59 INFO SparkContext: Starting job: collect at :17
...snip...
res1: Array[Double] = Array(5.0, 13.0, 17.0, 25.0)

In order for the remote executors here to actually run your code, they must have knowledge of the Vector2D class, yet they’re running on a different JVM (and probably different physical machine). How do they get it?

  • we choose a directory on disk to store the class files
  • a virtual directory is created at SparkIMain:101
  • a scala compiler is instantiated with this directory as the output directory at SparkIMain:299
  • this means that whenever a class is defined in the REPL, the class file is written to disk
  • a http server is created to serve the contents of this directory at SparkIMain:102
  • we can see info about the Http server in the logs:
    14/03/23 23:39:21 INFO HttpFileServer: HTTP File server directory is /var/folders/8t/bc2vylk13j14j13cccpv9r6r0000gn/T/spark-1c7fbed7-5c87-4c2c-89e8-be95c2c7ac54
    14/03/23 23:39:21 INFO Executor: Using REPL class URI: http://192.168.1.83:61626
  • the http server url is stored in the Spark Config, which is shipped out to the executors
  • the executors install a URL Classloader, pointing at the Http Class Server at Executor:74

For the curious, we can figure out what the url of a particular class is and then go check it out in a browser/with curl.

def urlOf[T:ClassTag] = {
   val clazz = implicitly[ClassTag[T]].erasure
   Seq(sc.getConf.get("spark.repl.class.uri"),
       clazz.getPackage.getName,
       clazz.getName).mkString("/")
}

Do it yourself

It’s pretty trivial to replicate this ourselves – in Spark’s case we have a scala compiler which writes the files to disk, but assuming we want to serve classes from a fairly normal JVM with a fairly standard classloader, we don’t even need to bother with the to disk. We can grab the class file using getResourceAsStream. It also doesn’t require any magic of scala – an example class server in java using Jetty:

class ClasspathClassServer {
	private Server server = null;
	private int port = -1;

	void start() throws Exception {
		System.out.println("Starting server...");
		if(server != null) {
			return;
		}

		server = new Server();
		NetworkTrafficSelectChannelConnector connector = new NetworkTrafficSelectChannelConnector(server);
		connector.setPort(0);
		server.addConnector(connector);

		ClasspathResourceHandler classpath = new ClasspathResourceHandler();

		server.setHandler(classpath);
		server.start();

		port = connector.getLocalPort();
		System.out.println("Running on port " + port);
	}

	class ClasspathResourceHandler extends AbstractHandler {
		@Override
		public void handle(String target, Request baseRequest, HttpServletRequest request, HttpServletResponse response)
					throws IOException, ServletException {
			System.out.println("Serving target: " + target);

			try {
				Class<?> clazz = Class.forName(target.substring(1));
				InputStream classStream = clazz.getResourceAsStream('/' + clazz.getName().replace('.', '/') + ".class");

				response.setContentType("text/html;charset=utf-8");
				response.setStatus(HttpServletResponse.SC_OK);
				baseRequest.setHandled(true);

				OutputStream os = response.getOutputStream();

				IOUtils.copy(classStream, os);
			} catch(Exception e) {
				System.out.println("Exception: " + e.getMessage());
				baseRequest.setHandled(false);
			}
		}
	}
}

It’s then just a matter of setting up a URL Classloader on the other side!

Further Examples

http://quoiquilensoit.blogspot.com/2008/02/generic-compute-server-in-scala-with.html: An example of using a similar technique to write a ‘compute server’ in scala – somewhat akin to a very stripped down version of Spark.

Advertisements
How Spark does Class Loading

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s