Connecting to Vertica from Spark

So you have a lot of data in vertica, and you want to do analytics beyond what’s easily expressible in vSQL, at scale, without writing nasty C++ UDFs; or perhaps you already have a lot of data already sitting in HDFS to join against.

Enter spark.

1. Grab the vertica jbdc drivers and hadoop connectors from the vertica support portal and put them on your spark classpath (e.g. via ADD_JARS)

2. Use something like this class

import org.apache.spark.rdd.RDD
import org.apache.hadoop.io.LongWritable
import com.vertica.hadoop._
import org.apache.hadoop.mapreduce._
import org.apache.hadoop.conf.Configuration
class Vertica(val hostnames: String,
              val database: String,
              val username: String,
              val password: String,
              val port: String = "5433") extends Serializable {

    def configuration:Configuration = {
        val conf = new Configuration
        conf.set("mapred.vertica.hostnames", hostnames)
        conf.set("mapred.vertica.database", database)
        conf.set("mapred.vertica.username", username)
        conf.set("mapred.vertica.password", password)
        conf.set("mapred.vertica.port", port)
        conf
    }

    def query(sql: String):RDD[VerticaRecord] = {
        val job = new Job(configuration)
        job.setInputFormatClass(classOf[VerticaInputFormat])

        VerticaInputFormat.setInput(job, sql)
        sc.newAPIHadoopRDD(job.getConfiguration, classOf[VerticaInputFormat], classOf[LongWritable], classOf[VerticaRecord]).map(_._2)
    }
}

3. VoilĂ !

val vertica = new Vertica("my-node-1,my-node-2,my-node-3", "my-db", "username", "password")
val v:RDD[VerticaRecord] = vertica.query("select date, category, sum(amount) from my_transaction_table group by date, category;")
Advertisements
Connecting to Vertica from Spark

One thought on “Connecting to Vertica from Spark

  1. Ben Duffield says:

    n.b. an example way of adding the appropriate jars (working around a slight bug in the most recent spark whereby jars in ADD_JARS don’t automatically end up on the classpath)

    MASTER=’spark://my-spark-master:7077′ ADD_JARS=’hadoop-vertica.jar,vertica-jdbc-7.0.0-0.jar’ SPARK_CLASSPATH=’hadoop-vertica.jar:vertica-jdbc-7.0.0-0.jar’ ./bin/spark-shell

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s