lundi 11 septembre 2017

Apache Spark distributed sql

I use Spark DataFrameReader to perform sql query from database. For each query performed the SparkSession is required. What I would like to do is: for each of JavaPairRDDs perform map, which would invoke sql query with parameters from this RDD. This means that I need to pass SparkSession in each lambda, which seems to be bad design. What is common approach in such problems?

It could look like:

roots.map(r -> DBLoader.getData(sparkSession, r._1));

How I load data now:

JavaRDD<Row> javaRDD = sparkSession.read().format("jdbc")
            .options(options)
            .load()
            .javaRDD();

Aucun commentaire:

Enregistrer un commentaire