dimanche 27 mars 2022

How to manage options in PySpark more efficiently

Let us consider following pySpark code

my_df = (spark.read.format("csv")
                     .option("header","true")
                     .option("inferSchema", "true")
                     .load(my_data_path))

This is a relatively small code, but sometimes we have codes with many options, where passing string options causes typos frequently. Also we don't get any suggestions from our code editors. As a workaround I am thinking to create a named tuple (or a custom class) to have all the options I need. For example,

from collections import namedtuple
allOptions = namedtuple("allOptions", "csvFormat header inferSchema")
sparkOptions = allOptions("csv", "header", "inferSchema")
my_df = (spark.read.format(sparkOptions.csvFormat)
                     .option(sparkOptions.header,"true")
                     .option(sparkOptions.inferSchema, "true")
                     .load(my_data_path))

I am wondering if there is downsides of this approach or if there is any better and standard approach used by the other pySpark developers.

Aucun commentaire:

Enregistrer un commentaire