Let us consider following pySpark code
my_df = (spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true")
.load(my_data_path))
This is a relatively small code, but sometimes we have codes with many options, where passing string options causes typos frequently. Also we don't get any suggestions from our code editors. As a workaround I am thinking to create a named tuple (or a custom class) to have all the options I need. For example,
from collections import namedtuple
allOptions = namedtuple("allOptions", "csvFormat header inferSchema")
sparkOptions = allOptions("csv", "header", "inferSchema")
my_df = (spark.read.format(sparkOptions.csvFormat)
.option(sparkOptions.header,"true")
.option(sparkOptions.inferSchema, "true")
.load(my_data_path))
I am wondering if there is downsides of this approach or if there is any better and standard approach used by the other pySpark developers.
Aucun commentaire:
Enregistrer un commentaire