It is more of a design question here. I have been tasked to create a "schema check" function ingesting a csv with a pyspark databricks notebook.
I have an expected schema. The way i am planning to do it is:
step 1) do the schema check by inferring schema (checking types, column names, column count) (spark_read.option('inferSchema', 'true'))
. If the schema is false, then fails.
step 2) load dataframe with applying the schema, in order to load it with the correct types and potentially remove corrupted records with dropmalformed (spark_read.schema(my_schema)
).
Does it make sense to do these 2 steps (which costs a second load the dataframe ) ? or shoud i just limit myself to the second one ?
Any level of feedback would be much appreciated (whether you have a better solution or not), just so that we understand the limits and potential improvement of this design.
Aucun commentaire:
Enregistrer un commentaire