I have a function that is loading a large dataset and does some convenience cleaning that looks something like this:
DEFAULT_COLS=['A','B','C']
def load_data(path, cols=DEFAULT_COLS):
df = pd.read_csv(path, usecols=cols)
df = _clean_A(df)
df = _clean_B(df)
return df
def _clean_A(df):
# code that references column A and does some cleaning specific to that
# column
df['A'] = df['A'].str.replace(' ',"_")
return df
def _clean_B(df):
# code that references column B and does some cleaning specific to that
# column
df['B'] = df['B'].some_cleaning()
return df
So the problem I see with my design is there is a built in dependency on loading in columns 'A' and 'B' or else the cleaning funcs error. Therefore the cols
parameter is 'lying'. Any ideas on how to only call cleaning steps when the dependencies of the function are met? I considered a try/except solution, but it seemed 'smelly' because a function/future dev would need to 'know' to raise an error (see below). I also considered 'registering' functions but have not come up with a good way to do that. Any advice/suggestions/google key words would be much appreciated.
# pseudo code for try/except solution
def load_data(path, cols=DEFAULT_COLS)
...
clean_funcs = [_clean_A, _clean_B]
for func in clean_funcs:
try:
df = func()
except KeyError: # or some other custom error to not mask real KeyError
pass
return df
Aucun commentaire:
Enregistrer un commentaire