mercredi 7 octobre 2020

optionally run functions based on function dependencies

I have a function that is loading a large dataset and does some convenience cleaning that looks something like this:

DEFAULT_COLS=['A','B','C']

def load_data(path, cols=DEFAULT_COLS):

    df = pd.read_csv(path, usecols=cols)

    df = _clean_A(df)
    df = _clean_B(df)
    return df

def _clean_A(df):
    # code that references column A and does some cleaning specific to that
    # column
    df['A'] = df['A'].str.replace(' ',"_")

    return df

def _clean_B(df):
    # code that references column B and does some cleaning specific to that
    # column
    df['B'] = df['B'].some_cleaning()

    return df

So the problem I see with my design is there is a built in dependency on loading in columns 'A' and 'B' or else the cleaning funcs error. Therefore the cols parameter is 'lying'. Any ideas on how to only call cleaning steps when the dependencies of the function are met? I considered a try/except solution, but it seemed 'smelly' because a function/future dev would need to 'know' to raise an error (see below). I also considered 'registering' functions but have not come up with a good way to do that. Any advice/suggestions/google key words would be much appreciated.

# pseudo code for try/except solution
def load_data(path, cols=DEFAULT_COLS)
    ...
    clean_funcs = [_clean_A, _clean_B]
    for func in clean_funcs:
        try:
            df = func()
        except KeyError:  # or some other custom error to not mask real KeyError
            pass
    return df

Aucun commentaire:

Enregistrer un commentaire