samedi 1 mai 2021

Design patterns for chaining data transformations methods using pandas

I receive monthly a csv file that has some columns. Regardless of what columns I receive, I should output a csv with column C1, C2, C3, ... C29, C30 if possible + a log file with the steps I took.

I know that, the order of my data transformations should be t1, t2, t3, t4, t5.

t1 generates columns C8, C9, C12, C22 using C1, C2, C3, C4
t2 generates columns C10, C11, C17 using C3, C6, C7, C8
t3 generates columns C13, C14, C15, C16 using C5, C8, C10, C11, C22
t4 generates columns C18, C19, C20, C21, C23, C24, C25 using C13, C15
t5 generates columns C26, C27, C28, C29, C30 using C5, C19, C20, C21

I cannot control what columns I get in my input data.

If my input data has C1, C2, C3, C4, C5, C6, C7 columns I can generate all the C1 ... C30 columns.

If my input data has C1, C2, C3, C4, C5, C6, C7, C8, C10, C11, C17 columns I can generate all the C1 ... C30 columns, but I should skip t2, as it is not necessary

If my input data has C1, C2, C3, C4, C6, C7 I can only do t1, t2, t3, t4. I cannot run t5, therefore I should create C26, C27, C28, C29, C30 columns with NaN values only and I should add in the log "Cannot perform t5 transformation because C5 is missing. C26, C27, C28, C29, C30 are filled with NaN values"

My t1, t2, t3, t4, t5 are already created, but I don't know how to organize the code in an elegant manner such that the code repetitions are minimal.

I had to develop my code in a very short amount of time. Consequently, all my t1, t2, t3, t4, t5 methods look like

def ti(df):
    output_cols = get_output_cols()
    if output_cols_already_exist(df, output_cols):
        return df, "{} skipped, the output cols {} already exist".format(inspect.stack()[0][3], output_cols)
    else:
        input_cols = get_required_input_cols()
        missing_cols = get_missing_cols(df, input_cols):
        if missing_cols == []:
            // do stuff
            log = "Performed {} transformation. Created {} columns".format(inspect.stack()[0][3], input_cols)
        else:
            for col in input_cols:
                df[col] = np.NaN
            log = "Cannot perform {} transformation because {} columns are missing. {} are filled with NaN values".format(inspect.stack()[0][3], missing_cols, output_cols)

Also, I use the functions in the following way:

text = ""
df = pd.read_csv(input_path)
df, log_text = t1(df)
text = text + log_text + "\n"
df, log_text = t2(df)
text = text + log_text + "\n"
df, log_text = t3(df)
text = text + log_text + "\n"
df, log_text = t4(df)
text = text + log_text + "\n"
df, log_text = t5(df)
text = text + log_text + "\n"
df.to_csv("output_data.csv", index = False)
logging.info(text)

As you can see, my code is ugly and repetitive. Now I have time to refactor it, but I don't know what would be the best approach. I also want my code to be extensible, as I am also thinking about adding a t6 transform. Can you help me giving some directions / design patterns I could follow? (I am also open using other python libraries beyond pandas)

Aucun commentaire:

Enregistrer un commentaire