lundi 21 juin 2021

Standard way of dealing with standardization for tabular data in PyTorch

This question is all about structuring the code in an elegant way that allows training from scratch, training using a model already trained and evaluating a model + doing the data normalization accordingly.

It is well known that a model works better if the data is standardized. Here is how I deal with this in the context of regression based on tabular data:

df_train, df_test, df_val = get_data(config)
if config['scaler'] is not None:
     scaler = config['scaler'] 
else:
    scaler = StandardScaler().fit(df[config['columns']], df[target]) # I only scale some numerical columns
df_train, df_test, df_val = scaler.transform(df_train), scaler.transform(df_test), scaler.transform(df_val)
best_model, test_predictions = train(df_train, df_test, df_val, config)
target_std, target_mean = get_scaler_coeffs(scaler)
test_predictions['y_hat'] = test_predictions['y_hat'] * target_std + target_mean
store_results(test_predictions, best_model, scaler, config)

In short, I use StandardScaler from sklearn to scale values (target included) and pandas for operating with csv files. After training, I store the best model along with its scaler and its results.

I am trying to write clean, elegant code. However, as you can see, this approach is not elegant at all, but I can’t figure out how I should change the code. What is the “standard” way of dealing with scaling and reverse scaling in pytorch? (preferably for regression purposes).

Aucun commentaire:

Enregistrer un commentaire