This question is all about structuring the code in an elegant way that allows training from scratch, training using a model already trained and evaluating a model + doing the data normalization accordingly.
It is well known that a model works better if the data is standardized. Here is how I deal with this in the context of regression based on tabular data:
df_train, df_test, df_val = get_data(config)
if config['scaler'] is not None:
scaler = config['scaler']
else:
scaler = StandardScaler().fit(df[config['columns']], df[target]) # I only scale some numerical columns
df_train, df_test, df_val = scaler.transform(df_train), scaler.transform(df_test), scaler.transform(df_val)
best_model, test_predictions = train(df_train, df_test, df_val, config)
target_std, target_mean = get_scaler_coeffs(scaler)
test_predictions['y_hat'] = test_predictions['y_hat'] * target_std + target_mean
store_results(test_predictions, best_model, scaler, config)
In short, I use StandardScaler from sklearn to scale values (target included) and pandas for operating with csv files. After training, I store the best model along with its scaler and its results.
I am trying to write clean, elegant code. However, as you can see, this approach is not elegant at all, but I can’t figure out how I should change the code. What is the “standard” way of dealing with scaling and reverse scaling in pytorch? (preferably for regression purposes).
Aucun commentaire:
Enregistrer un commentaire