samedi 13 février 2021

Good patterns for handling column names in dataframe-like data structures across an application

There are many applications in which data is handled via a dataframe-like data structure, i.e, a table with rows and columns. These dataframes are used in different parts of the application, often manipulated by distinct classes and functions, but the column structure of the dataframe remains the same. In many of these manipulations, it is required to use the column names, for instance, if you want to filter records previous to some date, you would filter through a column probably called "DATE" that stores the date of the record.

What are good patterns to handle these column names across the whole application? I have thought of the following options but would like to hear if there are better patterns.

  • Use the column name as a string everytime you want to access a particular column. Obviously not the best approach. If the column "DATE" needs to be renamed to "date" you need to find and replace all those strings.

  • Create a custom class that extends the dataframe class for each table that you expect to be used in different parts of your application, where you declare as attributes the column names and where you write all methods that manipulate this dataframe. If we think of pandas, would it be a good idea to exted the pandas.DataFrame class?

  • Write the column names in a constants file and make them accessible everytime you want to use them.

I am interested in reasons for preferring one pattern or another. I am mainly thinking in a context where you are using Python pandas or Spark, where dataframes are ubiquitous, but I am also interested in patterns that are best suited for other programming languages too.

Aucun commentaire:

Enregistrer un commentaire