samedi 7 mars 2020

Pandas flexible determination of metrics

Imagine we have different structures of dataframes in Pandas

# creating the first dataframe 
df1 = pd.DataFrame({
  "width": [1, 5], 
  "height": [5, 8]})

# creating second dataframe
df2 = pd.DataFrame({
  "a": [7, 8], 
  "b": [11, 23],
  "c": [1, 3]})

# creating second dataframe
df3 = pd.DataFrame({
  "radius": [7, 8], 
  "height": [11, 23]})

In general there might be more than 2 dataframes. Now, I want to create a logic that is mapping columns names to specific functions to create a new column "metric" (think of it as area for two columns and volume for 3 columns). I want to specify column names ensembles

column_name_ensembles = {
    "1": {
       "ensemble": ['height', 'width'],
       "method": area},
    "2": {
       "ensemble": ['a', 'b', 'c'],
       "method": volume_cube},
    "3": {
       "ensemble": ['radius', 'height'],
       "method": volume_cylinder}}

def area(width, height):
    return width * height

def volume_cube(a, b, c):
    return a * b * c

def volume_cylinder(radius, height):
    return (3.14159 * radius ** 2) * height

Now, the area function create a new column for the dataframe df1['metric'] = df1['height'] * df2['widht'] and the volumen function will create a new column for the dataframe df2['metic'] = df2['a'] * df2['b'] * df2['c']. Note, that the functions can have arbitrary form but it takes the ensemble as parameters. The desired function metric(df, column_name_ensembles) should take an arbitrary dataframe as input and decide by inspecting the column names which function should be applied.

Example input output behaviour

df1_with_metric = metric(df1, column_name_ensembles)
print(df1_with_metric)
# output
#    width height metric
#  0 1     5      5 
#  1 5     8      40
df2_with_metric = metric(df2, column_name_ensembles)
print(df2_with_metric)
# output
#    a  b  c  metric
#  0 7  11 1  77
#  1 8  23 3  552
df3_with_metric = metric(df3, column_name_ensembles)
print(df3_with_metric)
# output
#    radius  height  metric
#  0 7       11      1693.31701
#  1 8       23      4624.42048

The perfect solution would be a function that takes the dataframe and the column_name_ensembles as parameters and returns the dataframe with the appropriate 'metric' added to it.

I know this can be achieved by multiple if and else statements, but this does not seem to be the most intelligent solution. Maybe there is a design pattern that can solve this problem, but I am not an expert at design patterns.

Thank you for reading my question! I am looking forward for your great answers.

Aucun commentaire:

Enregistrer un commentaire