I have a python script, which creates custom filters for dataframes in pandas. These dataframes contain eventlogs with three columns, client
, event
, and next
(event). client
has recurring values, which on every occurance (might) have different values of event
and next
. I want to find all rows containing a value of client
corresponding to filter values of (either of) the other two columns to either single them out, or leave them out. This is captured by a FilterType
enumerator, which can be 0
or 1
for selecting values (select
), or leaving them out (hide
).
Right now, I've created three sample filters (Filter_HasEvent
, Filter_EventToNext
, and Filter_EventFirst
), which respectively find all client
's if for any occurence of client
the given value of event
is attained, a given combination of event
and next
is attained, or if the given value of event
occurs as the first event for client
.
I know dataframes have a filter function of their own, however, I would also like to apply multiple filters to the same dataframe using both unions and intersections. Hence, why I create a boolean mask for the dataframe, which can be combined, before application.
Now, for my question: By using classes, I currently have to initiate a filter for every value of event
, and combination of event
and next
there is. What's more, the filter type is now selected for each filter using an if
-else
construction. This becomes increasingly illegible if I want to add more filter types. Is there a nicer (i.e. more oop) way of defining these filters and implementing the different types, e.g., using decorators or design patterns?
Another issue I have, is that when I add more filters (e.g., filter_EventLast
), I have to implement all filter types separately for this filter, because they are not reusable. Ideally, I would like to make a template, which would redirect to the appropriate filter, depending on the input received, like in Java. Is this possible?
Below is the sample code, what it looks like now:
import abc
from enum import Enum
import numpy as np
import pandas as pd
np.random.seed(64951)
client = [j for j in range(10) for i in range(10)]
event = pd.Series(np.random.choice(range(10), len(client)))
next = event - event.diff(-1)
rand_df = pd.DataFrame({
'client': client,
'event': event,
'next': next
})
class FilterType(Enum):
select = 0
hide = 1
class FilterTemplate(object):
def __init__(self, value, filter_type):
self.value = value
self.filter_type = filter_type
self.mask = None
@abc.abstractmethod
def getMask(self, df):
raise Exception('Method needs to be implemented!')
@abc.abstractmethod
def apply(self, df):
raise Exception('Method needs to be implemented!')
class Filter_HasEvent(FilterTemplate):
def __init__(self, value, filter_type):
super().__init__(value, filter_type)
def getMask(self, df):
client_in_event = df[df.event == self.value].client.unique()
if self.filter_type == FilterType.select:
self.mask = df.client.isin(client_in_event)
elif self.filter_type == FilterType.hide:
self.mask = ~df.client.isin(client_in_event)
return self.mask
def apply(self, df):
return df[self.getMask(df)]
class Filter_EventToNext(FilterTemplate):
def __init__(self, value, next_value, filter_type):
super().__init__(value, filter_type)
self.next_value = next_value
def getMask(self, df):
client_in_eventtonext = df[(df.event == self.value) & (df.next == self.next_value)].client.unique()
if self.filter_type == FilterType.select:
self.mask = df.client.isin(client_in_eventtonext)
elif self.filter_type == FilterType.hide:
self.mask = ~df.client.isin(client_in_eventtonext)
return self.mask
def apply(self, df):
return df[self.getMask(df)]
class Filter_EventFirst(FilterTemplate):
def __init__(self, value, filter_type):
super().__init__(value, filter_type)
def getMask(self, df):
client_unique = pd.Series(sorted(df.client.unique()), index=sorted(df.client.unique()))
client_has_event_first = client_unique[df.groupby('client').event.first() == self.value]
if self.filter_type == FilterType.select:
self.mask = df.client.isin(client_has_event_first)
elif self.filter_type == FilterType.hide:
self.mask = ~df.client.isin(client_has_event_first)
return self.mask
def apply(self, df):
return df[self.getMask(df)]
filter1 = Filter_HasEvent(1, FilterType.hide)
mask1 = filter1.getMask(rand_df)
filter2 = Filter_EventToNext(1, 2, FilterType.hide)
mask2 = filter2.getMask(rand_df)
print(filter1.apply(rand_df))
print('')
print(rand_df[mask1 | mask2])
Aucun commentaire:
Enregistrer un commentaire