lundi 27 février 2017

Is there a design pattern that I can use for applying a list of functions to create machine learning features in python?

I am working on building an address parsing tool in python for labeling address parts. I have a pandas data frame that looks something like this.

df = pd.DataFrame({"TOKEN": ['123.', 'Fake', 'street']})

And I've got a number of feature functions that look like this:

def f_ends_in_period(s):
    return 'ends_in_period' if s[-1] == "." else ''

def f_numeric(s):
    return 'numeric' if any([k.isdigit() for k in s]) else ''

def f_capitalized(s):
    return 'f_capitalized' if s[0].isupper() else ''
...

The feature functions are fairly rigid. A feature function f_blah(s) returns "blah" if string s satisfies some condition (namely, condition "blah"), and otherwise returns an empty string. It's a little weird but there's a method to the madness.

Anyway, for now what I'm doing is simply going down the list

df['f_ends_in_period'] = df['TOKEN'].apply(f_ends_in_period)
df['f_numeric'] = df['TOKEN'].apply(f_numeric)
df['f_capitalized'] = df['TOKEN'].apply(f_capitalized)

And that works fine, except that every time I want to make a new feature function, I have to type the name of that feature function at least 4 times. That starts to get annoying really fast, especially if I want to create dozens of features.

Is there sort of a standard pattern that I can use to refactor this? I'm not sure exactly what the solution looks like, I'm just looking for suggestions to streamline this process.

Aucun commentaire:

Enregistrer un commentaire