I am working on building an address parsing tool in python for labeling address parts. I have a pandas data frame that looks something like this.
df = pd.DataFrame({"TOKEN": ['123.', 'Fake', 'street']})
And I've got a number of feature functions that look like this:
def f_ends_in_period(s):
return 'ends_in_period' if s[-1] == "." else ''
def f_numeric(s):
return 'numeric' if any([k.isdigit() for k in s]) else ''
def f_capitalized(s):
return 'f_capitalized' if s[0].isupper() else ''
...
The feature functions are fairly rigid. A feature function f_blah(s)
returns "blah" if string s
satisfies some condition (namely, condition "blah"), and otherwise returns an empty string. It's a little weird but there's a method to the madness.
Anyway, for now what I'm doing is simply going down the list
df['f_ends_in_period'] = df['TOKEN'].apply(f_ends_in_period)
df['f_numeric'] = df['TOKEN'].apply(f_numeric)
df['f_capitalized'] = df['TOKEN'].apply(f_capitalized)
And that works fine, except that every time I want to make a new feature function, I have to type the name of that feature function at least 4 times. That starts to get annoying really fast, especially if I want to create dozens of features.
Is there sort of a standard pattern that I can use to refactor this? I'm not sure exactly what the solution looks like, I'm just looking for suggestions to streamline this process.
Aucun commentaire:
Enregistrer un commentaire