I have more of a design question, but I am not sure how to handle that. I have a script preprocessing.py where I read a .csv file of text column that I would like to preprocess by removing punctuations, characters, ...etc.
What I have done now is that I have written a class with several functions as follows:
class Preprocessing(object):
def __init__(self, file):
self.my_data = pd.read_csv(file)
def remove_punctuation(self):
self.my_data['text'] = self.my_data['text'].str.replace('#','')
def remove_hyphen(self):
self.my_data['text'] = self.my_data['text'].str.replace('-','')
def remove_words(self):
self.my_data['text'] = self.my_data['text'].str.replace('reference','')
def save_data(self):
self.my_data.to_csv('my_data.csv')
def preprocessing(file_my):
f = Preprocessing(file_my)
f.remove_punctuation()
f.remove_hyphen()
f.remove_words()
f.save_data()
return f
if __name__ == '__main__':
preprocessing('/path/to/file.csv')
although it works fine, i would like to be able to expand the code easily and have smaller classes instead of having one large class. So i decided to use abstract class:
import pandas as pd
from abc import ABC, abstractmethod
my_data = pd.read_csv('/Users/kgz/Desktop/german_web_scraping/file.csv')
class Preprocessing(ABC):
@abstractmethod
def processor(self):
pass
class RemovePunctuation(Preprocessing):
def processor(self):
return my_data['text'].str.replace('#', '')
class RemoveHyphen(Preprocessing):
def processor(self):
return my_data['text'].str.replace('-', '')
class Removewords(Preprocessing):
def processor(self):
return my_data['text'].str.replace('reference', '')
final_result = [cls().processor() for cls in Preprocessing.__subclasses__()]
print(final_result)
So now each class is responsible for one task but there are a few issues I do not know how to handle since I am new to abstract classes. first, I am reading the file outside the classes, and I am not sure if that is good practice? if not, should i pass it as an argument to the processor function or have another class who is responsible to read the data.
Second, having one class with several functions allowed for a flow, so every transformation happened in order (i.e, first punctuation is removes, then hyphen is removed,...etc) but I do not know how to handle this order and dependency in abstract classes.
Aucun commentaire:
Enregistrer un commentaire