I have two teams of 10. One group fine-tunes data to be ingested for machine learning. Lots of XML parsing, preprocessing, sorting - basically, putting data in a format where it can be eaten up easily by scikit-learn/WEKA/etc.
The other team builds models for classification/regression from this cleaned up data.
As you can imagine, there is a lot of collaboration between the two teams, right?
Wrong.
Different tools are written across teams for preprocessing, XML parsing, cleaning of data, building of models,testing of data etc etc etc
To solve this, I was thinking a repository of generic tools/programs we have made/written would be really useful.
What programs do you think would be suitable for this?
I was thinking it would be appropriate to have a structure like :
>TEAM 1 (Preprocessing)
>XML-Parsing
>HTML-Parsing
>Word-doc-Parsing
...etc...
>TEAM 2 (Model-building)
>Cross-validation
>WEKA
>Scikit-Learn
...etc...
Any comments on this structure? Essentially what I am asking - is this, in your opinion, a good idea to promote collaboration/understanding AND, what is a good tool for me to share our tools/programs with, given that we are using a wide array of languages and styles (it's not just Java, some people use Python, some use BASH etc. etc.).
Any comments at all would be really appreciated.
Also, I debated with myself over whether to post this in SuperUser or StackOverflow and SO seemed more appropriate - this is kind of a Software Engineering question, so I BELIEVE it should go here - my apologies if this is incorrect :).
Aucun commentaire:
Enregistrer un commentaire