mercredi 16 août 2017

Design pattern for handling large datasets for machine learning

I'm currently attempting to scrape data from websites and building a large (and potentially growing with time) dataset from it. I'm wondering if there's any good practices to adopt when processing, saving and loading large datasets.

More concretely, what should I do when the dataset I want to save is too large to store in RAM, then writing to disk in one go; and writing it one data-point at a time is too inefficient? Is there an approach smarter than writing to file a moderately-sized-batch at a time?

Thank you for your time!

Aucun commentaire:

Enregistrer un commentaire