mercredi 26 août 2015

pattern for saving many small json files to filesystem

tl;dr: Is there a pattern for dynamically saving lots of small json files to a file system?

I am hitting a REST API once a minute, looking for results from a form submission. Each form result is turned into a JSON document and saved to a database.

Often there's nothing coming back, and sometimes there are a few 10s of documents per hit. The documents end up being very small, like 10-50K each.

I've been asked to also save each document to a separate gzip on the filesystem (of a different server) as well. The files do not need to be organized; accessing via os.walk is fine. (I'm using Python.)

Plan is to create a main directory manually, and then, add subdirectories as necessary by checking to see if the last one created has more than 999 files in it:

# will manually create '~/data/poll/0000000000' just before running
# will run from ''~/data/poll'
def create_poll_dir():
    starting_path = os.path.join(os.path.expanduser('~'), 'data', 'poll')
    all_subdirs = [d for d in os.listdir('.') if os.path.isdir(d)]
    latest_subdir = max(all_subdirs, key=os.path.getmtime)
    path = os.path.join(starting_path, latest_subdir)
    if len(os.listdir(path)) > 999:
        new_dir = str(int(latest_subdir) + 1).zfill(10)
        path = ckmkdirs(os.path.join(starting_path, new_dir))
    return path

  1. Since this is simply for backup, with ease and speed of access not an issue, and the documents are so small, is appending everything to a log-like file, using jsonlines, a better approach (i.e. less filesystem strain, just as much data integrity)?
  2. If I stay with this approach (which would eventually throw an exception, as is, I know), should I be using a less-flat organization scheme? And does 1000 seem like a happy medium for number-of-small-files in a directory?

Aucun commentaire:

Enregistrer un commentaire