lundi 26 août 2019

mongodb and multiprocessing - large ram usage

I am utilizing multiprocessing for parsing files for input into databases and have come to the extremely common problem of feeding data from these several processes to connections to the database.
There are of course several different way in which to do this, a few being:

  1. Create several processes in a pool, which then feed data to one connection to the data base. This requires making a queue on the output end of the pool and making the pool wait until the queue finishes a process in order to receive another job.

  2. Create several process in a pool, which each have it's own connection to the database and operate in parallel. (i.e. let the database drivers deal with the concurrency)

  3. Same as (1), but include a pool of connections to deal with the queue

A problem with many of the options is if the data being passed around and stored is large in each process, therefore each process needs to wait for the data to be persisted on the disk via the database before retrieving the next job.

This seems like it is solved in the implementation of (1).

I have tried the following (which works), but this produces a mongodb process which accumulates enormous amounts of RAM usage.

from p_tqdm import p_imap
import gc
import pymongo
import time


def data_chunk_process(f):
    # Some processing that takes time
    time.sleep(3)
    # Some ranom data that's fairly large
    m = 100
    n = 10000
    d = [{str(k):str([v]*100) for k, v in zip(range(m), range(m))} for i in range(n)]

    conn = pymongo.MongoClient()
    cln = conn['testDB']['test_collection']
    cln.insert_many(documents=d)

    conn.close()


class Database:
    def __init__(self):
        # Perhaps we could simply use one connection?
        self.conn = pymongo.MongoClient()

    def database_insertions(self):
        files = list(range(1000))
        # Process one thousand files, with each file consisting of several thousands of documents
        for file_doc_chunk in p_imap(data_chunk_process, files):
            gc.collect()

db = Database()
db.database_insertions()

glances output:

CPU%   MEM%  VIRT  RES     PID USER          TIME+ THR  NI S  R/s W/s Command
0.3    41.9  93.6G 92.2G 32344 mongodb       30:16 32    0 S    ? ?    /usr/bin/mongod

I am worried that if this script is executed on a computer with less RAM (e.g. ~4-8 GB), the system will crash due to using all of the available RAM.

What can be done to remove this constraint?

Aucun commentaire:

Enregistrer un commentaire