jeudi 28 avril 2016

Node.js design: multiple async functions writing to database using function passed as a closure

I am writing a standalone web scraper in Node, run from command line, which looks for specific data on a set of pages, fetches page views data from Google Analytics and saves it all in an MySQL database. Almost all is ready, but today I found a problem with the way I write data in the db.

To make thing easier let's assume I have an index.js file and two controllers - db and web. Db reads/writes data to db, web scraps the pages using configurable amount of PhantomJs instances.

Web exposes one function checkTargetUrls(urls, writer) where urls is an array with urls to be checked and writer is an optional parameter, called only if it is a function and there is data to be written.

Now the way I pass the writer is obviously wrong, but looks as follows (in index.js):

some code here
....
let pageId = 0;
... some promises code,
which checks validy of urls, 
creates new execution in the database, etc.
...

.then(ulrs => {
     return web.checkTargetUrls(urls,

        function(singleUrl, pageData) {
        ...
        a chain of promisable functions from db controller,
        which first lookup page id in the db, then its
        puts in the pageId variable and continues with write to db
        ...

}).then(() => {

logger.info('All done captain!');
}).catch(err => {logger.error(err})

In the effect randomly pageId gets overwritten by id of preceeding/succeeding page and invalid data is saved. Inside web there are up to 10 concurrent instances of PhantomJs running, which call writer function after they analyzed a page. Excuse me my language, but for me an analogy for that situation would be if I had, say, 10 instances of some object, which then rely for writing on a singleton, which causes the pageId overwriting problem (don't know how to properly express in JS/Node.js terms).

So far I have found one fix to the problem, but it is ugly as it introduces tight coupling. If I put the writer code in a separate module and then load it directly from inside the web controller all works great. But for me it is a bad design pattern and would rather do it otherwise.

var writer = require('./writer');

function checkTargetUrls(urls, executionId) {

return new Promise(
    function(resolve, reject) {

        let poolSize = config.phantomJs.concurrentInstances;
        let running = 0;

        ....
        a bit of code goes here
        ....
        if (slots != undefined && slots != null && slots.data.length > 0) {

           return writer.write(executionId, singleUrl, slots);
         }
         ...
         more code follows
})
}

I have a hard time findng a nicer solution, where I could still pass writer as an argument for checkTargetUrls(urls, writer) function. Can anyone point me in the right direction or suggest where to look for the answer?

Aucun commentaire:

Enregistrer un commentaire