lundi 17 juillet 2017

What is the best method to minimize code complexity when saving stack data to exploit parallelism?

I am attempting to accelerate some code with CUDA, and am under the constraints of preserving code readability/maintainability as much as possible.

I have found and parallelized a function buried within several functions/loops. This function accounts for ~98% of processing time, but doesn't exploit enough parallelism alone to be useful (on the order of a couple blocks..). When executed simultaneously the code is much faster. However, as a result I am forced to maintain a big list of stack objects that I must iterate over several times, see the code below:

void do_work(int i, ...) {
    // computationally expensive stuff...
}

void prereq_stuff(int i) {

    // lots of big divergent control structures...

    do_work(i); // maybe arrive here..

    // output and what not....
}

int main() {

    for (int i = 0; i < BIG_NUMBER; i++) {
        prereq_stuff(i);
    }

    return 0;
}

Has turned into...

// a struct that contains all the stack data..
struct Stack {
    int foo;
    double bar;
};

void do_work_on_gpu(List<Stack> contexts) {
    // launch a kernel to handle to expensive stuff..
}

void prereq_stuff(Stack* context, int i) {
    // maybe queue up data for do_work_on_gpu()...
}

void cleanup_stuff(Stack* context, int i) {
    // output and what not...
}

int main() {

   List<Stack> contexts; // some container of stack objects

   for (int i = 0; i < BIG_NUMBER; i++) {
        Stack* context = contexts.add();
        prereq_stuff(context, i);
    }

    do_work_on_gpu(contexts); // calls the CUDA kernel

    for (int i = 0; i < contexts.size(); i++) {
        cleanup_stuff(context, i);
    }

    return 0;
}

Is there some sort of design construct/pattern I can utilize here? Or is this as simple as it can get with having all the data to call do_work() available simultaneously?

Thanks!

Aucun commentaire:

Enregistrer un commentaire