I am attempting to accelerate some code with CUDA, and am under the constraints of preserving code readability/maintainability as much as possible.
I have found and parallelized a function buried within several functions/loops. This function accounts for ~98% of processing time, but doesn't exploit enough parallelism alone to be useful (on the order of a couple blocks..). When executed simultaneously the code is much faster. However, as a result I am forced to maintain a big list of stack objects that I must iterate over several times, see the code below:
void do_work(int i, ...) {
// computationally expensive stuff...
}
void prereq_stuff(int i) {
// lots of big divergent control structures...
do_work(i); // maybe arrive here..
// output and what not....
}
int main() {
for (int i = 0; i < BIG_NUMBER; i++) {
prereq_stuff(i);
}
return 0;
}
Has turned into...
// a struct that contains all the stack data..
struct Stack {
int foo;
double bar;
};
void do_work_on_gpu(List<Stack> contexts) {
// launch a kernel to handle to expensive stuff..
}
void prereq_stuff(Stack* context, int i) {
// maybe queue up data for do_work_on_gpu()...
}
void cleanup_stuff(Stack* context, int i) {
// output and what not...
}
int main() {
List<Stack> contexts; // some container of stack objects
for (int i = 0; i < BIG_NUMBER; i++) {
Stack* context = contexts.add();
prereq_stuff(context, i);
}
do_work_on_gpu(contexts); // calls the CUDA kernel
for (int i = 0; i < contexts.size(); i++) {
cleanup_stuff(context, i);
}
return 0;
}
Is there some sort of design construct/pattern I can utilize here? Or is this as simple as it can get with having all the data to call do_work() available simultaneously?
Thanks!
Aucun commentaire:
Enregistrer un commentaire