Hello,
I am implementing an application which uses single threaded IPP and external parallelization via MS OpenMP.
Below you can find a piece of the source code which I used for some tests (the full code is attached to the post).
for (auto t = 1; t <= maxThreads; t++) { auto start = clock(); #pragma omp parallel default(shared) num_threads(t) { auto id = omp_get_thread_num(); auto buffer = buffers[id]; auto step = steps[id]; #pragma omp for schedule(dynamic, 1) for (auto i = 0; i < count; i++) ippiDivC_32f_C1IR(1.0f, buffer, step, roi); } auto stop = clock(); cout << "threads="<< t << " time="<< (stop - start) << endl; }
The code of application is very simple. It just checks an execution time of calculation using IPP depending on the number of threads used for this processing.
For width=5000, height=5000 and count=100 I've obtained following results:
Intel Core i7-3770 CPU @ 3.40GHz
version=7.0 build 205.58 name=ippie9_l.lib
threads=1 time=982
threads=2 time=947
threads=3 time=945
threads=4 time=957
Intel Xeon CPU E5-1660 0 @ 3.30GHz
version=7.0 build 205.58 name=ippie9_l.lib
threads=1 time=988
threads=2 time=698
threads=3 time=679
threads=4 time=678
threads=5 time=678
threads=6 time=699
As you can see it is very difficult to get any significant speed up using multiple threads. My question is what is the reason of above behavior? Could you please tell me what is the bottleneck of described solution?
Thank you in advance for your help.
Krzysztof Piotrowski.