I've downloaded Intel IPP DMIP sample: ipp-samples.7.1.1.013. I built application\dmip_bench\ utility against IPP v7.1.1. It showed significant performance boost of DMIP flavor against IPP flavor.
I then refactored ModifyBrightness::DoIPP method to simply process image by rows, and parallelized this processing with Concurrency::parallel_for. Then I rebuild the solution with both _IPP_SEQUENTIAL_STATIC and _IPP_PARALLEL_DYNAMIC macros. And the results was unexpected.
With _IPP_SEQUENTIAL_STATIC:
DMIP 1.5 Jul 12 2012
ippIP SSSE3 (v8) 7.1.1 (r37466) Sep 24 2012
ippCV SSSE3 (v8) 7.1.1 (r37466) Sep 24 2012
ippCC SSSE3 (v8) 7.1.1 (r37466) Sep 25 2012
Number of threads: 2
DMIP Modify Brightness example time 3.16375 msec slice 34
IPP Modify Brightness example time 1.85974 msec slice 467
Close the session
With _IPP_PARALLEL_DYNAMIC:
DMIP 1.5 Jul 12 2012
ippIP SSSE3 (v8) 7.1.1 (r37466) Sep 27 2012
ippCV SSSE3 (v8) 7.1.1 (r37466) Sep 27 2012
ippCC SSSE3 (v8) 7.1.1 (r37466) Sep 28 2012
Number of threads: 2
DMIP Modify Brightness example time 2.34378 msec slice 34
IPP Modify Brightness example time 6.75662 msec slice 467
Close the session
As you can see, manually parallelized version works better, than DMIP. Why?
I used Visual Studio 2010 for compilation. Under Windows 7 x64. Solution configuration was x86. I have Intel E6550 processor. I used an RGB 1200x467 image.
I attached modified sample. With compiled executables and output logs.