Hi,
at the moment I'm using ippsDotProd_32f in IPP 7.0 quite extensively in one of my projects. I now tested IPP 8.2 on a Haswell CPU (Xeon e5-2650 v3 in a HP z640 workstation) with this project because I expected it to be significantly faster (see below). Actually, the code was about 10% slower using IPP 8.2 which I found quite disturbing.
I created a test program (see below) to verify this and found that ippsDotProd_32f (as well as some other functions) seem to be slower in IPP 8.2 as compared to IPP 7.0 if one uses a lot but rather small arrays of about 100 entries. For larger arrays the speed seems to be equal.
Unfortunately this is exactly what I have to do in my project. Now two questions arise:
1. What can I do to make my code work at least with the speed of IPP 7.0 event if I use IPP 8.2
2. Why is ippsDotProd_32f on a Haswell CPU not actually significantly faster? My assumptions are based on this article (section 3.1):
https://software.intel.com/en-us/articles/intel-xeon-processor-e5-2600-v...
Where it is stated that Haswell CPUs have two FMA units and therefore should be much faster calculating dot products. Furthermore it is stated in https://software.intel.com/en-us/articles/haswell-support-in-intel-ipp that ippsDotProd_32f should actually profit from this fact, at least in IPP versions larger 7.0
I'm very thankful for assistance here! Apparently I understood something wrong? Here is my test code, it was compiled with Visual Studio 2012 on a non-Haswell-computer but the tests were run on the mentioned Haswell-system:
#include "stdafx.h" #include "windows.h" #include "ipp.h" #include "ipps.h" #include "ippcore.h" int main(int argc, _TCHAR* argv[]) { IppStatus IPP_Init_status; IPP_Init_status=ippInit(); printf("%s\n", ippGetStatusString(IPP_Init_status) ); const IppLibraryVersion *lib; lib = ippsGetLibVersion(); printf("%s %s\n", lib->Name, lib->Version); //ippSetNumThreads(1); //generate two vectors float* vec1; float* vec2; vec1=new float[1000](); vec2=new float[1000](); //fill vectors with values for (int i=0;i<1000;i++){ vec1[i]=(float)i; vec2[i]=(float)(1000-i); } //result variable float dotprod_result=0.f; //start timing int dotprod_time=0; LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds; LARGE_INTEGER Frequency; QueryPerformanceFrequency(&Frequency); QueryPerformanceCounter(&StartingTime); //run ippsDotProd for (int i=0; i<500000000; i++){ //ippsSum_32f(vec1,1000, &dotprod_result,ippAlgHintFast); ippsDotProd_32f(vec1, vec1, 100, &dotprod_result); } //stop timing QueryPerformanceCounter(&EndingTime); ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart; ElapsedMicroseconds.QuadPart *= 1000000; ElapsedMicroseconds.QuadPart /= Frequency.QuadPart; dotprod_time=(int)(ElapsedMicroseconds.QuadPart/1000); printf("Total time [ms]: %d\n", dotprod_time); delete[] vec1; delete[] vec2; return 0; }
The result for IPP 7.0:
ippStsNoErr: No errors, it's OK.
ippse9-7.0.dll 7.0 build 205.105
Total time [ms]: 7558
The result for IPP 8.2:
ippStsNoErr: No errors.
ippSP AVX2 (l9) 8.2.1 (r44077)
Total time [ms]: 8141