Hello,
I use Denverton stepping B1 C3955 @2.10Ghz with BIOS 0015D96 on Harcuvar CRB.
Since Denverton doesn’t have DMA controller for transferring block of data between RAM and PCIe. I use the method suggested by McCalpin John at https://software.intel.com/en-us/forums/software-tuning-performance-opti... who writes the following:
“If the PCIe device does not have its own DMA controller, then the fastest way to copy data from system memory to that IO device is to use a processor core. You would need to set up a memory-mapped IO range for the device with the write-combining attribute, then use a processor core (or thread) to read from (cacheable) system memory and write to the MMIO range using streaming stores”
For PCIe region targeted I use BAR 0 (video memory) of Matrox Millennium G550 LP PCIE card installed on Harcuvar CRB. This BAR 0 is defined in MMU as non-cacheable and write-combining.
I call ippInit() and ippiGetLibVersion() that returns: “ippIP SSE4.2 (y8) 9.0.4 (r52811)”
After that I call ippsCopy_64s to copy 16 Mbytes of data from local buffer in DDR SDRAM to BAR0.
The address of local buffer and BAR0 is aligned on 64-bytes.
The throughput that I get is 90 Mbytes/s on copy from DDR SDRAM to PCIe and 10 Mbytes/s on copy from PCIe to DDR SDRAM.
Q1. Do above numbers make sense?
Q2. Is the usage of ippsCopy_64s best option in case of absence of DMA engine?
Is there any other method to make transfer to/from PCIe in order to get high throughput?
Q3. I tried ippsCopy_32s, ippsCopy_16s, ippsCopy_8u, but the result is same as in ippsCopy_64s. Could you explain please?
Q4. I also tried ippiCopyManaged_8u_C1R with the parameter IPP_NONTEMPORAL_STORE as suggested in https://software.intel.com/en-us/articles/ippscopy-vs-ippicopymanaged
the result still the same as ippsCopy_64s. Could you explain please?
Thanks.
Ilya.