I am trying to improve my multiprocessor application using shared memory for communication. I did some profiling with simple tests, and something strange came out. When I try to copy data stored in SharedMemory, it is faster with ReadProcessMemory than with Memcopy.
I know that I should not use SharedMemory in this way (it is better to read right inside the shared memory), but I'm still wondering why this is happening. Following further research, one more thing appeared: if I make 2 consecutive memcpy in the same area of ββshared memory (in fact, in the same area), the second copy is twice as fast as the first.
Here is a sample code showing the problem. In this example, there is only one process, but the problem is here. Running memcpy from a shared memory area is slower than running ReadProcessMemory in the same area in my own process!
#include <tchar.h> #include <basetsd.h> #include <iostream> #include <boost/interprocess/mapped_region.hpp> #include <boost/interprocess/windows_shared_memory.hpp> #include <time.h> namespace bip = boost::interprocess; #include <boost/asio.hpp> bip::windows_shared_memory* AllocateSharedMemory(UINT32 a_UI32_Size) { bip::windows_shared_memory* l_pShm = new bip::windows_shared_memory (bip::create_only, "Global\\testSharedMemory", bip::read_write, a_UI32_Size); bip::mapped_region l_region(*l_pShm, bip::read_write); std::memset(l_region.get_address(), 1, l_region.get_size()); return l_pShm; } //Copy the shared memory with memcpy void CopySharedMemory(UINT32 a_UI32_Size) { bip::windows_shared_memory m_shm(bip::open_only, "Global\\testSharedMemory", bip::read_only); bip::mapped_region l_region(m_shm, bip::read_only); void* l_pData = malloc(a_UI32_Size); memcpy(l_pData, l_region.get_address(), a_UI32_Size); free(l_pData); } //Copy the shared memory with ReadProcessMemory void ProcessCopySharedMemory(UINT32 a_UI32_Size) { bip::windows_shared_memory m_shm(bip::open_only, "Global\\testSharedMemory", bip::read_only); bip::mapped_region l_region(m_shm, bip::read_only); void* l_pData = malloc(a_UI32_Size); HANDLE hProcess = OpenProcess( PROCESS_ALL_ACCESS, FALSE,(DWORD) GetCurrentProcessId()); size_t l_szt_CurRemote_Readsize; ReadProcessMemory(hProcess, (LPCVOID)((void*)l_region.get_address()), l_pData, a_UI32_Size, (SIZE_T*)&l_szt_CurRemote_Readsize); free(l_pData); } // do 2 memcpy on the same shared memory void CopySharedMemory2(UINT32 a_UI32_Size) { bip::windows_shared_memory m_shm(bip::open_only, "Global\\testSharedMemory", bip::read_only); bip::mapped_region l_region(m_shm, bip::read_only); clock_t begin = clock(); void* l_pData = malloc(a_UI32_Size); memcpy(l_pData, l_region.get_address(), a_UI32_Size); clock_t end = clock(); std::cout << "FirstCopy: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl; free(l_pData); begin = clock(); l_pData = malloc(a_UI32_Size); memcpy(l_pData, l_region.get_address(), a_UI32_Size); end = clock(); std::cout << "SecondCopy: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl; free(l_pData); } int _tmain(int argc, _TCHAR* argv[]) { UINT32 l_UI32_Size = 1048576000; bip::windows_shared_memory* l_pShm = AllocateSharedMemory(l_UI32_Size); clock_t begin = clock(); for (int i=0; i<10 ; i++) CopySharedMemory(l_UI32_Size); clock_t end = clock(); std::cout << "MemCopy: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl; begin = clock(); for (int i=0; i<10 ; i++) ProcessCopySharedMemory(l_UI32_Size); end = clock(); std::cout << "ReadProcessMemory: " << (end - begin) * 1000 / CLOCKS_PER_SEC << " ms" << std::endl; for (int i=0; i<10 ; i++) CopySharedMemory2(l_UI32_Size); delete l_pShm; return 0; }
And here is the conclusion:
MemCopy: 8891 ms ReadProcessMemory: 6068 ms FirstCopy: 796 ms SecondCopy: 327 ms FirstCopy: 795 ms SecondCopy: 328 ms FirstCopy: 780 ms SecondCopy: 344 ms FirstCopy: 780 ms SecondCopy: 343 ms FirstCopy: 780 ms SecondCopy: 327 ms FirstCopy: 795 ms SecondCopy: 343 ms FirstCopy: 780 ms SecondCopy: 344 ms FirstCopy: 796 ms SecondCopy: 343 ms FirstCopy: 796 ms SecondCopy: 327 ms FirstCopy: 780 ms SecondCopy: 328 ms
If anyone has an idea why memcpy is so slow, and if there is a solution to this problem, I'm all ears.
Thanks.