Linux: controlling virtual memory mapping in my process for quick emulation

Recently, it occurred to me that many emulators are slow, because they have to simulate not only the processor, but also the memory of the emulated device. When a device has an I / O memory card, virtual memory, or simply unused address space, each memory access must be modeled in software.

It seems to me that this can be much faster if the OS does this for us using virtual memory. I will use Game Boy emulation as an example for simplicity, but obviously this method will be better for newer and more powerful machines.

Game Boy memory card approximately:

  • 0x0000 - 0x7FFF: displayed on the ROM cartridge
    • Most cartridges have 0x0000 - 0x3FFF fixed and 0x4000 - 0x7FFF with the ability to go to the bank by writing to 0x2000
  • 0x8000 - 0x9FFF: video memory (available only when it is not running)
  • 0xA000 - 0xBFFF: displayed on the cartridge (usually battery-powered)
  • 0xC000 - 0xDFFF: Internal RAM (0xD000 - 0xDFFF takes over the color GB)
  • 0xE000 - 0xFDFF: Internal memory mirror
  • 0xFE00 - 0xFE9F: object attribute memory (sprite memory)
  • 0xFEA0 - 0xFEFF: Unmapped (open bus or something unsure)
  • 0xFF00 - 0xFF7F: input-output with memory (sound system, video control, etc.)
  • 0xFE80 - 0xFFFF: Internal memory

Thus, a traditional emulator should translate each memory access something like this:

if(addr < 0x4000) return rom[addr]; else if(addr < 0x8000) return rom[(addr - 0x4000) + (0x4000 * cur_rom_bank)]; else if(addr < 0xA000) { if(vram_accessible) return vram[addr - 0x8000]; else return 0xFF; } else if(addr < 0xC000) return saveram[addr - 0xA000]; else if(addr < 0xE000) return ram[addr - 0xC000]; else if(addr < 0xFE00) return ram[addr - 0xE000]; else if(addr < 0xFE9F) return oam[addr - 0xFE00]; else if(addr < 0xFF00) return 0xFF; //or whatever should be here else if(addr < 0xFF80) return handle_io_read(addr); else return hram[addr - 0xFF80]; 

Obviously, this can be optimized using a switch or table, but it takes a lot of code to run each memory access. We could slightly improve the speed of emulation by comparing some pages with these addresses on our process memory card:

  • 0x0000 - 0x3FFF: R-- (there is no Exec flag, because the native processor does not execute it)
  • 0x4000 - 0x7FFF: R -
  • 0x8000 - 0x9FFF: ---
  • 0xA000 - 0xBFFF: ---
  • 0xC000 - 0xDFFF: RW -
  • 0xE000 - 0xFDFF: RW- (and is displayed on the same physical page as 0xC000 - 0xDFFF)
  • 0xFE00 - 0xFE9F: ---
  • 0xFEA0 - 0xFEFF: ---
  • 0xFF00 - 0xFF7F: ---
  • 0xFF80 - 0xFFFF: RW -

Then process the SIGSEGV (or any other signal that will be generated) that we receive when accessing these pages. Thus, reading from the ROM or writing to the RAM can simply be done directly, and writing to the ROM will raise an exception that we can handle. We can change the VRAM permissions (0x8000 - 0x9FFF) to RW - when it should be available and - when it does not. Theoretically, this can be much faster, since the emulator does not need to manually map each memory access in the software.

I know that I can use mmap() to map pages to fixed addresses with different resolutions. I do not know what:

  • Can the mapping overlap with different resolutions?
  • Is it possible to map pages to arbitrary addresses, regardless of the size of the system page? Is it possible to match address 0?
  • How to change the memory pointed to by a map? (for example, when the ROM bank is changed, we can simply switch which memory is mapped to 0x4000 - 0x7FFF, but how to do it?)
  • In the real case, when the emulated system has a 32- or 64-bit CPU, can I display the entire first 4 GB or potentially all the memory? How can I avoid a contradiction with what is already displayed (for example, libraries, my stack, kernel)?
  • Will it really be faster? Or throws and catches SIGSEGV, generating more overhead than the traditional way?
  • If this cannot be done in user space, can Linux provide a way to “capture” the kernel and make it there? So, could I at least create an “OS emulator” that works with white metal while still having some features of the Linux kernel (such as video and file system drivers)?
+5
source share
2 answers

I would expect to generate SIGSEGV, catch it, process it and resume it, would have more performances than on the original equipment, so attach it to ensure that this happens only when there really was an error, which can be slow.

This is a good technique for checking memory / array protection when violations are rare and it is normal if they are slow. Speeding up a common cause is a bit of a gain, even if it makes an exceptional case much slower, it is a victory when an exceptional case does not occur in normal emulated code.

I heard that Javascript emulators do this in order to get a cheaper check on the bounds of the array: select the array so that it ends at the top of the page where the next page was not displayed.


Take this with salt: I have not used any of this code that I wrote. I just heard about it and I think I understand how it works, and some of the consequences.

I hope this helps you take a look at the documents that will tell you what can actually be done.

Updating page tables is pretty slow . Try to find a balance in which you can use user space memory protection for some checks, but you do not constantly display / untie pages from your memory space during the “general case” of what your emulated code does. Predicted branches work very fast, especially. if they are not accepted to predict.

I saw a Linux kernel discussion / notes indicating that playing tricks with mmap is not only worth one page on memcpy . For more memory or less re-access checks, the advantage will exceed the setup overhead.


You want to use mprotect(2) to change the permissions on the (ranges) of pages. No, comparisons cannot overlap. See the MAP_FIXED parameter in mmap(2) :

If the memory area indicated by addr and len overlaps the pages of any existing mapping (s), then the overlapping portion of the existing mapping will be discarded.

IDK, if you can do anything useful using the x86 segment registers when accessing emulated memory, map address 0 to another address in the virtual address space of the process. You can map virtual address 0, but Linux disables it by default, so NULL pointers don't work silently!

Your software users will have to futz with sysctl (same as for WINE) to enable it:

 # Ubuntu /etc/sysctl.d/10-zeropage.conf # Protect the zero page of memory from userspace mmap to prevent kernel # NULL-dereference attacks against potential future kernel security # vulnerabilities. (Added in kernel 2.6.23.) # # While this default is built into the Ubuntu kernel, there is no way to # restore the kernel default if the value is changed during runtime; for # example via package removal (eg wine, dosemu). Therefore, this value # is reset to the secure default each time the sysctl values are loaded. vm.mmap_min_addr = 65536 

As I said, you can use segment redefinition for all downloads / storages to guest (emulated) memory in order to reassign it to a more reasonable page. Or maybe just use a constant offset of 64kiB (or more, maybe lay it out over the text / data / bss (bunch) of emulation software). Or not a constant offset, using the pointer to the base of your guest memory mmapped region, so everything relates to a global variable. With gcc, this can be a good candidate for requesting that gcc keep this global register in your functions. IDK, you would have to make sure that it helped to perforate or not. A constant offset will ultimately result in each team accessing the guest memory requiring an offset field 32b in addressing mode, rather than 0 or 8b.

Segment register, if it works the way I think, it (since a constant offset that can be applied with a segment redefinition prefix instead of an offset modifier 32b) would be much harder to get a compiler to generate, AFAIK, If it were only downloads / stores , this would be one: you could use the asm built-in shell to load and store the insn. But for efficient x86 code, all kinds of ALU instructions should use memory operands to reduce interface bottlenecks through micro-merging.

Perhaps you can simply define the global char *const guest_mem = (void*)0x2000000; or something else and then use mmap with MAP_FIXED to make it display memory there? Then access to guest memory can be compiled into more efficient single-register modes.

+3
source

General information

The Dolphin emulator has a fastmem feature. AFAIU, JITed code blocks, assuming memory accesses use standard memory. If at some point the command accesses the hardware memory, the command is corrected to use slow (memory) instead. This is caused by segfault, which is handled by the emulator:

  • a trampoline is generated that calls the appropriate one (slow path to memory);

  • the existing instruction is corrected and replaced with a jump on this trampoline.

Some links:

This is somehow similar to what you are describing with JIT / patching, it can absorb the cost of page errors (because crashing when a page crashes when accessing hardware addresses will be inefficient).

By the way, you might be wondering how emulation memory is managed . See MemoryMap_Setup () .

Answers to our questions

Can the mapping overlap with different resolutions?

If you mmap something that overlaps the previous VMA, this replaces part of the old VMA with the new one.

Can I match pages to such arbitrary addresses, regardless of the page size of the system?

No, VMAs are always aligned to page borders (4KiB on x86 and x86_64). If you map file / shared memory, you also have an offset alignment restriction.

Is it possible to match address 0?

At least Linux does not allow you to do this.

In the real case, when the emulated system has a 32- or 64-bit processor,> is it possible to display the entire first 4 GB or potentially the entire memory?

You cannot display the entire address space. AFAIU, what Dolphin does, displays an emulated 32-bit address space with a fixed offset of its own 64-bit address space.

How can I avoid a conflict with what is already being displayed (for example, libraries, my stack, kernel)?

An address space that is larger than the emulated one helps.

If this cannot be done in user space, does Linux possibly provide a way to “take over” the kernel and do it there? Therefore, could I at least create an “OS emulator” that launches bare-metal while still having some features of the Linux kernel (such as video and file system drivers)?

If you are trying to emulate your own processor, you can use virtualization technology (for example, KVM).

+2
source

Source: https://habr.com/ru/post/1239460/


All Articles