I wrote a complete RISC-style system simulator / emulator (and all peripherals). Currently, it uses an indirect in-line emulation circuit. That is, all the command footers are something in style:
pc += 4; inst = loadWord(mem, pc); instp = decodeTable[opcode(inst)]; goto *instp
This works well, I get about 70-80 MIPS on a modern machine when booting Linux, which is not bad.
However, I look at the transition to a direct predicted model of the interpreted stream, it looks like this:
tPC += 1; instp = predecodeMem[tPC].operation; goto *instp;
Pre-decoding in itself is not a problem, it is just replacing an existing decoder and adding some shadow memory. My main problem with this is with self-modifying code (or semi-automatic code).
In the simple case, we can simply highlight the pages of the preliminary code lazily when visiting pages that have not previously been executed. Then the TLB software is removed from all records so that we can go through the memory simulation system the next time we write to this page, and thus, to write to executable pages, it will also be necessary to update the decode information, which will cost in performance, but since this is rare, we should not have problems with this (we can also speed it up by adding executable bits on the page, computed at runtime).
The problem here is long-term code detection when pages are reused by the operating system running inside the emulator. For example, a memory page can be allocated by the Linux kernel assigned as code for a single process. The next time the process is created, the page can be allocated as data, but in the described scheme this causes problems, since now a clean data page should undergo a rather slow precoding with each byte record.
Some modern ideas, but not of them, I think are especially pleasant, i.e. All of them have significant disadvantages:
- Use mprotect () when executing the page and catch protection errors with signal handlers. This slows down the recording of many things and makes multi-threaded multi-core emulation a pain.
- When writing instead of updating the preliminary code information, we clear the soft TLB translations associated with the code execution and slightly turn over the information about the dirty page. This should then slow down the code on this page, but at least reading and writing are fast. The problem is that the next time the page is reused as executable rather than data data, we will have the same problem.
- Navigate to the smaller cache preecode file, which stores a limited number of pages with precode information. Pages age and evict based on LRU policies or something similar. This approach punishes applications that have a fixed memory layout (i.e., many built-in applications).
I believe that literature is not seriously discussed on this topic. What are the general methods that can age pages when they are no longer used, so that, for example, we clear the execution bit and the pre-code memory associated with the page?