Pipelining and caches, and the processor itself is no longer the main bottleneck, made two questions to your question. Firstly, the processor today usually executes one instruction per clock cycle, and secondly, it may take many (tens to hundreds) hours to supply the processor. More modern processors, even if their instruction sets are out of date, rarely mention clock execution because it is one clock cycle and the βrealβ execution speed is too complicated to describe.
The cache and pipeline try to allow the processor to work with one instruction per clock frequency, but, for example, reading from memory, must wait for a response in order to return. If this element is not in the cache, it can be hundreds of clock cycles, because it will need to read several places to fill the line in the cache, and then a few more hours to get them through the caches back to the processor.
Now, if you return in time or now, but to the world of microcontrollers, for example, or to another system in which the memory system can be responsible for one clock cycle or at least a very deterministic number (say, two hours for eeprom and one for ram, this kind of thing), then you can very easily calculate the exact number of hours. Such processors often publish a cycle table per instruction. For example, for two teams, for example, it would be two hours to retrieve a command, and then another measure to read, at least 3 measures. some actually take more than one measure to execute so that it is added as well.
I highly recommend finding a (used) copy of Zen of Assembly Language by Michael Abrash. It was dated when she left, but still important work. learning to juggle the relatively simple 8088/86 was tough enough, today's x86 and other systems are a bit more complicated.
If you run windows or linux or something like this, trying to figure out your code, be sure to get you where you want. add or remove nop, as a result of which the code will be aligned in memory by as many as bytes can greatly affect the performance of the remainder of the code, which has not changed, than its location in ram. As a simple example of understanding the complex nature of the problem.
Which processor or system are you interested in? The stm32f4 detection board, about $ 20, contains an ARM processor (cortex-m) with command and data caches. This has the complexities of a larger system, but at the same time simple enough (relative to a larger system) to be able to control experiments.
If you are familiar with the pic world microchip, they often count loops to perform precision delays between events. A very deterministic environment (until you use interrupts).