How to deduce from the summary report

I encoded the 80c51 architecture in VHDL using xilinx. In an attempt to increase the clock speed, I followed all the 80c51 instructions. Instructions were carried out at will, for example. when the 1st command is processed, the second command is retrieved.

However, I get a slightly higher clock speed (around +/- 10 Hz), despite creating a depth of pipe 3, from the summary report. I realized that the bottleneck was due to one operation specified in the summary report, but I could not understand the summary report.

May I ask what is trying to make the data path from "SEQ / decode_3 to SEQ / i_ram_addr_7"? (From my guess, I conclude that using the case where the operator checks 100 + the corresponding operation code, but is not sure that this is a bottleneck. But I do not know)

Therefore, my only 2 requests:

Firstly, is it possible that pipelining does not increase the clock speed, and testbench is the only way to explain the time reduction?

Secondly, how can I determine which path in my code is the bottleneck from SEQ / decode_3 to SEQ / i_ram_addr_7.

Thanks to everyone who can help explain my doubts!

Timing Summary: --------------- Speed Grade: -4 Minimum period: 12.542ns (Maximum Frequency: 79.730MHz) Minimum input arrival time before clock: 10.501ns Maximum output required time after clock: 5.698ns Maximum combinational path delay: No path found Timing Detail: -------------- All values displayed in nanoseconds (ns) ========================================================================= Timing constraint: Default period analysis for Clock 'clk' Clock period: 12.542ns (frequency: 79.730MHz) Total number of paths / destination ports: 113114 / 2670 ------------------------------------------------------------------------- Delay: 12.542ns (Levels of Logic = 10) Source: SEQ/decode_3 (FF) Destination: SEQ/i_ram_addr_7 (FF) Source Clock: clk rising Destination Clock: clk rising Data Path: SEQ/decode_3 to SEQ/i_ram_addr_7 Gate Net Cell:in->out fanout Delay Delay Logical Name (Net Name) ---------------------------------------- ------------ FDC:C->Q 102 0.591 1.364 SEQ/decode_3 (SEQ/decode_3) LUT4_D:I1->O 10 0.643 0.885 SEQ/de_state_cmp_eq002111 (N314) LUT4:I3->O 7 0.648 0.740 SEQ/de_state_cmp_eq00711 (SEQ/de_state_cmp_eq0071) LUT4:I2->O 3 0.648 0.534 SEQ/i_ram_addr_mux0000<0>11111 (N2301) LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0_F (N1284) MUXF5:I0->O 1 0.276 0.423 SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0 (N955) LUT4_D:I3->O 6 0.648 0.701 SEQ/i_ram_addr_mux0000<0>11270 (SEQ/i_ram_addr_mux0000<0>11270) LUT3_L:I2->LO 1 0.648 0.103 SEQ/i_ram_addr_mux0000<7>221_SW2_SW0 (N1208) LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>351_SW1 (N1085) LUT4:I3->O 1 0.648 0.423 SEQ/i_ram_addr_mux0000<7>2 (SEQ/i_ram_addr_mux0000<7>2) LUT4:I3->O 1 0.648 0.000 SEQ/i_ram_addr_mux0000<7>167 (SEQ/i_ram_addr_mux0000<7>) FDE:D 0.252 SEQ/i_ram_addr_7 ---------------------------------------- Total 12.542ns (6.946ns logic, 5.596ns route) (55.4% logic, 44.6% route) ========================================================================= Timing constraint: Default OFFSET IN BEFORE for Clock 'clk' Total number of paths / destination ports: 154 / 154 ------------------------------------------------------------------------- Offset: 8.946ns (Levels of Logic = 6) Source: rst (PAD) Destination: SEQ/i_ram_diByte_1 (FF) Destination Clock: clk rising Data Path: rst to SEQ/i_ram_diByte_1 Gate Net Cell:in->out fanout Delay Delay Logical Name (Net Name) ---------------------------------------- ------------ IBUF:I->O 444 0.849 1.392 rst_IBUF (REG/ext_int/fd_out1_0__or0000) BUF:I->O 445 0.648 1.425 rst_IBUF_1 (rst_IBUF_1) LUT3:I2->O 4 0.648 0.730 ROM/data<1>1 (i_rom_data<1>) LUT4:I0->O 1 0.648 0.500 SEQ/i_ram_diByte_mux0000<1>17_SW0 (N1262) LUT4:I1->O 1 0.643 0.563 SEQ/i_ram_diByte_mux0000<1>32 (SEQ/i_ram_diByte_mux0000<1>32) LUT4:I0->O 1 0.648 0.000 SEQ/i_ram_diByte_mux0000<1>60 (SEQ/i_ram_diByte_mux0000<1>) FDE:D 0.252 SEQ/i_ram_diByte_1 ---------------------------------------- Total 8.946ns (4.336ns logic, 4.610ns route) (48.5% logic, 51.5% route) ========================================================================= 

To allow me to be more precise, I will give a snipplet of sample code in the decoding phase of 1 operation code.

The following is 1 such case when decoding opd code, which is a mov instruction. There are about 100+ opcodes (100+ instructions), which means that in this case statements contain more than 100 messages.

OPCODE case

- MOV A, Rn
when the "11101000" | "11101001" | "11101010" | "11101011" | "11101100" | "11101101" | "11101110" | "11101111" => case de_state is when E7 =>

  de_state <= E8; when E8 => de_state <= E9; when E9 => de_state <= E10; when E10 => --Draw PSW i_ram_addr <= xD0; i_ram_rdByte <= '1'; de_state <= E11; when E11 => --Draw from Rn i_ram_addr <= "000" & i_ram_doByte(4 downto 3)& opcode(2 downto 0); i_ram_rdByte <= '1'; de_state <= E12; when E12 => --Place into EDR EDR <= i_ram_doByte; --close rdByte i_ram_rdByte <= '0'; when others => end case; 

Hope you could better understand my VHDL code. I would appreciate any help. Thanks!

+4
source share
2 answers

Since you are using Xilinx, I assume you also have access to PlanAhead? Try "Conduct a Time / Plan Analysis (PlanAhead)" (under "Implementing a Design" β†’ "Location and Route").

PlanAhead should open up and give you an idea of ​​your sync results at the bottom. Select the critical path (the one that has the least slack), right-click it and select "Scheme", which will display a graphical view of the primitives involved. Then you can right-click the primitives and select "Expand Cone" β†’ "To Flops" to get an idea of ​​the surrounding components.

This will help you better understand which signals are involved. Try tracking input and output signals in VHDL code and focus on this path for optimization.

+1
source

There will be no good answers from this information; we can only guess what source code this equipment created.

But it’s clear that you need to study the source, hypothesize why it is slow, take steps to fix the problem, and test the solution.

And repeat until fast enough.

My guess is, given your hint that there is a case argument for decoding opcodes ...

one of the hands is something like:

 when <some expression involving decode> => address <= <some address calculation>; 

The problem is that often the two expressions are interconnected, so they are evaluated in one cycle. An example solution would be to precompute the address expression (i.e., in the previous cycle) in the register and rewrite the business hand as:

 when <some expression involving decode> => address <= register; 

If you guessed it, the result will be a little faster, and you will have another (similar) bottleneck to fix. Repeat until fast enough ...

But without a source and time analysis, do not expect a more specific answer.

EDIT: Having laid out part of the source code, the image is a little clearer: you have two nested Case statements, each of which is large enough. You obviously need some simplification ...

I note that only 2 internal arguments to the case are assigned by i_ram_addr, but time analysis shows a huge and complex multiplex to i_ram_addr; it is clear that there are many other cases that contribute to i_ram_addr ...

I would suggest that you have to deal with i_ram_addr separately from the main Case statement and write a simple machine that you can only create for i_ram_addr. For example, I would like to note that the lever for the OPCODE case is equivalent:

 if OPCODE(7 downto 3) = "11101" then ... 

and ask how easy you can get a decoder for i_ram_addr only. You may find that many other weapon manipulations do very similar things with i_ram_addr (the original 8051 designers would jump to simplify the logic!). Synthesis tools can be quite smart at simplifying logic, but when things get too complicated, they can miss opportunities.

(At this point, I commented on the purpose of i_ram_addr and left the rest of the decoder)

+1
source

Source: https://habr.com/ru/post/1445695/


All Articles