This is a theoretical question, I really don't run fab or anything; -)
For small N, the N-by-N factor can be implemented as a tree from 3 to 2 adders of the depth (N) log and with N ^ 2-gates - let it ignore Booth coding, etc. It is super-fast, but requires an unreasonable amount of hardware.
This goal counter will soon become unreasonable (as well as postings). But software multiplication of kN-by-kN through k ^ 2 2N-bits of partial products and their combination together will be very slow.
My question is: what compromises do we have for very fast hardware multiplication of moderate N, after the N ^ 2-gates become too much (for gates, as well as for wiring), but we still want to be better than pure software.
I can imagine that this has a lot to do with custom cryptographic chips, but I'm just curious.
source
share