CISC jokes aside, this an interesting turn of events.
Classic ARM had LDM/STM which could load/store from a list of registeres. While very handy, it was a nightmare from a hardware POV. For example, it made error handling and rollback much much more complex in out-of-order implementations.
ARMv8 removed those in aarch64 and introduced LDP/STP which only handled two registers at a time (the P is for Pair, M for multiple). This made things much easier but it seems the performance hit was not negligible.
Now with v8.8 and v9.3 we get this, which looks much nicer than intels ancient string functions that have been around since 8086. But I am curious how it affects other aspects of the CPU, specially those with very long and wide pipelines.
Note that in ARM-based controllers, LDM/STM also have a non-negligible impact on interrupt latency. These are defined in a way that they cannot be interrupted mid-instruction, so worst-case interrupt latency is higher that would be expected with a RISC CPU (especially if LDM/STM happen to run on a somewhat slower memory region)
AFAICS x86 "rep" prefixed instructions are defined so that they can in fact be interrupted without problems. The remaining count is kept in (e)cx, so just doing an iret into "rep stosb" etc. will continue its operation.
I think VIA's hash/aes instruction set extension also made use of the "rep" prefix and kept all encryption/hash state in the x86 register set, so that they could in fact hash large memory regions on a single opcode without hampering interrupts.
AFAICS x86 "rep" prefixed instructions are defined so that they can in fact be interrupted without problems. The remaining count is kept in (e)cx, so just doing an iret into "rep stosb" etc. will continue its operation.
The 8086/8088 have a bug (one of very few!) where segment override prefixes were lost after an interrupted string instruction:
Classic ARM had LDM/STM which could load/store from a list of registeres. While very handy, it was a nightmare from a hardware POV. For example, it made error handling and rollback much much more complex in out-of-order implementations.
ARMv8 removed those in aarch64 and introduced LDP/STP which only handled two registers at a time (the P is for Pair, M for multiple). This made things much easier but it seems the performance hit was not negligible.
Now with v8.8 and v9.3 we get this, which looks much nicer than intels ancient string functions that have been around since 8086. But I am curious how it affects other aspects of the CPU, specially those with very long and wide pipelines.