Measurements were made over loops of 32 instructions. The number of iterations has an order of magnitude of 100 million (more or less depending on instructions). The global execution time is divided by the total number of executed instructions in the main loops, giving a mean instruction execution time in nanoseconds.
The problem is the overhead of the loop instructions (three instuctions: decrement counter, test counter, branch back). Two types of tests are made, removing or ignoring the loop overhead.
- To remove the loop overhead, an empty loop is executed with the same number of iterations. The time of the empty loop is substracted from the main loop time.
- However, it may be safe to ignore the overhead because the three loop instructions are probably executed in parallel of the body of the loop and the branch is probably correctly speculated. Therefore, removing the time of the empty loop may be misleading.
The time measurement is based on the virtual counter register (CNTVCT_EL0) and its corresponding frequency register (CNTFRQ_EL0).
As a general rule, inside a loop, all input registers are identical and at least 16 distinct output registers are used.
System | CPU chip | CPU core |
---|---|---|
Raspberry Pi 4 Model B | Broadcom BCM2711 | Arm Cortex A72 |
Ampere Mt.Jade Server | Ampere Altra | Arm Neoverse N1 |
AWS EC2 instance c7g.xlarge | AWS Graviton 3 | Arm Neoverse V1 |
Supermicro ARS-221GL-NR | Nvidia Grace Superchip | Arm Neoverse V2 |
Apple MacBook M1 | Apple M1 | Apple Firestorm/Icestorm |
Apple iMac M3 | Apple M3 | Apple (unknown core) |
The Apple M1 CPU was tested using macOS. All other CPU's were tested on Linux.
Depending on the CPU, consecutive executions of the test program produce slightly different results. However, the difference remains on the third decimal, meaning a few picoseconds. This difference can be considered as negligible.
The PAC instructions can be evaluated on Armv8.3-A onwards only. On older CPU cores, the PAC tests are automatically skipped.
On macOS, all Apple Silicon chips support PAC. However, PAC instructions can be
disabled at system or application level (architecture arm64
vs. arm46e
,
more details here).
When PAC instructions are disabled (architecture arm64
), they execute at the speed of a NOP.
Therefore, the PAC tests are skipped in this configuration to avoid reporting
non-significant instruction time.
The tables below give the mean instruction time, in nanoseconds, in each loop. The Excel file in this project adds the frequency information of each core and the corresponding relative performance information.
Mean instruction time (nanoseconds) | Cortex A72 | Neoverse N1 | Neoverse V1 | Neoverse V2 | Apple M1 | Apple M3 |
---|---|---|---|---|---|---|
NOP | 0.560 | 0.092 | 0.036 | 0.029 | 0.045 | 0.033 |
ADD | 0.297 | 0.115 | 0.099 | 0.080 | 0.070 | 0.036 |
ADC | 0.297 | 0.115 | 0.099 | 0.083 | 0.108 | 0.065 |
ADDS | 0.315 | 0.120 | 0.131 | 0.124 | 0.107 | 0.066 |
ADCS | 0.559 | 0.334 | 0.388 | 0.309 | 0.313 | 0.249 |
MUL | 1.677 | 1.002 | 0.193 | 0.152 | 0.156 | 0.126 |
UMULH | 2.236 | 1.336 | 0.193 | 0.152 | 0.156 | 0.126 |
DIV | 3.353 | 2.005 | 2.311 | 1.824 | 0.625 | 0.498 |
MUL UMULH | 1.957 | 1.169 | 0.193 | 0.152 | 0.156 | 0.125 |
MUL ADCS UMULH ADCS | 0.978 | 0.584 | 0.294 | 0.238 | 0.157 | 0.125 |
MUL ADD | 0.839 | 0.501 | 0.105 | 0.095 | 0.104 | 0.063 |
MUL ADC | 0.838 | 0.501 | 0.105 | 0.102 | 0.105 | 0.064 |
MUL ADDS | 0.838 | 0.501 | 0.103 | 0.099 | 0.104 | 0.063 |
MUL ADCS | 0.839 | 0.501 | 0.196 | 0.188 | 0.156 | 0.124 |
MUL ADCS (alt) | 0.838 | 0.501 | 0.195 | 0.156 | 0.156 | 0.125 |
UMULH ADD | 1.118 | 0.668 | 0.131 | 0.106 | 0.105 | 0.062 |
UMULH ADC | 1.118 | 0.668 | 0.131 | 0.113 | 0.105 | 0.062 |
UMULH ADDS | 1.118 | 0.668 | 0.146 | 0.119 | 0.104 | 0.063 |
UMULH ADCS | 1.118 | 0.668 | 0.340 | 0.270 | 0.156 | 0.124 |
UMULH ADCS (alt) | 1.119 | 0.668 | 0.339 | 0.268 | 0.156 | 0.125 |
UMULH NOP ADCS | 0.746 | 0.445 | 0.224 | 0.179 | 0.105 | 0.083 |
UMULH ADD ADCS | 0.745 | 0.445 | 0.210 | 0.179 | 0.105 | 0.083 |
UMULH ADDS ADCS | 0.745 | 0.445 | 0.144 | 0.129 | 0.117 | 0.046 |
UMULH ADCS ADCS | 0.745 | 0.445 | 0.361 | 0.289 | 0.210 | 0.166 |
UMULH ADC ADDS | 0.745 | 0.445 | 0.113 | 0.101 | 0.098 | 0.045 |
UMULH ADC ADDS (dep. regs) | 0.745 | 0.445 | 0.337 | 0.256 | 0.210 | 0.167 |
MUL ADD UMULH ADD | 0.978 | 0.584 | 0.123 | 0.106 | 0.104 | 0.063 |
PACIA | 0.385 | 0.313 | 0.248 | |||
PACIA AUTIA | 0.385 | 0.313 | 0.250 | |||
PACIA ... AUTIA ... | 0.385 | 0.314 | 0.251 |
Mean instruction time (nanoseconds) | Cortex A72 | Neoverse N1 | Neoverse V1 | Neoverse V2 | Apple M1 | Apple M3 |
---|---|---|---|---|---|---|
NOP | 0.524 | 0.073 | 0.024 | 0.019 | 0.033 | 0.022 |
ADD | 0.261 | 0.094 | 0.087 | 0.069 | 0.060 | 0.027 |
ADC | 0.261 | 0.094 | 0.087 | 0.069 | 0.098 | 0.057 |
ADDS | 0.279 | 0.099 | 0.119 | 0.118 | 0.098 | 0.058 |
ADCS | 0.525 | 0.313 | 0.376 | 0.300 | 0.303 | 0.242 |
MUL | 1.642 | 0.981 | 0.181 | 0.143 | 0.147 | 0.116 |
UMULH | 2.201 | 1.315 | 0.181 | 0.143 | 0.147 | 0.117 |
DIV | 3.319 | 1.983 | 2,299 | 1.815 | 0.617 | 0.491 |
MUL UMULH | 1.922 | 1.148 | 0.181 | 0.143 | 0.147 | 0.117 |
MUL ADCS UMULH ADCS | 0.943 | 0.563 | 0.282 | 0.228 | 0.147 | 0.117 |
MUL ADD | 0.804 | 0.480 | 0.093 | 0.081 | 0.095 | 0.055 |
MUL ADC | 0.803 | 0.480 | 0.093 | 0.095 | 0.095 | 0.056 |
MUL ADDS | 0.804 | 0.480 | 0.091 | 0.098 | 0.095 | 0.055 |
MUL ADCS | 0.803 | 0.480 | 0.183 | 0.176 | 0.147 | 0.118 |
MUL ADCS (alt) | 0.804 | 0.480 | 0.183 | 0.146 | 0.147 | 0.117 |
UMULH ADD | 1.084 | 0.647 | 0.119 | 0.087 | 0.095 | 0.055 |
UMULH ADC | 1.083 | 0.647 | 0.119 | 0.091 | 0.095 | 0.055 |
UMULH ADDS | 1.083 | 0.647 | 0.133 | 0.124 | 0.095 | 0.055 |
UMULH ADCS | 1.083 | 0.647 | 0.328 | 0.260 | 0.147 | 0.117 |
UMULH ADCS (alt) | 1.084 | 0.647 | 0.327 | 0.258 | 0.147 | 0.117 |
UMULH NOP ADCS | 0.712 | 0.425 | 0.213 | 0.170 | 0.095 | 0.076 |
UMULH ADD ADCS | 0.712 | 0.425 | 0.199 | 0.169 | 0.095 | 0.076 |
UMULH ADDS ADCS | 0.711 | 0.425 | 0.132 | 0.116 | 0.107 | 0.039 |
UMULH ADCS ADCS | 0.711 | 0.425 | 0.350 | 0.279 | 0.200 | 0.158 |
UMULH ADC ADDS | 0.712 | 0.425 | 0.101 | 0.083 | 0.089 | 0.039 |
UMULH ADC ADDS (dep. regs) | 0.711 | 0.425 | 0.326 | 0.247 | 0.200 | 0.158 |
MUL ADD UMULH ADD | 0.944 | 0.563 | 0.111 | 0.095 | 0.095 | 0.056 |
PACIA | 0.373 | 0.303 | 0.242 | |||
PACIA AUTIA | 0.373 | 0.303 | 0.241 | |||
PACIA ... AUTIA ... | 0.373 | 0.303 | 0.241 |