Skip to content

lelegard/arm-instruction-time

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Instruction execution time on Arm64

Principles

Measurements were made over loops of 32 instructions. The number of iterations has an order of magnitude of 100 million (more or less depending on instructions). The global execution time is divided by the total number of executed instructions in the main loops, giving a mean instruction execution time in nanoseconds.

The problem is the overhead of the loop instructions (three instuctions: decrement counter, test counter, branch back). Two types of tests are made, removing or ignoring the loop overhead.

  • To remove the loop overhead, an empty loop is executed with the same number of iterations. The time of the empty loop is substracted from the main loop time.
  • However, it may be safe to ignore the overhead because the three loop instructions are probably executed in parallel of the body of the loop and the branch is probably correctly speculated. Therefore, removing the time of the empty loop may be misleading.

The time measurement is based on the virtual counter register (CNTVCT_EL0) and its corresponding frequency register (CNTFRQ_EL0).

As a general rule, inside a loop, all input registers are identical and at least 16 distinct output registers are used.

Tested systems

System CPU chip CPU core
Raspberry Pi 4 Model B Broadcom BCM2711 Arm Cortex A72
Ampere Mt.Jade Server Ampere Altra Arm Neoverse N1
AWS EC2 instance c7g.xlarge AWS Graviton 3 Arm Neoverse V1
Supermicro ARS-221GL-NR Nvidia Grace Superchip Arm Neoverse V2
Apple MacBook M1 Apple M1 Apple Firestorm/Icestorm
Apple iMac M3 Apple M3 Apple (unknown core)

The Apple M1 CPU was tested using macOS. All other CPU's were tested on Linux.

Results

Depending on the CPU, consecutive executions of the test program produce slightly different results. However, the difference remains on the third decimal, meaning a few picoseconds. This difference can be considered as negligible.

The PAC instructions can be evaluated on Armv8.3-A onwards only. On older CPU cores, the PAC tests are automatically skipped.

On macOS, all Apple Silicon chips support PAC. However, PAC instructions can be disabled at system or application level (architecture arm64 vs. arm46e, more details here). When PAC instructions are disabled (architecture arm64), they execute at the speed of a NOP. Therefore, the PAC tests are skipped in this configuration to avoid reporting non-significant instruction time.

The tables below give the mean instruction time, in nanoseconds, in each loop. The Excel file in this project adds the frequency information of each core and the corresponding relative performance information.

Ignoring the empty loop time

Mean instruction time (nanoseconds) Cortex A72 Neoverse N1 Neoverse V1 Neoverse V2 Apple M1 Apple M3
NOP 0.560 0.092 0.036 0.029 0.045 0.033
ADD 0.297 0.115 0.099 0.080 0.070 0.036
ADC 0.297 0.115 0.099 0.083 0.108 0.065
ADDS 0.315 0.120 0.131 0.124 0.107 0.066
ADCS 0.559 0.334 0.388 0.309 0.313 0.249
MUL 1.677 1.002 0.193 0.152 0.156 0.126
UMULH 2.236 1.336 0.193 0.152 0.156 0.126
DIV 3.353 2.005 2.311 1.824 0.625 0.498
MUL UMULH 1.957 1.169 0.193 0.152 0.156 0.125
MUL ADCS UMULH ADCS 0.978 0.584 0.294 0.238 0.157 0.125
MUL ADD 0.839 0.501 0.105 0.095 0.104 0.063
MUL ADC 0.838 0.501 0.105 0.102 0.105 0.064
MUL ADDS 0.838 0.501 0.103 0.099 0.104 0.063
MUL ADCS 0.839 0.501 0.196 0.188 0.156 0.124
MUL ADCS (alt) 0.838 0.501 0.195 0.156 0.156 0.125
UMULH ADD 1.118 0.668 0.131 0.106 0.105 0.062
UMULH ADC 1.118 0.668 0.131 0.113 0.105 0.062
UMULH ADDS 1.118 0.668 0.146 0.119 0.104 0.063
UMULH ADCS 1.118 0.668 0.340 0.270 0.156 0.124
UMULH ADCS (alt) 1.119 0.668 0.339 0.268 0.156 0.125
UMULH NOP ADCS 0.746 0.445 0.224 0.179 0.105 0.083
UMULH ADD ADCS 0.745 0.445 0.210 0.179 0.105 0.083
UMULH ADDS ADCS 0.745 0.445 0.144 0.129 0.117 0.046
UMULH ADCS ADCS 0.745 0.445 0.361 0.289 0.210 0.166
UMULH ADC ADDS 0.745 0.445 0.113 0.101 0.098 0.045
UMULH ADC ADDS (dep. regs) 0.745 0.445 0.337 0.256 0.210 0.167
MUL ADD UMULH ADD 0.978 0.584 0.123 0.106 0.104 0.063
PACIA 0.385 0.313 0.248
PACIA AUTIA 0.385 0.313 0.250
PACIA ... AUTIA ... 0.385 0.314 0.251

After substracting the empty loop time

Mean instruction time (nanoseconds) Cortex A72 Neoverse N1 Neoverse V1 Neoverse V2 Apple M1 Apple M3
NOP 0.524 0.073 0.024 0.019 0.033 0.022
ADD 0.261 0.094 0.087 0.069 0.060 0.027
ADC 0.261 0.094 0.087 0.069 0.098 0.057
ADDS 0.279 0.099 0.119 0.118 0.098 0.058
ADCS 0.525 0.313 0.376 0.300 0.303 0.242
MUL 1.642 0.981 0.181 0.143 0.147 0.116
UMULH 2.201 1.315 0.181 0.143 0.147 0.117
DIV 3.319 1.983 2,299 1.815 0.617 0.491
MUL UMULH 1.922 1.148 0.181 0.143 0.147 0.117
MUL ADCS UMULH ADCS 0.943 0.563 0.282 0.228 0.147 0.117
MUL ADD 0.804 0.480 0.093 0.081 0.095 0.055
MUL ADC 0.803 0.480 0.093 0.095 0.095 0.056
MUL ADDS 0.804 0.480 0.091 0.098 0.095 0.055
MUL ADCS 0.803 0.480 0.183 0.176 0.147 0.118
MUL ADCS (alt) 0.804 0.480 0.183 0.146 0.147 0.117
UMULH ADD 1.084 0.647 0.119 0.087 0.095 0.055
UMULH ADC 1.083 0.647 0.119 0.091 0.095 0.055
UMULH ADDS 1.083 0.647 0.133 0.124 0.095 0.055
UMULH ADCS 1.083 0.647 0.328 0.260 0.147 0.117
UMULH ADCS (alt) 1.084 0.647 0.327 0.258 0.147 0.117
UMULH NOP ADCS 0.712 0.425 0.213 0.170 0.095 0.076
UMULH ADD ADCS 0.712 0.425 0.199 0.169 0.095 0.076
UMULH ADDS ADCS 0.711 0.425 0.132 0.116 0.107 0.039
UMULH ADCS ADCS 0.711 0.425 0.350 0.279 0.200 0.158
UMULH ADC ADDS 0.712 0.425 0.101 0.083 0.089 0.039
UMULH ADC ADDS (dep. regs) 0.711 0.425 0.326 0.247 0.200 0.158
MUL ADD UMULH ADD 0.944 0.563 0.111 0.095 0.095 0.056
PACIA 0.373 0.303 0.242
PACIA AUTIA 0.373 0.303 0.241
PACIA ... AUTIA ... 0.373 0.303 0.241

Reference public documents