Instruction execution time on Arm64

Principles

Measurements were made over loops of 32 instructions. The number of iterations has an order of magnitude of 100 million (more or less depending on instructions). The global execution time is divided by the total number of executed instructions in the main loops, giving a mean instruction execution time in nanoseconds.

The problem is the overhead of the loop instructions (three instuctions: decrement counter, test counter, branch back). Two types of tests are made, removing or ignoring the loop overhead.

To remove the loop overhead, an empty loop is executed with the same number of iterations. The time of the empty loop is substracted from the main loop time.
However, it may be safe to ignore the overhead because the three loop instructions are probably executed in parallel of the body of the loop and the branch is probably correctly speculated. Therefore, removing the time of the empty loop may be misleading.

The time measurement is based on the virtual counter register (CNTVCT_EL0) and its corresponding frequency register (CNTFRQ_EL0).

As a general rule, inside a loop, all input registers are identical and at least 16 distinct output registers are used.

Tested systems

System	CPU chip	CPU core
Raspberry Pi 4 Model B	Broadcom BCM2711	Arm Cortex A72
Ampere Mt.Jade Server	Ampere Altra	Arm Neoverse N1
AWS EC2 instance c7g.xlarge	AWS Graviton 3	Arm Neoverse V1
Supermicro ARS-221GL-NR	Nvidia Grace Superchip	Arm Neoverse V2
Apple MacBook M1	Apple M1	Apple Firestorm/Icestorm
Apple iMac M3	Apple M3	Apple (unknown core)

The Apple M1 CPU was tested using macOS. All other CPU's were tested on Linux.

Results

Depending on the CPU, consecutive executions of the test program produce slightly different results. However, the difference remains on the third decimal, meaning a few picoseconds. This difference can be considered as negligible.

The PAC instructions can be evaluated on Armv8.3-A onwards only. On older CPU cores, the PAC tests are automatically skipped.

On macOS, all Apple Silicon chips support PAC. However, PAC instructions can be disabled at system or application level (architecture arm64 vs. arm46e, more details here). When PAC instructions are disabled (architecture arm64), they execute at the speed of a NOP. Therefore, the PAC tests are skipped in this configuration to avoid reporting non-significant instruction time.

The tables below give the mean instruction time, in nanoseconds, in each loop. The Excel file in this project adds the frequency information of each core and the corresponding relative performance information.

Ignoring the empty loop time

Mean instruction time (nanoseconds)	Cortex A72	Neoverse N1	Neoverse V1	Neoverse V2	Apple M1	Apple M3
NOP	0.560	0.092	0.036	0.029	0.045	0.033
ADD	0.297	0.115	0.099	0.080	0.070	0.036
ADC	0.297	0.115	0.099	0.083	0.108	0.065
ADDS	0.315	0.120	0.131	0.124	0.107	0.066
ADCS	0.559	0.334	0.388	0.309	0.313	0.249
MUL	1.677	1.002	0.193	0.152	0.156	0.126
UMULH	2.236	1.336	0.193	0.152	0.156	0.126
DIV	3.353	2.005	2.311	1.824	0.625	0.498
MUL UMULH	1.957	1.169	0.193	0.152	0.156	0.125
MUL ADCS UMULH ADCS	0.978	0.584	0.294	0.238	0.157	0.125
MUL ADD	0.839	0.501	0.105	0.095	0.104	0.063
MUL ADC	0.838	0.501	0.105	0.102	0.105	0.064
MUL ADDS	0.838	0.501	0.103	0.099	0.104	0.063
MUL ADCS	0.839	0.501	0.196	0.188	0.156	0.124
MUL ADCS (alt)	0.838	0.501	0.195	0.156	0.156	0.125
UMULH ADD	1.118	0.668	0.131	0.106	0.105	0.062
UMULH ADC	1.118	0.668	0.131	0.113	0.105	0.062
UMULH ADDS	1.118	0.668	0.146	0.119	0.104	0.063
UMULH ADCS	1.118	0.668	0.340	0.270	0.156	0.124
UMULH ADCS (alt)	1.119	0.668	0.339	0.268	0.156	0.125
UMULH NOP ADCS	0.746	0.445	0.224	0.179	0.105	0.083
UMULH ADD ADCS	0.745	0.445	0.210	0.179	0.105	0.083
UMULH ADDS ADCS	0.745	0.445	0.144	0.129	0.117	0.046
UMULH ADCS ADCS	0.745	0.445	0.361	0.289	0.210	0.166
UMULH ADC ADDS	0.745	0.445	0.113	0.101	0.098	0.045
UMULH ADC ADDS (dep. regs)	0.745	0.445	0.337	0.256	0.210	0.167
MUL ADD UMULH ADD	0.978	0.584	0.123	0.106	0.104	0.063
PACIA			0.385		0.313	0.248
PACIA AUTIA			0.385		0.313	0.250
PACIA ... AUTIA ...			0.385		0.314	0.251

After substracting the empty loop time

Mean instruction time (nanoseconds)	Cortex A72	Neoverse N1	Neoverse V1	Neoverse V2	Apple M1	Apple M3
NOP	0.524	0.073	0.024	0.019	0.033	0.022
ADD	0.261	0.094	0.087	0.069	0.060	0.027
ADC	0.261	0.094	0.087	0.069	0.098	0.057
ADDS	0.279	0.099	0.119	0.118	0.098	0.058
ADCS	0.525	0.313	0.376	0.300	0.303	0.242
MUL	1.642	0.981	0.181	0.143	0.147	0.116
UMULH	2.201	1.315	0.181	0.143	0.147	0.117
DIV	3.319	1.983	2,299	1.815	0.617	0.491
MUL UMULH	1.922	1.148	0.181	0.143	0.147	0.117
MUL ADCS UMULH ADCS	0.943	0.563	0.282	0.228	0.147	0.117
MUL ADD	0.804	0.480	0.093	0.081	0.095	0.055
MUL ADC	0.803	0.480	0.093	0.095	0.095	0.056
MUL ADDS	0.804	0.480	0.091	0.098	0.095	0.055
MUL ADCS	0.803	0.480	0.183	0.176	0.147	0.118
MUL ADCS (alt)	0.804	0.480	0.183	0.146	0.147	0.117
UMULH ADD	1.084	0.647	0.119	0.087	0.095	0.055
UMULH ADC	1.083	0.647	0.119	0.091	0.095	0.055
UMULH ADDS	1.083	0.647	0.133	0.124	0.095	0.055
UMULH ADCS	1.083	0.647	0.328	0.260	0.147	0.117
UMULH ADCS (alt)	1.084	0.647	0.327	0.258	0.147	0.117
UMULH NOP ADCS	0.712	0.425	0.213	0.170	0.095	0.076
UMULH ADD ADCS	0.712	0.425	0.199	0.169	0.095	0.076
UMULH ADDS ADCS	0.711	0.425	0.132	0.116	0.107	0.039
UMULH ADCS ADCS	0.711	0.425	0.350	0.279	0.200	0.158
UMULH ADC ADDS	0.712	0.425	0.101	0.083	0.089	0.039
UMULH ADC ADDS (dep. regs)	0.711	0.425	0.326	0.247	0.200	0.158
MUL ADD UMULH ADD	0.944	0.563	0.111	0.095	0.095	0.056
PACIA			0.373		0.303	0.242
PACIA AUTIA			0.373		0.303	0.241
PACIA ... AUTIA ...			0.373		0.303	0.241

Reference public documents

Arm Cortex A72 Core Software Optimization Guide
Arm Neoverse N1 Core Software Optimization Guide
Arm Neoverse V1 Software Optimization Guide
Apple M1 Microarchitecture Research by Dougall Johnson (unofficial reverse engineering works)

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
results		results
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
analysis.pdf		analysis.pdf
analysis.xlsx		analysis.xlsx
inscode.S		inscode.S
instime.c		instime.c
reformat.py		reformat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instruction execution time on Arm64

Principles

Tested systems

Results

Ignoring the empty loop time

After substracting the empty loop time

Reference public documents

About

Releases

Languages

License

lelegard/arm-instruction-time

Folders and files

Latest commit

History

Repository files navigation

Instruction execution time on Arm64

Principles

Tested systems

Results

Ignoring the empty loop time

After substracting the empty loop time

Reference public documents

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages