-
Notifications
You must be signed in to change notification settings - Fork 2
Streaming Data
The usual pattern for operating on streaming data is:
- prefetch
- load
- operation
- non-temporal store
We can talk about each of these in turn:
To compensate for latency (time the processor takes to fetch data through cache hierarchy) the prefetch distance can be changed, but the throughput will often be hardware bound. It's important to note that processors can ignore cache hints.
Loading data in a bus compatible way will leave more cycles for the operation. This means cache line aligned chunks. Although the cache lines are said to be 32/64-bytes, some processors have features to preload adjacent cache lines (sometimes configurable in BIOS).
If MOVNTQ is only taking 0.5 cycle, to write 32 bytes - at 5Ghz that's 320GB/s, but no where near memory bus capability (uncached single core limit). So, we have a lot of cycles for the operation.
Lately, more instructions have been added to signal cache of data use, but one of the earliest was MOVNTQ - letting the processor know the data isn't going to be used again. This speeds cache eviction of data. It's important to note that processors can ignore cache hints.