Streaming Data

The usual pattern for operating on streaming data is:

prefetch
load
operation
non-temporal store

We can talk about each of these in turn:

To compensate for latency (time the processor takes to fetch data through cache hierarchy) the prefetch distance can be changed, but the throughput will often be hardware bound. It's important to note that processors can ignore cache hints.

Loading data in a bus compatible way will leave more cycles for the operation. This means cache line aligned chunks. Although the cache lines are said to be 32/64-bytes, some processors have features to preload adjacent cache lines (sometimes configurable in BIOS).

If MOVNTQ is only taking 0.5 cycle, to write 32 bytes - at 5Ghz that's 320GB/s, but no where near memory bus capability (uncached single core limit). So, we have a lot of cycles for the operation.

Lately, more instructions have been added to signal cache of data use, but one of the earliest was MOVNTQ - letting the processor know the data isn't going to be used again. This speeds cache eviction of data. It's important to note that processors can ignore cache hints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming Data

Clone this wiki locally