This is a flag documenting on how and why you want to use the -e cycles:pp flag when using perf.
Always use this flag, it gives better precision.
Perf todo wants to use this flag by default
The default “perf record ” can sometimes attribute cycles to the wrong instruction. For example, if you have a mov instruction before a multiply, the cycles the multiply is spent on waiting for the move will be attributed to multiply when the real bottleneck is the previous move instruction.
Some analysis and reasoning can be found here,
Finally, why don’t
cpu-clock profiles identify slow instructions? The answer has to do with the way
cpu-clock events are attributed back to binary code when handling a timer interrupt. First, there is a delay between the time when a sampling interrupt is requested and when the sampling interrupt is honored and handled. Second, the program counter value that is captured in a sample is the program address at which execution will restart after interrupt handling is complete; It is not the program location where the interrupt is first asserted. The combination of these two factors is called skid and it affects the attribution and distribution of samples in the final profile. In the presence of skid, samples are attributed to the general neighborhood around performance culprits such as long latency memory load operations. The
cpu-clock event cannot be precisely attributed to slow instructions — just the hot code region containing the culprit. This is why you shouldn’t conclude that the compare (cmp) instruction: