7703.572978 task-clock (msec) # 0.996 CPUs utilized 1,575 context-switches # 0.204 K/sec 18 cpu-migrations # 0.002 K/sec 65,975 page-faults # 0.009 M/sec 25,719,058,036 cycles # 3.340 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 12,323,855,909 instructions # 0.48 insns per cycle 2,337,484,352 branches # 303.429 M/sec 200,227,908 branch-misses # 8.57% of all branches 3,167,237,318 L1-dcache-loads # 411.139 M/sec 454,416,650 L1-dcache-load-misses # 14.35% of all L1-dcache hits 326,345,389 LLC-loads # 42.363 M/sec <not supported> LLC-load-misses:HG
I profiled my code written with libCCC in C by
perf stat. It sorts an doubly linked list which causes a lot of list traversal operations, which means that it may ask many data located from different memory addresses. However, modern processor supports pipelining of multi stages, branch prediction and out-of-order execution, so these should increase the average amount of instructions executed in the same time interval. In fact, from the analysis data, only about an instruction is processed per two cycles. What's the reasons that may cause this phenomenon?