Lately, I've been working on benchmarks for the disruptor-- project. The scenario is simple, one thread write to the buffer and a second thread fetch those values and does a simple addition. The goal is to mesure an upper bound to the number of operations per second.
The first benchmarks were run on my old Core2 E8400, obtaining a satisfying ~32M ops/sec with a standard deviation of ~1M. Perfect, let's try it on newer hardware, i.e. a dual Quad Core Xeon E5620. It didn't go as expected. In fact, the variance in performance was so high I couldn't compare to the previous results. On bad runs I would get 10M ops/sec, and good ones were about 50M. Here's a small sample that shows the variance:
~/src/disruptor--$ perf/perf-1P-1EP
6008031.52854411 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
5017330.94644462 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
13854443.2957167 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
56655853.066027 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
10099522.7723398 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
55783234.1370777 ops/sec
It came to me that threads were probably not run on the same
socket, implying no L3 cache sharing. This is where the
taskset utility comes in. I've first seen this tool
used by Martin Thompson
in is inter thread latency benchmark.
Unluckily for me, I was still getting
poor performance, but it did fix the high variance:
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10938592.0983346 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10174226.1285803 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10901707.4716223 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10272055.0311054 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
11507956.6466277 ops/sec
The culprit was not so easy to find. My computer scientist mindset
tricked me. CPU numbering in Linux does not implies topological
proximity, i.e. processors 0 and 1 are not guaranteed to be
physicaly close. All the information needed to graph the processor
topology can be found in /proc/cpuinfo. The following
code snippet extract the desired information.
#!/bin/bash
function filter {
cat /proc/cpuinfo | grep -E "$1.*: [0-9]*" | sed -e 's/^.*: //g'
}
CPU_ID=`filter processor`
SOCKET_ID=(`filter 'physical id'`)
CORE_ID=(`filter 'core id'`)
for cpu_id in $CPU_ID; do
echo "cpu $cpu_id: socket${SOCKET_ID[$cpu_id]}_core${CORE_ID[$cpu_id]}"
done
For example, on the previous dual Xeon, processors 0 and 1 are on 2 distinct cores, while (logical) processor 0 and 8 share the same core due to Intel HyperThreading.
cpu 0: socket0_core0
cpu 1: socket0_core1
cpu 2: socket0_core9
cpu 3: socket0_core10
cpu 4: socket1_core0
cpu 5: socket1_core1
cpu 6: socket1_core9
cpu 7: socket1_core10
cpu 8: socket0_core0
cpu 9: socket0_core1
cpu 10: socket0_core9
cpu 11: socket0_core10
cpu 12: socket1_core0
cpu 13: socket1_core1
cpu 14: socket1_core9
cpu 15: socket1_core10
We obtain a quite huge performance boost by using 2 logicial thread on the same core, thus sharing L2 cache.
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
55927128.0825528 ops/sec
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
57074942.92239 ops/secs
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
56078486.0066642 ops/sec
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
57132681.2391817 ops/sec
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
57538601.8447234 ops/sec
A C++ implementation of LMAX disruptor pattern, using latest C++0x atomic and thread features.
fsaintjacques@this.hostname
Born with a computer in hand, then fell in love with mathematics.