2011-10-11 21:25

Importance of CPU affinity in multi-threaded benchmarks

Lately, I've been working on benchmarks for the disruptor-- project. The scenario is simple, one thread write to the buffer and a second thread fetch those values and does a simple addition. The goal is to mesure an upper bound to the number of operations per second.

The first benchmarks were run on my old Core2 E8400, obtaining a satisfying ~32M ops/sec with a standard deviation of ~1M. Perfect, let's try it on newer hardware, i.e. a dual Quad Core Xeon E5620. It didn't go as expected. In fact, the variance in performance was so high I couldn't compare to the previous results. On bad runs I would get 10M ops/sec, and good ones were about 50M. Here's a small sample that shows the variance:

~/src/disruptor--$ perf/perf-1P-1EP
 6008031.52854411 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
 5017330.94644462 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
13854443.2957167 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
56655853.066027 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
10099522.7723398 ops/sec
~/src/disruptor--$ perf/perf-1P-1EP
55783234.1370777 ops/sec
            

It came to me that threads were probably not run on the same socket, implying no L3 cache sharing. This is where the taskset utility comes in. I've first seen this tool used by Martin Thompson in is inter thread latency benchmark. Unluckily for me, I was still getting poor performance, but it did fix the high variance:

~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10938592.0983346 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10174226.1285803 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10901707.4716223 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
10272055.0311054 ops/sec
~/src/disruptor--$ taskset -c 0,1 perf/perf-1P-1EP
11507956.6466277 ops/sec
            

The culprit was not so easy to find. My computer scientist mindset tricked me. CPU numbering in Linux does not implies topological proximity, i.e. processors 0 and 1 are not guaranteed to be physicaly close. All the information needed to graph the processor topology can be found in /proc/cpuinfo. The following code snippet extract the desired information.

#!/bin/bash
function filter {
  cat /proc/cpuinfo | grep -E "$1.*: [0-9]*" | sed -e 's/^.*: //g'
}

CPU_ID=`filter processor`
SOCKET_ID=(`filter 'physical id'`)
CORE_ID=(`filter 'core id'`)

for cpu_id in $CPU_ID; do
    echo "cpu $cpu_id: socket${SOCKET_ID[$cpu_id]}_core${CORE_ID[$cpu_id]}"
done
            

For example, on the previous dual Xeon, processors 0 and 1 are on 2 distinct cores, while (logical) processor 0 and 8 share the same core due to Intel HyperThreading.

cpu 0: socket0_core0
cpu 1: socket0_core1
cpu 2: socket0_core9
cpu 3: socket0_core10
cpu 4: socket1_core0
cpu 5: socket1_core1
cpu 6: socket1_core9
cpu 7: socket1_core10
cpu 8: socket0_core0
cpu 9: socket0_core1
cpu 10: socket0_core9
cpu 11: socket0_core10
cpu 12: socket1_core0
cpu 13: socket1_core1
cpu 14: socket1_core9
cpu 15: socket1_core10
            

We obtain a quite huge performance boost by using 2 logicial thread on the same core, thus sharing L2 cache.

~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
55927128.0825528 ops/sec
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
57074942.92239 ops/secs
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
56078486.0066642 ops/sec
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
57132681.2391817 ops/sec
~/src/disruptor--$ taskset -c 0,8 perf/perf-1P-1EP
57538601.8447234 ops/sec
            

disruptor--

A C++ implementation of LMAX disruptor pattern, using latest C++0x atomic and thread features.

Contact Information

fsaintjacques@this.hostname

Interests

Born with a computer in hand, then fell in love with mathematics.

Education

Master, Computer Science
2010-2012, Université de Montréal
Machine learning and real time bidding optimization
Directed by Yoshua Bengio
Bachelor, Mathematics and Computer Science
2007-2010, Université de Montréal
DEC, Computer Science - Network Administration
2005-2007, Collège de Maisonneuve

Work Experience

Software Engineer
2011 - present, STEERads
Network Administrator
2008 - 2009, w.illi.am/
IT Architecture Specialist
2006 - 2007, iWeb Technologies Inc