D1886R0.0: Error speed benchmarking

1. Introduction

In this paper, we will look at the relative speeds of error handling strategies in micro-benchmarking scenarios. "Happy" and "sad" path timings will be presented.

2. Measuring methodology

All benchmarks lie. It’s important to know how a benchmark is set up so that the useful parts can be distinguished from the misleading parts.

Gathering representative timings for very small pieces of code is difficult on modern hardware. One type of micro-architectural quirk that makes these measurements very painful is code alignment [Bakhvalov]. Each call and each branch can introduce stalls if jumping to a poorly aligned location. These stalls can dominate the timings. The author has seen code alignment differences swing a benchmark running at 451 cycles per iteration to 310 cycles per iteration, for a 45% increase in execution time. This involved no system calls, no atomic operations, no divisions, or any other highly expensive operations being performed.

This research used a novel technique to get representative timings across a wide range of jump alignments. At the beginning of each function, a random number of nop instructions (0 to 31 inclusive) were inserted. A matching set of nop instructions were inserted at a later point in the execution path so that a total of 31 user inserted nop instructions were always executed. 1024 instances of this program, all with different layouts of nops, were then built and run.

Tests were built to optimize for speed. The test case optimizes away in the face of link-time optimizations (LTO) / whole program optimizations, so they were not used. Some compilers only support profile-guided optimizations (PGO) after turning on link-time optimizations, so PGO wasn’t used either. Only the exception cases were built with exceptions turned on.

For all but the exception failure cases, 16K warm-up iterations were run before starting timing. Within the timer bounds, 512K iterations of the test case were run. The exception failure cases only had 256 warm-up iterations and 16K timed iterations, due to just how slow throwing an exception can be.

All tests were run on the same machine. The processor was an Intel Core i7-3770 running at 3.3915 GHz. The Core i7-3770 has an Ivy Bridge microarchitecture, and was released in April 2012. C-states, Hyper-Threading, and Turbo Boost were all disabled in order to improve benchmark reproducibility. MSVC benchmarks were run on the 64-bit version of Windows 10. Clang and GCC benchmarks were run on the Slax Linux distribution. (TODO: verify)

3. Starter test case for on-line analysis

To start with, we will look at code similar to the following:

struct Dtor[N] {
  extern ~Dtor[N]() {
    /* Random number of NOPs K[N] := [0,31] */
  }
};
struct inline_nops {
  inline inline_nops() {
    /* Random number of NOPs Z := [0,31] */
  }
  inline ~inline_nops() {
    /* 31 - Z NOPs */
  }
};
int global_int = 0;
int callee(bool do_err) {
  inline_nops n;
  if(do_err)
    return 1;
  return 0;
}
int caller[N](bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  int e = NEXT_FUNC(do_err);
  if(e) return e;
  global_int = 0;
  return 0;
}
int main() {
  /* Test happy path */
  startTimer();
  /* X NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    (void)caller[0](false);
  }
  /* 31 - X NOPs */
  endTimer();
  print(duration() / ITERATIONS);

  /* Test sad path */
  startTimer();
  /* Y NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    (void)caller[0](true);
  }
  /* 31 - Y NOPs */
  endTimer();
  print(duration() / ITERATIONS);
}

This code has some important properties.

callee() will raise an error based off of runtime state
caller[N]() needs to clean up the d object in error and non-error conditions.
caller[N]() should only set global_int in success cases.
The inserted NOPs allow us to sample various alignments while keeping the same program structure.

In the actual tests all the function bodies are in separate .cpp files, and link-time / whole-program optimizations aren’t used. If they had been used, the entire program would get optimized away, removing our ability to measure error handling differences.

The initial part of this paper is only looking at three types of error handling.

abort: When an error is encountered, kill the program with std::abort. No "sad" path benchmarks were taken for this case.
return_val: Return an integer error code, where zero represents success.
throw_exception: Throw an exception deriving from std::exception, that contains only an int. This should represent typical exception usage cases. The exception is only caught with a catch(...).

The actual code used for the benchmark can be found on my github.

4. Happy path measurements

Abort and x64 exceptions are consistently faster than return codes. Windows x86 exceptions are slower than both return codes and abort semantics. While return codes are slower than abort and x64 exceptions, they aren’t much slower. The cost ranges from less than a cycle per function to four cycles per function in my experiments.

4.1. Write to a global

Write to global: Median cycles per iteration. A histogram can be found here

4.2. Write through an argument

Instead of writing to a global, we write through a pointer argument.

int callee(int *val, bool do_err) {
  (void)val;
  inline_nops n;
  if(do_err)
    return 1;
  return 0;
}
int caller[N](int *val, bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  int e = NEXT_FUNC(val, do_err);
  if(e) return e;
  *val = 0;
  return 0;
}

Write through argument: Median cycles per iteration. A histogram can be found here

4.3. "Return" a value

In this case, we’ll use the output channel available to us to "return" an incremented value. When handling errors in an error code fashion, we end up using an output parameter.

int callee(int *val, bool do_err) {
  inline_nops n;
  if(do_err)
    return 1;
  *val = 0;
  return 0;
}
int caller[N](int *val, bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  int out_val = 0;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  int e = NEXT_FUNC(&out_val, do_err);
  if(e) return e;
  *val = out_val + 1;
  return 0;
}

Return a value: Median cycles per iteration. A histogram can be found here

5. Sad path measurements

Sad path timings for abort aren’t shown, as the recovery path is very system specific. Throwing an exception is more than 100x slower than returning an error code.

5.1. Write to a global

Write to a global, sad path: Median cycles per iteration. A histogram can be found here

5.2. Write through an argument

Write through an argument, sad path: Median cycles per iteration. A histogram can be found here

5.3. "Return" a value

Return a value, sad path: Median cycles per iteration. A histogram can be found here

6. Off-line analysis

The following analysis is done by examining assembly, but not running it. This analysis is valuable, but is no substitute for data on real machines.

6.1. LLVM-MCA happy path off-line analysis

LLVM-MCA [MCA] is a tool that will analyze assembly and provide statistics on a hypothetical execution on a specific simulated CPU. This can be used to provide a hardware independent analysis of generated code. There is at least one substantial limitation though, in that calls and returns are typically not modeled. Regardless, the analysis is still useful.

I focused on analyzing the difference between exception builds and abort / no exception builds, all in an error neutral function. In the off-line analysis, I’m only focused on 64-bit x86-64 (Clang, GCC, and MSVC) and 64-bit ARM (Clang and GCC). Exceptions on 32-bit x86 on Windows is so much slower in the happy path that it isn’t interesting to dive in at the assembly level.

In general, there are some minor missed optimization opportunities in some of the smallest error handling examples.

I will be looking at this code snippet:

struct Dtor {~Dtor();};
void callee();
int global;
void caller() {
  Dtor d;
  callee();
  global = 0;
}

Compiler Exceptions No exceptions

MSVC x64

Compiler	Exceptions	No exceptions
MSVC x64	`sub rsp, 40 call void callee(void) ; Note the ordering difference mov DWORD PTR int global, 0 lea rcx, QWORD PTR d$[rsp] call Dtor::~Dtor(void) add rsp, 40 ret 0`	`sub rsp, 40 call void callee(void) ; Note the ordering difference lea rcx, QWORD PTR d$[rsp] mov DWORD PTR int global, 0 call Dtor::~Dtor(void) add rsp, 40 ret 0`

sub rsp, 40
call void callee(void)
; Note the ordering difference
mov DWORD PTR int global, 0
lea rcx, QWORD PTR d$[rsp]
call Dtor::~Dtor(void)
add rsp, 40
ret 0

sub rsp, 40
call void callee(void)
; Note the ordering difference
lea rcx, QWORD PTR d$[rsp]
mov DWORD PTR int global, 0
call Dtor::~Dtor(void)
add rsp, 40
ret 0

With MSVC x64, we get almost identical codegen between code with exceptions and without. On MSVC x64, there are some optimization passes that don’t run when exceptions are turned on, and one of those passes is responsible for switching the order of the lea and mov instructions. I have observed a 1-3 cycle difference in benchmark results just from switching the order of those two instructions. In the data I am providing, this results in the exception based code being slightly faster than the abort based code.

Compiler Exceptions No exceptions

Clang 9.0 x64

Compiler	Exceptions	No exceptions
Clang 9.0 x64	1397 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture `pushq %rbx subq $16, %rsp callq callee() movl $0, global(%rip) leaq 8(%rsp), %rdi callq Dtor::~Dtor() addq $16, %rsp popq %rbx retq`	1199 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture `pushq %rax ; no subq here callq callee() movl $0, global(%rip) movq %rsp, %rdi ; movq instead of leaq callq Dtor::~Dtor() ; no addq here popq %rax retq`
GCC 8.2 x64	1397 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture `pushq %rbx subq $16, %rsp call callee() ; ordering difference movl $0, global(%rip) leaq 15(%rsp), %rdi call Dtor::~Dtor() addq $16, %rsp popq %rbx ret`	752 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture `; no pushq subq $24, %rsp call callee() ; ordering difference leaq 15(%rsp), %rdi movl $0, global(%rip) call Dtor::~Dtor() addq $24, %rsp ; no popq ret`

1397 cycles over 100 iterations.
Counts on Intel Ivy Bridge microarchitecture

pushq %rbx
subq $16, %rsp
callq callee()
movl $0, global(%rip)
leaq 8(%rsp), %rdi
callq Dtor::~Dtor()
addq $16, %rsp
popq %rbx
retq

1199 cycles over 100 iterations.
Counts on Intel Ivy Bridge microarchitecture

pushq %rax
; no subq here
callq callee()
movl $0, global(%rip)
movq %rsp, %rdi ; movq instead of leaq
callq Dtor::~Dtor()
; no addq here
popq %rax
retq

GCC 8.2 x64

1397 cycles over 100 iterations.
Counts on Intel Ivy Bridge microarchitecture

pushq %rbx
subq $16, %rsp
call callee()
; ordering difference
movl $0, global(%rip)
leaq 15(%rsp), %rdi
call Dtor::~Dtor()
addq $16, %rsp
popq %rbx
ret

752 cycles over 100 iterations.
Counts on Intel Ivy Bridge microarchitecture

; no pushq
subq $24, %rsp
call callee()
; ordering difference
leaq 15(%rsp), %rdi
movl $0, global(%rip)
call Dtor::~Dtor()
addq $24, %rsp
; no popq
ret

x64 GCC and Clang both generate slower code for this use case when exceptions are enabled. This is a missed optimization, but isn’t inherent to the design of exceptions.

In my timings, GCC’s abort code was 2 cycles faster than GCC’s exception code. Clang’s abort code was 5 cycles faster than Clang’s exception code.

Compiler Exceptions No exceptions

Clang 9.0 ARM64

Compiler	Exceptions	No exceptions
Clang 9.0 ARM64	684 cycles over 100 iterations. Counts on exynos-m5 architecture `str x19, [sp, #-32]! stp x29, x30, [sp, #16] add x29, sp, #16 bl callee() adrp x8, global add x0, sp, #8 str wzr, [x8, :lo12:global] bl Dtor::~Dtor() ldp x29, x30, [sp, #16] ldr x19, [sp], #32 ret`	665 cycles over 100 iterations. Counts on exynos-m5 architecture `sub sp, sp, #32 // sub instead of str stp x29, x30, [sp, #16] add x29, sp, #16 bl callee() adrp x8, global add x0, sp, #8 str wzr, [x8, :lo12:global] bl Dtor::~Dtor() ldp x29, x30, [sp, #16] add sp, sp, #32 // add instead of ldr ret`
GCC 8.2 ARM64	639 cycles over 100 iterations. Counts on exynos-m5 architecture `stp x29, x30, [sp, -48]! mov x29, sp bl callee() adrp x1, .LANCHOR0 add x0, sp, 40 str wzr, [x1, #:lo12:.LANCHOR0] bl Dtor::~Dtor() ldp x29, x30, [sp], 48 ret`	639 cycles over 100 iterations. Counts on exynos-m5 architecture `stp x29, x30, [sp, -32]! // different constant mov x29, sp bl callee() adrp x1, .LANCHOR0 add x0, sp, 24 // different constant str wzr, [x1, #:lo12:.LANCHOR0] bl Dtor::~Dtor() ldp x29, x30, [sp], 32 // different constant ret`

684 cycles over 100 iterations.
Counts on exynos-m5 architecture

str x19, [sp, #-32]!
stp x29, x30, [sp, #16]
add x29, sp, #16
bl callee()
adrp x8, global
add x0, sp, #8
str wzr, [x8, :lo12:global]
bl Dtor::~Dtor()
ldp x29, x30, [sp, #16]
ldr x19, [sp], #32
ret

665 cycles over 100 iterations.
Counts on exynos-m5 architecture

sub sp, sp, #32 // sub instead of str
stp x29, x30, [sp, #16]
add x29, sp, #16
bl callee()
adrp x8, global
add x0, sp, #8
str wzr, [x8, :lo12:global]
bl Dtor::~Dtor()
ldp x29, x30, [sp, #16]
add sp, sp, #32 // add instead of ldr
ret

GCC 8.2 ARM64

639 cycles over 100 iterations.
Counts on exynos-m5 architecture

stp x29, x30, [sp, -48]!
mov x29, sp
bl callee()
adrp x1, .LANCHOR0
add x0, sp, 40
str wzr, [x1, #:lo12:.LANCHOR0]
bl Dtor::~Dtor()
ldp x29, x30, [sp], 48
ret

639 cycles over 100 iterations.
Counts on exynos-m5 architecture

stp x29, x30, [sp, -32]! // different constant
mov x29, sp
bl callee()
adrp x1, .LANCHOR0
add x0, sp, 24 // different constant
str wzr, [x1, #:lo12:.LANCHOR0]
bl Dtor::~Dtor()
ldp x29, x30, [sp], 32 // different constant
ret

ARM64 GCC and Clang both generate different code for this use case when exceptions are enabled. This is likely another missed optimization. With GCC, the only penalty that is paid is slightly more stack usage.

6.2. Sad path affecting the happy path and macro-benchmark theorizing

GCC and Clang x64 place an exception "landing pad" at the end of each function that needs to run user code. I’ll be looking at the assembly, including the landing pad, of the following code:

struct Dtor {~Dtor();};
void callee();
int global;
void caller() {
  Dtor d;
  callee();
  global = 0;
}

Clang 9.0 GCC 7.4 GCC 9.2

Clang 9.0	GCC 7.4	GCC 9.2
`push rbx sub rsp, 16 call callee() mov dword ptr [rip + global], 0 lea rdi, [rsp + 8] call Dtor::~Dtor() add rsp, 16 pop rbx ret ; Begin landing pad mov rbx, rax lea rdi, [rsp + 8] call Dtor::~Dtor() mov rdi, rbx call _Unwind_Resume`	`push rbx sub rsp, 16 call callee() lea rdi, [rsp+15] mov DWORD PTR global[rip], 0 call Dtor::~Dtor() add rsp, 16 pop rbx ret ; Begin landing pad lea rdi, [rsp+15] mov rbx, rax call Dtor::~Dtor() mov rdi, rbx call _Unwind_Resume`	`push rbp sub rsp, 16 call callee() mov DWORD PTR global[rip], 0 lea rdi, [rsp+15] call Dtor::~Dtor() add rsp, 16 pop rbp ret ; Begin landing pad mov rbp, rax jmp .L2 caller() [clone .cold]: .L2: lea rdi, [rsp+15] call Dtor::~Dtor() mov rdi, rbp call _Unwind_Resume`

push    rbx
sub     rsp, 16
call    callee()
mov     dword ptr [rip + global], 0
lea     rdi, [rsp + 8]
call    Dtor::~Dtor()
add     rsp, 16
pop     rbx
ret
; Begin landing pad
mov     rbx, rax
lea     rdi, [rsp + 8]
call    Dtor::~Dtor()
mov     rdi, rbx
call    _Unwind_Resume

push    rbx
sub     rsp, 16
call    callee()
lea     rdi, [rsp+15]
mov     DWORD PTR global[rip], 0
call    Dtor::~Dtor()
add     rsp, 16
pop     rbx
ret
; Begin landing pad
lea     rdi, [rsp+15]
mov     rbx, rax
call    Dtor::~Dtor()
mov     rdi, rbx
call    _Unwind_Resume

push    rbp
sub     rsp, 16
call    callee()
mov     DWORD PTR global[rip], 0
lea     rdi, [rsp+15]
call    Dtor::~Dtor()
add     rsp, 16
pop     rbp
ret
; Begin landing pad
mov     rbp, rax
jmp     .L2
caller() [clone .cold]:
.L2:
lea     rdi, [rsp+15]
call    Dtor::~Dtor()
mov     rdi, rbp
call    _Unwind_Resume

In all three cases, some cold landing pad code is directly adjacent to hot happy path code. This doesn’t cause any problems in microbenchmarks, but in theory, it could reduce instruction cache utilization in larger benchmarks. Starting in GCC 8.1, GCC will only put a tiny trampoline in the landing pad. The bulk of the landing pad code is moved into a different, cold location. This should, in theory, improve instruction cache utilization in the happy path compared to Clang and earlier GCCs, though it will still be at a minor disadvantage compared to terminate based implementations. I have no measurements to confirm or reject this theory though. [Spencer18] measured a small, non-zero performance degradation ( <1% ) in the happy path in a AAA game when turning on exceptions.

The MSVC x64 /d2FH4 implementation does not place any exception handling code near the happy path. The MSVC x86 sad path exception handling code is also far away from the happy path, but the happy path does exception bookkeeping.

7. Conclusion

Exceptions are reasonable error handling strategies in many environments, but they don’t serve all needs in all use cases. C++ needs good performance, standards conforming ways to signal failures in constructors and operator overloads. Currently, exceptions do well on the happy path, but there is no good answer for the sad path. C++ is built on the foundation that you don’t pay for what you don’t use, and that you can’t write the language abstractions better by hand. This paper provides evidence that you can write error handling code by hand that results in faster code on the sad path than the equivalent exception throwing code. In each of the sad path test cases, return values beat exceptions by more than a factor of 100.

8. Acknowledgments

Charts generated with [ECharts].

Appendix A: Additional error handling types

While researching this paper, I gathered data on some additional error handling mechanisms. I will provide a brief description of the error handling mechanism, and I will provide the data. Only a minimal amount of code and explanation will be provided though. This data could be useful in future error handling discussions.

In each of these cases, the struct involved is of the following form:

struct error_struct {
  void *error = nullptr;
  void *domain = nullptr;
};

A null error pointer indicates no error.

ref_struct: Pass an error_struct by reference, and check error after each call.
return_struct: Like return_val, but with an error_struct instead.
return_nt_struct: Like return_struct, but error_struct has a destructor of the form ~error_struct() {}. "nt" means "non-trivial".
expected_struct: This uses Sy Brand’s [Expected] library with error_struct as the error type. This should closely resemble the proposed std::expected.
outcome_std_error: This uses a branch of Niall Douglas’s [Outcome] library. This branch is specially optimized for GCC.

In the sad path graphs, exceptions are omitted so that the remaining values can be easily compared.

Write to a global (happy path)

Write to global: Median cycles per iteration. A histogram can be found here

Write through an argument (happy path)

Write to global: Median cycles per iteration. A histogram can be found here

"Return" a value (happy path)

Write to global: Median cycles per iteration. A histogram can be found here

Write to a global (sad path)

Write to global: Median cycles per iteration. A histogram can be found here

Write through an argument (sad path)

Write to global: Median cycles per iteration. A histogram can be found here

"Return" a value (sad path)

Write to global: Median cycles per iteration. A histogram can be found here

Appendix B: The build flags

MSVC

The compiler and flags are the same for 32-bit and 64-bit builds, except that the 32-bit linker uses /machine:x86 and the 64-bit linker uses /machine:x64

/d2FH4 is a critical flag for these benchmarks. Without it, the results are drastically different. See [MoFH4] for a description of this flag.

Compiler marketing version: Visual Studio 2019

Compiler toolkit version: 14.22.27905

cl.exe version: 19.22.27905

Compiler codegen flags (no exceptions): /GR /Gy /Gw /O2 /MT /d2FH4 /std:c++latest /permissive- /DNDEBUG

Compiler codegen flags (with exceptions): /EHs /GR /Gy /Gw /O2 /MT /d2FH4 /std:c++latest /permissive- /DNDEBUG

Compiler codegen flags (outcome): /GR- /Gy /Gw /O2 /MT /d2FH4 /std:c++latest /permissive- /DNDEBUG

Linker flags: /OPT:REF /release /subsystem:CONSOLE /incremental:no /OPT:ICF /NXCOMPAT /DYNAMICBASE *.obj

Clang x64

Toolchains used:

Clang 8.0.0 and libc++
System linker from Ubuntu 14.04.3’s GCC 4.8.4 installation

Compiler codegen flags (no exceptions): -fno-exceptions -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -stdlib=libc++ -static -DNDEBUG

Compiler codegen flags (exceptions): -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -stdlib=libc++ -static -DNDEBUG

Compiler codegen flags (outcome): -fno-exceptions -fno-rtti -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -stdlib=libc++ -static -DNDEBUG

Linking flags: -Wl,--gc-sections -pthread -static -static-libgcc -stdlib=libc++ *.o libc++abi.a

GCC x64

Toolchain used: GCC 7.3.1 from the Red Hat Developer Toolset 7.1

Compiler codegen flags (no exceptions): -fno-exceptions -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -static -DNDEBUG

Compiler codegen flags (exceptions): -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -static -DNDEBUG

Compiler codegen flags (outcome): -fno-exceptions -fno-rtti -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -static -DNDEBUG

Linking flags: -Wl,--gc-sections -pthread -static -static-libgcc -static-libstdc++ *.o

Appendix C: The code

As stated before, this isn’t the exact code that was benchmarked. In the benchmarked code, functions were placed in distinct translation units in order to avoid inlining. The following code is provided to demonstrate what the error handling code looks like.

Common support code

Expand to see code snippets

All cases

struct Dtor[N] {
  extern ~Dtor[N]() {
    /* Random number of NOPs K[N] := [0,31] */
  }
};
struct inline_nops {
  inline inline_nops() {
    /* Random number of NOPs Z := [0,31] */
  }
  inline ~inline_nops() {
    /* 31 - Z NOPs */
  }
};

abort, return_val

int main() {
  /* Test happy path */
  startTimer();
  /* X NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    (void)caller[0](false);
  }
  /* 31 - X NOPs */
  endTimer();
  print(duration() / ITERATIONS);

  startTimer();
  /* Y NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    (void)caller[0](true);
  }
  /* 31 - Y NOPs */
  endTimer();
  print(duration() / ITERATIONS);
}

throw_exception

class err_exception : public std::exception {
public:
  int val;
  explicit err_exception(int e) : val(e) {}
  const char *what() const noexcept override { return ""; }
};
int main() {
  /* Test happy path */
  startTimer();
  /* X NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    try { (void)caller[0](false); } catch(...){}
  }
  /* 31 - X NOPs */
  endTimer();
  print(duration() / ITERATIONS);

  /* Test sad path */
  startTimer();
  /* Y NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    try { (void)caller[0](true); } catch(...){}
  }
  /* 31 - Y NOPs */
  endTimer();
  print(duration() / ITERATIONS);
}

Write to a global

Expand to see code snippets

All cases

int global_int = 0;

return_val

int callee(bool do_err) {
  inline_nops n;
  if(do_err)
    return 1;
  return 0;
}
int caller[N](bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  int e = NEXT_FUNC(do_err);
  if(e) return e;
  global_int = 0;
  return 0;
}

abort, throw_exception

void caller[N](bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  NEXT_FUNC(do_err);
  global_int = 0;
}

abort

void callee(bool do_err) {
  inline_nops n;
  if(do_err)
    abort();
}

throw_exception

void callee(bool do_err) {
  inline_nops n;
  if(do_err)
    throw err_exception(1);
}

Write through an argument

Expand to see code snippets

abort, return_val

int main() {
  int val=0;
  /* Test happy path */
  startTimer();
  /* X NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    (void)caller[0](&val, false);
  }
  /* 31 - X NOPs */
  endTimer();
  print(duration() / ITERATIONS);

  /* Test sad path */
  startTimer();
  /* Y NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    (void)caller[0](&val, true);
  }
  /* 31 - Y NOPs */
  endTimer();
  print(duration() / ITERATIONS);
}

return_val

int callee(int *, bool do_err) {
  inline_nops n;
  if(do_err)
    return 1;
  return 0;
}
int caller[N](int *val, bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  int e = NEXT_FUNC(val, do_err);
  if(e) return e;
  *val = 0;
  return 0;
}

abort, throw_exception

void caller[N](int *val, bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  NEXT_FUNC(val, do_err);
  *val = 0;
}

abort

void callee(int *, bool do_err) {
  inline_nops n;
  if(do_err)
    abort();
}

throw_exception

void callee(int *, bool do_err) {
  inline_nops n;
  if(do_err)
    throw err_exception(1);
}
int main() {
  int val=0;
  /* Test happy path */
  startTimer();
  /* X NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    try { (void)caller[0](&val, false); } catch(...) {}
  }
  /* 31 - X NOPs */
  endTimer();
  print(duration() / ITERATIONS);

  /* Test sad path */
  startTimer();
  /* Y NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    try { (void)caller[0](&val, true); } catch(...) {}
  }
  /* 31 - Y NOPs */
  endTimer();
  print(duration() / ITERATIONS);
}

"Return" a value

Expand to see code snippets

return_val

int callee(int *val, bool do_err) {
  inline_nops n;
  if(do_err)
    return 1;
  *val = 0;
  return 0;
}
int caller[N](int *val, bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  int out_val = 0;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  int e = NEXT_FUNC(&out_val, do_err);
  if(e) return e;
  *val = out_val + 1;
  return 0;
}
int main() {
  /* Test happy path */
  startTimer();
  /* X NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    int val=0;
    (void)caller[0](&val, false);
  }
  /* 31 - X NOPs */
  endTimer();
  print(duration() / ITERATIONS);

  /* Test sad path */
  startTimer();
  /* Y NOPs */
  for (int i = 0; i < ITERATIONS; ++i) {
    int val=0;
    (void)caller[0](&val, true);
  }
  /* 31 - Y NOPs */
  endTimer();
  print(duration() / ITERATIONS);
}

abort, throw_exception

int caller[N](bool do_err) {
  /* 31 - K[N] NOPs */
  Dtor[N] d;
  /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */
  return 1 + NEXT_FUNC(do_err);
}

abort

int callee(bool do_err) {
  inline_nops n;
  if(do_err)
    abort();
  return 0;
}

throw_exception

int callee(bool do_err) {
  inline_nops n;
  if(do_err)
    throw err_exception(1);
  return 0;
}

Appendix D: Histograms

Happy path: Write to a global

Write to global: Cycles per iteration histogram

Happy path: Write through an argument

Write through an argument: Cycles per iteration histogram

Happy path: "Return" a value

Return a value: Cycles per iteration histogram

Exception sad path: Write to a global

Write to a global: Cycles per iteration histogram. throw_exception and return_val only.

Exception sad path: Write through an argument

Write through an argument: Cycles per iteration histogram. throw_exception and return_val only.

Exception sad path: "Return" a value

Return a value: Cycles per iteration histogram. throw_exception and return_val only.

Other sad path: Write to a global

Write to a global: Cycles per iteration histogram. throw_exception omitted.

Other sad path: Write through an argument

Write through an argument: Cycles per iteration histogram. throw_exception omitted.

Other sad path: "Return" a value

Return a value: Cycles per iteration histogram. throw_exception omitted.

D1886R0.0Error speed benchmarking

Draft Proposal, 2019-09-28

Abstract

1. Introduction

2. Measuring methodology

3. Starter test case for on-line analysis

4. Happy path measurements

4.1. Write to a global

4.2. Write through an argument

4.3. "Return" a value

5. Sad path measurements

5.1. Write to a global

5.2. Write through an argument

5.3. "Return" a value

6. Off-line analysis

6.1. LLVM-MCA happy path off-line analysis

6.2. Sad path affecting the happy path and macro-benchmark theorizing

7. Conclusion

8. Acknowledgments

Appendix A: Additional error handling types

Write to a global (happy path)

Write through an argument (happy path)

"Return" a value (happy path)

Write to a global (sad path)

Write through an argument (sad path)

"Return" a value (sad path)

Appendix B: The build flags

MSVC

Clang x64

GCC x64

Appendix C: The code

Common support code

Write to a global

Write through an argument

"Return" a value

Appendix D: Histograms

Happy path: Write to a global

Happy path: Write through an argument

Happy path: "Return" a value

Exception sad path: Write to a global

Exception sad path: Write through an argument

Exception sad path: "Return" a value

Other sad path: Write to a global

Other sad path: Write through an argument

Other sad path: "Return" a value

References

Informative References

D1886R0.0
Error speed benchmarking