1. Introduction
In this paper, we will look at the relative speeds of error handling strategies in micro-benchmarking scenarios. "Happy" and "sad" path timings will be presented.2. Measuring methodology
All benchmarks lie. It’s important to know how a benchmark is set up so that the useful parts can be distinguished from the misleading parts.Gathering representative timings for very small pieces of code is difficult on modern hardware. One type of micro-architectural quirk that makes these measurements very painful is code alignment [Bakhvalov]. Each call and each branch can introduce stalls if jumping to a poorly aligned location. These stalls can dominate the timings. The author has seen code alignment differences swing a benchmark running at 451 cycles per iteration to 310 cycles per iteration, for a 45% increase in execution time. This involved no system calls, no atomic operations, no divisions, or any other highly expensive operations being performed.
This research used a novel technique to get representative timings across a wide range of jump alignments. At the beginning of each function, a random number of
instructions (0 to 31 inclusive) were inserted. A matching set of
instructions were inserted at a later point in the execution path so that a total of 31 user inserted
instructions were always executed. 1024 instances of this program, all with different layouts of nops, were then built and run.
Tests were built to optimize for speed. The test case optimizes away in the face of link-time optimizations (LTO) / whole program optimizations, so they were not used. Some compilers only support profile-guided optimizations (PGO) after turning on link-time optimizations, so PGO wasn’t used either. Only the exception cases were built with exceptions turned on.
For all but the exception failure cases, 16K warm-up iterations were run before starting timing. Within the timer bounds, 512K iterations of the test case were run. The exception failure cases only had 256 warm-up iterations and 16K timed iterations, due to just how slow throwing an exception can be.
All tests were run on the same machine. The processor was an Intel Core i7-3770 running at 3.3915 GHz. The Core i7-3770 has an Ivy Bridge microarchitecture, and was released in April 2012. C-states, Hyper-Threading, and Turbo Boost were all disabled in order to improve benchmark reproducibility. MSVC benchmarks were run on the 64-bit version of Windows 10. Clang and GCC benchmarks were run on the Slax Linux distribution. (TODO: verify)
3. Starter test case for on-line analysis
To start with, we will look at code similar to the following:This code has some important properties.struct Dtor [ N ] { extern ~ Dtor [ N ]() { /* Random number of NOPs K[N] := [0,31] */ } }; struct inline_nops { inline inline_nops () { /* Random number of NOPs Z := [0,31] */ } inline ~ inline_nops () { /* 31 - Z NOPs */ } }; int global_int = 0 ; int callee ( bool do_err ) { inline_nops n ; if ( do_err ) return 1 ; return 0 ; } int caller [ N ]( bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ int e = NEXT_FUNC ( do_err ); if ( e ) return e ; global_int = 0 ; return 0 ; } int main () { /* Test happy path */ startTimer (); /* X NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { ( void ) caller [ 0 ]( false); } /* 31 - X NOPs */ endTimer (); ( duration () / ITERATIONS ); /* Test sad path */ startTimer (); /* Y NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { ( void ) caller [ 0 ]( true); } /* 31 - Y NOPs */ endTimer (); ( duration () / ITERATIONS ); }
-
will raise an error based off of runtime statecallee () -
needs to clean up thecaller [ N ]()
object in error and non-error conditions.d -
should only setcaller [ N ]()
in success cases.global_int -
The inserted NOPs allow us to sample various alignments while keeping the same program structure.
In the actual tests all the function bodies are in separate .cpp files, and link-time / whole-program optimizations aren’t used. If they had been used, the entire program would get optimized away, removing our ability to measure error handling differences.
The initial part of this paper is only looking at three types of error handling.
-
abort: When an error is encountered, kill the program with
. No "sad" path benchmarks were taken for this case.std :: abort -
return_val: Return an integer error code, where zero represents success.
-
throw_exception: Throw an exception deriving from std::exception, that contains only an int. This should represent typical exception usage cases. The exception is only caught with a
.catch (...)
The actual code used for the benchmark can be found on my github.
4. Happy path measurements
Abort and x64 exceptions are consistently faster than return codes. Windows x86 exceptions are slower than both return codes and abort semantics. While return codes are slower than abort and x64 exceptions, they aren’t much slower. The cost ranges from less than a cycle per function to four cycles per function in my experiments.
4.1. Write to a global
Write to global: Median cycles per iteration. A histogram can be found here
4.2. Write through an argument
Instead of writing to a global, we write through a pointer argument.int callee ( int * val , bool do_err ) { ( void ) val ; inline_nops n ; if ( do_err ) return 1 ; return 0 ; } int caller [ N ]( int * val , bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ int e = NEXT_FUNC ( val , do_err ); if ( e ) return e ; * val = 0 ; return 0 ; }
Write through argument: Median cycles per iteration. A histogram can be found here
4.3. "Return" a value
In this case, we’ll use the output channel available to us to "return" an incremented value. When handling errors in an error code fashion, we end up using an output parameter.int callee ( int * val , bool do_err ) { inline_nops n ; if ( do_err ) return 1 ; * val = 0 ; return 0 ; } int caller [ N ]( int * val , bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; int out_val = 0 ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ int e = NEXT_FUNC ( & out_val , do_err ); if ( e ) return e ; * val = out_val + 1 ; return 0 ; }
Return a value: Median cycles per iteration. A histogram can be found here
5. Sad path measurements
Sad path timings for abort aren’t shown, as the recovery path is very system specific. Throwing an exception is more than 100x slower than returning an error code.
5.1. Write to a global
Write to a global, sad path: Median cycles per iteration. A histogram can be found here
5.2. Write through an argument
Write through an argument, sad path: Median cycles per iteration. A histogram can be found here
5.3. "Return" a value
Return a value, sad path: Median cycles per iteration. A histogram can be found here
6. Off-line analysis
The following analysis is done by examining assembly, but not running it. This analysis is valuable, but is no substitute for data on real machines.6.1. LLVM-MCA happy path off-line analysis
LLVM-MCA [MCA] is a tool that will analyze assembly and provide statistics on a hypothetical execution on a specific simulated CPU. This can be used to provide a hardware independent analysis of generated code. There is at least one substantial limitation though, in that calls and returns are typically not modeled. Regardless, the analysis is still useful.
I focused on analyzing the difference between exception builds and abort / no exception builds, all in an error neutral function. In the off-line analysis, I’m only focused on 64-bit x86-64 (Clang, GCC, and MSVC) and 64-bit ARM (Clang and GCC). Exceptions on 32-bit x86 on Windows is so much slower in the happy path that it isn’t interesting to dive in at the assembly level.
In general, there are some minor missed optimization opportunities in some of the smallest error handling examples.
I will be looking at this code snippet:
struct Dtor { ~ Dtor ();}; void callee (); int global ; void caller () { Dtor d ; callee (); global = 0 ; }
Compiler | Exceptions | No exceptions |
---|---|---|
MSVC x64 |
|
|
lea
and mov
instructions. I have observed a 1-3 cycle difference in benchmark results just from switching the order of those two instructions. In the data I am providing, this results in the exception based code being slightly faster than the abort based code.
Compiler | Exceptions | No exceptions |
---|---|---|
Clang 9.0 x64 | 1397 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture
| 1199 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture
|
GCC 8.2 x64 | 1397 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture
| 752 cycles over 100 iterations. Counts on Intel Ivy Bridge microarchitecture
|
In my timings, GCC’s abort code was 2 cycles faster than GCC’s exception code. Clang’s abort code was 5 cycles faster than Clang’s exception code.
Compiler | Exceptions | No exceptions |
---|---|---|
Clang 9.0 ARM64 | 684 cycles over 100 iterations. Counts on exynos-m5 architecture
| 665 cycles over 100 iterations. Counts on exynos-m5 architecture
|
GCC 8.2 ARM64 | 639 cycles over 100 iterations. Counts on exynos-m5 architecture
| 639 cycles over 100 iterations. Counts on exynos-m5 architecture
|
6.2. Sad path affecting the happy path and macro-benchmark theorizing
GCC and Clang x64 place an exception "landing pad" at the end of each function that needs to run user code. I’ll be looking at the assembly, including the landing pad, of the following code:struct Dtor { ~ Dtor ();}; void callee (); int global ; void caller () { Dtor d ; callee (); global = 0 ; }
Clang 9.0 | GCC 7.4 | GCC 9.2 |
---|---|---|
|
|
|
In all three cases, some cold landing pad code is directly adjacent to hot happy path code. This doesn’t cause any problems in microbenchmarks, but in theory, it could reduce instruction cache utilization in larger benchmarks. Starting in GCC 8.1, GCC will only put a tiny trampoline in the landing pad. The bulk of the landing pad code is moved into a different, cold location. This should, in theory, improve instruction cache utilization in the happy path compared to Clang and earlier GCCs, though it will still be at a minor disadvantage compared to terminate based implementations. I have no measurements to confirm or reject this theory though. [Spencer18] measured a small, non-zero performance degradation ( <1% ) in the happy path in a AAA game when turning on exceptions.
The MSVC x64 /d2FH4 implementation does not place any exception handling code near the happy path. The MSVC x86 sad path exception handling code is also far away from the happy path, but the happy path does exception bookkeeping.
7. Conclusion
Exceptions are reasonable error handling strategies in many environments, but they don’t serve all needs in all use cases. C++ needs good performance, standards conforming ways to signal failures in constructors and operator overloads. Currently, exceptions do well on the happy path, but there is no good answer for the sad path. C++ is built on the foundation that you don’t pay for what you don’t use, and that you can’t write the language abstractions better by hand. This paper provides evidence that you can write error handling code by hand that results in faster code on the sad path than the equivalent exception throwing code. In each of the sad path test cases, return values beat exceptions by more than a factor of 100.8. Acknowledgments
Charts generated with [ECharts].
Appendix A: Additional error handling types
While researching this paper, I gathered data on some additional error handling mechanisms. I will provide a brief description of the error handling mechanism, and I will provide the data. Only a minimal amount of code and explanation will be provided though. This data could be useful in future error handling discussions.In each of these cases, the struct involved is of the following form:
A null error pointer indicates no error.struct error_struct { void * error = nullptr ; void * domain = nullptr ; };
-
ref_struct: Pass an
by reference, and checkerror_struct
after each call.error -
return_struct: Like return_val, but with an
instead.error_struct -
return_nt_struct: Like return_struct, but
has a destructor of the formerror_struct
. "nt" means "non-trivial".~ error_struct () {} -
expected_struct: This uses Sy Brand’s [Expected] library with
as the error type. This should closely resemble the proposederror_struct
.std :: expected -
outcome_std_error: This uses a branch of Niall Douglas’s [Outcome] library. This branch is specially optimized for GCC.
In the sad path graphs, exceptions are omitted so that the remaining values can be easily compared.
Write to a global (happy path)
Write to global: Median cycles per iteration. A histogram can be found here
Write through an argument (happy path)
Write to global: Median cycles per iteration. A histogram can be found here
"Return" a value (happy path)
Write to global: Median cycles per iteration. A histogram can be found here
Write to a global (sad path)
Write to global: Median cycles per iteration. A histogram can be found here
Write through an argument (sad path)
Write to global: Median cycles per iteration. A histogram can be found here
"Return" a value (sad path)
Write to global: Median cycles per iteration. A histogram can be found here
Appendix B: The build flags
MSVC
The compiler and flags are the same for 32-bit and 64-bit builds, except that the 32-bit linker uses /machine:x86 and the 64-bit linker uses /machine:x64/d2FH4 is a critical flag for these benchmarks. Without it, the results are drastically different. See [MoFH4] for a description of this flag.
Compiler marketing version: Visual Studio 2019
Compiler toolkit version: 14.22.27905
cl.exe version: 19.22.27905
Compiler codegen flags (no exceptions): /GR /Gy /Gw /O2 /MT /d2FH4 /std:c++latest /permissive- /DNDEBUG
Compiler codegen flags (with exceptions): /EHs /GR /Gy /Gw /O2 /MT /d2FH4 /std:c++latest /permissive- /DNDEBUG
Compiler codegen flags (outcome): /GR- /Gy /Gw /O2 /MT /d2FH4 /std:c++latest /permissive- /DNDEBUG
Linker flags: /OPT:REF /release /subsystem:CONSOLE /incremental:no /OPT:ICF /NXCOMPAT /DYNAMICBASE *.obj
Clang x64
Toolchains used:-
Clang 8.0.0 and libc++
-
System linker from Ubuntu 14.04.3’s GCC 4.8.4 installation
Compiler codegen flags (no exceptions): -fno-exceptions -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -stdlib=libc++ -static -DNDEBUG
Compiler codegen flags (exceptions): -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -stdlib=libc++ -static -DNDEBUG
Compiler codegen flags (outcome): -fno-exceptions -fno-rtti -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -stdlib=libc++ -static -DNDEBUG
Linking flags: -Wl,--gc-sections -pthread -static -static-libgcc -stdlib=libc++ *.o libc++abi.a
GCC x64
Toolchain used: GCC 7.3.1 from the Red Hat Developer Toolset 7.1Compiler codegen flags (no exceptions): -fno-exceptions -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -static -DNDEBUG
Compiler codegen flags (exceptions): -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -static -DNDEBUG
Compiler codegen flags (outcome): -fno-exceptions -fno-rtti -O2 -mtune=ivybridge -ffunction-sections -fdata-sections -std=c++17 -static -DNDEBUG
Linking flags: -Wl,--gc-sections -pthread -static -static-libgcc -static-libstdc++ *.o
Appendix C: The code
As stated before, this isn’t the exact code that was benchmarked. In the benchmarked code, functions were placed in distinct translation units in order to avoid inlining. The following code is provided to demonstrate what the error handling code looks like.Common support code
Expand to see code snippets
All casesstruct Dtor [ N ] { extern ~ Dtor [ N ]() { /* Random number of NOPs K[N] := [0,31] */ } }; struct inline_nops { inline inline_nops () { /* Random number of NOPs Z := [0,31] */ } inline ~ inline_nops () { /* 31 - Z NOPs */ } };
abort, return_val
int main () { /* Test happy path */ startTimer (); /* X NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { ( void ) caller [ 0 ]( false); } /* 31 - X NOPs */ endTimer (); ( duration () / ITERATIONS ); startTimer (); /* Y NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { ( void ) caller [ 0 ]( true); } /* 31 - Y NOPs */ endTimer (); ( duration () / ITERATIONS ); }
throw_exception
class err_exception : public std :: exception { public : int val ; explicit err_exception ( int e ) : val ( e ) {} const char * what () const noexcept override { return "" ; } }; int main () { /* Test happy path */ startTimer (); /* X NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { try { ( void ) caller [ 0 ]( false); } catch (...){} } /* 31 - X NOPs */ endTimer (); ( duration () / ITERATIONS ); /* Test sad path */ startTimer (); /* Y NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { try { ( void ) caller [ 0 ]( true); } catch (...){} } /* 31 - Y NOPs */ endTimer (); ( duration () / ITERATIONS ); }
Write to a global
Expand to see code snippets
All casesint global_int = 0 ;
return_val
int callee ( bool do_err ) { inline_nops n ; if ( do_err ) return 1 ; return 0 ; } int caller [ N ]( bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ int e = NEXT_FUNC ( do_err ); if ( e ) return e ; global_int = 0 ; return 0 ; }
abort, throw_exception
abortvoid caller [ N ]( bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ NEXT_FUNC ( do_err ); global_int = 0 ; }
throw_exceptionvoid callee ( bool do_err ) { inline_nops n ; if ( do_err ) abort (); }
void callee ( bool do_err ) { inline_nops n ; if ( do_err ) throw err_exception ( 1 ); }
Write through an argument
Expand to see code snippets
abort, return_valint main () { int val = 0 ; /* Test happy path */ startTimer (); /* X NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { ( void ) caller [ 0 ]( & val , false); } /* 31 - X NOPs */ endTimer (); ( duration () / ITERATIONS ); /* Test sad path */ startTimer (); /* Y NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { ( void ) caller [ 0 ]( & val , true); } /* 31 - Y NOPs */ endTimer (); ( duration () / ITERATIONS ); }
return_val
int callee ( int * , bool do_err ) { inline_nops n ; if ( do_err ) return 1 ; return 0 ; } int caller [ N ]( int * val , bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ int e = NEXT_FUNC ( val , do_err ); if ( e ) return e ; * val = 0 ; return 0 ; }
abort, throw_exception
abortvoid caller [ N ]( int * val , bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ NEXT_FUNC ( val , do_err ); * val = 0 ; }
throw_exceptionvoid callee ( int * , bool do_err ) { inline_nops n ; if ( do_err ) abort (); }
void callee ( int * , bool do_err ) { inline_nops n ; if ( do_err ) throw err_exception ( 1 ); } int main () { int val = 0 ; /* Test happy path */ startTimer (); /* X NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { try { ( void ) caller [ 0 ]( & val , false); } catch (...) {} } /* 31 - X NOPs */ endTimer (); ( duration () / ITERATIONS ); /* Test sad path */ startTimer (); /* Y NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { try { ( void ) caller [ 0 ]( & val , true); } catch (...) {} } /* 31 - Y NOPs */ endTimer (); ( duration () / ITERATIONS ); }
"Return" a value
Expand to see code snippets
return_valint callee ( int * val , bool do_err ) { inline_nops n ; if ( do_err ) return 1 ; * val = 0 ; return 0 ; } int caller [ N ]( int * val , bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; int out_val = 0 ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ int e = NEXT_FUNC ( & out_val , do_err ); if ( e ) return e ; * val = out_val + 1 ; return 0 ; } int main () { /* Test happy path */ startTimer (); /* X NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { int val = 0 ; ( void ) caller [ 0 ]( & val , false); } /* 31 - X NOPs */ endTimer (); ( duration () / ITERATIONS ); /* Test sad path */ startTimer (); /* Y NOPs */ for ( int i = 0 ; i < ITERATIONS ; ++ i ) { int val = 0 ; ( void ) caller [ 0 ]( & val , true); } /* 31 - Y NOPs */ endTimer (); ( duration () / ITERATIONS ); }
abort, throw_exception
abortint caller [ N ]( bool do_err ) { /* 31 - K[N] NOPs */ Dtor [ N ] d ; /* NEXT_FUNC is callee when N == 7, or caller[N+1] otherwise */ return 1 + NEXT_FUNC ( do_err ); }
throw_exceptionint callee ( bool do_err ) { inline_nops n ; if ( do_err ) abort (); return 0 ; }
int callee ( bool do_err ) { inline_nops n ; if ( do_err ) throw err_exception ( 1 ); return 0 ; }
Appendix D: Histograms
Happy path: Write to a global
Write to global: Cycles per iteration histogram
Happy path: Write through an argument
Write through an argument: Cycles per iteration histogram
Happy path: "Return" a value
Return a value: Cycles per iteration histogram
Exception sad path: Write to a global
Write to a global: Cycles per iteration histogram. throw_exception and return_val only.
Exception sad path: Write through an argument
Write through an argument: Cycles per iteration histogram. throw_exception and return_val only.
Exception sad path: "Return" a value
Return a value: Cycles per iteration histogram. throw_exception and return_val only.
Other sad path: Write to a global
Write to a global: Cycles per iteration histogram. throw_exception omitted.
Other sad path: Write through an argument
Write through an argument: Cycles per iteration histogram. throw_exception omitted.
Other sad path: "Return" a value
Return a value: Cycles per iteration histogram. throw_exception omitted.