Parallel Mode C++ library parallel The libstdc++ parallel mode is an experimental parallel implementation of many algorithms of the C++ Standard Library. Several of the standard algorithms, for instance std::sort, are made parallel using OpenMP annotations. These parallel mode constructs can be invoked by explicit source declaration or by compiling existing sources with a specific compiler flag. The parallel mode has not been kept up to date with recent C++ standards and so it only conforms to the C++03 requirements. That means that move-only predicates may not work with parallel mode algorithms, and for C++20 most of the algorithms cannot be used in constexpr functions. For C++17 and above there are new overloads of the standard algorithms which take an execution policy argument. You should consider using those instead of the non-standard parallel mode extensions.
Intro The following library components in the include numeric are included in the parallel mode: std::accumulate std::adjacent_difference std::inner_product std::partial_sum The following library components in the include algorithm are included in the parallel mode: std::adjacent_find std::count std::count_if std::equal std::find std::find_if std::find_first_of std::for_each std::generate std::generate_n std::lexicographical_compare std::mismatch std::search std::search_n std::transform std::replace std::replace_if std::max_element std::merge std::min_element std::nth_element std::partial_sort std::partition std::random_shuffle std::set_union std::set_intersection std::set_symmetric_difference std::set_difference std::sort std::stable_sort std::unique_copy
Semantics The parallel mode STL algorithms are currently not exception-safe, i.e. user-defined functors must not throw exceptions. Also, the order of execution is not guaranteed for some functions, of course. Therefore, user-defined functors should not have any concurrent side effects. Since the current GCC OpenMP implementation does not support OpenMP parallel regions in concurrent threads, it is not possible to call parallel STL algorithm in concurrent threads, either. It might work with other compilers, though.
Using
Prerequisite Compiler Flags Any use of parallel functionality requires additional compiler and runtime support, in particular support for OpenMP. Adding this support is not difficult: just compile your application with the compiler flag -fopenmp. This will link in libgomp, the GNU Offloading and Multi Processing Runtime Library, whose presence is mandatory. In addition, hardware that supports atomic operations and a compiler capable of producing atomic operations is mandatory: GCC defaults to no support for atomic operations on some common hardware architectures. Activating atomic operations may require explicit compiler flags on some targets (like sparc and x86), such as -march=i686, -march=native or -mcpu=v9. See the GCC manual for more information.
Using Parallel Mode To use the libstdc++ parallel mode, compile your application with the prerequisite flags as detailed above, and in addition add -D_GLIBCXX_PARALLEL. This will convert all use of the standard (sequential) algorithms to the appropriate parallel equivalents. Please note that this doesn't necessarily mean that everything will end up being executed in a parallel manner, but rather that the heuristics and settings coded into the parallel versions will be used to determine if all, some, or no algorithms will be executed using parallel variants. Note that the _GLIBCXX_PARALLEL define may change the sizes and behavior of standard class templates such as std::search, and therefore one can only link code compiled with parallel mode and code compiled without parallel mode if no instantiation of a container is passed between the two translation units. Parallel mode functionality has distinct linkage, and cannot be confused with normal mode symbols.
Using Specific Parallel Components When it is not feasible to recompile your entire application, or only specific algorithms need to be parallel-aware, individual parallel algorithms can be made available explicitly. These parallel algorithms are functionally equivalent to the standard drop-in algorithms used in parallel mode, but they are available in a separate namespace as GNU extensions and may be used in programs compiled with either release mode or with parallel mode. An example of using a parallel version of std::sort, but no other parallel algorithms, is: #include <vector> #include <parallel/algorithm> int main() { std::vector<int> v(100); // ... // Explicitly force a call to parallel sort. __gnu_parallel::sort(v.begin(), v.end()); return 0; } Then compile this code with the prerequisite compiler flags (-fopenmp and any necessary architecture-specific flags for atomic operations.) The following table provides the names and headers of all the parallel algorithms that can be used in a similar manner: Parallel Algorithms Algorithm Header Parallel algorithm Parallel header std::accumulate numeric __gnu_parallel::accumulate parallel/numeric std::adjacent_difference numeric __gnu_parallel::adjacent_difference parallel/numeric std::inner_product numeric __gnu_parallel::inner_product parallel/numeric std::partial_sum numeric __gnu_parallel::partial_sum parallel/numeric std::adjacent_find algorithm __gnu_parallel::adjacent_find parallel/algorithm std::count algorithm __gnu_parallel::count parallel/algorithm std::count_if algorithm __gnu_parallel::count_if parallel/algorithm std::equal algorithm __gnu_parallel::equal parallel/algorithm std::find algorithm __gnu_parallel::find parallel/algorithm std::find_if algorithm __gnu_parallel::find_if parallel/algorithm std::find_first_of algorithm __gnu_parallel::find_first_of parallel/algorithm std::for_each algorithm __gnu_parallel::for_each parallel/algorithm std::generate algorithm __gnu_parallel::generate parallel/algorithm std::generate_n algorithm __gnu_parallel::generate_n parallel/algorithm std::lexicographical_compare algorithm __gnu_parallel::lexicographical_compare parallel/algorithm std::mismatch algorithm __gnu_parallel::mismatch parallel/algorithm std::search algorithm __gnu_parallel::search parallel/algorithm std::search_n algorithm __gnu_parallel::search_n parallel/algorithm std::transform algorithm __gnu_parallel::transform parallel/algorithm std::replace algorithm __gnu_parallel::replace parallel/algorithm std::replace_if algorithm __gnu_parallel::replace_if parallel/algorithm std::max_element algorithm __gnu_parallel::max_element parallel/algorithm std::merge algorithm __gnu_parallel::merge parallel/algorithm std::min_element algorithm __gnu_parallel::min_element parallel/algorithm std::nth_element algorithm __gnu_parallel::nth_element parallel/algorithm std::partial_sort algorithm __gnu_parallel::partial_sort parallel/algorithm std::partition algorithm __gnu_parallel::partition parallel/algorithm std::random_shuffle algorithm __gnu_parallel::random_shuffle parallel/algorithm std::set_union algorithm __gnu_parallel::set_union parallel/algorithm std::set_intersection algorithm __gnu_parallel::set_intersection parallel/algorithm std::set_symmetric_difference algorithm __gnu_parallel::set_symmetric_difference parallel/algorithm std::set_difference algorithm __gnu_parallel::set_difference parallel/algorithm std::sort algorithm __gnu_parallel::sort parallel/algorithm std::stable_sort algorithm __gnu_parallel::stable_sort parallel/algorithm std::unique_copy algorithm __gnu_parallel::unique_copy parallel/algorithm
Design
Interface Basics All parallel algorithms are intended to have signatures that are equivalent to the ISO C++ algorithms replaced. For instance, the std::adjacent_find function is declared as: namespace std { template<typename _FIter> _FIter adjacent_find(_FIter, _FIter); } Which means that there should be something equivalent for the parallel version. Indeed, this is the case: namespace std { namespace __parallel { template<typename _FIter> _FIter adjacent_find(_FIter, _FIter); ... } } But.... why the ellipses? The ellipses in the example above represent additional overloads required for the parallel version of the function. These additional overloads are used to dispatch calls from the ISO C++ function signature to the appropriate parallel function (or sequential function, if no parallel functions are deemed worthy), based on either compile-time or run-time conditions. The available signature options are specific for the different algorithms/algorithm classes. The general view of overloads for the parallel algorithms look like this: ISO C++ signature ISO C++ signature + sequential_tag argument ISO C++ signature + algorithm-specific tag type (several signatures) Please note that the implementation may use additional functions (designated with the _switch suffix) to dispatch from the ISO C++ signature to the correct parallel version. Also, some of the algorithms do not have support for run-time conditions, so the last overload is therefore missing.
Configuration and Tuning
Setting up the OpenMP Environment Several aspects of the overall runtime environment can be manipulated by standard OpenMP function calls. To specify the number of threads to be used for the algorithms globally, use the function omp_set_num_threads. An example: #include <stdlib.h> #include <omp.h> int main() { // Explicitly set number of threads. const int threads_wanted = 20; omp_set_dynamic(false); omp_set_num_threads(threads_wanted); // Call parallel mode algorithms. return 0; } Some algorithms allow the number of threads being set for a particular call, by augmenting the algorithm variant. See the next section for further information. Other parts of the runtime environment able to be manipulated include nested parallelism (omp_set_nested), schedule kind (omp_set_schedule), and others. See the OpenMP documentation for more information.
Compile Time Switches To force an algorithm to execute sequentially, even though parallelism is switched on in general via the macro _GLIBCXX_PARALLEL, add __gnu_parallel::sequential_tag() to the end of the algorithm's argument list. Like so: std::sort(v.begin(), v.end(), __gnu_parallel::sequential_tag()); Some parallel algorithm variants can be excluded from compilation by preprocessor defines. See the doxygen documentation on compiletime_settings.h and features.h for details. For some algorithms, the desired variant can be chosen at compile-time by appending a tag object. The available options are specific to the particular algorithm (class). For the "embarrassingly parallel" algorithms, there is only one "tag object type", the enum _Parallelism. It takes one of the following values, __gnu_parallel::parallel_tag, __gnu_parallel::balanced_tag, __gnu_parallel::unbalanced_tag, __gnu_parallel::omp_loop_tag, __gnu_parallel::omp_loop_static_tag. This means that the actual parallelization strategy is chosen at run-time. (Choosing the variants at compile-time will come soon.) For the following algorithms in general, we have __gnu_parallel::parallel_tag and __gnu_parallel::default_parallel_tag, in addition to __gnu_parallel::sequential_tag. __gnu_parallel::default_parallel_tag chooses the default algorithm at compiletime, as does omitting the tag. __gnu_parallel::parallel_tag postpones the decision to runtime (see next section). For all tags, the number of threads desired for this call can optionally be passed to the respective tag's constructor. The multiway_merge algorithm comes with the additional choices, __gnu_parallel::exact_tag and __gnu_parallel::sampling_tag. Exact and sampling are the two available splitting strategies. For the sort and stable_sort algorithms, there are several additional choices, namely __gnu_parallel::multiway_mergesort_tag, __gnu_parallel::multiway_mergesort_exact_tag, __gnu_parallel::multiway_mergesort_sampling_tag, __gnu_parallel::quicksort_tag, and __gnu_parallel::balanced_quicksort_tag. Multiway mergesort comes with the two splitting strategies for multi-way merging. The quicksort options cannot be used for stable_sort.
Run Time Settings and Defaults The default parallelization strategy, the choice of specific algorithm strategy, the minimum threshold limits for individual parallel algorithms, and aspects of the underlying hardware can be specified as desired via manipulation of __gnu_parallel::_Settings member data. First off, the choice of parallelization strategy: serial, parallel, or heuristically deduced. This corresponds to __gnu_parallel::_Settings::algorithm_strategy and is a value of enum __gnu_parallel::_AlgorithmStrategy type. Choices include: heuristic, force_sequential, and force_parallel. The default is heuristic. Next, the sub-choices for algorithm variant, if not fixed at compile-time. Specific algorithms like find or sort can be implemented in multiple ways: when this is the case, a __gnu_parallel::_Settings member exists to pick the default strategy. For example, __gnu_parallel::_Settings::sort_algorithm can have any values of enum __gnu_parallel::_SortAlgorithm: MWMS, QS, or QS_BALANCED. Likewise for setting the minimal threshold for algorithm parallelization. Parallelism always incurs some overhead. Thus, it is not helpful to parallelize operations on very small sets of data. Because of this, measures are taken to avoid parallelizing below a certain, pre-determined threshold. For each algorithm, a minimum problem size is encoded as a variable in the active __gnu_parallel::_Settings object. This threshold variable follows the following naming scheme: __gnu_parallel::_Settings::[algorithm]_minimal_n. So, for fill, the threshold variable is __gnu_parallel::_Settings::fill_minimal_n, Finally, hardware details like L1/L2 cache size can be hardwired via __gnu_parallel::_Settings::L1_cache_size and friends. All these configuration variables can be changed by the user, if desired. There exists one global instance of the class _Settings, i. e. it is a singleton. It can be read and written by calling __gnu_parallel::_Settings::get and __gnu_parallel::_Settings::set, respectively. Please note that the first call return a const object, so direct manipulation is forbidden. See <parallel/settings.h> for complete details. A small example of tuning the default: #include <parallel/algorithm> #include <parallel/settings.h> int main() { __gnu_parallel::_Settings s; s.algorithm_strategy = __gnu_parallel::force_parallel; __gnu_parallel::_Settings::set(s); // Do work... all algorithms will be parallelized, always. return 0; }
Implementation Namespaces One namespace contain versions of code that are always explicitly sequential: __gnu_serial. Two namespaces contain the parallel mode: std::__parallel and __gnu_parallel. Parallel implementations of standard components, including template helpers to select parallelism, are defined in namespace std::__parallel. For instance, std::transform from algorithm has a parallel counterpart in std::__parallel::transform from parallel/algorithm. In addition, these parallel implementations are injected into namespace __gnu_parallel with using declarations. Support and general infrastructure is in namespace __gnu_parallel. More information, and an organized index of types and functions related to the parallel mode on a per-namespace basis, can be found in the generated source documentation.
Testing Both the normal conformance and regression tests and the supplemental performance tests work. To run the conformance and regression tests with the parallel mode active, make check-parallel The log and summary files for conformance testing are in the testsuite/parallel directory. To run the performance tests with the parallel mode active, make check-performance-parallel The result file for performance testing are in the testsuite directory, in the file libstdc++_performance.sum. In addition, the policy-based containers have their own visualizations, which have additional software dependencies than the usual bare-boned text file, and can be generated by using the make doc-performance rule in the testsuite's Makefile.
Bibliography Parallelization of Bulk Operations for STL Dictionaries JohannesSingler LeonorFrias 2007 Workshop on Highly Parallel Processing on a Chip (HPPC) 2007. (LNCS) The Multi-Core Standard Template Library JohannesSingler PeterSanders FelixPutze 2007 Euro-Par 2007: Parallel Processing. (LNCS 4641)