On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach extending the specification of parallel loops with inspection via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions while dividing up the iteration spaces in the target loop. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We showcase the benefits of this work using MiniMD, a mini-app derived from LAMMPS, and three kernels from the GAP benchmark suite: Breadth First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules, and 1.07× over the static_steal schedule from recent research
#
#
#Supplementary notes can be added here, including code, math, and images.