This file contains unreleased/unannounced changes for parallelization, in addition to those listed in VERSIONS.  It will be merged into VERSIONS once parallel SLiM is released publicly.

PARALLEL changes (now in the master branch, but disabled):
	TO DO: the openmp branch had changes to travis.yml to test the openmp branch, which should be replicated somehow for GitHub Actions:
		using macos xcode11.6/clang and linux bionic/gcc
		on osx, "brew install libomp;" "ls -l /usr/local/opt/libomp/include/omp.h;" "ls -l /usr/local/include;"
		note the paths for the headers/lib are likely different, so that will require a bit of legwork
	updated SLiM.xcodeproj for OpenMP configuration:
		duplicate the entire SLiM directory to SLiM_temp, open the SLiM.xcodeproj file in the copy, rename it in the Project Navigator to "SLiM_OpenMP", copy it back into the SLiM folder and delete the rest of the copy
		use grep to find "SLiM_temp" inside files in the new SLiM_OpenMP.xcodeproj and fix them to refer to the correct directory (might be just one, or none)
		add system header search path for targets eidos_multi and SLiM_multi, in Build Settings: /usr/local/include/ (non-recursive)
		add library search path for targets eidos_multi and SLiM_multi, in Build Settings: /usr/local/lib/ (non-recursive)
		add link to binary library, under "Build Phases" (Link Binary With Libraries) on targets eidos_multi and SLiM_multi, by dragging in /usr/local/lib/libomp.dylib
		add to Other C Flags and Other C++ Flags for targets eidos_multi and SLiM_multi, in Build Settings: -Xclang and -fopenmp
		disable library validation for targets eidos_multi and SLiM_multi, in Signing & Capabilities: https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_security_cs_disable-library-validation
		set Enable Index-While-Building Functionality to NO in Build Settings on targets eidos_multi and SLiM_multi
		disable Link-Time Optimization, which appears to be incompatible with OpenMP
		switched "Build Active Architecture Only" in Build Settings to YES for Release builds, for targets eidos_multi and SLiM_multi, because the older libomp.dylib builds do not have arm64 in them
		see http://antonmenshov.com/2017/09/09/clang-openmp-setup-in-xcode/
		added -Rpass="loop|vect" -Rpass-missed="loop|vect" -Rpass-analysis="loop|vect" to Other C++ Flags for eidos_multi, to aid in optimization work
		added "-fno-math-errno" to the project-level Other C++ Flags, to match the main project
		added $(OTHER_CPLUSPLUSFLAGS) to that build setting for slim_multi and eidos_multi, to inherit from the project's settings
		require 10.13 or later for eidos_multi and slim_multi, since that's the macOS version that the R libomp packages target
		added SLiMguiLegacy_multi and EidosScribe_multi targets/schemes to the project for debugging/testing purposes; not intended for end users!
	modify CMakeLists.txt to add a "PARALLEL" flag that can be set to "ON" to enable OpenMP with the -fopenmp flag
	modify CMakeLists.txt to add /usr/local/include to all targets (through GSL_INCLUDES, which should maybe get cleaned up somehow)
	modify CMakeLists.txt to build executables eidos_multi and slim_multi when PARALLEL is ON, and to link in libomp
	disable parallel builds of EidosScribe, SLiMgui, and SLiMguiLegacy with #error directives; parallelization is for the command line only (active threads peg the CPU, passive threads are slower than single-threaded)
	add a command-line option, -maxThreads <n>, to control the maximum number of threads used in OpenMP
	add a header eidos_openmp.h that should be used instead of #include "omp.h", providing various useful definitions etc.
	add an initialization function for OpenMP, Eidos_WarmUpOpenMP(), that must be called when warming up; sets various options, logs diagnostic info
	set default values for OMP_WAIT_POLICY (active) and OMP_PROC_BIND (false) and OMP_DYNAMIC (false) for slim and eidos
	add THREAD_SAFETY_CHECK() convention for marking areas that are known not to be thread-safe, such as due to use of statics or globals
	add gEidosMaxThreads global that has the max number of threads for the session (for pre-allocating per-thread data structures)
	create per-thread SparseVector pools to allow high-scalability parallelization of spatial interactions
	add parallel sorting code - quicksort and mergesort, rather experimental at this point, just early work
	create per-thread random number generators to allow parallelization of code that involves RNG usage
	parallelize dnorm(), rbinom(), rdunif(), rexp(), rnorm(), rpois(), runif()
	parallelize tallying mutation refcounts, for the optimized fast case
	parallelize fitness evaluation for mutations, when there are no mutationEffect() or fitnessEffect() callbacks
	parallelize non-mutation-based fitness evaluation (i.e., only fitnessScaling values)
	parallelize some minor tick cycle bookkeeping: incrementing individual ages, clearing the migrant flag
	parallelize the viability/survival tick cycle stage's survival evaluation (without callbacks)
	parallelize localPopulationDensity(), clippedIntegral(), and neighborCount()
	add gEidosNumThreads global that has the currently requested number of threads (usually equal to gEidosMaxThreads)
	add functions to Eidos to support parallel execution: parallelGetNumThreads(), parallelGetMaxThreads(), and parallelSetNumThreads()
	add unit tests for parallelized functions in Eidos, comparing against the same functions non-parallel
	add nowait clause on parallel for loops that would otherwise have a double barrier
	shift to allocating per-thread RNGs in parallelized allocations, so "first touch" places each RNG in memory close to that thread's core
	switch to dynamic scheduling for rbinom() and rdunif(), which seem to need it for unclear reasons
	use a minimum chunk size of 16 for all dynamic schedules; problems that parallelize should always be at least that big, I imagine
		the exception is countOfMutationsOfType() / sumOfMutationsOfType() and similar, because they do so much work per iteration that they can be worth going parallel even for two iterations
		note that we never use schedule(guided) because its implementation in GCC is brain-dead; see https://stackoverflow.com/a/43047074/2752221
	switch to OMP_PROC_BIND == true by default; keep threads with the data they're working on, recommended practice
	parallelize more Eidos math functions: abs(float), ceil(), exp(float), floor(), log(float), log10(float), log2(float), round(), sqrt(float), trunc()
	parallelize Eidos min/max functions: max(int), max(float), min(int), min(float), pmax(int), pmax(float), pmin(int), pmin(float)
	parallelize match(), tabulate(), sample (int/float/object, replace=T, weights=NULL), mean(l/i/f)
	parallelize Individual -relatedness(), -countOfMutationsOfType(), -setSpatialPosition(); Species -individualsWithPedigreeIDs()
	parallelize Subpopulation -pointInBounds(), -pointReflected(), -pointStopped(), pointPeriodic(), pointUniform()
	parallelize Genome -containsMarkerMutation()
	change how RNG seeding works, for single-threaded and multi-threaded runs: use /dev/random to generate initial seeds, and for multi-threaded runs, use /dev/random for seeds for all threads beyond thread 0
	parallelize Genome -countOfMutationsOfType(), Subpopulation -spatialMapValue() and -sampleIndividuals()
	add unit tests for various SLiM methods that have been parallelized
	parallelize InteractionType nearestNeighbors(), nearestInteractingNeighbors(), drawByStrength() by adding a returnDict parameter that lets the user get a Dictionary as a result, with one entry per receiver
		their "receiver" parameter is no longer required to be singleton (when returnDict == T), and their return value is now just "object" since it can be Individual or Dictionary
	overhaul the internal MutationRun design to allow it to be used multithreaded - get rid of refcounting, switch to usage tallying
	parallelize offspring generation in WF models, when no callbacks (mateChoice(), modifyChild(), recombination(), mutation()) are active; no tree-seq, no cloning
	parallelize offspring generation in WF model: add support for cloning
	parallelize tallying of mutation runs, clearing of parental genomes at the end of WF model ticks, and mutation run uniquing
	add support for parallel reproduction when a non-default stacking policy is in use
	add support for parallel reproduction with tree-sequence recording
	add parallel reproduction in nonWF models, with no callbacks (recombination(), mutation()) active - modifyChild() callbacks are legal but cannot access child genomes since they are deferred
		this is achieved by passing defer=T to addCrossed(), addCloned(), addSelfed(), or addRecombinant()
	thread-safety work - break in backward reproducibility for scripts that use a type 's' DFE, because the code path for that shifted
	algorithm change for nearestNeighbors() and nearestNeighborsOfPoint(), when count is >1 and <N; the old algorithm was not thread-safe, and was inefficient; breaks backward reproducibility for the relevant case
		the break in backward reproducibility is because the old algorithm returned the neighbors in a scrambled order, whereas the new algorithm returns them in sorted order; the order is not documented, so both are compliant
	add max thread counts for each parallel loop, allowing flexible customization both by Eidos and by the user
	tweak the semantics of parallelSetNumThreads(): an integer requests that exact number of threads, whereas NULL requests "up to" the max number of threads (the default behavior)
		add gEidosNumThreadsOverride flag indicating whether the number of threads has been overridden explicitly
	add parallelSetTaskThreadCounts() function to set per-task thread counts, and parallelGetTaskThreadCounts() to get them
	add recipe 22.7, Parallelizing nonWF reproduction and spatial interactions
	add internal EidosBenchmark facility for timing of internal loops, for benchmarking of parallelization scaling
	add a command-line option, -perTaskThreads, to let the user control how many threads to use for different tasks (if not overridden)
	parallelize sorting, where possible/efficient, add requisite per-task thread count keys
	parallelize edge table sorting during simplification, add requisite per-task thread count keys

