6. Parallelisation¶
CROCO has been designed to be optimized on both shared and distributed memory parallel computer architectures. Parallelization is done by two dimensional sub-domains partitioning. Multiple sub-domains can be assigned to each processor in order to optimize the use of processor cache memory. This allow super-linear scaling when performance growth even faster than the number of CPUs.
Related CPP options:
OPENMP |
Activate OpenMP parallelization protocol |
MPI |
Activate MPI parallelization protocol |
MPI_NOLAND |
No computation on land only CPUs (needs preprocessing) |
AUTO_TILING |
Compute the best decomposition for OpenMP |
PARALLEL_FILES |
Output one file per CPU |
NC4_PAR |
Use NetCDF4 capabilities |
XIOS |
Dedicated CPU for output (needs XIOS installed) |
Preselected options:
# undef MPI
# undef OPENMP
# undef MPI_NOLAND
# undef AUTOTILING
# undef PARALLEL_FILES
# undef NC4_PAR
# undef XIOS
6.1. Parallel strategy overview¶
Two kind of parallelism are currently supported by CROCO: MPI (distributed memory) or OpenMP (shared memory). CROCO doesn’t currently support MPI+OpenMP hybrid parallelisation: use of cpp keys MPI or OPENMP is exclusive.
6.1.1. OpenMP (#define OPENMP)¶

Variables in param.h:
NPP : number of threads
NSUB_X : number of tiles in XI direction
NSUB_E : number of threads in ETA direction
NSUB_X x NSUB_E has to be a multiple of NPP. Most of the time, we set NPP=NSUB_X x NSUB_E
Example 1:
One node with 8 cores: NPP=8, NSUB_X=2, NSUB_ETA=4

Each thread computes one sub-domain.
Example 2:
Still one node with 8 cores: NPP=8, NSUB_X=2, NSUB_E=8

Each thread computes two sub-domains.
Code structure
OpenMP is NOT implemented at loop level
but uses a domain decomposition (similar to MPI) with parallel region
use of First touch initialisation so working arrays are attached to the same thread
working arrays have the size of the sub-domain only

Example of a parallel region¶

Inside a parallel region¶
Here Compute_1 and Compute2 can’t write on the same index of a global array.
6.1.2. MPI (#define MPI)¶
Variables in param.h:
NP_XI : decompostion in XI direction
NP_ETA : decomposition in ETA direction
NNODES : number of cores (=NP_XI x NP_ETA, except with MPI_NOLAND)
NPP = 1
NSUB_X and NSUB_ETA, number of sub-tiles (almost always =1)

Example 1:
8 cores:
NP_XI=2, NP_ETA=4, NNODES=8
NPP=1, NSUB_X=1, NSUB_ETA=1

Example 2:
8 cores:
NP_XI=2, NP_ETA=4, NNODES=8
NPP=1, NSUB_X=1, NSUB_ETA=2

6.2. Loops and indexes¶
Parallel/sequential correspondance:

Decomposition on 2 sub-domain (up), total (sequential) domain (bottom)¶
Decomposition:
Example : 2 MPI domains, with 2 sub-domains (OpenMP or not) by domain MPI

Istr, Iend are the limits of the sub-domains (without overlap). There are calculated dynamically.

Computation of Istr, Iend and use of working arrays¶
6.3. Exchanges¶
CROCO makes use 2 or 3 ghost cells depending on the numerical schemes chosen.

In the example above (2 ghosts cells), for correct exchanges, after computation:
\(\eta\) has to be valid on (1:Iend)
\(u\) has to be valid on (1:Iend) except on the left domain (2:Iend)

IstrU is the limit of validity at U point

Compuation of auxiliary indexes¶
6.4. Dealing with outputs¶
By default, with MPI activated input and output files are treated in a pseudo-sequential way, and one NetCDFfile corresponds to the whole domain. This has drawbacks when using a large number of computational cores, since each core is writing its part of the domain sequentially, the time dedicated to outputs increase with the number of cores. Three alternatives are implemented within CROCO.
Splited files (#define PARALLEL_FILES)
In this case, each core is writing its part only of the domain in separated files (one per MPI domain). This writing is performed concurrently. One other advantage is to avoid the creation of huge output files. The domain related output files can be recombined using ncjoin utility (in fortran) compiled in the same time than CROCO. Note that in this case, input files have to be splited as well, using partit utility.
Parallel NetCDF(#define NC4_FILES)
This option requires NetcDF4 verion, installed with parallel capabilities. All cores are writing concurrently but in the same time.
IO server (#define XIOS)
XIOS is an external IO server interfaced with CROCO. Informations about use and installation can be found there https://forge.ipsl.jussieu.fr/ioserver. In this case, output variables are defined in .xml files. See also the Diagnostics chapter.