6. Parallelisation¶

CROCO has been designed to be optimized on both shared and distributed memory parallel computer architectures. Parallelization is done by two dimensional sub-domains partitioning. Multiple sub-domains can be assigned to each processor in order to optimize the use of processor cache memory. This allow super-linear scaling when performance growth even faster than the number of CPUs.

Related CPP options:

 OPENMP Activate OpenMP parallelization protocol MPI Activate MPI parallelization protocol MPI_NOLAND No computation on land only CPUs (needs preprocessing) AUTO_TILING Compute the best decomposition for OpenMP PARALLEL_FILES Output one file per CPU NC4_PAR Use NetCDF4 capabilities XIOS Dedicated CPU for output (needs XIOS installed)

Preselected options:

# undef MPI
# undef OPENMP
# undef MPI_NOLAND
# undef AUTOTILING
# undef PARALLEL_FILES
# undef NC4_PAR
# undef XIOS


6.1. Parallel strategy overview¶

Two kind of parallelism are currently supported by CROCO: MPI (distributed memory) or OpenMP (shared memory). CROCO doesn’t currently support MPI+OpenMP hybrid parallelisation: use of cpp keys MPI or OPENMP is exclusive.

6.1.1. OpenMP (#define OPENMP)¶

Variables in param.h:

• NPP : number of threads

• NSUB_X : number of tiles in XI direction

• NSUB_E : number of threads in ETA direction

NSUB_X x NSUB_E has to be a multiple of NPP. Most of the time, we set NPP=NSUB_X x NSUB_E

Example 1:

One node with 8 cores: NPP=8, NSUB_X=2, NSUB_ETA=4

Each thread computes one sub-domain.

Example 2:

Still one node with 8 cores: NPP=8, NSUB_X=2, NSUB_E=8

Each thread computes two sub-domains.

Code structure

• OpenMP is NOT implemented at loop level

• but uses a domain decomposition (similar to MPI) with parallel region

• use of First touch initialisation so working arrays are attached to the same thread

• working arrays have the size of the sub-domain only

Here Compute_1 and Compute2 can’t write on the same index of a global array.

6.1.2. MPI (#define MPI)¶

Variables in param.h:

• NP_XI : decompostion in XI direction

• NP_ETA : decomposition in ETA direction

• NNODES : number of cores (=NP_XI x NP_ETA, except with MPI_NOLAND)

• NPP = 1

• NSUB_X and NSUB_ETA, number of sub-tiles (almost always =1)

Example 1:

8 cores:

• NP_XI=2, NP_ETA=4, NNODES=8

• NPP=1, NSUB_X=1, NSUB_ETA=1

Example 2:

8 cores:

• NP_XI=2, NP_ETA=4, NNODES=8

• NPP=1, NSUB_X=1, NSUB_ETA=2

6.2. Loops and indexes¶

Parallel/sequential correspondance:

Decomposition:

Example : 2 MPI domains, with 2 sub-domains (OpenMP or not) by domain MPI

Istr, Iend are the limits of the sub-domains (without overlap). There are calculated dynamically.

6.3. Exchanges¶

CROCO makes use 2 or 3 ghost cells depending on the numerical schemes chosen.

In the example above (2 ghosts cells), for correct exchanges, after computation:

• $$\eta$$ has to be valid on (1:Iend)

• $$u$$ has to be valid on (1:Iend) except on the left domain (2:Iend)

IstrU is the limit of validity at U point

6.4. Dealing with outputs¶

By default, with MPI activated input and output files are treated in a pseudo-sequential way, and one NetCDFfile corresponds to the whole domain. This has drawbacks when using a large number of computational cores, since each core is writing its part of the domain sequentially, the time dedicated to outputs increase with the number of cores. Three alternatives are implemented within CROCO.

Splited files (#define PARALLEL_FILES)

In this case, each core is writing its part only of the domain in separated files (one per MPI domain). This writing is performed concurrently. One other advantage is to avoid the creation of huge output files. The domain related output files can be recombined using ncjoin utility (in fortran) compiled in the same time than CROCO. Note that in this case, input files have to be splited as well, using partit utility.

Parallel NetCDF(#define NC4_FILES)

This option requires NetcDF4 verion, installed with parallel capabilities. All cores are writing concurrently but in the same time.

IO server (#define XIOS)

XIOS is an external IO server interfaced with CROCO. Informations about use and installation can be found there https://forge.ipsl.jussieu.fr/ioserver. In this case, output variables are defined in .xml files. See also the Diagnostics chapter.