6. Parallelisation

CROCO has been designed to be optimized on both shared and distributed memory parallel computer architectures. Parallelization is done by two dimensional sub-domains partitioning. Multiple sub-domains can be assigned to each processor in order to optimize the use of processor cache memory. This allow super-linear scaling when performance growth even faster than the number of CPUs.

Related CPP options:


Activate OpenMP parallelization protocol


Activate MPI parallelization protocol


No computation on land only CPUs (needs preprocessing)


Compute the best decomposition for OpenMP


Output one file per CPU


Use NetCDF4 capabilities


Dedicated CPU for output (needs XIOS installed)

Preselected options:

# undef MPI
# undef OPENMP
# undef MPI_NOLAND
# undef NC4_PAR
# undef XIOS

6.1. Parallel strategy overview

Two kind of parallelism are currently supported by CROCO: MPI (distributed memory) or OpenMP (shared memory). CROCO doesn’t currently support MPI+OpenMP hybrid parallelisation: use of cpp keys MPI or OPENMP is exclusive.

6.1.1. OpenMP (#define OPENMP)


Variables in param.h:

  • NPP : number of threads

  • NSUB_X : number of tiles in XI direction

  • NSUB_E : number of threads in ETA direction

NSUB_X x NSUB_E has to be a multiple of NPP. Most of the time, we set NPP=NSUB_X x NSUB_E

Example 1:

One node with 8 cores: NPP=8, NSUB_X=2, NSUB_ETA=4


Each thread computes one sub-domain.

Example 2:

Still one node with 8 cores: NPP=8, NSUB_X=2, NSUB_E=8


Each thread computes two sub-domains.

Code structure

  • OpenMP is NOT implemented at loop level

  • but uses a domain decomposition (similar to MPI) with parallel region

  • use of First touch initialisation so working arrays are attached to the same thread

  • working arrays have the size of the sub-domain only


Example of a parallel region


Inside a parallel region

Here Compute_1 and Compute2 can’t write on the same index of a global array.

6.1.2. MPI (#define MPI)

Variables in param.h:

  • NP_XI : decompostion in XI direction

  • NP_ETA : decomposition in ETA direction

  • NNODES : number of cores (=NP_XI x NP_ETA, except with MPI_NOLAND)

  • NPP = 1

  • NSUB_X and NSUB_ETA, number of sub-tiles (almost always =1)


Example 1:

8 cores:

  • NP_XI=2, NP_ETA=4, NNODES=8

  • NPP=1, NSUB_X=1, NSUB_ETA=1


Example 2:

8 cores:

  • NP_XI=2, NP_ETA=4, NNODES=8

  • NPP=1, NSUB_X=1, NSUB_ETA=2


6.2. Loops and indexes

Parallel/sequential correspondance:


Decomposition on 2 sub-domain (up), total (sequential) domain (bottom)


Example : 2 MPI domains, with 2 sub-domains (OpenMP or not) by domain MPI


Istr, Iend are the limits of the sub-domains (without overlap). There are calculated dynamically.


Computation of Istr, Iend and use of working arrays

6.3. Exchanges

CROCO makes use 2 or 3 ghost cells depending on the numerical schemes chosen.


In the example above (2 ghosts cells), for correct exchanges, after computation:

  • \(\eta\) has to be valid on (1:Iend)

  • \(u\) has to be valid on (1:Iend) except on the left domain (2:Iend)


IstrU is the limit of validity at U point


Compuation of auxiliary indexes

6.4. Dealing with outputs

By default, with MPI activated input and output files are treated in a pseudo-sequential way, and one NetCDFfile corresponds to the whole domain. This has drawbacks when using a large number of computational cores, since each core is writing its part of the domain sequentially, the time dedicated to outputs increase with the number of cores. Three alternatives are implemented within CROCO.

Splited files (#define PARALLEL_FILES)

In this case, each core is writing its part only of the domain in separated files (one per MPI domain). This writing is performed concurrently. One other advantage is to avoid the creation of huge output files. The domain related output files can be recombined using ncjoin utility (in fortran) compiled in the same time than CROCO. Note that in this case, input files have to be splited as well, using partit utility.

Parallel NetCDF(#define NC4_FILES)

This option requires NetcDF4 verion, installed with parallel capabilities. All cores are writing concurrently but in the same time.

IO server (#define XIOS)

XIOS is an external IO server interfaced with CROCO. Informations about use and installation can be found there https://forge.ipsl.jussieu.fr/ioserver. In this case, output variables are defined in .xml files. See also the Diagnostics chapter.