ParallelStencil.jl API

The following API is created using the docstrings avaliable within the ParallelStencil.jl package, and is included for convinience here for look-up purpose. Please refer to the official repository and ask the creator of the packages if anything is unclear.

ParallelStencil.ParallelStencil — Module

Module ParallelStencil

Enables domain scientists to write high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs.

General overview and examples

https://github.com/omlins/ParallelStencil.jl

Primary macros

@init_parallel_stencil
@parallel
@hide_communication
@zeros
@ones
@rand

Advanced

@parallel_indices
@parallel_async
@synchronize

Macros available for @parallel_indices kernels

@ps_show
@ps_println

Advanced

@gridDim
@blockIdx
@blockDim
@threadIdx
@sync_threads
@sharedMem

Submodules

Modules generated in caller

Data

To see a description of a macro or module type ?<macroname> (including the @) or ?<modulename>, respectively.

ParallelStencil.@blockDim — Macro

@blockDim()

Return the block size (or "dimension") in x, y and z dimension. The block size in a specific dimension is commonly retrieved directly as in this example in x dimension: @blockDim().x.

ParallelStencil.@blockIdx — Macro

@blockIdx()

Return the block ID in x, y and z dimension within the grid. The block ID in a specific dimension is commonly retrieved directly as in this example in x dimension: @blockIdx().x.

ParallelStencil.@gridDim — Macro

@gridDim()

Return the grid size (or "dimension") in x, y and z dimension. The grid size in a specific dimension is commonly retrieved directly as in this example in x dimension: @gridDim().x.

ParallelStencil.@hide_communication — Macro

@hide_communication boundary_width block

Advanced

@hide_communication ranges_outer ranges_inner block

Hide the communication behind the computation within the code block.

Arguments

boundary_width::Tuple{Integer,Integer,Integer} | Tuple{Integer,Integer} | Tuple{Integer}: width of the boundaries in each dimension. The boundaries must include (at least) all the data that is accessed in the communcation performed.
block: code block wich starts with exactly one @parallel call to perform computations, followed by code to set boundary conditions and to perform communication (as e.g. update_halo! from the package ImplicitGlobalGrid). The @parallel call to perform computations cannot contain any positional arguments (ranges, nblocks or nthreads) nor the stream keyword argument (stream=...). The code to set boundary conditions and to perform communication must only access the elements in the boundary ranges of the fields modified in the @parallel call; all elements can be acccessed from other fields. Moreover, this code must not include statements in array broadcasting notation, because they are always run on the default CUDA stream (for CUDA.jl < v2.0), which makes CUDA stream overlapping impossible. Instead, boundary region elements can, e.g., be accessed with @parallel calls passing a ranges argument that ensures that no threads mapping to elements outside of ranges_outer are launched. Note that these @parallel ranges calls cannot contain any other positional arguments (nblocks or nthreads) nor the stream keyword argument (stream=...).

Advanced

ranges_outer::Tuple with one or multiple ranges as required by the corresponding argument of @parallel: the ranges must together span (at least) all the data that is accessed in the communcation and boundary conditions performed.
ranges_inner::Tuple with one or multiple ranges as required by the corresponding argument of @parallel: the ranges must together span the data that is not included by ranges_outer.

Examples

@hide_communication (16, 2, 2) begin
    @parallel diffusion3D_step!(Te2, Te, Ci, lam, dt, dx, dy, dz);
    update_halo!(Te2);
end

@hide_communication (16, 2) begin
    @parallel diffusion2D_step!(Te2, Te, Ci, lam, dt, dx, dy);
    update_halo!(Te2);
end

@hide_communication ranges_outer ranges_inner begin
    @parallel diffusion3D_step!(Te2, Te, Ci, lam, dt, dx, dy, dz);
    update_halo!(Te2);
end

@parallel_indices (iy,iz) function bc_x(A)
    A[  1, iy,  iz] = A[    2,   iy,   iz]
    A[end, iy,  iz] = A[end-1,   iy,   iz]
    return
end
@parallel_indices (ix,iz) function bc_y(A)
    A[ ix,  1,  iz] = A[   ix,    2,   iz]
    A[ ix,end,  iz] = A[   ix,end-1,   iz]
    return
end
@parallel_indices (ix,iy) function bc_z(A)
    A[ ix,  iy,  1] = A[   ix,   iy,    2]
    A[ ix,  iy,end] = A[   ix,   iy,end-1]
    return
end
@hide_communication (16, 2, 2) begin
    @parallel diffusion3D_step!(Te2, Te, Ci, lam, dt, dx, dy, dz);
    @parallel (1:size(Te,2), 1:size(Te,3)) bc_x(Te);
    @parallel (1:size(Te,1), 1:size(Te,3)) bc_y(Te);
    @parallel (1:size(Te,1), 1:size(Te,2)) bc_z(Te);
    update_halo!(Te2);
end

Developers note

The communcation should not perform any blocking operations to enable a maximal overlap of communication with computation.