ParallelStencil.jl API
The following API is created using the docstrings avaliable within the ParallelStencil.jl package, and is included for convinience here for look-up purpose. Please refer to the official repository and ask the creator of the packages if anything is unclear.
ParallelStencil.ParallelStencil — ModuleModule ParallelStencil
Enables domain scientists to write high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs.
General overview and examples
https://github.com/omlins/ParallelStencil.jl
Primary macros
Macros available for @parallel_indices kernels
Submodules
ParallelStencil.FiniteDifferences1DParallelStencil.FiniteDifferences2DParallelStencil.FiniteDifferences3D
Modules generated in caller
To see a description of a macro or module type ?<macroname> (including the @) or ?<modulename>, respectively.
ParallelStencil.@blockDim — Macro@blockDim()Return the block size (or "dimension") in x, y and z dimension. The block size in a specific dimension is commonly retrieved directly as in this example in x dimension: @blockDim().x.
ParallelStencil.@blockIdx — Macro@blockIdx()Return the block ID in x, y and z dimension within the grid. The block ID in a specific dimension is commonly retrieved directly as in this example in x dimension: @blockIdx().x.
ParallelStencil.@gridDim — Macro@gridDim()Return the grid size (or "dimension") in x, y and z dimension. The grid size in a specific dimension is commonly retrieved directly as in this example in x dimension: @gridDim().x.
ParallelStencil.@hide_communication — Macro@hide_communication boundary_width block@hide_communication ranges_outer ranges_inner blockHide the communication behind the computation within the code block.
Arguments
boundary_width::Tuple{Integer,Integer,Integer} | Tuple{Integer,Integer} | Tuple{Integer}: width of the boundaries in each dimension. The boundaries must include (at least) all the data that is accessed in the communcation performed.block: code block wich starts with exactly one@parallelcall to perform computations, followed by code to set boundary conditions and to perform communication (as e.g.update_halo!from the packageImplicitGlobalGrid). The@parallelcall to perform computations cannot contain any positional arguments (ranges, nblocks or nthreads) nor the stream keyword argument (stream=...). The code to set boundary conditions and to perform communication must only access the elements in the boundary ranges of the fields modified in the@parallelcall; all elements can be acccessed from other fields. Moreover, this code must not include statements in array broadcasting notation, because they are always run on the default CUDA stream (for CUDA.jl < v2.0), which makes CUDA stream overlapping impossible. Instead, boundary region elements can, e.g., be accessed with@parallelcalls passing a ranges argument that ensures that no threads mapping to elements outside ofranges_outerare launched. Note that these@parallelrangescalls cannot contain any other positional arguments (nblocks or nthreads) nor the stream keyword argument (stream=...).
ranges_outer::Tuple with one or multiplerangesas required by the corresponding argument of@parallel: therangesmust together span (at least) all the data that is accessed in the communcation and boundary conditions performed.ranges_inner::Tuple with one or multiplerangesas required by the corresponding argument of@parallel: therangesmust together span the data that is not included byranges_outer.
Examples
@hide_communication (16, 2, 2) begin
@parallel diffusion3D_step!(Te2, Te, Ci, lam, dt, dx, dy, dz);
update_halo!(Te2);
end
@hide_communication (16, 2) begin
@parallel diffusion2D_step!(Te2, Te, Ci, lam, dt, dx, dy);
update_halo!(Te2);
end
@hide_communication ranges_outer ranges_inner begin
@parallel diffusion3D_step!(Te2, Te, Ci, lam, dt, dx, dy, dz);
update_halo!(Te2);
end
@parallel_indices (iy,iz) function bc_x(A)
A[ 1, iy, iz] = A[ 2, iy, iz]
A[end, iy, iz] = A[end-1, iy, iz]
return
end
@parallel_indices (ix,iz) function bc_y(A)
A[ ix, 1, iz] = A[ ix, 2, iz]
A[ ix,end, iz] = A[ ix,end-1, iz]
return
end
@parallel_indices (ix,iy) function bc_z(A)
A[ ix, iy, 1] = A[ ix, iy, 2]
A[ ix, iy,end] = A[ ix, iy,end-1]
return
end
@hide_communication (16, 2, 2) begin
@parallel diffusion3D_step!(Te2, Te, Ci, lam, dt, dx, dy, dz);
@parallel (1:size(Te,2), 1:size(Te,3)) bc_x(Te);
@parallel (1:size(Te,1), 1:size(Te,3)) bc_y(Te);
@parallel (1:size(Te,1), 1:size(Te,2)) bc_z(Te);
update_halo!(Te2);
endThe communcation should not perform any blocking operations to enable a maximal overlap of communication with computation.
See also: @parallel
ParallelStencil.@init_parallel_stencil — Macro@init_parallel_stencil(package, numbertype, ndims)Initialize the package ParallelStencil, giving access to its main functionality. Creates a module Data in the module where @init_parallel_stencil is called from. The module Data contains the types Data.Number, Data.Array and Data.DeviceArray (type ?Data after calling @init_parallel_stencil to see the full description of the module).
Arguments
package::Module: the package used for parallelization (CUDA or Threads).numbertype::DataType: the type of numbers used by @zeros, @ones and @rand and in all array types of moduleData(e.g. Float32 or Float64). It is contained inData.Numberafter @initparallelstencil.ndims::Integer: the number of dimensions used for the stencil computations in the kernels (1, 2 or 3).
See also: Data
ParallelStencil.@ones — Macro@ones(args...)Call ones(numbertype, args...), where numbertype is the datatype selected with @init_parallel_stencil and the function ones is chosen to be compatible with the package for parallelization selected with @init_parallel_stencil (ones for Threads CUDA.ones for CUDA).
ParallelStencil.@parallel — Macro@parallel kernelDeclare the kernel parallel and containing stencil computations be performed with one of the submodules ParallelStencil.FiniteDifferences{1D|2D|3D} (or with a compatible custom module or set of macros).
See also: @init_parallel_stencil
@parallel kernelcall@parallel ranges kernelcall
@parallel nblocks nthreads kernelcall
@parallel ranges nblocks nthreads kernelcall
@parallel (...) kwargs... kernelcallDeclare the kernelcall parallel. The kernel will automatically be called as required by the package for parallelization selected with @init_parallel_stencil. Synchronizes at the end of the call (if a stream is given via keyword arguments, then it synchronizes only this stream).
Arguments
kernelcall: a call to a kernel that is declared parallel.
ranges::Tuple{UnitRange{},UnitRange{},UnitRange{}} | Tuple{UnitRange{},UnitRange{}} | Tuple{UnitRange{}} | UnitRange{}: the ranges of indices in each dimension for which computations must be performed.nblocks::Tuple{Integer,Integer,Integer}: the number of blocks to be used if the package CUDA was selected with@init_parallel_stencil.nthreads::Tuple{Integer,Integer,Integer}: the number of threads to be used if the package CUDA was selected with@init_parallel_stencil.kwargs...: keyword arguments to be passed further to CUDA (ignored for Threads).
Kernel launch parameters are automatically defined with heuristics, where not defined with optional kernel arguments. For CUDA nthreads is whenever reasonable set to (32,8,1) and nblocks accordingly to ensure that enough threads are launched.
See also: @init_parallel_stencil
ParallelStencil.@parallel_async — Macro@parallel_async kernelcall
@parallel_async ranges kernelcall
@parallel_async nblocks nthreads kernelcall
@parallel_async ranges nblocks nthreads kernelcall
@parallel_async (...) kwargs... kernelcallDeclare the kernelcall parallel as with @parallel (see @parallel for more information); deactivates however automatic synchronization at the end of the call. Use @synchronize for synchronizing.
@parallelasync falls currently back to running synchronously if the package Threads was selected with [`@initparallel_stencil`](@ref).
See also: @synchronize, @parallel
ParallelStencil.@parallel_indices — Macro@parallel_indices indices kernelDeclare the kernel parallel and generate the given parallel indices inside the kernel using the package for parallelization selected with @init_parallel_stencil.
ParallelStencil.@ps_println — Macro@ps_println(...)Call a macro analogue to Base.@println, compatible with the package for parallelization selected with @init_parallel_stencil (Base.@println for Threads and CUDA.@cuprintln for CUDA).
ParallelStencil.@ps_show — Macro@ps_show(...)Call a macro analogue to Base.@show, compatible with the package for parallelization selected with @init_parallel_stencil (Base.@show for Threads and CUDA.@cushow for CUDA).
ParallelStencil.@rand — Macro@rand(args...)Call rand(numbertype, args...), where numbertype is the datatype selected with @init_parallel_stencil and the function rand is chosen/implemented to be compatible with the package for parallelization selected with @init_parallel_stencil.
ParallelStencil.@reset_parallel_stencil — MacroParallelStencil.@sharedMem — Macro@sharedMem(T, dims)Create an array that is shared between the threads of a block (i.e. accessible only by the threads of a same block), with element type T and size specified by dims.
The amount of shared memory needs to specified when launching the kernel (keyword argument shmem).
ParallelStencil.@sync_threads — Macro@sync_threads()Synchronize the threads of the block: wait until all threads in the block have reached this point and all global and shared memory accesses made by these threads prior to the sync_threads() call are visible to all threads in the block.
ParallelStencil.@synchronize — MacroParallelStencil.@threadIdx — Macro@threadIdx()Return the thread ID in x, y and z dimension within the block. The thread ID in a specific dimension is commonly retrieved directly as in this example in x dimension: @threadIdx().x.
ParallelStencil.@zeros — Macro@zeros(args...)Call zeros(numbertype, args...), where numbertype is the datatype selected with @init_parallel_stencil and the function zeros is chosen to be compatible with the package for parallelization selected with @init_parallel_stencil (zeros for Threads and CUDA.zeros for CUDA).