$ curl -fsSL https://install.julialang.org | sh
$ julia
pkg> add CUDA # ~2min
   Resolving package versions...
   Installed CUDA_Driver_jll ── v0.6.0+3
   Installed LLVMExtra_jll ──── v0.0.26+0
   ...
   Installed CUDA ───────────── v5.0.0
 Downloading artifact: CUDA_Driver

; module list

Currently Loaded Modules:
  1) craype-x86-milan                        8) cray-mpich/8.1.25
  2) libfabric/1.15.2.0                      9) craype/2.7.20
  3) craype-network-ofi                     10) gcc/11.2.0
  4) xpmem/2.6.2-2.5_2.27__gd067c3f.shasta  11) perftools-base/23.03.0
  5) PrgEnv-gnu/8.3.3                       12) cpe/23.03
  6) cray-dsmml/0.2.2                       13) xalt/2.10.2
  7) cray-libsci/23.02.1.1                  14) cray-python/3.9.13.1   (dev)

  Where:
   dev:  Development Tools and Programming Languages

using CUDA

CUDA.versioninfo()

CUDA runtime 12.2, artifact installation
CUDA driver 12.2
NVIDIA driver 525.105.17, originally for CUDA 12.0

CUDA libraries: 
- CUBLAS: 12.2.5
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.2
- CUSPARSE: 12.1.2
- CUPTI: 20.0.0
- NVML: 12.0.0+525.105.17

Julia packages: 
- CUDA: 5.0.0
- CUDA_Driver_jll: 0.6.0+3
- CUDA_Runtime_jll: 0.9.2+0

Toolchain:
- Julia: 1.9.3
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

4 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available)

arr = rand(10_000_000)

10000000-element Vector{Float64}:
 0.20079028039355207
 0.2551683713911349
 0.07850631788245288
 ⋮
 0.18280216971091756
 0.5304310135460691

carr = cu(arr)

10000000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.20079029
 0.25516838
 0.07850632
 ⋮
 0.18280217
 0.53043103

sin.(carr) .+ 1

10000000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.1994438
 1.2524083
 1.0784256
 ⋮
 1.1817858
 1.5059052

using BenchmarkTools

@btime CUDA.@sync sin.(carr) .+ 1;

  83.051 μs (41 allocations: 2.00 KiB)

@btime sin.(arr) .+ 1;

  68.599 ms (6 allocations: 76.29 MiB)

CUDA.@profile sin.(carr) .+ 1;

Profiler ran for 326.63 µs, capturing 11 events.

Host-side activity: calling CUDA APIs took 61.75 µs (18.91% of the trace)
┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬─────────────────────────┐
│ Time (%) │     Time │ Calls │ Avg time │ Min time │ Max time │ Name                    │
├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼─────────────────────────┤
│    9.56% │ 31.23 µs │     1 │ 31.23 µs │ 31.23 µs │ 31.23 µs │ cuLaunchKernel          │
│    7.66% │ 25.03 µs │     1 │ 25.03 µs │ 25.03 µs │ 25.03 µs │ cuMemAllocFromPoolAsync │
└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴─────────────────────────┘

Device-side activity: GPU was busy for 81.54 µs (24.96% of the trace)
┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │     Time │ Calls │ Avg time │ Min time │ Max time │ Name                                                                                                         ⋯
├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   24.96% │ 81.54 µs │     1 │ 81.54 µs │ 81.54 µs │ 81.54 µs │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5Tup ⋯
└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                               1 column omitted

struct Point{T}
    x :: T
    y :: T
end

arr = Point.(rand(100), rand(100))
carr = cu(arr)

100-element CuArray{Point{Float64}, 1, CUDA.Mem.DeviceBuffer}:
 Point{Float64}(0.8490008946627912, 0.48658520886875856)
 Point{Float64}(0.06977616429006461, 0.2501647436222665)
 Point{Float64}(0.6924522464442648, 0.2656146874146924)
 ⋮
 Point{Float64}(0.8748131265382463, 0.2480993353552592)
 Point{Float64}(0.49503190701987954, 0.13355513219798332)

distance_from_origin(p::Point) = sqrt(p.x^2 + p.y^2)

distance_from_origin (generic function with 1 method)

distance_from_origin.(carr)

100-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
 0.9785538741572041
 0.2597135191988057
 0.7416476763100615
 ⋮
 0.9093136348737674
 0.5127314719267381

function distance_from_origin_bad(p::Point)
    sqrt(sum([p.x^2, p.y^2]))
end

distance_from_origin_bad (generic function with 1 method)

distance_from_origin_bad.(carr)

InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#broadcast_kernel#32")(::CUDA.CuKernelContext, ::CuDeviceVector{Float64, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(distance_from_origin_bad), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Point{Float64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64) resulted in invalid LLVM IR
Reason: unsupported call through a literal pointer (call to ijl_alloc_array_1d)
Stacktrace:
  [1] Array
    @ ./boot.jl:477
  [2] Array
    @ ./boot.jl:486
  [3] similar
    @ ./abstractarray.jl:884
  [4] similar
    @ ./abstractarray.jl:883
  [5] _array_for
    @ ./array.jl:671
  [6] _array_for
    @ ./array.jl:674
  [7] vect
    @ ./array.jl:126
  [8] distance_from_origin_bad
    @ ./In[19]:2
  [9] _broadcast_getindex_evalf
    @ ./broadcast.jl:683
 [10] _broadcast_getindex
    @ ./broadcast.jl:656
 [11] getindex
    @ ./broadcast.jl:610
 [12] broadcast_kernel
    @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:64
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl

Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/validation.jl:147
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:440 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:439 [inlined]
  [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/utils.jl:92
  [6] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:129
  [7] codegen
    @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:110 [inlined]
  [8] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:106
  [9] compile
    @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:98 [inlined]
 [10] #1042
    @ ~/.julia/packages/CUDA/nbRJk/src/compiler/compilation.jl:166 [inlined]
 [11] JuliaContext(f::CUDA.var"#1042#1045"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:47
 [12] compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/nbRJk/src/compiler/compilation.jl:165
 [13] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/execution.jl:125
 [14] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/execution.jl:103
 [15] macro expansion
    @ ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:323 [inlined]
 [16] macro expansion
    @ ./lock.jl:267 [inlined]
 [17] cufunction(f::GPUArrays.var"#broadcast_kernel#32", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(distance_from_origin_bad), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Point{Float64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:318
 [18] cufunction
    @ ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:315 [inlined]
 [19] macro expansion
    @ ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:104 [inlined]
 [20] #launch_heuristic#1087
    @ ~/.julia/packages/CUDA/nbRJk/src/gpuarrays.jl:17 [inlined]
 [21] launch_heuristic
    @ ~/.julia/packages/CUDA/nbRJk/src/gpuarrays.jl:15 [inlined]
 [22] _copyto!
    @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:70 [inlined]
 [23] copyto!
    @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:51 [inlined]
 [24] copy
    @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:42 [inlined]
 [25] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(distance_from_origin_bad), Tuple{CuArray{Point{Float64}, 1, CUDA.Mem.DeviceBuffer}}})
    @ Base.Broadcast ./broadcast.jl:873
 [26] top-level scope
    @ In[20]:1

function my_kernel(carr_out, carr)   
    start = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    len = length(carr)
    for i = start:stride:len  # "grid-stride" loop
        carr_out[i] = sin(carr[i]) + 1
    end
    return
end

my_kernel (generic function with 1 method)

carr = cu(rand(10_000_000))
carr_out = similar(carr);

@cuda threads=256 my_kernel(carr_out, carr)

CUDA.HostKernel for my_kernel(CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1})

carr_out

10000000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.264759
 1.3415114
 1.0137947
 ⋮
 1.7018087
 1.215888

CUDA.devices()

CUDA.DeviceIterator() for 4 devices:
0. NVIDIA A100-SXM4-40GB
1. NVIDIA A100-SXM4-40GB
2. NVIDIA A100-SXM4-40GB
3. NVIDIA A100-SXM4-40GB

CUDA.device()

CuDevice(0): NVIDIA A100-SXM4-40GB

CUDA.device!(1)

CuDevice(1): NVIDIA A100-SXM4-40GB

arr = rand(10_000_000)
carr = cu(arr)
@btime CUDA.@sync sin.(carr) .+ 1;

  85.705 μs (41 allocations: 2.00 KiB)

GC.gc()
CUDA.reclaim()

CUDA.device!(0)

CuDevice(0): NVIDIA A100-SXM4-40GB

using Distributed

addprocs(3)

3-element Vector{Int64}:
 2
 3
 4

@everywhere using CUDA, BenchmarkTools

@everywhere procs() println((myid(), CUDA.device()))

(1, CuDevice(0))
      From worker 3:	(3, CuDevice(0))
      From worker 2:	(2, CuDevice(0))
      From worker 4:	(4, CuDevice(0))

@everywhere procs() CUDA.device!(myid()-1)

@everywhere procs() println((myid(), CUDA.device()))

(1, CuDevice(0))
      From worker 2:	(2, CuDevice(1))
      From worker 3:	(3, CuDevice(2))
      From worker 4:	(4, CuDevice(3))

let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:4) do i
        @btime CUDA.@sync sin.($carr) .+ 1
    end
end

  85.255 μs (37 allocations: 1.91 KiB)
      From worker 3:	  81.938 μs (37 allocations: 1.91 KiB)
      From worker 2:	  82.238 μs (37 allocations: 1.91 KiB)
      From worker 4:	  81.597 μs (37 allocations: 1.91 KiB)

4-element Vector{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}:
 Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087  …  1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569]
 Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087  …  1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569]
 Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087  …  1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569]
 Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087  …  1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569]

using ClusterManagers

em = ElasticManager(
    # Perlmutter specific ↓
    addr = IPv4(first(filter(!isnothing, match.(r"inet (.*)/.*hsn0", readlines(`ip a show`)))).captures[1]),
    port = 0
);

em

ElasticManager:
  Active workers : [ 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36]
  Number of workers to be added  : 0
  Terminated workers : []
  Worker connect command : 
    /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia --project=/global/u1/m/marius/work/gpu_science_day_julia/Project.toml -e 'using ClusterManagers; ClusterManagers.elastic_worker("6Ty6RCu5sIy5CedV","10.249.6.77",35449)'

salloc -C gpu -q regular -t 00:30:00 --cpus-per-task 32  --gpus-per-task 1 --ntasks-per-node 4 --nodes 8 -A mp107

using CUDADistributedTools

CUDADistributedTools.assign_GPU_workers()

┌ Info: Processes (36):
│  (myid = 1, host = nid001293, device = CuDevice(0): NVIDIA A100-SXM4-40GB 1c40175b))
│  (myid = 2, host = nid001293, device = CuDevice(1): NVIDIA A100-SXM4-40GB f179efe2))
│  (myid = 3, host = nid001293, device = CuDevice(2): NVIDIA A100-SXM4-40GB 36d32866))
│  (myid = 4, host = nid001293, device = CuDevice(3): NVIDIA A100-SXM4-40GB 634451b9))
│  (myid = 5, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB 892d65ed))
│  (myid = 6, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB 0212ac25))
│  (myid = 7, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB 9f1b5f73))
│  (myid = 8, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB b9ac9c36))
│  (myid = 9, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB 1ffb4f18))
│  (myid = 10, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB a25217d5))
│  (myid = 11, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB 2ad12529))
│  (myid = 12, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB 91817c8d))
│  (myid = 13, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB 6f8ab1df))
│  (myid = 14, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB 014e077e))
│  (myid = 15, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB 38a58e41))
│  (myid = 16, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB a860a000))
│  (myid = 17, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB fdfe719c))
│  (myid = 18, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB 547b9f5c))
│  (myid = 19, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB ee15a3a3))
│  (myid = 20, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB 32342d63))
│  (myid = 21, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB f5695274))
│  (myid = 22, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB 4cdbb673))
│  (myid = 23, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB 0130c469))
│  (myid = 24, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 5aeab6da))
│  (myid = 25, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 44081cfa))
│  (myid = 26, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 0a4aa27d))
│  (myid = 27, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 06c81dc5))
│  (myid = 28, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB 20734e54))
│  (myid = 29, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB 60aa76e5))
│  (myid = 30, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB 249cdaab))
│  (myid = 31, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB da05c388))
│  (myid = 32, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB dc1c5e9d))
│  (myid = 33, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB 3b627630))
│  (myid = 34, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB 76cf68e2))
│  (myid = 35, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB b2cb91b4))
└  (myid = 36, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB 4f421754))

@everywhere using CUDA, BenchmarkTools

let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:nprocs()) do i
        @btime CUDA.@sync sin.($carr) .+ 1
        return nothing
    end
end;

  85.174 μs (37 allocations: 1.91 KiB)
      From worker 3:	  81.828 μs (37 allocations: 1.91 KiB)
      From worker 4:	  81.768 μs (37 allocations: 1.91 KiB)
      From worker 2:	  82.759 μs (37 allocations: 1.91 KiB)
      From worker 5:	  81.707 μs (37 allocations: 1.91 KiB)
      From worker 8:	  81.758 μs (37 allocations: 1.91 KiB)
      From worker 7:	  81.897 μs (37 allocations: 1.91 KiB)
      From worker 9:	  82.840 μs (37 allocations: 1.91 KiB)
      From worker 10:	  81.388 μs (37 allocations: 1.91 KiB)
      From worker 12:	  81.898 μs (37 allocations: 1.91 KiB)
      From worker 11:	  81.638 μs (37 allocations: 1.91 KiB)
      From worker 6:	  83.030 μs (37 allocations: 1.91 KiB)
      From worker 15:	  81.608 μs (37 allocations: 1.91 KiB)
      From worker 13:	  83.251 μs (37 allocations: 1.91 KiB)
      From worker 16:	  82.169 μs (37 allocations: 1.91 KiB)
      From worker 14:	  83.382 μs (37 allocations: 1.91 KiB)
      From worker 19:	  82.439 μs (37 allocations: 1.91 KiB)
      From worker 18:	  82.630 μs (37 allocations: 1.91 KiB)
      From worker 22:	  81.197 μs (37 allocations: 1.91 KiB)
      From worker 26:	  82.680 μs (37 allocations: 1.91 KiB)
      From worker 17:	  83.001 μs (37 allocations: 1.91 KiB)
      From worker 21:	  81.497 μs (37 allocations: 1.91 KiB)
      From worker 29:	  81.617 μs (37 allocations: 1.91 KiB)
      From worker 20:	  82.448 μs (37 allocations: 1.91 KiB)
      From worker 24:	  81.669 μs (37 allocations: 1.91 KiB)
      From worker 25:	  82.230 μs (37 allocations: 1.91 KiB)
      From worker 33:	  81.507 μs (37 allocations: 1.91 KiB)
      From worker 32:	  81.618 μs (37 allocations: 1.91 KiB)
      From worker 30:	  81.447 μs (37 allocations: 1.91 KiB)
      From worker 35:	  81.908 μs (37 allocations: 1.91 KiB)
      From worker 31:	  81.718 μs (37 allocations: 1.91 KiB)
      From worker 27:	  82.640 μs (37 allocations: 1.91 KiB)
      From worker 34:	  81.577 μs (37 allocations: 1.91 KiB)
      From worker 28:	  82.680 μs (37 allocations: 1.91 KiB)
      From worker 23:	  81.618 μs (37 allocations: 1.91 KiB)
      From worker 36:	  81.768 μs (37 allocations: 1.91 KiB)

pkg> add MPI MPIPreferences

julia> MPIPreferences.use_system_binary(;vendor="cray", mpiexec="srun") # <- options are Perlmutter specific

┌ Info: MPI implementation identified
│   libmpi = "libmpi_gnu_91.so"
│   version_string = "MPI VERSION    : CRAY MPICH version 8.1.25.17 (ANL base 3.4a2)\nMPI BUILD INFO : Sun Feb 26 15:15 2023 (git hash aecd99f)\n"
│   impl = "CrayMPICH"
│   version = v"8.1.25"
└   abi = "MPICH"
┌ Info: MPIPreferences changed
│   binary = "system"
│   libmpi = "libmpi_gnu_91.so"
│   abi = "MPICH"
│   mpiexec = "srun"
│   preloads =
│    1-element Vector{String}:
│     "libmpi_gtl_cuda.so"
└   preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"

#!/bin/bash
#SBATCH -C gpu -q regular -A mp107
#SBATCH -t 00:05:00 
#SBATCH --cpus-per-task 32 --gpus-per-task 1 --ntasks-per-node 4 --nodes 4
#=
srun /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit 0
# =#

using MPIClusterManagers, Distributed, CUDA, BenchmarkTools
mgr = MPIClusterManagers.start_main_loop(MPIClusterManagers.MPI_TRANSPORT_ALL)

let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:nprocs()) do i
        @btime CUDA.@sync sin.($carr) .+ 1
    end
end

MPIClusterManagers.stop_main_loop(mgr)

let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:nprocs()) do i
        @btime CUDA.@sync sin.($carr) .+ 1
        return nothing
    end
end;

  85.515 μs (37 allocations: 1.91 KiB)
      From worker 5:	  81.698 μs (37 allocations: 1.91 KiB)
      From worker 6:	  81.728 μs (37 allocations: 1.91 KiB)
      From worker 2:	  81.918 μs (37 allocations: 1.91 KiB)
      From worker 8:	  81.697 μs (37 allocations: 1.91 KiB)
      From worker 7:	  81.637 μs (37 allocations: 1.91 KiB)
      From worker 9:	  81.338 μs (37 allocations: 1.91 KiB)
      From worker 10:	  81.197 μs (37 allocations: 1.91 KiB)
      From worker 3:	  81.867 μs (37 allocations: 1.91 KiB)
      From worker 13:	  81.968 μs (37 allocations: 1.91 KiB)
      From worker 4:	  81.838 μs (37 allocations: 1.91 KiB)
      From worker 11:	  81.668 μs (37 allocations: 1.91 KiB)
      From worker 12:	  82.049 μs (37 allocations: 1.91 KiB)
      From worker 15:	  80.727 μs (37 allocations: 1.91 KiB)
      From worker 14:	  81.378 μs (37 allocations: 1.91 KiB)
      From worker 16:	  82.159 μs (37 allocations: 1.91 KiB)
      From worker 17:	  81.297 μs (37 allocations: 1.91 KiB)
      From worker 18:	  81.277 μs (37 allocations: 1.91 KiB)
      From worker 20:	  81.357 μs (37 allocations: 1.91 KiB)
      From worker 21:	  81.899 μs (37 allocations: 1.91 KiB)
      From worker 19:	  81.637 μs (37 allocations: 1.91 KiB)
      From worker 23:	  81.597 μs (37 allocations: 1.91 KiB)
      From worker 22:	  81.558 μs (37 allocations: 1.91 KiB)
      From worker 25:	  81.738 μs (37 allocations: 1.91 KiB)
      From worker 24:	  81.587 μs (37 allocations: 1.91 KiB)
      From worker 26:	  81.558 μs (37 allocations: 1.91 KiB)
      From worker 28:	  81.688 μs (37 allocations: 1.91 KiB)
      From worker 27:	  81.808 μs (37 allocations: 1.91 KiB)
      From worker 29:	  81.798 μs (37 allocations: 1.91 KiB)
      From worker 30:	  82.329 μs (37 allocations: 1.91 KiB)
      From worker 31:	  81.658 μs (37 allocations: 1.91 KiB)
      From worker 33:	  81.668 μs (37 allocations: 1.91 KiB)
      From worker 32:	  81.788 μs (37 allocations: 1.91 KiB)
      From worker 35:	  81.457 μs (37 allocations: 1.91 KiB)
      From worker 34:	  81.778 μs (37 allocations: 1.91 KiB)
      From worker 36:	  81.838 μs (37 allocations: 1.91 KiB)

using ParameterizedNotebooks

nb = ParameterizedNotebook("talk.ipynb", sections=("Some code in a notebook:",))

ParameterizedNotebook("talk.ipynb")
□ ~
  □ Julia + Jupyter + GPU = ⚗️🔬🧬🥰
    □ Outline
    □ Motivation
    □ Install
    □ Basic usage
    □ Power of Julia (1)
    □ Limitations
    □ Power of Julia (2)
    □ Multi-GPU (single node)
    □ Multi-GPU (multiple nodes, elastic)
    □ Multi-GPU (multiple nodes, MPI)
    □ Multi-GPU (multiple nodes, MPI, notebooks)
      ☒ Some code in a notebook:
        ☒ …
      □ Now use:
    □ Conclusions

nb()

  85.074 μs (37 allocations: 1.91 KiB)
      From worker 5:	  82.029 μs (37 allocations: 1.91 KiB)
      From worker 6:	  82.058 μs (37 allocations: 1.91 KiB)
      From worker 7:	  81.968 μs (37 allocations: 1.91 KiB)
      From worker 9:	  81.999 μs (37 allocations: 1.91 KiB)
      From worker 3:	  81.838 μs (37 allocations: 1.91 KiB)
      From worker 8:	  81.818 μs (37 allocations: 1.91 KiB)
      From worker 10:	  81.929 μs (37 allocations: 1.91 KiB)
      From worker 4:	  82.148 μs (37 allocations: 1.91 KiB)
      From worker 11:	  81.808 μs (37 allocations: 1.91 KiB)
      From worker 13:	  82.149 μs (37 allocations: 1.91 KiB)
      From worker 12:	  81.799 μs (37 allocations: 1.91 KiB)
      From worker 14:	  81.948 μs (37 allocations: 1.91 KiB)
      From worker 2:	  82.078 μs (37 allocations: 1.91 KiB)
      From worker 16:	  81.909 μs (37 allocations: 1.91 KiB)
      From worker 17:	  81.908 μs (37 allocations: 1.91 KiB)
      From worker 15:	  81.689 μs (37 allocations: 1.91 KiB)
      From worker 18:	  82.139 μs (37 allocations: 1.91 KiB)
      From worker 19:	  81.988 μs (37 allocations: 1.91 KiB)
      From worker 21:	  81.537 μs (37 allocations: 1.91 KiB)
      From worker 20:	  81.898 μs (37 allocations: 1.91 KiB)
      From worker 22:	  81.698 μs (37 allocations: 1.91 KiB)
      From worker 23:	  81.968 μs (37 allocations: 1.91 KiB)
      From worker 25:	  81.859 μs (37 allocations: 1.91 KiB)
      From worker 24:	  81.508 μs (37 allocations: 1.91 KiB)
      From worker 26:	  81.769 μs (37 allocations: 1.91 KiB)
      From worker 27:	  82.099 μs (37 allocations: 1.91 KiB)
      From worker 29:	  81.728 μs (37 allocations: 1.91 KiB)
      From worker 28:	  81.598 μs (37 allocations: 1.91 KiB)
      From worker 30:	  81.899 μs (37 allocations: 1.91 KiB)
      From worker 31:	  82.019 μs (37 allocations: 1.91 KiB)
      From worker 33:	  81.938 μs (37 allocations: 1.91 KiB)
      From worker 32:	  81.638 μs (37 allocations: 1.91 KiB)
      From worker 34:	  81.758 μs (37 allocations: 1.91 KiB)
      From worker 35:	  81.958 μs (37 allocations: 1.91 KiB)
      From worker 36:	  81.908 μs (37 allocations: 1.91 KiB)

#!/bin/bash
#SBATCH -C gpu -q regular -A mp107
#SBATCH -t 00:05:00 
#SBATCH --cpus-per-task 32 --gpus-per-task 1 --ntasks-per-node 4 --nodes 4
#=
srun /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit 0
# =#

using MPIClusterManagers, Distributed, CUDA
mgr = MPIClusterManagers.start_main_loop(MPIClusterManagers.MPI_TRANSPORT_ALL)

nb = ParameterizedNotebook("talk.ipynb", sections=("Some code in a notebook:",))
nb()

MPIClusterManagers.stop_main_loop(mgr)

Julia + Jupyter + GPU = ⚗️🔬🧬🥰¶

Outline¶

Motivation¶

Install¶

Basic usage¶

Power of Julia (1)¶

Limitations¶

Power of Julia (2)¶

Multi-GPU (single node)¶

Multi-GPU (multiple nodes, elastic)¶

Multi-GPU (multiple nodes, MPI)¶

Multi-GPU (multiple nodes, MPI, notebooks)¶

Some code in a notebook:¶

Now use:¶

Conclusions¶