Julia + Jupyter + GPU = ⚗️🔬🧬🥰¶
Marius Millea (Project Scientist @ UC Davis in Cosmology)
NERSC GPU Science Day, Oct 12, 2023 - Slide source code to run locally
Thanks to: Tim Besard + CUDA.jl/Julia contributors, Johannes Blaschke, Rollin Thomas
I work on analyzing maps of the Cosmic Microwave Background. Using tiny distortions imprinted by gravitational lensing, we can make maps of where all the dark matter is in the universe. We do so by solving millions-of-dimensional Bayesian inference problems.
Our basic code building blocks are array broadcasts and FFTs, which is perfectly suited for GPU. Our group has been using GPUs since the Cori GPU testbed days.
But this talk is not about science, but instead sharing the workflow we've developed over the last ~5 years.
Outline¶
- Julia + Jupyter + GPU motivation
- Julia CUDA Installation
- Basic and advanced Julia CUDA usage
- Multi-GPU workflows for embarrasingly parallel problems
Motivation¶
- Julia
- interactive but fast
- powerful and flexible
- less boilerplate: code looks like science
- Jupyter
- convenient for interactive work
- fast iterative development workflow
- GPU
- duh
Install¶
Julia/CUDA install is drop-dead simple. Julia's CUDA package provides compatible binary drivers:
$ curl -fsSL https://install.julialang.org | sh
$ julia
pkg> add CUDA # ~2min
Resolving package versions...
Installed CUDA_Driver_jll ── v0.6.0+3
Installed LLVMExtra_jll ──── v0.0.26+0
...
Installed CUDA ───────────── v5.0.0
Downloading artifact: CUDA_Driver
(Easy to select CUDA version per project with e.g. CUDA.set_runtime_version!(v"11.4")
)
I recommend this fully native Julia install over using any modules
, i.e. I don't even have the gpu
module loaded:
; module list
Currently Loaded Modules: 1) craype-x86-milan 8) cray-mpich/8.1.25 2) libfabric/1.15.2.0 9) craype/2.7.20 3) craype-network-ofi 10) gcc/11.2.0 4) xpmem/2.6.2-2.5_2.27__gd067c3f.shasta 11) perftools-base/23.03.0 5) PrgEnv-gnu/8.3.3 12) cpe/23.03 6) cray-dsmml/0.2.2 13) xalt/2.10.2 7) cray-libsci/23.02.1.1 14) cray-python/3.9.13.1 (dev) Where: dev: Development Tools and Programming Languages
This has proven robust across many clusters I've tried.
Checking everything is installed:
using CUDA
CUDA.versioninfo()
CUDA runtime 12.2, artifact installation CUDA driver 12.2 NVIDIA driver 525.105.17, originally for CUDA 12.0 CUDA libraries: - CUBLAS: 12.2.5 - CURAND: 10.3.3 - CUFFT: 11.0.8 - CUSOLVER: 11.5.2 - CUSPARSE: 12.1.2 - CUPTI: 20.0.0 - NVML: 12.0.0+525.105.17 Julia packages: - CUDA: 5.0.0 - CUDA_Driver_jll: 0.6.0+3 - CUDA_Runtime_jll: 0.9.2+0 Toolchain: - Julia: 1.9.3 - LLVM: 14.0.6 - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5 - Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86 4 devices: 0: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available) 1: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available) 2: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available) 3: NVIDIA A100-SXM4-40GB (sm_80, 39.389 GiB / 40.000 GiB available)
Basic usage¶
arr = rand(10_000_000)
10000000-element Vector{Float64}: 0.20079028039355207 0.2551683713911349 0.07850631788245288 ⋮ 0.18280216971091756 0.5304310135460691
carr = cu(arr)
10000000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}: 0.20079029 0.25516838 0.07850632 ⋮ 0.18280217 0.53043103
sin.(carr) .+ 1
10000000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}: 1.1994438 1.2524083 1.0784256 ⋮ 1.1817858 1.5059052
Lets benchmark:
using BenchmarkTools
@btime CUDA.@sync sin.(carr) .+ 1;
83.051 μs (41 allocations: 2.00 KiB)
@btime sin.(arr) .+ 1;
68.599 ms (6 allocations: 76.29 MiB)
CUDA.@profile sin.(carr) .+ 1;
Profiler ran for 326.63 µs, capturing 11 events. Host-side activity: calling CUDA APIs took 61.75 µs (18.91% of the trace) ┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬─────────────────────────┐ │ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │ ├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼─────────────────────────┤ │ 9.56% │ 31.23 µs │ 1 │ 31.23 µs │ 31.23 µs │ 31.23 µs │ cuLaunchKernel │ │ 7.66% │ 25.03 µs │ 1 │ 25.03 µs │ 25.03 µs │ 25.03 µs │ cuMemAllocFromPoolAsync │ └──────────┴──────────┴───────┴──────────┴──────────┴──────────┴─────────────────────────┘ Device-side activity: GPU was busy for 81.54 µs (24.96% of the trace) ┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯ ├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ 24.96% │ 81.54 µs │ 1 │ 81.54 µs │ 81.54 µs │ 81.54 µs │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float32Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5Tup ⋯ └──────────┴──────────┴───────┴──────────┴──────────┴──────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 column omitted
Power of Julia (1)¶
In Julia, you can easily put many arbitrary objects on GPU:
struct Point{T}
x :: T
y :: T
end
arr = Point.(rand(100), rand(100))
carr = cu(arr)
100-element CuArray{Point{Float64}, 1, CUDA.Mem.DeviceBuffer}: Point{Float64}(0.8490008946627912, 0.48658520886875856) Point{Float64}(0.06977616429006461, 0.2501647436222665) Point{Float64}(0.6924522464442648, 0.2656146874146924) ⋮ Point{Float64}(0.8748131265382463, 0.2480993353552592) Point{Float64}(0.49503190701987954, 0.13355513219798332)
In e.g. Jax/PyTorch/TF, the only things you can stick inside of CUDA arrays are Int/Float/Complex. In Julia, anything with a static memory layout is fine.
distance_from_origin(p::Point) = sqrt(p.x^2 + p.y^2)
distance_from_origin (generic function with 1 method)
distance_from_origin.(carr)
100-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}: 0.9785538741572041 0.2597135191988057 0.7416476763100615 ⋮ 0.9093136348737674 0.5127314719267381
Limitations¶
function distance_from_origin_bad(p::Point)
sqrt(sum([p.x^2, p.y^2]))
end
distance_from_origin_bad (generic function with 1 method)
distance_from_origin_bad.(carr)
InvalidIRError: compiling MethodInstance for (::GPUArrays.var"#broadcast_kernel#32")(::CUDA.CuKernelContext, ::CuDeviceVector{Float64, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(distance_from_origin_bad), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Point{Float64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, ::Int64) resulted in invalid LLVM IR Reason: unsupported call through a literal pointer (call to ijl_alloc_array_1d) Stacktrace: [1] Array @ ./boot.jl:477 [2] Array @ ./boot.jl:486 [3] similar @ ./abstractarray.jl:884 [4] similar @ ./abstractarray.jl:883 [5] _array_for @ ./array.jl:671 [6] _array_for @ ./array.jl:674 [7] vect @ ./array.jl:126 [8] distance_from_origin_bad @ ./In[19]:2 [9] _broadcast_getindex_evalf @ ./broadcast.jl:683 [10] _broadcast_getindex @ ./broadcast.jl:656 [11] getindex @ ./broadcast.jl:610 [12] broadcast_kernel @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:64 Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl Stacktrace: [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module) @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/validation.jl:147 [2] macro expansion @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:440 [inlined] [3] macro expansion @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined] [4] macro expansion @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:439 [inlined] [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool) @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/utils.jl:92 [6] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing) @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:129 [7] codegen @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:110 [inlined] [8] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool) @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:106 [9] compile @ ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:98 [inlined] [10] #1042 @ ~/.julia/packages/CUDA/nbRJk/src/compiler/compilation.jl:166 [inlined] [11] JuliaContext(f::CUDA.var"#1042#1045"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}) @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/driver.jl:47 [12] compile(job::GPUCompiler.CompilerJob) @ CUDA ~/.julia/packages/CUDA/nbRJk/src/compiler/compilation.jl:165 [13] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link)) @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/execution.jl:125 [14] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function) @ GPUCompiler ~/.julia/packages/GPUCompiler/2mJjc/src/execution.jl:103 [15] macro expansion @ ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:323 [inlined] [16] macro expansion @ ./lock.jl:267 [inlined] [17] cufunction(f::GPUArrays.var"#broadcast_kernel#32", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceVector{Float64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(distance_from_origin_bad), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Point{Float64}, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}) @ CUDA ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:318 [18] cufunction @ ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:315 [inlined] [19] macro expansion @ ~/.julia/packages/CUDA/nbRJk/src/compiler/execution.jl:104 [inlined] [20] #launch_heuristic#1087 @ ~/.julia/packages/CUDA/nbRJk/src/gpuarrays.jl:17 [inlined] [21] launch_heuristic @ ~/.julia/packages/CUDA/nbRJk/src/gpuarrays.jl:15 [inlined] [22] _copyto! @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:70 [inlined] [23] copyto! @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:51 [inlined] [24] copy @ ~/.julia/packages/GPUArrays/EZkix/src/host/broadcast.jl:42 [inlined] [25] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Nothing, typeof(distance_from_origin_bad), Tuple{CuArray{Point{Float64}, 1, CUDA.Mem.DeviceBuffer}}}) @ Base.Broadcast ./broadcast.jl:873 [26] top-level scope @ In[20]:1
Limitations on code in functions that will be compiled for GPU:
- No calls to CPU functions
- E.g. creating Arrays (use StaticArrays.jl instead)
- No dynamic dispatch
- Code should be type stable
Power of Julia (2)¶
You can also directly write kernels in Julia, giving the full power and flexibility of CUDA kernel programming:
function my_kernel(carr_out, carr)
start = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
len = length(carr)
for i = start:stride:len # "grid-stride" loop
carr_out[i] = sin(carr[i]) + 1
end
return
end
my_kernel (generic function with 1 method)
carr = cu(rand(10_000_000))
carr_out = similar(carr);
@cuda threads=256 my_kernel(carr_out, carr)
CUDA.HostKernel for my_kernel(CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1})
carr_out
10000000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}: 1.264759 1.3415114 1.0137947 ⋮ 1.7018087 1.215888
See Kernel Programming for full list of CUDA.jl kernel programming capabilities.
Multi-GPU (single node)¶
CUDA.devices()
CUDA.DeviceIterator() for 4 devices: 0. NVIDIA A100-SXM4-40GB 1. NVIDIA A100-SXM4-40GB 2. NVIDIA A100-SXM4-40GB 3. NVIDIA A100-SXM4-40GB
CUDA.device()
CuDevice(0): NVIDIA A100-SXM4-40GB
CUDA.device!(1)
CuDevice(1): NVIDIA A100-SXM4-40GB
arr = rand(10_000_000)
carr = cu(arr)
@btime CUDA.@sync sin.(carr) .+ 1;
85.705 μs (41 allocations: 2.00 KiB)
CUDA.jl does its own memory management, so before switching back to GPU 0, give back memory (don't usually have to think about this unless you use the same GPU from multiple processes, which for the purpose of this demo I do):
GC.gc()
CUDA.reclaim()
CUDA.device!(0)
CuDevice(0): NVIDIA A100-SXM4-40GB
You can use multiple GPUs via Julia processes, tasks, or threads.
The most robust and easy way I have found (as of 2023), which I recommend starting with, is per-process:
using Distributed
addprocs(3)
3-element Vector{Int64}: 2 3 4
@everywhere using CUDA, BenchmarkTools
@everywhere procs() println((myid(), CUDA.device()))
(1, CuDevice(0)) From worker 3: (3, CuDevice(0)) From worker 2: (2, CuDevice(0)) From worker 4: (4, CuDevice(0))
@everywhere procs() CUDA.device!(myid()-1)
@everywhere procs() println((myid(), CUDA.device()))
(1, CuDevice(0)) From worker 2: (2, CuDevice(1)) From worker 3: (3, CuDevice(2)) From worker 4: (4, CuDevice(3))
Lets run our benchmark in parallel across all GPUs:
let
carr = cu(rand(10_000_000))
pmap(WorkerPool(procs()), 1:4) do i
@btime CUDA.@sync sin.($carr) .+ 1
end
end
85.255 μs (37 allocations: 1.91 KiB) From worker 3: 81.938 μs (37 allocations: 1.91 KiB) From worker 2: 82.238 μs (37 allocations: 1.91 KiB) From worker 4: 81.597 μs (37 allocations: 1.91 KiB)
4-element Vector{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}: Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087 … 1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569] Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087 … 1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569] Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087 … 1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569] Float32[1.8228395, 1.0992318, 1.759596, 1.8266902, 1.1692553, 1.3392247, 1.7539189, 1.7294312, 1.705107, 1.6065087 … 1.4714574, 1.0195016, 1.0635127, 1.5888524, 1.7992377, 1.333357, 1.2087038, 1.0803039, 1.7202525, 1.5920569]
Note, carr
was defined and moved to GPU on the master process. Julia automatically sent it to the worker GPUs, then automatically sent the results back to the master GPU.
In doing so, the array passed through CPU memory, so its not the most efficient (but its the easiest).
To go straight GPU-to-GPU, you can use unified memory on a single-node, or CUDA MPI transport (later this talk).
Multi-GPU (multiple nodes, elastic)¶
using ClusterManagers
em = ElasticManager(
# Perlmutter specific ↓
addr = IPv4(first(filter(!isnothing, match.(r"inet (.*)/.*hsn0", readlines(`ip a show`)))).captures[1]),
port = 0
);
em
ElasticManager: Active workers : [ 5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36] Number of workers to be added : 0 Terminated workers : [] Worker connect command : /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia --project=/global/u1/m/marius/work/gpu_science_day_julia/Project.toml -e 'using ClusterManagers; ClusterManagers.elastic_worker("6Ty6RCu5sIy5CedV","10.249.6.77",35449)'
Now submit a job, e.g. with:
salloc -C gpu -q regular -t 00:30:00 --cpus-per-task 32 --gpus-per-task 1 --ntasks-per-node 4 --nodes 8 -A mp107
then run the "worker connect command" printed above (could also do all-in-one as a batch job).
With more GPUs across different nodes, its more complex to assign one unique GPU to each process. Instead we can use this utility function:
using CUDADistributedTools
CUDADistributedTools.assign_GPU_workers()
┌ Info: Processes (36): │ (myid = 1, host = nid001293, device = CuDevice(0): NVIDIA A100-SXM4-40GB 1c40175b)) │ (myid = 2, host = nid001293, device = CuDevice(1): NVIDIA A100-SXM4-40GB f179efe2)) │ (myid = 3, host = nid001293, device = CuDevice(2): NVIDIA A100-SXM4-40GB 36d32866)) │ (myid = 4, host = nid001293, device = CuDevice(3): NVIDIA A100-SXM4-40GB 634451b9)) │ (myid = 5, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB 892d65ed)) │ (myid = 6, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB 0212ac25)) │ (myid = 7, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB 9f1b5f73)) │ (myid = 8, host = nid002532, device = CuDevice(0): NVIDIA A100-SXM4-40GB b9ac9c36)) │ (myid = 9, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB 1ffb4f18)) │ (myid = 10, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB a25217d5)) │ (myid = 11, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB 2ad12529)) │ (myid = 12, host = nid002536, device = CuDevice(0): NVIDIA A100-SXM4-40GB 91817c8d)) │ (myid = 13, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB 6f8ab1df)) │ (myid = 14, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB 014e077e)) │ (myid = 15, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB 38a58e41)) │ (myid = 16, host = nid003320, device = CuDevice(0): NVIDIA A100-SXM4-40GB a860a000)) │ (myid = 17, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB fdfe719c)) │ (myid = 18, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB 547b9f5c)) │ (myid = 19, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB ee15a3a3)) │ (myid = 20, host = nid003316, device = CuDevice(0): NVIDIA A100-SXM4-40GB 32342d63)) │ (myid = 21, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB f5695274)) │ (myid = 22, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB 4cdbb673)) │ (myid = 23, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB 0130c469)) │ (myid = 24, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 5aeab6da)) │ (myid = 25, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 44081cfa)) │ (myid = 26, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 0a4aa27d)) │ (myid = 27, host = nid003317, device = CuDevice(0): NVIDIA A100-SXM4-40GB 06c81dc5)) │ (myid = 28, host = nid002533, device = CuDevice(0): NVIDIA A100-SXM4-40GB 20734e54)) │ (myid = 29, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB 60aa76e5)) │ (myid = 30, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB 249cdaab)) │ (myid = 31, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB da05c388)) │ (myid = 32, host = nid003313, device = CuDevice(0): NVIDIA A100-SXM4-40GB dc1c5e9d)) │ (myid = 33, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB 3b627630)) │ (myid = 34, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB 76cf68e2)) │ (myid = 35, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB b2cb91b4)) └ (myid = 36, host = nid003321, device = CuDevice(0): NVIDIA A100-SXM4-40GB 4f421754))
Let's run parallel benchmarks again:
@everywhere using CUDA, BenchmarkTools
let
carr = cu(rand(10_000_000))
pmap(WorkerPool(procs()), 1:nprocs()) do i
@btime CUDA.@sync sin.($carr) .+ 1
return nothing
end
end;
85.174 μs (37 allocations: 1.91 KiB) From worker 3: 81.828 μs (37 allocations: 1.91 KiB) From worker 4: 81.768 μs (37 allocations: 1.91 KiB) From worker 2: 82.759 μs (37 allocations: 1.91 KiB) From worker 5: 81.707 μs (37 allocations: 1.91 KiB) From worker 8: 81.758 μs (37 allocations: 1.91 KiB) From worker 7: 81.897 μs (37 allocations: 1.91 KiB) From worker 9: 82.840 μs (37 allocations: 1.91 KiB) From worker 10: 81.388 μs (37 allocations: 1.91 KiB) From worker 12: 81.898 μs (37 allocations: 1.91 KiB) From worker 11: 81.638 μs (37 allocations: 1.91 KiB) From worker 6: 83.030 μs (37 allocations: 1.91 KiB) From worker 15: 81.608 μs (37 allocations: 1.91 KiB) From worker 13: 83.251 μs (37 allocations: 1.91 KiB) From worker 16: 82.169 μs (37 allocations: 1.91 KiB) From worker 14: 83.382 μs (37 allocations: 1.91 KiB) From worker 19: 82.439 μs (37 allocations: 1.91 KiB) From worker 18: 82.630 μs (37 allocations: 1.91 KiB) From worker 22: 81.197 μs (37 allocations: 1.91 KiB) From worker 26: 82.680 μs (37 allocations: 1.91 KiB) From worker 17: 83.001 μs (37 allocations: 1.91 KiB) From worker 21: 81.497 μs (37 allocations: 1.91 KiB) From worker 29: 81.617 μs (37 allocations: 1.91 KiB) From worker 20: 82.448 μs (37 allocations: 1.91 KiB) From worker 24: 81.669 μs (37 allocations: 1.91 KiB) From worker 25: 82.230 μs (37 allocations: 1.91 KiB) From worker 33: 81.507 μs (37 allocations: 1.91 KiB) From worker 32: 81.618 μs (37 allocations: 1.91 KiB) From worker 30: 81.447 μs (37 allocations: 1.91 KiB) From worker 35: 81.908 μs (37 allocations: 1.91 KiB) From worker 31: 81.718 μs (37 allocations: 1.91 KiB) From worker 27: 82.640 μs (37 allocations: 1.91 KiB) From worker 34: 81.577 μs (37 allocations: 1.91 KiB) From worker 28: 82.680 μs (37 allocations: 1.91 KiB) From worker 23: 81.618 μs (37 allocations: 1.91 KiB) From worker 36: 81.768 μs (37 allocations: 1.91 KiB)
Multi-GPU (multiple nodes, MPI)¶
Installing MPI for Julia and configuring:
pkg> add MPI MPIPreferences
julia> MPIPreferences.use_system_binary(;vendor="cray", mpiexec="srun") # <- options are Perlmutter specific
┌ Info: MPI implementation identified
│ libmpi = "libmpi_gnu_91.so"
│ version_string = "MPI VERSION : CRAY MPICH version 8.1.25.17 (ANL base 3.4a2)\nMPI BUILD INFO : Sun Feb 26 15:15 2023 (git hash aecd99f)\n"
│ impl = "CrayMPICH"
│ version = v"8.1.25"
└ abi = "MPICH"
┌ Info: MPIPreferences changed
│ binary = "system"
│ libmpi = "libmpi_gnu_91.so"
│ abi = "MPICH"
│ mpiexec = "srun"
│ preloads =
│ 1-element Vector{String}:
│ "libmpi_gtl_cuda.so"
└ preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
(This works thanks to among others NERSC's Johannes Blaschke's contributions to MPI.jl)
You can put SLURM script and Julia script in one file
test_script.jl
:
#!/bin/bash
#SBATCH -C gpu -q regular -A mp107
#SBATCH -t 00:05:00
#SBATCH --cpus-per-task 32 --gpus-per-task 1 --ntasks-per-node 4 --nodes 4
#=
srun /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit 0
# =#
using MPIClusterManagers, Distributed, CUDA, BenchmarkTools
mgr = MPIClusterManagers.start_main_loop(MPIClusterManagers.MPI_TRANSPORT_ALL)
let
carr = cu(rand(10_000_000))
pmap(WorkerPool(procs()), 1:nprocs()) do i
@btime CUDA.@sync sin.($carr) .+ 1
end
end
MPIClusterManagers.stop_main_loop(mgr)
Then sbatch test_script.jl
.
Here, movement of memory between GPUs will happen via CUDA MPI transport 🚀
Multi-GPU (multiple nodes, MPI, notebooks)¶
Some code in a notebook:¶
let
carr = cu(rand(10_000_000))
pmap(WorkerPool(procs()), 1:nprocs()) do i
@btime CUDA.@sync sin.($carr) .+ 1
return nothing
end
end;
85.515 μs (37 allocations: 1.91 KiB) From worker 5: 81.698 μs (37 allocations: 1.91 KiB) From worker 6: 81.728 μs (37 allocations: 1.91 KiB) From worker 2: 81.918 μs (37 allocations: 1.91 KiB) From worker 8: 81.697 μs (37 allocations: 1.91 KiB) From worker 7: 81.637 μs (37 allocations: 1.91 KiB) From worker 9: 81.338 μs (37 allocations: 1.91 KiB) From worker 10: 81.197 μs (37 allocations: 1.91 KiB) From worker 3: 81.867 μs (37 allocations: 1.91 KiB) From worker 13: 81.968 μs (37 allocations: 1.91 KiB) From worker 4: 81.838 μs (37 allocations: 1.91 KiB) From worker 11: 81.668 μs (37 allocations: 1.91 KiB) From worker 12: 82.049 μs (37 allocations: 1.91 KiB) From worker 15: 80.727 μs (37 allocations: 1.91 KiB) From worker 14: 81.378 μs (37 allocations: 1.91 KiB) From worker 16: 82.159 μs (37 allocations: 1.91 KiB) From worker 17: 81.297 μs (37 allocations: 1.91 KiB) From worker 18: 81.277 μs (37 allocations: 1.91 KiB) From worker 20: 81.357 μs (37 allocations: 1.91 KiB) From worker 21: 81.899 μs (37 allocations: 1.91 KiB) From worker 19: 81.637 μs (37 allocations: 1.91 KiB) From worker 23: 81.597 μs (37 allocations: 1.91 KiB) From worker 22: 81.558 μs (37 allocations: 1.91 KiB) From worker 25: 81.738 μs (37 allocations: 1.91 KiB) From worker 24: 81.587 μs (37 allocations: 1.91 KiB) From worker 26: 81.558 μs (37 allocations: 1.91 KiB) From worker 28: 81.688 μs (37 allocations: 1.91 KiB) From worker 27: 81.808 μs (37 allocations: 1.91 KiB) From worker 29: 81.798 μs (37 allocations: 1.91 KiB) From worker 30: 82.329 μs (37 allocations: 1.91 KiB) From worker 31: 81.658 μs (37 allocations: 1.91 KiB) From worker 33: 81.668 μs (37 allocations: 1.91 KiB) From worker 32: 81.788 μs (37 allocations: 1.91 KiB) From worker 35: 81.457 μs (37 allocations: 1.91 KiB) From worker 34: 81.778 μs (37 allocations: 1.91 KiB) From worker 36: 81.838 μs (37 allocations: 1.91 KiB)
Now use:¶
using ParameterizedNotebooks
nb = ParameterizedNotebook("talk.ipynb", sections=("Some code in a notebook:",))
ParameterizedNotebook("talk.ipynb") □ ~ □ Julia + Jupyter + GPU = ⚗️🔬🧬🥰 □ Outline □ Motivation □ Install □ Basic usage □ Power of Julia (1) □ Limitations □ Power of Julia (2) □ Multi-GPU (single node) □ Multi-GPU (multiple nodes, elastic) □ Multi-GPU (multiple nodes, MPI) □ Multi-GPU (multiple nodes, MPI, notebooks) ☒ Some code in a notebook: ☒ … □ Now use: □ Conclusions
nb()
85.074 μs (37 allocations: 1.91 KiB) From worker 5: 82.029 μs (37 allocations: 1.91 KiB) From worker 6: 82.058 μs (37 allocations: 1.91 KiB) From worker 7: 81.968 μs (37 allocations: 1.91 KiB) From worker 9: 81.999 μs (37 allocations: 1.91 KiB) From worker 3: 81.838 μs (37 allocations: 1.91 KiB) From worker 8: 81.818 μs (37 allocations: 1.91 KiB) From worker 10: 81.929 μs (37 allocations: 1.91 KiB) From worker 4: 82.148 μs (37 allocations: 1.91 KiB) From worker 11: 81.808 μs (37 allocations: 1.91 KiB) From worker 13: 82.149 μs (37 allocations: 1.91 KiB) From worker 12: 81.799 μs (37 allocations: 1.91 KiB) From worker 14: 81.948 μs (37 allocations: 1.91 KiB) From worker 2: 82.078 μs (37 allocations: 1.91 KiB) From worker 16: 81.909 μs (37 allocations: 1.91 KiB) From worker 17: 81.908 μs (37 allocations: 1.91 KiB) From worker 15: 81.689 μs (37 allocations: 1.91 KiB) From worker 18: 82.139 μs (37 allocations: 1.91 KiB) From worker 19: 81.988 μs (37 allocations: 1.91 KiB) From worker 21: 81.537 μs (37 allocations: 1.91 KiB) From worker 20: 81.898 μs (37 allocations: 1.91 KiB) From worker 22: 81.698 μs (37 allocations: 1.91 KiB) From worker 23: 81.968 μs (37 allocations: 1.91 KiB) From worker 25: 81.859 μs (37 allocations: 1.91 KiB) From worker 24: 81.508 μs (37 allocations: 1.91 KiB) From worker 26: 81.769 μs (37 allocations: 1.91 KiB) From worker 27: 82.099 μs (37 allocations: 1.91 KiB) From worker 29: 81.728 μs (37 allocations: 1.91 KiB) From worker 28: 81.598 μs (37 allocations: 1.91 KiB) From worker 30: 81.899 μs (37 allocations: 1.91 KiB) From worker 31: 82.019 μs (37 allocations: 1.91 KiB) From worker 33: 81.938 μs (37 allocations: 1.91 KiB) From worker 32: 81.638 μs (37 allocations: 1.91 KiB) From worker 34: 81.758 μs (37 allocations: 1.91 KiB) From worker 35: 81.958 μs (37 allocations: 1.91 KiB) From worker 36: 81.908 μs (37 allocations: 1.91 KiB)
You can put the call to the notebook code directly in a test_script_2.jl
:
#!/bin/bash
#SBATCH -C gpu -q regular -A mp107
#SBATCH -t 00:05:00
#SBATCH --cpus-per-task 32 --gpus-per-task 1 --ntasks-per-node 4 --nodes 4
#=
srun /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit 0
# =#
using MPIClusterManagers, Distributed, CUDA
mgr = MPIClusterManagers.start_main_loop(MPIClusterManagers.MPI_TRANSPORT_ALL)
nb = ParameterizedNotebook("talk.ipynb", sections=("Some code in a notebook:",))
nb()
MPIClusterManagers.stop_main_loop(mgr)
With some care in the organization of your sections, you can iterate on code in the notebook, even test it in parallel using on-the-fly ElasticManager
workers, then submit the identical code as an MPI job for larger-scale runs 🎉
Conclusions¶
- Julia + Jupyter + GPUs offer powerful scientific workflows
- Hopefully I've shared some efficient ways to do this that we've learned
- Wishlist
- More robust and easier CUDA.jl task/threading support
- An easy way to use MPI CUDA transport protocol from within Jupyter jobs
- A multi-node GPU monitor, even just a command-line one
nvitop
,btop
(PR), andgpustat
are some good command line single-node options