---
name: GPU_memory_model
class: left
# GPU memory model
---
name: GPU_memory_model2
class: left
# GPU memory model
Variable declaration |
Memory |
Scope |
Lifetime |
Performance penalty |
int localVar; |
register |
thread |
thread |
~ 1x |
__local__ int localVar; |
register |
thread |
thread |
~ 1x |
int localArray[10]; |
local |
thread |
thread |
~ 100x |
__shared__ int sharedVar; |
shared |
block |
block |
~ 1x |
__shared__ int sharedVar[10]; |
shared |
block |
block |
~ 1x |
__device__ int globalVar; |
global |
grid + host |
application |
~ 100x |
__constant__ int constantVar; |
constant |
grid + host |
application |
~ 1x |
---
name: coalesced_mem_access1
class: left
# Coalesced memory access
The GPU has a wide memory bus, and is capable of executing memory accesses in naturally aligned chunks of 32, 64, or 128 bytes.
Memory accesses to neighboring locations by threads in the same warp can be combined.
This is called coalesced memory access.
--
- If all 32 threads in a warp access consecutive 4 byte elements (e.g. `float`s), aligned on a 128 byte boundary, only 1 memory transaction is needed to handle the memory access.
- Conversely, if all threads in a warp access disjoint memory locations, 32 transactions are needed
- Coalescing memory accesses is crucial for good performance
- This stands in contrast to CPU performance guidelines, where consecutive elements should be handled by 1 thread (prefetcher)
- Use `cudaMallocPitch` to allocate 2D structures, so that each row/column is properly aligned
---
name: coalesced_mem_access2
class: left
# Coalesced memory access
Good access pattern (coalesced):
Bad access pattern (strided, unaligned):
The memory bandwidth penalty in this case is 3x
---
name: copying_memory
class: left
# Copying memory
The memory between host and device can be copied in two ways.
--
The synchronous call blocks the CPU until the copy is complete.
Copy begins when all preceding CUDA calls are completed.
```.c
cudaError_t cudaMemcpy(
void* dst, const void *src, size_t count, cudaMemcpyKind kind)
// enum cudaMemcpyKind
// cudaMemcpyHostToDevice = 1
// cudaMemcpyDeviceToHost = 2
```
--
An asynchronous call which does not block the CPU is
```.c
cudaError_t cudaMemcpyAsync(
void* dst, const void* src, size_t count, cudaMemcpyKind kind,
cudaStream_t stream = 0)
```
--
Note:
A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code.
While operations within a stream are guaranteed to execute in the prescribed order,
operations in different streams can be interleaved and, when possible, they can even run concurrently.
---
name: Compiling GPU application
class: left
# Compiling GPU application
Any source file containing CUDA language extensions must be compiled with the
`nvcc` compiler.
.left-column[
Compile & run GPU version
```.sh
nvcc vecadd.cu -o vecadd_gpu
./vecadd_gpu
```
]
.right-column[
Compile & run CPU program
```.sh
gcc vecadd.c -o vecadd_cpu
./vecadd_cpu
```]
---
name: SampleCode
class: left
# Code sample
Let us add two vectors
---
name: Code
class: left
# CPU-Only Version
```.c
void vecAdd(int N, float* A, float* B, float* C) {
for (int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
int main(int argc, char **argv)
{
int N = 16384; // default vector size
float *A = (float*)malloc(N * sizeof(float)); // memory allocation
float *B = (float*)malloc(N * sizeof(float));
float *C = (float*)malloc(N * sizeof(float));
vecAdd(N, A, B, C); // call compute kernel
free(A); free(B); free(C); // memory de-allocation
}
```
---
name: Code
class: left
# With GPU support
```.cu
int main(int argc, char **argv)
{
int N = 16384; // default vector size
float *A = (float*)malloc(N * sizeof(float)); // memory allocation
float *B = (float*)malloc(N * sizeof(float));
float *C = (float*)malloc(N * sizeof(float));
float *devPtrA, *devPtrB, *devPtrC;
cudaMalloc((void**)&devPtrA, N * sizeof(float)); // GPU memory allocation
cudaMalloc((void**)&devPtrB, N * sizeof(float));
cudaMalloc((void**)&devPtrC, N * sizeof(float));
// copy data from the CPU (host) memory to the GPU (device) memory
cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice);
// call compute kernel <<>>
vecAdd<<>>(devPtrA, devPtrB, devPtrC);
// copy results from device memory to the host memory
cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(devPtrA); // gpu memory de-allocation
cudaFree(devPtrB);
cudaFree(devPtrC);
free(A); free(B); free(C); // memory de-allocation
}
```
---
name: Code
class: left
# GPU kernel
CPU version
```.c
void vecAdd(int N, float* A, float* B, float* C)
{
for (int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
```
GPU version
```.cu
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
}
```
---
name: GPU and AI
class: left
# Simple Neural Network
[Img source.](https://www.jeremyjordan.me/intro-to-neural-networks/)
---
name: GPU and AI
class: left
# Simple Neural Network
[Img source.](https://www.jeremyjordan.me/intro-to-neural-networks/)
---
name: GPU and AI
class: left
# MLP - implementation
```.py
import tensorflow as tf
def multi_layer_perceptron(x, weights, biases):
num_layers = len(weights) + 1
H = 2.0*(x - min(x))/(max(x) - min(x)) - 1.0
for l in range(0,num_layers-2):
W = weights[l]
b = biases[l]
H = tf.tanh(tf.add(tf.matmul(H, W), b))
W = weights[-1]
b = biases[-1]
Y = tf.add(tf.matmul(H, W), b)
return Y
```
[Source code.](https://github.com/maziarraissi/PINNs/blob/0542794b0a91b9e8764a38f5fc9cd9647a3929ba/appendix/continuous_time_inference%20%28Burgers%29/Burgers.py#L91)
---
name: porting_issues
class: left
# Memory layout - porting issues
Consider a set of points (x,y,z) describing some geometry...
---
class: left
# Memory layout - porting issues
.left-column[
```.cpp
// SoA
struct pointlist3D {
float x[N];
float y[N];
float z[N];
};
struct pointlist3D points;
```
]
.right-column[
```.cpp
// AoS
struct point3D {
float x;
float y;
float z;
};
struct point3D points[N];
```
]
--
.left-column[
```.py
# Each MxM matrix
# stores only x, y or z coordinates.
import numpy as np
a = np.arange(M*M*3, dtype=np.int8)
a = a.reshape(3,M,M)
a.strides
# (M*M, M, 1)
```
]
.right-column[
```.py
# MxM matrix stores points(x,y,z)
# in its most inner dimension
import numpy as np
a = np.arange(M*M*3, dtype=np.int8)
a = a.reshape(M,M,3)
a.strides
# (3*M, 3, 1)
```
]
--
.left-column[
]
.right-column[
]
---
name: porting_issues
class: left
# Memory layout - img processing
The channels (colors in case of a RGB image) can be stored in two ways:
---
name: porting_issues
class: left
# Memory layout - convolutions
[Link.](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html)
---
name: porting_issues
class: left
# Memory layout - NCHW vs NHWC
```
"Layout choice has an effect on performance,
as convolutions implemented for Tensor Cores require NHWC layout
and are fastest when input tensors are laid out in NHWC."
```
[Source - nvidia.](https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html#tensor-layout)
---
class: left
# Command line utilities
```
nvidia-smi # NVIDIA System Management Interface program
cuda-memcheck # checks for memory erros within the program
cuda-gdb # Linux and mac (debugger)
```
---
class: left
# Simple debugging
Compile your application with debug flags `-g -G`
```.sh
nvcc -g -G program.cu -o program
```
Launch your application with debugger,
for example `Nsight Visual Studio Code Edition` plugin for Visual Studio Code.
---
name: DeviceQuery
class: left
# Device Query
```
Tesla V100-PCIE-32GB
Release date: 27 March 2018
$10,664 for 16GB; $11,458* for 32GB
7 TFLOPS (FP64)
```
```.sh
./deviceQuery Starting...
Device 0:
CUDA Driver Version / Runtime Version 11.5 / 11.4
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 32510 MBytes (34089730048 bytes)
(080) Multiprocessors, (064) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1380 MHz (1.38 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 98304 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
```
---
name: DeviceQuery
class: left
# Device Query
```.sh
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 7 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled <-- Error correction code memory (ECC memory)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
```
---
name: questions
class: left
# Questions
---
class: middle, center
# That's All!
For other cool stuff, check out:
Have fun!