This challenge is another easy one, and it’s basically the first challenge, except instead of a vector we are dealing with a matrix. Let’s get into it!
The Challenge: Matrix Addition
The task is to do element wise addition of two N x N matrices A and B, and store the result in C.
Constraints:
- Input matrices A and B have identical dimensions
- 1 ≤ N ≤ 4096
Example
Input: A = [[1.0, 2.0], [3.0, 4.0]] B = [[5.0, 6.0], [7.0, 8.0]]Output: C = [[6.0, 8.0], [10.0, 12.0]]As always, here’s how we’d solve this on the CPU:
N = 2A = [[1.0, 2.0], [3.0, 4.0]]B = [[5.0, 6.0], [7.0, 8.0]]
C = []for i in range(N): C.append([]) # Initialize empty row on output for j in range(N): C[i].append(A[i][j] + B[i][j])# C is now [[6.0, 8.0], [10.0, 12.0]]It basically does the following:
- Loop over each row
- Loop over each column
- Add
A[i][j] + B[i][j]and store the result intoC[i][j]
Solving the Challenge
You might think this would have a different solution from vector addition because it’s a matrix, but once you remember matrices are stored in row-major format, it’s easy to see how we can just treat it like a vector and reuse the same solution we used in vector addition.
We’re just going to:
- Treat
A,B, andCas flat arrays withn_elements = N * N - Launch a 1D grid like we did in vector addition
- Each program loads a chunk from
AandB, adds the values, and stores them intoC
No need for any fancy optimizations. The reads and writes are both already done on contiguous memory that’s not accessed by other programs in the kernel, so the most simple solution is also the most performant :)
I think at this point we’re used to GPU programming with Triton enough to not need a pseudocode anymore, specially for super easy challenges like this so unlike my past posts, let’s just get straight to the solution code.
The Solution
Step 1. Boilerplate
Let’s see what LeetGPU gives us for the boilerplate:
import torchimport tritonimport triton.language as tl
@triton.jitdef matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr): pass
# a, b, c are tensors on the GPUdef solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int): BLOCK_SIZE = 1024 n_elements = N * N grid = (triton.cdiv(n_elements, BLOCK_SIZE),) matrix_add_kernel[grid](a, b, c, n_elements, BLOCK_SIZE)This is basically just like the first post,
except it calculates the “array” length using N * N.
The kernel doesn’t need to know about rows and columns of the matrix because the memory layout is contiguous and we’re just gonna treat it as a 1D array.
Step 2. Loading values
Now let’s fill in the kernel with the usual offsets and mask:
@triton.jitdef matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr): pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements
a_vals = tl.load(a + offsets, mask=mask) b_vals = tl.load(b + offsets, mask=mask)You should be very familiar with this pattern by now since it’s something we’ve used in basically every challenge so far.
We just get the program ID and use it to calculate the offset of the values this program should handle in the input arrays with a mask to prevent out of bounds reads.
Step 3. Addition and Storage
Now we just add the values we loaded from A and B and store the result in C with the same mask as above to prevent out of bounds writes:
tl.store(c + offsets, a_vals + b_vals, mask=mask)Step 4. Final Code
We’re done! Here’s the final code :)
import torchimport tritonimport triton.language as tl
@triton.jitdef matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr): pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements a_vals = tl.load(a + offsets, mask=mask) b_vals = tl.load(b + offsets, mask=mask) tl.store(c + offsets, a_vals + b_vals, mask=mask)
# a, b, c are tensors on the GPUdef solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int): BLOCK_SIZE = 128 # Experimentally chosen for this problem n_elements = N * N grid = (triton.cdiv(n_elements, BLOCK_SIZE),) matrix_add_kernel[grid](a, b, c, n_elements, BLOCK_SIZE)What’s Next
That’s it for this one. It was pretty simple 😆. Onto the next challenge!