LeetGPU Challenge #5: Matrix Addition (Triton)

This challenge is another easy one, and it’s basically the first challenge, except instead of a vector we are dealing with a matrix. Let’s get into it!

The Challenge: Matrix Addition

The task is to do element wise addition of two N x N matrices A and B, and store the result in C.

Constraints:

Input matrices A and B have identical dimensions
1 ≤ N ≤ 4096

Example

1
Input:  A = [[1.0, 2.0],
2
             [3.0, 4.0]]
3
        B = [[5.0, 6.0],
4
             [7.0, 8.0]]
5
Output: C = [[6.0, 8.0],
6
             [10.0, 12.0]]

As always, here’s how we’d solve this on the CPU:

1
N = 2
2
A = [[1.0, 2.0], [3.0, 4.0]]
3
B = [[5.0, 6.0], [7.0, 8.0]]
4

5
C = []
6
for i in range(N):
7
    C.append([]) # Initialize empty row on output
8
    for j in range(N):
9
        C[i].append(A[i][j] + B[i][j])
10
# C is now [[6.0, 8.0], [10.0, 12.0]]

It basically does the following:

Loop over each row
Loop over each column
Add A[i][j] + B[i][j] and store the result into C[i][j]

Solving the Challenge

You might think this would have a different solution from vector addition because it’s a matrix, but once you remember matrices are stored in row-major format, it’s easy to see how we can just treat it like a vector and reuse the same solution we used in vector addition.

We’re just going to:

Treat A, B, and C as flat arrays with n_elements = N * N
Launch a 1D grid like we did in vector addition
Each program loads a chunk from A and B, adds the values, and stores them into C

No need for any fancy optimizations. The reads and writes are both already done on contiguous memory that’s not accessed by other programs in the kernel, so the most simple solution is also the most performant :)

I think at this point we’re used to GPU programming with Triton enough to not need a pseudocode anymore, specially for super easy challenges like this so unlike my past posts, let’s just get straight to the solution code.

The Solution

Step 1. Boilerplate

Let’s see what LeetGPU gives us for the boilerplate:

1
import torch
2
import triton
3
import triton.language as tl
4

5

6
@triton.jit
7
def matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
8
    pass
9

10

11
# a, b, c are tensors on the GPU
12
def solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int):
13
    BLOCK_SIZE = 1024
14
    n_elements = N * N
15
    grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
16
    matrix_add_kernel[grid](a, b, c, n_elements, BLOCK_SIZE)

This is basically just like the first post, except it calculates the “array” length using N * N. The kernel doesn’t need to know about rows and columns of the matrix because the memory layout is contiguous and we’re just gonna treat it as a 1D array.

Step 2. Loading values

Now let’s fill in the kernel with the usual offsets and mask:

1
@triton.jit
2
def matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
3
    pid = tl.program_id(0)
4

5
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
6
    mask = offsets < n_elements
7

8
    a_vals = tl.load(a + offsets, mask=mask)
9
    b_vals = tl.load(b + offsets, mask=mask)

You should be very familiar with this pattern by now since it’s something we’ve used in basically every challenge so far.

We just get the program ID and use it to calculate the offset of the values this program should handle in the input arrays with a mask to prevent out of bounds reads.

Step 3. Addition and Storage

Now we just add the values we loaded from A and B and store the result in C with the same mask as above to prevent out of bounds writes:

1
tl.store(c + offsets, a_vals + b_vals, mask=mask)

Step 4. Final Code

We’re done! Here’s the final code :)

1
import torch
2
import triton
3
import triton.language as tl
4

5

6
@triton.jit
7
def matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
8
    pid = tl.program_id(0)
9

10
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
11
    mask = offsets < n_elements
12
    a_vals = tl.load(a + offsets, mask=mask)
13
    b_vals = tl.load(b + offsets, mask=mask)
14
    tl.store(c + offsets, a_vals + b_vals, mask=mask)
15

16

17
# a, b, c are tensors on the GPU
18
def solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int):
19
    BLOCK_SIZE = 128 # Experimentally chosen for this problem
20
    n_elements = N * N
21
    grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
22
    matrix_add_kernel[grid](a, b, c, n_elements, BLOCK_SIZE)

What’s Next

That’s it for this one. It was pretty simple 😆. Onto the next challenge!

Resources

LeetGPU Challenges