Skip to main content
Logo
Overview

LeetGPU Challenge #5: Matrix Addition (Triton)

January 7, 2026
4 min read

This challenge is another easy one, and it’s basically the first challenge, except instead of a vector we are dealing with a matrix. Let’s get into it!

The Challenge: Matrix Addition

The task is to do element wise addition of two N x N matrices A and B, and store the result in C.

Constraints:

  • Input matrices A and B have identical dimensions
  • 1 ≤ N ≤ 4096

Example

Input: A = [[1.0, 2.0],
[3.0, 4.0]]
B = [[5.0, 6.0],
[7.0, 8.0]]
Output: C = [[6.0, 8.0],
[10.0, 12.0]]

As always, here’s how we’d solve this on the CPU:

N = 2
A = [[1.0, 2.0], [3.0, 4.0]]
B = [[5.0, 6.0], [7.0, 8.0]]
C = []
for i in range(N):
C.append([]) # Initialize empty row on output
for j in range(N):
C[i].append(A[i][j] + B[i][j])
# C is now [[6.0, 8.0], [10.0, 12.0]]

It basically does the following:

  1. Loop over each row
  2. Loop over each column
  3. Add A[i][j] + B[i][j] and store the result into C[i][j]

Solving the Challenge

You might think this would have a different solution from vector addition because it’s a matrix, but once you remember matrices are stored in row-major format, it’s easy to see how we can just treat it like a vector and reuse the same solution we used in vector addition.

We’re just going to:

  • Treat A, B, and C as flat arrays with n_elements = N * N
  • Launch a 1D grid like we did in vector addition
  • Each program loads a chunk from A and B, adds the values, and stores them into C

No need for any fancy optimizations. The reads and writes are both already done on contiguous memory that’s not accessed by other programs in the kernel, so the most simple solution is also the most performant :)

I think at this point we’re used to GPU programming with Triton enough to not need a pseudocode anymore, specially for super easy challenges like this so unlike my past posts, let’s just get straight to the solution code.

The Solution

Step 1. Boilerplate

Let’s see what LeetGPU gives us for the boilerplate:

import torch
import triton
import triton.language as tl
@triton.jit
def matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
pass
# a, b, c are tensors on the GPU
def solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int):
BLOCK_SIZE = 1024
n_elements = N * N
grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
matrix_add_kernel[grid](a, b, c, n_elements, BLOCK_SIZE)

This is basically just like the first post, except it calculates the “array” length using N * N. The kernel doesn’t need to know about rows and columns of the matrix because the memory layout is contiguous and we’re just gonna treat it as a 1D array.

Step 2. Loading values

Now let’s fill in the kernel with the usual offsets and mask:

@triton.jit
def matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
a_vals = tl.load(a + offsets, mask=mask)
b_vals = tl.load(b + offsets, mask=mask)

You should be very familiar with this pattern by now since it’s something we’ve used in basically every challenge so far.

We just get the program ID and use it to calculate the offset of the values this program should handle in the input arrays with a mask to prevent out of bounds reads.

Step 3. Addition and Storage

Now we just add the values we loaded from A and B and store the result in C with the same mask as above to prevent out of bounds writes:

tl.store(c + offsets, a_vals + b_vals, mask=mask)

Step 4. Final Code

We’re done! Here’s the final code :)

import torch
import triton
import triton.language as tl
@triton.jit
def matrix_add_kernel(a, b, c, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
a_vals = tl.load(a + offsets, mask=mask)
b_vals = tl.load(b + offsets, mask=mask)
tl.store(c + offsets, a_vals + b_vals, mask=mask)
# a, b, c are tensors on the GPU
def solve(a: torch.Tensor, b: torch.Tensor, c: torch.Tensor, N: int):
BLOCK_SIZE = 128 # Experimentally chosen for this problem
n_elements = N * N
grid = (triton.cdiv(n_elements, BLOCK_SIZE),)
matrix_add_kernel[grid](a, b, c, n_elements, BLOCK_SIZE)

What’s Next

That’s it for this one. It was pretty simple 😆. Onto the next challenge!

Resources