LeetGPU Challenge #6: 1D Convolution (Triton)

It’s been a while since the last post. Life got in the way a bit, but I’m back with a new challenge! This one felt like a fun puzzle to solve.

The Challenge: 1D Convolution

The task is to implement 1D convolution. We get an input array and a kernel (filter) and we need to produce the convolved output.

We do not pad the data and the kernel only applies until the point where it fully overlaps the input. This means the output size is going to be input_size - kernel_size + 1.

The convolution operation is defined as:

$output[i] = \sum_{j=0}^{kernel\_size-1} input[i+j] \times kernel[j]$

where i ranges from 0 to input_size - kernel_size.

Constraints:

1 ≤ input_size ≤ 1,500,000
1 ≤ kernel_size ≤ 2047
kernel_size ≤ input_size

Example

1
Input:  input = [1, 2, 3, 4, 5]
2
        kernel = [1, 0, -1]
3
Output: output = [-2, -2, -2]

Here’s how we would solve this on the CPU:

1
input = [1, 2, 3, 4, 5]
2
kernel = [1, 0, -1]
3
output = []
4
output_size = len(input) - len(kernel) + 1
5

6
for i in range(output_size):
7
    sum = 0.0
8
    for j in range(len(kernel)):
9
        sum += input[i + j] * kernel[j]
10
    output.append(sum)
11
# output now contains [-2, -2, -2]

We basically do the following:

For each position i in the output (from 0 to input_size - kernel_size)
We go over the kernel elements (from 0 to kernel_size - 1)
We multiply input[i + j] by kernel[j] and accumulate the result
We store the accumulated result in output[i]

Solving the Challenge

This challenge is pretty simple to solve once you understand the pattern. We need to parallelize the outer loop (over output positions) and handle the inner loop (over kernel elements) inside each program with a local accumulator. This is kind of similar to how we solved matrix multiplication when we added an inner loop within each program with a local accumulator.

The only tricky part is loading the input elements efficiently. For example, if kernel_size is 3, we need to load:

For output position 0: input[0], input[1], input[2]
For output position 1: input[1], input[2], input[3]
For output position 2: input[2], input[3], input[4]
And so on…

So we’re dealing with a sliding window pattern where each output position needs a different window from the input array.

The clean way to handle this is using broadcasting. We can create a 2D array where:

Each row represents an output position
Each column represents a kernel position
We use broadcasting to create all the input indices we need at once

For example, if we have output positions [0, 1, 2] and kernel positions [0, 1, 2], broadcasting output_positions[:, None] + kernel_positions[None, :] gives us:

1
[[0+0, 0+1, 0+2],   = [[0, 1, 2],
2
 [1+0, 1+1, 1+2],      [1, 2, 3],
3
 [2+0, 2+1, 2+2]]      [2, 3, 4]]

Note

If you’re not familiar with numpy, the [:, None] and [None, :] syntax might look weird.

Here’s some easy to understand examples:

[:, None]:

1
[1,2,3,4] -> [[1],
2
              [2],
3
              [3],
4
              [4]]

[None, :]:

1
[1,2,3,4] -> [[1, 2, 3, 4]]

It basically just changes the “shape” of the array and how we treat it.

This gives us exactly the sliding window pattern we need!

As always, let’s first get started with the pseudocode:

Note

If you’re new to Triton or GPU programming, the following pseudocode might be hard to understand. If that’s the case please start from the first challenge’s post and then come back!

1
def conv1d_gpu(input, kernel, output, input_size, kernel_size):
2
    BLOCK_SIZE = 2048  # output elements per program
3
    K_BLOCK = 4  # kernel elements to process at a time
4

5
    TOTAL = input_size - kernel_size + 1  # total output size
6
    n_programs = ceil(TOTAL / BLOCK_SIZE)
7
    gpu_programs = get_gpu_programs(count=n_programs)
8

9
    # Each program handles BLOCK_SIZE output elements
10
    # Calculate which output elements this program handles
11
    gpu_programs.generate_output_indices(
12
        offset=program_id * BLOCK_SIZE,
13
        count=BLOCK_SIZE,
14
        as=output_indices
15
    )
16
    # For program_id=0 and BLOCK_SIZE=3, output_indices would be: [0, 1, 2]
17

18
    # Initialize accumulator for all output elements this program handles
19
    gpu_programs.initialize_accumulator(
20
        size=BLOCK_SIZE,
21
        initial_value=0.0,
22
        as=accumulator
23
    )
24

25
    # Process kernel in chunks
26
    num_k_chunks = ceil(kernel_size / K_BLOCK)
27
    for k_chunk in range(num_k_chunks):
28
        k_chunk_start = k_chunk * K_BLOCK
29
        k_chunk_end = min(k_chunk_start + K_BLOCK, kernel_size)
30

31
        # Generate kernel indices for this chunk
32
        gpu_programs.generate_kernel_indices(
33
            start=k_chunk_start,
34
            end=k_chunk_end,
35
            as=kernel_indices_chunk
36
        )
37
        # For k_chunk=0 and K_BLOCK=4, kernel_indices_chunk would be: [0, 1, 2, 3]
38

39
        # Load kernel values for this chunk
40
        gpu_programs.load_values(
41
            from=kernel,
42
            indices=kernel_indices_chunk,
43
            as=kernel_values_chunk
44
        )
45

46
        # Use broadcasting to create input indices for all output positions at once
47
        # We create a 2D array where:
48
        # - Each row represents one output position
49
        # - Each column represents one kernel position in this chunk
50
        # - We use broadcasting: output_indices[:, None] + kernel_indices_chunk[None, :]
51
        #   This creates a column vector from output_indices and a row vector from kernel_indices_chunk
52
        #   Broadcasting adds them together to create all combinations
53
        gpu_programs.create_broadcasted_indices(
54
            output_indices=output_indices,
55
            kernel_indices=kernel_indices_chunk,
56
            as=input_indices_2d
57
        )
58
        # For output_indices=[0, 1, 2] and kernel_indices_chunk=[0, 1, 2, 3], input_indices_2d would be:
59
        # [[0+0, 0+1, 0+2, 0+3],   = [[0, 1, 2, 3],
60
        #  [1+0, 1+1, 1+2, 1+3],      [1, 2, 3, 4],
61
        #  [2+0, 2+1, 2+2, 2+3]]      [2, 3, 4, 5]]
62
        # Each row is the sliding window of input indices we need for that output position
63

64
        # Load input values for all output positions at once using the indices we just created
65
        gpu_programs.load_values(
66
            from=input,
67
            indices=input_indices_2d,
68
            as=input_values_2d
69
        )
70
        # This loads a 2D array where each row contains the input window for one output position
71
        # For our example above, input_values_2d would be:
72
        # [[input[0], input[1], input[2], input[3]],
73
        #  [input[1], input[2], input[3], input[4]],
74
        #  [input[2], input[3], input[4], input[5]]]
75

76
        # Multiply the elements and accumulate the results along the kernel dimension
77
        # Each row gets multiplied elementwise with kernel_values_chunk and summed
78
        # Then we add this result to the accumulator
79
        gpu_programs.multiply_and_accumulate(
80
            input_values_2d=input_values_2d,
81
            kernel_values=kernel_values_chunk,
82
            accumulator=accumulator
83
        )
84
        # For each output position i:
85
        # accumulator[i] += sum(input_values_2d[i] * kernel_values_chunk)
86

87
    # After processing all kernel chunks, accumulator contains the final result
88
    # Store results
89
    gpu_programs.store_values(
90
        into=output,
91
        indices=output_indices,
92
        values=accumulator
93
    )
94

95
def main(input, kernel, output, input_size, kernel_size):
96
    conv1d_gpu(input, kernel, output, input_size, kernel_size)

The Solution

Now let’s write the actual solution code :)

Step 1. Boilerplate

We’ll start with LeetGPU’s boilerplate (it’s modified a bit to fit how we’re going to actually implement it):

1
import torch
2
import triton
3
import triton.language as tl
4

5

6
@triton.jit
7
def conv1d_kernel(
8
    input, kernel, output,
9
    kernel_size,
10
    TOTAL,
11
    BLOCK_SIZE: tl.constexpr,
12
    K_BLOCK: tl.constexpr,
13
):
14
    pass
15

16

17
# input, kernel, output are tensors on the GPU
18
def solve(input: torch.Tensor, kernel: torch.Tensor, output: torch.Tensor, input_size: int, kernel_size: int):
19
    BLOCK_SIZE = 2048
20
    K_BLOCK = 4
21
    TOTAL = input_size - kernel_size + 1
22
    n_blocks = triton.cdiv(TOTAL, BLOCK_SIZE)
23
    grid = (n_blocks,)
24

25
    conv1d_kernel[grid](
26
        input, kernel, output,
27
        kernel_size,
28
        TOTAL=TOTAL,
29
        BLOCK_SIZE=BLOCK_SIZE,
30
        K_BLOCK=K_BLOCK
31
    )

From the top in solve:

Since we get arrays, we’re just doing a 1D grid, similar to what we did in the vector addition challenge
BLOCK_SIZE = 2048 is the number of output elements each program handles
K_BLOCK = 4 is the chunk size for handling the kernel (we’ll process the kernel in chunks of 4 elements at a time)
TOTAL = input_size - kernel_size + 1 is the total number of output elements we need to compute
We calculate n_blocks = triton.cdiv(TOTAL, BLOCK_SIZE) to determine how many programs we need
The kernel function takes TOTAL as a parameter instead of input_size since that’s what we actually need for bounds checking

Step 2. Setting Up the Indices and the Accumulator

Next we set up the indices for the output elements and initialize our accumulator:

1
@triton.jit
2
def conv1d_kernel(
3
    input, kernel, output,
4
    kernel_size,
5
    TOTAL,
6
    BLOCK_SIZE: tl.constexpr,
7
    K_BLOCK: tl.constexpr,
8
):
9
    pid = tl.program_id(0)
10

11
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)[:, None]
12
    mask = offsets < TOTAL
13

14
    k_offsets = tl.arange(0, K_BLOCK)[None, :]
15
    acc = tl.zeros((BLOCK_SIZE,), dtype=tl.float32)

From the top:

pid = tl.program_id(0) gets the program ID from our 1D grid
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)[:, None] creates a column vector of output indices we handle in this program.
mask = offsets < TOTAL creates a mask for us that will help prevent out of bounds memory access.
k_offsets = tl.arange(0, K_BLOCK)[None, :] creates the base row vector [0, 1, 2, ..., K_BLOCK-1] for kernel indices.
acc = tl.zeros((BLOCK_SIZE,), dtype=tl.float32) initializes our accumulator as a 1D array of zeros, one per output element we handle in this program.

Step 3. Processing the Kernel in Chunks

Now let’s add the loop that processes the kernel in chunks:

1
    for k in range(0, kernel_size, K_BLOCK):
2
        k_mask = (k + k_offsets) < kernel_size
3
        k_vals = tl.load(kernel + k + k_offsets, mask=k_mask)
4

5
        in_ptrs = input + offsets + k + k_offsets
6
        in_mask = mask & k_mask
7
        in_vals = tl.load(in_ptrs, mask=in_mask)
8

9
        acc += tl.sum(in_vals * k_vals, axis=1)

From the top:

for k in range(0, kernel_size, K_BLOCK) iterates over the kernel in chunks of K_BLOCK elements.
k_mask = (k + k_offsets) < kernel_size creates a mask for the kernel chunk to prevent out of bounds memory access.
k_vals = tl.load(kernel + k + k_offsets, mask=k_mask) loads the kernel values for this chunk.
in_ptrs = input + offsets + k + k_offsets uses broadcasting to create a BLOCK_SIZE x K_BLOCK grid of input pointers.
in_mask = mask & k_mask combines both masks to prevent out of bounds memory access.
in_vals = tl.load(in_ptrs, mask=in_mask) loads the input values.
acc += tl.sum(in_vals * k_vals, axis=1) multiplies and sums along the kernel dimension, then adds the result to our accumulator.

Step 4. Storing the Results

Now all that’s left to do is to store the results:

1
    tl.store(output + offsets, acc[:, None], mask=mask)

Here we need to use acc[:, None] to reshape the accumulator to a column vector so it matches the shape of our output indices. This is just something that tl.store needs us to do for it to be able to figure out where to store the values.

Step 5. Final Code

We’re done! Here’s the complete solution:

1
import torch
2
import triton
3
import triton.language as tl
4

5
@triton.jit
6
def conv1d_kernel(
7
    input, kernel, output,
8
    kernel_size,
9
    TOTAL,
10
    BLOCK_SIZE: tl.constexpr,
11
    K_BLOCK: tl.constexpr,
12
):
13
    pid = tl.program_id(0)
14

15
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)[:, None]
16
    mask = offsets < TOTAL
17

18
    k_offsets = tl.arange(0, K_BLOCK)[None, :]
19
    acc = tl.zeros((BLOCK_SIZE,), dtype=tl.float32)
20

21
    for k in range(0, kernel_size, K_BLOCK):
22
        k_mask = (k + k_offsets) < kernel_size
23
        k_vals = tl.load(kernel + k + k_offsets, mask=k_mask)
24

25
        in_ptrs = input + offsets + k + k_offsets
26
        in_mask = mask & k_mask
27
        in_vals = tl.load(in_ptrs, mask=in_mask)
28

29
        acc += tl.sum(in_vals * k_vals, axis=1)
30

31
    tl.store(output + offsets, acc[:, None], mask=mask)
32

33

34
# input, kernel, output are tensors on the GPU
35
def solve(input: torch.Tensor, kernel: torch.Tensor, output: torch.Tensor, input_size: int, kernel_size: int):
36
    BLOCK_SIZE = 2048
37
    K_BLOCK = 4
38
    TOTAL = input_size - kernel_size + 1
39
    n_blocks = triton.cdiv(TOTAL, BLOCK_SIZE)
40
    grid = (n_blocks,)
41

42
    conv1d_kernel[grid](
43
        input, kernel, output,
44
        kernel_size,
45
        TOTAL=TOTAL,
46
        BLOCK_SIZE=BLOCK_SIZE,
47
        K_BLOCK=K_BLOCK
48
    )

What’s Next

In this challenge figuring out how to structure the broadcasting for the inner loop felt like a pretty fun puzzle! I won’t be able to post as often as I’d like to for a while but I promise to post at least once a week so please stay tuned :)

Resources

LeetGPU Challenges