LeetGPU Challenge #7: Reverse Array (Triton)

Today’s challenge is extremely simple so let’s make it quick 😆.

The Challenge: Reverse Array

The task is to reverse an array in place. We get an array of numbers and we need to reverse it, modifying the original array.

Constraints:

1 ≤ N ≤ 100,000,000

Example

1
Input:  [1.0, 2.0, 3.0, 4.0]
2
Output: [4.0, 3.0, 2.0, 1.0]

Here’s how we’d solve this on the CPU:

1
arr = [1.0, 2.0, 3.0, 4.0]
2
N = len(arr)
3

4
for i in range(N // 2):
5
    j = N - 1 - i
6
    arr[i], arr[j] = arr[j], arr[i]
7
# arr is now [4.0, 3.0, 2.0, 1.0]

We basically do the following:

Loop over the first half of the array (from 0 to N // 2 - 1)
For each position i, calculate the corresponding position on the other side: j = N - 1 - i
Swap the values at positions i and j

Notice how if the array has an odd number of elements, we automatically skip the middle element since it doesn’t need to be swapped.

Solving the Challenge

To solve this challenge we’re just going to parallelize the CPU solution above on the GPU.

Given how simple this challenge is and how many challenges we’ve covered so far with in depth pseudocode and explanations, I think this time we can skip the pseudocode and get straight to the solution :)

The Solution

Step 1. Boilerplate

Let’s start with a basic boilerplate:

1
import torch
2
import triton
3
import triton.language as tl
4

5

6
@triton.jit
7
def reverse_kernel(
8
    input,
9
    half,
10
    N,
11
    BLOCK_SIZE: tl.constexpr
12
):
13
    pass
14

15

16
# input is a tensor on the GPU
17
def solve(input: torch.Tensor, N: int):
18
    BLOCK_SIZE = 256
19
    half = N // 2
20
    n_blocks = triton.cdiv(half, BLOCK_SIZE)
21
    grid = (n_blocks,)
22

23
    reverse_kernel[grid](
24
        input,
25
        half,
26
        N,
27
        BLOCK_SIZE=BLOCK_SIZE
28
    )

From the top in solve:

Since we get an array, we’re just doing a 1D grid, similar to what we did in the vector addition challenge
BLOCK_SIZE = 256 is the number of elements we want each program to handle
half = N // 2 is the midpoint. We only need to process elements up to this point since swapping is symmetric as we covered above.
We calculate n_blocks = triton.cdiv(half, BLOCK_SIZE) to determine how many programs we need
We pass half and N to the kernel so it knows where to stop and how to calculate the right side’s indices

Step 2. Calculating the Indices and the Mask

Now let’s fill in the kernel with the index calculations and the mask:

1
@triton.jit
2
def reverse_kernel(
3
    input,
4
    half,
5
    N,
6
    BLOCK_SIZE: tl.constexpr
7
):
8
    pid = tl.program_id(axis=0)
9
    left_offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
10
    right_offsets = (N - 1) - left_offsets
11
    mask = left_offsets < half

From the top:

pid = tl.program_id(axis=0) gets the program ID from our 1D grid
left_offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE) creates the indices for the left side elements this program handles.
right_offsets = (N - 1) - left_offsets calculates the corresponding right side indices.
mask = left_offsets < half creates a mask that prevents us from processing beyond the midpoint. This makes sure that we don’t try to swap elements in the second half (which would be redundant since we already swapped them).

Step 3. Loading, Swapping, and Storing

Now let’s load the values and perform the swap:

1
    l_val = tl.load(input + left_offsets, mask=mask)
2
    r_val = tl.load(input + right_offsets, mask=mask)
3

4
    tl.store(input + left_offsets, r_val, mask=mask)
5
    tl.store(input + right_offsets, l_val, mask=mask)

From the top:

l_val = tl.load(input + left_offsets, mask=mask) loads the values from the left side positions, using the mask to prevent out of bounds reads.
r_val = tl.load(input + right_offsets, mask=mask) loads the values from the right side positions, using the mask to prevent out of bounds reads.
tl.store(input + left_offsets, r_val, mask=mask) stores the right side values into the left side positions, using the mask to prevent undoing the swaps we already made.
tl.store(input + right_offsets, l_val, mask=mask) stores the left side values into the right side positions, using the mask to prevent undoing the swaps we already made.

Step 4. Final Code

We’re done! Here’s the complete solution:

1
import torch
2
import triton
3
import triton.language as tl
4

5

6
@triton.jit
7
def reverse_kernel(
8
    input,
9
    half,
10
    N,
11
    BLOCK_SIZE: tl.constexpr
12
):
13
    pid = tl.program_id(axis=0)
14
    left_offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
15
    right_offsets = (N - 1) - left_offsets
16
    mask = left_offsets < half
17

18
    l_val = tl.load(input + left_offsets, mask=mask)
19
    r_val = tl.load(input + right_offsets, mask=mask)
20

21
    tl.store(input + left_offsets, r_val, mask=mask)
22
    tl.store(input + right_offsets, l_val, mask=mask)
23

24

25
# input is a tensor on the GPU
26
def solve(input: torch.Tensor, N: int):
27
    BLOCK_SIZE = 256
28
    half = N // 2
29
    n_blocks = triton.cdiv(half, BLOCK_SIZE)
30
    grid = (n_blocks,)
31

32
    reverse_kernel[grid](
33
        input,
34
        half,
35
        N,
36
        BLOCK_SIZE=BLOCK_SIZE
37
    )

What’s Next

That’s it for this one! The next few challenges are going to be similar in how simple they are until a certain point, so I’m gonna be able to post them faster than usual since it takes less time to write the blogs and explain things. Stay tuned!

Resources

LeetGPU Challenges