cuTile Kernels

Posted on Dec 14, 2025

In this blog post, we will explore one of NVIDIA’s latest releases: cuTile. It represents the next progression in writing CUDA kernels by moving away from thread-level management.

A YouTube video tutorial can also be found here and a deep dive into the architecture and background here.

The official release docs for cuTile are available here, but a possibly better introduction is here with the TL;DR:

| NVIDIA Finally Adds Native Python Support to CUDA

The key takeaway of cuTile is the core programming model. It is expressed as: “the key in cuTile is operations done on arrays (tiles).” The effect of this is that kernels are developed differently than in standard CUDA code, where the mental model is centered around “threads.” This allows developers to write code at a higher level, while cuTile takes care of mapping tiles to CUDA threads. The goal of cuTile is to allow developers to focus on the algorithm and not have to focus on GPU hardware specifics.

While this sounds good, let’s try it out and see the difference.

Setting Up the Environment

This turned out to be the trickiest step. Anyone tinkering with GPUs knows that we often start down a slippery slope of driver versions and environment issues. After fixing some initial hurdles, it turned out I had missed the next requirement: the GPU drivers need to be at a very recent version. This is not something you can easily do as a developer on many cloud providers, which led me to a dead end with RunPod.

The solution was to use a B200 on Modal. While a bit overkill for a grayscale kernel, it provided the necessary environment to get running.

cuTile vs. CUDA

To compare the two, we’ll implement a classic example from the GPU Mode series by Jeremy Howard: converting an RGB image to grayscale.

The lecture teaches us how to take an RGB image and turn it into grayscale. This is highly parallelizable over the pixels of the image, with a dependency between the R, G, and B channels. If we have an image of shape HWC, we can turn it into HW number of tasks to put on different threads.

For each pixel $i$, we calculate the grayscale value:

$$ \text{Gray}[i] = 0.2989 \cdot \text{Red}[i] + 0.5870 \cdot \text{Green}[i] + 0.1140 \cdot \text{Blue}[i] $$

As an example for an image, we can do this in Python as follows (taken from the lecture code linked above):


def rgb2grey_py(x):
    c,h,w = x.shape
    n = h*w
    x = x.flatten()
    res = torch.empty(n, dtype=x.dtype, device=x.device)
    for i in range(n): res[i] = 0.2989*x[i] + 0.5870*x[i+n] + 0.1140*x[i+2*n]
    return res.view(h,w)

Notice here that we offset it with n due to how the matrix gets flattened into an array. Now, let’s jump to our cuTile example. The first step is to define the kernel itself. The input includes both the input matrix and the output matrix. In cuTile, we don’t return data; we assign it to the output matrix using ct.store().

TILE_SIZE = 16

@ct.kernel
def rgb_kernel(a, b, tn, tm):
    # Get the tile indices for this kernel invocation
    row_block = ct.bid(0)
    col_block = ct.bid(1)

    # Grayscale weights
    R_WEIGHT = 0.2989
    G_WEIGHT = 0.5870
    B_WEIGHT = 0.1140

    # Load tiles for each channel (R, G, B)
    # index=(channel, row, col)
    r = ct.load(a, index=(0, row_block, col_block), shape=(1, tn, tm))
    g = ct.load(a, index=(1, row_block, col_block), shape=(1, tn, tm))
    b = ct.load(a, index=(2, row_block, col_block), shape=(1, tn, tm))

    # Compute Grayscale using element-wise math
    # cuTile automatically handles the broadcasting
    grey_tile = (r * R_WEIGHT) + (g * G_WEIGHT) + (b * B_WEIGHT)
    
    # Reshape from (1, tn, tm) to (tn, tm) for the 2D output
    grey_tile = ct.reshape(grey_tile, shape=(tn, tm))
  
    # Store the result back to global memory
    ct.store(b, index=(row_block, col_block), tile=grey_tile)

The row_block and col_block are obtained via ct.bid (Block ID), which indexes us into the different tiles. Unlike the standard CUDA version where we spin up a thread for each pixel, we instead process the image in tiles. The key here is that inside each kernel invocation, we use ct.bid to select the correct part of the input matrix. We have many parallel kernel invocations for the same matrices, differentiated by their ct.bid.

A small thing to be aware of here is that I had to perform three separate loads—one for each channel—to bring the data from global memory into the local memory of the kernel.

To run the kernel, we define a grid. The grid tells cuTile how many invocations of the kernel are needed to cover the entire calculation. This grid determines the values of ct.bid. While the grid can have up to 3 dimensions, we only need 2 for this operation.

# Calculate the grid size based on image dimensions and tile size
grid = (ct.cdiv(img.shape[1], TILE_SIZE), ct.cdiv(img.shape[2], TILE_SIZE), 1)
result = torch.zeros((img.shape[1], img.shape[2]), dtype=torch.float32, device='cuda')

img_cuda = img.to(device='cuda')

# Launch the kernel
out = ct.launch(cupy.cuda.get_current_stream(), grid, rgb_kernel, (img_cuda, result, TILE_SIZE, TILE_SIZE))

For a more in-depth look at the equivalent CUDA code, check out the original YouTube video from Jeremy Howard on the problem.

Hopefully, I will have more time to play around with cuTile and return with a follow-up blog post soon. THe full code is Here