Tiling Through Reductions

#include "helpers.h"
#include "library/array/annot/cpu-array.h"
#include "library/array/impl/cpu-array.h"


Similar to the tiling the output, users can
introduce tiling over a "Reducible" parameter. For instance, the reduction function shown below takes a parameter `k` that specifies how many elements of array `a` contribute to each output element in array `b`:

```C++
inline void reduction(ArrayCPU a, ArrayCPU b, int64_t k) {
    for (int64_t i = 0; i < b.size; i++) {
        for (int64_t j = 0; j < k; j++) {
            b.data[i] += a.data[j];
        }
    }
}

The annotation for the function indicates that the function can also be split across k.

return annotate(Tileable(x = Expr(0), output["size"], step,
                        Computes(
                            Produces::Subset(output, {x, step}),
                            Consumes::Subsets(
                                // Indicate a reducible loop
                                Reducible(r = Expr(0), k, reduce,
                                    SubsetObjMany{
                                    SubsetObj(input, {r, reduce})})))));

The key point is that the reducible parameter must be part of the function interface so that users can tile a reducible loop if needed. A reducible loop is similar to a tilable loop, but with one important distinction: its output is accumulated into a value that is staged outside the loop. This external accumulator enables reduction across tiles or iterations.

using namespace gern;

int main() {
    // ***** PROGRAM DEFINITION *****
    auto input = mk_array("input");
    auto output = mk_array("output");

    gern::annot::reduction reduce;
    Variable len("len");
    Variable tile_size("tile_size");

    Composable program({
        Reduce(len, tile_size)(
            reduce(input, output, len)),
    });

    // ***** PROGRAM EVALUATION *****
    library::impl::ArrayCPU a(10);
    a.ascending();
    library::impl::ArrayCPU b(10);
    int64_t len_val = 10;
    int64_t tile_size_val = 2;

    auto runner = compile_program(program);
    runner.evaluate({
        {"input", &a},
        {"output", &b},
        {"len", &len_val},
        {"tile_size", &tile_size_val},
    }); 
}

The program produces the following code:

void function_9(library::impl::ArrayCPU &input, library::impl::ArrayCPU &output, int64_t len, int64_t tile_size) {
    for (int64_t _gern_r_1_5 = 0; (_gern_r_1_5 < len); _gern_r_1_5 = (_gern_r_1_5 + tile_size)) {
        auto _query_input_10 = input.query(_gern_r_1_5, tile_size);

        library::impl::reduction(_query_input_10, output, tile_size);
    }
}