[Backend][AIE] Support global indexing, tensor access and kernel reindex for AIE kernel mapping #300

EthanMeng324 · 2025-02-19T20:14:13Z

Description

The original AIE backend with dataflow interface only support simple element-wise application, like vector add. It used local indexing, which is insufficient for more complex tensor access pattern, like matrix multiplication.

Problems

In the original AIE interface, the compiler lacks sufficient information to determine which dimension (M, K, or N) should be used for tiling in this GEMM implementation.

Ty = int32
M, N, K = 16, 16, 16
P0 = 1
Mt = M // P0

@df.region()
def top():
    @df.kernel(mapping=[P0])
    def gemm(A: Ty[M, K], B: Ty[K, N], C: Ty[M, N]):
        for i, j, k in allo.grid(Mt, K, N):
            C[i, j] += A[i, k] * B[k, j]

Proposed Solutions

In the new interface, we adopt global indexing similar to the HLS backend for AIE. Additionally, we leverage tensor slicing access, inspired by Triton, to enable the compiler to determine the precise access range for each tensor, facilitating backend reindexing. Furthermore, we use primitives like allo.matmul to encapsulate fundamental operations, streamlining MLIR construction and lowering processes.

Ty = int32
M, N, K = 16, 16, 16
P0 = 2
Mt = M // P0

@df.region()
def top():
    @df.kernel(mapping=[P0])
    def gemm(A: Ty[M, K], B: Ty[K, N], C: Ty[M, N]):
        pi = df.get_pid()
        C[pi * Mt: (pi + 1) * Mt, :] = allo.matmul(
            A[pi * Mt: (pi + 1) * Mt, :], B)

module {
  aie.device(npu1_2col) {
    %tile_shim = aie.tile(0, 0)
    %tile_mem0 = aie.tile(0, 1)
    %tile_mem1 = aie.tile(1, 1)
    %tile_comp0 = aie.tile(0, 2)
    %tile_comp0_buf0 = aie.buffer(%tile_comp0) : memref<8x16xi32>
    %tile_comp1 = aie.tile(0, 3)
    %tile_comp1_buf0 = aie.buffer(%tile_comp1) : memref<8x16xi32>
    aie.objectfifo @in_sh0(%tile_shim, {%tile_mem0}, 2 : i32) : !aie.objectfifo<memref<16x16xi32>>
    aie.objectfifo @in0_p0(%tile_mem0, {%tile_comp0}, 2 : i32) : !aie.objectfifo<memref<8x16xi32>>
    aie.objectfifo @in0_p1(%tile_mem0, {%tile_comp1}, 2 : i32) : !aie.objectfifo<memref<8x16xi32>>
    aie.objectfifo.link [@in_sh0] -> [@in0_p0, @in0_p1]([] [0, 128])
    aie.objectfifo @in_sh1(%tile_shim, {%tile_mem1}, 2 : i32) : !aie.objectfifo<memref<16x16xi32>>
    aie.objectfifo @in1_p0(%tile_mem1, {%tile_comp0, %tile_comp1}, 2 : i32) : !aie.objectfifo<memref<16x16xi32>>
    aie.objectfifo.link [@in_sh1] -> [@in1_p0]([] [])
    aie.objectfifo @out_p0(%tile_comp0, {%tile_mem0}, 2 : i32) : !aie.objectfifo<memref<8x16xi32>>
    aie.objectfifo @out_p1(%tile_comp1, {%tile_mem0}, 2 : i32) : !aie.objectfifo<memref<8x16xi32>>
    aie.objectfifo @out_sh(%tile_mem0, {%tile_shim}, 2 : i32) : !aie.objectfifo<memref<16x16xi32>>
    aie.objectfifo.link [@out_p0, @out_p1] -> [@out_sh]([0, 128] [])
    %core_0_2 = aie.core(%tile_comp0) {
      %c1000 = arith.constant 0 : index
      %c1001 = arith.constant 1 : index
      %c9223372036854775807 = arith.constant 9223372036854775807 : index
      scf.for %arg0 = %c1000 to %c9223372036854775807 step %c1001 {
        %fifo0 = aie.objectfifo.acquire @in0_p0(Consume, 1) : !aie.objectfifosubview<memref<8x16xi32>>
        %local0 = aie.objectfifo.subview.access %fifo0[0] : !aie.objectfifosubview<memref<8x16xi32>> -> memref<8x16xi32>
        %fifo1 = aie.objectfifo.acquire @in1_p0(Consume, 1) : !aie.objectfifosubview<memref<16x16xi32>>
        %local1 = aie.objectfifo.subview.access %fifo1[0] : !aie.objectfifosubview<memref<16x16xi32>> -> memref<16x16xi32>
        %fifo_out = aie.objectfifo.acquire @out_p0(Produce, 1) : !aie.objectfifosubview<memref<8x16xi32>>
        %local_out = aie.objectfifo.subview.access %fifo_out[0] : !aie.objectfifosubview<memref<8x16xi32>> -> memref<8x16xi32>
      %c0_i32 = arith.constant 0 : i32
      %subview = memref.subview %local0[0, 0] [8, 16] [1, 1] : memref<8x16xi32> to memref<8x16xi32, strided<[16, 1]>>
      %c0 = arith.constant 0 : index
      %c8 = arith.constant 8 : index
      %c1 = arith.constant 1 : index
      scf.for %arg3 = %c0 to %c8 step %c1 {
        %c0_4 = arith.constant 0 : index
        %c16 = arith.constant 16 : index
        %c1_5 = arith.constant 1 : index
        scf.for %arg4 = %c0_4 to %c16 step %c1_5 {
          memref.store %c0_i32, %tile_comp0_buf0[%arg3, %arg4] : memref<8x16xi32>
        }
      }
      %c0_0 = arith.constant 0 : index
      %c8_1 = arith.constant 8 : index
      %c1_2 = arith.constant 1 : index
      scf.for %arg3 = %c0_0 to %c8_1 step %c1_2 {
        %c0_4 = arith.constant 0 : index
        %c16 = arith.constant 16 : index
        %c1_5 = arith.constant 1 : index
        scf.for %arg4 = %c0_4 to %c16 step %c1_5 {
          %c0_6 = arith.constant 0 : index
          %c16_7 = arith.constant 16 : index
          %c1_8 = arith.constant 1 : index
          scf.for %arg5 = %c0_6 to %c16_7 step %c1_8 {
            %0 = memref.load %subview[%arg3, %arg5] : memref<8x16xi32, strided<[16, 1]>>
            %1 = memref.load %local1[%arg5, %arg4] : memref<16x16xi32>
            %2 = memref.load %tile_comp0_buf0[%arg3, %arg4] : memref<8x16xi32>
            %3 = arith.muli %0, %1 : i32
            %4 = arith.addi %2, %3 : i32
            memref.store %4, %tile_comp0_buf0[%arg3, %arg4] : memref<8x16xi32>
          }
        }
      }
      %subview_3 = memref.subview %local_out[0, 0] [8, 16] [1, 1] : memref<8x16xi32> to memref<8x16xi32, strided<[16, 1]>>
      memref.copy %tile_comp0_buf0, %subview_3 : memref<8x16xi32> to memref<8x16xi32, strided<[16, 1]>>
        aie.objectfifo.release @in0_p0(Consume, 1)
        aie.objectfifo.release @in1_p0(Consume, 1)
        aie.objectfifo.release @out_p0(Produce, 1)
      }
      aie.end
    }
    %core_0_3 = aie.core(%tile_comp1) {
      %c1000 = arith.constant 0 : index
      %c1001 = arith.constant 1 : index
      %c9223372036854775807 = arith.constant 9223372036854775807 : index
      scf.for %arg0 = %c1000 to %c9223372036854775807 step %c1001 {
        %fifo0 = aie.objectfifo.acquire @in0_p1(Consume, 1) : !aie.objectfifosubview<memref<8x16xi32>>
        %local0 = aie.objectfifo.subview.access %fifo0[0] : !aie.objectfifosubview<memref<8x16xi32>> -> memref<8x16xi32>
        %fifo1 = aie.objectfifo.acquire @in1_p0(Consume, 1) : !aie.objectfifosubview<memref<16x16xi32>>
        %local1 = aie.objectfifo.subview.access %fifo1[0] : !aie.objectfifosubview<memref<16x16xi32>> -> memref<16x16xi32>
        %fifo_out = aie.objectfifo.acquire @out_p1(Produce, 1) : !aie.objectfifosubview<memref<8x16xi32>>
        %local_out = aie.objectfifo.subview.access %fifo_out[0] : !aie.objectfifosubview<memref<8x16xi32>> -> memref<8x16xi32>
      %c0_i32 = arith.constant 0 : i32
      %subview = memref.subview %local0[0, 0] [8, 16] [1, 1] : memref<8x16xi32> to memref<8x16xi32, strided<[16, 1]>>
      %c0 = arith.constant 0 : index
      %c8 = arith.constant 8 : index
      %c1 = arith.constant 1 : index
      scf.for %arg3 = %c0 to %c8 step %c1 {
        %c0_4 = arith.constant 0 : index
        %c16 = arith.constant 16 : index
        %c1_5 = arith.constant 1 : index
        scf.for %arg4 = %c0_4 to %c16 step %c1_5 {
          memref.store %c0_i32, %tile_comp1_buf0[%arg3, %arg4] : memref<8x16xi32>
        }
      }
      %c0_0 = arith.constant 0 : index
      %c8_1 = arith.constant 8 : index
      %c1_2 = arith.constant 1 : index
      scf.for %arg3 = %c0_0 to %c8_1 step %c1_2 {
        %c0_4 = arith.constant 0 : index
        %c16 = arith.constant 16 : index
        %c1_5 = arith.constant 1 : index
        scf.for %arg4 = %c0_4 to %c16 step %c1_5 {
          %c0_6 = arith.constant 0 : index
          %c16_7 = arith.constant 16 : index
          %c1_8 = arith.constant 1 : index
          scf.for %arg5 = %c0_6 to %c16_7 step %c1_8 {
            %0 = memref.load %subview[%arg3, %arg5] : memref<8x16xi32, strided<[16, 1]>>
            %1 = memref.load %local1[%arg5, %arg4] : memref<16x16xi32>
            %2 = memref.load %tile_comp1_buf0[%arg3, %arg4] : memref<8x16xi32>
            %3 = arith.muli %0, %1 : i32
            %4 = arith.addi %2, %3 : i32
            memref.store %4, %tile_comp1_buf0[%arg3, %arg4] : memref<8x16xi32>
          }
        }
      }
      %subview_3 = memref.subview %local_out[0, 0] [8, 16] [1, 1] : memref<8x16xi32> to memref<8x16xi32, strided<[16, 1]>>
      memref.copy %tile_comp1_buf0, %subview_3 : memref<8x16xi32> to memref<8x16xi32, strided<[16, 1]>>
        aie.objectfifo.release @in0_p1(Consume, 1)
        aie.objectfifo.release @in1_p0(Consume, 1)
        aie.objectfifo.release @out_p1(Produce, 1)
      }
      aie.end
    }
    aiex.runtime_sequence(%arg0: memref<16x16xi32>, %arg1: memref<16x16xi32>, %arg2: memref<16x16xi32>) {
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][1, 1, 16, 16][0, 0, 16, 1]) {id = 1 : i64, issue_token = true, metadata = @in_sh0} : memref<16x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 1, 32, 16][0, 0, 16, 1]) {id = 2 : i64, issue_token = true, metadata = @in_sh1} : memref<16x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][1, 1, 16, 16][0, 0, 16, 1]) {id = 0 : i64, metadata = @out_sh} : memref<16x16xi32>
      aiex.npu.dma_wait {symbol = @in_sh0}
      aiex.npu.dma_wait {symbol = @in_sh1}
      aiex.npu.dma_wait {symbol = @out_sh}
    }
  }
}

Checklist

PR's title starts with a category (e.g. [Bugfix], [IR], [Builder], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage (It would be better to provide ~2 different test cases to test the robustness of your code)
Code is well-documented

…uncOp

…mlir level

chhzh123

Mostly looks good to me. We need more comments for your implementation so that it is easier for later development.

allo/backend/aie.py

chhzh123 · 2025-02-20T01:46:44Z

allo/backend/aie.py

@@ -221,86 +237,153 @@ def codegen_aie_mlir(mod, orig_input_args, mapping):
    code += format_str("%tile_shim = aie.tile(0, 0)")
    for mid in range(mem_tile_size):
        code += format_str(f"%tile_mem{mid} = aie.tile({mid}, 1)")
-    assert len(mapping) == 1, "Only support 1D mapping for now"
-    pe_size = mapping[0]
+    # assert len(mapping) == 1, "Only support 1D mapping for now"


Can it support 2D now?

Not yet. I think I will support it in the next PR.

Since you removed the mapping argument, you should also remove the assertion here

chhzh123 · 2025-02-20T02:14:36Z

allo/backend/aie.py

-        code += format_str(
-            f"aie.objectfifo @in_sh{i}(%tile_shim, {{%tile_mem{i}}}, 2 : i32) : !aie.objectfifo<{orig_in_type}>"
-        )
+    linkings = [False] * len(input_args)


Add comment for linkings

Based on your description, a clearer name might be dist_alloc, which reflects that when set to True the memory is distributed among compute tiles, and when False it is replicated to each tile.

allo/backend/aie.py

chhzh123 · 2025-02-20T02:15:59Z

allo/backend/aie.py

+            code += format_str("%c1000 = arith.constant 0 : index")
+            code += format_str("%c1001 = arith.constant 1 : index")


Why using 1000 and 1001?

Because there is going to be some conflict when using c0 and c1 when c0 and c1 are also used in the actual computation. So, I chose two large numbers.

Give a better name then. Probably %global_c0 and %global_c1

tests/dataflow/aie/test_multi_core.py

allo/dataflow.py

chhzh123

Just some naming issues

chhzh123 · 2025-02-20T21:42:15Z

allo/backend/aie.py

+            code += format_str("%c1000 = arith.constant 0 : index")
+            code += format_str("%c1001 = arith.constant 1 : index")


Give a better name then. Probably %global_c0 and %global_c1

chhzh123 · 2025-02-20T21:47:46Z

allo/backend/aie.py

-        code += format_str(
-            f"aie.objectfifo @in_sh{i}(%tile_shim, {{%tile_mem{i}}}, 2 : i32) : !aie.objectfifo<{orig_in_type}>"
-        )
+    linkings = [False] * len(input_args)


Based on your description, a clearer name might be dist_alloc, which reflects that when set to True the memory is distributed among compute tiles, and when False it is replicated to each tile.

chhzh123 · 2025-02-20T21:49:00Z

allo/backend/aie.py

@@ -221,86 +237,153 @@ def codegen_aie_mlir(mod, orig_input_args, mapping):
    code += format_str("%tile_shim = aie.tile(0, 0)")
    for mid in range(mem_tile_size):
        code += format_str(f"%tile_mem{mid} = aie.tile({mid}, 1)")
-    assert len(mapping) == 1, "Only support 1D mapping for now"
-    pe_size = mapping[0]
+    # assert len(mapping) == 1, "Only support 1D mapping for now"


Since you removed the mapping argument, you should also remove the assertion here

chhzh123

LGTM. Thx!

EthanMeng324 added 22 commits February 19, 2025 12:40

support tensor compiler time constant resolve and dynamic shape

066578b

modify building flow for AIE, disable canonicalization for tensor

0b9fd52

dynamic infer the mapping for each kernel, insert mlir-aie based on f…

726a971

…uncOp

reindex tensor access

a4c57cc

lower tensor dialect to memref

d21dbf4

modify input getting and target function

3bc7fa6

support local buffer

faf5086

support both linking and broadcasting

1ce02e8

support different input shape reindex and change type replacement in …

d62d1a8

…mlir level

support allo.mul

7c7a66b

support broadcast for element-wise callop

6d818b9

modify test cases

c83e71e

add gemm test case

e74853f

fix fifo name for multi-core

a512bf8

add license

ab87ce3

reformat

194cd50

reformat

e886887

reformat

11412e9

reformat

83d3bb9

reformat

ba5ebd9

disable for llvm.py

e6c6d6c

fix type assign

b282aed

chhzh123 reviewed Feb 20, 2025

View reviewed changes

add comment

d917e81

chhzh123 reviewed Feb 20, 2025

View reviewed changes

modify names

bc25ebf

chhzh123 approved these changes Feb 20, 2025

View reviewed changes

chhzh123 merged commit 2f4c197 into cornell-zhang:main Feb 20, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend][AIE] Support global indexing, tensor access and kernel reindex for AIE kernel mapping #300

[Backend][AIE] Support global indexing, tensor access and kernel reindex for AIE kernel mapping #300

EthanMeng324 commented Feb 19, 2025

chhzh123 left a comment

chhzh123 Feb 20, 2025

EthanMeng324 Feb 20, 2025

chhzh123 Feb 20, 2025

chhzh123 Feb 20, 2025

chhzh123 Feb 20, 2025

chhzh123 Feb 20, 2025

EthanMeng324 Feb 20, 2025

chhzh123 Feb 20, 2025

chhzh123 left a comment

chhzh123 Feb 20, 2025

chhzh123 Feb 20, 2025

chhzh123 Feb 20, 2025

chhzh123 left a comment

		code += format_str("%c1000 = arith.constant 0 : index")
		code += format_str("%c1001 = arith.constant 1 : index")

[Backend][AIE] Support global indexing, tensor access and kernel reindex for AIE kernel mapping #300

[Backend][AIE] Support global indexing, tensor access and kernel reindex for AIE kernel mapping #300

Conversation

EthanMeng324 commented Feb 19, 2025

Description

Problems

Proposed Solutions

Checklist

chhzh123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chhzh123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chhzh123 left a comment

Choose a reason for hiding this comment