-
Notifications
You must be signed in to change notification settings - Fork 8
Test status log
-- Sachille
func mmul(a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)) {
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[i*3+j] + b[i*3+j];
}
}
}
This example should fail because even though the unrolled version can be mapped on to hardware directly, unrolling would need some muxing. But that condition is not currently checked by the type system.
However, this example fails in the current system because in order to use indexing of i, it needs to be static as i is unrolled. (Pointed out by Ted in Slack) This can be seen with the tests run on vvadd, vvadd example. Compare that with 1st instance which fails.
Therefore, it is important to note the two indexing methods in seashell, which are
- physical indexing i.e.
array c[static bank integer][variable or static integer bank index]
- logical indexing i.e.
array c[logical index variable or static integer]
cannot be used interchangeably in loops.
If a loop is not unrolled, the designer has to specify the bank. This is evident when arrays are banked. But this is necessary even when they are not. Therefore, note any array used in a regular loop needs static banking i.e. for a non banked array c[0][i]
.
If a loop is banked, programmer can still use banked arrays, which can lead to issues such as
for i = 1..n unroll K:
A[0][0] := i
being type checked by seashell. (Pointed out by Adrian in Slack)
This issue is avoided by using unroll in the non unrolled arrays. Edited multaccess example
The original expectation of the previous example is to highlight this issue, where unrolling the inner loop will force hardware to access the same element multiple times as write, forcing HLS to put some muxes to select the last access to write. However, this is a non-issue for array reads, as a single element can be fanned out to multiple computes. We arrived at a decision to restrict only writes and allow such reads in this example.
This approach does create in inconsistency between unrolled access and non-unrolled access, as non-unrolled bank access requires even reads to be single access as highlighted by Ted here, inconsistency. One motivation to maintain this inconsistency would be, programmer already aware of hardware duplication when using unroll, which allows implicit banking to add some more wires to propagate the read data. But in explicit access, programmer is aware of this multiple wires by using explicit local variables.
This led to the question, is it worthwhile to support nested loops, as a programmer, albeit with some difficulty, should be able to write the same program with one loop structure. The second nested loop issue being, handling multiple loop unrolls. This would lead to complex expressions within array indices, but would make type checking a single loop easier.
But it turns out, this benefit in terms of simple loop would be lost when type checking the array indices. Type systems are on the dumb side and would benefit from having simpler indices to operate with.
With nested loop unrolling thus handled with
- multiple access
- nested loops
- nested loop unrolling
the next issue is to handle compute-reduction (?) operations and a neat way to handle multi-dimensional logical arrays to bring them down to the physical arrays seashell has.
Issue with operations like += is that these require two access to the same memory element both as read and as write in the same operation. In hardware terms, this requires two registers (memory elements). I also need to understand more about why the implicit commutative property of + which is no longer followed in a += be a concern here. These operations can be handled with parallel compute and ultimately reduce method operators.
Then, concerning multi-dimensional arrays, logical dimensionality is at play for applications such as matrix multiplication or convolution, but actual hardware arrays are 1D. However, hardware optimizations such as unrolling depend on this logical dimensionality as they dictate the access pattern and thus better locality. How this translation can be done is the question. It has some subtlety to it beyond that, which I need to refresh.
We have come to the conclusion we need nested loops and also that we need to support nested loop unrolling.
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[i*3+j] + b[i*3+j];
}
}
The current array partitioning we assume is interleaved partitioning, cyclical like this
0 3 6 | 1 4 7 | 2 5 8 |
---|
, which works alright for unrolling consecutive elements in single loop unroll.
for (let j = 0..9) unroll 3 {
c:= a[j];
}
is equivalent to
for (let j = 0..3) unroll 3 {
c:= a[j*3+0];
c:= a[j*3+1];
c:= a[j*3+2];
}
But the type system doesn't currently support an unrolling of the following nature,
for (let j = 0..3) unroll 3 {
c:= a[0*3+j];
c:= a[1*3+j];
c:= a[2*3+j];
}
which would need a banking in blocks like
0 1 2 | 3 4 5 | 6 7 8 |
---|
Therefore, we need at least two types of banking (maybe 3 as HLS has, pg. 148 UG902 (v2018.2) June 6, 2018).
Furthermore, considering an example like
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[i*3+j] + b[i*3+j];
}
}
this has further impact. Notice the arrays need to be banked cyclically for this access, validating the type check.
However, if we change the code to,
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) unroll 3 {
for (let j = 0..3) {
c[i] := a[i*3+j] + b[i*3+j];
}
}
or
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[j*3+i] + b[j*3+i];
}
}
we need a blockwise partitioning. It also shows we need a mechanism to figure out whether the block partitioning we use is accurate.