-
Notifications
You must be signed in to change notification settings - Fork 8
Test status log
-- Sachille
func mmul(a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)) {
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[i*3+j] + b[i*3+j];
}
}
}
This example should fail because even though the unrolled version can be mapped on to hardware directly, unrolling would need some muxing. But that condition is not currently checked by the type system.
However, this example fails in the current system because in order to use indexing of i, it needs to be static as i is unrolled. (Pointed out by Ted in Slack) This can be seen with the tests run on vvadd, vvadd example. Compare that with 1st instance which fails.
Therefore, it is important to note the two indexing methods in seashell, which are
- physical indexing i.e.
array c[static bank integer][variable or static integer bank index]
- logical indexing i.e.
array c[logical index variable or static integer]
cannot be used interchangeably in loops.
If a loop is not unrolled, the designer has to specify the bank. This is evident when arrays are banked. But this is necessary even when they are not. Therefore, note any array used in a regular loop needs static banking i.e. for a non banked array c[0][i]
.
If a loop is banked, programmer can still use banked arrays, which can lead to issues such as
for i = 1..n unroll K:
A[0][0] := i
being type checked by seashell. (Pointed out by Adrian in Slack)
This issue is avoided by using unroll in the non unrolled arrays. Edited multaccess example
The original expectation of the previous example is to highlight this issue, where unrolling the inner loop will force hardware to access the same element multiple times as write, forcing HLS to put some muxes to select the last access to write. However, this is a non-issue for array reads, as a single element can be fanned out to multiple computes. We arrived at a decision to restrict only writes and allow such reads in this example.
This approach does create in inconsistency between unrolled access and non-unrolled access, as non-unrolled bank access requires even reads to be single access as highlighted by Ted here, inconsistency. One motivation to maintain this inconsistency would be, programmer already aware of hardware duplication when using unroll, which allows implicit banking to add some more wires to propagate the read data. But in explicit access, programmer is aware of this multiple wires by using explicit local variables.
This led to the question, is it worthwhile to support nested loops, as a programmer, albeit with some difficulty, should be able to write the same program with one loop structure. The second nested loop issue being, handling multiple loop unrolls. This would lead to complex expressions within array indices, but would make type checking a single loop easier.
But it turns out, this benefit in terms of simple loop would be lost when type checking the array indices. Type systems are on the dumb side and would benefit from having simpler indices to operate with.
With nested loop unrolling thus handled with
- multiple access
- nested loops
- nested loop unrolling
the next issue is to handle compute-reduction (?) operations and a neat way to handle multi-dimensional logical arrays to bring them down to the physical arrays seashell has.
Issue with operations like += is that these require two access to the same memory element both as read and as write in the same operation. In hardware terms, this requires two registers (memory elements). I also need to understand more about why the implicit commutative property of + which is no longer followed in a += be a concern here. These operations can be handled with parallel compute and ultimately reduce method operators.
Then, concerning multi-dimensional arrays, logical dimensionality is at play for applications such as matrix multiplication or convolution, but actual hardware arrays are 1D. However, hardware optimizations such as unrolling depend on this logical dimensionality as they dictate the access pattern and thus better locality. How this translation can be done is the question. It has some subtlety to it beyond that, which I need to refresh.
We have come to the conclusion we need nested loops and also that we need to support nested loop unrolling.
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[i*3+j] + b[i*3+j];
}
}
The current array partitioning we assume is interleaved partitioning, cyclical like this
0 3 6 | 1 4 7 | 2 5 8 |
---|
, which works alright for unrolling consecutive elements in single loop unroll.
for (let j = 0..9) unroll 3 {
c:= a[j];
}
is equivalent to
for (let j = 0..3) unroll 3 {
c:= a[j*3+0];
c:= a[j*3+1];
c:= a[j*3+2];
}
But the type system doesn't currently support an unrolling of the following nature,
for (let j = 0..3) unroll 3 {
c:= a[0*3+j];
c:= a[1*3+j];
c:= a[2*3+j];
}
which would need a banking in blocks like
0 1 2 | 3 4 5 | 6 7 8 |
---|
Therefore, we need at least two types of banking (maybe 3 as HLS has, pg. 148 UG902 (v2018.2) June 6, 2018).
Furthermore, considering an example like
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[i*3+j] + b[i*3+j];
}
}
this has further impact. Notice the arrays need to be banked cyclically for this access, validating the type check.
However, if we change the code to,
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) unroll 3 {
for (let j = 0..3) {
c[i] := a[i*3+j] + b[i*3+j];
}
}
or
a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)
for (let i = 0..3) {
for (let j = 0..3) unroll 3 {
c[i] := a[j*3+i] + b[j*3+i];
}
}
we need a blockwise partitioning. It also shows we need a mechanism to figure out whether the block partitioning we use is accurate.
Logically multi arrays are extremely useful to FPGA apps. Simple 1D apps are usually trivial enough for GP-CPUs to outperform FPGAs. But CPUs struggle when it comes to complex memory accesses. FPGAs, with the capability to bank memories as they please, allow exploiting spatial locality. Also, it is simpler to index an array as [i][j]
than always be computing the correct 1D index such as [i*3+j]
.
However, this representation must be robust enough to handle arbitrary number of dimensions (i.e. map to hardware), and also maintain capabilities we gave for the 1D variant.
In order to capture these requirements, I am considering an example of a 3D array with partial reordering across two dimensions.
a: float[32] bank(4);
for (let n = 0..1) {
for (let i = 0..3) unroll 2 {
for (let j = 0..3) unroll 2 {
access(a[ n*16 + i*4 + j ]);
}
}
}
The parameters here are
-
array size- p1
-
banks- p2
-
number of loop elements- p3, p5, p7
-
largest loop element- p9, p10, p11
-
unroll factors- p4, p6, p8
-
intermediate index multiplier- v
-
largest index multiplier- w
Array mapping is determined by p1,v, w and p4,p6,p8
Possible array definitions are,
# current version, declaring physical parameters
a: float [32] bank(4);
# new version with logical parameters
a: float [2]:[4]bank(2):[4]bank(2);
Additional parameters for the possible new version,
- dimension sizes- d1,d2,d3
- dimension banks- b1,b2,b3
We'll consider n to be the frames, therefore we have two frames each of size 16. These frames would look like,
n=0 | n=1 |
---|---|
0 1 2 3 | 0 1 2 3 |
4 5 6 7 | 4 5 6 7 |
8 9 10 11 | 8 9 10 11 |
12 13 14 15 | 12 13 14 15 |
Default mapping for the three loop program without unrolling would be,
00 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 01 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 |
---|
Considering b3 only, we know it needs interleaving. To be more general, we need to partition into two in columns (j is the lowest loop, so interleaved by v/b3= 2), leading to a memory like this,
00 10 40 50 80 90 120 130 01 11 41 51 81 91 121 131 | 20 30 60 70 100 110 140 150 21 31 61 71 101 111 141 151 |
---|
Considering b2 only, i.e. blockwise partitioning, we need to partition in terms of rows like (interleaved by w/b2= 8),
00 10 20 30 40 50 60 70 01 11 21 31 41 51 61 71 | 80 90 100 110 120 130 140 150 81 91 101 111 121 131 141 151 |
---|
Adding both unroll factors lead to needing 4 banks, leading to a memory (first interleave by 8 to two banks, then interleave each by 2),
00 10 40 50 01 11 41 51 | 20 30 60 70 21 31 61 71 | 80 90 120 130 81 91 121 131 | 100 110 140 150 101 111 141 151 |
---|
Now we have the physical mapping and the parameters to do that, 1st decision would be how to extract these parameters from user program.
- b1, b2 and b3 are provided bank factors
- v= (p7 or) d3
- w= (p5*p7 or) d2*d3
For now, we feel the new version is clear as the programmer gets to explicitly say the banking they need.
Next decision we need is how to represent implicit and explicit array declarations with multiD arrays.
For the sake of visualization, this is how explicitly unrolled version would look like with a funky explicit bank in each dimension.
a: float[32] bank(4);
for (let n' = 0..1) {
for (let i' = 0..1) {
for (let j' = 0..1) {
access( a [[0][n']] [[0][i']] [[0][j']] );
}
}
}
However, this is a horrible explicit representation,
- first of all it's not explicit, actual hardware has just 1D arrays as we illustrated earlier
- we need banks for each dimension as definition doesn't tell us which dimension to partition in
- it's super ugly and confusing
Alternate version with new array declaration is,
a: float [2]:[4]bank(2):[4]bank(2);
for (let n' = 0..1) {
for (let i' = 0..1) {
for (let j' = 0..1) {
access( a [[0]][n'*4 + i'*2 + j'] );
}
}
}
This is a better hardware representation, but very difficult for a programmer to figure out which bank and which element. But we would not want programmers to really use this explicit version, unless for really fine control. Therefore, this seems a good enough trade-off. (The above example uses [[..]]
for banks, which is only a temporary symbol. Other suggestions for explicit banking include {..}{..}
and why not {..}[..]
)
Therefore, we think maintaining this explicit access is prudent.
Concluding this note, it is of use to jot down translation from implicit access to physical access in HLS for C code generation and explicit access in seashell for type checking.
a: float [d1]:[d2]bank(b2):[d3]bank(b3);
for (let n = 0..1) {
for (let i = 0..3) unroll p6 {
for (let j = 0..3) unroll p8 {
access(a[n][i][j]);
}
}
}
Implicit access a[n][i][j]
will be converted to,
physical access a[n*d3*d2 + i*d3 + j]
explicit access a{(i%b2)*b3 + j%b3}[n*(d3/b3)*(d2/b2) + (i/b2)*d3/b3 + j/b3]
(need to verify)
- I went through the example with the assumption
b1=p4
,b2=p6
andb3=p8
. But it seems to me I needn't worry about the unroll factors but use the banking factor for partitioning and translation. Unrolling factor should just be 'safe' from a hardware perspective. Example is altered accordingly.
This probably needs more thought, but it should be valid to partition an array more than it's unroll factor, i.e. unrolling should probably be less than the bank factor as each unroll would need a dedicated RW port. A possible example for this,
a: float [8] bank(2);
for (let i = 0..3) {
access(a[i] + a[i+4]);
}
unrolling this would be independent of this access,
a: float [8] bank(8);
for (let i = 0..3) unroll 4 {
access(a[i] + a[i+4]);
}
A version like this could also be similarly mapped to hardware, but I need to check how smart the compiler can be to do it without any muxing,
a: float [10] bank(2);
for (let i = 0..7) {
access(a[i] + a[i+1]);
}
Note array a
only has 9 elements but I'm specifying it larger to have a symmetric partition for now.
An attempt to humor an example with more unroll than banking required,
a: float [4], b: float [4];
for (let i = 0..7) unroll 2 {
if (i%2==0)
access(a[i/2]);
else
access(b[i/2]);
}
This is a very bad example, for one we don't need an if condition here, and even then it is possible unroll 2 would reflect in a different array banking which will be needed to make this useful. Disregarding the fact that this is a bad program, this example is only to consider a valid design where unroll > banking.
Summarizing notes on slack from Adrian, the key idea of a MUX/DEMUX is to accept a dynamic signal as a select and use it to forward a selected data line (Multiplex) or forward data to a selected output (Demultiplex). A general representation of it, taken from the slack thread is, function mymux(i: int) { return mux(i) {expr1}{expr2}{expr3} }
.
However, it is handy to come up with a version which could access an arbitrary memory bank, essentially in an expression like A[i]
where A
is partitioned in terms of i
. We can write this as, (again from slack thread)
# writing in general representation
function Am(i: int) { return mux(i) {A[0]}{A[1]}{A[2]}{A[3]} }
# simplify with the prior knowledge of # of banks
function Am(i: int) { return mux(i) {A[i]} }
# simplifying this to a banked array template
mux n Am(A);
One thing to consider regardless of the general or array specific version is how to use this as a mux and demux. Adrian suggests muxes to be used both as 'lvalues' and 'rvalues'. lvalues and rvalues. This might result in following expressions being correcly identified as a mux or a demux.
a: float[10] bank(5);
mux 5 Am(A)
mux 5 Adm(A)
for (let i = 0..4) {
for (let j = 0..1) {
# mux on A selects the correct input and send only that to c
c = Am(A[i][j]);
d = c*10 +7;
# demux on A takes the input and selects the correct output to go to
Adm(A[i][j]) = d;
}
}
We already considered one example of muxes with arbitrary access to banks. Another common example is if conditions.
Linear search
a: float[100] bank(5);
let float b = 20;
let out = 100;
for (let i = 0..99) {
if(A[i]==b) {
out := i;
i := 100;
}
}
Binary search
a: float[100] bank(5);
let float b = 20;
let L = 0;
let R= 99;
let i= 0;
let out= 100;
while (L<=R){
i:= (L+R)/2;
if(A[i]<b)
L:= i+1;
else if(A[i]>b)
R:= i-1;
else
out= i;
}
After getting to know a type system can be aware of lvalues and rvalues, I'm curious if this can be used to treat memory reads and writes differently. Adrian said on slack, 'maybe the type of writable memories could be “consumed” from the context differently depending on whether it’s used as an lvaue or an rvalue'.
Assumption for this note is banking is only done in cyclic interleaving.
Let's consider the baseline valid design (we can access multiple banks in parallel),
a: float[9] bank(3); c: float[9] bank(3);
for (let i = 0..3) {
c[0][i]:= a[0][i];
c[1][i]:= a[1][i];
c[2][i]:= a[2][i];
}
We cannot access same bank in parallel due to resource limitation,
for read
a: float[9] bank(3); c: float[9] bank(3);
for (let i = 0..3) {
c[0][i]:= a[i][0];
c[1][i]:= a[i][1];
c[2][i]:= a[i][2];
}
and for write
a: float[9] bank(3); c: float[9] bank(3);
for (let i = 0..3) {
c[i][0]:= a[0][i];
c[i][1]:= a[1][i];
c[i][2]:= a[2][i];
}
We can access same location in a bank for read, it's just more fanout
a: float[9] bank(3); c: float[9] bank(3);
for (let i = 0..3) {
c[0][i]:= a[i][0];
c[1][i]:= a[i][0];
c[2][i]:= a[i][0];
}
or
a: float[9] bank(3); c: float[9] bank(3);
for (let i = 0..3) {
c[0][i]:= a[0][i];
c[1][i]:= a[0][i];
c[2][i]:= a[0][i];
}
But we cannot access the same element for write, because it is physically impossible to do that, (therefore, hardware would put some Muxes and pick which would go in)
a: float[9] bank(3); c: float[9] bank(3);
for (let i = 0..3) {
c[i][0]:= a[0][i];
c[i][0]:= a[1][i];
c[i][0]:= a[2][i];
}
or
a: float[9] bank(3); c: float[9] bank(3);
for (let i = 0..3) {
c[0][i]:= a[0][i];
c[0][i]:= a[1][i];
c[0][i]:= a[2][i];
}