Test status log

2018 July 11th

-- Sachille

Things to note when accessing arrays in Seashell

func mmul(a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)) {

  for (let i = 0..3) {
    for (let j = 0..3) unroll 3 {
      c[i] := a[i*3+j] + b[i*3+j];
    }
  }

}

This example should fail because even though the unrolled version can be mapped on to hardware directly, unrolling would need some muxing. But that condition is not currently checked by the type system.

However, this example fails in the current system because in order to use indexing of i, it needs to be static as i is unrolled. (Pointed out by Ted in Slack) This can be seen with the tests run on vvadd, vvadd example. Compare that with 1st instance which fails.

Therefore, it is important to note the two indexing methods in seashell, which are

physical indexing i.e. array c[static bank integer][variable or static integer bank index]
logical indexing i.e. array c[logical index variable or static integer] cannot be used interchangeably in loops.

If a loop is not unrolled, the designer has to specify the bank. This is evident when arrays are banked. But this is necessary even when they are not. Therefore, note any array used in a regular loop needs static banking i.e. for a non banked array c[0][i].

If a loop is banked, programmer can still use banked arrays, which can lead to issues such as

for i = 1..n unroll K:
  A[0][0] := i

being type checked by seashell. (Pointed out by Adrian in Slack)

This issue is avoided by using unroll in the non unrolled arrays. Edited multaccess example

How to deal with multiple write access issues?

The original expectation of the previous example is to highlight this issue, where unrolling the inner loop will force hardware to access the same element multiple times as write, forcing HLS to put some muxes to select the last access to write. However, this is a non-issue for array reads, as a single element can be fanned out to multiple computes. We arrived at a decision to restrict only writes and allow such reads in this example.

2018 July 12th

Inconsistency between explicit and implicit access in terms of multiple read access

This approach does create in inconsistency between unrolled access and non-unrolled access, as non-unrolled bank access requires even reads to be single access as highlighted by Ted here, inconsistency. One motivation to maintain this inconsistency would be, programmer already aware of hardware duplication when using unroll, which allows implicit banking to add some more wires to propagate the read data. But in explicit access, programmer is aware of this multiple wires by using explicit local variables.

Why do we need nested loops?

This led to the question, is it worthwhile to support nested loops, as a programmer, albeit with some difficulty, should be able to write the same program with one loop structure. The second nested loop issue being, handling multiple loop unrolls. This would lead to complex expressions within array indices, but would make type checking a single loop easier.

But it turns out, this benefit in terms of simple loop would be lost when type checking the array indices. Type systems are on the dumb side and would benefit from having simpler indices to operate with.

How to deal with multi dimensional arrays in seashell

With nested loop unrolling thus handled with

multiple access
nested loops
nested loop unrolling

the next issue is to handle compute-reduction (?) operations and a neat way to handle multi-dimensional logical arrays to bring them down to the physical arrays seashell has.

Issue with operations like += is that these require two access to the same memory element both as read and as write in the same operation. In hardware terms, this requires two registers (memory elements). I also need to understand more about why the implicit commutative property of + which is no longer followed in a += be a concern here. These operations can be handled with parallel compute and ultimately reduce method operators.

Then, concerning multi-dimensional arrays, logical dimensionality is at play for applications such as matrix multiplication or convolution, but actual hardware arrays are 1D. However, hardware optimizations such as unrolling depend on this logical dimensionality as they dictate the access pattern and thus better locality. How this translation can be done is the question. It has some subtlety to it beyond that, which I need to refresh.

2018 July 16th

Array partitioning seems to need different types of partitioning

We have come to the conclusion we need nested loops and also that we need to support nested loop unrolling.

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) {
  for (let j = 0..3) unroll 3 {
    c[i] := a[i*3+j] + b[i*3+j];
  }
}

The current array partitioning we assume is interleaved partitioning, cyclical like this

0 3 6	1 4 7	2 5 8

, which works alright for unrolling consecutive elements in single loop unroll.

for (let j = 0..9) unroll 3 {
  c:= a[j];
}

is equivalent to

for (let j = 0..3) unroll 3 {
  c:= a[j*3+0];
  c:= a[j*3+1];
  c:= a[j*3+2];
}

But the type system doesn't currently support an unrolling of the following nature,

for (let j = 0..3) unroll 3 {
  c:= a[0*3+j];
  c:= a[1*3+j];
  c:= a[2*3+j];
}

which would need a banking in blocks like

0 1 2	3 4 5	6 7 8

Therefore, we need at least two types of banking (maybe 3 as HLS has, pg. 148 UG902 (v2018.2) June 6, 2018).

Furthermore, considering an example like

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) {
  for (let j = 0..3) unroll 3 {
    c[i] := a[i*3+j] + b[i*3+j];
  }
}

this has further impact. Notice the arrays need to be banked cyclically for this access, validating the type check.

However, if we change the code to,

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) unroll 3 {
  for (let j = 0..3) {
    c[i] := a[i*3+j] + b[i*3+j];
  }
}

or

a: float[9] bank(3), b: float[9] bank(3), c: float[3] bank(1)

for (let i = 0..3) {
  for (let j = 0..3) unroll 3 {
    c[i] := a[j*3+i] + b[j*3+i];
  }
}

we need a blockwise partitioning. It also shows we need a mechanism to figure out whether the block partitioning we use is accurate.

Need a coherent method to access logically multi-D arrays

Logically multi arrays are extremely useful to FPGA apps. Simple 1D apps are usually trivial enough for GP-CPUs to outperform FPGAs. But CPUs struggle when it comes to complex memory accesses. FPGAs, with the capability to bank memories as they please, allow exploiting spatial locality. Also, it is simpler to index an array as [i][j] than always be computing the correct 1D index such as [i*3+j].

However, this representation must be robust enough to handle arbitrary number of dimensions (i.e. map to hardware), and also maintain capabilities we gave for the 1D variant.

In order to capture these requirements, I am considering an example of a 3D array with partial reordering across two dimensions.

a: float[32] bank(4);

for (let n = 0..1) {
  for (let i = 0..3) unroll 2 {
    for (let j = 0..3) unroll 2 {
      access(a[ n*16 + i*4 + j ]);
    }
  }
}

The parameters here are

array size- p1
banks- p2
number of loop elements- p3, p5, p7
largest loop element- p9, p10, p11
unroll factors- p4, p6, p8
intermediate index multiplier- v
largest index multiplier- w

Array mapping is determined by p1,v, w and p4,p6,p8

Possible array definitions are,

# current version, declaring physical parameters
a: float [32] bank(4);

# new version with logical parameters
a: float [2]:[4]bank(2):[4]bank(2);

Additional parameters for the possible new version,

dimension sizes- d1,d2,d3
dimension banks- b1,b2,b3

We'll consider n to be the frames, therefore we have two frames each of size 16. These frames would look like,

n=0	n=1
0 1 2 3	0 1 2 3
4 5 6 7	4 5 6 7
8 9 10 11	8 9 10 11
12 13 14 15	12 13 14 15

Default mapping for the three loop program without unrolling would be,

0₀ 1₀ 2₀ 3₀ 4₀ 5₀ 6₀ 7₀ 8₀ 9₀ 10₀ 11₀ 12₀ 13₀ 14₀ 15₀ 0₁ 1₁ 2₁ 3₁ 4₁ 5₁ 6₁ 7₁ 8₁ 9₁ 10₁ 11₁ 12₁ 13₁ 14₁ 15₁

Considering b3 only, we know it needs interleaving. To be more general, we need to partition into two in columns (j is the lowest loop, so interleaved by v/b3= 2), leading to a memory like this,

0₀ 1₀ 4₀ 5₀ 8₀ 9₀ 12₀ 13₀ 0₁ 1₁ 4₁ 5₁ 8₁ 9₁ 12₁ 13₁	2₀ 3₀ 6₀ 7₀ 10₀ 11₀ 14₀ 15₀ 2₁ 3₁ 6₁ 7₁ 10₁ 11₁ 14₁ 15₁

Considering b2 only, i.e. blockwise partitioning, we need to partition in terms of rows like (interleaved by w/b2= 8),

0₀ 1₀ 2₀ 3₀ 4₀ 5₀ 6₀ 7₀ 0₁ 1₁ 2₁ 3₁ 4₁ 5₁ 6₁ 7₁	8₀ 9₀ 10₀ 11₀ 12₀ 13₀ 14₀ 15₀ 8₁ 9₁ 10₁ 11₁ 12₁ 13₁ 14₁ 15₁

Adding both unroll factors lead to needing 4 banks, leading to a memory (first interleave by 8 to two banks, then interleave each by 2),

0₀ 1₀ 4₀ 5₀ 0₁ 1₁ 4₁ 5₁	2₀ 3₀ 6₀ 7₀ 2₁ 3₁ 6₁ 7₁	8₀ 9₀ 12₀ 13₀ 8₁ 9₁ 12₁ 13₁	10₀ 11₀ 14₀ 15₀ 10₁ 11₁ 14₁ 15₁

Now we have the physical mapping and the parameters to do that, 1st decision would be how to extract these parameters from user program.

b1, b2 and b3 are provided bank factors
v= (p7 or) d3
w= (p5*p7 or) d2*d3

For now, we feel the new version is clear as the programmer gets to explicitly say the banking they need.

Next decision we need is how to represent implicit and explicit array declarations with multiD arrays.

For the sake of visualization, this is how explicitly unrolled version would look like with a funky explicit bank in each dimension.

a: float[32] bank(4);

for (let n' = 0..1) {
  for (let i' = 0..1) {
    for (let j' = 0..1) {
      access( a [[0][n']] [[0][i']] [[0][j']] );
    }
  }
}

However, this is a horrible explicit representation,

first of all it's not explicit, actual hardware has just 1D arrays as we illustrated earlier
we need banks for each dimension as definition doesn't tell us which dimension to partition in
it's super ugly and confusing

Alternate version with new array declaration is,

a: float [2]:[4]bank(2):[4]bank(2);  

for (let n' = 0..1) {
  for (let i' = 0..1) {
    for (let j' = 0..1) {
      access( a [[0]][n'*4 + i'*2 + j'] );
    }
  }
}

This is a better hardware representation, but very difficult for a programmer to figure out which bank and which element. But we would not want programmers to really use this explicit version, unless for really fine control. Therefore, this seems a good enough trade-off. (The above example uses [[..]] for banks, which is only a temporary symbol. Other suggestions for explicit banking include {..}{..} and why not {..}[..])

Therefore, we think maintaining this explicit access is prudent.

Concluding this note, it is of use to jot down translation from implicit access to physical access in HLS for C code generation and explicit access in seashell for type checking.

a: float [d1]:[d2]bank(b2):[d3]bank(b3); 

for (let n = 0..1) {
  for (let i = 0..3) unroll p6 {
    for (let j = 0..3) unroll p8 {
      access(a[n][i][j]);
    }
  }
}

Implicit access a[n][i][j] will be converted to,
physical access a[n*d3*d2 + i*d3 + j]
explicit access a{(i%b2)*b3 + j%b3}[n*(d3/b3)*(d2/b2) + (i/b2)*d3/b3 + j/b3] (need to verify)

I went through the example with the assumption b1=p4, b2=p6 and b3=p8. But it seems to me I needn't worry about the unroll factors but use the banking factor for partitioning and translation. Unrolling factor should just be 'safe' from a hardware perspective. Example is altered accordingly.

2018 July 18th

Examples where unroll factor can deviate from array partitioning

This probably needs more thought, but it should be valid to partition an array more than it's unroll factor, i.e. unrolling should probably be less than the bank factor as each unroll would need a dedicated RW port. A possible example for this,

a: float [8] bank(2); 
for (let i = 0..3) {
    access(a[i] + a[i+4]);
}

unrolling this would be independent of this access,

a: float [8] bank(8); 
for (let i = 0..3) unroll 4 {
    access(a[i] + a[i+4]);
}

A version like this could also be similarly mapped to hardware, but I need to check how smart the compiler can be to do it without any muxing,

a: float [10] bank(2); 
for (let i = 0..7) {
    access(a[i] + a[i+1]);
}

Note array a only has 9 elements but I'm specifying it larger to have a symmetric partition for now.

An attempt to humor an example with more unroll than banking required,

a: float [4], b: float [4]; 
for (let i = 0..7) unroll 2 {
  if (i%2==0)
    access(a[i/2]);
  else 
    access(b[i/2]);
}

This is a very bad example, for one we don't need an if condition here, and even then it is possible unroll 2 would reflect in a different array banking which will be needed to make this useful. Disregarding this is a bad program, this example is only to consider a valid design where unroll > banking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly