Test Flow Status (Archived)

This page is a sequence of language and compiler goals on the way to a milestone where we can compile and run a simple convolution benchmark. In fact, our goal is somewhat loftier than just compiling a single function: we want to be able to explore a reasonable design space for each program by twiddling certain parameters to get a high-performance hardware design. It's divided into three stages: smaller benchmarks first and then the full convolution example. (The latter two are empty at the moment; we'll fill those out when we get to them.)

The primary language features we're focusing on here are memory banking and unrolling.

Stage 1: Vector/Scalar Add and Vector/Vector Add

The first goal in this stage is to run the vsadd example, which has both banking and unrolling. It should then be straightforward to run the similar vvadd. As an intermediate goal, we'll run a version of vsadd without unrolling.

Generate valid C output.

We can generate reasonable C code from the non-unrolled code that seems to produce the right output. 🎉
Functions.

To run the C code through the SDSoC compiler, we need to link together C code written for the host (i.e., the CPU) and HLS-ified C code for the FPGA kernel. The current plan is to provide manually-written host code in a separate file. The Seashell compiler needs to generate code inside of proper functions (with arguments and names that aren't main) that this hand-written code can invoke.

Here's what we'll need to do:
- Change the Seashell syntax to allow code written inside C-like function definitions.
- These function should have parameters and return types (edit: decided for now to not have return statements)
- Add function call expressions so functions can call each other. (But probably not themselves---there's no need to handle recursion at the moment.)
Type variables.

Numeric data types can have a surprisingly large impact on hardware efficiency: a design that operates on 16-bit numbers might be up to twice as big as one that operates on 32-bit numbers, for example. So it's important to have an easy way to switch a program between different numerical representations. (The idea is that the programmer might try a few different number types, running the program on test inputs each time to see if the precision is good enough, and then select the smallest size that yields good overall output accuracy.)

From a language design perspective, this is a surprisingly deep topic. Eventually, we probably want to add parametric polymorphism, so every function (or every module) can be instantiated with a different (set of) types depending on how it's used. For example, there might be a polymorphic vsadd function where a program could use both vsadd<int16> and vsadd<int32> to get vector/scalar adds for different data types. To make things more complicated, we'll need to handle relative data sizes: for example, we might want to say "this variable is twice the width of that variable" or "this variable is one bit wider than that variable."

For now, let's keep it simple and add type aliases, which are similar to C's typedef statement. Seashell will support syntax like type <id> = <type>, which creates an alias for <type> named <id>. This way, a program can use something like:
```
type NumT = int32;
```
and then use NumT everywhere to declare variables. This way, they can just change one line of code to adjust the entire design to work on a different data type.

Here's our to-do list:
- Add type alias statements to the language.
- Adjust the type checker to work with type aliases the same way as with concrete types.
- Eliminate type aliases in the compiler: i.e., the generated code should have concrete types.

Minor debugging needed for non-unrolled vsadd example

Current result

void madd(int *a, b, int *c) {  
        for (int i = 0; i <= 1; i += 1) {  
                c[0 + 5*(i)] = a[0 + 5*(i)]+b;  
                c[1 + 5*(i)] = a[1 + 5*(i)]+b;  
        }  
}

Data type for non array arguments (int b above)
Possibly use float data type
Insert array partition pragma
Multiplying factor should reflect ~~loop~~ banking factor

i.e. if looping by a factor of 2, it can be arr[0*ArraySize/Banking + i] or [0 + Banking*i] depending on the interleaving. This is accurate for the above example, but when I change the array factor to 4, multiplier doesn't change to say 2, assuming array size is 10. But this probably requires ArraySize?
~~Some method to input array size?~~
ArraySize already exists in seashell.

Design decision to make

Specifying array size in seashell

There are two designs accepted by HLS

Using size in arguments

void madd(float a[1024], float b[1024], float c[1024]) {
//      #pragma HLS ARRAY_PARTITION variable=a factor=32
//      #pragma HLS ARRAY_PARTITION variable=b factor=32
//      #pragma HLS ARRAY_PARTITION variable=c factor=32
        for (int i = 0; i <= 1023; i += 1) {
//              #pragma HLS UNROLL factor=32
                c[i] = a[i]+b[i];
        }
}

Using pragmas

#pragma SDS data copy(a[0:1024],b[0:1024],c[0:1024])
#pragma SDS data access_pattern(a:SEQUENTIAL,b:SEQUENTIAL,c:SEQUENTIAL)
void madd(float *a, float *b, float *c) {
//      #pragma HLS ARRAY_PARTITION variable=a factor=32
//      #pragma HLS ARRAY_PARTITION variable=b factor=32
//      #pragma HLS ARRAY_PARTITION variable=c factor=32
        for (int i = 0; i <= 1023; i += 1) {
//              #pragma HLS UNROLL factor=32
                c[i] = a[i]+b[i];
        }
}

However, weirdly these designs are different. The closest I could simulate with the latter from the first is,

#pragma SDS data access_pattern(a:SEQUENTIAL,b:SEQUENTIAL,c:SEQUENTIAL)
void madd(float a[1024], float b[1024], float c[1024]) {
//      #pragma HLS ARRAY_PARTITION variable=a factor=32
//      #pragma HLS ARRAY_PARTITION variable=b factor=32
//      #pragma HLS ARRAY_PARTITION variable=c factor=32
        for (int i = 0; i <= 1023; i += 1) {
//              #pragma HLS UNROLL factor=32
                c[i] = a[i]+b[i];
        }
}

And a design with RANDOM access pattern with pointers fails with and without data copy pragma.

I also think the design we need is RANDOM data access pattern to simulate data explicitly being copied over. pragma list

Thus a timely prudent decision might be to use 1st design point.

At this point, we should be able to support a fairly satisfying version of the non-unrolled vsadd example. The next step will be to refine our support for unrolled loops:

Support loop unrolling with an arbitrary factor.

Generate the proper #pragmas to unroll loops. Perhaps this is already working? I'm not sure.
Automated test flow for vs/vvadd example DSE (Sachille)
~~Avoid an off-by-one error.~~

Sachille noticed that there might be a bug in the loop bounds in the generated C code: the end should perhaps be n (the size of the array) instead of n-1?

I think it's not a bug, but something I mistook in semantics. In Seashell we create a for loop as for (let i = 0..N). I felt this would intuitively mean i={0,1,..,N-1,N} but only later came to terms that it's i={0,1,..,N-1}.

edit: now is i={0,1,..,N-1,N}.
Emit simple array indices in generated code.

You can see in the current output from the tool that the code generator is currently emitting explicit array indexing expressions like c[0+2*i]. In unrolled loops, we can rely on the backend HLS tool to do the indexing for us, so the generated indexing expressions can look very simple, like c[i].

To finish off this sub-milestone, we will want to deal provisionally with streaming by sidestepping:

~~Generate #pragmas that disable streaming.~~

HLS tools are very eager to set up hardware that streams data into the accelerator instead of loading it all at once. To isolate the effects of streaming vs. other hardware changes, we would like to limit this eagerness so we can be sure we get a non-streamed design (i.e., one that loads the vector arguments into on-chip memories all at once).

SDx creates a non streaming interface by default. From the descriptions in the documentation, the implication is RAM interface is not designed for streaming and an array by default is provided a RAM interface.

We hope this should be easy:
- Sachille needs to figure out which magical pragmas we can generate to get non-streaming access.
  The task now becomes the magical pragmas for streaming, as this has to be explicit. This seems to have several stages, streaming system port can be specified and data mover can be explicitly made a FIFO. Some tests with the latter shows the design streaming at 1 data per time. More tests needed to figure out how to make it to larger block streaming.
- ~~Once we have that, Ted can generate this pragma by default.~~
  No pragmas would create a non-streaming to the best of my (Sachille) understanding. So I removed this task unless I can verify otherwise.

After this (perhaps as part of the next milestone), it would be awesome to enable streaming optionally! We will need to invent syntax and type constraints that reflect how streaming access is implemented. For example, maybe we can direct the HLS tool to stream in B bytes at a time; we will need to add constraints to the type system to ensure that the code accesses each B-sized block of the input in order. (But accesses within each block might occur out of order.)

Stage 2: Matrix/Matrix Multiplication

Array access inconsistency

There is an inconsistency with the different types of array accesses: with an explicit array access, one cannot read a[0][i] twice in the following example:

for (let i = 0..9) {
  let x = a[0][i];
  let z = a[0][i];
}

Instead, one must use a local variable, like so:

for (let i = 0..9) {
  let temp = a[0][i];
  let x = temp;
  let z = temp;
}

Whereas we can access multiple values just fine with an implicit array access:

for (let i = 0..9) unroll 3 {
  let x = a[i];
}

There is a concern over the inconsistency of this: the explicit array access resembles the fine grained control available in Verilog, whereas the implicit access looks more like normal old HLS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Flow Status (Archived)

Stage 1: Vector/Scalar Add and Vector/Vector Add

Design decision to make

Stage 2: Matrix/Matrix Multiplication

Array access inconsistency

Stage 3: Convolution

Stage *: Multiplexing

Things of interest

Menu

Clone this wiki locally