Skip to content

Commit

Permalink
DNN kernels to support GPT decoder models and additional utilities (#87)
Browse files Browse the repository at this point in the history
* gemm: Add loop unrolled FP32 baseline

* gemm: Add multi-precision verification

* gemm: Add tiling

* gemm: Add options to parallelize over K and bypass DMA

* dnn: Refactor and verify Layernorm

* dnn: Add trivial multi-cluster Layernorm implementation

* dnn: Add IPC verification for Softmax

* dnn: Add FlashAttention-2 layer

* dnn: Refactor and verify GeLU

* dnn: Add Concat layer

* dnn: Add FusedConcatLinear layer

* dnn: Add IPC verification for Conv2D and FusedConv

* dnn: Remove GEMM and Linear layers

* snRuntime: Add global reduction function

* treewide: Add configuration with HW FDIV unit

* docs: Add `data_utils` documentation

* container: Update Bender installation method after cargo fail

* ci: Free up disk space in `build-docker` workflow

---------

Co-authored-by: Viviane Potocnik <vivianep@iis.ee.ethz.ch>
Co-authored-by: Luca Colagrande <luca.colagrande3@gmail.com>
  • Loading branch information
3 people authored Feb 9, 2024
1 parent ae02a03 commit 27c9e85
Show file tree
Hide file tree
Showing 116 changed files with 4,350 additions and 2,455 deletions.
2 changes: 1 addition & 1 deletion .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@
# The CI runs on `clang-format` version 10
BasedOnStyle: Google
IndentWidth: 4
IncludeBlocks: Preserve
IncludeBlocks: Preserve
12 changes: 12 additions & 0 deletions .github/workflows/build-docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,18 @@ jobs:
name: Deploy Docker image
runs-on: ubuntu-22.04
steps:
# Free up disk space on Github-hosted runner
- name: Disk usage
run: df -h
- uses: jlumbroso/free-disk-space@v1.3.1
with:
android: true
dotnet: true
haskell: true
large-packages: true
- name: Disk usage after freeing up space
run: df -h
# Actually build the Docker container
- uses: actions/checkout@v2
- uses: docker/setup-buildx-action@v1
- name: GHCR Log-in
Expand Down
24 changes: 24 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,30 @@ jobs:
run: |
./run.py sw/run.yaml --simulator verilator -j
# Tests requiring hardware FDIV unit
sw-snitch-cluster-fdiv-vlt:
name: Simulate FDIV SW on Snitch Cluster w/ Verilator
runs-on: ubuntu-22.04
container:
image: ghcr.io/pulp-platform/snitch_cluster:main
steps:
- uses: actions/checkout@v2
with:
submodules: 'recursive'
- name: Build Software
working-directory: target/snitch_cluster
run: |
bender vendor init
make CFG_OVERRIDE=cfg/fdiv.hjson sw
- name: Build Hardware
working-directory: target/snitch_cluster
run: |
make CFG_OVERRIDE=cfg/fdiv.hjson bin/snitch_cluster.vlt
- name: Run Tests
working-directory: target/snitch_cluster
run: |
./run.py sw/fdiv.yaml --simulator verilator -j
#########################################
# Build SW on Snitch Cluster w/ Banshee #
#########################################
Expand Down
8 changes: 8 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -134,3 +134,11 @@ snitch-cluster-banshee:
- cargo install --debug --path .
- cd ../target/snitch_cluster
- ./run.py sw/run.yaml --simulator banshee -j --run-dir runs/banshee

# Tests requiring hardware FDIV unit
snitch-cluster-fdiv-vsim:
script:
- cd target/snitch_cluster
- make CFG_OVERRIDE=cfg/fdiv.hjson sw
- make bin/snitch_cluster.vsim
- ./run.py sw/fdiv.yaml --simulator vsim -j --run-dir runs/vsim
1 change: 1 addition & 0 deletions docs/rm/sim/data_utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: data_utils
5 changes: 3 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ markdown_extensions:
- pymdownx.superfences
- pymdownx.tabbed
- pymdownx.emoji:
emoji_index: !!python/name:materialx.emoji.twemoji
emoji_generator: !!python/name:materialx.emoji.to_svg
emoji_index: !!python/name:material.extensions.emoji.twemoji
emoji_generator: !!python/name:material.extensions.emoji.to_svg
plugins:
- include-markdown
- mkdocstrings:
Expand Down Expand Up @@ -54,6 +54,7 @@ nav:
# - Solder: rm/solder.md
- Software:
- Simulation Utilities:
- data_utils: rm/sim/data_utils.md
- sim_utils: rm/sim/sim_utils.md
- rm/sim/Simulation.md
- rm/sim/Simulator.md
Expand Down
1 change: 1 addition & 0 deletions python-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ hjson
jsonref
jsonschema
mako
mkdocs-material
progressbar2
tabulate
yamllint
Expand Down
14 changes: 7 additions & 7 deletions sw/blas/axpy/data/datagen.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
import os

sys.path.append(os.path.join(os.path.dirname(__file__), "../../../../util/sim/"))
from data_utils import format_scalar_definition, format_vector_definition, \
format_vector_declaration, format_ifdef_wrapper # noqa: E402
from data_utils import format_scalar_definition, format_array_definition, \
format_array_declaration, format_ifdef_wrapper # noqa: E402

MIN = -1000
MAX = +1000
Expand Down Expand Up @@ -47,16 +47,16 @@ def main():
a = np.random.uniform(MIN, MAX, 1)
x = np.random.uniform(MIN, MAX, length)
y = np.random.uniform(MIN, MAX, length)
z = np.zeros(length)
g = golden_model(a, x, y)

# Format header file
l_str = format_scalar_definition('const uint32_t', 'l', length)
a_str = format_scalar_definition('const double', 'a', a[0])
x_str = format_vector_definition('double', 'x', x, alignment=BURST_ALIGNMENT, section=section)
y_str = format_vector_definition('double', 'y', y, alignment=BURST_ALIGNMENT, section=section)
z_str = format_vector_declaration('double', 'z', z, alignment=BURST_ALIGNMENT, section=section)
g_str = format_vector_definition('double', 'g', g)
x_str = format_array_definition('double', 'x', x, alignment=BURST_ALIGNMENT, section=section)
y_str = format_array_definition('double', 'y', y, alignment=BURST_ALIGNMENT, section=section)
z_str = format_array_declaration('double', 'z', [length],
alignment=BURST_ALIGNMENT, section=section)
g_str = format_array_definition('double', 'g', g)
g_str = format_ifdef_wrapper('BIST', g_str)
f_str = '\n\n'.join([l_str, a_str, x_str, y_str, z_str, g_str])
f_str += '\n'
Expand Down
10 changes: 5 additions & 5 deletions sw/blas/axpy/verify.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
sys.path.append(str(Path(__file__).parent / '../../../util/sim/'))
import verification # noqa: E402
from elf import Elf # noqa: E402
from data_utils import bytes_to_doubles # noqa: E402
from data_utils import from_buffer # noqa: E402


ERR_THRESHOLD = 1E-10
Expand All @@ -27,16 +27,16 @@ def main():
symbols_bin=args.symbols_bin,
log=args.log,
output_uids=['z'])
z_actual = np.array(bytes_to_doubles(raw_results['z']))
z_actual = from_buffer(raw_results['z'], 'double')

# Extract input operands from ELF file
if args.symbols_bin:
elf = Elf(args.symbols_bin)
else:
elf = Elf(args.snitch_bin)
a = np.array(bytes_to_doubles(elf.get_symbol_contents('a')))
x = np.array(bytes_to_doubles(elf.get_symbol_contents('x')))
y = np.array(bytes_to_doubles(elf.get_symbol_contents('y')))
a = elf.from_symbol('a', 'double')
x = elf.from_symbol('x', 'double')
y = elf.from_symbol('y', 'double')

# Verify results
z_golden = golden_model(a, x, y)
Expand Down
2 changes: 1 addition & 1 deletion sw/blas/gemm/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ MK_DIR := $(dir $(realpath $(lastword $(MAKEFILE_LIST))))
DATA_DIR := $(realpath $(MK_DIR)/data)
SRC_DIR := $(realpath $(MK_DIR)/src)

DATA_CFG ?= $(DATA_DIR)/params.hjson
DATA_CFG ?= $(DATA_DIR)/params.json
SECTION ?=

APP ?= gemm
Expand Down
72 changes: 47 additions & 25 deletions sw/blas/gemm/data/datagen.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@
import numpy as np
import argparse
import pathlib
import hjson
import json5
import sys
import os

sys.path.append(os.path.join(os.path.dirname(__file__), "../../../../util/sim/"))
from data_utils import emit_license, format_scalar_definition, \
format_vector_definition, format_ifdef_wrapper # noqa: E402
format_array_definition, format_ifdef_wrapper # noqa: E402


np.random.seed(42)
Expand Down Expand Up @@ -52,25 +52,41 @@ def emit_header(**kwargs):

# Generate random input matrices
dtype = NUMPY_TYPES[str(kwargs['prec'])]
M, N, K = kwargs['M'], kwargs['N'], kwargs['K']
m_tiles = kwargs['m_tiles']
n_tiles = kwargs['n_tiles']
k_tiles = kwargs['k_tiles']
parallelize_m = kwargs['parallelize_m']
parallelize_k = kwargs['parallelize_k']
baseline = kwargs['baseline']

assert (M % m_tiles) == 0, 'M is not an integer multiple of tile size'
assert (N % n_tiles) == 0, 'N is not an integer multiple of tile size'
assert (K % k_tiles) == 0, 'K is not an integer multiple of tile size'
frac_m = M / m_tiles
assert (frac_m % 8) == 0, 'frac_m is not an integer multiple of the number of cores per' \
'cluster'
assert not (parallelize_m and parallelize_k), 'Cannot parallelize K and M simultaneously'

if (kwargs['prec']) == 8:
# sign -1 or 1
sign_a = np.random.randint(0, 2, (kwargs['M'], kwargs['K'])).astype(dtype)
sign_a = np.random.randint(0, 2, (M, K)).astype(dtype)
# esponent < 0b01111
exponent_a = np.random.randint(0, 16, (kwargs['M'], kwargs['K'])).astype(dtype)
exponent_a = np.random.randint(0, 16, (M, K)).astype(dtype)
# mantissa can be arbitrary
mantissa_a = np.random.randint(0, 4, (kwargs['M'], kwargs['K'])).astype(dtype)
mantissa_a = np.random.randint(0, 4, (M, K)).astype(dtype)
# sign -1 or 1
sign_b = np.random.randint(0, 2, (kwargs['K'], kwargs['N'])).astype(dtype)
sign_b = np.random.randint(0, 2, (K, N)).astype(dtype)
# esponent < 0b01111
exponent_b = np.random.randint(0, 16, (kwargs['K'], kwargs['N'])).astype(dtype)
exponent_b = np.random.randint(0, 16, (K, N)).astype(dtype)
# mantissa can be arbitrary
mantissa_b = np.random.randint(0, 4, (kwargs['K'], kwargs['N'])).astype(dtype)
mantissa_b = np.random.randint(0, 4, (K, N)).astype(dtype)
# sign -1 or 1
sign_c = np.random.randint(0, 2, (kwargs['M'], kwargs['N'])).astype(dtype)
sign_c = np.random.randint(0, 2, (M, N)).astype(dtype)
# esponent < 0b01111
exponent_c = np.random.randint(0, 16, (kwargs['M'], kwargs['N'])).astype(dtype)
exponent_c = np.random.randint(0, 16, (M, N)).astype(dtype)
# mantissa can be arbitrary
mantissa_c = np.random.randint(0, 4, (kwargs['M'], kwargs['N'])).astype(dtype)
mantissa_c = np.random.randint(0, 4, (M, N)).astype(dtype)
_a = ((-1.0)**sign_a.astype(np.double))*(2.0**(exponent_a.astype(np.double)-15.0)) \
* (1.0 + mantissa_a.astype(np.double) / (2**2))
_b = ((-1.0)**sign_b.astype(np.double))*(2.0**(exponent_b.astype(np.double)-15.0)) \
Expand All @@ -82,36 +98,42 @@ def emit_header(**kwargs):
b = sign_b << 7 | exponent_b << FP8_FORMATS['fp8']['mant'] | mantissa_b
c = sign_c << 7 | exponent_c << FP8_FORMATS['fp8']['mant'] | mantissa_c
else:
a = np.random.rand(kwargs['M'], kwargs['K']).astype(dtype)
b = np.random.rand(kwargs['K'], kwargs['N']).astype(dtype)
c = np.random.rand(kwargs['M'], kwargs['N']).astype(dtype)
a = np.random.rand(M, K).astype(dtype)
b = np.random.rand(K, N).astype(dtype)
c = np.random.rand(M, N).astype(dtype)
result = golden_model(1, a, b, kwargs['beta'], c)

# Store matrices in transposed form if requested
a = a.T if kwargs['ta'] else a
b = b.T if kwargs['tb'] else b

data_str = [emit_license()]
data_str += [format_scalar_definition('uint32_t', 'M', kwargs['M'])]
data_str += [format_scalar_definition('uint32_t', 'N', kwargs['N'])]
data_str += [format_scalar_definition('uint32_t', 'K', kwargs['K'])]
data_str += [format_scalar_definition('uint32_t', 'M', M)]
data_str += [format_scalar_definition('uint32_t', 'N', N)]
data_str += [format_scalar_definition('uint32_t', 'K', K)]
data_str += [format_scalar_definition('uint32_t', 'TA', int(kwargs['ta']))]
data_str += [format_scalar_definition('uint32_t', 'TB', int(kwargs['tb']))]
data_str += [format_scalar_definition('uint32_t', 'BETA', kwargs['beta'])]
data_str += [format_scalar_definition('uint32_t', 'dtype_size', kwargs['prec']//8)]
data_str += [format_scalar_definition('uint32_t', 'expand', kwargs['expand'])]
data_str += [format_vector_definition(C_TYPES[str(kwargs['prec'])], 'a', a.flatten(),
data_str += [format_scalar_definition('uint32_t', 'm_tiles', kwargs['m_tiles'])]
data_str += [format_scalar_definition('uint32_t', 'n_tiles', kwargs['n_tiles'])]
data_str += [format_scalar_definition('uint32_t', 'k_tiles', kwargs['k_tiles'])]
data_str += [format_scalar_definition('uint32_t', 'parallelize_m', kwargs['parallelize_m'])]
data_str += [format_scalar_definition('uint32_t', 'parallelize_k', kwargs['parallelize_k'])]
data_str += [format_scalar_definition('uint32_t', 'baseline', int(baseline))]
data_str += [format_array_definition(C_TYPES[str(kwargs['prec'])], 'a', a.flatten(),
alignment=BURST_ALIGNMENT, section=kwargs['section'])]
data_str += [format_vector_definition(C_TYPES[str(kwargs['prec'])], 'b', b.flatten(),
data_str += [format_array_definition(C_TYPES[str(kwargs['prec'])], 'b', b.flatten(),
alignment=BURST_ALIGNMENT, section=kwargs['section'])]
data_str += [format_vector_definition(C_TYPES[str(kwargs['prec'])], 'c', c.flatten(),
data_str += [format_array_definition(C_TYPES[str(kwargs['prec'])], 'c', c.flatten(),
alignment=BURST_ALIGNMENT, section=kwargs['section'])]
if kwargs['prec'] == 8:
result_def = format_vector_definition(C_TYPES['64'], 'result', result.flatten())
result_def = format_array_definition(C_TYPES['64'], 'result', result.flatten())
else:
result_def = format_vector_definition(C_TYPES[str(kwargs['prec'])],
'result',
result.flatten())
result_def = format_array_definition(C_TYPES[str(kwargs['prec'])],
'result',
result.flatten())
data_str += [format_ifdef_wrapper('BIST', result_def)]
data_str = '\n\n'.join(data_str)

Expand All @@ -135,7 +157,7 @@ def main():

# Load param config file
with args.cfg.open() as f:
param = hjson.loads(f.read())
param = json5.loads(f.read())
param['section'] = args.section

# Emit header file
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,11 @@
ta: false,
tb: true, // must be true for SIMD
prec: 64,
expand: 0
expand: 0,
m_tiles: 2, // number of tiles in M dimension
k_tiles: 1, // number of tiles in K dimension
n_tiles: 1, // number of tiles in N dimension
parallelize_k: 0,
parallelize_m: 0,
baseline: false
}
Loading

0 comments on commit 27c9e85

Please sign in to comment.