This document provides a short description about producing Ahead Of Time (AOT) compiled executable bundles. The motivation for this work is to remove the cost of compile time by allowing the users of Glow to compile a network model Ahead Of Time.
A bundle is a self-contained compiled network model that can be used to execute the model in a standalone mode. Glow has multiple backends which can run models but not all of them are capable of saving a compiled model.
The main backend used to generate bundles is the CPU backend where the bundle exists as a self-contained object file (library) containing all the necessary code to run a model. The bundle has a single entry function which performs inference for the model.
After following this document you will be able to compile models into bundles. You can view the CMake instructions (CMakeLists.txt) used to build the bundles and also the applications (main.cpp) which integrate the bundles for the ResNet50 and LeNetMnist models here:
When building Glow you can choose to build the ResNet50 and the LeNetMnist
examples by enabling the GLOW_WITH_BUNDLES
option as below:
mkdir build_Debug
cd build_Debug
cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug -DGLOW_WITH_BUNDLES=ON ../glow
ninja all
After the build is complete, you can run the example applications using one of the
following commands (the executables are found in the build_Debug/bundles
folder):
./ResNet50Bundle cat_285.png
./QuantizedResNet50Bundle cat_285.png
./LeNetMnistBundle 0_1009.png
./QuantizedLeNetMnistBundle 0_1009.png
You can find the sample images used in the examples above in the following directories:
- Sample images for the ResNet50 model from the IMAGENET database here
- Sample image for the LeNetMnist model from the MNIST database here
The model-compiler front-end tool is the main Glow tool used to compile ONNX and Caffe2 models into bundles. The tool is generic in the sense that it can compile models with any number of inputs or outputs, without being limited to a particular application.
The command used to build a bundle using the CPU backend is the following:
model-compiler -backend=CPU -model=<model-path> -emit-bundle=<bundle-dir>
- The option
backend
specifies the backend used to generated the bundle. - The option
model
specifies the path for the Caffe2 or ONNX model:- For Caffe2 models the
<model-path>
is expected to point to a directory which contains two model files:init_net.pb
andpredict_net.pb
. Some Caffe2 model directories also contain a file namedpredict_net.pbtxt
but that file is not required. - For ONNX models the
<model-path>
is expected to point to a file with the extension onnx or onnxtxt.
- For Caffe2 models the
- The option
emit-bundle
specifies the output directory where all the bundle artifacts will be generated. If the directory does not exist, it will be created.
There is a small difference when using this tool with ONNX versus Caffe2 models:
- For ONNX models the tool can infer automatically the inputs of the model
since the description of the input tensors is part of the model. Therefore the tool
will be used in the form shown above:
model-compiler -backend=CPU -model=<onnx-model-path> -emit-bundle=<bundle-dir>
- For Caffe2 models the user must also explicitly provide the description
of the input tensors which are not part of model. The option
model-input
will be used to specify the name, type and shape for each model input. The model input names can be deduced by inspecting the model with Netron. The option can be used multiple times to describe each model input separately:For quantized types the format of themodel-compiler -backend=CPU -model=<caffe2-model-path> -emit-bundle=<bundle-dir> \ -model-input=<inputName1>,<inputType1>,<inputShape1> \ -model-input=<inputName2>,<inputType2>,<inputShape2> \ ....................................................
-model-input
is slightly different since the scale and offset parameters should also be provided:For example we can can provide one or more inputs with:-model-input=<name>,<type>,<scale>,<offset>,<shape>
-model-input=input_03_data,float,[1] -model-input=data_bias,int32,[1,32,32] -model-input=data,int8q,0.123,-13,[1,10]
After running the model-compiler tool, the following bundle artifacts will be generated in the output directory:
<network_name>.o
- the bundle object file (code).<network_name>.h
- the bundle header file (API).<network_name>.weights.bin
- the model weights in binary format.<network_name>.weights.txt
- the model weights in text format as C text array.
For example, to compile the ResNet50 model provided by Glow in Caffe2 format, you can use the following command. The name of the input tensor is gpu_0/data, the type is float and the shape is 1 x 3 x 224 x 224 corresponding to a RGB image with NCHW layout:
model-compiler -backend=CPU -model=resnet50 -emit-bundle=build -model-input=gpu_0/data,float,[1,3,224,224]
Similarly, to compile the LeNetMnist model provided by Glow in Caffe2 format, you can use the following command. The name of the input tensor is data, the type is float and the shape is 1 x 1 x 28 x 28 corresponding to a grayscale image with NCHW layout:
model-compiler -backend=CPU -model=lenet_mnist -emit-bundle=build -model-input=data,float,[1,1,28,28]
For more information about the options of the model-compiler tool use:
model-compiler -help
Glow support for out-of-the-box quantized models is a work in progress mainly because the quantized tensors and operators are not well standardized in formats like Caffe2 and ONNX.
The way Glow produces a quantized bundle is by taking a floating-point model and converting it to a quantized model using it's own internal representation before compiling to a bundle.
The procedure used for quantizing the model is called profile-guided quantization (more details here). Before the model is quantized and compiled with the model-compiler tool, the quantization profile must be acquired.
In order to compute the quantization profile, the image-classifier tool is used. This application is used for image classification models only and requires a set of images to do the inference and compute the profile:
image-classifier <images> <image-opts> -model=<model-path> -model-input-name=<name> -dump-profile=profile.yaml
- The image paths are specified one after the other, separated by space. Only images in PNG format are supported (RGB or grayscale).
model
specifies the path for the Caffe2 or ONNX model.model-input-name
specifies the name of the model input.- The option
dump-profile
specifies the file name used to dump the profile in YAML format.
Extra options are available to specify how the images are pre-processed before they are fed to the model during inference:
- The option
image-mode
can be used to specify how the images are normalized:neg1to1
for the range [-1, 1].0to1
for the range [0, 1].0to255
for the range [0, 255] (Default).neg128to127
for the range [-128,127].
- The option
image-channel-order
can be used to specify the order of the channels:RGB
for R,G,B channel ordering.BGR
for B,G,R channel ordering (Default).
- The option
image-layout
can be used to specify the tensor dimension ordering (layout) which must match the layout of the model input:NCHW
for NCHW tensor layout (Default).NHWC
for NHWC tensor layout.
For more information about the options of the image-classifier tool use:
image-classifier -help
After the quantization profile profile.yaml
has been generated, we can use the model-compiler
tool to compile the model into a bundle by loading the previously generated profile:
model-compiler ... -load-profile=profile.yaml
If a specific quantization schema is desired, same schema must be provided during both
profiling and compiling using the quantization-schema
option. More details about the
quantization schema can be found here.
For example, in order to profile, quantize and compile the ResNet50 model you can use the commands:
image-classifier cat_285.png ... -image-mode=0to1 -model=resnet50 -model-input-name=gpu_0/data -dump-profile=profile.yml
model-compiler -backend=CPU -model=resnet50 -emit-bundle=build -model-input=gpu_0/data,float,[1,3,224,224] -load-profile=profile.yml
Similarly, for the LeNetMnist model:
image-classifier 0_1009.png ... -image-mode=0to1 -model=lenet_mnist -model-input-name=data -dump-profile=profile.yml
model-compiler -backend=CPU -model=lenet_mnist -emit-bundle=build -model-input=data,float,[1,1,28,28] -load-profile=profile.yml
It is important to note that currently the quantization of a model is performed only for the intermediate nodes of the graph, without affecting the data types of the model inputs and outputs. In the examples above, the data type for the model input (image tensor) and the model output remains float even though the intermediate operators and tensors use int8 data type.
When compiling a quantized bundle you can choose to disable the quantization of some
of the graph operators which might be more susceptible to quantization by using
the option keep-original-precision-for-nodes
. For example in order to disable the
quantization for the operators Add,Sub,Mul,Div,Exp and SoftMax you can use:
model-compiler ... -keep-original-precision-for-nodes=Add,Sub,Mul,Div,Exp,SoftMax
Since the CPU backend is based on LLVM, the Glow tools can be used to
cross-compile bundles for different target architectures. To specify
the target architecture you must use the -target
and -mcpu
flags
(if no target flags are provided the bundle will be generated by default
for the native architecture - the one which is running Glow). Below you
have a table with examples for some of the common architecures:
Architecture | Option |
---|---|
x86 - 32 bit | -target=i386 |
x86 - 64 bit | -target=x86_64 |
ARM Cortex M0 | -target=arm -mcpu=cortex-m0 |
ARM Cortex M4 | -target=arm -mcpu=cortex-m4 |
ARM Cortex M7 | -target=arm -mcpu=cortex-m7 |
ARM Cortex M33 | -target=arm -mcpu=cortex-m33 |
ARM Cortex A53 | -target=aarch64 -mcpu=cortex-a53 |
ARM Cortex A72 | -target=aarch64 -mcpu=cortex-a72 |
The bundle can be cross-compiled for any target architecture supported by
LLVM. For the complete list of LLVM target architectures you can type
llc -version
command in Linux (assuming you have LLVM installed). For
example the LLVM 8.0.1 has the following supported architectures:
LLVM (http://llvm.org/):
LLVM version 8.0.1
Optimized build.
Default target: x86_64-pc-linux-gnu
Host CPU: skylake
Registered Targets:
aarch64 - AArch64 (little endian)
aarch64_be - AArch64 (big endian)
amdgcn - AMD GCN GPUs
arm - ARM
arm64 - ARM64 (little endian)
armeb - ARM (big endian)
avr - Atmel AVR Microcontroller
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
hexagon - Hexagon
lanai - Lanai
mips - MIPS (32-bit big endian)
mips64 - MIPS (64-bit big endian)
mips64el - MIPS (64-bit little endian)
mipsel - MIPS (32-bit little endian)
msp430 - MSP430 [experimental]
nvptx - NVIDIA PTX 32-bit
nvptx64 - NVIDIA PTX 64-bit
ppc32 - PowerPC 32
ppc64 - PowerPC 64
ppc64le - PowerPC 64 LE
r600 - AMD GPUs HD2XXX-HD6XXX
sparc - Sparc
sparcel - Sparc LE
sparcv9 - Sparc V9
systemz - SystemZ
thumb - Thumb
thumbeb - Thumb (big endian)
wasm32 - WebAssembly 32-bit
wasm64 - WebAssembly 64-bit
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
xcore - XCore
-
If you want to change the base name of the bundle artifacts you can use the option:
-network-name=<network-name>
For example by using the option
-network-name=mb2
the generated bundle artifacts will be named mb2.o, mb2.h, mb2.weights.bin and mb2.weights.txt. -
The generated bundle object file
<network_name>.o
is non-relocatable by default. It is possible to control the relocation model with the command line option-relocation-model=<mode>
. This option supports two modes:static
: (Default) Produce non-relocatable code.pic
: Produce position independent code.
-
The ONNX format allows some of the tensor dimensions to be undefined (that is, marked as symbols and not as actual sizes) within the model. In the ONNX proto description the symbolic sizes are marked with
dim_param
while the actual sizes are marked withdim_value
. For example, when inspecting an image classification model with Netron one might see that the input tensor size is defined asNone x 3 x 224 x 224
whereNone
is the undefined size (symbol) associated with the batch size of the model. Glow cannot compile a model with undefined sizes, therefore the user must assign manually the actual values for all the symbols. The user will be prompted with an error if there is an undefined symbol (for exampleONNX model symbol 'None' is undefined.
). In the above example, the user can define the symbolNone
to have the value1
by using the following option:-onnx-define-symbol=None,1
Multiple symbols can be defined using this option, for example:
-onnx-define-symbol=<symbol_name1>,<symbol_value1> -onnx-define-symbol=<symbol_name2>,<symbol_value2> ..................................................
-
When cross-compiling bundles for some target architectures you might be interested in generating a bundle compatible with a given float ABI (Application Binary Interface) type (soft or hard). The LLVM backend can be instructed to generate an object file using a specific float ABI by using the option
-float-abi=hard
or-float-abi=soft
. -
When compiling the bundle it is useful to view the final form of the graph after all the transformations and optimizations performed by Glow (which might differ from the initial model). You can generate the graph visual representation in .dot format by using the
-dump-graph-DAG
option like in this:-dump-graph-DAG=graph.dot
Additionally, you can convert the .dot file to .pdf format using the dot utility available on Linux like this:
dot -Tpdf graph.dot -o graph.pdf
The memory of a bundle is organized in three separate memory regions which must be allocated by the user application code and provided through the bundle interface:
-
constantWeight
- contains the model constant weights. The user application must:- allocate this memory region (statically or dynamically)
- initialize this memory region with the content of the generated weights file in
one of two possible formats:
- binary format (
<network_name>.weights.bin
) used to initialize this memory region (allocated statically or dynamically) by loading the binary file dynamically at run-time using standard C function like fopen. - text format (
<network_name>.weights.txt
) used to initialize this memory region (only if statically allocated) by including the text file statically at compile-time as a C array using the #include pre-processor directive. This format is suitable for target architectures which do not have file systems (for example microcontrollers).
- binary format (
- provide the base address of this memory region to the inference function
-
mutableWeight
- contains all the model inputs and outputs (graph placeholders). The tensors corresponding to different inputs and outputs are identified using offsets relative to the base address of this memory region. The user application must:- allocate this memory region (statically or dynamically)
- initialize the model input tensors from this memory region with the desired input data before running the inference
- provide the base address of this memory region to the inference function
- read the model output tensors from this memory region after running the inference
-
activations
- this memory region is a scratch memory required for the bundle code to store the intermediate results of the graph computation (activations). The user application must:- allocate this memory region (statically or dynamically)
- provide the base address of this memory region to the inference function
- this memory region is NOT required to be initialized
The required sizes for all the memory regions described above are provided in the bundle interface. Also all the memory regions must be allocated with a minimum alignment which is also provided in the interface (typically 64 bytes). For example, for aligning a statically allocated buffer one can use the following C syntax:
__attribute__((aligned(64)))
uint8_t aligned_buffer[BUFFER_SIZE];
This is the default bundle API obtained by generating the bundle with the option
-bundle-api=static
. Below is an example of how the auto-generated header file
looks like for the Lenet Mnist model:
// Placeholder address offsets within mutable buffer (bytes)
#define LENET_MNIST_data 0
#define LENET_MNIST_softmax__1 3136
// Memory sizes (bytes)
#define LENET_MNIST_CONSTANT_MEM_SIZE 1724672
#define LENET_MNIST_MUTABLE_MEM_SIZE 3200
#define LENET_MNIST_ACTIVATIONS_MEM_SIZE 57600
// Memory alignment (bytes)
#define LENET_MNIST_MEM_ALIGN 64
// Bundle entry point (inference function)
void lenet_mnist(uint8_t *constantWeight, uint8_t *mutableWeight, uint8_t *activations);
The header file contains all the information required to run the bundle, defined in a static manner using macro defines:
- the offsets of all the placeholders (graph inputs/outputs) within the
mutableWeight
memory - the sizes for all the memory regions
- the alignment required for allocating the memory regions
- the inference function prototype
All the definitions names (the macros and the inference function) are prefixed
with the model name, in this example with lenet_mnist. If you want to change
the model name you can use the command line option -network-name
, for example
-network-name=my_bundle
.
The auto-generated header file file also contains some extra defines to help with writing the user application code:
// Memory alignment definition with given alignment size
// for static allocation of memory.
#define GLOW_MEM_ALIGN(size) __attribute__((aligned(size)))
// Macro function to get the absolute address of a
// placeholder using the base address of the mutable
// weight buffer and placeholder offset definition.
#define GLOW_GET_ADDR(mutableBaseAddr, placeholderOff) (((uint8_t*)(mutableBaseAddr)) + placeholderOff)
For example, in order to allocate and initialize all the memory regions, you need to write the following in the user application (lenet_mnist.weights.txt is the file containing the model weights serialized as text):
GLOW_MEM_ALIGN(LENET_MNIST_MEM_ALIGN)
uint8_t constantWeight[LENET_MNIST_CONSTANT_MEM_SIZE] = {
#include "lenet_mnist.weights.txt"
};
GLOW_MEM_ALIGN(LENET_MNIST_MEM_ALIGN)
uint8_t mutableWeight[LENET_MNIST_MUTABLE_MEM_SIZE];
GLOW_MEM_ALIGN(LENET_MNIST_MEM_ALIGN)
uint8_t activations[LENET_MNIST_ACTIVATIONS_MEM_SIZE];
In order to obtain the absolute addresses of the model inputs/outputs you need to write the following in the user application:
uint8_t *inputAddr = GLOW_GET_ADDR(mutableWeight, LENET_MNIST_data);
uint8_t *outputAddr = GLOW_GET_ADDR(mutableWeight, LENET_MNIST_softmax__1);
This is the bundle API obtained by generating the bundle with the option
-bundle-api=dynamic
. Below is an example of how the auto-generated header
file looks like for the Resnet50 model:
// Bundle memory configuration (memory layout)
extern BundleConfig resnet50_config;
// Bundle entry point (inference function)
void resnet50(uint8_t *constantWeight, uint8_t *mutableWeight, uint8_t *activations);
This API has all the information about the memory configuration encapsulated
in a structure named <network_name>_config
. The layout of this structure is
defined by the type BundleConfig
which is also included in the generated
header file:
// Type describing the config of a generated bundle.
struct BundleConfig {
// Size of the constant weight variables memory area.
uint64_t constantWeightVarsMemSize;
// Size of the mutable weight variables memory area.
uint64_t mutableWeightVarsMemSize;
// Size of the activations memory area.
uint64_t activationsMemSize;
// Alignment to be used for weights and activations.
uint64_t alignment;
// Number of symbols in the symbol table.
uint64_t numSymbols;
// Symbol table.
const SymbolTableEntry *symbolTable;
};
Similar to the static API, this structure contains:
- the sizes for all the memory regions
- the alignment required for allocating all the memory regions
- the number of symbols
- the descriptions of all the symbols as an array of symbol entries
In this case the notion of symbol might include not only the model
placeholders but also the model constant weights. Each symbol is
described according to the SymbolTableEntry
structure definition
(included also in the header file):
// Type describing a symbol table entry of a generated bundle.
struct SymbolTableEntry {
// Name of a variable.
const char *name;
// Offset of the variable inside the memory area.
uint64_t offset;
// The number of elements inside this variable.
uint64_t size;
// Variable kind: 1 if it is a mutable variable, 0 otherwise.
char kind;
};
For each symbol the following information is registered:
- the symbol name
- the symbol kind: whether is mutable (placeholder) or not (constant)
- the size in bytes
- the offset: if the symbol is mutable this is the offset of the variable
within the
mutableWeight
buffer, otherwise this is the offset of the variable within theconstantWeight
buffer
The user has to look up the symbol entries to find the model variables (placeholders or constants) at run-time (dynamically).
This section describes how to use the bundle produced with the CPU backend. Other backends may have different interfaces. To integrate the bundle into your project, you need to do the following:
- Link the object file
<network_name>.o
to your application. - Include the header file
<network_name>.h
in your application. - Allocate memory for all the required buffers:
constantWeight
,mutableWeight
,activations
:- For the static API the buffer sizes and alignments are given by the macro defines from the
header file
<network_name>.h
and are available both at compile-time and run-time. - For the dynamic API the buffer sizes and alignments are provided in the
<network_name>_config
structure and are only available at run-time.
- For the static API the buffer sizes and alignments are given by the macro defines from the
header file
- Initialize the content of the
constantWeight
buffer with the model weights using either the<network_name>.weights.bin
file or the<network_name>.weights.txt
file. - Initialize the model input tensors from the
mutableWeight
buffer (e.g. image data). - Invoke the main entry function
<network_name>
by providing the base addresses of the memory regions previously allocated. - After the main entry function has returned, you can find the results in the corresponding model
output tensors from the
mutableWeight
buffer.
A very popular tool for visualizing the original model before compiling with Glow is Netron which has:
- an online browser version here (drag & drop a model into the browser window).
- an offline standalone version here (drag & drop a model into the GUI window).
The Glow compiler currently has support only for Caffe2 and ONNX model formats. Since a lot of popular models are available in other formats, for example TensorFlow, it is useful to have tools to convert models between different formats. The most used tools for format conversion are:
Here we demonstrate how to convert a TensorFlow model to ONNX. We will convert a MobileNet V1 image classification model which operates on 128 x 128 RGB images and 1001 classes. You can download the model archive in TensorFlow format from here.
In order to convert the model to ONNX format using the MMdnn tool:
- Install MMdnn from here.
- Download the MobileNet V1 model archive and unzip it.
- Run the following command to convert the TensorFlow frozen file
mobilenet_v1_0.25_128_frozen.pb
to the ONNX model filemobilenet_v1_0.25_128.onnx
:mmconvert \ --srcFramework tensorflow \ --inputWeight mobilenet_v1_0.25_128_frozen.pb \ --inNodeName input \ --inputShape 128,128,3 \ --dstNodeName MobilenetV1/Predictions/Softmax \ --dstFramework onnx \ --outputModel mobilenet_v1_0.25_128.onnx
In order to convert the model to ONNX format using the tf2onnx tool:
- Install tf2onnx from here.
- Download the MobileNet V1 model archive and unzip it.
- Run the following command to convert the TensorFlow frozen file
mobilenet_v1_0.25_128_frozen.pb
to the ONNX model filemobilenet_v1_0.25_128.onnx
:python -m tf2onnx.convert \ --input mobilenet_v1_0.25_128_frozen.pb \ --inputs input:0 \ --outputs MobilenetV1/Predictions/Softmax:0 \ --output mobilenet_v1_0.25_128.onnx
After converting the model to ONNX you can build a bundle using the Glow model-compiler tool with the command:
model-compiler -backend=CPU -model=mobilenet_v1_0.25_128.onnx -emit-bundle=bundle
When converting the model with tf2onnx some undefined symbols might end up in the ONNX model. For example,
the input batch size might be encoded as a symbol unk__286
which can be defined to a value of 1
by using
the extra option:
-onnx-define-symbol=unk__286,1
There are times when it is useful to manually edit the protocol buffers which are used for model formats like ONNX and TensorFlow. One such example is when one needs to modify an ONNX model in order to circumvent a limitation of the Glow compiler (for example when an attribute is not relevant but causes compiler issues, etc).
In order modify a protocol buffer you need:
- the proto compiler application protoc which can be found here.
- the proto definition:
When you need to manually edit an ONNX model you will need to:
- decode the model proto to text format.
- edit the model in text format.
- encode the model back to proto format.
After you install the protoc application and you download the proto definitions, you can perform the actions defined below:
-
Decode an ONNX model from proto
model.onnx
to text filemodel.onnxtxt
:protoc onnx.proto --decode=onnx.ModelProto < model.onnx > model.onnxtxt
-
Encode an ONNX model from text file
model.onnxtxt
to protomodel.onnx
:protoc onnx.proto --encode=onnx.ModelProto < model.onnxtxt > model.onnx
-
Decode an ONNX tensor from proto
tensor.pb
to text filetensor.pbtxt
:protoc onnx.proto --decode=onnx.TensorProto < tensor.pb > tensor.pbtxt
-
Encode an ONNX tensor from text file
tensor.pbtxt
to prototensor.pb
:protoc onnx.proto --encode=onnx.TensorProto < tensor.pbtxt > tensor.pb
-
Decode/Encode TensorFlow saved model to/from text file:
protoc saved_model.proto --decode tensorflow.SavedModel < model.pb > model.pbtxt protoc saved_model.proto --encode tensorflow.SavedModel < model.pbtxt < model.pb
-
Decode/Encode TensorFlow frozen graph to/from text file:
protoc saved_model.proto --decode tensorflow.GraphDef < model.pb > model.pbtxt protoc saved_model.proto --encode tensorflow.GraphDef < model.pbtxt > model.pb