DeepSparse v0.12.0
jeanniefinks
released this
22 Apr 13:44
·
5 commits
to release/0.12
since this release
New Features:
Documentation:
- SparseServer.UI: a Streamlit app for deploying the DeepSparse Server for exploring the inference performance of BERT on the question answering task.
- DeepSparse Server README:
deepsparse.server
capabilities, including single model and multi-model inferencing. - Twitter NLP Inference Examples added.
Changes:
Performance:
- Speedup for large batch sizes when using sync mode on AMD EPYC processors.
- AVX2 improvements for
- Up to 40% speedup out of the box for dense quantized models.
- Up to 20% speedup for pruned quantized BERT, ResNet-50 and MobileNet.
- Speedup from sparsity realized for ConvInteger operators.
- Model compilation time decreased on systems with many cores.
- Multi-stream Scheduler: certain computations that were executed during runtime are now precomputed.
- Hugging Face Transformers integration updated to latest state from upstream main branch.
Documentation:
- DeepSparse README: references to
deepsparse.server
,deepsparse.benchmark
, and Transformer pipelines. - DeepSparse Benchmark README: highlights of
deepsparse.benchmark
CLI command. - Transformers 🤗 Inference Pipelines: examples included on how to run inference via Python for several NLP tasks.
Resolved Issues:
- When running quantized BERT with a sequence length not divisible by 4, the DeepSparse Engine will no longer disable optimizations and see very poor performance.
- Users executing
arch.bin
now receive a correct architecture profile of their system.
Known Issues:
- When running the DeepSparse engine on a system with a nonuniform system topology, for example, an AMD EPYC processor where some cores per core-complex (CCX) have been disabled, model compilation will never terminate. A workaround is to set the environment variable
NM_SERIAL_UNIT_GENERATION=1
.