From 1cb007bb1b7bfffa401dc18f9b2a6ebc51068f17 Mon Sep 17 00:00:00 2001
From: livion <52649461+Livioni@users.noreply.github.com>
Date: Wed, 17 Apr 2024 17:45:46 +0800
Subject: [PATCH] Update README.md

---
 README.md | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/README.md b/README.md
index d3e8048..773deb7 100644
--- a/README.md
+++ b/README.md
@@ -2,14 +2,12 @@
 
 This is the repository that contains source code for the [Arena website](https://livioni.github.io/Arena_pages/).
 
-**Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics**
+**[Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics](https://arxiv.org/abs/2404.09245)**
 
-[Haosong Peng](https://livioni.github.io/) ^1^, [Wei Feng](https://github.com/Couteaux123) ^1^, [Hao Li](https://lifuguan.github.io/)^2^, [Yufeng Zhan](https://ray-zhan.github.io/)^1^, [Qihua Zhou](http://qihuazhou.com/)^3^, and Yuanqing Xia^1^
+[Haosong Peng](https://livioni.github.io/)<sup>1</sup> , [Wei Feng](https://github.com/Couteaux123)<sup>1</sup>, [Hao Li](https://lifuguan.github.io/)<sup>2</sup>, [Yufeng Zhan](https://ray-zhan.github.io/)<sup>1</sup>, [Qihua Zhou](http://qihuazhou.com/)<sup>3</sup>, and Yuanqing Xia<sup>1</sup>
 
 1. Beijing Institute of Technology Beijing,
-
 2. Northwestern Polytechnology University
-
 3. Hong Kong University of Science and Technology Hong Kong
 
 
@@ -18,7 +16,7 @@ This is the repository that contains source code for the [Arena website](https:/
 
 ## Abstract
 
-The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to $1.58\times$ and $1.82\times$ on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.
+The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest (PoIs) to the downstream models. Additionally, we employ probability-based patch sampling, which provides a simple but efficient mechanism for determining PoIs where the probable locations of objects are in subsequent frames. Through extensive evaluations on public datasets, our findings reveal that Arena can boost inference speeds by up to 1.58\times1.58\times and 1.82\times1.82\times on average while consuming only 54% and 34% of the bandwidth, respectively, all with high inference accuracy.
 
 ![overview](README.assets/overview.png)
 
@@ -51,5 +49,4 @@ The advent of edge computing has made real-time intelligent video analytics feas
 ```
 
 # Website License
-
-`<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" />``</a><br />`This work is licensed under a `<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">`Creative Commons Attribution-ShareAlike 4.0 International License`</a>`.
+<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.