- Server
Sync
: FastAPIμμ λκΈ° μ²λ¦¬Async
: FastAPIμμ λΉλκΈ° μ²λ¦¬Rep
:fastapi
μtriton-inference-server
μ replica μEnsemble
:triton-inference-server
λ΄μμ ensembleμ νμ©ν΄ μ , νμ²λ¦¬ λ° μκ°νλ₯Ό μν (fastapi
λ λΉλκΈ°λ‘ μλ)
- Client (FastAPIλ₯Ό 100ν νΈμΆ, 10ν μ€ν)
Serial
:for
λ¬Έμ μ΄μ©ν΄ μ§λ ¬μ νΈμΆConcurrency
:ThreadPoolExecutor
λ₯Ό μ΄μ©ν΄ λμ νΈμΆRandom
:ThreadPoolExecutor
λ₯Ό μ΄μ© λ° 0 ~ 20μ΄ μ΄ν λλ€ νΈμΆ
[Sec]
Server Arch. | Mean(Serial) | End(Serial) | Mean(Concurrency) | End(Concurrency) | Mean(Random) | End(Random) |
---|---|---|---|---|---|---|
Sync&Rep=1 | 0.69 | 78.01 | 41.93 | 129.61 | 40.05 | 128.63 |
Sync&Rep=5 | 0.60 | 68.99 | 25.57 | 61.38 | 26.88 | 81.69 |
Async&Rep=1 | 0.68 | 77.02 | 0.80 | 82.22 | 0.78 | 80.34 |
Async&Rep=1-5 | 0.61 | 69.07 | 0.60 | 62.11 | - | - |
Async&Rep=5 | 0.62 | 69.77 | 1.84 | 39.77 | 1.91 | 41.84 |
Ensemble&Rep=1 | 0.70 | 78.02 | 0.77 | 78.50 | - | - |
Ensemble&Rep=5 | 0.66 | 74.52 | 1.90 | 42.03 | - | - |
[Sec]
Server Arch. | Mean(Serial) | End(Serial) | Mean(Concurrency) | End(Concurrency) | Mean(Random) | End(Random) |
---|---|---|---|---|---|---|
Sync | 0.647 | 73.499 | 33.752 | 95.496 | 33.460 | 105.160 |
Async | 0.652 | 73.395 | 1.320 | 60.991 | 1.345 | 61.094 |
Ensemble | 0.680 | 76.270 | 1.332 | 60.269 | - | - |
μ§λ ¬μ νΈμΆ μ λκΈ°, λΉλκΈ° λ°©μμ μ°¨μ΄κ° μ‘΄μ¬νμ§ μλλ€.
νμ§λ§ λΉλκΈ° λ°©μμ λκΈ° λ°©μμ λΉν΄ λμμ νΈμΆ μ μ½ 36.51%, λλ€ νΈμΆ μ μ½ 41.90% λΉ λ₯Έ μλ΅μ νμΈν μ μλ€.
λ°λ©΄ ensemble λ°©μμ ν΅ν΄ ν° μ΄μ μ νμΈνμ§ λͺ»νμ§λ§, λ³Έ μ€νμ νκ³μΌ μ μλ€. (리μμ€, λ°μ΄ν° κ·λͺ¨, ...)
async def
λ‘ μ μλ FastAPIμμ Random
쑰건μ μ€λ₯ λ°μ
Traceback (most recent call last):
File "anaconda3\lib\site-packages\requests\models.py", line 972, in json
return complexjson.loads(self.text, kwargs)
File "anaconda3\lib\site-packages\simplejson\__init__.py", line 514, in loads
return _default_decoder.decode(s)
File "anaconda3\lib\site-packages\simplejson\decoder.py", line 386, in decode
obj, end = self.raw_decode(s)
File "anaconda3\lib\site-packages\simplejson\decoder.py", line 416, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Downloads\curl.py", line 70, in <module>
main(i)
File "Downloads\curl.py", line 53, in main
responses = list(
File "anaconda3\lib\concurrent\futures\_base.py", line 609, in result_iterator
yield fs.pop().result()
File "anaconda3\lib\concurrent\futures\_base.py", line 439, in result
return self.__get_result()
File "anaconda3\lib\concurrent\futures\_base.py", line 391, in __get_result
raise self._exception
File "anaconda3\lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, self.kwargs)
File "Downloads\curl.py", line 24, in send_request
res = response.json()
File "anaconda3\lib\site-packages\requests\models.py", line 976, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
μ΄λ Random
쑰건μμ λ°μνλ μ€λ₯μΈλ°, Concurrency
쑰건μμ λ°μνμ§ μλκ² μ΄μνλ€.
λͺ¨λ podμ λν΄ λ‘κ·Έκ° λ¨μμλ μ΄ν΄λ΄€μ§λ§ λ°κ²¬νμ§ λͺ»νλ€.
μλ§ λΉλκΈ°μ μΌλ‘ triton-inference-server
μ serviceλ₯Ό νΈμΆνλ€κ° μκΈ΄ μ€λ₯λ‘ λ³΄μΈλ€.
Chat GPT μ μλμ λ΅λ³
FastAPIλ₯Ό μ¬μ©νλ©΄μ νΉμ ν¨ν΄μ μμ²μμλ§ μ€λ₯κ° λ°μνλ κ²½μ°, λ€μν μμΈμ΄ μμ μ μμ΅λλ€. λ€μμ λͺ κ°μ§ κ°λ₯μ±μ λν΄ μ€λͺ ν©λλ€:
- 리μμ€ νκ³: μλ² λλ μ¬μ© μ€μΈ λ°μ΄ν°λ² μ΄μ€μ μ°κ²° μλ κΈ°ν 리μμ€ νκ³μ λλ¬ν μ μμ΅λλ€. ν μκ°μ 100ν μμ²μ λ³΄λΌ λ, κ° μμ²μ΄ λΉ λ₯΄κ² μ²λ¦¬λμ΄μ 리μμ€κ° λΉ λ₯΄κ² ν΄μ λλ©΄ λ¬Έμ κ° μμ μ μμ΅λλ€. κ·Έλ¬λ 100νμ μμ²μ 20μ΄ λμ μ§μμ μΌλ‘ λ³΄λΌ κ²½μ°, κ·Έ μ¬μ΄μ 리μμ€κ° ν΄μ λμ§ μμ λ¬Έμ κ° λ°μν μ μμ΅λλ€.
- λ μ΄νΈ 리λ°ν : μΌλΆ μμ€ν μ 짧μ μκ° λμ λ무 λ§μ μμ²μ μ²λ¦¬νλ κ²μ λ°©μ§νκΈ° μν΄ λ μ΄νΈ 리λ°ν (rate limiting)μ μ¬μ©ν©λλ€. κ·Έλ κΈ° λλ¬Έμ 20μ΄ λμ κΎΈμ€ν μμ²μ 보λ΄λ©΄ μ νμ 걸릴 μ μμ΅λλ€.
- λ©λͺ¨λ¦¬ λ¬Έμ : μμ²μ΄ λμ λλ©΄μ μλ²μ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ λμ΄λ μ μμ΅λλ€. λ§μ½ λ©λͺ¨λ¦¬κ° λΆμ‘±νλ€λ©΄, μμ² μ²λ¦¬ μ€μ μλ¬κ° λ°μν μ μμ΅λλ€.
- μ€λ³΅ μμ² μ²λ¦¬: 20μ΄ λμ μ§μμ μΌλ‘ μμ²μ λ³΄λΌ λ, μ΄μ μμ²μ κ²°κ³Όλ μνκ° λ€μ μμ²μ μ²λ¦¬μ μν₯μ μ€ μ μμ΅λλ€. μ€λ³΅λλ μμ²μ΄λ μν λ³κ²½μ΄ μμΈμ΄ λ μ μμ΅λλ€.
- λ‘κΉ λλ λͺ¨λν°λ§ λꡬ: μμ²μ μ²λ¦¬νλ λμ λ‘κ·Έλ₯Ό λ¨κΈ°κ±°λ λͺ¨λν°λ§ λꡬλ₯Ό μ¬μ©νλ κ²½μ°, μμ² λΉλκ° λμμ§λ©΄ μΆκ°μ μΈ λΆνκ° λ°μν μ μμ΅λλ€.
- ν΄κ²° λ°©λ²
- λ‘κΉ : λ°μνλ μ€λ₯μ λ‘κ·Έλ₯Ό νμΈνμ¬ κ΅¬μ²΄μ μΈ μμΈμ νμ ν©λλ€.
- νλ‘νμΌλ§: μλ²μ CPU, λ©λͺ¨λ¦¬, λ€νΈμν¬ μ¬μ©λ λ±μ 리μμ€ μ¬μ© μν©μ λͺ¨λν°λ§νμ¬ λ³λͺ© νμμ΄λ 리μμ€ λΆμ‘± λ¬Έμ λ₯Ό νμ ν©λλ€.
- ν μ€ν : μλ²μ μμ²μ 보λ΄λ λ€μν μλ리μ€λ₯Ό ν μ€νΈνμ¬ λ¬Έμ μ μμΈμ νμ ν©λλ€.
μ΄λ¬ν μ κ²μ ν΅ν΄ λ¬Έμ μ μμΈμ νμ νκ³ μ μ ν μ‘°μΉλ₯Ό μ·¨ν μ μμ΅λλ€.
[Sec]
Server Arch. | Mean(Serial) | End(Serial) | Mean(Concurrency) | End(Concurrency) | Mean(Random) | End(Random) |
---|---|---|---|---|---|---|
Rep=1 | 0.691 | 77.682 | 14.501 | 96.777 | 20.415 | 104.487 |
Rep=5 | 0.629 | 71.094 | 9.767 | 47.726 | 14.391 | 61.767 |
Replica μμ μ¦κ°λ₯Ό ν΅ν΄ APIμ μλ΅μ λΉ λ₯΄κ² ν μ μμμ νμΈνλ€. (νλλ μλΉμ€μ ν΅μ νλλ‘ κ΅¬μ±ν μ μμΌλ©°, μλΉμ€μμ ν΅μ μ μλΉμ€μ λ§΄λ² μ€ μΌλΆ νλμ μλμ μΌλ‘ λ‘λ-λ°Έλ°μ± λλ€.)
νΉν λμμ νΈμΆ μ ν° ν₯μμ΄ μμμ νμΈν μ μλ€.
WORKER TIMEOUT
fastapi
μ replicaλ 1κ°, triton-inference-server
μ replicaλ 5κ° μΌ λλ λ°μνμ§ μλ μ€λ₯κ° fastapi
μ replicaλ 5κ°, triton-inference-server
μ replicaλ 5κ° μΌ λ μλμ κ°μ΄ λ°μνλ€.
μ΄κ²μ "--timeout", "120"
μ Dockerfile
μ μΆκ°νμ¬ ν΄κ²°νλ€.
[1] [CRITICAL] WORKER TIMEOUT (pid:8)
[1] [WARNING] Worker with pid 8 was terminated due to signal 6
[379] [INFO] Booting worker with pid: 379
[379] [INFO] Started server process [379]
[379] [INFO] Waiting for application startup.
[379] [INFO] Application startup complete.
HPA
μ¬μ© μ ν μκ°μ 100νμ μμ²μ΄ μ
λ ₯λλ©΄ replicaλ₯Ό μμ±νκΈ° μ μ λ¨μΌ fastapi
podμ μ
λ ₯λκΈ° λλ¬Έμ autoscaling ν¨κ³Όλ₯Ό λ³Ό μ μλ€.
λ°λΌμ autoscalingμ μνν νλ €λ©΄ Resource
κΈ°μ€μ΄ μλ μλ‘μ΄ metrics
κ° νμνλ€.
μμ: hpa.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: triton-inference-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference-server
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: fastapi-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fastapi
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80