Skip to content

OpenPPL/ppl.llm.serving

Repository files navigation

PPL LLM Serving

Overview

ppl.llm.serving is a part of PPL.LLM system.

SYSTEM_OVERVIEW

We recommend users who are new to this project to read the Overview of system.

ppl.llm.serving is a serving based on ppl.nn for various Large Language Models(LLMs). This repository contains a server based on gRPC and inference support for LLaMA.

Prerequisites

  • Linux running on x86_64 or arm64 CPUs
  • GCC >= 9.4.0
  • CMake >= 3.18
  • Git >= 2.7.0
  • CUDA Toolkit >= 11.4. 11.6 recommended. (for CUDA)
  • Rust & cargo >= 1.8.0. (for Huggingface Tokenizer)

PPL Server Quick Start

Here is a brief tutorial, refer to LLaMA Guide for more details.

  • Installing Prerequisites(on Debian or Ubuntu for example)

    apt-get install build-essential cmake git
  • Cloning Source Code

    git clone https://github.com/openppl-public/ppl.llm.serving.git
  • Building from Source

    ./build.sh  -DPPLNN_USE_LLM_CUDA=ON  -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'" -DPPL_LLM_ENABLE_GRPC_SERVING=ON

    NCCL is required if multiple GPU devices are used.

    We support Sync Decode feature (mainly for offline_inference), which means model forward and decode in the same thread. To enable this feature, compile with marco -DPPL_LLM_SERVING_SYNC_DECODE=ON.

  • Exporting Models

    Refer to ppl.pmx for details.

  • Running Server

    ./ppl_llm_server \
        --model-dir /data/model \
        --model-param-path /data/model/params.json \
        --tokenizer-path /data/tokenizer.model \
        --tensor-parallel-size 1 \
        --top-p 0.0 \
        --top-k 1 \
        --max-tokens-scale 0.94 \
        --max-input-tokens-per-request 4096 \
        --max-output-tokens-per-request 4096 \
        --max-total-tokens-per-request 8192 \
        --max-running-batch 1024 \
        --max-tokens-per-step 8192 \
        --host 127.0.0.1 \
        --port 23333 

    You are expected to give the correct values before running the server.

    • model-dir: path of models exported by ppl.pmx.
    • model-param-path: params of models. $model_dir/params.json.
    • tokenizer-path: tokenizer files for sentencepiece.
  • Running client: send request through gRPC to query the model

    ./ppl-build/client_sample 127.0.0.1:23333

    See tools/client_sample.cc for more details.

  • Benchmarking

    ./ppl-build/client_qps_measure --target=127.0.0.1:23333 --tokenizer=/path/to/tokenizer/path --dataset=tools/samples_1024.json --request_rate=inf

    See tools/client_qps_measure.cc for more details. --request_rate is the number of request per second, and value inf means send all client request with no interval.

  • Running inference offline:

    ./offline_inference \
        --model-dir /data/model \
        --model-param-path /data/model/params.json \
        --tokenizer-path /data/tokenizer.model \
        --tensor-parallel-size 1 \
        --top-p 0.0 \
        --top-k 1 \
        --max-tokens-scale 0.94 \
        --max-input-tokens-per-request 4096 \
        --max-output-tokens-per-request 4096 \
        --max-total-tokens-per-request 8192 \
        --max-running-batch 1024 \
        --max-tokens-per-step 8192 \
        --host 127.0.0.1 \
        --port 23333 

    See tools/offline_inference.cc for more details.

License

This project is distributed under the Apache License, Version 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published