withcode - Deploy large language models with vLLM

Parameters

--worker-use-ray use multiprocessing to avoid the probe failure
--gpu-memory-utilization=0.85 reserve more GPU memory
--max-num-batched-tokens for longer sequences
--tensor-parallel-size enable tensor parallelism for multiple GPUs

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-13b-hf --worker-use-ray --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8080 --tensor-parallel-size 2

To disable the Uvicorn access log, set the following env

UVICORN_ACCESS_LOG=False

Dockerfile

check vllm-env

Warmup

Download the model from HuggingFace:

from huggingface_hub import snapshot_download
snapshot_download("meta-llama/Llama-2-13b-hf")

You might need to set up your token like this:

HUGGING_FACE_HUB_TOKEN=xxx python warmup.py

Test with OpenAI client

from random import randint
import concurrent.futures

import openai

openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

MODEL = "meta-llama/Llama-2-13b-hf"


def query(max_tokens=20):
    chat = openai.ChatCompletion.create(
        model=MODEL,
        messages=[{
            "role": "user",
            "content": "Who are you?",
        }],
        stream=True,
        max_tokens=max_tokens,
    )
    for result in chat:
        delta = result.choices[0].delta
        print(delta.get('content', ''), end='', flush=True)
    print()


def batch_test():
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [
            executor.submit(query, max_tokens=randint(20, 200)) for _ in range(20)
        ]
        for future in concurrent.futures.as_completed(futures):
            future.result()


if __name__ == "__main__":
    batch_test()

References

https://modelz.ai/blog/vllm

Contents

Parameters

Dockerfile

Warmup

Test with OpenAI client

References