Deploy large language models with vLLM
Parameters
--worker-use-ray
use multiprocessing to avoid the probe failure--gpu-memory-utilization=0.85
reserve more GPU memory--max-num-batched-tokens
for longer sequences--tensor-parallel-size
enable tensor parallelism for multiple GPUs
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-13b-hf --worker-use-ray --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8080 --tensor-parallel-size 2
To disable the Uvicorn access log, set the following env
UVICORN_ACCESS_LOG=False
Dockerfile
- check vllm-env
Warmup
Download the model from HuggingFace:
from huggingface_hub import snapshot_download
snapshot_download("meta-llama/Llama-2-13b-hf")
You might need to set up your token like this:
HUGGING_FACE_HUB_TOKEN=xxx python warmup.py
Test with OpenAI client
from random import randint
import concurrent.futures
import openai
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"
MODEL = "meta-llama/Llama-2-13b-hf"
def query(max_tokens=20):
chat = openai.ChatCompletion.create(
model=MODEL,
messages=[{
"role": "user",
"content": "Who are you?",
}],
stream=True,
max_tokens=max_tokens,
)
for result in chat:
delta = result.choices[0].delta
print(delta.get('content', ''), end='', flush=True)
print()
def batch_test():
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [
executor.submit(query, max_tokens=randint(20, 200)) for _ in range(20)
]
for future in concurrent.futures.as_completed(futures):
future.result()
if __name__ == "__main__":
batch_test()
References
- https://modelz.ai/blog/vllm