vLLM OpenAI API

Container Image

We will use the container image provided by vLLM at vllm/vllm-openai [dockerhub].

Deploy vLLM OpenAI API

Let’s get started, follow these steps to create the app.

Compute

5090

32 GiB GPU 4 CPU • 32 GiB RAM

Pick the above option for this example. You may need higher ram if your model is bigger and need to offload it to ram.

Container

App Name

string

required

Set to vLLM Qwen

Container Image

string

required

Set to vllm/vllm-openai:latest

Registry Auth

string

Leave it empty

Entrypoint

string

Set to python3 -m vllm.entrypoints.openai.api_server --port 80 --model /8scale_hf_model --gpu_memory_utilization 0.88scale_hf_model is the directory for model cache provided by 8scale

Container Disk

number

default:"1"

required

Leave it as 1 GiB

Cache Volume

number

default:"1"

Leave it as 0

Scaling

Min Replica(s)

number

default:"0"

required

Leave it as 0

Max Replica(s)

number

default:"1"

required

Leave it as 1

Requests Per Replica

number

default:"1"

required

Set to 25

Scale Down Delay

number

required

Set to 20 seconds

Environment

Set the following environment variables:8SCALE_HF_MODEL=Qwen/Qwen2.5-3B VLLM_DISABLE_COMPILE_CACHE=18SCALE_HF_MODEL env variable enables 8scale model caching and mounts the model on path /8scale_hf_model. 8Scale will automatically download the model and its part of the CACHING status for a replica.For private models or if you want to supply your own huggingface token, you can set HF_TOKEN env variable.Click Deploy App and then go to apps page and click vLLM Qwen app to get to overview page.

Test vLLM API

Click playground tab. Type the following body in POST /v1/completions and hit send.

{"prompt": "list top countries by GDP"}

You will see the result with a 200 status.

The result should look like this.

You have successfully deployed vLLM OpenAI API and tested scaling up a replica which handles POST /v1/completions request.

Get Started

How To Deploy

Container Image

Deploy vLLM OpenAI API

5090

Test vLLM API

Get Started

How To Deploy

​Container Image

​Deploy vLLM OpenAI API

5090

​Test vLLM API

Container Image

Deploy vLLM OpenAI API

Test vLLM API