Skip to content

Commit 39baaf5

Browse files
authored
docker : add server-first container images (#5157)
* feat: add Dockerfiles for each platform that user ./server instead of ./main * feat: update .github/workflows/docker.yml to build server-first docker containers * doc: add information about running the server with Docker to README.md * doc: add information about running with docker to the server README * doc: update n-gpu-layers to show correct GPU usage * fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA
1 parent 6db2b41 commit 39baaf5

File tree

7 files changed

+147
-1
lines changed

7 files changed

+147
-1
lines changed

.devops/server-cuda.Dockerfile

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
ARG UBUNTU_VERSION=22.04
2+
# This needs to generally match the container host's environment.
3+
ARG CUDA_VERSION=11.7.1
4+
# Target the CUDA build image
5+
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
6+
# Target the CUDA runtime image
7+
ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
8+
9+
FROM ${BASE_CUDA_DEV_CONTAINER} as build
10+
11+
# Unless otherwise specified, we make a fat build.
12+
ARG CUDA_DOCKER_ARCH=all
13+
14+
RUN apt-get update && \
15+
apt-get install -y build-essential git
16+
17+
WORKDIR /app
18+
19+
COPY . .
20+
21+
# Set nvcc architecture
22+
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
23+
# Enable cuBLAS
24+
ENV LLAMA_CUBLAS=1
25+
26+
RUN make
27+
28+
FROM ${BASE_CUDA_RUN_CONTAINER} as runtime
29+
30+
COPY --from=build /app/server /server
31+
32+
ENTRYPOINT [ "/server" ]

.devops/server-intel.Dockerfile

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
ARG ONEAPI_VERSION=2024.0.1-devel-ubuntu22.04
2+
ARG UBUNTU_VERSION=22.04
3+
4+
FROM intel/hpckit:$ONEAPI_VERSION as build
5+
6+
RUN apt-get update && \
7+
apt-get install -y git
8+
9+
WORKDIR /app
10+
11+
COPY . .
12+
13+
# for some reasons, "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=Intel10_64lp -DLLAMA_NATIVE=ON" give worse performance
14+
RUN mkdir build && \
15+
cd build && \
16+
cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx && \
17+
cmake --build . --config Release --target main server
18+
19+
FROM ubuntu:$UBUNTU_VERSION as runtime
20+
21+
COPY --from=build /app/build/bin/server /server
22+
23+
ENV LC_ALL=C.utf8
24+
25+
ENTRYPOINT [ "/server" ]

.devops/server-rocm.Dockerfile

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
ARG UBUNTU_VERSION=22.04
2+
3+
# This needs to generally match the container host's environment.
4+
ARG ROCM_VERSION=5.6
5+
6+
# Target the CUDA build image
7+
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete
8+
9+
FROM ${BASE_ROCM_DEV_CONTAINER} as build
10+
11+
# Unless otherwise specified, we make a fat build.
12+
# List from https://github.com/ggerganov/llama.cpp/pull/1087#issuecomment-1682807878
13+
# This is mostly tied to rocBLAS supported archs.
14+
ARG ROCM_DOCKER_ARCH=\
15+
gfx803 \
16+
gfx900 \
17+
gfx906 \
18+
gfx908 \
19+
gfx90a \
20+
gfx1010 \
21+
gfx1030 \
22+
gfx1100 \
23+
gfx1101 \
24+
gfx1102
25+
26+
COPY requirements.txt requirements.txt
27+
COPY requirements requirements
28+
29+
RUN pip install --upgrade pip setuptools wheel \
30+
&& pip install -r requirements.txt
31+
32+
WORKDIR /app
33+
34+
COPY . .
35+
36+
# Set nvcc architecture
37+
ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
38+
# Enable ROCm
39+
ENV LLAMA_HIPBLAS=1
40+
ENV CC=/opt/rocm/llvm/bin/clang
41+
ENV CXX=/opt/rocm/llvm/bin/clang++
42+
43+
RUN make
44+
45+
ENTRYPOINT [ "/app/server" ]

.devops/server.Dockerfile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
ARG UBUNTU_VERSION=22.04
2+
3+
FROM ubuntu:$UBUNTU_VERSION as build
4+
5+
RUN apt-get update && \
6+
apt-get install -y build-essential git
7+
8+
WORKDIR /app
9+
10+
COPY . .
11+
12+
RUN make
13+
14+
FROM ubuntu:$UBUNTU_VERSION as runtime
15+
16+
COPY --from=build /app/server /server
17+
18+
ENV LC_ALL=C.utf8
19+
20+
ENTRYPOINT [ "/server" ]

.github/workflows/docker.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,18 @@ jobs:
2828
config:
2929
- { tag: "light", dockerfile: ".devops/main.Dockerfile", platforms: "linux/amd64,linux/arm64" }
3030
- { tag: "full", dockerfile: ".devops/full.Dockerfile", platforms: "linux/amd64,linux/arm64" }
31+
- { tag: "server", dockerfile: ".devops/server.Dockerfile", platforms: "linux/amd64,linux/arm64" }
3132
# NOTE(canardletter): The CUDA builds on arm64 are very slow, so I
3233
# have disabled them for now until the reason why
3334
# is understood.
3435
- { tag: "light-cuda", dockerfile: ".devops/main-cuda.Dockerfile", platforms: "linux/amd64" }
3536
- { tag: "full-cuda", dockerfile: ".devops/full-cuda.Dockerfile", platforms: "linux/amd64" }
37+
- { tag: "server-cuda", dockerfile: ".devops/server-cuda.Dockerfile", platforms: "linux/amd64" }
3638
- { tag: "light-rocm", dockerfile: ".devops/main-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
3739
- { tag: "full-rocm", dockerfile: ".devops/full-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
40+
- { tag: "server-rocm", dockerfile: ".devops/server-rocm.Dockerfile", platforms: "linux/amd64,linux/arm64" }
3841
- { tag: "light-intel", dockerfile: ".devops/main-intel.Dockerfile", platforms: "linux/amd64" }
42+
- { tag: "server-intel", dockerfile: ".devops/server-intel.Dockerfile", platforms: "linux/amd64" }
3943
steps:
4044
- name: Check out the repo
4145
uses: actions/checkout@v3

README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -931,17 +931,20 @@ Place your desired model into the `~/llama.cpp/models/` directory and execute th
931931
* Create a folder to store big models & intermediate files (ex. /llama/models)
932932

933933
#### Images
934-
We have two Docker images available for this project:
934+
We have three Docker images available for this project:
935935

936936
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
937937
2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
938+
3. `ghcr.io/ggerganov/llama.cpp:server`: This image only includes the server executabhle file. (platforms: `linux/amd64`, `linux/arm64`)
938939

939940
Additionally, there the following images, similar to the above:
940941

941942
- `ghcr.io/ggerganov/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
942943
- `ghcr.io/ggerganov/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
944+
- `ghcr.io/ggerganov/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
943945
- `ghcr.io/ggerganov/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
944946
- `ghcr.io/ggerganov/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
947+
- `ghcr.io/ggerganov/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
945948

946949
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](.github/workflows/docker.yml). If you need different settings (for example, a different CUDA or ROCm library, you'll need to build the images locally for now).
947950
@@ -967,6 +970,12 @@ or with a light image:
967970
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
968971
```
969972
973+
or with a server image:
974+
975+
```bash
976+
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggerganov/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
977+
```
978+
970979
### Docker With CUDA
971980
972981
Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.
@@ -976,6 +985,7 @@ Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia
976985
```bash
977986
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
978987
docker build -t local/llama.cpp:light-cuda -f .devops/main-cuda.Dockerfile .
988+
docker build -t local/llama.cpp:server-cuda -f .devops/server-cuda.Dockerfile .
979989
```
980990
981991
You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
@@ -989,6 +999,7 @@ The resulting images, are essentially the same as the non-CUDA images:
989999
9901000
1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
9911001
2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
1002+
3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.
9921003
9931004
#### Usage
9941005
@@ -997,6 +1008,7 @@ After building locally, Usage is similar to the non-CUDA examples, but you'll ne
9971008
```bash
9981009
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
9991010
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
1011+
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
10001012
```
10011013

10021014
### Contributing

examples/server/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,14 @@ server.exe -m models\7B\ggml-model.gguf -c 2048
6666
The above command will start a server that by default listens on `127.0.0.1:8080`.
6767
You can consume the endpoints with Postman or NodeJS with axios library. You can visit the web front end at the same url.
6868

69+
### Docker:
70+
```bash
71+
docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080
72+
73+
# or, with CUDA:
74+
docker run -p 8080:8080 -v /path/to/models:/models --gpus all ggerganov/llama.cpp:server-cuda -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080 --n-gpu-layers 99
75+
```
76+
6977
## Testing with CURL
7078

7179
Using [curl](https://curl.se/). On Windows `curl.exe` should be available in the base OS.

0 commit comments

Comments
 (0)