Compare commits

..

9 Commits

Author SHA1 Message Date
Aleksander Grygier
253ba110bc webui: Move static build output from repo code to HF Bucket (#22937)
* ci: add workflow to publish webui to Hugging Face bucket

* ci: add webui release job to release workflow

* ci: test webui release job

* chore: Return to default minification strategy for build output files

* ci: extract webui build into separate workflow and job

* chore: Ignore webui static output + clean up references

* chore: Delete legacy webui static output

* chore: Ignore webui build static output

* fix: Workflow

* fix: Versioning naming

* chore: Update package name

* test: Test CI fix

* refactor: Naming

* server: implement webui build strategy with HF Bucket support

* chore: Remove test workflow

* chore: Use WebUI build workflow call in other workflows

* server: HF Buckets fallback for WebUI build

* refactor: App name variable

* refactor: Naming

* fix: Retrieve loading.html

* fix: workflow syntax

* fix: Rewrite malformed release.yml

* fix: Req param

* test: Re-add missing Playwright installation for CI tests

* refactor: Logic & security improvements

* refactor: Retrieve publishing jobs and DRY the workflows

* fix: Test workflow syntax

* fix: Upstream Release Tag for test workflow

* chore: Remove test workflow

* ci: Run WebUI jobs on `ubuntu-24.04-arm`

* refactor: Post-CR cleanup

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: CI cleanup

* refactor: Cleanup

* test: Test workflow

* refactor: use LLAMA_BUILD_NUMBER instead of LLAMA_BUILD_TAG for HF Bucket webui downloads

* server: add fallback mechanism for HF Bucket webui downloads from latest directory

* fix: Incorrect argument order in file(SHA256) calls for checksum verification

* refactor: Use cmake script for handling the HF Bucket download on build time

* feat: support local npm build for WebUI assets

* refactor: add `HF_ENABLED` flag to control WebUI build/download provisioning

* refactor: Cleanup

* chore: Remove test workflow

* fix: remove s390x from release workflow

* fix: add webui-build dependency to ubuntu-22-rocm and windows-hip

* Revert "fix: remove s390x from release workflow"

This reverts commit debcfffa9bc1e3112eae41f2d29741b682e4eb19.

* fix: Release workflow file

* fix: Proper release tag used for HF Bucket upload

* fix: Remove duplicate steps in release workflow

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-14 13:21:41 +02:00
Georgi Gerganov
67b2b7f2f2 logs : reduce (#23021)
* logs : reduce

* args : fix envs

* server : fix build

* common : print verbosity level at start

* server : clean-up logs

* server : print prompt processing timings + sampling params

* minor : whitespaces
2026-05-14 13:05:52 +03:00
alex-spacemit
81b0d882ae ggml-cpu: Add IME2 Instruction Support for the SpacemiT Backend (#22863) 2026-05-14 17:39:30 +08:00
Neo Zhang
0f45f1a35c docker : revert stable version of intel compute-runtime (#22968) 2026-05-14 11:30:40 +02:00
Kabir Potdar
42532afff4 unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regr… (#22110)
* unicode,test: add Qwen3.5 non-backtracking tokenizer handler and regression tests

- Add unicode_regex_split_custom_qwen35() to [src/unicode.cpp](src/unicode.cpp), a non-backtracking handler for Qwen3.5's [\p{L}\p{M}]+ regex (letters + combining marks).
- Register the handler in the custom tokenizer dispatch table to prevent stack overflows on long inputs (fixes #21919).
- Add [models/ggml-vocab-qwen35.gguf](models/ggml-vocab-qwen35.gguf) (test vocab), [models/ggml-vocab-qwen35.gguf.inp](models/ggml-vocab-qwen35.gguf.inp) (test cases), and [models/ggml-vocab-qwen35.gguf.out](models/ggml-vocab-qwen35.gguf.out) (expected output) for regression testing.
- Update [tests/CMakeLists.txt](tests/CMakeLists.txt) to include the new test entry.

This mirrors the Qwen2 fix (commit 0d049d6), but adapts for Qwen3.5's regex. Ensures robust Unicode tokenization and prevents std::regex stack overflows.

Closes #21919.

* fix: enhance regex handling for Qwen3.5 tokenizer to include accent marks

* cont : remove trailing whitespace

---------

Co-authored-by: Kabir <kabir@example.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
2026-05-14 11:03:40 +02:00
Ruben Ortlam
dbe7901ca6 vulkan: fix matmul integer pipeline selection (#23005)
* vulkan: fix matmul integer pipeline selection

* gate pipeline creation with the right bools
2026-05-14 10:36:54 +02:00
Aleksander Grygier
320a6a44a5 fix: Autoscroll detection (#23026) 2026-05-14 08:09:29 +02:00
Katostrofik
9ed6e19b9d SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations (#21597)
* SYCL: fix multi-GPU system RAM exhaustion by using Level Zero allocations

Replace sycl::malloc_device with zeMemAllocDevice for GPU memory allocation
in the SYCL backend. sycl::malloc_device triggers the xe kernel driver's
DMA-buf/TTM path which mirrors every VRAM allocation 1:1 in system RAM.
zeMemAllocDevice uses the SVM/P2P path with no host staging.

On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), a 15.6 GiB model
consumed 60 GiB of system RAM via sycl::malloc_device, causing OOM crashes.
With zeMemAllocDevice, the same workload uses ~6.7 GiB of system RAM with
no performance regression.

All Level Zero calls include automatic fallback to the original SYCL
allocation path if Level Zero interop is unavailable.

* SYCL: address review feedback - remove try/catch, check device types, deduplicate

- Remove try/catch from malloc/free/memcpy helpers, check backend and
  device type upfront instead (ggml_sycl_is_level_zero, ggml_sycl_is_dgpu)
- Move shared helpers (is_level_zero, is_dgpu, free_device) to common.cpp
  and declare in common.hpp to eliminate code duplication
- Use SYCL_CHECK(CHECK_TRY_ERROR()) for fallback sycl::free calls
- Guard dev2dev_memcpy L0 path to dGPU-to-dGPU only, preserving the
  host-staged path for iGPU-to-dGPU transfers
- Add Windows Level Zero SDK path detection (LEVEL_ZERO_V1_SDK_PATH)
  in CMakeLists.txt (co-authored with @arthw)

* SYCL: add build/runtime flags for Level Zero, address review feedback

Implements the architecture suggested by @arthw: compile-time and runtime
flags to cleanly separate Level Zero and SYCL memory API paths.

- Add GGML_SYCL_SUPPORT_LEVEL_ZERO cmake option (default ON). All Level
  Zero code is wrapped in #ifdef so the build works on systems without
  the Level Zero SDK installed (e.g. CPU-only CI servers). Both the
  loader library and headers are checked before enabling.

- Add GGML_SYCL_ENABLE_LEVEL_ZERO runtime env var (default 1). Controls
  whether Level Zero or SYCL memory APIs are used. Only one API style is
  used per session, no mixing. If Level Zero is enabled but the devices
  don't support the Level Zero backend, it auto-disables with a warning.

- Remove Level Zero code from dpct_malloc. It was unused (dpct::device_memory
  is not called anywhere in the backend) and used try/catch for flow control.

- Update SYCL.md with documentation for both new parameters.

Tested on Intel Arc Pro B70 (32GB), single-GPU and dual-GPU, with both
GGML_SYCL_SUPPORT_LEVEL_ZERO=ON and OFF builds. AI-assisted development
(Claude). Code reviewed and tested on my hardware.

* SYCL: unify Level Zero malloc/free call sites, address review feedback

Move ggml_sycl_malloc_device to common.cpp alongside ggml_sycl_free_device.
Both functions are now unconditionally available — Level Zero code is
#ifdef'd inside the functions, not at call sites. All call sites use
uniform SYCL_CHECK(CHECK_TRY_ERROR()) wrapping with no #ifdef blocks.

Addresses arthw's review: wrap all malloc/free in SYCL_CHECK for stack
traces on failure, eliminate duplicated #ifdef/else patterns at 6 call
sites (-29 lines net).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: add Level Zero SDK to CI, fix device check and missed alloc paths

Add Level Zero SDK installation to Ubuntu and Windows SYCL CI jobs
so the Level Zero code path is compiled and tested in CI.

Fix two bugs found during extended dual-GPU testing (no
ONEAPI_DEVICE_SELECTOR set):

- The Level Zero backend check was iterating all SYCL devices
  including CPU. The OpenCL CPU device caused Level Zero to be
  disabled for the GPUs, defeating the fix on multi-GPU systems.
  Added is_gpu() filter so only GPU devices are checked.

- sycl_ext_malloc_device/sycl_ext_free (tensor reorder temp buffers)
  were still calling sycl::malloc/sycl::free directly, bypassing the
  Level Zero path. Routed through ggml_sycl_malloc_device/free_device
  for consistency with the other device memory call sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* SYCL: address arthw review feedback on Level Zero memory API structure

- Move ggml_sycl_malloc_device to static function in ggml-sycl.cpp;
  only ggml_sycl_free_device (used by common.cpp) stays in common.cpp
- Switch both helpers to use g_ggml_sycl_enable_level_zero global
  instead of per-call queue backend checks
- Remove #ifdef wrapper from global definition; always declare at 0,
  add #else branch in init block so it stays 0 when L0 not compiled in
- Update init loop comment to explain GPU-only device check
- CMakeLists: message(STATUS) before the if block; align option wording

AI-assisted implementation. Reviewed and tested on dual Intel Arc Pro
B70 (32 GB each): test-backend-ops OK on both GPUs, single/dual-GPU
Q4_K_M and Q8_0 bench correct, zeMemAllocDevice GTT delta confirmed
<5 MiB per 4 GiB allocation (vs ~4 GiB shadow with sycl::malloc_device).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* SYCL: remove unused cstdio/cstdlib includes from common.cpp

Leftover from the deleted ggml_sycl_queue_supports_level_zero helper.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Apply suggestions from code review

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* SYCL: preserve Level Zero allocation path during early malloc

* ci: fix Level Zero package conflict in Intel Docker build

* ci: find Level Zero loader in oneAPI package step

* ci: allow Windows SYCL package without Level Zero DLL

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2026-05-14 13:39:14 +08:00
Zheyuan Chen
4c1c3ac09d ggml-webgpu: only use subgroup-matrix path when head dims are divisible by sg_mat_k / sg_mat_n (#23020) 2026-05-13 15:12:40 -07:00
86 changed files with 16196 additions and 17166 deletions

View File

@@ -5,8 +5,15 @@ ARG ONEAPI_VERSION=2025.3.3-0-devel-ubuntu24.04
FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS build
ARG GGML_SYCL_F16=OFF
ARG LEVEL_ZERO_VERSION=1.28.2
ARG LEVEL_ZERO_UBUNTU_VERSION=u24.04
RUN apt-get update && \
apt-get install -y git libssl-dev
apt-get install -y git libssl-dev wget ca-certificates && \
cd /tmp && \
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb && \
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb && \
apt-get -o Dpkg::Options::="--force-overwrite" install -y ./level-zero.deb ./level-zero-devel.deb && \
rm -f /tmp/level-zero.deb /tmp/level-zero-devel.deb
WORKDIR /app
@@ -33,11 +40,11 @@ RUN mkdir -p /app/full \
FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS base
ARG IGC_VERSION=v2.32.7
ARG IGC_VERSION_FULL=2_2.32.7+21184
ARG COMPUTE_RUNTIME_VERSION=26.14.37833.4
ARG COMPUTE_RUNTIME_VERSION_FULL=26.14.37833.4-0
ARG IGDGMM_VERSION=22.9.0
ARG IGC_VERSION=v2.20.5
ARG IGC_VERSION_FULL=2_2.20.5+19972
ARG COMPUTE_RUNTIME_VERSION=25.40.35563.10
ARG COMPUTE_RUNTIME_VERSION_FULL=25.40.35563.10-0
ARG IGDGMM_VERSION=22.8.2
RUN mkdir /tmp/neo/ && cd /tmp/neo/ \
&& wget https://github.com/intel/intel-graphics-compiler/releases/download/$IGC_VERSION/intel-igc-core-${IGC_VERSION_FULL}_amd64.deb \
&& wget https://github.com/intel/intel-graphics-compiler/releases/download/$IGC_VERSION/intel-igc-opencl-${IGC_VERSION_FULL}_amd64.deb \
@@ -109,4 +116,3 @@ WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]

View File

@@ -53,14 +53,6 @@ charset = unset
trim_trailing_whitespace = unset
insert_final_newline = unset
[tools/server/public/**]
indent_style = unset
indent_size = unset
end_of_line = unset
charset = unset
trim_trailing_whitespace = unset
insert_final_newline = unset
[benches/**]
indent_style = unset
indent_size = unset

4
.gitattributes vendored
View File

@@ -1,4 +0,0 @@
# Treat the generated single-file WebUI build as binary for diff purposes.
# Git's pack-file delta compression still works (byte-level), but this prevents
# git diff from printing the entire minified file on every change.
tools/server/public/index.html -diff

1
.github/labeler.yml vendored
View File

@@ -77,7 +77,6 @@ server/webui:
- changed-files:
- any-glob-to-any-file:
- tools/server/webui/**
- tools/server/public/**
server:
- changed-files:
- any-glob-to-any-file:

View File

@@ -301,16 +301,17 @@ jobs:
export RISCV_ROOT_PATH=${PWD}/spacemit_toolchain
cmake -B build -DLLAMA_OPENSSL=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DGGML_CPU_REPACK=OFF \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DGGML_CPU_RISCV64_SPACEMIT=ON \
-DGGML_RVV=ON \
-DGGML_RV_ZVFH=ON \
-DGGML_RV_ZFH=ON \
-DGGML_RV_ZICBOP=ON \
-DGGML_RV_ZIHINTPAUSE=ON \
-DRISCV64_SPACEMIT_IME_SPEC=RISCV64_SPACEMIT_IME1 \
-DGGML_RV_ZBA=ON \
-DCMAKE_TOOLCHAIN_FILE=${PWD}/cmake/riscv64-spacemit-linux-gnu-gcc.cmake
cmake --build build --config Release -j $(nproc)

View File

@@ -50,6 +50,8 @@ jobs:
env:
ONEAPI_ROOT: /opt/intel/oneapi/
ONEAPI_INSTALLER_VERSION: "2025.3.3"
LEVEL_ZERO_VERSION: "1.28.2"
LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
continue-on-error: true
@@ -71,6 +73,14 @@ jobs:
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
- name: Install Level Zero SDK
shell: bash
run: |
cd /tmp
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
- name: Clone
id: checkout
uses: actions/checkout@v6
@@ -107,6 +117,7 @@ jobs:
env:
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
ONEAPI_INSTALLER_VERSION: "2025.3.3"
steps:
@@ -127,6 +138,13 @@ jobs:
run: |
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
- name: Install Level Zero SDK
shell: pwsh
run: |
Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
"LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:

View File

@@ -36,7 +36,14 @@ env:
CMAKE_ARGS: "-DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON"
jobs:
webui-build:
name: Build WebUI
uses: ./.github/workflows/webui-build.yml
macOS-cpu:
needs:
- webui-build
strategy:
matrix:
include:
@@ -64,6 +71,12 @@ jobs:
with:
fetch-depth: 0
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -100,6 +113,9 @@ jobs:
name: llama-bin-macos-${{ matrix.build }}.tar.gz
ubuntu-cpu:
needs:
- webui-build
strategy:
matrix:
include:
@@ -119,6 +135,12 @@ jobs:
with:
fetch-depth: 0
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
if: ${{ matrix.build != 's390x' }}
uses: ggml-org/ccache-action@v1.2.21
@@ -169,6 +191,9 @@ jobs:
name: llama-bin-ubuntu-${{ matrix.build }}.tar.gz
ubuntu-vulkan:
needs:
- webui-build
strategy:
matrix:
include:
@@ -186,6 +211,12 @@ jobs:
with:
fetch-depth: 0
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -237,6 +268,9 @@ jobs:
name: llama-bin-ubuntu-vulkan-${{ matrix.build }}.tar.gz
android-arm64:
needs:
- webui-build
runs-on: ubuntu-latest
env:
@@ -249,6 +283,12 @@ jobs:
with:
fetch-depth: 0
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -306,6 +346,9 @@ jobs:
name: llama-bin-android-arm64.tar.gz
ubuntu-24-openvino:
needs:
- webui-build
runs-on: ubuntu-24.04
outputs:
@@ -327,6 +370,12 @@ jobs:
with:
fetch-depth: 0
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -386,6 +435,9 @@ jobs:
name: llama-bin-ubuntu-openvino-${{ env.OPENVINO_VERSION_MAJOR }}-x64.tar.gz
windows-cpu:
needs:
- webui-build
runs-on: windows-2025
strategy:
@@ -400,6 +452,12 @@ jobs:
with:
fetch-depth: 0
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -438,6 +496,9 @@ jobs:
name: llama-bin-win-cpu-${{ matrix.arch }}.zip
windows:
needs:
- webui-build
runs-on: windows-2025
env:
@@ -461,6 +522,12 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -520,6 +587,9 @@ jobs:
name: llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip
windows-cuda:
needs:
- webui-build
runs-on: windows-2022
strategy:
@@ -531,6 +601,12 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -591,6 +667,9 @@ jobs:
name: cudart-llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip
windows-sycl:
needs:
- webui-build
runs-on: windows-2022
defaults:
@@ -600,6 +679,7 @@ jobs:
env:
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
ONEAPI_INSTALLER_VERSION: "2025.3.3"
@@ -621,6 +701,19 @@ jobs:
run: |
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
- name: Install Level Zero SDK
shell: pwsh
run: |
Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
"LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -655,6 +748,13 @@ jobs:
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_opencl.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_loader.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_win_proxy_loader.dll" ./build/bin
ZE_LOADER_DLL=$(find "${{ env.ONEAPI_ROOT }}" "$LEVEL_ZERO_V1_SDK_PATH" -iname ze_loader.dll -print -quit 2>/dev/null || true)
if [ -n "$ZE_LOADER_DLL" ]; then
echo "Using Level Zero loader: $ZE_LOADER_DLL"
cp "$ZE_LOADER_DLL" ./build/bin
else
echo "Level Zero loader DLL not found in oneAPI or SDK; relying on system driver/runtime"
fi
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl8.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/svml_dispmd.dll" ./build/bin
@@ -681,6 +781,9 @@ jobs:
name: llama-bin-win-sycl-x64.zip
ubuntu-24-sycl:
needs:
- webui-build
strategy:
matrix:
build: [fp32, fp16]
@@ -695,6 +798,8 @@ jobs:
env:
ONEAPI_ROOT: /opt/intel/oneapi/
ONEAPI_INSTALLER_VERSION: "2025.3.3"
LEVEL_ZERO_VERSION: "1.28.2"
LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
steps:
- name: Clone
@@ -718,6 +823,20 @@ jobs:
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
- name: Install Level Zero SDK
shell: bash
run: |
cd /tmp
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
@@ -757,6 +876,9 @@ jobs:
name: llama-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz
ubuntu-22-rocm:
needs:
- webui-build
runs-on: ubuntu-22.04
strategy:
@@ -773,6 +895,12 @@ jobs:
with:
fetch-depth: 0
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: Free up disk space
uses: ggml-org/free-disk-space@v1.3.1
with:
@@ -860,6 +988,9 @@ jobs:
name: llama-bin-ubuntu-rocm-${{ env.ROCM_VERSION_SHORT }}-${{ matrix.build }}.tar.gz
windows-hip:
needs:
- webui-build
runs-on: windows-2022
env:
@@ -876,6 +1007,12 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: Grab rocWMMA package
id: grab_rocwmma
run: |
@@ -1122,6 +1259,7 @@ jobs:
runs-on: ubuntu-slim
needs:
- webui-build
- windows
- windows-cpu
- windows-cuda
@@ -1137,6 +1275,9 @@ jobs:
- ios-xcode-build
- openEuler-cann
outputs:
tag_name: ${{ steps.tag.outputs.name }}
steps:
- name: Clone
id: checkout
@@ -1262,3 +1403,15 @@ jobs:
});
}
}
webui-publish:
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
needs:
- release
uses: ./.github/workflows/webui-publish.yml
with:
version_tag: ${{ needs.release.outputs.tag_name }}
secrets:
hf_token: ${{ secrets.HF_TOKEN_WEBUI_STATIC_OUTPUT }}

View File

@@ -39,7 +39,12 @@ concurrency:
cancel-in-progress: true
jobs:
webui-build:
name: Build WebUI
uses: ./.github/workflows/webui-build.yml
server-metal:
needs: webui-build
runs-on: [self-hosted, llama-server, macOS, ARM64]
name: server-metal (${{ matrix.wf_name }})
@@ -67,6 +72,12 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: Build
id: cmake_build
run: |

View File

@@ -1,7 +1,7 @@
name: Server WebUI
on:
workflow_dispatch: # allows manual triggering
workflow_dispatch:
inputs:
sha:
description: 'Commit SHA1 to build'
@@ -13,16 +13,14 @@ on:
paths: [
'.github/workflows/server-webui.yml',
'tools/server/webui/**.*',
'tools/server/tests/**.*',
'tools/server/public/**'
'tools/server/tests/**.*'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/server-webui.yml',
'tools/server/webui/**.*',
'tools/server/tests/**.*',
'tools/server/public/**'
'tools/server/tests/**.*'
]
env:
@@ -36,9 +34,14 @@ concurrency:
cancel-in-progress: true
jobs:
webui-check:
webui-build:
name: Build WebUI
uses: ./.github/workflows/webui-build.yml
webui-checks:
name: WebUI Checks
runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
needs: webui-build
runs-on: ubuntu-24.04-arm
continue-on-error: true
steps:
- name: Checkout code
@@ -51,7 +54,7 @@ jobs:
id: node
uses: actions/setup-node@v6
with:
node-version: "22"
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/server/webui/package-lock.json"
@@ -71,6 +74,47 @@ jobs:
run: npm run lint
working-directory: tools/server/webui
- name: Install Playwright browsers
id: playwright
if: ${{ always() && steps.setup.conclusion == 'success' }}
run: npx playwright install --with-deps
working-directory: tools/server/webui
- name: Run Client tests
if: ${{ always() && steps.playwright.conclusion == 'success' }}
run: npm run test:client
working-directory: tools/server/webui
- name: Run Unit tests
if: ${{ always() && steps.playwright.conclusion == 'success' }}
run: npm run test:unit
working-directory: tools/server/webui
e2e-tests:
name: E2E Tests
needs: webui-build
runs-on: ubuntu-24.04-arm
steps:
- name: Checkout code
uses: actions/checkout@v6
with:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Setup Node.js
id: node
uses: actions/setup-node@v6
with:
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/server/webui/package-lock.json"
- name: Install dependencies
id: setup
if: ${{ steps.node.conclusion == 'success' }}
run: npm ci
working-directory: tools/server/webui
- name: Build application
if: ${{ always() && steps.setup.conclusion == 'success' }}
run: npm run build
@@ -87,16 +131,6 @@ jobs:
run: npm run build-storybook
working-directory: tools/server/webui
- name: Run Client tests
if: ${{ always() && steps.playwright.conclusion == 'success' }}
run: npm run test:client
working-directory: tools/server/webui
- name: Run Unit tests
if: ${{ always() && steps.playwright.conclusion == 'success' }}
run: npm run test:unit
working-directory: tools/server/webui
- name: Run UI tests
if: ${{ always() && steps.playwright.conclusion == 'success' }}
run: npm run test:ui -- --testTimeout=60000

View File

@@ -54,7 +54,12 @@ concurrency:
cancel-in-progress: true
jobs:
webui-build:
name: Build WebUI
uses: ./.github/workflows/webui-build.yml
server:
needs: webui-build
runs-on: ubuntu-latest
name: server (${{ matrix.wf_name }})
@@ -93,6 +98,12 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: Build
id: cmake_build
run: |
@@ -125,6 +136,7 @@ jobs:
SLOW_TESTS=1 pytest -v -x
server-windows:
needs: webui-build
runs-on: windows-2022
steps:
@@ -135,6 +147,12 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: Build
id: cmake_build
run: |

44
.github/workflows/webui-build.yml vendored Normal file
View File

@@ -0,0 +1,44 @@
name: Build WebUI
on:
workflow_call:
jobs:
build:
name: Build WebUI
runs-on: ubuntu-slim
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
steps:
- name: Checkout code
uses: actions/checkout@v6
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/server/webui/package-lock.json"
- name: Install dependencies
run: npm ci
working-directory: tools/server/webui
- name: Build application
run: npm run build
working-directory: tools/server/webui
- name: Generate checksums
run: |
cd tools/server/public
for f in *; do
sha256sum "$f" | awk '{print $1, $2}' >> checksums.txt
done
- name: Upload built webui
uses: actions/upload-artifact@v6
with:
name: webui-build
path: tools/server/public/
retention-days: 1

65
.github/workflows/webui-publish.yml vendored Normal file
View File

@@ -0,0 +1,65 @@
name: WebUI Publish
on:
workflow_call:
inputs:
version_tag:
description: 'Version tag to publish under (e.g., b1234)'
required: true
type: string
secrets:
hf_token:
description: 'Hugging Face token with write access'
required: true
jobs:
publish:
name: Publish WebUI Static Output
runs-on: ubuntu-24.04-arm
permissions:
contents: read
env:
HF_BUCKET_NAME: ${{ vars.HF_BUCKET_WEBUI_STATIC_OUTPUT }}
steps:
- name: Checkout code
uses: actions/checkout@v6
with:
fetch-depth: 1
- name: Download WebUI build artifact
uses: actions/download-artifact@v7
with:
name: webui-build
path: tools/server/public/
- name: Install Hugging Face Hub CLI
run: pip install -U huggingface_hub
- name: Authenticate with Hugging Face
run: hf auth login --token ${{ secrets.hf_token }}
- name: Sync built files to Hugging Face bucket (version tag)
run: |
# Upload the built files to the Hugging Face bucket under the release version
hf buckets sync tools/server/public hf://buckets/ggml-org/${{ env.HF_BUCKET_NAME }}/${{ inputs.version_tag }} --delete --quiet
- name: Sync built files to Hugging Face bucket (latest)
run: |
# Also upload to the 'latest' directory for fallback downloads
hf buckets sync tools/server/public hf://buckets/ggml-org/${{ env.HF_BUCKET_NAME }}/latest --delete --quiet
- name: Verify upload
run: |
# List the files in the bucket to verify the upload
hf buckets list hf://buckets/ggml-org/${{ env.HF_BUCKET_NAME }}/${{ inputs.version_tag }} -R -h
- name: Clean up root-level files
run: |
# Clean up any old root-level files from previous non-versioned deployments
hf buckets rm ggml-org/${{ env.HF_BUCKET_NAME }}/index.html --yes 2>/dev/null || true
hf buckets rm ggml-org/${{ env.HF_BUCKET_NAME }}/bundle.js --yes 2>/dev/null || true
hf buckets rm ggml-org/${{ env.HF_BUCKET_NAME }}/bundle.css --yes 2>/dev/null || true
hf buckets rm ggml-org/${{ env.HF_BUCKET_NAME }}/loading.html --yes 2>/dev/null || true

3
.gitignore vendored
View File

@@ -54,6 +54,7 @@
/tmp/
/autogen-*.md
/common/build-info.cpp
/tools/server/public
# Deprecated
@@ -96,8 +97,6 @@
/tools/server/webui/node_modules
/tools/server/webui/dist
# we no longer use gz for index.html
/tools/server/public/index.html.gz
# Python

View File

@@ -104,13 +104,14 @@ option(LLAMA_SANITIZE_UNDEFINED "llama: enable undefined sanitizer" OFF)
option(LLAMA_BUILD_COMMON "llama: build common utils library" ${LLAMA_STANDALONE})
# extra artifacts
option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_TOOLS "llama: build tools" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_SERVER "llama: build server example" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_WEBUI "llama: build the embedded Web UI for server" ON)
option(LLAMA_TOOLS_INSTALL "llama: install tools" ${LLAMA_TOOLS_INSTALL_DEFAULT})
option(LLAMA_TESTS_INSTALL "llama: install tests" ON)
option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_TOOLS "llama: build tools" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_SERVER "llama: build server example" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_WEBUI "llama: build the embedded Web UI for server" ON)
option(LLAMA_USE_PREBUILT_WEBUI "llama: use prebuilt WebUI from HF Bucket when available (requires LLAMA_BUILD_WEBUI=ON)" ON)
option(LLAMA_TOOLS_INSTALL "llama: install tools" ${LLAMA_TOOLS_INSTALL_DEFAULT})
option(LLAMA_TESTS_INSTALL "llama: install tests" ON)
# 3rd party libs
option(LLAMA_OPENSSL "llama: use openssl to support HTTPS" ON)

View File

@@ -24,6 +24,6 @@ set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
set(CMAKE_C_FLAGS "-march=rv64gcv_zfh_zba_zicbop -mabi=lp64d ${CMAKE_C_FLAGS}")
set(CMAKE_CXX_FLAGS "-march=rv64gcv_zfh_zba_zicbop -mabi=lp64d ${CXX_FLAGS}")
set(CMAKE_C_FLAGS "-march=rv64gcv_zfh_zvfh_zba_zicbop -mabi=lp64d -fno-tree-vectorize -fno-tree-loop-vectorize ${CMAKE_C_FLAGS}")
set(CMAKE_CXX_FLAGS "-march=rv64gcv_zfh_zvfh_zba_zicbop -mabi=lp64d -fno-tree-vectorize -fno-tree-loop-vectorize ${CMAKE_CXX_FLAGS}")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -latomic")

View File

@@ -308,12 +308,14 @@ static bool common_params_handle_remote_preset(common_params & params, llama_exa
common_download_opts opts;
opts.bearer_token = params.hf_token;
opts.offline = params.offline;
LOG_TRC("%s: looking for remote preset at %s\n", __func__, preset_url.c_str());
const int status = common_download_file_single(preset_url, preset_path, opts);
const bool has_preset = status >= 200 && status < 400;
// remote preset is optional, so we don't error out if not found
if (has_preset) {
LOG_INF("applying remote preset from %s\n", preset_url.c_str());
LOG_TRC("%s: applying remote preset from %s\n", __func__, preset_url.c_str());
common_preset_context ctx(ex, /* only_remote_allowed */ true);
common_preset global;
auto remote_presets = ctx.load_from_ini(preset_path, global);
@@ -326,7 +328,7 @@ static bool common_params_handle_remote_preset(common_params & params, llama_exa
throw std::runtime_error("Remote preset.ini does not contain [" + std::string(hf_tag) + "] section");
}
} else {
LOG_INF("%s", "no remote preset found, skipping\n");
LOG_TRC("%s: no remote preset found, skipping\n", __func__);
}
return has_preset;
@@ -3301,18 +3303,20 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_env("LLAMA_LOG_VERBOSITY"));
add_opt(common_arg(
{"--log-prefix"},
{"--no-log-prefix"},
"Enable prefix in log messages",
[](common_params &) {
common_log_set_prefix(common_log_main(), true);
[](common_params &, bool value) {
common_log_set_prefix(common_log_main(), value);
}
).set_env("LLAMA_LOG_PREFIX"));
).set_env("LLAMA_ARG_LOG_PREFIX"));
add_opt(common_arg(
{"--log-timestamps"},
{"--no-log-timestamps"},
"Enable timestamps in log messages",
[](common_params &) {
common_log_set_timestamps(common_log_main(), true);
[](common_params &, bool value) {
common_log_set_timestamps(common_log_main(), value);
}
).set_env("LLAMA_LOG_TIMESTAMPS"));
).set_env("LLAMA_ARG_LOG_TIMESTAMPS"));
//
// speculative parameters

View File

@@ -366,15 +366,29 @@ void common_init() {
SetConsoleCP(CP_UTF8);
#endif
llama_log_set(common_log_default_callback, NULL);
common_log_set_prefix(common_log_main(), true);
common_log_set_timestamps(common_log_main(), true);
llama_log_set(common_log_default_callback, NULL);
}
void common_params_print_info(const common_params & params) {
#ifdef NDEBUG
const char * build_type = "";
#else
const char * build_type = " (debug)";
#endif
LOG_TRC("%s: build %d (%s) with %s for %s%s\n", __func__, llama_build_number(), llama_commit(), llama_compiler(), llama_build_target(), build_type);
LOG_DBG("build: %d (%s) with %s for %s%s\n", llama_build_number(), llama_commit(), llama_compiler(), llama_build_target(), build_type);
LOG_INF("log_info: verbosity = %d (adjust with the `-lv N` CLI arg)\n", common_log_get_verbosity_thold());
LOG_INF("device_info:\n");
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
size_t free, total;
ggml_backend_dev_memory(dev, &free, &total);
LOG_INF(" - %-8s: %s (%zu MiB, %zu MiB free)\n", ggml_backend_dev_name(dev), ggml_backend_dev_description(dev), total / 1024 / 1024, free / 1024 / 1024);
}
LOG_INF("%s\n", common_params_get_system_info(params).c_str());
}
std::string common_params_get_system_info(const common_params & params) {
@@ -1147,7 +1161,8 @@ common_init_result::common_init_result(common_params & params) :
auto cparams = common_context_params_to_llama(params);
if (params.fit_params) {
LOG_INF("%s: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on\n", __func__);
LOG_INF("%s: fitting params to device memory ...\n", __func__);
LOG_INF("%s: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)\n", __func__);
common_fit_params(params.model.path.c_str(), &mparams, &cparams,
params.tensor_split,
params.tensor_buft_overrides.data(),
@@ -1196,7 +1211,7 @@ common_init_result::common_init_result(common_params & params) :
// initialize once
for (llama_token i = 0; i < llama_vocab_n_tokens(vocab); i++) {
if (llama_vocab_is_eog(vocab, i)) {
LOG_INF("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(vocab, i).c_str(), -INFINITY);
LOG_TRC("%s: added %s logit bias = %f\n", __func__, common_token_to_piece(vocab, i).c_str(), -INFINITY);
params.sampling.logit_bias_eog.push_back({i, -INFINITY});
}
}
@@ -1209,12 +1224,12 @@ common_init_result::common_init_result(common_params & params) :
}
//if (params.sampling.penalty_last_n == -1) {
// LOG_INF("%s: setting penalty_last_n to ctx_size = %d\n", __func__, llama_n_ctx(lctx));
// LOG_TRC("%s: setting penalty_last_n to ctx_size = %d\n", __func__, llama_n_ctx(lctx));
// params.sampling.penalty_last_n = llama_n_ctx(lctx);
//}
//if (params.sampling.dry_penalty_last_n == -1) {
// LOG_INF("%s: setting dry_penalty_last_n to ctx_size = %d\n", __func__, llama_n_ctx(lctx));
// LOG_TRC("%s: setting dry_penalty_last_n to ctx_size = %d\n", __func__, llama_n_ctx(lctx));
// params.sampling.dry_penalty_last_n = llama_n_ctx(lctx);
//}
@@ -1422,7 +1437,7 @@ common_context_seq_rm_type common_context_can_seq_rm(llama_context * ctx) {
// try to remove the last tokens
if (!llama_memory_seq_rm(mem, 0, 1, -1)) {
LOG_WRN("%s: the context does not support partial sequence removal\n", __func__);
LOG_TRC("%s: the context does not support partial sequence removal\n", __func__);
res = COMMON_CONTEXT_SEQ_RM_TYPE_FULL;
goto done;
}

View File

@@ -605,7 +605,11 @@ struct common_params {
std::map<std::string, std::string> default_template_kwargs;
// webui configs
bool webui = true;
#ifdef LLAMA_WEBUI_DEFAULT_ENABLED
bool webui = LLAMA_WEBUI_DEFAULT_ENABLED != 0;
#else
bool webui = true; // default to enabled when not set
#endif
bool webui_mcp_proxy = false;
std::string webui_config_json;
@@ -686,6 +690,7 @@ struct common_params {
// initializes the logging system and prints info about the build
void common_init();
void common_params_print_info(const common_params & params);
std::string common_params_get_system_info(const common_params & params);
bool parse_cpu_range(const std::string & range, bool(&boolmask)[GGML_MAX_N_THREADS]);

View File

@@ -320,9 +320,9 @@ static int common_download_file_single_online(const std::string & url,
auto head = cli.Head(parts.path);
if (!head || head->status < 200 || head->status >= 300) {
LOG_WRN("%s: HEAD failed, status: %d\n", __func__, head ? head->status : -1);
LOG_TRC("%s: HEAD failed, status: %d\n", __func__, head ? head->status : -1);
if (file_exists) {
LOG_INF("%s: using cached file (HEAD failed): %s\n", __func__, path.c_str());
LOG_TRC("%s: using cached file (HEAD failed): %s\n", __func__, path.c_str());
return 304; // 304 Not Modified - fake cached response
}
return head ? head->status : -1;

View File

@@ -168,7 +168,7 @@ static void common_params_fit_impl(
// step 1: get data for default parameters and check whether any changes are necessary in the first place
LOG_INF("%s: getting device memory data for initial parameters:\n", __func__);
LOG_TRC("%s: getting device memory data for initial parameters:\n", __func__);
const dmds_t dmds_full = common_get_device_memory_data(path_model, mparams, cparams, devs, hp_ngl, hp_nct, hp_nex, log_level);
const size_t nd = devs.size(); // number of devices
@@ -213,13 +213,13 @@ static void common_params_fit_impl(
LOG_INF("%s: projected to use %" PRId64 " MiB of host memory vs. %" PRId64 " MiB of total host memory\n",
__func__, sum_projected_used/MiB, sum_free/MiB);
if (sum_projected_free >= margins[0]) {
LOG_INF("%s: will leave %" PRId64 " >= %" PRId64 " MiB of system memory, no changes needed\n",
LOG_TRC("%s: will leave %" PRId64 " >= %" PRId64 " MiB of system memory, no changes needed\n",
__func__, sum_projected_free/MiB, margins[0]/MiB);
return;
}
} else {
if (nd > 1) {
LOG_INF("%s: projected memory use with initial parameters [MiB]:\n", __func__);
LOG_TRC("%s: projected memory use with initial parameters [MiB]:\n", __func__);
}
for (size_t id = 0; id < nd; id++) {
const llama_device_memory_data & dmd = dmds_full[id];
@@ -234,16 +234,16 @@ static void common_params_fit_impl(
sum_projected_model += dmd.mb.model;
if (nd > 1) {
LOG_INF("%s: - %s: %6" PRId64 " total, %6" PRId64 " used, %6" PRId64 " free vs. target of %6" PRId64 "\n",
LOG_TRC("%s: - %s: %6" PRId64 " total, %6" PRId64 " used, %6" PRId64 " free vs. target of %6" PRId64 "\n",
__func__, dev_names[id].c_str(), dmd.total/MiB, projected_used/MiB, projected_free/MiB, margins[id]/MiB);
}
}
assert(sum_free >= 0 && sum_projected_used >= 0);
LOG_INF("%s: projected to use %" PRId64 " MiB of device memory vs. %" PRId64 " MiB of free device memory\n",
LOG_TRC("%s: projected to use %" PRId64 " MiB of device memory vs. %" PRId64 " MiB of free device memory\n",
__func__, sum_projected_used/MiB, sum_free/MiB);
if (nd == 1) {
if (projected_free_per_device[0] >= margins[0]) {
LOG_INF("%s: will leave %" PRId64 " >= %" PRId64 " MiB of free device memory, no changes needed\n",
LOG_TRC("%s: will leave %" PRId64 " >= %" PRId64 " MiB of free device memory, no changes needed\n",
__func__, projected_free_per_device[0]/MiB, margins[0]/MiB);
return;
}
@@ -256,7 +256,7 @@ static void common_params_fit_impl(
}
}
if (!changes_needed) {
LOG_INF("%s: targets for free memory can be met on all devices, no changes needed\n", __func__);
LOG_TRC("%s: targets for free memory can be met on all devices, no changes needed\n", __func__);
return;
}
}
@@ -275,10 +275,10 @@ static void common_params_fit_impl(
}
if (global_surplus < 0) {
if (nd <= 1) {
LOG_INF("%s: cannot meet free memory target of %" PRId64 " MiB, need to reduce device memory by %" PRId64 " MiB\n",
LOG_TRC("%s: cannot meet free memory target of %" PRId64 " MiB, need to reduce device memory by %" PRId64 " MiB\n",
__func__, margins[0]/MiB, -global_surplus/MiB);
} else {
LOG_INF(
LOG_TRC(
"%s: cannot meet free memory targets on all devices, need to use %" PRId64 " MiB less in total\n",
__func__, -global_surplus/MiB);
}
@@ -320,28 +320,28 @@ static void common_params_fit_impl(
const int64_t bytes_per_ctx = (sum_projected_used - sum_projected_used_min_ctx) / (hp_nct - n_ctx_min);
const int64_t memory_reduction = (hp_nct - cparams->n_ctx) * bytes_per_ctx;
LOG_INF("%s: context size reduced from %" PRIu32 " to %" PRIu32 " -> need %" PRId64 " MiB less memory in total\n",
LOG_TRC("%s: context size reduced from %" PRIu32 " to %" PRIu32 " -> need %" PRId64 " MiB less memory in total\n",
__func__, hp_nct, cparams->n_ctx, memory_reduction/MiB);
if (nd <= 1) {
LOG_INF("%s: entire model can be fit by reducing context\n", __func__);
LOG_TRC("%s: entire model can be fit by reducing context\n", __func__);
return;
}
LOG_INF("%s: entire model should be fit across devices by reducing context\n", __func__);
LOG_TRC("%s: entire model should be fit across devices by reducing context\n", __func__);
} else {
const int64_t memory_reduction = sum_projected_used - sum_projected_used_min_ctx;
LOG_INF("%s: context size reduced from %" PRIu32 " to %" PRIu32 " -> need %" PRId64 " MiB less memory in total\n",
LOG_TRC("%s: context size reduced from %" PRIu32 " to %" PRIu32 " -> need %" PRId64 " MiB less memory in total\n",
__func__, hp_nct, cparams->n_ctx, memory_reduction/MiB);
}
} else {
if (n_ctx_min == UINT32_MAX) {
LOG_INF("%s: user has requested full context size of %" PRIu32 " -> no change\n", __func__, hp_nct);
LOG_TRC("%s: user has requested full context size of %" PRIu32 " -> no change\n", __func__, hp_nct);
} else {
LOG_INF("%s: default model context size is %" PRIu32 " which is <= the min. context size of %" PRIu32 " -> no change\n",
LOG_TRC("%s: default model context size is %" PRIu32 " which is <= the min. context size of %" PRIu32 " -> no change\n",
__func__, hp_nct, n_ctx_min);
}
}
} else {
LOG_INF("%s: context size set by user to %" PRIu32 " -> no change\n", __func__, cparams->n_ctx);
LOG_TRC("%s: context size set by user to %" PRIu32 " -> no change\n", __func__, cparams->n_ctx);
}
}
}
@@ -485,10 +485,10 @@ static void common_params_fit_impl(
const dmds_t dmd_nl = common_get_device_memory_data(
path_model, &mparams_copy, cparams, devs, hp_ngl, hp_nct, hp_nex, log_level);
LOG_INF("%s: memory for test allocation by device:\n", func_name);
LOG_TRC("%s: memory for test allocation by device:\n", func_name);
for (size_t id = 0; id < nd; id++) {
const ngl_t & n = ngl_per_device[id];
LOG_INF(
LOG_TRC(
"%s: id=%zu, n_layer=%2" PRIu32 ", n_part=%2" PRIu32 ", overflow_type=%d, mem=%6" PRId64 " MiB\n",
func_name, id, n.n_layer, n.n_part, int(n.overflow_type), dmd_nl[id].mb.total()/MiB);
}
@@ -509,7 +509,7 @@ static void common_params_fit_impl(
tensor_buft_overrides[1] = {nullptr, nullptr};
mparams->tensor_buft_overrides = tensor_buft_overrides;
LOG_INF("%s: getting device memory data with all MoE tensors moved to system memory:\n", __func__);
LOG_TRC("%s: getting device memory data with all MoE tensors moved to system memory:\n", __func__);
const dmds_t dmds_cpu_moe = common_get_device_memory_data(
path_model, mparams, cparams, devs, hp_ngl, hp_nct, hp_nex, log_level);
@@ -519,10 +519,10 @@ static void common_params_fit_impl(
}
if (global_surplus_cpu_moe > 0) {
LOG_INF("%s: with only dense weights in device memory there is a total surplus of %" PRId64 " MiB\n",
LOG_TRC("%s: with only dense weights in device memory there is a total surplus of %" PRId64 " MiB\n",
__func__, global_surplus_cpu_moe/MiB);
} else {
LOG_INF("%s: with only dense weights in device memory there is still a total deficit of %" PRId64 " MiB\n",
LOG_TRC("%s: with only dense weights in device memory there is still a total deficit of %" PRId64 " MiB\n",
__func__, -global_surplus_cpu_moe/MiB);
}
@@ -535,7 +535,7 @@ static void common_params_fit_impl(
targets.reserve(nd);
for (size_t id = 0; id < nd; id++) {
targets.push_back(dmds_full[id].free - margins[id]);
LOG_INF("%s: id=%zu, target=%" PRId64 " MiB\n", __func__, id, targets[id]/MiB);
LOG_TRC("%s: id=%zu, target=%" PRId64 " MiB\n", __func__, id, targets[id]/MiB);
}
std::vector<ggml_backend_buffer_type_t> overflow_bufts; // which bufts the first partial layer of a device overflows to:
@@ -555,9 +555,9 @@ static void common_params_fit_impl(
// - once we only have a difference of a single layer, stop and return the lower bound that just barely still fits
// - the last device has the output layer, which cannot be a partial layer
if (hp_nex == 0) {
LOG_INF("%s: filling dense layers back-to-front:\n", __func__);
LOG_TRC("%s: filling dense layers back-to-front:\n", __func__);
} else {
LOG_INF("%s: filling dense-only layers back-to-front:\n", __func__);
LOG_TRC("%s: filling dense-only layers back-to-front:\n", __func__);
}
for (int id = nd - 1; id >= 0; id--) {
uint32_t n_unassigned = hp_ngl + 1;
@@ -576,7 +576,7 @@ static void common_params_fit_impl(
if (mem_high[id] > targets[id]) {
assert(ngl_per_device_high[id].n_layer > ngl_per_device[id].n_layer);
uint32_t delta = ngl_per_device_high[id].n_layer - ngl_per_device[id].n_layer;
LOG_INF("%s: start filling device %" PRIu32 ", delta=%" PRIu32 "\n", __func__, id, delta);
LOG_TRC("%s: start filling device %" PRIu32 ", delta=%" PRIu32 "\n", __func__, id, delta);
while (delta > 1) {
uint32_t step_size = int64_t(delta) * (targets[id] - mem[id]) / (mem_high[id] - mem[id]);
step_size = std::max(step_size, uint32_t(1));
@@ -593,11 +593,11 @@ static void common_params_fit_impl(
if (mem_test[id] <= targets[id]) {
ngl_per_device = ngl_per_device_test;
mem = mem_test;
LOG_INF("%s: set ngl_per_device[%d].n_layer=%" PRIu32 "\n", __func__, id, ngl_per_device[id].n_layer);
LOG_TRC("%s: set ngl_per_device[%d].n_layer=%" PRIu32 "\n", __func__, id, ngl_per_device[id].n_layer);
} else {
ngl_per_device_high = ngl_per_device_test;
mem_high = mem_test;
LOG_INF("%s: set ngl_per_device_high[%d].n_layer=%" PRIu32 "\n", __func__, id, ngl_per_device_high[id].n_layer);
LOG_TRC("%s: set ngl_per_device_high[%d].n_layer=%" PRIu32 "\n", __func__, id, ngl_per_device_high[id].n_layer);
}
delta = ngl_per_device_high[id].n_layer - ngl_per_device[id].n_layer;
}
@@ -605,12 +605,12 @@ static void common_params_fit_impl(
assert(ngl_per_device_high[id].n_layer == n_unassigned);
ngl_per_device = ngl_per_device_high;
mem = mem_high;
LOG_INF("%s: set ngl_per_device[%d].n_layer=%" PRIu32 "\n", __func__, id, ngl_per_device[id].n_layer);
LOG_TRC("%s: set ngl_per_device[%d].n_layer=%" PRIu32 "\n", __func__, id, ngl_per_device[id].n_layer);
}
}
const int64_t projected_margin = dmds_full[id].free - mem[id];
LOG_INF(
LOG_TRC(
"%s: - %s: %2" PRIu32 " layers, %6" PRId64 " MiB used, %6" PRId64 " MiB free\n",
__func__, dev_names[id].c_str(), ngl_per_device[id].n_layer, mem[id]/MiB, projected_margin/MiB);
}
@@ -634,7 +634,7 @@ static void common_params_fit_impl(
}
assert(id_dense_start < nd);
LOG_INF("%s: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:\n", __func__);
LOG_TRC("%s: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:\n", __func__);
for (size_t id = 0; id <= id_dense_start && id_dense_start < nd; id++) {
std::vector<ngl_t> ngl_per_device_high = ngl_per_device;
for (size_t jd = id_dense_start; jd < nd; jd++) {
@@ -674,13 +674,13 @@ static void common_params_fit_impl(
ngl_per_device = ngl_per_device_test;
mem = mem_test;
id_dense_start = id_dense_start_test;
LOG_INF("%s: set ngl_per_device[%zu].(n_layer, n_part)=(%" PRIu32 ", %" PRIu32 "), id_dense_start=%zu\n",
LOG_TRC("%s: set ngl_per_device[%zu].(n_layer, n_part)=(%" PRIu32 ", %" PRIu32 "), id_dense_start=%zu\n",
__func__, id, ngl_per_device[id].n_layer, ngl_per_device[id].n_part, id_dense_start);
} else {
ngl_per_device_high = ngl_per_device_test;
mem_high = mem_test;
id_dense_start_high = id_dense_start_test;
LOG_INF("%s: set ngl_per_device_high[%zu].(n_layer, n_part)=(%" PRIu32 ", %" PRIu32 "), id_dense_start_high=%zu\n",
LOG_TRC("%s: set ngl_per_device_high[%zu].(n_layer, n_part)=(%" PRIu32 ", %" PRIu32 "), id_dense_start_high=%zu\n",
__func__, id, ngl_per_device_high[id].n_layer, ngl_per_device_high[id].n_part, id_dense_start_high);
}
assert(ngl_per_device_high[id].n_full() >= ngl_per_device[id].n_full());
@@ -690,7 +690,7 @@ static void common_params_fit_impl(
ngl_per_device = ngl_per_device_high;
mem = mem_high;
id_dense_start = id_dense_start_high;
LOG_INF("%s: set ngl_per_device[%zu].(n_layer, n_part)=(%" PRIu32 ", %" PRIu32 "), id_dense_start=%zu\n",
LOG_TRC("%s: set ngl_per_device[%zu].(n_layer, n_part)=(%" PRIu32 ", %" PRIu32 "), id_dense_start=%zu\n",
__func__, id, ngl_per_device[id].n_layer, ngl_per_device[id].n_part, id_dense_start);
}
@@ -710,44 +710,44 @@ static void common_params_fit_impl(
if (id < nd - 1) {
overflow_bufts_test[id] = ggml_backend_dev_buffer_type(devs[id + 1]);
}
LOG_INF("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_UP\n", __func__);
LOG_TRC("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_UP\n", __func__);
std::vector<int64_t> mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts_test);
if (mem_test[id] < targets[id] && (id + 1 == nd || mem_test[id + 1] < targets[id + 1])) {
ngl_per_device = ngl_per_device_test;
overflow_bufts = overflow_bufts_test;
mem = mem_test;
id_dense_start = id_dense_start_test;
LOG_INF("%s: set ngl_per_device[%zu].(n_layer, n_part, overflow_type)=(%" PRIu32 ", %" PRIu32 ", UP), id_dense_start=%zu\n",
LOG_TRC("%s: set ngl_per_device[%zu].(n_layer, n_part, overflow_type)=(%" PRIu32 ", %" PRIu32 ", UP), id_dense_start=%zu\n",
__func__, id, ngl_per_device[id].n_layer, ngl_per_device[id].n_part, id_dense_start);
ngl_per_device_test[id].overflow_type = LAYER_FRACTION_GATE;
LOG_INF("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_GATE\n", __func__);
LOG_TRC("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_GATE\n", __func__);
mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts_test);
if (mem_test[id] < targets[id] && (id + 1 == nd || mem_test[id + 1] < targets[id + 1])) {
ngl_per_device = ngl_per_device_test;
overflow_bufts = overflow_bufts_test;
mem = mem_test;
id_dense_start = id_dense_start_test;
LOG_INF("%s: set ngl_per_device[%zu].(n_layer, n_part, overflow_type)=(%" PRIu32 ", %" PRIu32 ", GATE), id_dense_start=%zu\n",
LOG_TRC("%s: set ngl_per_device[%zu].(n_layer, n_part, overflow_type)=(%" PRIu32 ", %" PRIu32 ", GATE), id_dense_start=%zu\n",
__func__, id, ngl_per_device[id].n_layer, ngl_per_device[id].n_part, id_dense_start);
}
} else {
ngl_per_device_test[id].overflow_type = LAYER_FRACTION_ATTN;
LOG_INF("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_ATTN\n", __func__);
LOG_TRC("%s: trying to fit one extra layer with overflow_type=LAYER_FRACTION_ATTN\n", __func__);
mem_test = get_memory_for_layers(__func__, ngl_per_device_test, overflow_bufts_test);
if (mem_test[id] < targets[id] && (id + 1 == nd || mem_test[id + 1] < targets[id + 1])) {
ngl_per_device = ngl_per_device_test;
overflow_bufts = overflow_bufts_test;
mem = mem_test;
id_dense_start = id_dense_start_test;
LOG_INF("%s: set ngl_per_device[%zu].(n_layer, n_part, overflow_type)=(%" PRIu32 ", %" PRIu32 ", ATTN), id_dense_start=%zu\n",
LOG_TRC("%s: set ngl_per_device[%zu].(n_layer, n_part, overflow_type)=(%" PRIu32 ", %" PRIu32 ", ATTN), id_dense_start=%zu\n",
__func__, id, ngl_per_device[id].n_layer, ngl_per_device[id].n_part, id_dense_start);
}
}
}
const int64_t projected_margin = dmds_full[id].free - mem[id];
LOG_INF(
LOG_TRC(
"%s: - %s: %2" PRIu32 " layers (%2" PRIu32 " overflowing), %6" PRId64 " MiB used, %6" PRId64 " MiB free\n",
__func__, dev_names[id].c_str(), ngl_per_device[id].n_layer, ngl_per_device[id].n_part, mem[id]/MiB, projected_margin/MiB);
}
@@ -755,7 +755,7 @@ static void common_params_fit_impl(
// print info for devices that were not changed during the conversion from dense only to full layers:
for (size_t id = id_dense_start + 1; id < nd; id++) {
const int64_t projected_margin = dmds_full[id].free - mem[id];
LOG_INF(
LOG_TRC(
"%s: - %s: %2" PRIu32 " layers (%2" PRIu32 " overflowing), %6" PRId64 " MiB used, %6" PRId64 " MiB free\n",
__func__, dev_names[id].c_str(), ngl_per_device[id].n_layer, ngl_per_device[id].n_part, mem[id]/MiB, projected_margin/MiB);
}
@@ -776,7 +776,7 @@ enum common_params_fit_status common_fit_params(
common_params_fit_status status = COMMON_PARAMS_FIT_STATUS_SUCCESS;
try {
common_params_fit_impl(path_model, mparams, cparams, tensor_split, tensor_buft_overrides, margins, n_ctx_min, log_level);
LOG_INF("%s: successfully fit params to free device memory\n", __func__);
LOG_TRC("%s: successfully fit params to free device memory\n", __func__);
} catch (const common_params_fit_exception & e) {
LOG_WRN("%s: failed to fit params to free device memory: %s\n", __func__, e.what());
status = COMMON_PARAMS_FIT_STATUS_FAILURE;
@@ -785,7 +785,7 @@ enum common_params_fit_status common_fit_params(
status = COMMON_PARAMS_FIT_STATUS_ERROR;
}
const int64_t t1_us = llama_time_us();
LOG_INF("%s: fitting params to free memory took %.2f seconds\n", __func__, (t1_us - t0_us) * 1e-6);
LOG_TRC("%s: fitting params to free memory took %.2f seconds\n", __func__, (t1_us - t0_us) * 1e-6);
return status;
}
@@ -925,7 +925,7 @@ void common_memory_breakdown_print(const struct llama_context * ctx) {
}
}
for (const auto & td : table_data) {
LOG_INF(td[0].c_str(),
LOG_TRC(td[0].c_str(),
__func__, td[1].c_str(), td[2].c_str(), td[3].c_str(), td[4].c_str(), td[5].c_str(),
td[6].c_str(), td[7].c_str(), td[8].c_str());
}

View File

@@ -435,10 +435,10 @@ void common_log_flush(struct common_log * log) {
static int common_get_verbosity(enum ggml_log_level level) {
switch (level) {
case GGML_LOG_LEVEL_DEBUG: return LOG_LEVEL_DEBUG;
case GGML_LOG_LEVEL_INFO: return LOG_LEVEL_INFO;
case GGML_LOG_LEVEL_INFO: return LOG_LEVEL_TRACE;
case GGML_LOG_LEVEL_WARN: return LOG_LEVEL_WARN;
case GGML_LOG_LEVEL_ERROR: return LOG_LEVEL_ERROR;
case GGML_LOG_LEVEL_CONT: return LOG_LEVEL_INFO; // same as INFO
case GGML_LOG_LEVEL_CONT: return LOG_LEVEL_TRACE;
case GGML_LOG_LEVEL_NONE:
default:
return LOG_LEVEL_OUTPUT;

View File

@@ -21,7 +21,8 @@
# define LOG_ATTRIBUTE_FORMAT(...) __attribute__((format(printf, __VA_ARGS__)))
#endif
#define LOG_LEVEL_DEBUG 4
#define LOG_LEVEL_DEBUG 5
#define LOG_LEVEL_TRACE 4
#define LOG_LEVEL_INFO 3
#define LOG_LEVEL_WARN 2
#define LOG_LEVEL_ERROR 1
@@ -111,13 +112,15 @@ void common_log_flush (struct common_log * log); // f
#define LOGV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_NONE, verbosity, __VA_ARGS__)
#define LOG_DBG(...) LOG_TMPL(GGML_LOG_LEVEL_DEBUG, LOG_LEVEL_DEBUG, __VA_ARGS__)
#define LOG_TRC(...) LOG_TMPL(GGML_LOG_LEVEL_INFO, LOG_LEVEL_TRACE, __VA_ARGS__)
#define LOG_INF(...) LOG_TMPL(GGML_LOG_LEVEL_INFO, LOG_LEVEL_INFO, __VA_ARGS__)
#define LOG_WRN(...) LOG_TMPL(GGML_LOG_LEVEL_WARN, LOG_LEVEL_WARN, __VA_ARGS__)
#define LOG_ERR(...) LOG_TMPL(GGML_LOG_LEVEL_ERROR, LOG_LEVEL_ERROR, __VA_ARGS__)
#define LOG_CNT(...) LOG_TMPL(GGML_LOG_LEVEL_CONT, LOG_LEVEL_INFO, __VA_ARGS__) // same as INFO
#define LOG_DBGV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_DEBUG, verbosity, __VA_ARGS__)
#define LOG_TRCV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_TRACE, verbosity, __VA_ARGS__)
#define LOG_INFV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_INFO, verbosity, __VA_ARGS__)
#define LOG_WRNV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_WARN, verbosity, __VA_ARGS__)
#define LOG_ERRV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_ERROR, verbosity, __VA_ARGS__)
#define LOG_DBGV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_DEBUG, verbosity, __VA_ARGS__)
#define LOG_CNTV(verbosity, ...) LOG_TMPL(GGML_LOG_LEVEL_CONT, verbosity, __VA_ARGS__)

View File

@@ -984,7 +984,7 @@ common_speculative * common_speculative_init(common_params_speculative & params,
}
if (impls.empty()) {
LOG_WRN("%s", "no implementations specified for speculative decoding\n");
LOG_WRN("%s: no implementations specified for speculative decoding\n", __func__);
return nullptr;
}

View File

@@ -720,6 +720,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. |
| GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. |
| GGML_SYCL_SUPPORT_LEVEL_ZERO | ON *(default)* \|OFF *(Optional)* | Enable Level Zero API for device memory allocation. Requires Level Zero headers/library at build time and Intel GPU driver (Level Zero runtime) at run time. Reduces system RAM usage during multi-GPU inference. |
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
@@ -733,9 +734,10 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_ENABLE_FLASH_ATTN | 1 (default) or 0| Enable Flash-Attention. It can reduce memory usage. The performance impact depends on the LLM.|
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) |
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
| GGML_SYCL_ENABLE_LEVEL_ZERO | 1 (default) or 0 | Use Level Zero API for device memory allocation instead of SYCL. Reduces system RAM usage on Intel dGPUs by avoiding DMA-buf/TTM host memory staging. Requires GGML_SYCL_SUPPORT_LEVEL_ZERO=ON at build time. |
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Allow SYCL/Unified Runtime Level Zero device allocations larger than 4 GiB. llama.cpp's direct Level Zero allocation path requests the relaxed maximum-size limit itself when GGML_SYCL_ENABLE_LEVEL_ZERO=1. |
## Compile-time Flags
@@ -819,7 +821,7 @@ Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spo
- `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5000000000 Bytes of memory on device`
You need to enable to support 4GB memory malloc by:
With the default `GGML_SYCL_ENABLE_LEVEL_ZERO=1`, llama.cpp requests Level Zero's relaxed maximum-size allocation limit directly. If Level Zero support is disabled at build time or runtime and the allocation goes through SYCL/Unified Runtime instead, enable support for allocations larger than 4 GiB by:
```
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
set UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1

View File

@@ -9,18 +9,20 @@ wget https://archive.spacemit.com/toolchain/spacemit-toolchain-linux-glibc-x86_6
~~~
2. Build
Below is the build script: it requires utilizing RISC-V vector instructions for acceleration. Ensure the `GGML_CPU_RISCV64_SPACEMIT` compilation option is enabled. The currently supported optimization version is `RISCV64_SPACEMIT_IME1`, corresponding to the `RISCV64_SPACEMIT_IME_SPEC` compilation option. Compiler configurations are defined in the `riscv64-spacemit-linux-gnu-gcc.cmake` file. Please ensure you have installed the RISC-V compiler and set the environment variable via `export RISCV_ROOT_PATH={your_compiler_path}`.
Below is the build script: it requires utilizing RISC-V vector instructions for acceleration. Ensure the `GGML_CPU_RISCV64_SPACEMIT` compilation option is enabled. The currently supported optimization version is `RISCV64_SPACEMIT_IME1` and `RISCV64_SPACEMIT_IME2`, corresponding to the `RISCV64_SPACEMIT_IME_SPEC` compilation option. Compiler configurations are defined in the `riscv64-spacemit-linux-gnu-gcc.cmake` file. Please ensure you have installed the RISC-V compiler and set the environment variable via `export RISCV_ROOT_PATH={your_compiler_path}`.
```bash
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CPU_RISCV64_SPACEMIT=ON \
-DGGML_CPU_REPACK=OFF \
-DLLAMA_OPENSSL=OFF \
-DGGML_RVV=ON \
-DGGML_RV_ZVFH=ON \
-DGGML_RV_ZFH=ON \
-DGGML_RV_ZICBOP=ON \
-DGGML_RV_ZIHINTPAUSE=ON \
-DRISCV64_SPACEMIT_IME_SPEC=RISCV64_SPACEMIT_IME1 \
-DGGML_RV_ZBA=ON \
-DCMAKE_TOOLCHAIN_FILE=${PWD}/cmake/riscv64-spacemit-linux-gnu-gcc.cmake \
-DCMAKE_INSTALL_PREFIX=build/installed
@@ -47,8 +49,25 @@ export RISCV_ROOT_PATH_IME1={your RISC-V compiler path}
${QEMU_ROOT_PATH}/bin/qemu-riscv64 -L ${RISCV_ROOT_PATH_IME1}/sysroot -cpu max,vlen=256,elen=64,vext_spec=v1.0 ${PWD}/build/bin/llama-cli -m ${PWD}/models/Qwen2.5-0.5B-Instruct-Q4_0.gguf -t 1
~~~
## Quantization Support For Matrix
| Quantization Type | X60 | A100 |
| ---: | ---: | ---: |
| Q2_K | | :heavy_check_mark: |
| Q3_K | | :heavy_check_mark: |
| Q4_0 | :heavy_check_mark: | :heavy_check_mark: |
| Q4_1 | :heavy_check_mark: | :heavy_check_mark: |
| Q4_K | :heavy_check_mark: | :heavy_check_mark: |
| Q5_0 | | :heavy_check_mark: |
| Q5_1 | | :heavy_check_mark: |
| Q5_K | | :heavy_check_mark: |
| Q6_K | | :heavy_check_mark: |
| Q8_0 | | :heavy_check_mark: |
## Performance
#### Quantization Support For Matrix
* Spacemit(R) X60
~~~
model name : Spacemit(R) X60
isa : rv64imafdcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm_zfh_zfhmin_zca_zcd_zba_zbb_zbc_zbs_zkt_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkt_sscofpmf_sstc_svinval_svnapot_svpbmt
@@ -58,33 +77,34 @@ mvendorid : 0x710
marchid : 0x8000000058000001
~~~
Q4_0
| Model | Size | Params | backend | threads | test | t/s |
| -----------| -------- | ------ | ------- | ------- | ---- |------|
Qwen2.5 0.5B |403.20 MiB|630.17 M| cpu | 4 | pp512|64.12 ± 0.26|
Qwen2.5 0.5B |403.20 MiB|630.17 M| cpu | 4 | tg128|10.03 ± 0.01|
Qwen2.5 1.5B |1011.16 MiB| 1.78 B | cpu | 4 | pp512|24.16 ± 0.02|
Qwen2.5 1.5B |1011.16 MiB| 1.78 B | cpu | 4 | tg128|3.83 ± 0.06|
Qwen2.5 3B | 1.86 GiB | 3.40 B | cpu | 4 | pp512|12.08 ± 0.02|
Qwen2.5 3B | 1.86 GiB | 3.40 B | cpu | 4 | tg128|2.23 ± 0.02|
Q4_1
| Model | Size | Params | backend | threads | test | t/s |
| -----------| -------- | ------ | ------- | ------- | ---- |------|
Qwen2.5 0.5B |351.50 MiB|494.03 M| cpu | 4 | pp512|62.07 ± 0.12|
Qwen2.5 0.5B |351.50 MiB|494.03 M| cpu | 4 | tg128|9.91 ± 0.01|
Qwen2.5 1.5B |964.06 MiB| 1.54 B | cpu | 4 | pp512|22.95 ± 0.25|
Qwen2.5 1.5B |964.06 MiB| 1.54 B | cpu | 4 | tg128|4.01 ± 0.15|
Qwen2.5 3B | 1.85 GiB | 3.09 B | cpu | 4 | pp512|11.55 ± 0.16|
Qwen2.5 3B | 1.85 GiB | 3.09 B | cpu | 4 | tg128|2.25 ± 0.04|
| model | size | params | backend | threads | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | ---: | --------------: | -------------------: |
| qwen35 2B Q4_1 | 1.19 GiB | 1.88 B | CPU | 4 | 128 | 1 | 0 | pp128 | 10.32 ± 0.02 |
| qwen35 2B Q4_1 | 1.19 GiB | 1.88 B | CPU | 4 | 128 | 1 | 0 | tg128 | 3.07 ± 0.01 |
| qwen3 0.6B Q4_0 | 358.78 MiB | 596.05 M | CPU | 4 | 128 | 1 | 0 | pp128 | 49.15 ± 0.25 |
| qwen3 0.6B Q4_0 | 358.78 MiB | 596.05 M | CPU | 4 | 128 | 1 | 0 | tg128 | 11.73 ± 0.02 |
Q4_K
| Model | Size | Params | backend | threads | test | t/s |
| -----------| -------- | ------ | ------- | ------- | ---- |------|
Qwen2.5 0.5B |462.96 MiB|630.17 M| cpu | 4 | pp512|9.29 ± 0.05|
Qwen2.5 0.5B |462.96 MiB|630.17 M| cpu | 4 | tg128|5.67 ± 0.04|
Qwen2.5 1.5B | 1.04 GiB | 1.78 B | cpu | 4 | pp512|10.38 ± 0.10|
Qwen2.5 1.5B | 1.04 GiB | 1.78 B | cpu | 4 | tg128|3.17 ± 0.08|
Qwen2.5 3B | 1.95 GiB | 3.40 B | cpu | 4 | pp512|4.23 ± 0.04|
Qwen2.5 3B | 1.95 GiB | 3.40 B | cpu | 4 | tg128|1.73 ± 0.00|
* Spacemit(R) A100
~~~
model name : Spacemit(R) A100
isa : rv64imafdcvh_zicbom_zicbop_zicboz_zicntr_zicond_zicsr_zifencei_zihintntl_zihintpause_zihpm_zimop_zaamo_zalrsc_zawrs_zfa_zfh_zfhmin_zca_zcb_zcd_zcmop_zba_zbb_zbc_zbs_zkt_zvbb_zvbc_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkb_zvkg_zvkned_zvknha_zvknhb_zvksed_zvksh_zvkt_smaia_smstateen_ssaia_sscofpmf_sstc_svinval_svnapot_svpbmt_sdtrig
mmu : sv39
mvendorid : 0x710
marchid : 0x8000000041000002
mimpid : 0x10000000d5686200
hart isa : rv64imafdcv_zicbom_zicbop_zicboz_zicntr_zicond_zicsr_zifencei_zihintntl_zihintpause_zihpm_zimop_zaamo_zalrsc_zawrs_zfa_zfh_zfhmin_zca_zcb_zcd_zcmop_zba_zbb_zbc_zbs_zkt_zvbb_zvbc_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkb_zvkg_zvkned_zvknha_zvknhb_zvksed_zvksh_zvkt_smaia_smstateen_ssaia_sscofpmf_sstc_svinval_svnapot_svpbmt_sdtrig
~~~
| model | size | params | backend | threads | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3 0.6B Q4_0 | 358.78 MiB | 596.05 M | CPU | 8 | 128 | 1 | 0 | pp128 | 565.83 ± 0.31 |
| qwen3 0.6B Q4_0 | 358.78 MiB | 596.05 M | CPU | 8 | 128 | 1 | 0 | tg128 | 55.77 ± 0.02 |
| qwen3 4B Q4_0 | 2.21 GiB | 4.02 B | CPU | 8 | 128 | 1 | 0 | pp128 | 79.74 ± 0.04 |
| qwen3 4B Q4_0 | 2.21 GiB | 4.02 B | CPU | 8 | 128 | 1 | 0 | tg128 | 11.29 ± 0.00 |
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CPU | 8 | 128 | 1 | 0 | pp128 | 57.88 ± 0.31 |
| qwen3moe 30B.A3B Q4_0 | 16.18 GiB | 30.53 B | CPU | 8 | 128 | 1 | 0 | tg128 | 12.79 ± 0.00 |
| qwen35 2B Q4_1 | 1.19 GiB | 1.88 B | CPU | 8 | 128 | 1 | 0 | pp128 | 115.23 ± 0.04 |
| qwen35 2B Q4_1 | 1.19 GiB | 1.88 B | CPU | 8 | 128 | 1 | 0 | tg128 | 16.49 ± 0.01 |
| gemma4 E4B Q4_K - Medium | 4.76 GiB | 7.52 B | CPU | 8 | 128 | 1 | 0 | pp128 | 21.13 ± 0.01 |
| gemma4 E4B Q4_K - Medium | 4.76 GiB | 7.52 B | CPU | 8 | 128 | 1 | 0 | tg128 | 5.66 ± 0.00 |

View File

@@ -249,6 +249,7 @@ option(GGML_SYCL "ggml: use SYCL"
option(GGML_SYCL_F16 "ggml: use 16 bit floats for sycl calculations" OFF)
option(GGML_SYCL_GRAPH "ggml: enable graphs in the SYCL backend" ON)
option(GGML_SYCL_HOST_MEM_FALLBACK "ggml: allow host memory fallback in SYCL reorder (requires kernel 6.8+)" ON)
option(GGML_SYCL_SUPPORT_LEVEL_ZERO "ggml: use Level Zero API in SYCL backend" ON)
option(GGML_SYCL_DNN "ggml: enable oneDNN in the SYCL backend" ON)
set (GGML_SYCL_TARGET "INTEL" CACHE STRING
"ggml: sycl target device")

View File

@@ -450,12 +450,22 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
ggml-cpu/arch/riscv/repack.cpp
)
if (GGML_CPU_RISCV64_SPACEMIT)
include(ggml-cpu/cmake/FindSMTIME.cmake)
target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_CPU_RISCV64_SPACEMIT ${RISCV64_SPACEMIT_IME_SPEC})
list(APPEND GGML_CPU_SOURCES
ggml-cpu/spacemit/ime.cpp
ggml-cpu/spacemit/ime.h
ggml-cpu/spacemit/spine_mem_pool.cpp
ggml-cpu/spacemit/spine_mem_pool.h
ggml-cpu/spacemit/repack.cpp
ggml-cpu/spacemit/repack.h
ggml-cpu/spacemit/ime_env.cpp
ggml-cpu/spacemit/ime_env.h
ggml-cpu/spacemit/ime1_kernels.cpp
ggml-cpu/spacemit/ime2_kernels.cpp
ggml-cpu/spacemit/ime_kernels.h
ggml-cpu/spacemit/rvv_kernels.cpp
ggml-cpu/spacemit/rvv_kernels.h
)
endif()
if(NOT GGML_CPU_ALL_VARIANTS)
@@ -485,6 +495,9 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_RV_ZIHINTPAUSE)
string(APPEND MARCH_STR "_zihintpause")
endif()
if (GGML_RV_ZBA)
string(APPEND MARCH_STR "_zba")
endif()
if (GGML_CPU_RISCV64_SPACEMIT)
# `xsmtvdotii' is only required for GCC >= 15.
if (CMAKE_C_COMPILER_ID STREQUAL "GNU" AND

View File

@@ -0,0 +1,32 @@
include(CheckCSourceRuns)
if (CMAKE_SYSTEM_PROCESSOR MATCHES "^(riscv)" AND GGML_CPU_RISCV64_SPACEMIT)
set(SMT_MARCH_STR "-march=rv64gcv_zfh_zvfh_zba_zicbop")
if (CMAKE_C_COMPILER_ID STREQUAL "GNU" AND
CMAKE_C_COMPILER_VERSION VERSION_GREATER_EQUAL 15)
string(APPEND SMT_MARCH_STR "_xsmtvdotii")
endif()
set(CMAKE_REQUIRED_FLAGS "${SMT_MARCH_STR}")
check_c_source_compiles("int main() {__asm__ volatile(\"vmadot v2, v0, v1\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_IME1)
check_c_source_compiles("int main() {__asm__ volatile(\"vmadot v2, v0, v1, i4\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VMADOT_S4)
check_c_source_compiles("int main() {__asm__ volatile(\"vmadot v2, v0, v1, i8\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VMADOT_S8)
check_c_source_compiles("int main() {__asm__ volatile(\"vfwmadot v2, v0, v1, fp16\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VFWMADOT_FP16)
check_c_source_compiles("int main() {__asm__ volatile(\"vmadot.hp v2, v0, v1, v0, 0, i4\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VFMADOT_S4)
check_c_source_compiles("int main() {__asm__ volatile(\"vmadot.hp v2, v0, v1, v0, 0, i8\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VFMADOT_S8)
check_c_source_compiles("int main() {__asm__ volatile(\"vmadot1 v2, v0, v1\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VMADOTN)
check_c_source_compiles("int main() {__asm__ volatile(\"vpack.vv v2, v0, v1, 2\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VPACK)
check_c_source_compiles("int main() {__asm__ volatile(\"vnspack.vv v2, v0, v1, 2\");}" SPACEMIT_RISCV_COMPILER_SUPPORT_VNPACK)
unset(CMAKE_REQUIRED_FLAGS)
list(APPEND RISCV64_SPACEMIT_IME_SPEC "")
if (SPACEMIT_RISCV_COMPILER_SUPPORT_IME1)
set(RISCV64_SPACEMIT_IME_SPEC "RISCV64_SPACEMIT_IME1")
endif()
if (SPACEMIT_RISCV_COMPILER_SUPPORT_VMADOT_S4 AND SPACEMIT_RISCV_COMPILER_SUPPORT_VPACK AND SPACEMIT_RISCV_COMPILER_SUPPORT_VNPACK)
list(APPEND RISCV64_SPACEMIT_IME_SPEC "RISCV64_SPACEMIT_IME2")
endif()
message("RISCV64_SPACEMIT_IME_SPEC: ${RISCV64_SPACEMIT_IME_SPEC}")
endif()

View File

@@ -50,6 +50,10 @@
#include "llamafile/sgemm.h"
#endif
#ifdef GGML_USE_CPU_RISCV64_SPACEMIT
# include "spacemit/ime.h"
#endif
// Note: once we move threading into a separate C++ file
// will use std::hardware_destructive_interference_size instead of hardcoding it here
// and we'll use C++ attribute syntax.
@@ -3011,7 +3015,11 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
const struct ggml_cgraph * cgraph = tp->cgraph;
const struct ggml_cplan * cplan = tp->cplan;
#ifdef GGML_USE_CPU_RISCV64_SPACEMIT
ggml_backend_cpu_riscv64_spacemit_set_numa_thread_affinity(state->ith);
#else
set_numa_thread_affinity(state->ith);
#endif
struct ggml_compute_params params = {
/*.ith =*/ state->ith,
@@ -3068,6 +3076,10 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
ggml_barrier(state->threadpool);
#ifdef GGML_USE_CPU_RISCV64_SPACEMIT
ggml_backend_cpu_riscv64_spacemit_clear_numa_thread_affinity_threaded(state->ith);
#endif
return 0;
}

File diff suppressed because it is too large Load Diff

View File

@@ -8,6 +8,14 @@ extern "C" {
ggml_backend_buffer_type_t ggml_backend_cpu_riscv64_spacemit_buffer_type(void);
void ggml_backend_cpu_riscv64_spacemit_set_numa_thread_affinity(int thread_n);
void ggml_backend_cpu_riscv64_spacemit_clear_numa_thread_affinity_threaded(int thread_n);
void * ggml_backend_cpu_riscv64_spacemit_alloc_shared(size_t size, size_t alignment);
void ggml_backend_cpu_riscv64_spacemit_free_shared(void * ptr);
#ifdef __cplusplus
}
#endif

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,320 @@
#include "ime_env.h"
#include "ggml-impl.h"
#include "spine_mem_pool.h"
#include <fcntl.h>
#include <unistd.h>
#include <algorithm>
#include <array>
#include <cctype>
#include <fstream>
#include <string>
#include <thread>
#include <unordered_map>
namespace ggml::cpu::riscv64_spacemit {
bool spine_core_info::get_spine_core_info(std::vector<spine_core_info> & result) {
static std::unordered_map<uint64_t, spine_core_arch_id> spine_march_mapping_ = {
{0x8000000058000001, spine_core_arch_id::core_arch_x60 },
{ 0x8000000041000001, spine_core_arch_id::core_arch_a60 },
{ 0x8000000058000002, spine_core_arch_id::core_arch_x100},
{ 0x8000000041000002, spine_core_arch_id::core_arch_a100},
};
result.clear();
std::ifstream file("/proc/cpuinfo");
std::string line;
std::vector<std::array<uint64_t, 2>> cpu_info_list;
uint64_t current_processor = spine_invalid_core_id;
uint64_t current_marchid = 0;
bool has_processor = false;
bool has_marchid = false;
if (!file.is_open()) {
return false;
}
while (std::getline(file, line)) {
if (line.substr(0, 9) == "processor") {
if (has_processor && has_marchid) {
cpu_info_list.push_back({ current_processor, current_marchid });
}
size_t colon_pos = line.find(':');
if (colon_pos != std::string::npos) {
current_processor = std::stoi(line.substr(colon_pos + 1));
has_processor = true;
}
has_marchid = false;
} else if (line.substr(0, 7) == "marchid") {
size_t colon_pos = line.find(':');
if (colon_pos != std::string::npos) {
std::string marchid_str = line.substr(colon_pos + 1);
marchid_str.erase(std::remove_if(marchid_str.begin(), marchid_str.end(), isspace), marchid_str.end());
current_marchid = std::stoull(marchid_str, nullptr, 16);
has_marchid = true;
}
}
}
if (has_processor && has_marchid) {
cpu_info_list.push_back({ current_processor, current_marchid });
}
if (has_processor && has_marchid) {
for (auto & cpu_info : cpu_info_list) {
if (cpu_info[0] != spine_invalid_core_id &&
spine_march_mapping_.find(cpu_info[1]) != spine_march_mapping_.end()) {
auto core_info = spine_core_info();
core_info.core_id = cpu_info[0];
core_info.arch_id = spine_core_arch_id(spine_march_mapping_[cpu_info[1]]);
result.push_back(core_info);
}
}
}
return has_processor && has_marchid;
}
namespace {
uint16_t hex_string_to_u16(const std::string & hex_str) {
try {
size_t pos = 0;
if (hex_str.substr(0, 2) == "0x" || hex_str.substr(0, 2) == "0X") {
pos = 2;
}
unsigned long result = std::stoul(hex_str.substr(pos), nullptr, 16);
if (result > std::numeric_limits<uint16_t>::max()) {
throw std::out_of_range("Converted value is out of range for uint16_t");
}
return static_cast<uint16_t>(result);
} catch (const std::invalid_argument & e) {
throw std::invalid_argument("Invalid hexadecimal string");
} catch (const std::out_of_range & e) {
throw;
}
}
const char * spine_mem_pool_backend_to_string(spine_mem_pool_backend backend) {
switch (backend) {
case spine_mem_pool_backend::none:
return "NONE";
case spine_mem_pool_backend::posix_memalign:
return "POSIX";
case spine_mem_pool_backend::transparent_hugepage:
return "HPAGE";
case spine_mem_pool_backend::hugetlb_1g:
return "HPAGE1GB";
}
return "unknown";
}
spine_mem_pool_backend parse_mem_backend(const char * mem_backend_str) {
if (mem_backend_str == nullptr || mem_backend_str[0] == '\0') {
return spine_mem_pool_backend::transparent_hugepage;
}
std::string value(mem_backend_str);
std::transform(value.begin(), value.end(), value.begin(),
[](unsigned char ch) { return static_cast<char>(std::tolower(ch)); });
if (value == "none") {
return spine_mem_pool_backend::none;
}
if (value == "posix") {
return spine_mem_pool_backend::posix_memalign;
}
if (value == "hpage") {
return spine_mem_pool_backend::transparent_hugepage;
}
if (value == "hpage1gb") {
return spine_mem_pool_backend::hugetlb_1g;
}
throw std::runtime_error("invalid SPACEMIT_MEM_BACKEND: " + value + ", expected NONE, POSIX, HPAGE or HPAGE1GB");
}
} // namespace
spine_env_info::spine_env_info() {
num_cores = static_cast<int>(std::thread::hardware_concurrency());
spine_core_info::get_spine_core_info(core_info_list);
// special for x60 K1
if (core_info_list.size() == 8 && core_info_list[0].arch_id == spine_core_arch_id::core_arch_x60) {
for (int i = 0; i < 4; i++) {
core_info_list[i].arch_id = spine_core_arch_id::core_arch_a60;
}
}
// special for qemu
if (core_info_list.size() == 0) {
char * spine_core_arch_str = getenv("SPACEMIT_CORE_ARCH");
if (spine_core_arch_str != nullptr) {
auto arch_id = hex_string_to_u16(spine_core_arch_str);
for (int i = 0; i < num_cores; i++) {
auto core_info = spine_core_info();
core_info.core_id = i;
core_info.arch_id = spine_core_arch_id{ arch_id };
core_info_list.push_back(core_info);
}
}
}
if (core_info_list.size() == 0) {
throw std::runtime_error(
"Failed to get SPACEMIT_CORE_ARCH from environment or failed to parse it from /proc/cpuinfo");
}
char * spine_perfer_core_arch_str = getenv("SPACEMIT_PERFER_CORE_ARCH");
if (spine_perfer_core_arch_str != nullptr && spine_perfer_core_arch_str != "") {
perfer_core_arch_id = spine_core_arch_id{ hex_string_to_u16(spine_perfer_core_arch_str) };
}
char * spine_perfer_core_id_str = getenv("SPACEMIT_PERFER_CORE_ID");
std::vector<int> perfer_core_id_vec;
if (spine_perfer_core_id_str != nullptr && spine_perfer_core_id_str != "") {
std::string perfer_core_id_str(spine_perfer_core_id_str);
size_t start = 0;
size_t end = 0;
while ((end = perfer_core_id_str.find(',', start)) != std::string::npos) {
std::string core_id_substr = perfer_core_id_str.substr(start, end - start);
perfer_core_id_vec.push_back(std::stoi(core_id_substr));
start = end + 1;
}
std::string core_id_substr = perfer_core_id_str.substr(start);
perfer_core_id_vec.push_back(std::stoi(core_id_substr));
}
perfer_core_ids.reserve(num_cores);
if (perfer_core_arch_id == spine_core_arch_id::core_arch_none) {
for (auto & core_info : core_info_list) {
auto core_arch_id = core_info.arch_id;
auto core_arch_head = (uint16_t) (core_arch_id) >> 12;
if (core_arch_head == 0xA) {
num_perfer_cores++;
perfer_core_arch_id = core_arch_id;
cpu_mask |= (1ULL << core_info.core_id);
perfer_core_ids.push_back(core_info.core_id);
}
}
} else {
for (auto & core_info : core_info_list) {
auto core_arch_id = core_info.arch_id;
if (core_arch_id == perfer_core_arch_id) {
num_perfer_cores++;
cpu_mask |= (1ULL << core_info.core_id);
auto core_arch_head = (uint16_t) (core_arch_id) >> 12;
if (core_arch_head == 0xA) {
perfer_core_ids.push_back(core_info.core_id);
}
}
}
if (num_perfer_cores == 0) {
GGML_ABORT("can not find core with arch id %x for SPACEMIT_PERFER_CORE_ARCH in core info list\n",
(uint16_t) perfer_core_arch_id);
}
}
if (perfer_core_id_vec.size() > 0) {
perfer_core_ids.clear();
cpu_mask = 0;
num_perfer_cores = 0;
for (int core_id : perfer_core_id_vec) {
if (core_id < 0 || core_id >= num_cores) {
GGML_ABORT("invalid core id in SPACEMIT_PERFER_CORE_ID: %d, should be between 0 and %d\n", core_id,
num_cores - 1);
}
auto core_info = core_info_list[core_id];
auto core_arch_id = core_info.arch_id;
if (core_arch_id == perfer_core_arch_id) {
cpu_mask |= (1ULL << core_id);
perfer_core_ids.push_back(core_id);
} else {
GGML_ABORT(
"core id %d in SPACEMIT_PERFER_CORE_ID has arch id %x which does not match "
"SPACEMIT_PERFER_CORE_ARCH %x\n",
core_id, (uint16_t) core_arch_id, (uint16_t) perfer_core_arch_id);
}
}
std::string perfer_core_id_vec_str;
for (int core_id : perfer_core_id_vec) {
perfer_core_id_vec_str += std::to_string(core_id) + ",";
}
perfer_core_id_vec_str.pop_back();
GGML_LOG_DEBUG("SPACEMIT_PERFER_CORE_ID is set, perferred core ids: %s\n", perfer_core_id_vec_str.c_str());
num_perfer_cores = static_cast<int>(perfer_core_id_vec.size());
}
use_ime1 = perfer_core_arch_id == spine_core_arch_id::core_arch_a60 ||
perfer_core_arch_id == spine_core_arch_id::core_arch_x100;
use_ime2 = perfer_core_arch_id == spine_core_arch_id::core_arch_a100;
mem_backend = parse_mem_backend(getenv("SPACEMIT_MEM_BACKEND"));
char * spine_disable_tcm_str = getenv("SPACEMIT_DISABLE_TCM");
auto user_disable_tcm = spine_disable_tcm_str != nullptr && strcmp(spine_disable_tcm_str, "0") != 0;
if (!user_disable_tcm) {
spine_mem_pool_tcm_info tcm_info;
if (spine_mem_pool_tcm_init(&tcm_info)) {
use_tcm = tcm_info.available;
tcm_blk_size = tcm_info.blk_size;
GGML_LOG_DEBUG("CPU_RISCV64_SPACEMIT: tcm is available, blk_size: %zu, blk_num: %zu, is_fake_tcm: %d\n",
tcm_info.blk_size, tcm_info.blk_num, tcm_info.is_fake_tcm);
for (auto & core_info : core_info_list) {
auto core_arch_head = (uint16_t) (core_info.arch_id) >> 12;
if (core_arch_head != 0xA) {
aicpu_id_offset++;
} else {
break;
}
}
}
}
GGML_LOG_DEBUG(
"CPU_RISCV64_SPACEMIT: num_cores: %d, num_perfer_cores: %d, perfer_core_arch_id: %x, exclude_main_thread: %d, "
"use_ime1: %d, use_ime2: %d, mem_backend: %s, cpu_mask: %lx, aicpu_id_offset: %d\n",
num_cores, num_perfer_cores, (uint16_t) perfer_core_arch_id, exclude_main_thread, use_ime1, use_ime2,
spine_mem_pool_backend_to_string(mem_backend), cpu_mask, aicpu_id_offset);
const size_t init_barrier_size = sizeof(spine_barrier_t) * spine_init_barrier_count;
init_barrier =
static_cast<spine_barrier_t *>(spine_mem_pool_shared_mem_alloc(init_barrier_size, alignof(spine_barrier_t)));
if (init_barrier != nullptr) {
init_barrier_is_shared_mem = true;
} else {
GGML_LOG_WARN("CPU_RISCV64_SPACEMIT: failed to allocate init_barrier from shared mem, falling back to heap\n",
__func__);
init_barrier = new spine_barrier_t[spine_init_barrier_count];
}
spine_barrier_init(init_barrier, spine_init_barrier_count, 2);
}
spine_env_info::~spine_env_info() {
if (init_barrier_is_shared_mem) {
spine_mem_pool_shared_mem_free(init_barrier);
} else {
delete[] init_barrier;
}
init_barrier = nullptr;
init_barrier_is_shared_mem = false;
}
spine_env_info global_spine_env_info;
} // namespace ggml::cpu::riscv64_spacemit

View File

@@ -0,0 +1,55 @@
#pragma once
#include "spine_barrier.h"
#include "spine_mem_pool.h"
#include <cstddef>
#include <cstdint>
#include <vector>
namespace ggml::cpu::riscv64_spacemit {
constexpr uint64_t spine_invalid_core_id = 0xFFFFFFFF;
constexpr size_t spine_init_barrier_count = 16;
enum class spine_core_arch_id : uint16_t {
core_arch_none = 0,
core_arch_x60 = 0x503C,
core_arch_x100 = 0x5064,
core_arch_x200 = 0x50C8,
core_arch_a60 = 0xA03C,
core_arch_a100 = 0xA064,
core_arch_a200 = 0xA0C8,
};
struct spine_core_info {
uint64_t core_id{ spine_invalid_core_id };
spine_core_arch_id arch_id{ spine_core_arch_id::core_arch_none };
static bool get_spine_core_info(std::vector<spine_core_info> & result);
};
struct spine_env_info {
std::vector<spine_core_info> core_info_list;
std::vector<int> perfer_core_ids;
int aicpu_id_offset{ 0 };
int num_cores{ 0 };
int num_perfer_cores{ 0 };
spine_core_arch_id perfer_core_arch_id{ spine_core_arch_id::core_arch_none };
bool exclude_main_thread{ false };
bool use_ime2{ false };
bool use_ime1{ false };
bool use_tcm{ false };
spine_mem_pool_backend mem_backend{ spine_mem_pool_backend::transparent_hugepage };
uint64_t tcm_blk_size{ 0 };
uint64_t cpu_mask{ 0 };
spine_barrier_t * init_barrier{ nullptr };
bool init_barrier_is_shared_mem{ false };
spine_env_info();
~spine_env_info();
};
extern spine_env_info global_spine_env_info;
} // namespace ggml::cpu::riscv64_spacemit

View File

@@ -1,26 +1,189 @@
#pragma once
#include <cassert>
#include <cstddef>
#include <functional>
namespace spacemit_kernels {
#define BLOCK_QNK_LEN 256
template <int N> struct nrow_block_q2_k {
// [4bit scale + 4bit zp] * N * 16
uint8_t scales[N * BLOCK_QNK_LEN / 16];
// [b0, b16, b32, b48] [b1, b17, b33, b49] ... [b15, b31, b47, b63]
// [b64, b80, b96, b112] ...[b79, b95, b111, b127]
// [b128, b144, b160, b176] ...[b143, b159, b175, b191]
// [b192, b208, b224, b240] ...[b207, b223, b239, b255]
uint8_t qs[N * BLOCK_QNK_LEN / 4];
uint16_t scales16[N];
uint16_t zeros16[N];
};
template <int N> struct nrow_block_q3_k {
// [8bit scale] * N * 16
int8_t scales[N * 16];
// [b0, b1, b2, b3, b4, b5, b6, b7] ... [b248, b249, b250, b251, b252, b253, b254, b255]
uint8_t hmask[N * BLOCK_QNK_LEN / 8];
// [b0, b16, b32, b48] [b1, b17, b33, b49] ... [b15, b31, b47, b63]
// [b64, b80, b96, b112] ...[b79, b95, b111, b127]
// [b128, b144, b160, b176] ...[b143, b159, b175, b191]
// [b192, b208, b224, b240] ...[b207, b223, b239, b255]
uint8_t qs[N * BLOCK_QNK_LEN / 4];
uint16_t scales16[N];
};
template <int N> struct nrow_block_mxfp4 {
uint8_t e[N];
uint8_t qh[4 * N];
uint8_t qs[16 * N];
};
template <int N> struct __attribute__((packed)) nrow_block_q5_1 {
uint16_t scales16[N];
uint8_t zp[N];
// n0 [bh0, bh1, bh2, bh3, bh4, bh5, bh6, bh7] ....
uint8_t qh[4 * N];
// n0 [b0, b1], [b2, b3] .... [b30, b31]
// n1 [b0, b1], [b2, b3] .... [b30, b31]
uint8_t qs[16 * N];
};
static_assert(sizeof(nrow_block_q5_1<1>) == sizeof(uint8_t) + 22, "wrong nrow_block_q5_1 block size/padding");
template <int N> struct __attribute__((packed)) nrow_block_q5_0 {
uint16_t scales16[N];
// n0 [bh0, bh1, bh2, bh3, bh4, bh5, bh6, bh7] ....
uint8_t qh[4 * N];
// n0 [b0, b1], [b2, b3] .... [b30, b31]
// n1 [b0, b1], [b2, b3] .... [b30, b31]
uint8_t qs[16 * N];
};
static_assert(sizeof(nrow_block_q5_0<1>) == 22, "wrong nrow_block_q5_0 block size/padding");
using gemm_kernel_quantize_def = std::function<
size_t(size_t, const uint8_t *, const uint8_t *, const uint8_t *, float *, size_t, size_t, size_t, size_t)>;
using moe_gemm_kernel_quantize_def = std::function<
size_t(size_t, const uint8_t **, const uint8_t *, const uint8_t *, float **, size_t, size_t, size_t, size_t)>;
namespace sqnbitgemm_spacemit_ime {
namespace ime1 {
size_t gemm_kernel_i8i4(size_t blk_len,
const std::byte * quant_a_ptr,
const std::byte * quant_b_data,
const float * quant_b_scale,
const std::byte * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t count_k,
size_t block_count_k,
size_t ldc,
const float * bias,
const size_t scale_stride);
size_t gemm_kernel_i8i4(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
void quantize_a_row_i8(size_t blk_len, const float * a_ptr, size_t count_k, std::byte * quant_a_ptr);
void quantize_a_row_i8(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
void quantize_a_4row_i8(size_t blk_len, const float * a_ptr, size_t count_k, std::byte * quant_a_ptr);
void quantize_a_4row_i8(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
} // namespace ime1
} // namespace sqnbitgemm_spacemit_ime
namespace ime2 {
size_t gemm_kernel_i8i2k(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t gemm_kernel_i8i3k(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t gemm_kernel_i8i4(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t gemm_kernel_i8i4_hp(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t moe_m2_gemm_kernel_i8i4(size_t blk_len,
const uint8_t ** quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float ** c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t gemm_kernel_i8i8(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t gemm_kernel_i8mxfp4(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t moe_m2_gemm_kernel_i8mxfp4(size_t blk_len,
const uint8_t ** quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float ** c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t gemm_kernel_i8i5(size_t blk_len,
const uint8_t * quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float * c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
size_t moe_m2_gemm_kernel_i8i5(size_t blk_len,
const uint8_t ** quant_a_ptr,
const uint8_t * quant_b_data,
const uint8_t * quant_b_zp,
float ** c_ptr,
size_t count_m,
size_t count_n,
size_t k_blks,
size_t ldc);
} // namespace ime2
} // namespace spacemit_kernels

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,14 @@
#pragma once
#include "ggml-common.h"
#include "ggml.h"
#include <cstddef>
#include <cstdint>
namespace ggml::cpu::riscv64_spacemit {
template <typename BLOC_TYPE, int64_t INTER_SIZE, int64_t NB_COLS>
int repack(ggml_tensor * t, const void * data, size_t data_size);
} // namespace ggml::cpu::riscv64_spacemit

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,95 @@
#pragma once
#include "ggml-cpu-impl.h"
#include <cassert>
#include <cstddef>
#include <cstdint>
#include <functional>
namespace spacemit_kernels {
constexpr auto div_round_up(auto up, auto down) {
return (up + down - 1) / down;
}
// Q8 Blk [f32] [s16] [int8 * blk_len]
// Q8 Blk N [f32 * N] [s16 * N] [int8 * blk_len * N]
constexpr size_t q8_blk_size(size_t blk_len, bool with_blk_sum = false) {
const size_t blk_size = sizeof(float) + blk_len * sizeof(int8_t) + (with_blk_sum ? sizeof(int16_t) : 0);
return blk_size;
}
// Q8 HP row block: K is split into K32 subblocks.
// Each subblock stores [f32 scale] [int8 * 32], with an optional fp16 sum trailer per subblock.
constexpr size_t q8_hp_blk_size(size_t blk_len, bool with_blk_sum = false, bool with_blk_scale = false) {
const size_t subblk_count = div_round_up(blk_len, size_t(32));
const size_t blk_size = blk_len * sizeof(int8_t) + subblk_count * sizeof(_Float16) +
(with_blk_sum ? subblk_count * sizeof(_Float16) : 0) +
(with_blk_scale ? sizeof(_Float16) : 0);
return blk_size;
}
// Q8K Blk [f32] [s16 * (blk_len / 16)] [int8 * blk_len]
// Q8K Blk N [f32 * N] [s16 * (blk_len / 16) * N] [int8 * blk_len * N]
constexpr size_t q8k_blk_size(size_t blk_len) {
const size_t blk_size = sizeof(float) + blk_len * sizeof(int8_t) + sizeof(int16_t) * blk_len / 16;
return blk_size;
}
using quantize_a_row_def = std::function<void(size_t, const float *, size_t, uint8_t *)>;
namespace rvv {
void memcpy1d(void * dst, const void * src, int64_t size);
void memcpy2d(void * dst, int64_t dst_stride, const void * src, int64_t src_stride, int64_t tile_rows, int64_t size);
void forward_flash_attn_ext_f16_one_chunk_vlen1024_vf16(const ggml_compute_params * params,
ggml_tensor * dst,
int ir0,
int ir1,
void * tcm_buffer,
size_t tcm_buffer_size);
void forward_flash_attn_ext_f16_tiled_vlen1024_vf16(const ggml_compute_params * params,
ggml_tensor * dst,
int ir0,
int ir1,
void * tcm_buffer,
size_t tcm_buffer_size);
void forward_rms_norm_f32(ggml_compute_params * params, ggml_tensor * op);
void forward_norm_f32(ggml_compute_params * params, ggml_tensor * op);
void forward_cont_with_permute(ggml_compute_params * params, ggml_tensor * op);
void forward_cpy_with_permute(ggml_compute_params * params, ggml_tensor * op);
template <typename T> void forward_get_rows(ggml_compute_params * params, ggml_tensor * op);
template <typename T> void forward_concat(ggml_compute_params * params, ggml_tensor * op);
template <ggml_op op_type, typename T> void forward_binary(ggml_compute_params * params, ggml_tensor * op);
template <typename T> void forward_sum_rows(const ggml_compute_params * params, ggml_tensor * op);
template <typename T> void forward_repeat_nrows(ggml_compute_params * params, ggml_tensor * op);
template <typename T> void forward_repeat_dim1(ggml_compute_params * params, ggml_tensor * op);
void quantize_a_row_i8(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
void quantize_a_4row_i8(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
void quantize_a_row_i8_hp(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
void quantize_a_4row_i8_hp(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
void quantize_a_row_i8k(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
void quantize_a_4row_i8k(size_t blk_len, const float * a_ptr, size_t count_k, uint8_t * quant_a_ptr);
} // namespace rvv
} // namespace spacemit_kernels

View File

@@ -0,0 +1,34 @@
#pragma once
#include <atomic>
#include <cstdint>
#define SPINE_CACHE_LINE 64
#define SPINE_CACHE_ALIGN __attribute__((aligned(SPINE_CACHE_LINE)))
struct spine_barrier_t {
SPINE_CACHE_ALIGN std::atomic<int64_t> pending_;
SPINE_CACHE_ALIGN std::atomic<int64_t> rounds_;
SPINE_CACHE_ALIGN int64_t total_;
};
inline void spine_barrier_wait(spine_barrier_t * b) {
auto cur_round = b->rounds_.load(std::memory_order_acquire);
auto cnt = --b->pending_;
if (cnt == 0) {
b->pending_.store(b->total_);
b->rounds_.store(cur_round + 1);
} else {
while (cur_round == b->rounds_.load(std::memory_order_relaxed)) {
__asm__ volatile("pause " ::: "memory");
}
}
}
inline void spine_barrier_init(spine_barrier_t * b, int num_barriers, uint64_t thread_count) {
for (int i = 0; i < num_barriers; i++) {
b[i].total_ = thread_count;
b[i].pending_.store(thread_count);
b[i].rounds_.store(0);
}
}

View File

@@ -0,0 +1,760 @@
#include "spine_mem_pool.h"
#include "common.h"
#include "ime_env.h"
#include "spine_tcm.h"
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <algorithm>
#include <cerrno>
#include <cstdint>
#include <cstdlib>
#include <limits>
#include <memory>
#include <mutex>
#include <unordered_map>
#include <vector>
namespace ggml::cpu::riscv64_spacemit {
namespace {
constexpr size_t SPINE_MEM_POOL_CHUNK_SIZE = 512ull * 1024ull * 1024ull;
constexpr size_t SPINE_SHARE_MEM_POOL_CHUNK_SIZE = 512ull * 1024ull;
constexpr size_t SPINE_MEM_POOL_1G_REGION_SIZE = 1ull << 30;
constexpr uint64_t HUGETLB_1G_FLAG_REQUIRE_PUD = 1ull << 0;
constexpr char SPINE_MEM_POOL_HUGETLB_1G_DEV[] = "/dev/hugetlb_1g";
constexpr char SPINE_MEM_POOL_TCM_SYNC_MEM_DEV[] = "/dev/tcm_sync_mem";
struct hugetlb_1g_region {
uint64_t size{ 0 };
uint64_t dma_addr{ 0 };
uint64_t flags{ 0 };
uint64_t reserved{ 0 };
};
#define HUGETLB_1G_IOC_MAGIC 'M'
#define HUGETLB_1G_IOC_ALLOC _IOWR(HUGETLB_1G_IOC_MAGIC, 0x00, struct hugetlb_1g_region)
#define HUGETLB_1G_IOC_FREE _IO(HUGETLB_1G_IOC_MAGIC, 0x01)
struct free_block {
size_t offset{ 0 };
size_t size{ 0 };
};
struct pool_chunk {
uint8_t * base{ nullptr };
size_t size{ 0 };
int fd{ -1 };
std::vector<free_block> free_blocks;
};
struct pool_allocation {
void * chunk_base{ nullptr };
size_t chunk_size{ 0 };
void * base{ nullptr };
size_t size{ 0 };
};
bool is_power_of_two(size_t value) {
return value != 0 && (value & (value - 1)) == 0;
}
bool align_up(size_t value, size_t alignment, size_t * aligned_value) {
if (aligned_value == nullptr || alignment == 0) {
return false;
}
const size_t remainder = value % alignment;
if (remainder == 0) {
*aligned_value = value;
return true;
}
const size_t padding = alignment - remainder;
if (value > std::numeric_limits<size_t>::max() - padding) {
return false;
}
*aligned_value = value + padding;
return true;
}
bool align_up_uintptr(uintptr_t value, size_t alignment, uintptr_t * aligned_value) {
if (aligned_value == nullptr || alignment == 0) {
return false;
}
const uintptr_t remainder = value % alignment;
if (remainder == 0) {
*aligned_value = value;
return true;
}
const uintptr_t padding = alignment - remainder;
if (value > std::numeric_limits<uintptr_t>::max() - padding) {
return false;
}
*aligned_value = value + padding;
return true;
}
class spine_mem_pool_manager {
public:
explicit spine_mem_pool_manager(size_t default_chunk_size) : default_chunk_size_(default_chunk_size) {}
virtual ~spine_mem_pool_manager() = default;
void * alloc(size_t size, size_t alignment) {
if (size == 0 || !is_power_of_two(alignment)) {
return nullptr;
}
size_t aligned_size = 0;
if (!align_up(size, alignment, &aligned_size)) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: align_up failed for size %zu alignment %zu\n", __func__, size,
alignment);
return nullptr;
}
pool_allocation allocation;
std::lock_guard<std::mutex> lock(mutex_);
if (!try_alloc_locked(aligned_size, alignment, &allocation)) {
if (!add_chunk_locked(aligned_size, alignment)) {
return nullptr;
}
if (!try_alloc_locked(aligned_size, alignment, &allocation)) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: allocation retry failed for size %zu alignment %zu\n",
__func__, aligned_size, alignment);
return nullptr;
}
}
try {
const auto [allocation_it, inserted] = allocations_.emplace(allocation.base, allocation);
if (!inserted) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: duplicate allocation key %p\n", __func__, allocation.base);
rollback_allocation_locked(allocation);
return nullptr;
}
} catch (const std::bad_alloc &) {
rollback_allocation_locked(allocation);
throw;
}
return allocation.base;
}
void free(void * base) {
if (base == nullptr) {
return;
}
std::lock_guard<std::mutex> lock(mutex_);
auto allocation_it = allocations_.find(base);
if (allocation_it == allocations_.end()) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: unknown allocation %p\n", __func__, base);
return;
}
pool_allocation allocation = allocation_it->second;
allocations_.erase(allocation_it);
auto chunk_it = find_chunk_locked(allocation);
if (chunk_it == chunks_.end()) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: unknown chunk for allocation %p size %zu\n", __func__,
allocation.base, allocation.size);
return;
}
auto * chunk_base = chunk_it->base;
auto * alloc_base = static_cast<uint8_t *>(allocation.base);
if (alloc_base < chunk_base || alloc_base >= chunk_base + chunk_it->size) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: allocation %p out of chunk range %p..%p\n", __func__,
allocation.base, chunk_base, chunk_base + chunk_it->size);
return;
}
const size_t offset = static_cast<size_t>(alloc_base - chunk_base);
if (offset > chunk_it->size || allocation.size > chunk_it->size - offset) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: allocation %p size %zu exceeds chunk size %zu\n", __func__,
allocation.base, allocation.size, chunk_it->size);
return;
}
insert_free_block_locked(*chunk_it, { offset, allocation.size });
maybe_release_empty_chunk_locked(chunk_it);
}
protected:
void release_chunks() {
std::lock_guard<std::mutex> lock(mutex_);
allocations_.clear();
for (auto & chunk : chunks_) {
dealloc_chunk(&chunk);
}
chunks_.clear();
}
size_t default_chunk_size() const { return default_chunk_size_; }
static void clear_chunk(pool_chunk * chunk) {
chunk->base = nullptr;
chunk->size = 0;
chunk->fd = -1;
chunk->free_blocks.clear();
}
virtual bool alloc_chunk(size_t min_size, size_t alignment, void * hint_addr, pool_chunk * chunk) = 0;
virtual void dealloc_chunk(pool_chunk * chunk) = 0;
private:
struct alloc_candidate {
size_t chunk_index{ 0 };
size_t block_index{ 0 };
size_t aligned_offset{ 0 };
uintptr_t address{ std::numeric_limits<uintptr_t>::max() };
bool valid{ false };
};
std::vector<pool_chunk>::iterator find_chunk_locked(const pool_allocation & allocation) {
return std::find_if(chunks_.begin(), chunks_.end(), [&](const pool_chunk & chunk) {
return chunk.base == allocation.chunk_base && chunk.size == allocation.chunk_size;
});
}
bool add_chunk_locked(size_t min_size, size_t alignment) {
pool_chunk chunk;
const size_t chunk_request = default_chunk_size_ == 0 ? min_size : std::max(min_size, default_chunk_size_);
void * hint_addr = nullptr;
for (const auto & existing_chunk : chunks_) {
auto * chunk_end = existing_chunk.base + existing_chunk.size;
if (hint_addr == nullptr || chunk_end > hint_addr) {
hint_addr = chunk_end;
}
}
if (!alloc_chunk(chunk_request, alignment, hint_addr, &chunk)) {
return false;
}
if (chunk.base == nullptr || chunk.size < min_size) {
GGML_LOG_ERROR(
"CPU_RISCV64_SPACEMIT: %s: invalid chunk returned for request size %zu, chunk_base=%p chunk_size=%zu\n",
__func__, min_size, chunk.base, chunk.size);
dealloc_chunk(&chunk);
return false;
}
try {
chunk.free_blocks.push_back({ 0, chunk.size });
chunks_.push_back(std::move(chunk));
} catch (const std::bad_alloc &) {
dealloc_chunk(&chunk);
throw;
}
return true;
}
void rollback_allocation_locked(const pool_allocation & allocation) {
auto chunk_it = find_chunk_locked(allocation);
if (chunk_it == chunks_.end()) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: failed to rollback allocation %p, owning chunk not found\n",
__func__, allocation.base);
return;
}
auto * chunk_base = chunk_it->base;
auto * alloc_base = static_cast<uint8_t *>(allocation.base);
if (alloc_base < chunk_base || alloc_base >= chunk_base + chunk_it->size) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: failed to rollback allocation %p, chunk range is invalid\n",
__func__, allocation.base);
return;
}
const size_t offset = static_cast<size_t>(alloc_base - chunk_base);
if (offset > chunk_it->size || allocation.size > chunk_it->size - offset) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: failed to rollback allocation %p size %zu\n", __func__,
allocation.base, allocation.size);
return;
}
insert_free_block_locked(*chunk_it, { offset, allocation.size });
maybe_release_empty_chunk_locked(chunk_it);
}
bool try_alloc_locked(size_t size, size_t alignment, pool_allocation * allocation) {
alloc_candidate best;
for (size_t chunk_index = 0; chunk_index < chunks_.size(); ++chunk_index) {
const auto & chunk = chunks_[chunk_index];
for (size_t block_index = 0; block_index < chunk.free_blocks.size(); ++block_index) {
const auto & block = chunk.free_blocks[block_index];
uintptr_t aligned_addr = 0;
const auto block_addr = reinterpret_cast<uintptr_t>(chunk.base + block.offset);
if (!align_up_uintptr(block_addr, alignment, &aligned_addr)) {
continue;
}
if (aligned_addr < block_addr) {
continue;
}
const size_t aligned_offset = block.offset + static_cast<size_t>(aligned_addr - block_addr);
const size_t padding = aligned_offset - block.offset;
if (padding > block.size || size > block.size - padding) {
continue;
}
if (!best.valid || aligned_addr < best.address) {
best.chunk_index = chunk_index;
best.block_index = block_index;
best.aligned_offset = aligned_offset;
best.address = aligned_addr;
best.valid = true;
}
}
}
if (!best.valid) {
return false;
}
auto & chunk = chunks_[best.chunk_index];
const free_block block = chunk.free_blocks[best.block_index];
const size_t padding = best.aligned_offset - block.offset;
const size_t alloc_end = best.aligned_offset + size;
const size_t block_end = block.offset + block.size;
chunk.free_blocks.erase(chunk.free_blocks.begin() + best.block_index);
auto insert_it = chunk.free_blocks.begin() + best.block_index;
if (padding != 0) {
insert_it = chunk.free_blocks.insert(insert_it, { block.offset, padding });
++insert_it;
}
if (alloc_end < block_end) {
chunk.free_blocks.insert(insert_it, { alloc_end, block_end - alloc_end });
}
allocation->chunk_base = chunk.base;
allocation->chunk_size = chunk.size;
allocation->base = chunk.base + best.aligned_offset;
allocation->size = size;
return true;
}
void maybe_release_empty_chunk_locked(std::vector<pool_chunk>::iterator chunk_it) {
if (chunk_it->free_blocks.size() != 1) {
return;
}
const auto & block = chunk_it->free_blocks.front();
if (block.offset != 0 || block.size != chunk_it->size) {
return;
}
dealloc_chunk(&*chunk_it);
chunks_.erase(chunk_it);
}
void insert_free_block_locked(pool_chunk & chunk, free_block block) {
auto it = chunk.free_blocks.begin();
while (it != chunk.free_blocks.end() && it->offset < block.offset) {
++it;
}
if (it != chunk.free_blocks.begin()) {
const auto & prev = *(it - 1);
if (prev.offset + prev.size > block.offset) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: overlapping free block at offset %zu size %zu\n", __func__,
block.offset, block.size);
return;
}
}
if (it != chunk.free_blocks.end() && block.offset + block.size > it->offset) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: overlapping next free block at offset %zu size %zu\n", __func__,
block.offset, block.size);
return;
}
it = chunk.free_blocks.insert(it, block);
if (it != chunk.free_blocks.begin()) {
auto prev = it - 1;
if (prev->offset + prev->size == it->offset) {
it->offset = prev->offset;
it->size += prev->size;
it = chunk.free_blocks.erase(prev);
}
}
if (it + 1 != chunk.free_blocks.end() && it->offset + it->size == (it + 1)->offset) {
it->size += (it + 1)->size;
chunk.free_blocks.erase(it + 1);
}
}
std::mutex mutex_;
std::vector<pool_chunk> chunks_;
std::unordered_map<void *, pool_allocation> allocations_;
size_t default_chunk_size_{ 0 };
};
class spine_mem_pool_posix final : public spine_mem_pool_manager {
public:
spine_mem_pool_posix() : spine_mem_pool_manager(0) {}
~spine_mem_pool_posix() override { release_chunks(); }
private:
bool alloc_chunk(size_t min_size, size_t alignment, void * hint_addr, pool_chunk * chunk) override {
(void) hint_addr;
const size_t alloc_alignment = std::max(alignment, sizeof(void *));
void * base = nullptr;
const int rc = posix_memalign(&base, alloc_alignment, min_size);
if (rc != 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: posix_memalign failed for size %zu alignment %zu, rc=%d\n",
__func__, min_size, alloc_alignment, rc);
return false;
}
chunk->base = static_cast<uint8_t *>(base);
chunk->size = min_size;
chunk->fd = -1;
return true;
}
void dealloc_chunk(pool_chunk * chunk) override {
std::free(chunk->base);
clear_chunk(chunk);
}
};
class spine_mem_pool_transparent_hugepage final : public spine_mem_pool_manager {
public:
spine_mem_pool_transparent_hugepage() : spine_mem_pool_manager(SPINE_MEM_POOL_CHUNK_SIZE) {}
~spine_mem_pool_transparent_hugepage() override { release_chunks(); }
private:
bool alloc_chunk(size_t min_size, size_t alignment, void * hint_addr, pool_chunk * chunk) override {
(void) alignment;
size_t chunk_size = 0;
if (!align_up(min_size, default_chunk_size(), &chunk_size)) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: failed to round chunk size for %zu\n", __func__, min_size);
return false;
}
void * map_addr = mmap(hint_addr, chunk_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (map_addr == MAP_FAILED) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: mmap failed for chunk size %zu, errno=%d\n", __func__, chunk_size,
errno);
return false;
}
if (madvise(map_addr, chunk_size, MADV_HUGEPAGE) != 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: madvise(MADV_HUGEPAGE) failed for chunk size %zu, errno=%d\n",
__func__, chunk_size, errno);
munmap(map_addr, chunk_size);
return false;
}
chunk->base = static_cast<uint8_t *>(map_addr);
chunk->size = chunk_size;
chunk->fd = -1;
return true;
}
void dealloc_chunk(pool_chunk * chunk) override {
if (chunk->base != nullptr && chunk->size != 0 && munmap(chunk->base, chunk->size) != 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: munmap failed for chunk %p size %zu, errno=%d\n", __func__,
chunk->base, chunk->size, errno);
}
clear_chunk(chunk);
}
};
class spine_mem_pool_hugetlb_1g final : public spine_mem_pool_manager {
public:
spine_mem_pool_hugetlb_1g() : spine_mem_pool_manager(SPINE_MEM_POOL_1G_REGION_SIZE) {}
~spine_mem_pool_hugetlb_1g() override { release_chunks(); }
private:
bool alloc_chunk(size_t min_size, size_t alignment, void * hint_addr, pool_chunk * chunk) override {
(void) alignment;
(void) hint_addr;
size_t region_size = 0;
if (!align_up(min_size, SPINE_MEM_POOL_1G_REGION_SIZE, &region_size)) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: failed to round hugetlb_1g size for %zu\n", __func__, min_size);
return false;
}
const int fd = open(SPINE_MEM_POOL_HUGETLB_1G_DEV, O_RDWR);
if (fd < 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: open(%s) failed, errno=%d\n", __func__,
SPINE_MEM_POOL_HUGETLB_1G_DEV, errno);
return false;
}
hugetlb_1g_region region;
region.size = region_size;
region.flags = HUGETLB_1G_FLAG_REQUIRE_PUD;
if (ioctl(fd, HUGETLB_1G_IOC_ALLOC, &region) < 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: HUGETLB_1G_IOC_ALLOC failed for size %zu, errno=%d\n", __func__,
region_size, errno);
close(fd);
return false;
}
void * map_addr = mmap(nullptr, region.size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (map_addr == MAP_FAILED) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: mmap failed for hugetlb_1g size %llu, errno=%d\n", __func__,
static_cast<unsigned long long>(region.size), errno);
ioctl(fd, HUGETLB_1G_IOC_FREE);
close(fd);
return false;
}
chunk->base = static_cast<uint8_t *>(map_addr);
chunk->size = region.size;
chunk->fd = fd;
return true;
}
void dealloc_chunk(pool_chunk * chunk) override {
if (chunk->base != nullptr && chunk->size != 0 && munmap(chunk->base, chunk->size) != 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: munmap failed for hugetlb_1g chunk %p size %zu, errno=%d\n",
__func__, chunk->base, chunk->size, errno);
}
if (chunk->fd >= 0) {
if (ioctl(chunk->fd, HUGETLB_1G_IOC_FREE) < 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: HUGETLB_1G_IOC_FREE failed for chunk %p, errno=%d\n",
__func__, chunk->base, errno);
}
close(chunk->fd);
}
clear_chunk(chunk);
}
};
class spine_mem_pool_shared_mem final : public spine_mem_pool_manager {
public:
spine_mem_pool_shared_mem() : spine_mem_pool_manager(SPINE_SHARE_MEM_POOL_CHUNK_SIZE) {}
~spine_mem_pool_shared_mem() override { release_chunks(); }
private:
bool alloc_chunk(size_t min_size, size_t alignment, void * hint_addr, pool_chunk * chunk) override {
(void) alignment;
if (hint_addr != nullptr) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: shared_mem does not support multiple active chunks\n", __func__);
return false;
}
if (min_size > default_chunk_size()) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: shared_mem request %zu exceeds chunk size %zu\n", __func__,
min_size, default_chunk_size());
return false;
}
const int fd = open(SPINE_MEM_POOL_TCM_SYNC_MEM_DEV, O_RDWR | O_SYNC);
if (fd < 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: open(%s) failed, errno=%d\n", __func__,
SPINE_MEM_POOL_TCM_SYNC_MEM_DEV, errno);
return false;
}
void * map_addr = mmap(nullptr, default_chunk_size(), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (map_addr == MAP_FAILED) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: mmap failed for %s size %zu, errno=%d\n", __func__,
SPINE_MEM_POOL_TCM_SYNC_MEM_DEV, default_chunk_size(), errno);
close(fd);
return false;
}
chunk->base = static_cast<uint8_t *>(map_addr);
chunk->size = default_chunk_size();
chunk->fd = fd;
return true;
}
void dealloc_chunk(pool_chunk * chunk) override {
if (chunk->base != nullptr && chunk->size != 0 && munmap(chunk->base, chunk->size) != 0) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: munmap failed for shared_mem chunk %p size %zu, errno=%d\n",
__func__, chunk->base, chunk->size, errno);
}
if (chunk->fd >= 0) {
close(chunk->fd);
}
clear_chunk(chunk);
}
};
spine_mem_pool_manager & get_spine_mem_pool_manager() {
static std::once_flag pool_once;
static std::unique_ptr<spine_mem_pool_manager> selected_pool;
static spine_mem_pool_backend selected_backend = spine_mem_pool_backend::none;
spine_mem_pool_backend backend = global_spine_env_info.mem_backend;
if (backend == spine_mem_pool_backend::none) {
backend = spine_mem_pool_backend::transparent_hugepage;
}
std::call_once(pool_once, [&]() {
selected_backend = backend;
switch (selected_backend) {
case spine_mem_pool_backend::posix_memalign:
selected_pool = std::make_unique<spine_mem_pool_posix>();
break;
case spine_mem_pool_backend::transparent_hugepage:
selected_pool = std::make_unique<spine_mem_pool_transparent_hugepage>();
break;
case spine_mem_pool_backend::hugetlb_1g:
selected_pool = std::make_unique<spine_mem_pool_hugetlb_1g>();
break;
case spine_mem_pool_backend::none:
selected_backend = spine_mem_pool_backend::transparent_hugepage;
selected_pool = std::make_unique<spine_mem_pool_transparent_hugepage>();
break;
}
});
if (backend != selected_backend) {
GGML_LOG_ERROR(
"CPU_RISCV64_SPACEMIT: %s: mem pool backend is process-global and mutually exclusive, requested=%d but "
"selected=%d\n",
__func__, static_cast<int>(backend), static_cast<int>(selected_backend));
}
if (selected_pool) {
return *selected_pool;
}
throw std::bad_alloc();
}
spine_mem_pool_manager & get_spine_mem_pool_shared_mem_manager() {
static std::once_flag shared_mem_pool_once;
static std::unique_ptr<spine_mem_pool_shared_mem> shared_mem_pool;
std::call_once(shared_mem_pool_once, [&]() { shared_mem_pool = std::make_unique<spine_mem_pool_shared_mem>(); });
if (shared_mem_pool) {
return *shared_mem_pool;
}
throw std::bad_alloc();
}
} // namespace
bool spine_mem_pool_tcm_init(spine_mem_pool_tcm_info * info) noexcept {
if (info == nullptr) {
return false;
}
*info = {};
if (spine_tcm_open_handle(NULL) != 0 || !spine_tcm_is_available()) {
return false;
}
spine_tcm_mem_info_t mem_info;
if (spine_tcm_mem_info(&mem_info) != 0) {
return false;
}
info->available = true;
info->blk_size = mem_info.blk_size;
info->blk_num = mem_info.blk_num;
info->is_fake_tcm = mem_info.is_fake_tcm != 0;
return true;
}
void * spine_mem_pool_tcm_mem_get(int cpu_id) noexcept {
return spine_tcm_mem_get(cpu_id);
}
void * spine_mem_pool_tcm_mem_wait(int cpu_id) noexcept {
return spine_tcm_mem_try_wait(cpu_id, 1000 * 1000);
}
int spine_mem_pool_tcm_mem_release(int cpu_id) noexcept {
return spine_tcm_mem_release(cpu_id);
}
void * spine_mem_pool_alloc(size_t size, size_t alignment) noexcept {
try {
return get_spine_mem_pool_manager().alloc(size, alignment);
} catch (const std::bad_alloc &) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: bad_alloc while allocating size %zu\n", __func__, size);
return nullptr;
}
}
void * spine_mem_pool_shared_mem_alloc(size_t size, size_t alignment) noexcept {
try {
return get_spine_mem_pool_shared_mem_manager().alloc(size, alignment);
} catch (const std::bad_alloc &) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: bad_alloc while allocating shared memory size %zu\n", __func__, size);
return nullptr;
}
}
void spine_mem_pool_free(void * base) noexcept {
try {
get_spine_mem_pool_manager().free(base);
} catch (const std::bad_alloc &) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: bad_alloc while freeing allocation %p\n", __func__, base);
}
}
void spine_mem_pool_shared_mem_free(void * base) noexcept {
try {
get_spine_mem_pool_shared_mem_manager().free(base);
} catch (const std::bad_alloc &) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: bad_alloc while freeing shared allocation %p\n", __func__, base);
}
}
} // namespace ggml::cpu::riscv64_spacemit
extern "C" {
void * ggml_backend_cpu_riscv64_spacemit_alloc_shared(size_t size, size_t alignment) {
void * result = ggml::cpu::riscv64_spacemit::spine_mem_pool_shared_mem_alloc(size, alignment);
if (result == nullptr) {
GGML_LOG_ERROR("CPU_RISCV64_SPACEMIT: %s: failed to allocate shared memory size %zu alignment %zu\n", __func__,
size, alignment);
}
return result;
}
void ggml_backend_cpu_riscv64_spacemit_free_shared(void * ptr) {
ggml::cpu::riscv64_spacemit::spine_mem_pool_shared_mem_free(ptr);
}
}

View File

@@ -0,0 +1,32 @@
#pragma once
#include <cstddef>
#include <cstdint>
namespace ggml::cpu::riscv64_spacemit {
enum class spine_mem_pool_backend : uint8_t {
none,
posix_memalign,
transparent_hugepage,
hugetlb_1g,
};
struct spine_mem_pool_tcm_info {
bool available{ false };
size_t blk_size{ 0 };
size_t blk_num{ 0 };
bool is_fake_tcm{ false };
};
bool spine_mem_pool_tcm_init(spine_mem_pool_tcm_info * info) noexcept;
void * spine_mem_pool_tcm_mem_get(int cpu_id) noexcept;
void * spine_mem_pool_tcm_mem_wait(int cpu_id) noexcept;
int spine_mem_pool_tcm_mem_release(int cpu_id) noexcept;
void * spine_mem_pool_alloc(size_t size, size_t alignment) noexcept;
void * spine_mem_pool_shared_mem_alloc(size_t size, size_t alignment) noexcept;
void spine_mem_pool_free(void * base) noexcept;
void spine_mem_pool_shared_mem_free(void * base) noexcept;
} // namespace ggml::cpu::riscv64_spacemit

View File

@@ -0,0 +1,409 @@
#ifndef SPINE_TCM_PUBLIC_H_
#define SPINE_TCM_PUBLIC_H_
/*
* spine_tcm public API
*
* Usage:
* 1. Direct link mode
* Define SPINE_TCM_DIRECT_LINK and link against libspine_tcm.so.
*
* if (spine_tcm_is_available()) {
* void *buffer = spine_tcm_mem_get(0);
* spine_tcm_mem_free(0);
* }
*
* 2. Header-only loader mode
* Include this header without linking libspine_tcm.so. The loader first
* tries to reuse a process-global spine_tcm instance and falls back to
* dlopen("libspine_tcm.so") when needed.
*
* spine_tcm_open_handle(NULL); // optional pre-bind
* if (spine_tcm_is_available()) {
* void *buffer = spine_tcm_mem_get(0);
* spine_tcm_mem_free(0);
* }
*/
#include <stddef.h>
#include <stdint.h>
#include <string.h>
#if !defined(SPINE_TCM_BUILD_SHARED) && !defined(SPINE_TCM_DIRECT_LINK)
# include <dlfcn.h>
#endif
#ifdef __cplusplus
extern "C" {
#endif
#if defined(_WIN32)
# if defined(SPINE_TCM_BUILD_SHARED)
# define SPINE_TCM_API __declspec(dllexport)
# else
# define SPINE_TCM_API __declspec(dllimport)
# endif
#else
# define SPINE_TCM_API __attribute__((visibility("default")))
#endif
typedef struct spine_tcm_mem_info {
size_t blk_size;
size_t blk_num;
int is_fake_tcm;
} spine_tcm_mem_info_t;
typedef struct spine_tcm_block_info {
int id;
void * va;
size_t size;
uint64_t phys_addr;
uint64_t cpu_affinity_mask;
int owner_tid;
int is_acquired;
} spine_tcm_block_info_t;
/* Shared-library runtime ABI exported by libspine_tcm.so. */
SPINE_TCM_API const char * spine_tcm_runtime_version(void);
SPINE_TCM_API int spine_tcm_runtime_is_available(void);
SPINE_TCM_API int spine_tcm_runtime_layout_info(spine_tcm_mem_info_t * info);
SPINE_TCM_API int spine_tcm_runtime_mem_info(int id, spine_tcm_block_info_t * info);
SPINE_TCM_API void * spine_tcm_runtime_mem_get(int id);
SPINE_TCM_API int spine_tcm_runtime_mem_free(int id);
SPINE_TCM_API void * spine_tcm_runtime_mem_try_wait(int id, size_t timeout_us);
SPINE_TCM_API int spine_tcm_runtime_mem_release(int id);
SPINE_TCM_API int spine_tcm_runtime_mem_force_release(int id);
SPINE_TCM_API int spine_tcm_runtime_mem_query(int id);
#if defined(SPINE_TCM_DIRECT_LINK)
/* Optional no-op in direct-link mode. */
static inline int spine_tcm_open_handle(const char * so_path) {
(void) so_path;
return 0;
}
static inline const char * spine_tcm_version(void) {
return spine_tcm_runtime_version();
}
/* Returns 1 when the runtime driver is available, otherwise 0. */
static inline int spine_tcm_is_available(void) {
return spine_tcm_runtime_is_available();
}
/* Returns runtime memory geometry and whether the current backend is fake TCM. */
static inline int spine_tcm_mem_info(spine_tcm_mem_info_t * info) {
return spine_tcm_runtime_layout_info(info);
}
/* Returns per-block runtime metadata for the given TCM id. */
static inline int spine_tcm_block_info(int id, spine_tcm_block_info_t * info) {
return spine_tcm_runtime_mem_info(id, info);
}
/* Returns a cached buffer for the given TCM id, or NULL on failure. */
static inline void * spine_tcm_mem_get(int id) {
return spine_tcm_runtime_mem_get(id);
}
/* Releases one reference acquired by spine_tcm_mem_get(id). */
static inline int spine_tcm_mem_free(int id) {
return spine_tcm_runtime_mem_free(id);
}
/* Waits for a TCM block handoff and returns the driver-owned buffer when available. */
static inline void * spine_tcm_mem_try_wait(int id, size_t over_time) {
return spine_tcm_runtime_mem_try_wait(id, over_time);
}
/* Releases a buffer acquired by spine_tcm_mem_try_wait(id, over_time). */
static inline int spine_tcm_mem_release(int id) {
return spine_tcm_runtime_mem_release(id);
}
/* Forces a release for the given TCM id when the backend supports it. */
static inline int spine_tcm_mem_force_release(int id) {
return spine_tcm_runtime_mem_force_release(id);
}
/* Returns whether the given TCM id is currently acquired. */
static inline int spine_tcm_mem_query(int id) {
return spine_tcm_runtime_mem_query(id);
}
#elif !defined(SPINE_TCM_BUILD_SHARED)
typedef struct spine_tcm_handle {
void * module_handle;
int use_global_scope;
int owns_module_handle;
const char * (*runtime_version)(void);
int (*runtime_is_available)(void);
int (*runtime_layout_info)(spine_tcm_mem_info_t * info);
int (*runtime_mem_info)(int id, spine_tcm_block_info_t * info);
void * (*runtime_mem_get)(int id);
int (*runtime_mem_free)(int id);
void * (*runtime_mem_try_wait)(int id, size_t over_time);
int (*runtime_mem_release)(int id);
int (*runtime_mem_force_release)(int id);
int (*runtime_mem_query)(int id);
} spine_tcm_handle_t;
static inline spine_tcm_handle_t * spine_tcm_default_handle(void) {
static spine_tcm_handle_t handle = { 0 };
return &handle;
}
static inline void spine_tcm_handle_reset(spine_tcm_handle_t * handle) {
if (handle != NULL) {
memset(handle, 0, sizeof(*handle));
}
}
static inline int spine_tcm_handle_bind(spine_tcm_handle_t * handle) {
void * symbol_scope = handle->use_global_scope ? RTLD_DEFAULT : handle->module_handle;
handle->runtime_version = (const char * (*) (void) ) dlsym(symbol_scope, "spine_tcm_runtime_version");
handle->runtime_is_available = (int (*)(void)) dlsym(symbol_scope, "spine_tcm_runtime_is_available");
handle->runtime_layout_info =
(int (*)(spine_tcm_mem_info_t *)) dlsym(symbol_scope, "spine_tcm_runtime_layout_info");
handle->runtime_mem_info =
(int (*)(int, spine_tcm_block_info_t *)) dlsym(symbol_scope, "spine_tcm_runtime_mem_info");
handle->runtime_mem_get = (void * (*) (int) ) dlsym(symbol_scope, "spine_tcm_runtime_mem_get");
handle->runtime_mem_free = (int (*)(int)) dlsym(symbol_scope, "spine_tcm_runtime_mem_free");
handle->runtime_mem_try_wait = (void * (*) (int, size_t)) dlsym(symbol_scope, "spine_tcm_runtime_mem_try_wait");
handle->runtime_mem_release = (int (*)(int)) dlsym(symbol_scope, "spine_tcm_runtime_mem_release");
handle->runtime_mem_force_release = (int (*)(int)) dlsym(symbol_scope, "spine_tcm_runtime_mem_force_release");
handle->runtime_mem_query = (int (*)(int)) dlsym(symbol_scope, "spine_tcm_runtime_mem_query");
return handle->runtime_version != NULL && handle->runtime_is_available != NULL &&
handle->runtime_layout_info != NULL && handle->runtime_mem_info != NULL &&
handle->runtime_mem_get != NULL && handle->runtime_mem_free != NULL &&
handle->runtime_mem_try_wait != NULL && handle->runtime_mem_release != NULL &&
handle->runtime_mem_force_release != NULL && handle->runtime_mem_query != NULL ?
0 :
-1;
}
/*
* Try to bind against an already-loaded process-global spine_tcm instance.
* The shared library exports spine_tcm_runtime_marker only for this probe.
*/
static inline int spine_tcm_try_bind_global(spine_tcm_handle_t * handle) {
if (dlsym(RTLD_DEFAULT, "spine_tcm_runtime_marker") == NULL) {
return -1;
}
handle->use_global_scope = 1;
return spine_tcm_handle_bind(handle);
}
/*
* Optional pre-bind entry point.
*
* Behavior:
* - Reuses an already-loaded global spine_tcm instance when available.
* - Otherwise loads the shared library from so_path or the default soname.
* - Repeated calls are safe and return 0 after the first successful bind.
*/
static inline int spine_tcm_open_handle(const char * so_path) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
const char * library = (so_path != NULL && so_path[0] != '\0') ? so_path : "libspine_tcm.so";
if (resolved->module_handle != NULL || resolved->use_global_scope) {
return 0;
}
if (spine_tcm_try_bind_global(resolved) == 0) {
return 0;
}
spine_tcm_handle_reset(resolved);
resolved->module_handle = dlopen(library, RTLD_LAZY | RTLD_GLOBAL);
resolved->owns_module_handle = resolved->module_handle != NULL ? 1 : 0;
if (resolved->module_handle == NULL) {
spine_tcm_handle_reset(resolved);
return -1;
}
if (spine_tcm_handle_bind(resolved) != 0) {
if (resolved->owns_module_handle) {
dlclose(resolved->module_handle);
}
spine_tcm_handle_reset(resolved);
return -1;
}
return 0;
}
/* Returns 1 when the runtime driver is available, otherwise 0. */
static inline int spine_tcm_is_available(void) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) || resolved->runtime_is_available == NULL) {
return 0;
}
return resolved->runtime_is_available();
}
/* Returns runtime memory geometry and whether the current backend is fake TCM. */
static inline int spine_tcm_mem_info(spine_tcm_mem_info_t * info) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) || resolved->runtime_layout_info == NULL) {
return -1;
}
return resolved->runtime_layout_info(info);
}
static inline const char * spine_tcm_version(void) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) || resolved->runtime_version == NULL) {
return "unknown";
}
return resolved->runtime_version();
}
/* Returns per-block runtime metadata for the given TCM id. */
static inline int spine_tcm_block_info(int id, spine_tcm_block_info_t * info) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) || resolved->runtime_mem_info == NULL) {
return -1;
}
return resolved->runtime_mem_info(id, info);
}
/* Returns a cached buffer for the given TCM id, or NULL on failure. */
static inline void * spine_tcm_mem_get(int id) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
return NULL;
}
if (resolved->runtime_mem_get == NULL) {
return NULL;
}
return resolved->runtime_mem_get(id);
}
/* Releases one reference acquired by spine_tcm_mem_get(id). */
static inline int spine_tcm_mem_free(int id) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) || resolved->runtime_mem_free == NULL) {
return -1;
}
return resolved->runtime_mem_free(id);
}
/* Waits for a TCM block handoff and returns the driver-owned buffer when available. */
static inline void * spine_tcm_mem_try_wait(int id, size_t over_time) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
return NULL;
}
if (resolved->runtime_mem_try_wait == NULL) {
return NULL;
}
return resolved->runtime_mem_try_wait(id, over_time);
}
/* Releases a buffer acquired by spine_tcm_mem_try_wait(id, over_time). */
static inline int spine_tcm_mem_release(int id) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) || resolved->runtime_mem_release == NULL) {
return -1;
}
return resolved->runtime_mem_release(id);
}
/* Forces a release for the given TCM id when the backend supports it. */
static inline int spine_tcm_mem_force_release(int id) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) ||
resolved->runtime_mem_force_release == NULL) {
return -1;
}
return resolved->runtime_mem_force_release(id);
}
/* Returns whether the given TCM id is currently acquired. */
static inline int spine_tcm_mem_query(int id) {
spine_tcm_handle_t * resolved = spine_tcm_default_handle();
if (resolved->module_handle == NULL && !resolved->use_global_scope) {
(void) spine_tcm_open_handle(NULL);
}
if ((resolved->module_handle == NULL && !resolved->use_global_scope) || resolved->runtime_mem_query == NULL) {
return -1;
}
return resolved->runtime_mem_query(id);
}
#else
static inline const char * spine_tcm_version(void) {
return spine_tcm_runtime_version();
}
#endif
#define SPINE_TCM_VERSION (spine_tcm_version())
#ifdef __cplusplus
}
#endif
#endif

View File

@@ -672,7 +672,7 @@ ggml_metal_device_t ggml_metal_device_init(int device) {
![[dev->mtl_device name] containsString:@"M6"] &&
![[dev->mtl_device name] containsString:@"A19"] &&
![[dev->mtl_device name] containsString:@"A20"]) {
GGML_LOG_WARN("%s: tensor API disabled for pre-M5 and pre-A19 devices\n", __func__);
GGML_LOG_INFO("%s: tensor API disabled for pre-M5 and pre-A19 devices\n", __func__);
dev->props.has_tensor = false;
}

View File

@@ -39,6 +39,18 @@ if (WIN32)
set(CMAKE_CXX_COMPILER "icx")
set(CMAKE_CXX_COMPILER_ID "IntelLLVM")
endif()
# Level Zero SDK path for Windows (only when GGML_SYCL_SUPPORT_LEVEL_ZERO is enabled)
if(GGML_SYCL_SUPPORT_LEVEL_ZERO)
if(DEFINED ENV{LEVEL_ZERO_V1_SDK_PATH})
set(LEVEL_ZERO_V1_SDK_PATH $ENV{LEVEL_ZERO_V1_SDK_PATH})
if(EXISTS "${LEVEL_ZERO_V1_SDK_PATH}")
target_include_directories(ggml-sycl PRIVATE "${LEVEL_ZERO_V1_SDK_PATH}/include")
set(LEVEL_ZERO_V1_SDK_LIB_PATH "${LEVEL_ZERO_V1_SDK_PATH}/lib")
else()
message(WARNING "LEVEL_ZERO_V1_SDK_PATH set but folder not found: ${LEVEL_ZERO_V1_SDK_PATH}")
endif()
endif()
endif()
endif()
macro(detect_and_find_package package_name)
@@ -93,6 +105,23 @@ endif()
target_compile_options(ggml-sycl PRIVATE "-Wno-narrowing")
message(STATUS "GGML_SYCL_SUPPORT_LEVEL_ZERO ${GGML_SYCL_SUPPORT_LEVEL_ZERO}")
if (GGML_SYCL_SUPPORT_LEVEL_ZERO)
# Link against Level Zero loader for direct device memory allocation.
# Avoids sycl::malloc_device triggering DMA-buf/TTM system RAM staging
# in the xe kernel driver during multi-GPU inference.
find_path(LEVEL_ZERO_INCLUDE_DIR level_zero/ze_api.h HINTS ${ONEAPI_ROOT}/include ${LEVEL_ZERO_V1_SDK_PATH}/include)
find_library(ZE_LOADER_LIB ze_loader HINTS ${ONEAPI_ROOT}/lib ${LEVEL_ZERO_V1_SDK_LIB_PATH} ENV LD_LIBRARY_PATH)
if(ZE_LOADER_LIB AND LEVEL_ZERO_INCLUDE_DIR)
target_link_libraries(ggml-sycl PRIVATE ${ZE_LOADER_LIB})
target_compile_definitions(ggml-sycl PRIVATE GGML_SYCL_SUPPORT_LEVEL_ZERO)
message(STATUS "Level Zero loader found: ${ZE_LOADER_LIB}")
message(STATUS "Level Zero headers found: ${LEVEL_ZERO_INCLUDE_DIR}")
else()
message(WARNING "Level Zero loader or headers not found, Level Zero support disabled")
endif()
endif()
# Link against oneDNN
set(GGML_SYCL_DNNL 0)
if(GGML_SYCL_DNN)

View File

@@ -11,6 +11,10 @@
//
#include "common.hpp"
#include <sycl/backend.hpp>
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#include <level_zero/ze_api.h>
#endif
#include "ggml-backend-impl.h"
#include "ggml-impl.h"
@@ -55,6 +59,20 @@ bool gpu_has_xmx(sycl::device &dev) {
return dev.has(sycl::aspect::ext_intel_matrix);
}
static int ggml_sycl_get_env(const char *env_name, int default_val) {
char *user_device_string = getenv(env_name);
int user_number = default_val;
unsigned n;
if (user_device_string != NULL &&
sscanf(user_device_string, " %u", &n) == 1) {
user_number = (int)n;
} else {
user_number = default_val;
}
return user_number;
}
int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block_size) {
const int64_t max_range = std::numeric_limits<int>::max();
int64_t sycl_down_blk_size = block_size;
@@ -66,6 +84,61 @@ int64_t downsample_sycl_global_range(int64_t accumulate_block_num, int64_t block
return sycl_down_blk_size;
}
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
static bool ggml_sycl_use_level_zero_device_alloc(sycl::queue &q) {
return ggml_sycl_get_env("GGML_SYCL_ENABLE_LEVEL_ZERO", 1) &&
q.get_device().is_gpu() &&
q.get_backend() == sycl::backend::ext_oneapi_level_zero;
}
#endif
// Use Level Zero zeMemAllocDevice to avoid sycl::malloc_device triggering
// DMA-buf/TTM system RAM staging in the xe kernel driver during multi-GPU inference.
// The decision is made from the queue and runtime env because large buffers can be
// allocated before ggml_check_sycl() initializes g_ggml_sycl_enable_level_zero.
void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) {
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
if (ggml_sycl_use_level_zero_device_alloc(q)) {
void *ptr = nullptr;
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_device());
#ifdef ZE_RELAXED_ALLOCATION_LIMITS_EXP_NAME
ze_relaxed_allocation_limits_exp_desc_t relaxed_desc = {
ZE_STRUCTURE_TYPE_RELAXED_ALLOCATION_LIMITS_EXP_DESC,
nullptr,
ZE_RELAXED_ALLOCATION_LIMITS_EXP_FLAG_MAX_SIZE,
};
ze_device_mem_alloc_desc_t alloc_desc = {
ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC,
&relaxed_desc,
0,
0,
};
#else
ze_device_mem_alloc_desc_t alloc_desc = {ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC, nullptr, 0, 0};
#endif
ze_result_t r = zeMemAllocDevice(ze_ctx, &alloc_desc, size, 64, ze_dev, &ptr);
if (r == ZE_RESULT_SUCCESS && ptr) {
return ptr;
}
return nullptr;
}
#endif
return sycl::malloc_device(size, q);
}
void ggml_sycl_free_device(void *ptr, sycl::queue &q) {
if (!ptr) return;
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
if (ggml_sycl_use_level_zero_device_alloc(q)) {
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
zeMemFree(ze_ctx, ptr);
return;
}
#endif
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, q)));
}
void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> streams) {
for (int i = 0; i < ggml_sycl_info().device_count; ++i) {
for (int64_t is = 0; is < GGML_SYCL_MAX_STREAMS; ++is) {
@@ -75,8 +148,7 @@ void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> str
}
if (extra->data_device[i] != nullptr && streams.size()>0) {
ggml_sycl_set_device(i);
SYCL_CHECK(
CHECK_TRY_ERROR(sycl::free(extra->data_device[i], *(streams[i]))));
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_free_device(extra->data_device[i], *(streams[i]))));
}
}
delete extra;

View File

@@ -310,6 +310,10 @@ struct ggml_tensor_extra_gpu {
optimize_feature optimized_feature;
};
extern int g_ggml_sycl_enable_level_zero;
void * ggml_sycl_malloc_device(size_t size, sycl::queue &q);
void ggml_sycl_free_device(void *ptr, sycl::queue &q);
void release_extra_gpu(ggml_tensor_extra_gpu * extra, std::vector<queue_ptr> streams={});
namespace sycl_ex = sycl::ext::oneapi::experimental;

View File

@@ -30,6 +30,10 @@
#include <regex>
#include <sycl/sycl.hpp>
#include <sycl/backend.hpp>
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
#include <level_zero/ze_api.h>
#endif
#if defined(GGML_SYCL_GRAPH) && SYCL_EXT_ONEAPI_ASYNC_MEMORY_ALLOC
# include <sycl/ext/oneapi/experimental/async_alloc/async_alloc.hpp>
#endif
@@ -68,6 +72,7 @@ int g_ggml_sycl_disable_graph = 0;
int g_ggml_sycl_disable_dnn = 0;
int g_ggml_sycl_prioritize_dmmv = 0;
int g_ggml_sycl_use_async_mem_op = 0;
int g_ggml_sycl_enable_level_zero = 0;
int g_ggml_sycl_enable_flash_attention = 1;
@@ -223,6 +228,27 @@ static void ggml_check_sycl() try {
g_ggml_sycl_disable_graph = get_sycl_env("GGML_SYCL_DISABLE_GRAPH", 1);
g_ggml_sycl_disable_dnn = get_sycl_env("GGML_SYCL_DISABLE_DNN", 0);
g_ggml_sycl_prioritize_dmmv = get_sycl_env("GGML_SYCL_PRIORITIZE_DMMV", 0);
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
g_ggml_sycl_enable_level_zero = get_sycl_env("GGML_SYCL_ENABLE_LEVEL_ZERO", 1);
#else
g_ggml_sycl_enable_level_zero = 0;
#endif
if (g_ggml_sycl_enable_level_zero) {
// Verify all GPU devices use the Level Zero backend before enabling L0 APIs.
// Only check GPU devices; CPU devices use OpenCL and would otherwise
// disable Level Zero for the GPUs on systems without ONEAPI_DEVICE_SELECTOR set.
for (unsigned int i = 0; i < dpct::dev_mgr::instance().device_count(); i++) {
auto & q = dpct::dev_mgr::instance().get_device(i).default_queue();
if (!q.get_device().is_gpu()) {
continue;
}
if (q.get_backend() != sycl::backend::ext_oneapi_level_zero) {
GGML_LOG_WARN("SYCL GPU device %d does not use Level Zero backend, disabling Level Zero memory API\n", i);
g_ggml_sycl_enable_level_zero = 0;
break;
}
}
}
#ifdef SYCL_FLASH_ATTN
g_ggml_sycl_enable_flash_attention = get_sycl_env("GGML_SYCL_ENABLE_FLASH_ATTN", 1);
@@ -253,6 +279,11 @@ static void ggml_check_sycl() try {
#else
GGML_LOG_INFO(" GGML_SYCL_DNNL: no\n");
#endif
#if defined(GGML_SYCL_SUPPORT_LEVEL_ZERO)
GGML_LOG_INFO(" GGML_SYCL_SUPPORT_LEVEL_ZERO: yes\n");
#else
GGML_LOG_INFO(" GGML_SYCL_SUPPORT_LEVEL_ZERO: no\n");
#endif
GGML_LOG_INFO("Running with Environment Variables:\n");
GGML_LOG_INFO(" GGML_SYCL_DEBUG: %d\n", g_ggml_sycl_debug);
@@ -262,6 +293,11 @@ static void ggml_check_sycl() try {
#else
GGML_LOG_INFO(" GGML_SYCL_DISABLE_GRAPH: graph disabled by compile flag\n");
#endif
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
GGML_LOG_INFO(" GGML_SYCL_ENABLE_LEVEL_ZERO: %d\n", g_ggml_sycl_enable_level_zero);
#else
GGML_LOG_INFO(" GGML_SYCL_ENABLE_LEVEL_ZERO: Level Zero disabled by compile flag\n");
#endif
#if GGML_SYCL_DNNL
GGML_LOG_INFO(" GGML_SYCL_DISABLE_DNN: %d\n", g_ggml_sycl_disable_dnn);
#else
@@ -371,7 +407,7 @@ struct ggml_backend_sycl_buffer_context {
~ggml_backend_sycl_buffer_context() {
if (dev_ptr != nullptr) {
ggml_sycl_set_device(device);
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(dev_ptr, *stream)));
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_free_device(dev_ptr, *stream)));
}
//release extra used by tensors
@@ -504,8 +540,43 @@ catch (sycl::exception const &exc) {
std::exit(1);
}
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
static bool ggml_sycl_is_l0_discrete_gpu(sycl::queue &q) {
if (!q.get_device().is_gpu() || q.get_backend() != sycl::backend::ext_oneapi_level_zero) {
return false;
}
ze_device_handle_t ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_device());
ze_device_properties_t props = {};
props.stype = ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES;
ze_result_t r = zeDeviceGetProperties(ze_dev, &props);
return r == ZE_RESULT_SUCCESS && !(props.flags & ZE_DEVICE_PROPERTY_FLAG_INTEGRATED);
}
#endif
static void dev2dev_memcpy(sycl::queue &q_dst, sycl::queue &q_src, void *ptr_dst,
const void *ptr_src, size_t size) {
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
// Use Level Zero direct copy for dGPU-to-dGPU transfers.
const bool l0_copy_supported =
ggml_sycl_is_l0_discrete_gpu(q_dst) && ggml_sycl_is_l0_discrete_gpu(q_src);
if (g_ggml_sycl_enable_level_zero && l0_copy_supported) {
auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q_dst.get_context());
auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q_dst.get_device());
ze_command_queue_desc_t cq_desc = {ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC, nullptr, 0, 0,
0, ZE_COMMAND_QUEUE_MODE_SYNCHRONOUS, ZE_COMMAND_QUEUE_PRIORITY_NORMAL};
ze_command_list_handle_t cl;
ze_result_t r = zeCommandListCreateImmediate(ze_ctx, ze_dev, &cq_desc, &cl);
if (r == ZE_RESULT_SUCCESS) {
r = zeCommandListAppendMemoryCopy(cl, ptr_dst, ptr_src, size, nullptr, 0, nullptr);
zeCommandListDestroy(cl);
if (r == ZE_RESULT_SUCCESS) {
return;
}
}
}
#endif
// Host-staged copy
char *host_buf = (char *)malloc(size);
q_src.memcpy(host_buf, (const char *)ptr_src, size).wait();
q_dst.memcpy((char *)ptr_dst, host_buf, size).wait();
@@ -675,8 +746,7 @@ ggml_backend_sycl_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft,
size = std::max(size, (size_t)1); // syclMalloc returns null for size 0
void * dev_ptr;
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)sycl::malloc_device(
size, *stream)));
SYCL_CHECK(CHECK_TRY_ERROR(dev_ptr = (void *)ggml_sycl_malloc_device(size, *stream)));
if (!dev_ptr) {
GGML_LOG_ERROR("%s: can't allocate %lu Bytes of memory on device\n", __func__, size);
return nullptr;
@@ -917,18 +987,10 @@ ggml_backend_sycl_split_buffer_init_tensor(ggml_backend_buffer_t buffer,
size += ggml_row_size(tensor->type, MATRIX_ROW_PADDING - ne0 % MATRIX_ROW_PADDING);
}
// FIXME: do not crash if SYCL Buffer alloc fails
// currently, init_tensor cannot fail, it needs to be fixed in ggml-backend first
ggml_sycl_set_device(i);
const queue_ptr stream = ctx->streams[i];
char * buf;
/*
DPCT1009:208: SYCL uses exceptions to report errors and does not use the
error codes. The original code was commented out and a warning string
was inserted. You need to rewrite this code.
*/
SYCL_CHECK(CHECK_TRY_ERROR(buf = (char *)sycl::malloc_device(
size, *stream)));
SYCL_CHECK(CHECK_TRY_ERROR(buf = (char *)ggml_sycl_malloc_device(size, *stream)));
if (!buf) {
char err_buf[1024];
snprintf(err_buf, 1023, "%s: can't allocate %lu Bytes of memory on device\n", __func__, size);
@@ -1306,7 +1368,7 @@ struct ggml_sycl_pool_leg : public ggml_sycl_pool {
for (int i = 0; i < MAX_SYCL_BUFFERS; ++i) {
ggml_sycl_buffer & b = buffer_pool[i];
if (b.ptr != nullptr) {
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(b.ptr, *qptr)));
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_free_device(b.ptr, *qptr)));
pool_size -= b.size;
}
}
@@ -1374,9 +1436,7 @@ struct ggml_sycl_pool_leg : public ggml_sycl_pool {
void * ptr;
size_t look_ahead_size = (size_t) (1.05 * size);
SYCL_CHECK(
CHECK_TRY_ERROR(ptr = (void *)sycl::malloc_device(
look_ahead_size, *qptr)));
SYCL_CHECK(CHECK_TRY_ERROR(ptr = (void *)ggml_sycl_malloc_device(look_ahead_size, *qptr)));
if (!ptr) {
GGML_LOG_ERROR("%s: can't allocate %lu Bytes of memory on device/GPU\n", __func__, look_ahead_size);
return nullptr;
@@ -1404,7 +1464,7 @@ struct ggml_sycl_pool_leg : public ggml_sycl_pool {
}
}
GGML_LOG_WARN("WARNING: sycl buffer pool full, increase MAX_sycl_BUFFERS\n");
SYCL_CHECK(CHECK_TRY_ERROR(sycl::free(ptr, *qptr)));
SYCL_CHECK(CHECK_TRY_ERROR(ggml_sycl_free_device(ptr, *qptr)));
pool_size -= size;
}
};
@@ -3405,7 +3465,7 @@ static inline void * sycl_ext_malloc_device(dpct::queue_ptr stream, size_t size)
// If async allocation extension is not available, use_async should always be false.
GGML_ASSERT(!use_async);
#endif
return sycl::malloc(size, *stream, sycl::usm::alloc::device);
return ggml_sycl_malloc_device(size, *stream);
}
static inline void sycl_ext_free(dpct::queue_ptr stream, void * ptr) {
@@ -3419,7 +3479,7 @@ static inline void sycl_ext_free(dpct::queue_ptr stream, void * ptr) {
// If async allocation extension is not available, use_async should always be false.
GGML_ASSERT(!use_async);
#endif
sycl::free(ptr, *stream);
ggml_sycl_free_device(ptr, *stream);
}
// RAII wrapper for temporary reorder buffers with optional host memory fallback.

View File

@@ -3954,13 +3954,13 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->a_s, #NAMELC #F16ACC "_aligned_s", NAMELC ## _aligned ## F16ACC ## _len, NAMELC ## _aligned ## F16ACC ## _data, "main", PARAMCOUNT, sizeof(PUSHCONST), s_ ## WG_DENOMS, s_ ## WARPTILE, s_align, false, REQSUBGROUPSIZE > 0, REQSUBGROUPSIZE); \
#define CREATE_MMQ(TYPE, PIPELINE_NAME, NAMELC, WG_DENOMS, WARPTILE, PUSHCONST, PARAMCOUNT, ID, REQSUBGROUPSIZE) \
if (device->mul_mat ## ID ## _l[TYPE]) { \
if (device->mul_mat ## ID ## _l_int[TYPE]) { \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME .f32acc->l, #NAMELC "_l", NAMELC ## _len, NAMELC ## _data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, 1, false, REQSUBGROUPSIZE > 0, REQSUBGROUPSIZE); \
} \
if (device->mul_mat ## ID ## _m[TYPE]) { \
if (device->mul_mat ## ID ## _m_int[TYPE]) { \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME .f32acc->m, #NAMELC "_m", NAMELC ## _len, NAMELC ## _data, "main", PARAMCOUNT, sizeof(PUSHCONST), m_ ## WG_DENOMS, m_ ## WARPTILE, 1, false, REQSUBGROUPSIZE > 0, REQSUBGROUPSIZE); \
} \
if (device->mul_mat ## ID ## _s[TYPE]) { \
if (device->mul_mat ## ID ## _s_int[TYPE]) { \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME .f32acc->s, #NAMELC "_s", NAMELC ## _len, NAMELC ## _data, "main", PARAMCOUNT, sizeof(PUSHCONST), s_ ## WG_DENOMS, s_ ## WARPTILE, 1, false, REQSUBGROUPSIZE > 0, REQSUBGROUPSIZE); \
} \
@@ -4131,11 +4131,11 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->a_s, #NAMELC #F16ACC "_aligned_s", NAMELC ## _aligned ## F16ACC ## _fp32_len, NAMELC ## _aligned ## F16ACC ## _fp32_data, "main", PARAMCOUNT, sizeof(PUSHCONST), s_ ## WG_DENOMS, s_ ## WARPTILE, s_align, false, REQSUBGROUPSIZE > 0, REQSUBGROUPSIZE); \
#define CREATE_MMQ(TYPE, PIPELINE_NAME, NAMELC, WG_DENOMS, WARPTILE, PUSHCONST, PARAMCOUNT, ID) \
if (device->mul_mat ## ID ## _l[TYPE]) \
if (device->mul_mat ## ID ## _l_int[TYPE]) \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->l, #NAMELC "_l", NAMELC ## _fp32_len, NAMELC ## _fp32_data, "main", PARAMCOUNT, sizeof(PUSHCONST), l_ ## WG_DENOMS, l_ ## WARPTILE, 1); \
if (device->mul_mat ## ID ## _m[TYPE]) \
if (device->mul_mat ## ID ## _m_int[TYPE]) \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->m, #NAMELC "_m", NAMELC ## _fp32_len, NAMELC ## _fp32_data, "main", PARAMCOUNT, sizeof(PUSHCONST), m_ ## WG_DENOMS, m_ ## WARPTILE, 1); \
if (device->mul_mat ## ID ## _s[TYPE]) \
if (device->mul_mat ## ID ## _s_int[TYPE]) \
ggml_vk_create_pipeline(device, device-> PIPELINE_NAME ->s, #NAMELC "_s", NAMELC ## _fp32_len, NAMELC ## _fp32_data, "main", PARAMCOUNT, sizeof(PUSHCONST), s_ ## WG_DENOMS, s_ ## WARPTILE, 1); \
CREATE_MM(GGML_TYPE_F32, pipeline_matmul_f32, matmul_f32_f32, , wg_denoms, warptile, vk_mat_mat_push_constants, 3, , 0);
@@ -5716,12 +5716,12 @@ static vk_device ggml_vk_get_device(size_t idx) {
break;
}
device->mul_mat_l_int[i] = true;
device->mul_mat_m_int[i] = true;
device->mul_mat_s_int[i] = true;
device->mul_mat_id_l_int[i] = true;
device->mul_mat_id_m_int[i] = true;
device->mul_mat_id_s_int[i] = true;
device->mul_mat_l_int[i] = device->mul_mat_l[i];
device->mul_mat_m_int[i] = device->mul_mat_m[i];
device->mul_mat_s_int[i] = device->mul_mat_s[i];
device->mul_mat_id_l_int[i] = device->mul_mat_id_l[i];
device->mul_mat_id_m_int[i] = device->mul_mat_id_m[i];
device->mul_mat_id_s_int[i] = device->mul_mat_id_s[i];
}

View File

@@ -777,7 +777,10 @@ inline ggml_webgpu_flash_attn_decisions ggml_webgpu_flash_attn_get_decisions(
const bool tile_can_dispatch_all_q_rows =
context.max_subgroup_size > 0 &&
context.max_wg_size >= GGML_WEBGPU_FLASH_ATTN_TILE_Q_TILE * context.max_subgroup_size;
const bool use_tile = context.supports_subgroups && !context.supports_subgroup_matrix && K->type == GGML_TYPE_F16 &&
const bool use_subgroup_matrix =
context.supports_subgroup_matrix && context.sg_mat_k > 0 && context.sg_mat_n > 0 &&
context.src0->ne[0] % context.sg_mat_k == 0 && context.src2->ne[0] % context.sg_mat_n == 0;
const bool use_tile = context.supports_subgroups && !use_subgroup_matrix && K->type == GGML_TYPE_F16 &&
V->type == GGML_TYPE_F16 && f16_vec4_aligned &&
(context.src0->ne[0] % GGML_WEBGPU_FLASH_ATTN_TILE_KV_VEC_WIDTH == 0) &&
(context.src2->ne[0] % GGML_WEBGPU_FLASH_ATTN_TILE_KV_VEC_WIDTH == 0) &&
@@ -785,7 +788,7 @@ inline ggml_webgpu_flash_attn_decisions ggml_webgpu_flash_attn_get_decisions(
decisions.path = use_vec ? GGML_WEBGPU_FLASH_ATTN_PATH_VEC :
use_tile ? GGML_WEBGPU_FLASH_ATTN_PATH_TILE :
context.supports_subgroup_matrix ? GGML_WEBGPU_FLASH_ATTN_PATH_SUBGROUP_MATRIX :
use_subgroup_matrix ? GGML_WEBGPU_FLASH_ATTN_PATH_SUBGROUP_MATRIX :
GGML_WEBGPU_FLASH_ATTN_PATH_NONE;
if (decisions.path == GGML_WEBGPU_FLASH_ATTN_PATH_NONE) {

View File

@@ -148,7 +148,6 @@ You can use GBNF grammars:
- In [llama-cli](../tools/cli) and [llama-completion](../tools/completion), passed as the `--json` / `-j` flag
- To convert to a grammar ahead of time:
- in CLI, with [examples/json_schema_to_grammar.py](../examples/json_schema_to_grammar.py)
- in JavaScript with [json-schema-to-grammar.mjs](../tools/server/public_legacy/json-schema-to-grammar.mjs) (this is used by the [server](../tools/server)'s Web UI)
> [!NOTE]
> The JSON schema is only used to constrain the model output and is not injected into the prompt. The model has no visibility into the schema, so if you want it to understand the expected structure, describe it explicitly in your prompt. This does not apply to tool calling, where schemas are injected into the prompt.

Binary file not shown.

View File

@@ -0,0 +1,120 @@
ied 4 ½ months
__ggml_vocab_test__
Äpfel
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
__ggml_vocab_test__
Hello world
__ggml_vocab_test__
Hello world
__ggml_vocab_test__
Hello World
__ggml_vocab_test__
Hello World
__ggml_vocab_test__
Hello World!
__ggml_vocab_test__
Hello, world!
__ggml_vocab_test__
Hello, world!
__ggml_vocab_test__
this is 🦙.cpp
__ggml_vocab_test__
w048 7tuijk dsdfhu
__ggml_vocab_test__
нещо на Български
__ggml_vocab_test__
កាន់តែពិសេសអាចខលចេញ
__ggml_vocab_test__
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token)
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
__ggml_vocab_test__
Hello
Hello
__ggml_vocab_test__
(
__ggml_vocab_test__
=
__ggml_vocab_test__
' era
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33
__ggml_vocab_test__
333
__ggml_vocab_test__
3333
__ggml_vocab_test__
33333
__ggml_vocab_test__
333333
__ggml_vocab_test__
3333333
__ggml_vocab_test__
33333333
__ggml_vocab_test__
333333333
__ggml_vocab_test__
Cửa Việt
__ggml_vocab_test__
discards
__ggml_vocab_test__
🚀 (normal) 😶‍🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天 ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL
__ggml_vocab_test__
__ggml_vocab_test__
résumé
__ggml_vocab_test__
àààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààààà
__ggml_vocab_test__
Vieết Nam
__ggml_vocab_test__

View File

@@ -0,0 +1,50 @@
1122 220 19 220 26062 3951
86975 15897 301
220
256
262
197
198
271
1406
1572
9707 1879
21927 1879
9707 4337
21927 4337
21927 4337 0
9707 11 1879 0
21927 11 1879 0
419 374 11162 99 247 13 10821
86 15 19 23 220 22 83 1963 41808 11472 2940 16739
78762 14144 1456 13073 63471 33594 3038 133178 79012
146394 97529 241 44258 233 146568 44258 224 147603 20879 115 146280 44258 223 146280 147272 97529 227 147805 148301 147270 44258 223 146848
145836 320 8252 8 26525 114 378 235 149921 30543 320 35673 99066 97534 8 25521 227 320 3243 42365 429 702 1181 1828 3950 8
9707
21927
220 21927
256 21927
262 21927
262 21927 198 262 21927
320
198 284
6 11385
9707 11 379 64848 0 2585 525 498 26525 223 937 104100 18493 22377 99257 16 18 16 19 16 20 16 35727 21216
17085 2928
18
18 18
18 18 18
18 18 18 18
18 18 18 18 18
18 18 18 18 18 18
18 18 18 18 18 18 18
18 18 18 18 18 18 18 18
18 18 18 18 18 18 18 18 18
34 90063 128324
2560 2347
198 4710 14731 65497 7847 1572 2303 78672 10947 145836 320 8252 8 26525 114 378 235 149921 30543 320 35673 99066 97534 8 25521 227 11162 99 247 149955 220 18 220 18 18 220 18 18 18 220 18 18 18 18 220 18 18 18 18 18 220 18 18 18 18 18 18 220 18 18 18 18 18 18 18 220 18 18 18 18 18 18 18 18 220 18 13 18 220 18 496 18 220 18 1112 18 220 146394 97529 241 44258 233 146568 44258 224 147603 20879 115 146280 44258 223 146280 147272 97529 227 144534 937 104100 18493 22377 99257 16 18 16 19 16 20 16 35727 21216 55460 53237 18658 14144 1456 13073 63471 33594 3038 133178 79012 3355 4605 4605 13874 13874 73594 3014 3014 28149 17085 2928 26610 7646 358 3003 1012 364 83 813 566 594 1052 11 364 787 498 2704 30 364 44 537 2704 358 3278 1281 432 11 364 35 498 1075 1045 15243 30 1205 6 42612 264 63866 43
68 53839
265 53839 31323 53839
64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549 64 96549
53 645 51580 29974

View File

@@ -0,0 +1,213 @@
# Download webui assets from Hugging Face Bucket at build time
# Usage: cmake -DPUBLIC_DIR=... -DHF_BUCKET=... -DHF_VERSION=... -DASSETS="a;b;c" -P scripts/webui-download.cmake
#
# Asset provisioning priority:
# 1. Pre-built assets already in PUBLIC_DIR (cached from a previous run)
# 2. Local npm build (if NPM_DIR is provided and has package.json)
# 3. Hugging Face Bucket download (version-specific, then 'latest' fallback)
cmake_minimum_required(VERSION 3.16)
set(PUBLIC_DIR "" CACHE STRING "Directory to store/download assets")
set(HF_BUCKET "" CACHE STRING "Hugging Face bucket name")
set(HF_VERSION "" CACHE STRING "Version to download (empty = resolve from git)")
set(ASSETS "" CACHE STRING "Semicolon-separated list of asset filenames")
set(STAMP_FILE "" CACHE STRING "Stamp file to create on success (optional)")
set(SOURCE_DIR "" CACHE STRING "Project source root (to resolve version from git)")
set(NPM_DIR "" CACHE STRING "WebUI source directory (to run npm build)")
set(HF_ENABLED "" CACHE STRING "Whether to allow HF Bucket download (ON/OFF)")
# ---------------------------------------------------------------------------
# 1. Resolve version from git if not provided at configure time
# ---------------------------------------------------------------------------
set(RESOLVED_VERSION "${HF_VERSION}")
if("${RESOLVED_VERSION}" STREQUAL "" AND NOT "${SOURCE_DIR}" STREQUAL "")
if(EXISTS "${SOURCE_DIR}/cmake/build-info.cmake")
include("${SOURCE_DIR}/cmake/build-info.cmake")
if(NOT "${BUILD_NUMBER}" STREQUAL "" AND NOT BUILD_NUMBER EQUAL 0)
set(RESOLVED_VERSION "${BUILD_NUMBER}")
message(STATUS "WebUI: resolved version from git: ${RESOLVED_VERSION}")
endif()
endif()
endif()
# ---------------------------------------------------------------------------
# 2. Check stamp freshness — re-download if resolved version changed
# ---------------------------------------------------------------------------
set(FORCE_REBUILD FALSE)
if(NOT "${STAMP_FILE}" STREQUAL "" AND EXISTS "${STAMP_FILE}")
file(READ "${STAMP_FILE}" STAMPED_VERSION)
string(STRIP "${STAMPED_VERSION}" STAMPED_VERSION)
if(NOT "${STAMPED_VERSION}" STREQUAL "${RESOLVED_VERSION}")
message(STATUS "WebUI: version changed (${STAMPED_VERSION} -> ${RESOLVED_VERSION}), re-building")
set(FORCE_REBUILD TRUE)
endif()
endif()
# ---------------------------------------------------------------------------
# 3. Check if assets already exist (cached from a previous run)
# ---------------------------------------------------------------------------
set(ALL_EXISTS TRUE)
foreach(asset ${ASSETS})
if(NOT EXISTS "${PUBLIC_DIR}/${asset}")
set(ALL_EXISTS FALSE)
break()
endif()
endforeach()
if(ALL_EXISTS AND NOT FORCE_REBUILD)
message(STATUS "WebUI: all assets already exist in ${PUBLIC_DIR}, skipping")
return()
endif()
file(MAKE_DIRECTORY "${PUBLIC_DIR}")
# ---------------------------------------------------------------------------
# 4. Priority 2: build from source via npm (fast path for developers)
# ---------------------------------------------------------------------------
set(PROVISION_SUCCESS FALSE)
if(NOT PROVISION_SUCCESS AND NOT "${NPM_DIR}" STREQUAL "")
if(EXISTS "${NPM_DIR}/package.json")
message(STATUS "WebUI: building from source in ${NPM_DIR}")
# Run npm install if node_modules is missing
if(NOT EXISTS "${NPM_DIR}/node_modules")
message(STATUS "WebUI: running npm install (first time)")
execute_process(
COMMAND npm install
WORKING_DIRECTORY "${NPM_DIR}"
RESULT_VARIABLE NPM_INSTALL_RESULT
OUTPUT_VARIABLE NPM_OUT
ERROR_VARIABLE NPM_ERR
)
if(NOT NPM_INSTALL_RESULT EQUAL 0)
message(STATUS "WebUI: npm install failed (${NPM_INSTALL_RESULT}), falling back to download")
message(STATUS " stderr: ${NPM_ERR}")
endif()
endif()
# Run the build
execute_process(
COMMAND npm run build
WORKING_DIRECTORY "${NPM_DIR}"
RESULT_VARIABLE NPM_BUILD_RESULT
OUTPUT_VARIABLE NPM_OUT
ERROR_VARIABLE NPM_ERR
)
if(NPM_BUILD_RESULT EQUAL 0)
# Verify that the expected assets were produced
set(ALL_BUILT TRUE)
foreach(asset ${ASSETS})
if(NOT EXISTS "${PUBLIC_DIR}/${asset}")
set(ALL_BUILT FALSE)
break()
endif()
endforeach()
if(ALL_BUILT)
message(STATUS "WebUI: local npm build succeeded")
set(PROVISION_SUCCESS TRUE)
else()
message(STATUS "WebUI: npm build completed but assets missing from ${PUBLIC_DIR}, falling back to download")
endif()
else()
message(STATUS "WebUI: npm build failed (${NPM_BUILD_RESULT}), falling back to download")
message(STATUS " stderr: ${NPM_ERR}")
endif()
else()
message(STATUS "WebUI: NPM_DIR (${NPM_DIR}) has no package.json, skipping npm build")
endif()
endif()
# ---------------------------------------------------------------------------
# 5. Priority 3: download from Hugging Face Bucket (if enabled)
# ---------------------------------------------------------------------------
if(NOT PROVISION_SUCCESS AND HF_ENABLED)
# Build list of URLs to try — version-specific first, then 'latest'
set(URL_ENTRIES "")
if(NOT "${RESOLVED_VERSION}" STREQUAL "")
list(APPEND URL_ENTRIES
"version:https://huggingface.co/buckets/ggml-org/${HF_BUCKET}/resolve/${RESOLVED_VERSION}")
endif()
list(APPEND URL_ENTRIES
"latest:https://huggingface.co/buckets/ggml-org/${HF_BUCKET}/resolve/latest")
foreach(entry ${URL_ENTRIES})
string(REGEX REPLACE "^([^:]+):.*$" "\\1" url_label "${entry}")
string(REGEX REPLACE "^[^:]+:(.*)$" "\\1" base_url "${entry}")
message(STATUS "WebUI: downloading assets from ${url_label}: ${base_url}")
# Download each asset
set(ALL_OK TRUE)
foreach(asset ${ASSETS})
set(download_url "${base_url}/${asset}?download=true")
set(download_path "${PUBLIC_DIR}/${asset}")
file(DOWNLOAD "${download_url}" "${download_path}"
STATUS download_status TIMEOUT 60
)
list(GET download_status 0 download_result)
if(NOT download_result EQUAL 0)
list(GET download_status 1 error_message)
message(STATUS "WebUI: failed to download ${asset} from ${url_label}: ${error_message}")
set(ALL_OK FALSE)
break()
endif()
message(STATUS "WebUI: downloaded ${asset}")
endforeach()
if(NOT ALL_OK)
continue()
endif()
# Verify checksums if the server provides them
file(DOWNLOAD "${base_url}/checksums.txt?download=true"
"${PUBLIC_DIR}/checksums.txt"
STATUS checksum_status TIMEOUT 30
)
list(GET checksum_status 0 checksum_result)
if(checksum_result EQUAL 0)
message(STATUS "WebUI: verifying checksums...")
file(STRINGS "${PUBLIC_DIR}/checksums.txt" CHECKSUMS_CONTENT)
foreach(asset ${ASSETS})
set(download_path "${PUBLIC_DIR}/${asset}")
file(SHA256 "${download_path}" asset_hash)
string(TOUPPER "${asset_hash}" EXPECTED_HASH_UPPER)
string(REGEX MATCH "${EXPECTED_HASH_UPPER}[ \\t]+${asset}" CHECKSUM_LINE "${CHECKSUMS_CONTENT}")
if(NOT CHECKSUM_LINE)
message(WARNING "WebUI: checksum verification failed for ${asset}")
set(ALL_OK FALSE)
break()
endif()
endforeach()
if(ALL_OK)
message(STATUS "WebUI: all checksums verified")
endif()
endif()
if(ALL_OK)
set(PROVISION_SUCCESS TRUE)
break()
endif()
endforeach()
if(PROVISION_SUCCESS)
message(STATUS "WebUI: provisioning complete")
else()
message(WARNING "WebUI: failed to download assets from HF Bucket (${HF_BUCKET})")
endif()
endif()
# ---------------------------------------------------------------------------
# 6. Write stamp file on success (stores resolved version for freshness check)
# ---------------------------------------------------------------------------
if(PROVISION_SUCCESS)
if(NOT "${STAMP_FILE}" STREQUAL "")
file(WRITE "${STAMP_FILE}" "${RESOLVED_VERSION}")
endif()
else()
message(WARNING "WebUI: no source available. Neither local build (${NPM_DIR}) nor HF Bucket download succeeded.")
message(WARNING "WebUI: building server without embedded WebUI. Set LLAMA_BUILD_WEBUI=OFF to suppress this warning.")
endif()

View File

@@ -605,6 +605,136 @@ static std::vector<size_t> unicode_regex_split_custom_qwen2(const std::string &
return bpe_offsets;
}
// Qwen3.5 system regex: "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
// Compared to Qwen2, letter-runs also consume Unicode combining marks (\p{M}): [\p{L}\p{M}]+ instead of \p{L}+
static std::vector<size_t> unicode_regex_split_custom_qwen35(const std::string & text, const std::vector<size_t> & offsets) {
std::vector<size_t> bpe_offsets; // store the offset of each word
bpe_offsets.reserve(offsets.size()); // Reserve memory for the approximate size
const auto cpts = unicode_cpts_from_utf8(text);
size_t start = 0;
for (auto offset : offsets) {
const size_t offset_ini = start;
const size_t offset_end = start + offset;
assert(offset_end <= cpts.size());
start = offset_end;
static const uint32_t OUT_OF_RANGE = 0xFFFFFFFF;
auto _get_cpt = [&] (const size_t pos) -> uint32_t {
return (offset_ini <= pos && pos < offset_end) ? cpts[pos] : OUT_OF_RANGE;
};
auto _get_flags = [&] (const size_t pos) -> unicode_cpt_flags {
return (offset_ini <= pos && pos < offset_end) ? unicode_cpt_flags_from_cpt(cpts[pos]) : unicode_cpt_flags{};
};
size_t _prev_end = offset_ini;
auto _add_token = [&] (const size_t end) -> size_t {
assert(_prev_end <= end && end <= offset_end);
size_t len = end - _prev_end;
if (len > 0) {
bpe_offsets.push_back(len);
}
_prev_end = end;
return len;
};
for (size_t pos = offset_ini; pos < offset_end; /*pos++*/ ) {
const uint32_t cpt = _get_cpt(pos);
const auto flags = _get_flags(pos);
// regex: (?i:'s|'t|'re|'ve|'m|'ll|'d) // case insensitive
if (cpt == '\'' && pos+1 < offset_end) {
uint32_t cpt_next = unicode_tolower(_get_cpt(pos+1));
if (cpt_next == 's' || cpt_next == 't' || cpt_next == 'm' || cpt_next == 'd') {
pos += _add_token(pos+2);
continue;
}
if (pos+2 < offset_end) {
uint32_t cpt_next_next = unicode_tolower(_get_cpt(pos+2));
if ((cpt_next == 'r' && cpt_next_next == 'e') ||
(cpt_next == 'v' && cpt_next_next == 'e') ||
(cpt_next == 'l' && cpt_next_next == 'l')) {
pos += _add_token(pos+3);
continue;
}
}
}
// regex: [^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+
if (!(cpt == '\r' || cpt == '\n' || flags.is_number)) {
if (flags.is_letter || flags.is_accent_mark || _get_flags(pos + 1).is_accent_mark || _get_flags(pos+1).is_letter) {
pos++;
while (_get_flags(pos).is_letter || _get_flags(pos).is_accent_mark) {
pos++;
}
_add_token(pos);
continue;
}
}
// regex: \p{N}
if (flags.is_number) {
pos++;
_add_token(pos);
continue;
}
// regex: <space>?[^\s\p{L}\p{M}\p{N}]+[\r\n]*
auto flags2 = (cpt == ' ' ? _get_flags(pos+1) : flags);
if (!(flags2.is_whitespace | flags2.is_letter | flags2.is_accent_mark | flags2.is_number) && flags.as_uint()) {
pos += (cpt == ' ');
while (!(flags2.is_whitespace | flags2.is_letter | flags2.is_accent_mark | flags2.is_number) && flags2.as_uint()) {
flags2 = _get_flags(++pos);
}
uint32_t cpt2 = _get_cpt(pos);
while (cpt2 == '\r' || cpt2 == '\n') {
cpt2 = _get_cpt(++pos);
}
_add_token(pos);
continue;
}
size_t num_whitespaces = 0;
size_t last_end_r_or_n = 0;
while (_get_flags(pos+num_whitespaces).is_whitespace) {
uint32_t cpt2 = _get_cpt(pos+num_whitespaces);
if (cpt2 == '\r' || cpt2 == '\n') {
last_end_r_or_n = pos + num_whitespaces + 1;
}
num_whitespaces++;
}
// regex: \s*[\r\n]+
if (last_end_r_or_n > 0) {
pos = last_end_r_or_n;
_add_token(pos);
continue;
}
// regex: \s+(?!\S)
if (num_whitespaces > 1 && _get_cpt(pos+num_whitespaces) != OUT_OF_RANGE) {
pos += num_whitespaces - 1;
_add_token(pos);
continue;
}
// regex: \s+
if (num_whitespaces > 0) {
pos += num_whitespaces;
_add_token(pos);
continue;
}
// no matches
_add_token(++pos);
}
}
return bpe_offsets;
}
template <typename CharT>
static std::vector<size_t> unicode_regex_split_stl(const std::basic_string<CharT> & text, const std::basic_string<CharT> & regex, const std::vector<size_t> & offsets) {
using BidirIt = typename std::basic_string<CharT>::const_iterator;
@@ -929,6 +1059,9 @@ static std::vector<size_t> unicode_regex_split_custom(const std::string & text,
} else if (
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
bpe_offsets = unicode_regex_split_custom_qwen2(text, offsets);
} else if (
regex_expr == "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+") {
bpe_offsets = unicode_regex_split_custom_qwen35(text, offsets);
} else if (regex_expr == "\\p{Han}+") {
// K2's first pattern - handle all K2 patterns together
bpe_offsets = unicode_regex_split_custom_kimi_k2(text, offsets);

View File

@@ -131,6 +131,7 @@ llama_test(test-tokenizer-0 NAME test-tokenizer-0-llama-spm ARGS ${PROJE
llama_test(test-tokenizer-0 NAME test-tokenizer-0-mpt ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-mpt.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-phi-3 ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-phi-3.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-qwen2 ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-qwen2.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-qwen35 ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-qwen35.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-refact ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-refact.gguf)
llama_test(test-tokenizer-0 NAME test-tokenizer-0-starcoder ARGS ${PROJECT_SOURCE_DIR}/models/ggml-vocab-starcoder.gguf)

View File

@@ -473,10 +473,10 @@ static void clip_log_internal(enum ggml_log_level level, const char * format, ..
va_end(args);
}
#define LOG_DBG(...) clip_log_internal(GGML_LOG_LEVEL_DEBUG, __VA_ARGS__)
#define LOG_INF(...) clip_log_internal(GGML_LOG_LEVEL_INFO, __VA_ARGS__)
#define LOG_WRN(...) clip_log_internal(GGML_LOG_LEVEL_WARN, __VA_ARGS__)
#define LOG_ERR(...) clip_log_internal(GGML_LOG_LEVEL_ERROR, __VA_ARGS__)
#define LOG_DBG(...) clip_log_internal(GGML_LOG_LEVEL_DEBUG, __VA_ARGS__)
#define LOG_CNT(...) clip_log_internal(GGML_LOG_LEVEL_CONT, __VA_ARGS__)
//

View File

@@ -169,7 +169,7 @@ struct clip_ctx {
throw std::runtime_error("failed to initialize CPU backend");
}
if (ctx_params.use_gpu) {
auto backend_name = std::getenv("MTMD_BACKEND_DEVICE");
auto * backend_name = std::getenv("MTMD_BACKEND_DEVICE");
if (backend_name != nullptr) {
backend = ggml_backend_init_by_name(backend_name, nullptr);
if (!backend) {

View File

@@ -40,7 +40,9 @@ set(TARGET_SRCS
server-models.h
)
option(LLAMA_BUILD_WEBUI "Build the embedded Web UI" ON)
# Option to specify custom HF bucket for webui (defaults to llama-ui)
# Usage: cmake -B build -DLLAMA_WEBUI_HF_BUCKET=llama-ui
set(LLAMA_WEBUI_HF_BUCKET "llama-ui" CACHE STRING "Hugging Face bucket name for prebuilt webui assets")
if (LLAMA_BUILD_WEBUI)
set(PUBLIC_ASSETS
@@ -50,19 +52,108 @@ if (LLAMA_BUILD_WEBUI)
loading.html
)
# Determine source of webui assets (priority: local > HF Bucket)
set(WEBUI_SOURCE "")
set(WEBUI_SOURCE_DIR "")
# Priority 1: Check for local webui build output
set(LOCAL_WEBUI_DIR "${CMAKE_CURRENT_SOURCE_DIR}/public")
# Verify all required assets exist before declaring local source valid
set(ALL_ASSETS_PRESENT TRUE)
foreach(asset ${PUBLIC_ASSETS})
set(input "${CMAKE_CURRENT_SOURCE_DIR}/public/${asset}")
set(output "${CMAKE_CURRENT_BINARY_DIR}/${asset}.hpp")
list(APPEND TARGET_SRCS ${output})
add_custom_command(
DEPENDS "${input}"
OUTPUT "${output}"
COMMAND "${CMAKE_COMMAND}" "-DINPUT=${input}" "-DOUTPUT=${output}" -P "${PROJECT_SOURCE_DIR}/scripts/xxd.cmake"
)
set_source_files_properties(${output} PROPERTIES GENERATED TRUE)
if(NOT EXISTS "${LOCAL_WEBUI_DIR}/${asset}")
set(ALL_ASSETS_PRESENT FALSE)
break()
endif()
endforeach()
add_definitions(-DLLAMA_BUILD_WEBUI)
if(ALL_ASSETS_PRESENT)
set(WEBUI_SOURCE "local")
set(WEBUI_SOURCE_DIR "${LOCAL_WEBUI_DIR}")
message(STATUS "WebUI: using local build from ${WEBUI_SOURCE_DIR}")
endif()
# Priority 2: Build-time asset provisioning (npm build → HF Bucket fallback)
if(NOT WEBUI_SOURCE_DIR)
if(DEFINED LLAMA_BUILD_NUMBER)
set(HF_WEBUI_VERSION "${LLAMA_BUILD_NUMBER}")
message(STATUS "WebUI: using LLAMA_BUILD_NUMBER=${HF_WEBUI_VERSION}")
else()
set(HF_WEBUI_VERSION "")
message(STATUS "WebUI: LLAMA_BUILD_NUMBER not defined")
endif()
# Stamp file embeds the version tag so a changed build number triggers
# a fresh provision run on the next `cmake --build` without reconfiguring.
if("${HF_WEBUI_VERSION}" STREQUAL "")
set(WEBUI_VERSION_TAG "provisioned")
else()
set(WEBUI_VERSION_TAG "${HF_WEBUI_VERSION}")
endif()
set(WEBUI_STAMP "${CMAKE_CURRENT_BINARY_DIR}/.webui-${WEBUI_VERSION_TAG}.stamp")
# Escape semicolons so the CMake list is passed as a single -D argument
string(REPLACE ";" "\\;" PUBLIC_ASSETS_ESC "${PUBLIC_ASSETS}")
add_custom_command(
OUTPUT ${WEBUI_STAMP}
COMMAND ${CMAKE_COMMAND}
"-DSOURCE_DIR=${PROJECT_SOURCE_DIR}"
"-DPUBLIC_DIR=${CMAKE_CURRENT_SOURCE_DIR}/public"
"-DHF_BUCKET=${LLAMA_WEBUI_HF_BUCKET}"
"-DHF_VERSION=${HF_WEBUI_VERSION}"
"-DHF_ENABLED=${LLAMA_USE_PREBUILT_WEBUI}"
"-DASSETS=${PUBLIC_ASSETS_ESC}"
"-DSTAMP_FILE=${WEBUI_STAMP}"
"-DNPM_DIR=${CMAKE_CURRENT_SOURCE_DIR}/webui"
-P ${PROJECT_SOURCE_DIR}/scripts/webui-download.cmake
COMMENT "Building/provisioning WebUI assets (npm build -> HF Bucket fallback)"
)
set(WEBUI_SOURCE "provisioned")
set(WEBUI_SOURCE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/public")
endif()
# Process assets from the determined source
if(WEBUI_SOURCE_DIR)
foreach(asset ${PUBLIC_ASSETS})
set(input "${WEBUI_SOURCE_DIR}/${asset}")
set(output "${CMAKE_CURRENT_BINARY_DIR}/${asset}.hpp")
list(APPEND TARGET_SRCS ${output})
if(WEBUI_SOURCE STREQUAL "local")
# Local build: files exist at configure time
if(NOT EXISTS "${input}")
message(FATAL_ERROR "WebUI asset not found: ${input}")
endif()
set(dependency "${input}")
else()
# HF Bucket: files are downloaded at build time
set(dependency "${WEBUI_STAMP}")
endif()
add_custom_command(
DEPENDS ${dependency}
OUTPUT "${output}"
COMMAND "${CMAKE_COMMAND}" "-DINPUT=${input}" "-DOUTPUT=${output}" -P "${PROJECT_SOURCE_DIR}/scripts/xxd.cmake"
)
set_source_files_properties(${output} PROPERTIES GENERATED TRUE)
endforeach()
add_definitions(-DLLAMA_BUILD_WEBUI)
add_definitions(-DLLAMA_WEBUI_DEFAULT_ENABLED=1)
message(STATUS "WebUI: embedded with source: ${WEBUI_SOURCE}")
else()
# WebUI source not found - issue warning but don't fail the build
# The server will still build but without webui embedded
message(WARNING "WebUI: no source available. Neither local build (tools/server/public/) nor HF Bucket download succeeded.")
message(WARNING "WebUI: building server without embedded WebUI. Set LLAMA_BUILD_WEBUI=OFF to suppress this warning.")
add_definitions(-DLLAMA_WEBUI_DEFAULT_ENABLED=0)
endif()
else()
# WebUI is disabled at build time
add_definitions(-DLLAMA_WEBUI_DEFAULT_ENABLED=0)
endif()
add_executable(${TARGET} ${TARGET_SRCS})

View File

@@ -1822,43 +1822,3 @@ You can specify default preferences for the web UI using `--webui-config <JSON c
```
You may find available preferences in [settings-config.ts](webui/src/lib/constants/settings-config.ts).
### Legacy completion web UI
A new chat-based UI has replaced the old completion-based since [this PR](https://github.com/ggml-org/llama.cpp/pull/10175). If you want to use the old completion, start the server with `--path ./tools/server/public_legacy`
For example:
```sh
./llama-server -m my_model.gguf -c 8192 --path ./tools/server/public_legacy
```
### Extending or building alternative Web Front End
You can extend the front end by running the server binary with `--path` set to `./your-directory` and importing `/completion.js` to get access to the llamaComplete() method.
Read the documentation in `/completion.js` to see convenient ways to access llama.
A simple example is below:
```html
<html>
<body>
<pre>
<script type="module">
import { llama } from '/completion.js'
const prompt = `### Instruction:
Write dad jokes, each one paragraph.
You can use html formatting if needed.
### Response:`
for await (const chunk of llama(prompt)) {
document.write(chunk.data.content)
}
</script>
</pre>
</body>
</html>
```

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@@ -1,12 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="refresh" content="5">
</head>
<body>
<div id="loading">
The model is loading. Please wait.<br/>
The user interface will appear soon.
</div>
</body>
</html>

View File

@@ -15,17 +15,19 @@
using json = nlohmann::ordered_json;
#define SLT_DBG(slot, fmt, ...) LOG_DBG("slot %12.*s: id %2d | task %d | " fmt, 12, __func__, (slot).id, ((slot).task ? (slot).task->id : -1), __VA_ARGS__)
#define SLT_TRC(slot, fmt, ...) LOG_TRC("slot %12.*s: id %2d | task %d | " fmt, 12, __func__, (slot).id, ((slot).task ? (slot).task->id : -1), __VA_ARGS__)
#define SLT_INF(slot, fmt, ...) LOG_INF("slot %12.*s: id %2d | task %d | " fmt, 12, __func__, (slot).id, ((slot).task ? (slot).task->id : -1), __VA_ARGS__)
#define SLT_CNT(slot, fmt, ...) LOG_CNT("" fmt, __VA_ARGS__)
#define SLT_WRN(slot, fmt, ...) LOG_WRN("slot %12.*s: id %2d | task %d | " fmt, 12, __func__, (slot).id, ((slot).task ? (slot).task->id : -1), __VA_ARGS__)
#define SLT_ERR(slot, fmt, ...) LOG_ERR("slot %12.*s: id %2d | task %d | " fmt, 12, __func__, (slot).id, ((slot).task ? (slot).task->id : -1), __VA_ARGS__)
#define SLT_DBG(slot, fmt, ...) LOG_DBG("slot %12.*s: id %2d | task %d | " fmt, 12, __func__, (slot).id, ((slot).task ? (slot).task->id : -1), __VA_ARGS__)
#define SLT_CNT(slot, fmt, ...) LOG_CNT("" fmt, __VA_ARGS__)
#define SRV_DBG(fmt, ...) LOG_DBG("srv %12.*s: " fmt, 12, __func__, __VA_ARGS__)
#define SRV_TRC(fmt, ...) LOG_TRC("srv %12.*s: " fmt, 12, __func__, __VA_ARGS__)
#define SRV_INF(fmt, ...) LOG_INF("srv %12.*s: " fmt, 12, __func__, __VA_ARGS__)
#define SRV_CNT(fmt, ...) LOG_CNT("" fmt, __VA_ARGS__)
#define SRV_WRN(fmt, ...) LOG_WRN("srv %12.*s: " fmt, 12, __func__, __VA_ARGS__)
#define SRV_ERR(fmt, ...) LOG_ERR("srv %12.*s: " fmt, 12, __func__, __VA_ARGS__)
#define SRV_DBG(fmt, ...) LOG_DBG("srv %12.*s: " fmt, 12, __func__, __VA_ARGS__)
#define SRV_CNT(fmt, ...) LOG_CNT("" fmt, __VA_ARGS__)
using raw_buffer = std::vector<uint8_t>;

View File

@@ -166,6 +166,7 @@ struct server_slot {
// stats
size_t n_sent_text = 0; // number of sent text character
int64_t t_print_last = 0;
int64_t t_start_process_prompt;
int64_t t_start_generation;
@@ -233,7 +234,7 @@ struct server_slot {
}
}
SLT_INF(*this, "init sampler, took %0.2f ms, tokens: text = %d, total = %d\n",
SLT_TRC(*this, "init sampler, took %0.2f ms, tokens: text = %d, total = %d\n",
(ggml_time_us() - t_start) / 1000.0, n_text, (int) prompt.tokens.size());
}
@@ -417,6 +418,36 @@ struct server_slot {
return stop_pos;
}
void print_timings_tg() {
if (n_decoded < 100) {
return;
}
const int64_t t_now = ggml_time_us();
if (t_now - t_print_last < 3*1000*1000) {
return;
}
t_print_last = t_now;
const double n_gen_second = 1e3 / t_token_generation * n_decoded;
SLT_INF(*this, "n_decoded = %6d, tg = %6.2f t/s\n", n_decoded, n_gen_second);
}
void print_timings_pp() const {
const double n_prompt_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed;
const double f_progress = (float) prompt.n_tokens() / task->n_tokens();
if (t_prompt_processing < 3000.0) {
return;
}
SLT_INF(*this, "prompt processing, n_tokens = %6d, progress = %.2f, t = %6.2f s / %.2f tokens per second\n",
n_prompt_tokens_processed, f_progress, t_prompt_processing / 1e3, n_prompt_second);
}
void print_timings() const {
const double t_prompt = t_prompt_processing / n_prompt_tokens_processed;
const double n_prompt_second = 1e3 / t_prompt_processing * n_prompt_tokens_processed;
@@ -588,6 +619,10 @@ public:
// note: chat_params must not be refreshed upon existing sleeping state
server_chat_params chat_params;
server_context_impl() {
mtmd_helper_log_set(common_log_default_callback, nullptr);
}
~server_context_impl() {
if (!sleeping) {
// destroy() is already called when entering sleeping state
@@ -749,10 +784,6 @@ private:
std::string & mmproj_path = params_base.mmproj.path;
if (!mmproj_path.empty()) {
if (!is_resume) {
mtmd_helper_log_set(common_log_default_callback, nullptr);
}
mtmd_context_params mparams = mtmd_context_params_default();
mparams.use_gpu = params_base.mmproj_use_gpu;
@@ -896,17 +927,17 @@ private:
if (params_base.cache_ram_mib != 0) {
if (params_base.cache_ram_mib < 0) {
SRV_WRN("prompt cache is enabled, size limit: %s\n", "no limit");
SRV_INF("prompt cache is enabled, size limit: %s\n", "no limit");
} else {
SRV_WRN("prompt cache is enabled, size limit: %d MiB\n", params_base.cache_ram_mib);
SRV_INF("prompt cache is enabled, size limit: %d MiB\n", params_base.cache_ram_mib);
}
SRV_WRN("%s", "use `--cache-ram 0` to disable the prompt cache\n");
SRV_INF("%s", "use `--cache-ram 0` to disable the prompt cache\n");
prompt_cache = std::make_unique<server_prompt_cache>(params_base.cache_ram_mib, n_ctx);
} else {
SRV_WRN("%s", "prompt cache is disabled - use `--cache-ram N` to enable it\n");
SRV_INF("%s", "prompt cache is disabled - use `--cache-ram N` to enable it\n");
}
SRV_WRN("%s", "for more info see https://github.com/ggml-org/llama.cpp/pull/16391\n");
SRV_INF("%s", "for more info see https://github.com/ggml-org/llama.cpp/pull/16391\n");
if (!params_base.model_alias.empty()) {
// backward compat: use first alias as model name
@@ -954,13 +985,13 @@ private:
if (params_base.cache_idle_slots) {
if (!params_base.kv_unified) {
SRV_WRN("%s: --cache-idle-slots requires --kv-unified, disabling\n", __func__);
SRV_WRN("%s", "--cache-idle-slots requires --kv-unified, disabling\n");
params_base.cache_idle_slots = false;
} else if (params_base.cache_ram_mib == 0) {
SRV_WRN("%s: --cache-idle-slots requires --cache-ram, disabling\n", __func__);
SRV_WRN("%s", "--cache-idle-slots requires --cache-ram, disabling\n");
params_base.cache_idle_slots = false;
} else {
SRV_INF("%s: idle slots will be saved to prompt cache and cleared upon starting a new task\n", __func__);
SRV_INF("%s", "idle slots will be saved to prompt cache and cleared upon starting a new task\n");
SRV_DBG("%s", "__TEST_TAG_CACHE_IDLE_SLOTS_ENABLED__\n");
}
}
@@ -1112,7 +1143,7 @@ private:
update_cache = update_cache && task.type == SERVER_TASK_TYPE_COMPLETION;
if (update_cache) {
SRV_WRN("%s", "updating prompt cache\n");
SRV_INF("%s", "updating prompt cache\n");
const int64_t t_start = ggml_time_us();
@@ -1127,7 +1158,7 @@ private:
prompt_cache->update();
SRV_WRN("prompt cache update took %.2f ms\n", (ggml_time_us() - t_start) / 1000.0);
SRV_INF("prompt cache update took %.2f ms\n", (ggml_time_us() - t_start) / 1000.0);
}
}
@@ -1186,10 +1217,10 @@ private:
if (!are_lora_equal(task_loras, slot.lora)) {
// if lora has changed, check to see if the cache should be cleared
if (lora_should_clear_cache(slot.lora, task_loras)) {
SLT_INF(slot, "clearing cache for lora change. %zu loras -> %zu loras\n", slot.lora.size(), task.params.lora.size());
SLT_TRC(slot, "clearing cache for lora change. %zu loras -> %zu loras\n", slot.lora.size(), task.params.lora.size());
slot.prompt.tokens.clear();
} else {
SLT_INF(slot, "keeping cache for alora. %zu target loras\n", task_loras.size());
SLT_TRC(slot, "keeping cache for alora. %zu target loras\n", task_loras.size());
}
slot.lora = task_loras;
}
@@ -1281,7 +1312,8 @@ private:
llama_set_sampler(ctx_tgt, slot.id, nullptr);
}
SLT_INF(slot, "sampler chain: %s\n", common_sampler_print(slot.smpl.get()).c_str());
SLT_TRC(slot, "sampler chain: %s\n", common_sampler_print(slot.smpl.get()).c_str());
SLT_TRC(slot, "sampler params: \n%s\n", task.params.sampling.print().c_str());
} else {
slot.smpl.reset();
}
@@ -1800,7 +1832,7 @@ private:
cur.update_tgt(ctx_tgt, slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
cur.update_dft(ctx_dft.get(), slot.id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
SLT_WRN(slot,
SLT_INF(slot,
"created context checkpoint %d of %d (pos_min = %d, pos_max = %d, n_tokens = %" PRId64 ", size = %.3f MiB)\n",
(int) slot.prompt.checkpoints.size(), params_base.n_ctx_checkpoints, cur.pos_min,
cur.pos_max, cur.n_tokens, (float) cur.size() / 1024 / 1024);
@@ -2339,7 +2371,7 @@ private:
slot.state = SLOT_STATE_PROCESSING_PROMPT;
SLT_INF(slot, "new prompt, n_ctx_slot = %d, n_keep = %d, task.n_tokens = %d\n",
SLT_TRC(slot, "new prompt, n_ctx_slot = %d, n_keep = %d, task.n_tokens = %d\n",
slot.n_ctx, slot.task->params.n_keep, slot.task->n_tokens());
// print prompt tokens (for debugging)
@@ -2454,7 +2486,7 @@ private:
}
if (n_match >= (size_t) n_cache_reuse) {
SLT_INF(slot, "reusing chunk with size %zu, shifting KV cache [%zu, %zu) -> [%zu, %zu)\n", n_match, head_c, head_c + n_match, head_p, head_p + n_match);
SLT_TRC(slot, "reusing chunk with size %zu, shifting KV cache [%zu, %zu) -> [%zu, %zu)\n", n_match, head_c, head_c + n_match, head_p, head_p + n_match);
//for (size_t i = head_p; i < head_p + n_match; i++) {
// SLT_DBG(slot, "cache token %3zu: %6d '%s'\n", i, prompt_tokens[i], common_token_to_piece(ctx_tgt, prompt_tokens[i]).c_str());
//}
@@ -2620,10 +2652,14 @@ private:
}
}
const int64_t t_current = ggml_time_us();
slot.t_prompt_processing = (t_current - slot.t_start_process_prompt) / 1e3;
slot.print_timings_pp();
// truncate any tokens that are beyond n_past for this slot
const llama_pos p0 = slot.prompt.tokens.pos_next();
SLT_INF(slot, "n_tokens = %d, memory_seq_rm [%d, end)\n", slot.prompt.n_tokens(), p0);
SLT_TRC(slot, "cached n_tokens = %d, memory_seq_rm [%d, end)\n", slot.prompt.n_tokens(), p0);
if (!llama_memory_seq_rm(llama_get_memory(ctx_tgt), slot.id, p0, -1)) {
SLT_WRN(slot, "failed to truncate tokens with position >= %d - clearing the memory\n", p0);
@@ -2764,7 +2800,6 @@ private:
slot.i_batch = batch.n_tokens - 1;
slot.init_sampler();
SLT_INF(slot, "prompt processing done, n_tokens = %d, batch.n_tokens = %d\n", slot.prompt.n_tokens(), batch.n_tokens);
} else {
if (slot.task->n_tokens() < slot.prompt.n_tokens() + n_ubatch) {
// near the end of the prompt
@@ -2786,8 +2821,6 @@ private:
}
}
}
SLT_INF(slot, "prompt processing progress, n_tokens = %d, batch.n_tokens = %d, progress = %f\n", slot.prompt.n_tokens(), batch.n_tokens, (float) slot.prompt.n_tokens() / slot.task->n_tokens());
}
const auto pos_min = llama_memory_seq_pos_min(llama_get_memory(ctx_tgt), slot.id);
@@ -3084,6 +3117,8 @@ private:
continue;
}
slot.print_timings_tg();
}
// speculative decoding - main model sample and accept
@@ -3196,6 +3231,8 @@ private:
}
}
slot.print_timings_tg();
SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) n_draft, slot.prompt.n_tokens());
}
}

View File

@@ -47,7 +47,7 @@ static void log_server_request(const httplib::Request & req, const httplib::Resp
// reminder: this function is not covered by httplib's exception handler; if someone does more complicated stuff, think about wrapping it in try-catch
SRV_INF("done request: %s %s %s %d\n", req.method.c_str(), req.path.c_str(), req.remote_addr.c_str(), res.status);
SRV_TRC("done request: %s %s %s %d\n", req.method.c_str(), req.path.c_str(), req.remote_addr.c_str(), res.status);
SRV_DBG("request: %s\n", req.body.c_str());
SRV_DBG("response: %s\n", res.body.c_str());
@@ -89,10 +89,10 @@ bool server_http_context::init(const common_params & params) {
hostname = params.hostname;
if (gcp.enabled) {
LOG_INF("%s: Google Cloud Platform compat: health route = %s, predict route = %s, port = %d\n", __func__, gcp.path_health.c_str(), gcp.path_predict.c_str(), gcp.port);
SRV_INF("Google Cloud Platform compat: health route = %s, predict route = %s, port = %d\n", gcp.path_health.c_str(), gcp.path_predict.c_str(), gcp.port);
if (port != gcp.port) {
LOG_WRN("%s: Google Cloud Platform compat: overriding server port %d with AIP_HTTP_PORT %d\n", __func__, port, gcp.port);
SRV_WRN("Google Cloud Platform compat: overriding server port %d with AIP_HTTP_PORT %d\n", port, gcp.port);
}
port = gcp.port;
@@ -102,17 +102,17 @@ bool server_http_context::init(const common_params & params) {
#ifdef CPPHTTPLIB_OPENSSL_SUPPORT
if (params.ssl_file_key != "" && params.ssl_file_cert != "") {
LOG_INF("Running with SSL: key = %s, cert = %s\n", params.ssl_file_key.c_str(), params.ssl_file_cert.c_str());
SRV_INF("running with SSL: key = %s, cert = %s\n", params.ssl_file_key.c_str(), params.ssl_file_cert.c_str());
srv.reset(
new httplib::SSLServer(params.ssl_file_cert.c_str(), params.ssl_file_key.c_str())
);
} else {
LOG_INF("Running without SSL\n");
SRV_INF("%s", "running without SSL\n");
srv.reset(new httplib::Server());
}
#else
if (params.ssl_file_key != "" && params.ssl_file_cert != "") {
LOG_ERR("Server is built without SSL support\n");
SRV_ERR("%s", "the server is built without SSL support\n");
return false;
}
srv.reset(new httplib::Server());
@@ -134,7 +134,7 @@ bool server_http_context::init(const common_params & params) {
res.status = 500;
res.set_content(message, "text/plain");
LOG_ERR("got exception: %s\n", message.c_str());
SRV_ERR("got exception: %s\n", message.c_str());
});
srv->set_error_handler([](const httplib::Request &, httplib::Response & res) {
@@ -162,7 +162,7 @@ bool server_http_context::init(const common_params & params) {
#ifdef SO_REUSEPORT
httplib::set_socket_opt(sock, SOL_SOCKET, SO_REUSEPORT, 1);
#else
LOG_WRN("%s: SO_REUSEPORT is not supported\n", __func__);
SRV_WRN("%s", "SO_REUSEPORT is not supported\n");
#endif
}
});
@@ -170,9 +170,9 @@ bool server_http_context::init(const common_params & params) {
if (params.api_keys.size() == 1) {
auto key = params.api_keys[0];
std::string substr = key.substr(std::max((int)(key.length() - 4), 0));
LOG_INF("%s: api_keys: ****%s\n", __func__, substr.c_str());
SRV_INF("api_keys: ****%s\n", substr.c_str());
} else if (params.api_keys.size() > 1) {
LOG_INF("%s: api_keys: %zu keys loaded\n", __func__, params.api_keys.size());
SRV_INF("api_keys: %zu keys loaded\n", params.api_keys.size());
}
//
@@ -232,36 +232,36 @@ bool server_http_context::init(const common_params & params) {
"application/json; charset=utf-8"
);
LOG_WRN("Unauthorized: Invalid API Key\n");
SRV_WRN("%s", "unauthorized: Invalid API Key\n");
return false;
};
auto middleware_server_state = [this](const httplib::Request & req, httplib::Response & res) {
(void)req; // suppress unused parameter warning when LLAMA_BUILD_WEBUI is not defined
bool ready = is_ready.load();
if (!ready) {
#ifdef LLAMA_BUILD_WEBUI
auto tmp = string_split<std::string>(req.path, '.');
if (req.path == "/" || tmp.back() == "html") {
if (req.path == "/" || (tmp.size() > 0 && tmp.back() == "html")) {
res.status = 503;
res.set_content(reinterpret_cast<const char*>(loading_html), loading_html_len, "text/html; charset=utf-8");
} else
#endif
{
// no endpoints is allowed to be accessed when the server is not ready
// this is to prevent any data races or inconsistent states
res.status = 503;
res.set_content(
safe_json_to_str(json {
{"error", {
{"message", "Loading model"},
{"type", "unavailable_error"},
{"code", 503}
}}
}),
"application/json; charset=utf-8"
);
return false;
}
#endif
// no endpoints are allowed to be accessed when the server is not ready
// this is to prevent any data races or inconsistent states
res.status = 503;
res.set_content(
safe_json_to_str(json {
{"error", {
{"message", "Loading model"},
{"type", "unavailable_error"},
{"code", 503}
}}
}),
"application/json; charset=utf-8"
);
return false;
}
return true;
@@ -292,7 +292,7 @@ bool server_http_context::init(const common_params & params) {
// +4 threads for monitoring, health and some threads reserved for MCP and other tasks in the future
n_threads_http = std::max(params.n_parallel + 4, (int32_t) std::thread::hardware_concurrency() - 1);
}
LOG_INF("%s: using %d threads for HTTP server\n", __func__, n_threads_http);
SRV_INF("using %d threads for HTTP server\n", n_threads_http);
srv->new_task_queue = [n_threads_http] {
// spawn n_threads_http fixed thread (always alive), while allow up to 1024 max possible additional threads
// when n_threads_http is used, server will create new "dynamic" threads that will be destroyed after processing each request
@@ -306,14 +306,14 @@ bool server_http_context::init(const common_params & params) {
//
if (!params.webui) {
LOG_INF("Web UI is disabled\n");
SRV_INF("%s", "the WebUI is disabled\n");
} else {
// register static assets routes
if (!params.public_path.empty()) {
// Set the base directory for serving static files
bool is_found = srv->set_mount_point(params.api_prefix + "/", params.public_path);
if (!is_found) {
LOG_ERR("%s: static assets path not found: %s\n", __func__, params.public_path.c_str());
SRV_ERR("static assets path not found: %s\n", params.public_path.c_str());
return 1;
}
} else {
@@ -348,13 +348,13 @@ bool server_http_context::start() {
bool is_sock = false;
if (string_ends_with(std::string(hostname), ".sock")) {
is_sock = true;
LOG_INF("%s: setting address family to AF_UNIX\n", __func__);
SRV_INF("%s", "setting address family to AF_UNIX\n");
srv->set_address_family(AF_UNIX);
// bind_to_port requires a second arg, any value other than 0 should
// simply get ignored
was_bound = srv->bind_to_port(hostname, 8080);
} else {
LOG_INF("%s: binding port with default address family\n", __func__);
SRV_INF("%s", "binding port with default address family\n");
// bind HTTP listen port
if (port == 0) {
int bound_port = srv->bind_to_any_port(hostname);
@@ -368,7 +368,7 @@ bool server_http_context::start() {
}
if (!was_bound) {
LOG_ERR("%s: couldn't bind HTTP server socket, hostname: %s, port: %d\n", __func__, hostname.c_str(), port);
SRV_ERR("couldn't bind HTTP server socket, hostname: %s, port: %d\n", hostname.c_str(), port);
return false;
}
@@ -580,7 +580,7 @@ void server_http_context::register_gcp_compat() {
}
if (handlers.count(gcp.path_predict)) {
LOG_ERR("%s: AIP_PREDICT_ROUTE=%s conflicts with an existing llama-server route\n", __func__, gcp.path_predict.c_str());
SRV_ERR("AIP_PREDICT_ROUTE=%s conflicts with an existing llama-server route\n", gcp.path_predict.c_str());
exit(1);
}
@@ -651,7 +651,7 @@ void server_http_context::register_gcp_compat() {
payload.erase("@requestFormat");
if (payload.contains("stream")) {
LOG_WRN("%s: ignoring client-provided stream field in instance, streaming is not supported in predict route\n", __func__);
SRV_WRN("%s", "ignoring client-provided stream field in instance, streaming is not supported in predict route\n");
payload["stream"] = false;
}

View File

@@ -1988,7 +1988,7 @@ server_prompt * server_prompt_cache::alloc(const server_prompt & prompt, size_t
const int cur_lcp_len = it->tokens.get_common_prefix(prompt.tokens);
if (cur_lcp_len == (int) prompt.tokens.size()) {
SRV_WRN("%s", " - prompt is already in the cache, skipping\n");
SRV_INF("%s", " - prompt is already in the cache, skipping\n");
return nullptr;
}
}
@@ -2043,7 +2043,7 @@ bool server_prompt_cache::load(server_prompt & prompt, const server_tokens & tok
float f_keep_best = prompt.tokens.size() > 0 ? float(lcp_best) / prompt.tokens.size() : -1.0f; // empty slot: any cache entry wins
float sim_best = float(lcp_best) / tokens_new.size();
SRV_WRN(" - looking for better prompt, base f_keep = %.3f, sim = %.3f\n", f_keep_best, sim_best);
SRV_INF(" - looking for better prompt, base f_keep = %.3f, sim = %.3f\n", f_keep_best, sim_best);
auto it_best = states.end();
@@ -2068,7 +2068,7 @@ bool server_prompt_cache::load(server_prompt & prompt, const server_tokens & tok
}
if (it_best != states.end()) {
SRV_WRN(" - found better prompt with f_keep = %.3f, sim = %.3f\n", f_keep_best, sim_best);
SRV_INF(" - found better prompt with f_keep = %.3f, sim = %.3f\n", f_keep_best, sim_best);
{
auto & data = it_best->data.main;
@@ -2076,7 +2076,7 @@ bool server_prompt_cache::load(server_prompt & prompt, const server_tokens & tok
const size_t size = data.size();
const size_t n = llama_state_seq_set_data_ext(ctx_tgt, data.data(), size, id_slot, 0);
if (n != size) {
SRV_WRN("failed to restore state with size %zu\n", size);
SRV_ERR("failed to restore state with size %zu\n", size);
return false;
}
@@ -2145,11 +2145,11 @@ void server_prompt_cache::update() {
}
}
SRV_WRN(" - cache state: %zu prompts, %.3f MiB (limits: %.3f MiB, %zu tokens, %zu est)\n",
SRV_INF(" - cache state: %zu prompts, %.3f MiB (limits: %.3f MiB, %zu tokens, %zu est)\n",
states.size(), size() / (1024.0 * 1024.0), limit_size / (1024.0 * 1024.0), limit_tokens, limit_tokens_cur);
for (const auto & state : states) {
SRV_WRN(" - prompt %p: %7d tokens, checkpoints: %2zu, %9.3f MiB\n",
SRV_INF(" - prompt %p: %7d tokens, checkpoints: %2zu, %9.3f MiB\n",
(const void *)&state, state.n_tokens(), state.checkpoints.size(), state.size() / (1024.0 * 1024.0));
}
}

View File

@@ -83,17 +83,22 @@ int main(int argc, char ** argv) {
return 1;
}
llama_backend_init();
llama_numa_init(params.numa);
common_params_print_info(params);
// validate batch size for embeddings
// embeddings require all tokens to be processed in a single ubatch
// see https://github.com/ggml-org/llama.cpp/issues/12836
if (params.embedding && params.n_batch > params.n_ubatch) {
LOG_WRN("%s: embeddings enabled with n_batch (%d) > n_ubatch (%d)\n", __func__, params.n_batch, params.n_ubatch);
LOG_WRN("%s: setting n_batch = n_ubatch = %d to avoid assertion failure\n", __func__, params.n_ubatch);
SRV_WRN("embeddings enabled with n_batch (%d) > n_ubatch (%d)\n", params.n_batch, params.n_ubatch);
SRV_WRN("setting n_batch = n_ubatch = %d to avoid assertion failure\n", params.n_ubatch);
params.n_batch = params.n_ubatch;
}
if (params.n_parallel < 0) {
LOG_INF("%s: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true\n", __func__);
SRV_INF("%s", "n_parallel is set to auto, using n_parallel = 4 and kv_unified = true\n");
params.n_parallel = 4;
params.kv_unified = true;
@@ -107,15 +112,9 @@ int main(int argc, char ** argv) {
// struct that contains llama context and inference
server_context ctx_server;
llama_backend_init();
llama_numa_init(params.numa);
LOG_INF("build_info: %s\n", llama_build_info());
LOG_INF("%s\n", common_params_get_system_info(params).c_str());
server_http_context ctx_http;
if (!ctx_http.init(params)) {
LOG_ERR("%s: failed to initialize HTTP server\n", __func__);
SRV_ERR("%s", "failed to initialize HTTP server\n");
return 1;
}
@@ -134,7 +133,7 @@ int main(int argc, char ** argv) {
try {
models_routes.emplace(params, argc, argv);
} catch (const std::exception & e) {
LOG_ERR("%s: failed to initialize router models: %s\n", __func__, e.what());
SRV_ERR("failed to initialize router models: %s\n", e.what());
return 1;
}
@@ -222,7 +221,7 @@ int main(int argc, char ** argv) {
try {
tools.setup(params.server_tools);
} catch (const std::exception & e) {
LOG_ERR("%s: tools setup failed: %s\n", __func__, e.what());
SRV_ERR("tools setup failed: %s\n", e.what());
return 1;
}
SRV_WRN("%s", "-----------------\n");
@@ -240,7 +239,7 @@ int main(int argc, char ** argv) {
std::function<void()> clean_up;
if (is_router_server) {
LOG_INF("%s: starting router server, no model will be loaded in this process\n", __func__);
SRV_INF("%s", "starting router server, no model will be loaded in this process\n");
clean_up = [&models_routes]() {
SRV_INF("%s: cleaning up before exit...\n", __func__);
@@ -252,7 +251,7 @@ int main(int argc, char ** argv) {
if (!ctx_http.start()) {
clean_up();
LOG_ERR("%s: exiting due to HTTP server error\n", __func__);
SRV_ERR("%s", "exiting due to HTTP server error\n");
return 1;
}
ctx_http.is_ready.store(true);
@@ -273,12 +272,12 @@ int main(int argc, char ** argv) {
// start the HTTP server before loading the model to be able to serve /health requests
if (!ctx_http.start()) {
clean_up();
LOG_ERR("%s: exiting due to HTTP server error\n", __func__);
SRV_ERR("%s", "exiting due to HTTP server error\n");
return 1;
}
// load the model
LOG_INF("%s: loading model\n", __func__);
SRV_INF("%s", "loading model\n");
if (server_models::is_child_server()) {
ctx_server.on_sleeping_changed([&](bool sleeping) {
@@ -291,14 +290,14 @@ int main(int argc, char ** argv) {
if (ctx_http.thread.joinable()) {
ctx_http.thread.join();
}
LOG_ERR("%s: exiting due to model loading error\n", __func__);
SRV_ERR("%s", "exiting due to model loading error\n");
return 1;
}
routes.update_meta(ctx_server);
ctx_http.is_ready.store(true);
LOG_INF("%s: model loaded\n", __func__);
SRV_INF("%s", "model loaded\n");
shutdown_handler = [&](int) {
// this will unblock start_loop()
@@ -322,9 +321,9 @@ int main(int argc, char ** argv) {
#endif
if (is_router_server) {
LOG_INF("%s: router server is listening on %s\n", __func__, ctx_http.listening_address.c_str());
LOG_INF("%s: NOTE: router mode is experimental\n", __func__);
LOG_INF("%s: it is not recommended to use this mode in untrusted environments\n", __func__);
SRV_INF("router server is listening on %s\n", ctx_http.listening_address.c_str());
SRV_WRN("%s", "NOTE: router mode is experimental\n");
SRV_WRN("%s", " it is not recommended to use this mode in untrusted environments\n");
if (ctx_http.thread.joinable()) {
ctx_http.thread.join(); // keep the main thread alive
}
@@ -332,8 +331,7 @@ int main(int argc, char ** argv) {
// when the HTTP server stops, clean up and exit
clean_up();
} else {
LOG_INF("%s: server is listening on %s\n", __func__, ctx_http.listening_address.c_str());
LOG_INF("%s: starting the main loop...\n", __func__);
SRV_INF("server is listening on %s\n", ctx_http.listening_address.c_str());
// optionally, notify router server that this instance is ready
std::thread monitor_thread;

View File

@@ -1,6 +1,6 @@
# llama.cpp Web UI
# llama-ui
A modern, feature-rich web interface for llama.cpp built with SvelteKit. This UI provides an intuitive chat interface with advanced file handling, conversation management, and comprehensive model interaction capabilities.
A modern, feature-rich web interface for llama-server built with SvelteKit. This UI provides an intuitive chat interface with advanced file handling, conversation management, and comprehensive model interaction capabilities.
The WebUI supports two server operation modes:

View File

@@ -1,11 +1,11 @@
{
"name": "llama-server-webui",
"name": "llama-ui",
"version": "1.0.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "llama-server-webui",
"name": "llama-ui",
"version": "1.0.0",
"dependencies": {
"@modelcontextprotocol/sdk": "^1.25.1",

View File

@@ -1,5 +1,5 @@
{
"name": "llama-server-webui",
"name": "llama-ui",
"private": true,
"version": "1.0.0",
"type": "module",

View File

@@ -1,7 +1,7 @@
#!/bin/bash
# Development script for llama.cpp webui
#
# Development script for llama-ui
#
# This script starts the webui development servers (Storybook and Vite).
# Note: You need to start llama-server separately.
#
@@ -14,12 +14,12 @@ cd ../../../
# Check and install git hooks if missing
check_and_install_hooks() {
local hooks_missing=false
# Check for required hooks
if [ ! -f ".git/hooks/pre-commit" ] || [ ! -f ".git/hooks/pre-push" ] || [ ! -f ".git/hooks/post-push" ]; then
hooks_missing=true
fi
if [ "$hooks_missing" = true ]; then
echo "🔧 Git hooks missing, installing them..."
cd tools/server/webui

View File

@@ -1,7 +1,7 @@
#!/bin/bash
# Script to install pre-commit hook for webui
# Pre-commit: formats, checks, builds, and stages build output
# Pre-commit: formats, checks, and builds webui
REPO_ROOT=$(git rev-parse --show-toplevel)
PRE_COMMIT_HOOK="$REPO_ROOT/.git/hooks/pre-commit"
@@ -56,11 +56,7 @@ if git diff --cached --name-only | grep -q "^tools/server/webui/"; then
exit 1
fi
# Stage the build output alongside the source changes
cd "$REPO_ROOT"
git add tools/server/public/
echo "✅ Webui built and build output staged"
echo "✅ Webui built successfully"
fi
exit 0
@@ -75,7 +71,7 @@ if [ $? -eq 0 ]; then
echo ""
echo "The hook will automatically:"
echo " • Format, lint and check webui code before commits"
echo " • Build webui and stage tools/server/public/ into the same commit"
echo " • Build webui"
else
echo "❌ Failed to make hook executable"
exit 1

View File

@@ -323,7 +323,7 @@
});
function handleMessagesReady() {
if (!disableAutoScroll) {
if (!disableAutoScroll && !autoScroll.userScrolledUp) {
requestAnimationFrame(() => {
autoScroll.scrollToBottom('instant');
});
@@ -379,7 +379,9 @@
messages={activeMessages()}
onUserAction={() => {
autoScroll.enable();
autoScroll.scrollToBottom();
if (!autoScroll.userScrolledUp) {
autoScroll.scrollToBottom();
}
}}
onMessagesReady={handleMessagesReady}
/>

View File

@@ -8,8 +8,7 @@
message?: string;
}
let { class: className = '', message = 'Initializing connection to llama.cpp server...' }: Props =
$props();
let { class: className = '', message = 'Initializing connection to server...' }: Props = $props();
</script>
<div class="flex h-full items-center justify-center {className}">

View File

@@ -5,7 +5,7 @@ import { ROUTES } from './routes';
export const FORK_TREE_DEPTH_PADDING = 8;
export const SYSTEM_MESSAGE_PLACEHOLDER = 'System message';
export const APP_NAME = import.meta.env.VITE_PUBLIC_APP_NAME || 'llama.cpp';
export const APP_NAME = import.meta.env.VITE_PUBLIC_APP_NAME || 'llama-ui';
export const ICON_STRIP_TRANSITION_DURATION = 150;
export const ICON_STRIP_TRANSITION_DELAY_MULTIPLIER = 50;

View File

@@ -17,7 +17,6 @@ export class AutoScrollController {
private _userScrolledUp = $state(false);
private _lastScrollTop = $state(0);
private _scrollInterval: ReturnType<typeof setInterval> | undefined;
private _scrollTimeout: ReturnType<typeof setTimeout> | undefined;
private _container: HTMLElement | undefined;
private _disabled: boolean;
private _mutationObserver: MutationObserver | null = null;
@@ -51,6 +50,7 @@ export class AutoScrollController {
* Updates the disabled state.
*/
setDisabled(disabled: boolean): void {
if (this._disabled === disabled) return;
this._disabled = disabled;
if (disabled) {
this._autoScrollEnabled = false;
@@ -80,17 +80,6 @@ export class AutoScrollController {
this._autoScrollEnabled = true;
}
if (this._scrollTimeout) {
clearTimeout(this._scrollTimeout);
}
this._scrollTimeout = setTimeout(() => {
if (isAtBottom) {
this._userScrolledUp = false;
this._autoScrollEnabled = true;
}
}, AUTO_SCROLL_INTERVAL);
this._lastScrollTop = scrollTop;
}
@@ -157,11 +146,6 @@ export class AutoScrollController {
destroy(): void {
this.stopInterval();
this._doStopObserving();
if (this._scrollTimeout) {
clearTimeout(this._scrollTimeout);
this._scrollTimeout = undefined;
}
}
/**

View File

@@ -76,7 +76,7 @@ export class ChatService {
*/
/**
* Sends a chat completion request to the llama.cpp server.
* Sends a chat completion request to the llama-server.
* Supports both streaming and non-streaming responses with comprehensive parameter configuration.
* Automatically converts database messages with attachments to the appropriate API format.
*

View File

@@ -3,7 +3,7 @@
import { page } from '$app/state';
import { afterNavigate } from '$app/navigation';
import { DialogModelNotAvailable } from '$lib/components/app';
import { ROUTES } from '$lib/constants/routes';
import { APP_NAME, ROUTES } from '$lib/constants';
import { chatStore, isLoading } from '$lib/stores/chat.svelte';
import { conversationsStore, activeConversation } from '$lib/stores/conversations.svelte';
import { modelsStore, modelOptions } from '$lib/stores/models.svelte';
@@ -125,7 +125,7 @@
</script>
<svelte:head>
<title>{activeConversation()?.name || 'Chat'} - llama.cpp</title>
<title>{activeConversation()?.name || 'Chat'} - {APP_NAME}</title>
</svelte:head>
<DialogModelNotAvailable

View File

@@ -27,7 +27,7 @@ const config = {
$styles: 'src/styles'
},
version: {
name: 'llama-server-webui'
name: 'llama-ui'
}
},

View File

@@ -4,11 +4,11 @@ import { Meta } from '@storybook/addon-docs/blocks';
# llama.cpp Web UI
Welcome to the **llama.cpp Web UI** component library! This Storybook showcases the components used in the modern web interface for the llama.cpp server.
Welcome to the **llama-ui** component library! This Storybook showcases the components used in the modern web interface for the llama-server.
## 🚀 About This Project
WebUI is a modern web interface for the llama.cpp server, built with SvelteKit and ShadCN UI. Features include:
WebUI is a modern web interface for the llama-server, built with SvelteKit and ShadCN UI. Features include:
- **Real-time chat conversations** with AI assistants
- **Multi-conversation management** with persistent storage

View File

@@ -23,11 +23,6 @@ export default defineConfig({
minify: true
},
esbuild: {
lineLimit: 500,
minifyIdentifiers: false
},
css: {
preprocessorOptions: {
scss: {

View File

@@ -604,8 +604,8 @@ int main(int argc, char ** argv) {
}
LOG_INF("sampler seed: %u\n", common_sampler_get_seed(smpl[0]));
LOG_INF("sampler params: \n%s\n", params.sampling.print().c_str());
LOG_INF("sampler chain: %s\n", common_sampler_print(smpl[0]).c_str());
LOG_INF("sampler params: \n%s\n", params.sampling.print().c_str());
LOG_INF("%s: loading done\n", __func__);