Compare commits

...

48 Commits
b9283 ... b9331

Author SHA1 Message Date
Georgi Gerganov
302e2c2652 ci : reduce PR jobs by matching backend paths (#23675)
* ci : disable SYCL f16 builds

* ci : extract android and hip into separate workflows

* ci : move webgpu to separate workflow

* ci : move the rpc to a separate workflow

* ci : extract s309x and ppcl jobs

* ci : extract opencl job into a separate workflow
2026-05-25 20:54:54 +03:00
Pascal
328874d054 model: tag ffn_latent as MUL_MAT to fix buft probe (#23664)
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.

Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.
2026-05-25 16:05:04 +02:00
Aman Gupta
c1f1e28d29 CUDA: add fast walsh-hadamard transform (#23615)
* CUDA: add fast walsh-hadamard transform

* review: add unrolls + change size_t -> int

* warp size 64

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-25 21:12:10 +08:00
Pascal
5a4126adc1 ui: fix stop/continue during an agentic loop (#23356) 2026-05-25 14:18:59 +02:00
Michael Wand
a4d2d4ae41 convert : add compressed-tensors NVFP4 support (#21095)
* Refactored Compressed Tensors NVFP4 support for new base.py

* Support compressed-tensors NVFP4 conversion

* Moved Qwen MTP remap into filter_tensors

* simplify

* pathlib no longer used

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-25 14:16:11 +02:00
Georgi Gerganov
d161ea7071 sync : ggml 2026-05-25 12:43:27 +03:00
Georgi Gerganov
45158f460e ggml : bump version to 0.13.0 (ggml/1510) 2026-05-25 12:43:27 +03:00
Georgi Gerganov
22307b3e8b sync : ggml 2026-05-25 12:38:01 +03:00
Georgi Gerganov
ce5890b5f7 ggml : bump version to 0.12.1 (ggml/1508) 2026-05-25 12:38:01 +03:00
Ori Pekelman
b251f74f49 ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500) 2026-05-25 12:38:01 +03:00
Dev-X25874
fa97041524 ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (ggml/1492) 2026-05-25 12:38:01 +03:00
Johannes Gäßler
ae251b5ff2 TP: fix ggml context size calculation (#22616)
* TP: fix ggml context size calculation, memory leak

* move split state cache back into the context

* revert to constant ggml context size for cgraphs

* increase headroom for statically allocated tensors

* remove obsolete include
2026-05-25 12:37:25 +03:00
Gilad S.
66efd13375 ggml: gguf_init_from_callback and gguf_init_from_buffer (#22341)
* ggml: implement `gguf_init_from_buffer`

* test: `gguf_init_from_buffer`

* fix: memory breakdown for a model loaded with `no_alloc` from a file is consistent with being loaded from a buffer

* fix: use `GGML_UNUSED`

Co-authored-by: Copilot <copilot@github.com>

* fix: remove `total_size` from `gguf_reader`

* fix: file offset calculation, rename `offset` to `data_offset`

Co-authored-by: Copilot <copilot@github.com>

* refactor: extract model loader bug fixes to another PR

* feat: add `gguf_init_from_callback`

* fix: always require a max expected size

* fix: change `gguf_reader_callback_t`'s `output` type to `void *`, change `max_expected_size` and offsets to `uint64_t`

* fix: harden against offset overflow in buffer read

* fix: remove seek behavior from the callback

* feat: `max_chunk_read == 0` means `SIZE_MAX`

* fix: seeking in a gguf file with no tensors

---------

Co-authored-by: Copilot <copilot@github.com>
2026-05-25 11:33:29 +02:00
Aman Gupta
6c4cbdc70b server: MTP layer kv-cache should respect draft type ctk (#23646) 2026-05-25 16:46:23 +08:00
alex-spacemit
5fdf07e33b ci : update spacemit toolchain url and enhance curl command (#23642)
* fix(action): update SpacemiT toolchain URL and version

Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c

* fix(action): add -L flag to curl command for URL redirection

Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06
2026-05-25 10:43:24 +02:00
Sigbjørn Skjæret
062d3115aa ci : fix pre-tokenizer-hashes check (#23651) 2026-05-25 10:41:25 +02:00
Tim Neumann
314e729347 llama : document that only one on-device state can be saved per sequence (#23520) 2026-05-25 10:29:28 +03:00
Aldehir Rojas
d55fb97174 ci : install host compiler on android-ndk build (#23630) 2026-05-25 10:18:08 +03:00
Jeff Bolz
826539ce59 ggml : Parallelize quant LUT init (#23595)
- Use OpenMP to parallelize iq2xs_init_impl and iq3xs_init_impl.
- Move the OpenMP detection from ggml-cpu to ggml-base.
- Update OpenMP dependencies in ggml-config.cmake.in.
2026-05-25 10:15:46 +03:00
Saba Fallah
b96487645c ui: media attachments before text (#23467)
* ui: media attachments before text

* fix prettier formatting
2026-05-25 08:50:41 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
9627d0f540 vendor : update cpp-httplib to 0.45.1 (#23639) 2026-05-25 09:45:22 +03:00
jacekpoplawski
e2ef8fe42c server: fix checkpoints creation (#22929)
* common : add common_chat_split_by_role

* cont : fix spans to reach end of message

* server: fix checkpoints creation

- extract message_spans from chat templates
- find the prompt token position before the latest user message
- split prompt batching at that position
- create a context checkpoint before the latest user input
- avoid periodic mid-prompt checkpoints when that position is known
- handle multimodal prompts when mapping text/template positions to server prompt tokens
- add --checkpoint-min-step to control minimum spacing between checkpoints

* cont : clean-up

* Support autoparser detection for message barriers

* server: fix message span delimiter and update docs

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
2026-05-25 08:56:18 +03:00
fairydreaming
6d57c26ef8 perplexity : fix even more integer overflows (#23623)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-25 08:12:39 +03:00
Georgi Gerganov
28123a3937 ci : move most slim jobs to self-hosted runners (#23619)
* ci : remove tag from build-self-hosted.yml

* ci : slim -> self-hosted

* ci : prevent heavy CPU jobs from running on fast runners

* ci : prevent cmake pkg to run on dedicated fast runners

* ci : try to bump 3.11 -> 3.13

* ci : move lint back to 3.11

* ci : back to 3.11

* ci : add comment about UI jobs

* ci : move python requirements check to CPU runners

this job is a bit slow for a dedicated "fast" runner

* ci : add self-hosted ui workflow

* ci : fix UI naming

* tmp to check if arm64 fast is compatible with all jobs

* revert last commit
2026-05-25 08:11:19 +03:00
Georgi Gerganov
549b9d8433 ci : update build-self-hosted.yml (#23616) 2026-05-24 18:20:10 +03:00
Sigbjørn Skjæret
5d246a792d convert : minor fixes for numpy 2.x (#23571) 2026-05-24 09:51:31 +02:00
Aldehir Rojas
63248fc3e3 cmake : fix ui build (#23592)
* cmake/ui : add -fPIC to llama-ui static lib

* cmake : rename host compiled embed helper
2026-05-24 02:37:28 -05:00
Aman Gupta
83eebe9d08 server: add margin for draft model for fit (#23485) 2026-05-24 14:43:08 +08:00
Johannes Gäßler
fff63b5108 TP: fix entirely zero-sized slices per device (#23525) 2026-05-24 08:19:33 +02:00
shaofeiqi
f3061116ff opencl: batch profiling to improve speed and prevent memory leaks (#23495) 2026-05-23 23:11:43 -07:00
Yiwei Shao
1c0f6db545 hexagon: apply repl optimization in flash attn softmax as #22993 (#23455) 2026-05-23 19:56:59 -07:00
Aparna M P
cec51c7a7d snapdragon: update windows toolchain to use hsdk v6.6.0.0 (#23552) 2026-05-23 19:56:41 -07:00
Aldehir Rojas
b22ff4b7b4 cmake/ui : refactor the build (#23352) 2026-05-23 17:08:22 -04:00
Aditya Singh
c0c7e147e7 requirements : bump torch to 2.11.0 (#23503)
* requirements: relax torch~=2.6.0 to torch>=2.6.0 for convert_hf_to_gguf

The ~=2.6.0 operator resolves to >=2.6.0, <2.7.0, which fails on
PyPI for platform/CPython combinations where 2.6.x is not present.
The accompanying comment already says 'PyTorch 2.6.0 or later', so
the looser >=2.6.0 matches the documented intent and unblocks
pip install -r requirements/requirements-convert_hf_to_gguf.txt.

Fixes #23408

* requirements: bump torch floor to 2.11.0 per maintainer

* requirements: pin torch to ==2.11.0 per project policy

* requirements: pin mtmd torch and torchvision to 2.11.0/0.26.0 per project policy

* requirements: suppress check_requirements pin warning on mtmd

The check_requirements script flags '==' on lines in files matched by
*/**/requirements*.txt. Append the documented suppression comment to the
pinned torch and torchvision lines (and to the s390x platform marker lines)
so the check passes while keeping the pins required by project policy.

* ty: silence Tensor/Module union check on model[0].auto_model

With torch 2.11.0 stubs, nn.Sequential.__getitem__ now returns
Tensor | Module rather than Module, so model[0].auto_model fails ty
on the SentenceTransformer code path. The runtime behavior is
unchanged because SentenceTransformer always wraps a Module at
index 0. Adding a targeted unresolved-attribute ignore keeps the
type-check green without altering behavior. A follow-up issue
tracks typing the variable explicitly.
2026-05-23 18:24:39 +02:00
Michael Wand
b0df4c0cfd model : add NVFP4 MTP scale tensors (#23563)
* Add NVFP4 MTP scale tensors

* Link Qwen3.5 MTP tensors

* Aligned nullptr
2026-05-23 13:30:31 +02:00
dskwe
a497476330 ggml : Check the right iface method before using the fallback 2d get (#23514) 2026-05-23 12:49:24 +02:00
Jeff Bolz
95405ac65f vulkan: fix windows find_package of SPIRV-Headers (#23215)
* vulkan: fix windows find_package of SPIRV-Headers

* not windows-only
2026-05-23 09:44:46 +02:00
Shawn Gu
0f3cb3fc8b opencl: generalize Adreno MoE kernels on M (#23449) 2026-05-22 17:08:41 -07:00
Aldehir Rojas
1acee6bf89 server: only parse empty msg if continuing an assistant msg (#23506) 2026-05-22 11:58:15 -04:00
fairydreaming
ef570f6308 perplexity : fix integer overflow (#23496)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-22 15:50:44 +03:00
Alexey Kopytko
cc9e331213 SYCL: improve MoE prefill throughput (#23142)
- change `k_copy_src1_to_contiguous` so that uses a precomputed contiguous mapping where all rows "owned" by an expert are in one slice with a know starts and ends
- switch the `O(n_as * n_routed_rows)` contraption to a counting sort-based procedure with `O(n_as + n_routed_rows)` complexity
2026-05-22 15:50:17 +03:00
Alexey Kopytko
bcfd1989e9 sycl : Level Zero detection in ggml_sycl_init (#23097)
* [SYCL] Centralize Level Zero detection in ggml_sycl_init

* use the same wording

* get back the warning
2026-05-22 15:49:45 +03:00
karavayev
56f16f235c SYCL : gated_delta_net K>1 (#23174)
* sycl_gated_delta_net K>1

* editor_config
2026-05-22 15:48:56 +03:00
Katostrofik
8cc67efcd4 SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (#21580)
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup

BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.

Fixes #20478

* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0

The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.

Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.

Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 15:48:24 +03:00
Jesus Talavera
95feeab52e docs: Update documentation with Granite 4.0/4.1 (#23404) 2026-05-22 20:35:46 +08:00
Sachin Sharma
99d4026b11 ggml-zendnn : add Q8_0 quantization support (#23414)
* ggml-zendnn : add Q8_0 quantization support

* ggml-zendnn : sync with latest ZenDNN

* ggml-zendnn : address review comments for Q8_0
2026-05-22 13:16:55 +02:00
fairydreaming
9c92e96a64 cmake : build router app only during standalone builds (#23521)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-22 12:55:29 +03:00
Kashif Rasul
afcda09d15 vocab : fix HybridDNA tokenizer (#23466)
* vocab : mark hybriddna k-mers to avoid BPE token collisions

* improved loop

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-22 11:17:31 +02:00
137 changed files with 4234 additions and 1678 deletions

View File

@@ -15,6 +15,6 @@ runs:
id: setup
uses: ./.github/actions/unarchive-tar
with:
url: https://archive.spacemit.com/toolchain/spacemit-toolchain-linux-glibc-x86_64-v${{ inputs.version }}.tar.xz
url: https://github.com/spacemit-com/toolchain/releases/download/v${{ inputs.version }}/spacemit-toolchain-linux-glibc-x86_64-v${{ inputs.version }}.tar.xz
path: ${{ inputs.path }}
strip: 1

View File

@@ -24,4 +24,4 @@ runs:
run: |
mkdir -p ${{ inputs.path }}
cd ${{ inputs.path }}
curl --no-progress-meter ${{ inputs.url }} | tar -${{ inputs.type }}x --strip-components=${{ inputs.strip }}
curl --no-progress-meter -L ${{ inputs.url }} | tar -${{ inputs.type }}x --strip-components=${{ inputs.strip }}

View File

@@ -73,6 +73,11 @@ jobs:
fetch-depth: 0
lfs: false
- name: Dependencies
run: |
apt-get update
apt-get install -y build-essential
- name: Build
id: ndk_build
run: |
@@ -86,3 +91,53 @@ jobs:
with:
name: llama-cpp-android-arm64-cpu
path: pkg-adb/llama.cpp
android-arm64:
runs-on: ubuntu-latest
env:
NDK_VERSION: "29.0.14206865"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: android-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Set up JDK
uses: actions/setup-java@v5
with:
java-version: 17
distribution: temurin
- name: Setup Android SDK
uses: android-actions/setup-android@40fd30fb8d7440372e1316f5d1809ec01dcd3699 # v4.0.1
with:
log-accepted-android-sdk-licenses: false
- name: Install NDK
run: |
sdkmanager "ndk;${{ env.NDK_VERSION }}"
echo "ANDROID_NDK=${ANDROID_SDK_ROOT}/ndk/${{ env.NDK_VERSION }}" >> $GITHUB_ENV
- name: Build
id: cmake_build
run: |
cmake -B build \
-DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-28 \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_BACKEND_DL=ON \
-DGGML_NATIVE=OFF \
-DGGML_CPU_ALL_VARIANTS=ON \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)

View File

@@ -5,17 +5,12 @@ on:
jobs:
linux:
runs-on: ubuntu-slim
runs-on: [self-hosted, Linux, CPU]
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y build-essential tcl cmake
- name: Build
run: |
PREFIX="$(pwd)"/inst

View File

@@ -277,7 +277,7 @@ jobs:
env:
# Make sure this is in sync with build-cache.yml
SPACEMIT_IME_TOOLCHAIN_VERSION: "1.1.2"
SPACEMIT_IME_TOOLCHAIN_VERSION: "1.2.4"
steps:
- uses: actions/checkout@v6

167
.github/workflows/build-hip.yml vendored Normal file
View File

@@ -0,0 +1,167 @@
name: CI (hip)
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/build-hip.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp',
'**/*.cu',
'**/*.cuh'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-hip.yml',
'ggml/src/ggml-cuda/**'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
jobs:
ubuntu-22-hip:
runs-on: ubuntu-22.04
container: rocm/dev-ubuntu-22.04:6.1.2
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y build-essential git cmake rocblas-dev hipblas-dev libssl-dev rocwmma-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with native CMake HIP support
id: cmake_build
run: |
cmake -B build -S . \
-DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DGPU_TARGETS="gfx1030" \
-DGGML_HIP=ON
cmake --build build --config Release -j $(nproc)
windows-latest-hip:
runs-on: windows-2022
env:
# Make sure this is in sync with build-cache.yml
HIPSDK_INSTALLER_VERSION: "26.Q1"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Grab rocWMMA package
id: grab_rocwmma
run: |
curl -o rocwmma.deb "https://repo.radeon.com/rocm/apt/7.2.1/pool/main/r/rocwmma-dev/rocwmma-dev_2.2.0.70201-81~24.04_amd64.deb"
7z x rocwmma.deb
7z x data.tar
- name: Use ROCm Installation Cache
uses: actions/cache@v5
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
uses: ./.github/actions/windows-setup-rocm
with:
version: ${{ env.HIPSDK_INSTALLER_VERSION }}
- name: Verify ROCm
id: verify
run: |
# Find and test ROCm installation
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
if (-not $clangPath) {
Write-Error "ROCm installation not found"
exit 1
}
& $clangPath.FullName --version
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ${{ github.job }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
$env:HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake -G "Unix Makefiles" -B build -S . `
-DCMAKE_C_COMPILER="${env:HIP_PATH}\bin\clang.exe" `
-DCMAKE_CXX_COMPILER="${env:HIP_PATH}\bin\clang++.exe" `
-DCMAKE_CXX_FLAGS="-I$($PWD.Path.Replace('\', '/'))/opt/rocm-7.2.1/include/" `
-DCMAKE_BUILD_TYPE=Release `
-DLLAMA_BUILD_BORINGSSL=ON `
-DROCM_DIR="${env:HIP_PATH}" `
-DGGML_HIP=ON `
-DGGML_HIP_ROCWMMA_FATTN=ON `
-DGPU_TARGETS="gfx1100" `
-DGGML_RPC=ON
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}
ubuntu-22-musa:
runs-on: ubuntu-22.04
container: mthreads/musa:rc4.3.0-devel-ubuntu22.04-amd64
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
apt-get update
apt-get install -y build-essential git cmake libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-musa
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with native CMake MUSA support
id: cmake_build
run: |
cmake -B build -S . \
-DGGML_MUSA=ON
time cmake --build build --config Release -j $(nproc)

150
.github/workflows/build-ibm.yml vendored Normal file
View File

@@ -0,0 +1,150 @@
name: CI (ibm)
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/build-ibm.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-ibm.yml',
'ggml/src/ggml-cpu/**'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-s390x:
runs-on: ubuntu-24.04-s390x
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Build Dependencies
id: build_depends
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
python3 python3-pip python3-dev python3-wheel \
libjpeg-dev build-essential libssl-dev \
git-lfs
- name: Toolchain workaround (GCC 14)
run: |
sudo apt-get install -y gcc-14 g++-14
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: Python Dependencies
id: python_depends
run: |
export PIP_BREAK_SYSTEM_PACKAGES="1"
python3 -m pip install --upgrade pip setuptools
pip3 install ./gguf-py
- name: Swap Endianness
id: endianness
run: |
for f in models/*.gguf; do
echo YES | python3 gguf-py/gguf/scripts/gguf_convert_endian.py $f big
done
- name: Build
id: cmake_build
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
- name: Test llama2c (s390x)
id: llama2c_test_s390x
run: |
cd build
echo "Fetch llama2c big-endian model"
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K-be.gguf
./bin/llama-completion -m stories260K-be.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
ubuntu-24-ppc64le:
runs-on: ubuntu-24.04-ppc64le
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Build Dependencies
id: build_depends
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
python3 python3-pip python3-dev python3-wheel \
libjpeg-dev build-essential libssl-dev \
git-lfs
- name: Toolchain workaround (GCC 14)
run: |
sudo apt-get install -y gcc-14 g++-14
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: Python Dependencies
id: python_depends
run: |
export PIP_BREAK_SYSTEM_PACKAGES="1"
python3 -m pip install --upgrade pip setuptools
pip3 install ./gguf-py
- name: Build
id: cmake_build
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
- name: Test llama2c conversion
id: llama2c_test
run: |
cd build
echo "Fetch tokenizer"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256

83
.github/workflows/build-opencl.yml vendored Normal file
View File

@@ -0,0 +1,83 @@
name: CI (opencl)
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/build-opencl.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp',
'**/*.cl'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-opencl.yml',
'ggml/src/ggml-opencl/**'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
jobs:
windows-latest-opencl-adreno:
runs-on: windows-2025
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-llvm-arm64-opencl-adreno
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Install OpenCL Headers and Libs
id: install_opencl
run: |
git clone https://github.com/KhronosGroup/OpenCL-Headers
cd OpenCL-Headers
cmake -B build `
-DBUILD_TESTING=OFF `
-DOPENCL_HEADERS_BUILD_TESTING=OFF `
-DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF `
-DCMAKE_INSTALL_PREFIX="$env:RUNNER_TEMP/opencl-arm64-release"
cmake --build build --target install
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
cd OpenCL-ICD-Loader
cmake -B build-arm64-release `
-A arm64 `
-DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" `
-DCMAKE_INSTALL_PREFIX="$env:RUNNER_TEMP/opencl-arm64-release"
cmake --build build-arm64-release --target install --config release
- name: Build
id: cmake_build
run: |
cmake -S . -B build -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON -DLLAMA_BUILD_BORINGSSL=ON
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS}

View File

@@ -34,6 +34,76 @@ env:
LLAMA_LOG_TIMESTAMPS: 1
jobs:
ubuntu-cpu-riscv64-native:
runs-on: ubuntu-24.04-riscv
steps:
- name: Install dependencies
run: |
# Install necessary packages
sudo apt-get update
sudo apt-get install -y libssl-dev
# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
git lfs install
- name: Check environment
run: |
uname -a
gcc --version
g++ --version
ldd --version
cmake --version
rustc --version
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-cpu-riscv64-native
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=ON \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DGGML_RPC=ON \
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
- name: Test llama2c conversion
id: llama2c_test
run: |
cd build
echo "Fetch tokenizer"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
ubuntu-riscv64-native-sanitizer:
runs-on: ubuntu-24.04-riscv

67
.github/workflows/build-rpc.yml vendored Normal file
View File

@@ -0,0 +1,67 @@
name: CI (rpc)
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/build-rpc.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-rpc.yml',
'ggml/src/ggml-rpc/**'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
jobs:
ubuntu-latest-rpc:
runs-on: ubuntu-latest
continue-on-error: true
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential libssl-dev ninja-build
- name: Build
id: cmake_build
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose

View File

@@ -55,24 +55,7 @@ env:
LLAMA_LOG_TIMESTAMPS: 1
jobs:
determine-tag:
name: Determine tag name
runs-on: ubuntu-slim
outputs:
tag_name: ${{ steps.tag.outputs.name }}
steps:
- name: Clone
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
ggml-ci-nvidia-cuda:
needs: determine-tag
runs-on: [self-hosted, Linux, NVIDIA]
steps:
@@ -82,14 +65,11 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
nvidia-smi
GG_BUILD_CUDA=1 bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
GG_BUILD_CUDA=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-vulkan-cm:
needs: determine-tag
runs-on: [self-hosted, Linux, NVIDIA]
steps:
@@ -99,14 +79,11 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 GGML_VK_DISABLE_COOPMAT2=1 bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
GG_BUILD_VULKAN=1 GGML_VK_DISABLE_COOPMAT2=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-vulkan-cm2:
needs: determine-tag
runs-on: [self-hosted, Linux, NVIDIA, COOPMAT2]
steps:
@@ -116,14 +93,12 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-webgpu:
runs-on: [self-hosted, Linux, NVIDIA]
runs-on: [self-hosted, Linux, NVIDIA, X64]
steps:
- name: Clone
@@ -149,7 +124,7 @@ jobs:
GG_BUILD_WEBGPU=1 \
GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
GG_BUILD_WEBGPU_DAWN_DIR="$GITHUB_WORKSPACE/dawn/lib64/cmake/Dawn" \
bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMX-compatible machine
#ggml-ci-cpu-amx:
@@ -163,7 +138,7 @@ jobs:
# - name: Test
# id: ggml-ci
# run: |
# bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
# bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMD GPU machine
# ggml-ci-amd-vulkan:
@@ -178,7 +153,7 @@ jobs:
# id: ggml-ci
# run: |
# vulkaninfo --summary
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMD GPU machine
# ggml-ci-amd-rocm:
@@ -193,10 +168,9 @@ jobs:
# id: ggml-ci
# run: |
# amd-smi static
# GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
# GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-metal:
needs: determine-tag
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -206,13 +180,10 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-webgpu:
needs: determine-tag
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -235,14 +206,11 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-vulkan:
needs: determine-tag
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -252,14 +220,11 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-linux-intel-vulkan:
needs: determine-tag
runs-on: [self-hosted, Linux, Intel]
steps:
@@ -271,14 +236,11 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-win-intel-vulkan:
needs: determine-tag
runs-on: [self-hosted, Windows, X64, Intel]
steps:
@@ -293,7 +255,6 @@ jobs:
MSYSTEM: UCRT64
CHERE_INVOKING: 1
PATH: C:\msys64\ucrt64\bin;C:\msys64\usr\bin;C:\Windows\System32;${{ env.PATH }}
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
vulkaninfo --summary
# Skip python related tests with GG_BUILD_LOW_PERF=1 since Windows MSYS2 UCRT64 currently fails to create
@@ -301,7 +262,6 @@ jobs:
LLAMA_FATAL_WARNINGS=OFF GG_BUILD_NINJA=1 GG_BUILD_VULKAN=1 GG_BUILD_LOW_PERF=1 ./ci/run.sh ./results/llama.cpp ./mnt/llama.cpp
ggml-ci-intel-openvino-gpu-low-perf:
needs: determine-tag
runs-on: [self-hosted, Linux, Intel, OpenVINO]
concurrency:
@@ -333,8 +293,64 @@ jobs:
- name: Test
id: ggml-ci
env:
HF_UI_VERSION: ${{ needs.determine-tag.outputs.tag_name }}
run: |
source ./openvino_toolkit/setupvars.sh
GG_BUILD_OPENVINO=1 GGML_OPENVINO_DEVICE=GPU GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
GG_BUILD_OPENVINO=1 GGML_OPENVINO_DEVICE=GPU GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-arm64-cpu-low-perf:
runs-on: [self-hosted, Linux, ARM64, CPU]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-arm64-cpu-high-perf:
runs-on: [self-hosted, Linux, ARM64, CPU]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: not sure how to detect ARM flags on DGX Spark. currently get this error during cmake:
# CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:147 (message):
# ARM -march/-mcpu not found, -mcpu=native will be used
#
# if we resolve this, we should be able to offload these jobs to the self-hosted runners
#
# ggml-ci-arm64-cpu-high-perf-sve:
# runs-on: [self-hosted, Linux, ARM64, CPU]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-arm64-cpu-kleidiai:
# runs-on: [self-hosted, Linux, ARM64, CPU]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_KLEIDIAI=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp

View File

@@ -38,12 +38,10 @@ jobs:
ubuntu-24-sycl:
strategy:
matrix:
build: [fp32, fp16]
build: [fp32]
include:
- build: fp32
fp16: OFF
- build: fp16
fp16: ON
runs-on: ubuntu-24.04

186
.github/workflows/build-webgpu.yml vendored Normal file
View File

@@ -0,0 +1,186 @@
name: CI (webgpu)
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/build-webgpu.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp',
'**/*.wgsl'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-webgpu.yml',
'ggml/src/ggml-webgpu/**'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
jobs:
macOS-latest-arm64-webgpu:
runs-on: macos-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dawn Dependency
id: dawn-depends
run: |
DAWN_VERSION="v20260317.182325"
DAWN_OWNER="google"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-18eb229ef5f707c1464cc581252e7603c73a3ef0-macos-latest-Release"
echo "Fetching release asset from https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
curl -L -o artifact.tar.gz \
"https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
mkdir dawn
tar -xvf artifact.tar.gz -C dawn --strip-components=1
- name: Build
id: cmake_build
run: |
export CMAKE_PREFIX_PATH=dawn
cmake -B build -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DGGML_WEBGPU=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
ubuntu-24-webgpu:
runs-on: ubuntu-24.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo add-apt-repository -y ppa:kisak/kisak-mesa
sudo apt-get update -y
sudo apt-get install -y build-essential mesa-vulkan-drivers \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev
- name: Get latest Vulkan SDK version
id: vulkan_sdk_version
run: |
echo "VULKAN_SDK_VERSION=$(curl https://vulkan.lunarg.com/sdk/latest/linux.txt)" >> "$GITHUB_ENV"
- name: Use Vulkan SDK Cache
uses: actions/cache@v5
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: Dawn Dependency
id: dawn-depends
run: |
sudo apt-get install -y libxrandr-dev libxinerama-dev libxcursor-dev mesa-common-dev libx11-xcb-dev libxi-dev
DAWN_VERSION="v20260317.182325"
DAWN_OWNER="google"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-18eb229ef5f707c1464cc581252e7603c73a3ef0-ubuntu-latest-Release"
echo "Fetching release asset from https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
curl -L -o artifact.tar.gz \
"https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
mkdir dawn
tar -xvf artifact.tar.gz -C dawn --strip-components=1
- name: Build
id: cmake_build
run: |
export Dawn_DIR=dawn/lib64/cmake/Dawn
cmake -B build \
-DGGML_WEBGPU=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
# This is using llvmpipe and runs slower than other backends
# test-backend-ops is too slow on llvmpipe, skip it
ctest -L main -E test-backend-ops --verbose --timeout 900
ubuntu-24-webgpu-wasm:
runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install Emscripten
run: |
git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install latest
./emsdk activate latest
- name: Fetch emdawnwebgpu
run: |
DAWN_TAG="v20260317.182325"
EMDAWN_PKG="emdawnwebgpu_pkg-${DAWN_TAG}.zip"
echo "Downloading ${EMDAWN_PKG}"
curl -L -o emdawn.zip \
"https://github.com/google/dawn/releases/download/${DAWN_TAG}/${EMDAWN_PKG}"
unzip emdawn.zip
- name: Build WASM WebGPU
run: |
source emsdk/emsdk_env.sh
emcmake cmake -B build-wasm \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_WEBGPU=ON \
-DLLAMA_OPENSSL=OFF \
-DEMDAWNWEBGPU_DIR=emdawnwebgpu_pkg
time cmake --build build-wasm --config Release --target test-backend-ops -j $(nproc)

View File

@@ -132,47 +132,6 @@ jobs:
cd build
ctest -L main --verbose --timeout 900
macOS-latest-arm64-webgpu:
runs-on: macos-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dawn Dependency
id: dawn-depends
run: |
DAWN_VERSION="v20260317.182325"
DAWN_OWNER="google"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-18eb229ef5f707c1464cc581252e7603c73a3ef0-macos-latest-Release"
echo "Fetching release asset from https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
curl -L -o artifact.tar.gz \
"https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
mkdir dawn
tar -xvf artifact.tar.gz -C dawn --strip-components=1
- name: Build
id: cmake_build
run: |
export CMAKE_PREFIX_PATH=dawn
cmake -B build -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DGGML_WEBGPU=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
ubuntu-cpu:
strategy:
matrix:
@@ -181,10 +140,6 @@ jobs:
os: ubuntu-22.04
- build: 'arm64'
os: ubuntu-24.04-arm
- build: 's390x'
os: ubuntu-24.04-s390x
- build: 'ppc64le'
os: ubuntu-24.04-ppc64le
runs-on: ${{ matrix.os }}
@@ -194,7 +149,6 @@ jobs:
uses: actions/checkout@v6
- name: ccache
if: ${{ matrix.build != 's390x' && matrix.build != 'ppc64le' }}
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-cpu-${{ matrix.build }}
@@ -224,14 +178,6 @@ jobs:
python3 -m pip install --upgrade pip setuptools
pip3 install ./gguf-py
- name: Swap Endianness
id: endianness
if: ${{ matrix.build == 's390x' }}
run: |
for f in models/*.gguf; do
echo YES | python3 gguf-py/gguf/scripts/gguf_convert_endian.py $f big
done
- name: Build
id: cmake_build
run: |
@@ -248,7 +194,6 @@ jobs:
- name: Test llama2c conversion
id: llama2c_test
if: ${{ matrix.build != 's390x' }}
run: |
cd build
echo "Fetch tokenizer"
@@ -258,96 +203,6 @@ jobs:
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
- name: Test llama2c (s390x)
id: llama2c_test_s390x
if: ${{ matrix.build == 's390x' }}
run: |
cd build
echo "Fetch llama2c big-endian model"
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K-be.gguf
./bin/llama-completion -m stories260K-be.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
android-arm64:
runs-on: ubuntu-latest
env:
NDK_VERSION: "29.0.14206865"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: android-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Set up JDK
uses: actions/setup-java@v5
with:
java-version: 17
distribution: temurin
- name: Setup Android SDK
uses: android-actions/setup-android@40fd30fb8d7440372e1316f5d1809ec01dcd3699 # v4.0.1
with:
log-accepted-android-sdk-licenses: false
- name: Install NDK
run: |
sdkmanager "ndk;${{ env.NDK_VERSION }}"
echo "ANDROID_NDK=${ANDROID_SDK_ROOT}/ndk/${{ env.NDK_VERSION }}" >> $GITHUB_ENV
- name: Build
id: cmake_build
run: |
cmake -B build \
-DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-28 \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_BACKEND_DL=ON \
-DGGML_NATIVE=OFF \
-DGGML_CPU_ALL_VARIANTS=ON \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)
ubuntu-latest-rpc:
runs-on: ubuntu-latest
continue-on-error: true
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential libssl-dev ninja-build
- name: Build
id: cmake_build
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose
ubuntu-24-vulkan:
strategy:
matrix:
@@ -387,176 +242,6 @@ jobs:
run: |
time cmake --build build -j $(nproc)
ubuntu-24-webgpu:
runs-on: ubuntu-24.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo add-apt-repository -y ppa:kisak/kisak-mesa
sudo apt-get update -y
sudo apt-get install -y build-essential mesa-vulkan-drivers \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev
- name: Get latest Vulkan SDK version
id: vulkan_sdk_version
run: |
echo "VULKAN_SDK_VERSION=$(curl https://vulkan.lunarg.com/sdk/latest/linux.txt)" >> "$GITHUB_ENV"
- name: Use Vulkan SDK Cache
uses: actions/cache@v5
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: Dawn Dependency
id: dawn-depends
run: |
sudo apt-get install -y libxrandr-dev libxinerama-dev libxcursor-dev mesa-common-dev libx11-xcb-dev libxi-dev
DAWN_VERSION="v20260317.182325"
DAWN_OWNER="google"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-18eb229ef5f707c1464cc581252e7603c73a3ef0-ubuntu-latest-Release"
echo "Fetching release asset from https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
curl -L -o artifact.tar.gz \
"https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
mkdir dawn
tar -xvf artifact.tar.gz -C dawn --strip-components=1
- name: Build
id: cmake_build
run: |
export Dawn_DIR=dawn/lib64/cmake/Dawn
cmake -B build \
-DGGML_WEBGPU=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
# This is using llvmpipe and runs slower than other backends
# test-backend-ops is too slow on llvmpipe, skip it
ctest -L main -E test-backend-ops --verbose --timeout 900
ubuntu-24-webgpu-wasm:
runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install Emscripten
run: |
git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install latest
./emsdk activate latest
- name: Fetch emdawnwebgpu
run: |
DAWN_TAG="v20260317.182325"
EMDAWN_PKG="emdawnwebgpu_pkg-${DAWN_TAG}.zip"
echo "Downloading ${EMDAWN_PKG}"
curl -L -o emdawn.zip \
"https://github.com/google/dawn/releases/download/${DAWN_TAG}/${EMDAWN_PKG}"
unzip emdawn.zip
- name: Build WASM WebGPU
run: |
source emsdk/emsdk_env.sh
emcmake cmake -B build-wasm \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_WEBGPU=ON \
-DLLAMA_OPENSSL=OFF \
-DEMDAWNWEBGPU_DIR=emdawnwebgpu_pkg
time cmake --build build-wasm --config Release --target test-backend-ops -j $(nproc)
ubuntu-22-hip:
runs-on: ubuntu-22.04
container: rocm/dev-ubuntu-22.04:6.1.2
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y build-essential git cmake rocblas-dev hipblas-dev libssl-dev rocwmma-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with native CMake HIP support
id: cmake_build
run: |
cmake -B build -S . \
-DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DGPU_TARGETS="gfx1030" \
-DGGML_HIP=ON
cmake --build build --config Release -j $(nproc)
ubuntu-22-musa:
runs-on: ubuntu-22.04
container: mthreads/musa:rc4.3.0-devel-ubuntu22.04-amd64
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
apt-get update
apt-get install -y build-essential git cmake libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-musa
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with native CMake MUSA support
id: cmake_build
run: |
cmake -B build -S . \
-DGGML_MUSA=ON
time cmake --build build --config Release -j $(nproc)
windows-latest:
runs-on: windows-2025
@@ -580,9 +265,6 @@ jobs:
- build: 'llvm-arm64'
arch: 'arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
- build: 'llvm-arm64-opencl-adreno'
arch: 'arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON'
steps:
- name: Clone
@@ -624,26 +306,6 @@ jobs:
run: |
choco install ninja
- name: Install OpenCL Headers and Libs
id: install_opencl
if: ${{ matrix.build == 'llvm-arm64-opencl-adreno' }}
run: |
git clone https://github.com/KhronosGroup/OpenCL-Headers
cd OpenCL-Headers
cmake -B build `
-DBUILD_TESTING=OFF `
-DOPENCL_HEADERS_BUILD_TESTING=OFF `
-DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF `
-DCMAKE_INSTALL_PREFIX="$env:RUNNER_TEMP/opencl-arm64-release"
cmake --build build --target install
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
cd OpenCL-ICD-Loader
cmake -B build-arm64-release `
-A arm64 `
-DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" `
-DCMAKE_INSTALL_PREFIX="$env:RUNNER_TEMP/opencl-arm64-release"
cmake --build build-arm64-release --target install --config release
- name: Build
id: cmake_build
run: |
@@ -764,145 +426,6 @@ jobs:
cmake --build build --config Release -j %NINJA_JOBS% -t ggml
cmake --build build --config Release
windows-latest-hip:
runs-on: windows-2022
env:
# Make sure this is in sync with build-cache.yml
HIPSDK_INSTALLER_VERSION: "26.Q1"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Grab rocWMMA package
id: grab_rocwmma
run: |
curl -o rocwmma.deb "https://repo.radeon.com/rocm/apt/7.2.1/pool/main/r/rocwmma-dev/rocwmma-dev_2.2.0.70201-81~24.04_amd64.deb"
7z x rocwmma.deb
7z x data.tar
- name: Use ROCm Installation Cache
uses: actions/cache@v5
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
uses: ./.github/actions/windows-setup-rocm
with:
version: ${{ env.HIPSDK_INSTALLER_VERSION }}
- name: Verify ROCm
id: verify
run: |
# Find and test ROCm installation
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
if (-not $clangPath) {
Write-Error "ROCm installation not found"
exit 1
}
& $clangPath.FullName --version
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ${{ github.job }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
$env:HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake -G "Unix Makefiles" -B build -S . `
-DCMAKE_C_COMPILER="${env:HIP_PATH}\bin\clang.exe" `
-DCMAKE_CXX_COMPILER="${env:HIP_PATH}\bin\clang++.exe" `
-DCMAKE_CXX_FLAGS="-I$($PWD.Path.Replace('\', '/'))/opt/rocm-7.2.1/include/" `
-DCMAKE_BUILD_TYPE=Release `
-DLLAMA_BUILD_BORINGSSL=ON `
-DROCM_DIR="${env:HIP_PATH}" `
-DGGML_HIP=ON `
-DGGML_HIP_ROCWMMA_FATTN=ON `
-DGPU_TARGETS="gfx1100" `
-DGGML_RPC=ON
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}
ubuntu-cpu-riscv64-native:
runs-on: ubuntu-24.04-riscv
steps:
- name: Install dependencies
run: |
# Install necessary packages
sudo apt-get update
sudo apt-get install -y libssl-dev
# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
git lfs install
- name: Check environment
run: |
uname -a
gcc --version
g++ --version
ldd --version
cmake --version
rustc --version
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-cpu-riscv64-native
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=ON \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DGGML_RPC=ON \
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
- name: Test llama2c conversion
id: llama2c_test
run: |
cd build
echo "Fetch tokenizer"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
# TODO: simplify the following workflows using a matrix
# TODO: run lighter CI on PRs and the full CI only on master (if needed)
ggml-ci-x64-cpu-low-perf:
@@ -931,31 +454,32 @@ jobs:
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-low-perf:
runs-on: ubuntu-22.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-low-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# note: moved to build-self-hosted.yml - can remove from here when everything is stable
# ggml-ci-arm64-cpu-low-perf:
# runs-on: ubuntu-22.04-arm
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ggml-ci-arm64-cpu-low-perf
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-x64-cpu-high-perf:
runs-on: ubuntu-22.04
@@ -983,31 +507,32 @@ jobs:
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-high-perf:
runs-on: ubuntu-22.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-high-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# note: moved to build-self-hosted.yml - can remove from here when everything is stable
# ggml-ci-arm64-cpu-high-perf:
# runs-on: ubuntu-22.04-arm
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ggml-ci-arm64-cpu-high-perf
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-high-perf-sve:
runs-on: ubuntu-22.04-arm

View File

@@ -19,7 +19,7 @@ on:
jobs:
check-vendor:
runs-on: ubuntu-slim
runs-on: [self-hosted, fast]
steps:
- name: Checkout

View File

@@ -15,7 +15,7 @@ concurrency:
jobs:
model-naming:
runs-on: ubuntu-slim
runs-on: [self-hosted, fast]
steps:
- uses: actions/checkout@v6
- name: Check model naming conventions

View File

@@ -15,7 +15,7 @@ concurrency:
jobs:
editorconfig:
runs-on: ubuntu-slim
runs-on: [self-hosted, fast]
steps:
- uses: actions/checkout@v6
- uses: editorconfig-checker/action-editorconfig-checker@840e866d93b8e032123c23bac69dece044d4d84c # v2.2.0

View File

@@ -3,16 +3,16 @@ name: Check Pre-Tokenizer Hashes
on:
push:
paths:
- 'convert_hf_to_gguf.py'
- 'conversion/base.py'
- 'convert_hf_to_gguf_update.py'
pull_request:
paths:
- 'convert_hf_to_gguf.py'
- 'conversion/base.py'
- 'convert_hf_to_gguf_update.py'
jobs:
pre-tokenizer-hashes:
runs-on: ubuntu-slim
runs-on: [self-hosted, fast]
steps:
- name: Checkout repository
@@ -30,16 +30,16 @@ jobs:
- name: Update pre-tokenizer hashes
run: |
cp convert_hf_to_gguf.py /tmp
cp conversion/base.py /tmp
.venv/bin/python convert_hf_to_gguf_update.py --check-missing
- name: Check if committed pre-tokenizer hashes matches generated version
run: |
if ! diff -q convert_hf_to_gguf.py /tmp/convert_hf_to_gguf.py; then
echo "Model pre-tokenizer hashes (in convert_hf_to_gguf.py) do not match generated hashes (from convert_hf_to_gguf_update.py)."
echo "To fix: run ./convert_hf_to_gguf_update.py and commit the updated convert_hf_to_gguf.py along with your changes"
if ! diff -q conversion/base.py /tmp/base.py; then
echo "Model pre-tokenizer hashes (in conversion/base.py) do not match generated hashes (from convert_hf_to_gguf_update.py)."
echo "To fix: run ./convert_hf_to_gguf_update.py and commit the updated conversion/base.py along with your changes"
echo "Differences found:"
diff convert_hf_to_gguf.py /tmp/convert_hf_to_gguf.py || true
diff conversion/base.py /tmp/base.py || true
exit 1
fi
echo "Model pre-tokenizer hashes are up to date."

View File

@@ -20,7 +20,7 @@ concurrency:
jobs:
python-check-requirements:
runs-on: ubuntu-slim
runs-on: [self-hosted, CPU, fast]
name: check-requirements
steps:
- name: Check out source repository

View File

@@ -21,7 +21,7 @@ concurrency:
jobs:
flake8-lint:
runs-on: ubuntu-slim
runs-on: [self-hosted, fast]
name: Lint
steps:
- name: Check out source repository

View File

@@ -22,7 +22,7 @@ concurrency:
jobs:
python-type-check:
runs-on: ubuntu-slim
runs-on: [self-hosted, fast]
name: python type-check
steps:
- name: Check out source repository

View File

@@ -772,12 +772,10 @@ jobs:
strategy:
matrix:
build: [fp32, fp16]
build: [fp32]
include:
- build: fp32
fp16: OFF
- build: fp16
fp16: ON
runs-on: ubuntu-24.04
@@ -1234,6 +1232,9 @@ jobs:
path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
name: llama-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
ui-build:
uses: ./.github/workflows/ui-build.yml
release:
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
@@ -1259,6 +1260,7 @@ jobs:
- macOS-cpu
- ios-xcode-build
- openEuler-cann
- ui-build
outputs:
tag_name: ${{ steps.tag.outputs.name }}
@@ -1318,6 +1320,18 @@ jobs:
mv -v artifact/*.zip release
mv -v artifact/*.tar.gz release
- name: Download UI build
id: download_ui
uses: actions/download-artifact@v7
with:
name: ui-build
path: ./ui-dist
- name: Package UI
id: package_ui
run: |
tar -czvf release/llama-${{ steps.tag.outputs.name }}-ui.tar.gz --transform "s,^\.,llama-${{ steps.tag.outputs.name }}," -C ./ui-dist .
- name: Create release
id: create_release
uses: ggml-org/action-create-release@v1
@@ -1347,7 +1361,6 @@ jobs:
- [Ubuntu x64 (ROCm 7.2)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-rocm-7.2-x64.tar.gz)
- [Ubuntu x64 (OpenVINO)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-openvino-${{ needs.ubuntu-24-openvino.outputs.openvino_version }}-x64.tar.gz)
- [Ubuntu x64 (SYCL FP32)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-sycl-fp32-x64.tar.gz)
- [Ubuntu x64 (SYCL FP16)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-sycl-fp16-x64.tar.gz)
**Android:**
- [Android arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-android-arm64.tar.gz)
@@ -1367,6 +1380,9 @@ jobs:
- [openEuler aarch64 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-310p-openEuler-aarch64.tar.gz)
- [openEuler aarch64 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-910b-openEuler-aarch64-aclgraph.tar.gz)
**UI:**
- [UI](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-ui.tar.gz)
- name: Upload release
id: upload_release
uses: actions/github-script@v8

View File

@@ -91,45 +91,44 @@ jobs:
export ${{ matrix.extra_args }}
pytest -v -x -m "not slow"
# TODO: provision CUDA runner
# server-cuda:
# runs-on: [self-hosted, llama-server, Linux, NVIDIA]
#
# name: server-cuda (${{ matrix.wf_name }})
# strategy:
# matrix:
# build_type: [Release]
# wf_name: ["GPUx1"]
# include:
# - build_type: Release
# extra_args: "LLAMA_ARG_BACKEND_SAMPLING=1"
# wf_name: "GPUx1, backend-sampling"
# fail-fast: false
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
# with:
# fetch-depth: 0
# ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
#
# - name: Build
# id: cmake_build
# run: |
# cmake -B build -DGGML_SCHED_NO_REALLOC=ON
# cmake --build build --config ${{ matrix.build_type }} -j $(sysctl -n hw.logicalcpu) --target llama-server
#
# - name: Tests
# id: server_integration_tests
# if: ${{ (!matrix.disabled_on_pr || !github.event.pull_request) }}
# run: |
# cd tools/server/tests
# python3 -m venv venv
# source venv/bin/activate
# pip install -r requirements.txt
# export ${{ matrix.extra_args }}
# pytest -v -x -m "not slow"
server-cuda:
runs-on: [self-hosted, llama-server, Linux, NVIDIA]
name: server-cuda (${{ matrix.wf_name }})
strategy:
matrix:
build_type: [Release]
wf_name: ["GPUx1"]
include:
- build_type: Release
extra_args: "LLAMA_ARG_BACKEND_SAMPLING=1"
wf_name: "GPUx1, backend-sampling"
fail-fast: false
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Build
id: cmake_build
run: |
cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_NO_REALLOC=ON
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target llama-server
- name: Tests
id: server_integration_tests
if: ${{ (!matrix.disabled_on_pr || !github.event.pull_request) }}
run: |
cd tools/server/tests
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
export ${{ matrix.extra_args }}
pytest -v -x -m "not slow"
server-kleidiai:
runs-on: ah-ubuntu_22_04-c8g_8x

View File

@@ -54,8 +54,13 @@ concurrency:
cancel-in-progress: true
jobs:
ui-build:
name: Build Web UI
uses: ./.github/workflows/ui-build.yml
server:
runs-on: ubuntu-latest
needs: ui-build
name: server (${{ matrix.wf_name }})
strategy:
@@ -93,12 +98,11 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Setup Node.js
uses: actions/setup-node@v6
- name: Download built UI
uses: actions/download-artifact@v7
with:
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
name: ui-build
path: tools/ui/dist
- name: Build
id: cmake_build

View File

@@ -5,8 +5,7 @@ on:
jobs:
build:
name: Build static output
runs-on: ubuntu-slim
runs-on: [self-hosted, fast]
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
@@ -31,7 +30,7 @@ jobs:
- name: Generate checksums
run: |
cd build/tools/ui/dist
cd tools/ui/dist
for f in *; do
sha256sum "$f" | awk '{print $1, $2}' >> checksums.txt
done
@@ -40,5 +39,5 @@ jobs:
uses: actions/upload-artifact@v6
with:
name: ui-build
path: build/tools/ui/dist/
path: tools/ui/dist/
retention-days: 1

View File

@@ -38,7 +38,7 @@ jobs:
uses: actions/download-artifact@v7
with:
name: ui-build
path: build/tools/ui/dist/
path: tools/ui/dist/
- name: Install Hugging Face Hub CLI
run: pip install -U huggingface_hub
@@ -49,12 +49,12 @@ jobs:
- name: Sync built files to Hugging Face bucket (version tag)
run: |
# Upload the built files to the Hugging Face bucket under the release version
hf buckets sync build/tools/ui/dist hf://buckets/ggml-org/${{ env.HF_BUCKET_NAME }}/${{ inputs.version_tag }} --delete --quiet
hf buckets sync tools/ui/dist hf://buckets/ggml-org/${{ env.HF_BUCKET_NAME }}/${{ inputs.version_tag }} --delete --quiet
- name: Sync built files to Hugging Face bucket (latest)
run: |
# Also upload to the 'latest' directory for fallback downloads
hf buckets sync build/tools/ui/dist hf://buckets/ggml-org/${{ env.HF_BUCKET_NAME }}/latest --delete --quiet
hf buckets sync tools/ui/dist hf://buckets/ggml-org/${{ env.HF_BUCKET_NAME }}/latest --delete --quiet
- name: Verify upload
run: |

118
.github/workflows/ui-self-hosted.yml vendored Normal file
View File

@@ -0,0 +1,118 @@
name: UI (self-hosted)
# these are the same as ui.yml, but with self-hosted runners
# the runners come with pre-installed Playwright browsers version: 1.56.1
# the jobs are much lighter because they don't need to install node and playwright browsers
on:
workflow_dispatch:
inputs:
sha:
description: 'Commit SHA1 to build'
required: false
type: string
push:
branches:
- master
paths: [
'.github/workflows/ui-self-hosted.yml',
'.github/workflows/ui-build.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/ui-self-hosted.yml',
'.github/workflows/ui-build.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
ui-build:
name: Build static output
uses: ./.github/workflows/ui-build.yml
ui-checks:
name: Checks
needs: ui-build
runs-on: [self-hosted, PLAYWRIGHT]
continue-on-error: true
steps:
- name: Checkout code
uses: actions/checkout@v6
with:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Install dependencies
id: setup
run: npm ci
working-directory: tools/ui
- name: Run type checking
if: ${{ always() && steps.setup.conclusion == 'success' }}
run: npm run check
working-directory: tools/ui
- name: Run linting
if: ${{ always() && steps.setup.conclusion == 'success' }}
run: npm run lint
working-directory: tools/ui
- name: Run Client tests
if: ${{ always() }}
run: npm run test:client
working-directory: tools/ui
- name: Run Unit tests
if: ${{ always() }}
run: npm run test:unit
working-directory: tools/ui
e2e-tests:
name: E2E Tests
needs: ui-build
runs-on: [self-hosted, PLAYWRIGHT]
steps:
- name: Checkout code
uses: actions/checkout@v6
with:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Install dependencies
id: setup
run: npm ci
working-directory: tools/ui
- name: Build application
if: ${{ always() && steps.setup.conclusion == 'success' }}
run: npm run build
working-directory: tools/ui
- name: Build Storybook
if: ${{ always() }}
run: npm run build-storybook
working-directory: tools/ui
- name: Run UI tests
if: ${{ always() }}
run: npm run test:ui -- --testTimeout=60000
working-directory: tools/ui
- name: Run E2E tests
if: ${{ always() }}
run: npm run test:e2e
working-directory: tools/ui

View File

@@ -1,4 +1,4 @@
name: CI (UI)
name: UI
on:
workflow_dispatch:
@@ -11,14 +11,16 @@ on:
branches:
- master
paths: [
'.github/workflows/ui-ci.yml',
'.github/workflows/ui.yml',
'.github/workflows/ui-build.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/ui-ci.yml',
'.github/workflows/ui.yml',
'.github/workflows/ui-build.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
@@ -39,7 +41,7 @@ jobs:
uses: ./.github/workflows/ui-build.yml
ui-checks:
name: UI Checks
name: Checks
needs: ui-build
runs-on: ubuntu-latest
continue-on-error: true

View File

@@ -3,18 +3,20 @@ name: Update Operations Documentation
on:
push:
paths:
- '.github/workflows/update-ops-docs.yml'
- 'docs/ops.md'
- 'docs/ops/**'
- 'scripts/create_ops_docs.py'
pull_request:
paths:
- '.github/workflows/update-ops-docs.yml'
- 'docs/ops.md'
- 'docs/ops/**'
- 'scripts/create_ops_docs.py'
jobs:
update-ops-docs:
runs-on: ubuntu-slim
runs-on: [self-hosted, fast, ARM64]
steps:
- name: Checkout repository

View File

@@ -108,20 +108,10 @@ option(LLAMA_BUILD_TESTS "llama: build tests"
option(LLAMA_BUILD_TOOLS "llama: build tools" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_SERVER "llama: build server example" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_APP "llama: build the unified binary" ON)
option(LLAMA_BUILD_APP "llama: build the unified binary" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_UI "llama: build the embedded Web UI for server" ON)
option(LLAMA_USE_PREBUILT_UI "llama: use prebuilt UI from HF Bucket when available (requires LLAMA_BUILD_UI=ON)" ON)
# Backward compat: when old var is set but new one isn't, forward the value
if(DEFINED LLAMA_BUILD_WEBUI)
set(LLAMA_BUILD_UI ${LLAMA_BUILD_WEBUI})
message(DEPRECATION "LLAMA_BUILD_WEBUI is deprecated, use LLAMA_BUILD_UI instead")
endif()
if(DEFINED LLAMA_USE_PREBUILT_WEBUI)
set(LLAMA_USE_PREBUILT_UI ${LLAMA_USE_PREBUILT_WEBUI})
message(DEPRECATION "LLAMA_USE_PREBUILT_WEBUI is deprecated, use LLAMA_USE_PREBUILT_UI instead")
endif()
option(LLAMA_TOOLS_INSTALL "llama: install tools" ${LLAMA_TOOLS_INSTALL_DEFAULT})
option(LLAMA_TESTS_INSTALL "llama: install tests" ON)

View File

@@ -238,7 +238,7 @@ function gg_run_ctest_debug {
(cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time cmake --build . --config Debug -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log
(time ctest -C Debug --output-on-failure -L main -E "test-opt|test-backend-ops" ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
(time ctest -C Debug --output-on-failure -L main -E "test-opt|test-backend-ops|test-llama-archs" ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
set +e
}

View File

@@ -1334,12 +1334,15 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_env("LLAMA_ARG_CTX_CHECKPOINTS").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"-cpent", "--checkpoint-every-n-tokens"}, "N",
string_format("create a checkpoint every n tokens during prefill (processing), -1 to disable (default: %d)", params.checkpoint_every_nt),
{"-cms", "--checkpoint-min-step"}, "N",
string_format("minimum spacing between context checkpoints in tokens (default: %d, 0 = no minimum)", params.checkpoint_min_step),
[](common_params & params, int value) {
params.checkpoint_every_nt = value;
if (value < 0) {
throw std::invalid_argument("checkpoint-min-step must be non-negative");
}
params.checkpoint_min_step = value;
}
).set_env("LLAMA_ARG_CHECKPOINT_EVERY_NT").set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
).set_env("LLAMA_ARG_CHECKPOINT_MIN_SPACING_NT").set_examples({LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"-cram", "--cache-ram"}, "N",
string_format("set the maximum cache size in MiB (default: %d, -1 - no limit, 0 - disable)"

View File

@@ -310,6 +310,8 @@ std::vector<segment> prune_whitespace_segments(const std::vector<segment> & segm
namespace autoparser {
static const std::string ERR_TMPL = "#**ERROR**#";
std::string apply_template(const common_chat_template & tmpl, const template_params & params) {
generation_params tmpl_params;
tmpl_params.messages = params.messages;
@@ -326,7 +328,7 @@ std::string apply_template(const common_chat_template & tmpl, const template_par
return common_chat_template_direct_apply(tmpl, tmpl_params);
} catch (const std::exception & e) {
LOG_DBG("Template application failed: %s\n", e.what());
return "";
return ERR_TMPL;
}
}
@@ -347,7 +349,7 @@ std::optional<compare_variants_result> compare_variants(
std::string output_B = apply_template(tmpl, params_B);
// Check for template application failures
if (output_A.empty() || output_B.empty()) {
if (output_A == ERR_TMPL || output_B == ERR_TMPL) {
return std::nullopt;
}

View File

@@ -377,6 +377,8 @@ struct analyze_tools : analyze_base {
struct autoparser {
jinja::caps jinja_caps;
std::string user_start;
std::string assistant_start;
analyze_reasoning reasoning;
analyze_content content;
analyze_tools tools;
@@ -387,6 +389,10 @@ struct autoparser {
autoparser() = default;
// Find the starting marker for the user message and assistant message
std::string detect_user_start_marker(const common_chat_template & tmpl);
std::string detect_assistant_start_marker(const common_chat_template & tmpl);
// Run full differential analysis on a template
void analyze_template(const common_chat_template & tmpl);

View File

@@ -8,6 +8,9 @@
#include "peg-parser.h"
#include <algorithm>
#include <cctype>
#include <ostream>
#include <sstream>
#define ANSI_RESET "\033[0m"
#define ANSI_PURPLE "\033[1m\x1b[38;5;126m"
@@ -23,6 +26,7 @@ static const std::string FUN_SECOND = "SSS_SECOND_FUN_S";
static const std::string ARG_FIRST = "AA_ARG_FST_AA";
static const std::string ARG_SECOND = "BB_ARG_SND_BB";
static const std::string USER_MSG = "U_USER_MSG Hello END_U";
static const std::string USER_MSG_TWO = "V_USER_MSG Hello END_V";
static const std::string ASSISTANT_MSG = "A_ASST_MSG I can help END_A";
static const std::string THINKING_CONTENT = "REASON_PART I am thinking END_R";
static const std::string CALL_ID_001 = "call00001";
@@ -71,6 +75,7 @@ static std::vector<std::function<void(const common_chat_template & tmpl, autopar
analysis.content.end = "<|END_OF_TURN_TOKEN|>";
analysis.preserved_tokens.push_back("<|CHATBOT_TOKEN|>");
analysis.preserved_tokens.push_back("<|END_OF_TURN_TOKEN|>");
analysis.user_start = "<|START_OF_TURN_TOKEN|><|USER_TOKEN|>";
LOG_DBG(ANSI_ORANGE "[Patch: Cohere Command R+]\n" ANSI_RESET);
}
},
@@ -108,7 +113,59 @@ static std::vector<std::function<void(const common_chat_template & tmpl, autopar
analysis.tools.function.close = "```";
LOG_DBG(ANSI_ORANGE "[Patch: DeepSeek-R1-Distill-Qwen]\n" ANSI_RESET);
}
}
},
// Nemotron Nano v2
[](const common_chat_template & tmpl, autoparser & analysis) -> void {
if (tmpl.src.find("<SPECIAL_10>") != std::string::npos && tmpl.src.find("<SPECIAL_11>") != std::string::npos &&
tmpl.src.find("<SPECIAL_12>") != std::string::npos && tmpl.src.find("<TOOL_RESPONSE>") != std::string::npos) {
analysis.tools.format.mode = tool_format::JSON_NATIVE;
analysis.tools.format.section_start = "";
analysis.tools.format.section_end = "";
analysis.tools.format.per_call_start = "<TOOLCALL>";
analysis.tools.format.per_call_end = "</TOOLCALL>";
analysis.content.mode = content_mode::PLAIN;
analysis.content.start = "";
analysis.content.end = "";
analysis.reasoning.mode = reasoning_mode::TAG_BASED;
analysis.reasoning.start = "<think>\n\n";
analysis.reasoning.end = "</think>";
analysis.assistant_start = "<SPECIAL_11>Assistant";
analysis.user_start = "<SPECIAL_11>User";
analysis.preserved_tokens.clear();
analysis.preserved_tokens.push_back("<SPECIAL_12>");
analysis.preserved_tokens.push_back("<SPECIAL_11>");
analysis.preserved_tokens.push_back("</think>");
analysis.preserved_tokens.push_back("<TOOLCALL>");
analysis.preserved_tokens.push_back("</TOOLCALL>");
LOG_DBG(ANSI_ORANGE "[Patch: Nemotron Nano v2]\n" ANSI_RESET);
}
},
// Fireworks
[](const common_chat_template & tmpl, autoparser & analysis) -> void {
if (tmpl.src.find("{%- set system_prompt = '<|start_header_id|>' + 'system' + '<|end_header_id|>\\n\\n'"
" + message['content'] | trim + '\\n' + system_prompt_suffix + '<|eot_id|>' -%}") != std::string::npos) {
analysis.assistant_start = "<|start_header_id|>assistant<|end_header_id|>";
analysis.user_start = "<|start_header_id|>user<|end_header_id|>";
LOG_DBG(ANSI_ORANGE "[Patch: Fireworks v2]\n" ANSI_RESET);
}
},
// Solar Open
[](const common_chat_template & tmpl, autoparser & analysis) -> void {
if (tmpl.src.find("<|begin|>assistant<|think|><|end|>") != std::string::npos) {
analysis.assistant_start = "<|begin|>assistant";
LOG_DBG(ANSI_ORANGE "[Patch: Solar Open]\n" ANSI_RESET);
}
},
// Apriel 1.6
[](const common_chat_template & tmpl, autoparser & analysis) -> void {
if (tmpl.src.find("if not loop.last and '[BEGIN FINAL RESPONSE]' in asst_text") != std::string::npos) {
analysis.user_start = "<|begin_user|>";
analysis.assistant_start = "<|begin_assistant|>";
LOG_DBG(ANSI_ORANGE "[Patch: Apriel 1.6]\n" ANSI_RESET);
}
},
});
// Common JSON structures
@@ -166,6 +223,8 @@ void autoparser::analyze_template(const common_chat_template & tmpl) {
reasoning = analyze_reasoning(tmpl, jinja_caps.supports_tool_calls);
content = analyze_content(tmpl, reasoning);
tools = analyze_tools(jinja_caps.supports_tool_calls ? analyze_tools(tmpl, jinja_caps, reasoning) : analyze_tools());
assistant_start = detect_assistant_start_marker(tmpl);
user_start = detect_user_start_marker(tmpl);
collect_preserved_tokens();
for (auto & workaround : workarounds) {
@@ -173,6 +232,8 @@ void autoparser::analyze_template(const common_chat_template & tmpl) {
}
LOG_DBG("\n--- Reasoning & Content Structure ---\n");
LOG_DBG("user_msg_start: %s\n", user_start.c_str());
LOG_DBG("assistant_msg_start: %s\n", assistant_start.c_str());
LOG_DBG("reasoning_mode: %s\n", mode_to_str(reasoning.mode).c_str());
LOG_DBG("reasoning_start: '%s'\n", reasoning.start.c_str());
LOG_DBG("reasoning_end: '%s'\n", reasoning.end.c_str());
@@ -245,6 +306,120 @@ void autoparser::collect_preserved_tokens() {
add_token(tools.call_id.suffix);
}
std::string autoparser::detect_assistant_start_marker(const common_chat_template & tmpl) {
json user_msg = json{
{ "role", "user" },
{ "content", USER_MSG }
};
json assistant_no_reasoning = json{
{ "role", "assistant" },
{ "content", ASSISTANT_MSG }
};
template_params params;
params.messages = json::array({ user_msg });
params.add_generation_prompt = false;
params.enable_thinking = true;
auto comparison = compare_variants(
tmpl, params, [&](template_params & p) {
p.messages = json::array({ user_msg, assistant_no_reasoning });
}
);
if (!comparison) {
LOG_DBG(ANSI_ORANGE "%s: Template application failed, skipping assistant start detection\n" ANSI_RESET, __func__);
return "";
}
auto usermsg = comparison->diff.right;
if (usermsg.find(ASSISTANT_MSG) == std::string::npos) {
LOG_DBG(ANSI_ORANGE "%s: Did not find assistant message in assistant message block, skipping detection\n" ANSI_RESET, __func__);
}
auto ast_prefix = usermsg.substr(0, usermsg.find(ASSISTANT_MSG));
if (!reasoning.start.empty() && ast_prefix.find(trim_whitespace(reasoning.start)) != std::string::npos) {
ast_prefix = ast_prefix.substr(0, ast_prefix.find(trim_whitespace(reasoning.start)));
}
if (!reasoning.end.empty() && ast_prefix.find(trim_whitespace(reasoning.end)) != std::string::npos) {
ast_prefix = ast_prefix.substr(0, ast_prefix.find(trim_whitespace(reasoning.end)));
}
return trim_whitespace(ast_prefix);
}
std::string autoparser::detect_user_start_marker(const common_chat_template & tmpl) {
json user_msg = json{
{ "role", "user" },
{ "content", USER_MSG }
};
json assistant = json{
{ "role", "assistant" },
{ "content", ASSISTANT_MSG }
};
json user_msg_two = json{
{ "role", "user" },
{ "content", USER_MSG_TWO }
};
template_params params;
params.messages = json::array({});
params.add_generation_prompt = false;
params.enable_thinking = true;
auto comparison = compare_variants(
tmpl, params, [&](template_params & p) {
p.messages = json::array({ user_msg });
}
);
if (!comparison) {
LOG_DBG(ANSI_ORANGE "%s: Template application failed, unsupported empty messages? trying complex variant\n" ANSI_RESET, __func__);
params.messages = json::array({ user_msg_two, assistant });
comparison = compare_variants(
tmpl, params, [&](template_params & p) {
p.messages = json::array({ user_msg_two, assistant, user_msg });
}
);
if (!comparison) {
LOG_DBG(ANSI_ORANGE "%s: Template application failed for reserve variant, aborting\n" ANSI_RESET, __func__);
return "";
}
}
auto usermsg = comparison->diff.right;
if (usermsg.find(USER_MSG) == std::string::npos) {
LOG_DBG(ANSI_ORANGE "%s: Did not find user message in user message block, aborting detection\n" ANSI_RESET, __func__);
}
if (usermsg.find(ASSISTANT_MSG) != std::string::npos) {
usermsg = usermsg.substr(usermsg.find(ASSISTANT_MSG) + ASSISTANT_MSG.size());
}
auto candidate = usermsg.substr(0, usermsg.find(USER_MSG));
auto candidate_split = segmentize_markers(candidate);
std::stringstream result;
bool encountered_marker = false;
for (const auto & mrk : candidate_split) {
std::string lower_mrk = std::string(mrk.value);
std::transform(lower_mrk.begin(), lower_mrk.end(), lower_mrk.begin(),
[](unsigned char c) { return std::tolower(c); });
// heuristic to weed out potential end markers, but only at the start
if (mrk.type == segment_type::MARKER && !encountered_marker &&
(lower_mrk.find("end") != std::string::npos || lower_mrk.find("close") != std::string::npos)) {
continue;
}
if (mrk.type == segment_type::TEXT && !encountered_marker && trim_whitespace(mrk.value).empty()) {
continue;
}
encountered_marker |= mrk.type == segment_type::MARKER;
result << mrk.value;
}
return trim_whitespace(result.str());
}
analyze_reasoning::analyze_reasoning(const common_chat_template & tmpl, bool supports_tools)
: analyze_base(tmpl) {
LOG_DBG(ANSI_PURPLE "=== Starting differential analysis ===\n" ANSI_RESET);

View File

@@ -90,6 +90,45 @@ std::string common_chat_msg::render_content(const std::string & delimiter) const
return text;
}
std::vector<common_chat_msg_span> common_chat_split_by_role(const std::string & prompt, const std::vector<common_chat_msg_delimiter> & delims) {
if (delims.empty() || prompt.empty()) {
return {};
}
auto parser = build_peg_parser([&](common_peg_parser_builder & p) {
std::vector<std::string> all_delims;
std::vector<common_peg_parser> tagged_messages;
all_delims.reserve(delims.size());
tagged_messages.reserve(delims.size());
for (const auto & d : delims) {
all_delims.push_back(d.delimiter);
}
auto any_delim = p.until_one_of(all_delims);
for (const auto & d : delims) {
tagged_messages.push_back(p.tag(d.role, p.literal(d.delimiter) + any_delim));
}
return any_delim + p.zero_or_more(p.choice(tagged_messages)) + p.end();
});
common_peg_parse_context ctx(prompt);
const auto result = parser.parse(ctx);
if (!result.success()) {
return {};
}
std::vector<common_chat_msg_span> spans;
ctx.ast.visit(result, [&](const common_peg_ast_node & node) {
if (!node.tag.empty()) {
spans.push_back({ node.tag, node.start, node.end - node.start });
}
});
return spans;
}
json common_chat_msg::to_json_oaicompat(bool concat_typed_text) const {
if (!content.empty() && !content_parts.empty()) {
throw std::runtime_error("Cannot specify both content and content_parts");
@@ -1042,6 +1081,14 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
data.prompt = prompt;
data.generation_prompt = common_chat_template_generation_prompt_impl(tmpl, inputs, /* messages_override= */ adjusted_messages);
data.message_spans = common_chat_split_by_role(prompt, {
{ "assistant", "<|start|>assistant" },
{ "user", "<|start|>user" },
{ "system", "<|start|>developer" },
{ "system", "<|start|>system" },
{ "tool", "<|start|>functions" },
});
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
data.supports_thinking = true;
@@ -1181,6 +1228,11 @@ static common_chat_params common_chat_params_init_gemma4(const common_chat_templ
data.prompt += data.generation_prompt;
}
data.message_spans = common_chat_split_by_role(data.prompt, {
{ "user", "<|turn>user\n" },
{ "assistant", "<|turn>model\n" },
});
data.format = COMMON_CHAT_FORMAT_PEG_GEMMA4;
data.supports_thinking = true;
data.thinking_start_tag = "<|channel>thought";
@@ -2393,6 +2445,19 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
struct autoparser::autoparser autoparser;
autoparser.analyze_template(tmpl);
auto auto_params = autoparser::peg_generator::generate_parser(tmpl, params, autoparser);
std::vector<common_chat_msg_delimiter> delimiters;
if (!autoparser.assistant_start.empty()) {
delimiters.push_back({ "assistant", autoparser.assistant_start });
}
if (!autoparser.user_start.empty()) {
delimiters.push_back({ "user", autoparser.user_start });
}
if (!delimiters.empty()) {
auto_params.message_spans = common_chat_split_by_role(auto_params.prompt, delimiters);
}
auto_params.supports_thinking = autoparser.reasoning.mode != autoparser::reasoning_mode::NONE;
if (auto_params.supports_thinking) {
auto_params.thinking_start_tag = trim_whitespace(autoparser.reasoning.start);

View File

@@ -143,6 +143,17 @@ struct common_chat_msg_diff {
}
};
struct common_chat_msg_span {
std::string role;
std::size_t pos = 0;
std::size_t len = 0;
};
struct common_chat_msg_delimiter {
std::string role;
std::string delimiter;
};
struct common_chat_tool {
std::string name;
std::string description;
@@ -208,6 +219,7 @@ struct common_chat_params {
std::vector<std::string> preserved_tokens;
std::vector<std::string> additional_stops;
std::string parser;
std::vector<common_chat_msg_span> message_spans;
};
// per-message parsing syntax
@@ -219,6 +231,7 @@ struct common_chat_parser_params {
bool reasoning_in_content = false;
std::string generation_prompt;
bool parse_tool_calls = true;
bool is_continuation = false;
bool echo = false; // Include assistant prefilled msg in output
bool debug = false; // Enable debug output for PEG parser
common_peg_arena parser = {};
@@ -303,6 +316,7 @@ std::optional<common_chat_params> common_chat_try_specialized_template(
const std::string & src,
autoparser::generation_params & params);
// specialized per-task preset
struct common_chat_prompt_preset {
std::string system;
@@ -310,3 +324,6 @@ struct common_chat_prompt_preset {
};
common_chat_prompt_preset common_chat_get_asr_prompt(const common_chat_templates * chat_templates);
std::vector<common_chat_msg_span> common_chat_split_by_role(const std::string & prompt, const std::vector<common_chat_msg_delimiter> & delims);

View File

@@ -445,6 +445,27 @@ std::string string_strip(const std::string & str) {
return str.substr(start, end - start);
}
std::string string_lcs(std::string_view a, std::string_view b) {
if (a.empty() || b.empty()) return {};
std::vector<std::vector<size_t>> dp(a.size() + 1, std::vector<size_t>(b.size() + 1, 0));
size_t best_len = 0;
size_t best_end_a = 0;
for (size_t i = 1; i <= a.size(); ++i) {
for (size_t j = 1; j <= b.size(); ++j) {
if (a[i - 1] == b[j - 1]) {
dp[i][j] = dp[i - 1][j - 1] + 1;
if (dp[i][j] > best_len) {
best_len = dp[i][j];
best_end_a = i;
}
}
}
}
return std::string(a.substr(best_end_a - best_len, best_len));
}
std::string string_get_sortable_timestamp() {
using clock = std::chrono::system_clock;

View File

@@ -594,7 +594,7 @@ struct common_params {
bool cache_prompt = true; // whether to enable prompt caching
bool cache_idle_slots = true; // save and clear idle slots upon starting a new task
int32_t n_ctx_checkpoints = 32; // max number of context checkpoints per slot
int32_t checkpoint_every_nt = 8192; // make a checkpoint every n tokens during prefill
int32_t checkpoint_min_step = 256; // minimum spacing between context checkpoints
int32_t cache_ram_mib = 8192; // -1 = no limit, 0 - disable, 1 = 1 MiB, etc.
std::string hostname = "127.0.0.1";
@@ -617,11 +617,7 @@ struct common_params {
std::map<std::string, std::string> default_template_kwargs;
// UI configs
#ifdef LLAMA_UI_DEFAULT_ENABLED
bool ui = LLAMA_UI_DEFAULT_ENABLED != 0;
#else
bool ui = true; // default to enabled when not set
#endif
bool ui = true;
// Deprecated: use ui, ui_mcp_proxy, ui_config_json instead
bool webui = ui;
@@ -735,6 +731,7 @@ std::string string_format(const char * fmt, ...);
std::string string_strip(const std::string & str);
std::string string_get_sortable_timestamp();
std::string string_lcs(std::string_view a, std::string_view b);
std::string string_join(const std::vector<std::string> & values, const std::string & separator);
std::vector<std::string> string_split(const std::string & str, const std::string & delimiter);

View File

@@ -26,7 +26,7 @@ class common_params_fit_exception : public std::runtime_error {
using std::runtime_error::runtime_error;
};
static std::vector<llama_device_memory_data> common_get_device_memory_data(
std::vector<llama_device_memory_data> common_get_device_memory_data(
const char * path_model,
const llama_model_params * mparams,
const llama_context_params * cparams,

View File

@@ -1,6 +1,11 @@
#pragma once
#include "ggml.h"
#include "ggml-backend.h"
#include "llama.h"
#include "../src/llama-ext.h"
#include <vector>
enum common_params_fit_status {
COMMON_PARAMS_FIT_STATUS_SUCCESS = 0, // found allocations that are projected to fit
@@ -30,3 +35,14 @@ void common_fit_print(
struct llama_context_params * cparams);
void common_memory_breakdown_print(const struct llama_context * ctx);
// Load a model + context with no_alloc and return the per-device memory breakdown.
std::vector<llama_device_memory_data> common_get_device_memory_data(
const char * path_model,
const struct llama_model_params * mparams,
const struct llama_context_params * cparams,
std::vector<ggml_backend_dev_t> & devs,
uint32_t & hp_ngl,
uint32_t & hp_n_ctx_train,
uint32_t & hp_n_expert,
enum ggml_log_level log_level);

View File

@@ -467,7 +467,14 @@ class ModelBase:
elif quant_method == "compressed-tensors":
quant_format = quant_config["format"]
groups = quant_config["config_groups"]
if len(groups) > 1:
nvfp4_compressed_tensors = (
quant_format == "nvfp4-pack-quantized"
or quant_format == "mixed-precision"
and bool(groups)
and all(g.get("format") == "nvfp4-pack-quantized" for g in groups.values() if isinstance(g, dict))
)
if len(groups) > 1 and not nvfp4_compressed_tensors:
raise NotImplementedError("Can't handle multiple config groups for compressed-tensors yet")
weight_config = tuple(groups.values())[0]["weights"]
@@ -505,6 +512,9 @@ class ModelBase:
tensors_to_remove += [base_name + n for n in ("_packed", "_shape", "_scale")]
if (base_name + "_zero_point") in self.model_tensors:
tensors_to_remove.append(base_name + "_zero_point")
elif nvfp4_compressed_tensors:
# Don't error from compressed-tensors, we'll handle them in _generate_nvfp4_tensors
pass
else:
raise NotImplementedError(f"Quant format {quant_format!r} for method {quant_method!r} is not yet supported")
elif quant_method == "modelopt":
@@ -746,10 +756,13 @@ class ModelBase:
del experts, merged
def prepare_tensors(self):
# detect NVFP4 quantization (ModelOpt format)
quant_algo = (self.hparams.get("quantization_config") or {}).get("quant_algo")
quant_method = (self.hparams.get("quantization_config") or {}).get("quant_method")
quant_layers = (self.hparams.get("quantization_config") or {}).get("quantized_layers") or {}
# detect NVFP4 quantization (ModelOpt and Compressed-tensors formats)
quantization_config = self.hparams.get("quantization_config") or {}
quant_algo = quantization_config.get("quant_algo")
quant_method = quantization_config.get("quant_method")
quant_format = quantization_config.get("format")
quant_groups = quantization_config.get("config_groups") or {}
quant_layers = quantization_config.get("quantized_layers") or {}
quant_config_file = self.dir_model / "hf_quant_config.json"
if (not quant_algo or not quant_layers) and quant_config_file.is_file():
@@ -760,13 +773,25 @@ class ModelBase:
producer_name = (producer.get("name") or "").lower()
if quant_method is None:
self.hparams.setdefault("quantization_config", {})["quant_method"] = producer_name
quant_method = producer_name
quant_algo = quant_config.get("quant_algo", quant_algo)
quant_method = quant_config.get("quant_method", quant_method)
quant_format = quant_config.get("format", quant_format)
quant_groups = quant_config.get("config_groups", quant_groups) or {}
quant_layers = quant_config.get("quantized_layers", quant_layers) or {}
# Some models use per-tensor quant_algo (e.g. "MIXED_PRECISION" with
# per-layer NVFP4/FP8) instead of a single global "NVFP4" value.
nvfp4_compressed_tensors = quant_method == "compressed-tensors" and (
quant_format == "nvfp4-pack-quantized"
or quant_format == "mixed-precision"
and bool(quant_groups)
and all(g.get("format") == "nvfp4-pack-quantized" for g in quant_groups.values() if isinstance(g, dict))
)
if quant_algo != "NVFP4":
if any(v.get("quant_algo") == "NVFP4" for v in quant_layers.values() if isinstance(v, dict)):
if nvfp4_compressed_tensors:
quant_algo = "NVFP4"
elif any(v.get("quant_algo") == "NVFP4" for v in quant_layers.values() if isinstance(v, dict)):
quant_algo = "NVFP4"
self._is_nvfp4 = quant_algo == "NVFP4"
@@ -776,6 +801,28 @@ class ModelBase:
# This must run before dequant_model so NVFP4 tensors are removed
# from model_tensors, leaving only non-NVFP4 (e.g. FP8) for dequant.
if self._is_nvfp4:
if nvfp4_compressed_tensors:
# Convert compressed-tensors 'global' scales into the reciprocal
def inverse_scale(gen):
def load():
scale = LazyTorchTensor.to_eager(gen()).float()
return 1.0 / scale
return load
# Change the compressed-tensors names to the ModelOpt names for handling consistently later
for name in list(self.model_tensors.keys()):
if name.endswith(".weight_packed"):
weight_name = name.removesuffix("_packed")
if weight_name not in self.model_tensors:
self.model_tensors[weight_name] = self.model_tensors.pop(name)
elif name.endswith(".weight_global_scale"):
scale2_name = name.replace(".weight_global_scale", ".weight_scale_2")
if scale2_name not in self.model_tensors:
self.model_tensors[scale2_name] = inverse_scale(self.model_tensors.pop(name))
elif name.endswith(".input_global_scale"):
input_scale_name = name.replace(".input_global_scale", ".input_scale")
if input_scale_name not in self.model_tensors:
self.model_tensors[input_scale_name] = inverse_scale(self.model_tensors.pop(name))
self._generate_nvfp4_tensors()
self.dequant_model()
@@ -1617,6 +1664,11 @@ class TextModel(ModelBase):
assert max(tokenizer.vocab.values()) < vocab_size # ty: ignore[unresolved-attribute]
reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()} # ty: ignore[unresolved-attribute]
# k-mers can share text with a base-vocab BPE token (e.g. CCCCCC) and get
# dropped by get_vocab(); a reserved marker suffix (U+E000) keeps each
# k-mer's own id (llama.cpp strips it on detokenization)
for kmer in tokenizer.kmers: # ty: ignore[unresolved-attribute]
reverse_vocab[tokenizer.dna_token_to_id[kmer]] = kmer + "\ue000" # ty: ignore[unresolved-attribute]
added_vocab = tokenizer.get_added_vocab() # ty: ignore[unresolved-attribute]
added_tokens_decoder = tokenizer.added_tokens_decoder # ty: ignore[unresolved-attribute]

View File

@@ -1,6 +1,5 @@
from __future__ import annotations
from pathlib import Path
from typing import Any, Callable, Iterable, TYPE_CHECKING
import torch
@@ -549,6 +548,7 @@ class _Qwen35MtpMixin:
tensor_map: gguf.TensorNameMap
no_mtp: bool
mtp_only: bool
_original_block_count: int | None = None
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
@@ -557,22 +557,44 @@ class _Qwen35MtpMixin:
self.block_count += self.hparams.get("mtp_num_hidden_layers", 0)
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
def index_tensors(self, remote_hf_model_id: str | None = None) -> dict[str, Callable[[], Tensor]]:
hparams = {**self.hparams, **self.hparams.get("text_config", {})}
key = next((k for k in ["n_layers", "num_hidden_layers", "n_layer", "num_layers"] if k in hparams), None)
type(self)._original_block_count = hparams.get(key)
return super().index_tensors(remote_hf_model_id=remote_hf_model_id) # ty: ignore[unresolved-attribute]
@classmethod
def filter_tensors(cls, item):
name, _ = item
assert cls._original_block_count is not None
# TODO: change TextModel to super()
if (titem := TextModel.filter_tensors(item)) is None:
return None
name, gen = titem
if name.startswith("model.mtp."):
name = name.replace("model.", "", 1)
if name.startswith("mtp."):
if cls.no_mtp:
return None
return item
if cls.mtp_only:
canonical = name.replace("language_model.", "")
keep = canonical in (
remapper = {
"fc": "eh_proj",
"pre_fc_norm_embedding": "enorm",
"pre_fc_norm_hidden": "hnorm",
"norm": "shared_head.norm",
}
parts = name.split(".", 3)
if len(parts) == 4 and parts[1] == "layers" and parts[2].isdecimal():
mtp_idx = int(parts[2])
name = f"model.layers.{cls._original_block_count + mtp_idx}.{parts[3]}"
elif len(parts) == 3 and parts[1] in remapper:
name = f"model.layers.{cls._original_block_count}.{remapper[parts[1]]}.{parts[2]}"
elif cls.mtp_only:
keep = name in (
"model.embed_tokens.weight", "model.norm.weight", "lm_head.weight",
"embed_tokens.weight", "norm.weight",
)
if not keep:
return None
return super().filter_tensors(item) # ty: ignore[unresolved-attribute]
return name, gen
def set_gguf_parameters(self):
super().set_gguf_parameters() # ty: ignore[unresolved-attribute]
@@ -594,29 +616,6 @@ class _Qwen35MtpMixin:
self.metadata.version, size_label=None, output_type=output_type, model_type=None) # pyright: ignore[reportAttributeAccessIssue] # ty: ignore[unresolved-attribute]
self.fname_out = self.fname_out.parent / f"mtp-{fname_default}.gguf"
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("mtp."):
n_layer = self.hparams["num_hidden_layers"]
if name.find("layers.") != -1:
assert bid is not None
name = name.replace(f"mtp.layers.{bid}", f"model.layers.{bid + n_layer}")
bid = bid + n_layer
else:
remapper = {
"mtp.fc": "model.layers.{bid}.eh_proj",
"mtp.pre_fc_norm_embedding": "model.layers.{bid}.enorm",
"mtp.pre_fc_norm_hidden": "model.layers.{bid}.hnorm",
"mtp.norm": "model.layers.{bid}.shared_head.norm",
}
stem = Path(name).stem
suffix = Path(name).suffix
tmpl = remapper[stem] + suffix
for b in range(n_layer, self.block_count):
yield from super().modify_tensors(data_torch, tmpl.format(bid=b), b) # ty: ignore[unresolved-attribute]
return
yield from super().modify_tensors(data_torch, name, bid) # ty: ignore[unresolved-attribute]
@ModelBase.register("Qwen3_5ForConditionalGeneration", "Qwen3_5ForCausalLM")
class Qwen3_5TextModel(_Qwen35MtpMixin, _Qwen35MRopeMixin, _LinearAttentionVReorderBase):

View File

@@ -489,6 +489,7 @@ The following templates have active tests in `tests/test-chat.cpp`:
| Qwen-QwQ-32B | Reasoning | Forced-open thinking |
| NousResearch Hermes 2 Pro | JSON_NATIVE | `<tool_call>` wrapper |
| IBM Granite 3.3 | JSON_NATIVE | `<think></think>` + `<response></response>` |
| IBM Granite 4.0 | JSON_NATIVE | `<tool_call>` wrapper (same template used by 4.1) |
| ByteDance Seed-OSS | TAG_WITH_TAGGED | Custom `<seed:think>` and `<seed:tool_call>` tags |
| Qwen3-Coder | TAG_WITH_TAGGED | XML-style tool format |
| DeepSeek V3.1 | JSON_NATIVE | Forced thinking mode |

View File

@@ -33,8 +33,8 @@
"name": "arm64-windows-snapdragon",
"inherits": [ "base", "arm64-windows-llvm" ],
"cacheVariables": {
"CMAKE_C_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE",
"CMAKE_CXX_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE",
"CMAKE_C_FLAGS": "-march=armv8.7a+fp16+dotprod+i8mm -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE",
"CMAKE_CXX_FLAGS": "-march=armv8.7a+fp16+dotprod+i8mm -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE",
"CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG",
"CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG",
"CMAKE_C_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",

View File

@@ -24,7 +24,7 @@ Native Windows 11 arm64 builds has the following tools dependencies:
- UCRT and Driver Kit
- LLVM core libraries and Clang compiler (winget)
- CMake, Git, Python (winget)
- Hexagon SDK Community Edition 6.4 or later (see windows.md)
- Hexagon SDK Community Edition 6.6 or later (see windows.md)
- OpenCL SDK 2.3 or later (see windows.md)
Note: The rest of the **Windows** build process assumes that you're running natively in Powershell.
@@ -45,7 +45,7 @@ Preset CMake variables:
GGML_HEXAGON="ON"
GGML_OPENCL="ON"
GGML_OPENMP="OFF"
HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2"
HEXAGON_SDK_ROOT="/opt/hexagon/6.6.0.0"
...
-- Including OpenCL backend
-- Including Hexagon backend

View File

@@ -28,15 +28,15 @@ c:\Qualcomm\OpenCL_SDK\2.3.2
Either use the trimmed down version (optimized for CI) from
https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v6.4.0.2/hexagon-sdk-v6.4.0.2-arm64-wos.tar.xz
https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v6.6.0.0/hexagon-sdk-v6.6.0.0-arm64-wos.tar.xz
Or download the complete official version from
https://softwarecenter.qualcomm.com/catalog/item/Hexagon_SDK?version=6.4.0.2
https://softwarecenter.qualcomm.com/catalog/item/Hexagon_SDK?version=6.6.0.0
Unzip/untar the archive into
```
c:\Qualcomm\Hexagon_SDK\6.4.0.2
c:\Qualcomm\Hexagon_SDK\6.6.0.0
```
## Install the latest Adreno GPU driver
@@ -123,10 +123,10 @@ The overall Hexagon backend build procedure for Windows on Snapdragon is the sam
However, additional settings are required for generating and signing HTP Ops libraries.
```
> $env:OPENCL_SDK_ROOT="C:\Qualcomm\OpenCL_SDK\2.3.2"
> $env:HEXAGON_SDK_ROOT="C:\Qualcomm\Hexagon_SDK\6.4.0.2"
> $env:HEXAGON_TOOLS_ROOT="C:\Qualcomm\Hexagon_SDK\6.4.0.2\tools\HEXAGON_Tools\19.0.04"
> $env:HEXAGON_SDK_ROOT="C:\Qualcomm\Hexagon_SDK\6.6.0.0"
> $env:HEXAGON_TOOLS_ROOT="C:\Qualcomm\Hexagon_SDK\6.6.0.0\tools\HEXAGON_Tools\19.0.07"
> $env:HEXAGON_HTP_CERT="c:\Users\MyUsers\Certs\ggml-htp-v1.pfx"
> $env:WINDOWS_SDK_BIN="C:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0\arm64"
> $env:WINDOWS_SDK_BIN="C:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0"
> cmake --preset arm64-windows-snapdragon-release -B build-wos
...

View File

@@ -5,7 +5,7 @@
1. Prepare Toolchain For RISCV
~~~
wget https://archive.spacemit.com/toolchain/spacemit-toolchain-linux-glibc-x86_64-v1.1.2.tar.xz
wget https://github.com/spacemit-com/toolchain/releases/download/v1.2.4/spacemit-toolchain-linux-glibc-x86_64-v1.2.4.tar.xz
~~~
2. Build

View File

@@ -291,6 +291,7 @@ Here are some models known to work (w/ chat template override when needed):
llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M
llama-server --jinja -fa -hf ibm-granite/granite-4.1-3b-GGUF:Q4_K_M
# Native support for DeepSeek R1 works best w/ our template override (official template is buggy, although we do work around it)

View File

@@ -1308,7 +1308,8 @@ def do_dump_model(model_plus: ModelPlus) -> None:
def main(args_in: list[str] | None = None) -> None:
output_choices = ["f32", "f16"]
if np.uint32(1) == np.uint32(1).newbyteorder("<"):
dummy_val = np.uint32(1)
if dummy_val == dummy_val.view(dummy_val.dtype.newbyteorder("<")):
# We currently only support Q8_0 output on little endian systems.
output_choices.append("q8_0")
parser = argparse.ArgumentParser(description="Convert a LLaMA model to a GGML compatible file")

View File

@@ -64,7 +64,7 @@ def load_model_and_tokenizer(model_path, use_sentence_transformers=False, device
print("Using SentenceTransformer to apply all numbered layers")
model = SentenceTransformer(model_path)
tokenizer = model.tokenizer
config = model[0].auto_model.config
config = model[0].auto_model.config # ty: ignore[unresolved-attribute]
else:
tokenizer = AutoTokenizer.from_pretrained(model_path)
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)

View File

@@ -4,7 +4,7 @@ project("ggml" C CXX ASM)
### GGML Version
set(GGML_VERSION_MAJOR 0)
set(GGML_VERSION_MINOR 12)
set(GGML_VERSION_MINOR 13)
set(GGML_VERSION_PATCH 0)
set(GGML_VERSION_BASE "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")

View File

@@ -6,6 +6,7 @@
include(CMakeFindDependencyMacro)
find_dependency(Threads)
if (NOT GGML_SHARED_LIB)
set(GGML_BASE_INTERFACE_LINK_LIBRARIES "")
set(GGML_CPU_INTERFACE_LINK_LIBRARIES "")
set(GGML_CPU_INTERFACE_LINK_OPTIONS "")
@@ -20,7 +21,15 @@ if (NOT GGML_SHARED_LIB)
if (GGML_OPENMP_ENABLED)
find_dependency(OpenMP)
list(APPEND GGML_CPU_INTERFACE_LINK_LIBRARIES OpenMP::OpenMP_C OpenMP::OpenMP_CXX)
set(GGML_OPENMP_INTERFACE_LINK_LIBRARIES "")
if (TARGET OpenMP::OpenMP_C)
list(APPEND GGML_OPENMP_INTERFACE_LINK_LIBRARIES OpenMP::OpenMP_C)
endif()
if (TARGET OpenMP::OpenMP_CXX)
list(APPEND GGML_OPENMP_INTERFACE_LINK_LIBRARIES OpenMP::OpenMP_CXX)
endif()
list(APPEND GGML_BASE_INTERFACE_LINK_LIBRARIES ${GGML_OPENMP_INTERFACE_LINK_LIBRARIES})
list(APPEND GGML_CPU_INTERFACE_LINK_LIBRARIES ${GGML_OPENMP_INTERFACE_LINK_LIBRARIES})
endif()
if (GGML_CPU_HBM)
@@ -122,7 +131,8 @@ if(NOT TARGET ggml::ggml)
add_library(ggml::ggml-base UNKNOWN IMPORTED)
set_target_properties(ggml::ggml-base
PROPERTIES
IMPORTED_LOCATION "${GGML_BASE_LIBRARY}")
IMPORTED_LOCATION "${GGML_BASE_LIBRARY}"
INTERFACE_LINK_LIBRARIES "${GGML_BASE_INTERFACE_LINK_LIBRARIES}")
set(_ggml_all_targets "")
if (NOT GGML_BACKEND_DL)

View File

@@ -76,6 +76,7 @@ GGML_API size_t ggml_gallocr_get_buffer_size(ggml_gallocr_t galloc, int buffer_i
// Utils
// Create a buffer and allocate all the tensors in a ggml_context
// ggml_backend_alloc_ctx_tensors_from_buft_size returns the size of the buffer that would be allocated by ggml_backend_alloc_ctx_tensors_from_buft
// ggml_backend_alloc_ctx_tensors_from_buft returns NULL on failure or if all tensors in ctx are already allocated or zero-sized
GGML_API size_t ggml_backend_alloc_ctx_tensors_from_buft_size(struct ggml_context * ctx, ggml_backend_buffer_type_t buft);
GGML_API struct ggml_backend_buffer * ggml_backend_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft);
GGML_API struct ggml_backend_buffer * ggml_backend_alloc_ctx_tensors(struct ggml_context * ctx, ggml_backend_t backend);

View File

@@ -1189,8 +1189,8 @@ extern "C" {
struct ggml_context * ctx,
struct ggml_tensor * a);
// a - x
// b - dy
// a - dy
// b - x
GGML_API struct ggml_tensor * ggml_silu_back(
struct ggml_context * ctx,
struct ggml_tensor * a,

View File

@@ -76,10 +76,16 @@ extern "C" {
struct ggml_context ** ctx;
};
// callback to simulate or wrap a FILE pointer - read up to `len` bytes at `offset` into `output` and return the number of bytes read
typedef size_t (*gguf_reader_callback_t)(void * userdata, void * output, uint64_t offset, size_t len);
GGML_API struct gguf_context * gguf_init_empty(void);
GGML_API struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_params params);
GGML_API struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params);
//GGML_API struct gguf_context * gguf_init_from_buffer(..);
GGML_API struct gguf_context * gguf_init_from_buffer(const void * data, size_t size, struct gguf_init_params params);
// max_chunk_read is the maximum number of bytes that the GGUF code will read at once from the callback, a value of 0 means no limit
GGML_API struct gguf_context * gguf_init_from_callback(gguf_reader_callback_t callback, void * userdata, size_t max_chunk_read, uint64_t max_expected_size, struct gguf_init_params params);
GGML_API void gguf_free(struct gguf_context * ctx);
@@ -87,7 +93,7 @@ extern "C" {
GGML_API uint32_t gguf_get_version (const struct gguf_context * ctx);
GGML_API size_t gguf_get_alignment (const struct gguf_context * ctx);
GGML_API size_t gguf_get_data_offset(const struct gguf_context * ctx);
GGML_API size_t gguf_get_data_offset(const struct gguf_context * ctx); // padded to gguf_get_alignment if and only if the gguf_context contains at least one tensor
GGML_API int64_t gguf_get_n_kv(const struct gguf_context * ctx);
GGML_API int64_t gguf_find_key(const struct gguf_context * ctx, const char * key); // returns -1 if key is not found

View File

@@ -222,6 +222,23 @@ if (GGML_SCHED_NO_REALLOC)
target_compile_definitions(ggml-base PUBLIC GGML_SCHED_NO_REALLOC)
endif()
if (GGML_OPENMP)
find_package(OpenMP)
if (OpenMP_FOUND)
set(GGML_OPENMP_ENABLED "ON" CACHE INTERNAL "")
else()
set(GGML_OPENMP_ENABLED "OFF" CACHE INTERNAL "")
message(WARNING "OpenMP not found")
endif()
else()
set(GGML_OPENMP_ENABLED "OFF" CACHE INTERNAL "")
endif()
if (GGML_OPENMP_ENABLED)
target_compile_definitions(ggml-base PRIVATE GGML_USE_OPENMP)
target_link_libraries(ggml-base PRIVATE OpenMP::OpenMP_C OpenMP::OpenMP_CXX)
endif()
add_library(ggml
ggml-backend-dl.cpp
ggml-backend-reg.cpp)

View File

@@ -150,7 +150,7 @@ static void ggml_dyn_tallocr_insert_block(struct tallocr_chunk * chunk, size_t o
static void ggml_dyn_tallocr_remove_block(struct tallocr_chunk * chunk, int idx) {
// shift all elements after idx by 1 to the left, overwriting the element at idx
for (int i = idx; i < chunk->n_free_blocks; i++) {
for (int i = idx; i < chunk->n_free_blocks - 1; i++) {
chunk->free_blocks[i] = chunk->free_blocks[i+1];
}
chunk->n_free_blocks--;

View File

@@ -13,6 +13,7 @@
#include <cstring>
#include <map>
#include <memory>
#include <set>
#include <string>
#include <tuple>
#include <utility>
@@ -392,64 +393,100 @@ static ggml_backend_buffer_type_t ggml_backend_meta_device_get_host_buffer_type(
// meta backend buffer
//
// Container to hold the tensor slices per simple ggml backend buffer.
struct ggml_backend_meta_simple_tensor_container {
std::vector<ggml_context_ptr> ctxs;
std::map<const ggml_tensor *, std::vector<ggml_tensor *>> simple_tensors;
ggml_backend_meta_simple_tensor_container(const ggml_init_params & params, const int n_simple) {
ctxs.reserve(n_simple);
for (int i = 0; i < n_simple; i++) {
ctxs.emplace_back(ggml_init(params));
}
}
ggml_backend_meta_simple_tensor_container() {}
};
struct ggml_backend_meta_buffer_context {
// FIXME
// Most tensors can simply be stored statically in their own buffer.
// Externally created views however also need a mapping to simple tensors but they use the buffer of the view source.
// If external views are simply using that buffer they will slowly deplete its memory.
// Current solution: rotating set of 2 "compute" containers to hold external views, works correctly for llama.cpp.
// Long-term: tie the lifetime of external views to the meta backend executing the graph instead,
// currently not possible due to graph-external operations in the backend scheduler.
ggml_backend_meta_simple_tensor_container stc_static;
ggml_backend_meta_simple_tensor_container stc_compute[2];
int stc_compute_index = 0;
int stc_compute_index_next = 0;
std::vector<ggml_backend_buffer_ptr> bufs;
// FIXME
// The size of the split state cache is unbounded and can theoretically grow infinitely large.
// However, it is also expensive to build and clearing it on every rebuild in ggml_backend_meta_graph_compute is too expensive.
static constexpr size_t nbtc = GGML_TENSOR_SIZE - sizeof(ggml_tensor::padding);
std::map<std::pair<const ggml_tensor *, bool>, std::pair<ggml_backend_meta_split_state, char[nbtc]>> split_state_cache;
std::map< const ggml_tensor *, std::vector<ggml_tensor *>> simple_tensors;
struct buffer_config {
ggml_context * ctx;
ggml_backend_buffer_t buf;
buffer_config(ggml_context * ctx, ggml_backend_buffer_t buf) : ctx(ctx), buf(buf) {}
};
std::vector<buffer_config> buf_configs;
int debug;
ggml_backend_meta_buffer_context() {
ggml_backend_meta_buffer_context(
ggml_backend_meta_simple_tensor_container & stc_static,
ggml_backend_meta_simple_tensor_container & stc_compute_0,
ggml_backend_meta_simple_tensor_container & stc_compute_1,
const std::vector<ggml_backend_buffer_t> & bufs)
: stc_static(std::move(stc_static)), stc_compute{std::move(stc_compute_0), std::move(stc_compute_1)} {
this->bufs.reserve(bufs.size());
for (ggml_backend_buffer_t buf : bufs) {
this->bufs.emplace_back(buf);
}
const char * GGML_META_DEBUG = getenv("GGML_META_DEBUG");
debug = GGML_META_DEBUG ? atoi(GGML_META_DEBUG) : 0;
}
ggml_backend_meta_simple_tensor_container & get_simple_tensor_container(const ggml_tensor * tensor) {
if (stc_static.simple_tensors.find(tensor) != stc_static.simple_tensors.end()) {
return stc_static;
}
return stc_compute[stc_compute_index];
}
};
static void ggml_backend_meta_buffer_free_buffer(ggml_backend_buffer_t buffer) {
GGML_ASSERT(ggml_backend_buffer_is_meta(buffer));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) buffer->context;
for (auto & [ctx, buf] : buf_ctx->buf_configs) {
ggml_backend_buffer_free(buf);
ggml_free(ctx);
}
delete buf_ctx;
}
static size_t ggml_backend_meta_buffer_n_bufs(ggml_backend_buffer_t meta_buf) {
GGML_ASSERT(ggml_backend_buffer_is_meta(meta_buf));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) meta_buf->context;
return buf_ctx->buf_configs.size();
return buf_ctx->bufs.size();
}
static ggml_backend_buffer_t ggml_backend_meta_buffer_simple_buffer(ggml_backend_buffer_t meta_buf, size_t index) {
GGML_ASSERT(ggml_backend_buffer_is_meta(meta_buf));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) meta_buf->context;
GGML_ASSERT(index < buf_ctx->buf_configs.size());
return buf_ctx->buf_configs[index].buf;
GGML_ASSERT(index < buf_ctx->bufs.size());
return buf_ctx->bufs[index].get();
}
static struct ggml_tensor * ggml_backend_meta_buffer_simple_tensor(const struct ggml_tensor * tensor, size_t index) {
GGML_ASSERT(ggml_backend_buffer_is_meta(tensor->buffer));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) tensor->buffer->context;
GGML_ASSERT(index < buf_ctx->buf_configs.size());
GGML_ASSERT(index < buf_ctx->bufs.size());
auto it = buf_ctx->simple_tensors.find(tensor);
if (it == buf_ctx->simple_tensors.end()) {
ggml_backend_meta_simple_tensor_container & stc = buf_ctx->get_simple_tensor_container(tensor);
auto it = stc.simple_tensors.find(tensor);
if (it == stc.simple_tensors.end()) {
return nullptr;
}
return it->second[index];
}
static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(const struct ggml_tensor * tensor, bool assume_sync) {
static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(const struct ggml_tensor * tensor, bool assume_sync);
static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
ggml_backend_meta_simple_tensor_container & stc, const struct ggml_tensor * tensor, bool assume_sync) {
const size_t n_bufs = ggml_backend_meta_buffer_n_bufs(tensor->buffer);
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) tensor->buffer->context;
@@ -785,7 +822,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(co
src_ss[i] = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
continue;
}
src_ss[i] = ggml_backend_meta_get_split_state(tensor->src[i], /*assume_sync =*/ true);
src_ss[i] = ggml_backend_meta_get_split_state(stc, tensor->src[i], /*assume_sync =*/ true);
GGML_ASSERT(src_ss[i].axis != GGML_BACKEND_SPLIT_AXIS_UNKNOWN);
}
@@ -1079,17 +1116,23 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(co
return ret;
}
static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(const struct ggml_tensor * tensor, bool assume_sync) {
GGML_ASSERT(ggml_backend_buffer_is_meta(tensor->buffer));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) tensor->buffer->context;
return ggml_backend_meta_get_split_state(buf_ctx->get_simple_tensor_container(tensor), tensor, assume_sync);
}
static void * ggml_backend_meta_buffer_get_base(ggml_backend_buffer_t buffer) {
GGML_UNUSED(buffer);
return (void *) 0x1000000000000000; // FIXME
}
static enum ggml_status ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor) {
GGML_ASSERT(ggml_backend_buffer_is_meta(buffer));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) buffer->context;
const size_t n_simple_bufs = ggml_backend_meta_buffer_n_bufs(buffer);
static enum ggml_status ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_meta_simple_tensor_container & stc, ggml_tensor * tensor) {
GGML_ASSERT(ggml_backend_buffer_is_meta(tensor->buffer));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) tensor->buffer->context;
const size_t n_simple_bufs = ggml_backend_meta_buffer_n_bufs(tensor->buffer);
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ true);
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(stc, tensor, /*assume_sync =*/ true);
GGML_ASSERT(ggml_nelements(tensor) == 0 || split_state.axis != GGML_BACKEND_SPLIT_AXIS_UNKNOWN);
GGML_ASSERT(split_state.n_segments <= 16);
@@ -1104,8 +1147,8 @@ static enum ggml_status ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer
std::vector<ggml_tensor *> simple_tensors;
simple_tensors.reserve(n_simple_bufs);
for (size_t j = 0; j < n_simple_bufs; j++) {
ggml_context * simple_ctx = buf_ctx->buf_configs[j].ctx;
ggml_backend_buffer_t simple_buf = buf_ctx->buf_configs[j].buf;
ggml_context * simple_ctx = stc.ctxs[j].get();
ggml_backend_buffer_t simple_buf = buf_ctx->bufs[j].get();
if (split_dim >= 0 && split_dim < GGML_MAX_DIMS) {
// TODO: the following assert fails for llama-parallel even though the results are correct:
@@ -1158,7 +1201,7 @@ static enum ggml_status ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer
t_ij->data = (char *) t_ij->view_src->data + t_ij->view_offs;
} else if (simple_buf != nullptr) {
t_ij->data = (char *) ggml_backend_buffer_get_base(simple_buf)
+ size_t(tensor->data) - size_t(ggml_backend_buffer_get_base(buffer));
+ size_t(tensor->data) - size_t(ggml_backend_buffer_get_base(tensor->buffer));
}
t_ij->extra = tensor->extra;
for (int i = 0; i < GGML_MAX_SRC; i++) {
@@ -1194,11 +1237,18 @@ static enum ggml_status ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer
}
}
buf_ctx->simple_tensors[tensor] = simple_tensors;
stc.simple_tensors[tensor] = simple_tensors;
return GGML_STATUS_SUCCESS;
}
static enum ggml_status ggml_backend_meta_buffer_init_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor) {
GGML_ASSERT(ggml_backend_buffer_is_meta(buffer));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) buffer->context;
buf_ctx->stc_compute_index = buf_ctx->stc_compute_index_next;
return ggml_backend_meta_buffer_init_tensor_impl(buf_ctx->get_simple_tensor_container(tensor), tensor);
}
static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
const size_t n_bufs = ggml_backend_meta_buffer_n_bufs(buffer);
GGML_ASSERT(ggml_is_contiguous(tensor));
@@ -1275,6 +1325,9 @@ static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t chunk_size_j = simple_tensor->nb[split_state.axis + 1];
if (chunk_size_j == 0) {
continue;
}
const size_t simple_offset = i_start * chunk_size_j;
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_j, simple_offset, chunk_size_j, i_stop - i_start, chunk_size_j, chunk_size_full);
offset_j += chunk_size_j;
@@ -1382,6 +1435,9 @@ static void ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer_t buffer, co
for (size_t j = 0; j < n_bufs; j++){
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t chunk_size_j = simple_tensor->nb[split_state.axis + 1];
if (chunk_size_j == 0) {
continue;
}
const size_t simple_offset = i_start * chunk_size_j;
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_j, simple_offset, chunk_size_j, i_stop - i_start, chunk_size_j, chunk_size_full);
offset_j += chunk_size_j;
@@ -1407,8 +1463,9 @@ static void ggml_backend_meta_buffer_clear(ggml_backend_buffer_t buffer, uint8_t
}
static void ggml_backend_meta_buffer_reset(ggml_backend_buffer_t buffer) {
const size_t n_buffers = ggml_backend_meta_buffer_n_bufs(buffer);
for (size_t i = 0; i < n_buffers; i++) {
GGML_ASSERT(ggml_backend_buffer_is_meta(buffer));
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) buffer->context;
for (size_t i = 0; i < buf_ctx->bufs.size(); i++) {
ggml_backend_buffer_reset(ggml_backend_meta_buffer_simple_buffer(buffer, i));
}
}
@@ -1434,20 +1491,24 @@ bool ggml_backend_buffer_is_meta(ggml_backend_buffer_t buf) {
static ggml_backend_buffer_t ggml_backend_meta_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
const size_t n_simple_bufts = ggml_backend_meta_buft_n_bufts(buft);
ggml_init_params params = {
/*.mem_size =*/ 1024*1024*1024, // FIXME
const ggml_init_params params = {
/*.mem_size =*/ 1024*1024*ggml_tensor_overhead(), // FIXME
/*.mem_buffer =*/ nullptr,
/*.no_alloc =*/ true,
};
ggml_backend_meta_simple_tensor_container stc_static;
ggml_backend_meta_simple_tensor_container stc_compute_0(params, n_simple_bufts);
ggml_backend_meta_simple_tensor_container stc_compute_1(params, n_simple_bufts);
ggml_backend_meta_buffer_context * buf_ctx = new ggml_backend_meta_buffer_context();
size_t max_size = 0;
buf_ctx->buf_configs.reserve(n_simple_bufts);
std::vector<ggml_backend_buffer_t> bufs;
bufs.reserve(n_simple_bufts);
for (size_t i = 0; i < n_simple_bufts; i++) {
ggml_backend_buffer_t simple_buf = ggml_backend_buft_alloc_buffer(ggml_backend_meta_buft_simple_buft(buft, i), size);
max_size = std::max(max_size, ggml_backend_buffer_get_size(simple_buf));
buf_ctx->buf_configs.emplace_back(ggml_init(params), simple_buf);
bufs.push_back(ggml_backend_buft_alloc_buffer(ggml_backend_meta_buft_simple_buft(buft, i), size));
GGML_ASSERT(bufs.back() != nullptr);
max_size = std::max(max_size, ggml_backend_buffer_get_size(bufs.back()));
}
ggml_backend_meta_buffer_context * buf_ctx = new ggml_backend_meta_buffer_context(stc_static, stc_compute_0, stc_compute_1, bufs);
return ggml_backend_buffer_init(buft, ggml_backend_meta_buffer_iface, buf_ctx, max_size);
}
@@ -1455,28 +1516,53 @@ static ggml_backend_buffer_t ggml_backend_meta_buffer_type_alloc_buffer(ggml_bac
struct ggml_backend_buffer * ggml_backend_meta_alloc_ctx_tensors_from_buft(struct ggml_context * ctx, ggml_backend_buffer_type_t buft) {
const size_t n_simple_bufts = ggml_backend_meta_buft_n_bufts(buft);
ggml_init_params params = {
/*.mem_size =*/ 1024*1024*1024, // FIXME
constexpr size_t compute_headroom = 16; // Maximum number of views per statically allocated tensor that can be created between evals.
const ggml_init_params params_static = {
/*.mem_size =*/ ggml_get_mem_size(ctx),
/*.mem_buffer =*/ nullptr,
/*.no_alloc =*/ true,
};
const ggml_init_params params_compute = {
/*.mem_size =*/ compute_headroom*ggml_get_mem_size(ctx),
/*.mem_buffer =*/ nullptr,
/*.no_alloc =*/ true,
};
ggml_backend_meta_simple_tensor_container stc_static (params_static, n_simple_bufts);
ggml_backend_meta_simple_tensor_container stc_compute_0(params_compute, n_simple_bufts);
ggml_backend_meta_simple_tensor_container stc_compute_1(params_compute, n_simple_bufts);
ggml_backend_meta_buffer_context * meta_buf_ctx = new ggml_backend_meta_buffer_context();
meta_buf_ctx->buf_configs.reserve(n_simple_bufts);
for (size_t i = 0; i < n_simple_bufts; i++) {
meta_buf_ctx->buf_configs.emplace_back(ggml_init(params), nullptr);
}
std::vector<ggml_backend_buffer_t> bufs(n_simple_bufts, nullptr);
ggml_backend_meta_buffer_context * meta_buf_ctx = new ggml_backend_meta_buffer_context(stc_static, stc_compute_0, stc_compute_1, bufs);
ggml_backend_buffer_t meta_buf = ggml_backend_buffer_init(buft, ggml_backend_meta_buffer_iface, meta_buf_ctx, 0);
for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != nullptr; t = ggml_get_next_tensor(ctx, t)) {
t->buffer = meta_buf;
ggml_backend_meta_buffer_init_tensor(meta_buf, t);
ggml_backend_meta_buffer_init_tensor_impl(meta_buf_ctx->stc_static, t);
t->data = (void *) 0x2000000000000000; // FIXME
}
for (size_t i = 0; i < n_simple_bufts; i++) {
meta_buf_ctx->buf_configs[i].buf = ggml_backend_alloc_ctx_tensors_from_buft(
meta_buf_ctx->buf_configs[i].ctx, ggml_backend_meta_buft_simple_buft(buft, i));
meta_buf->size = std::max(meta_buf->size, ggml_backend_buffer_get_size(meta_buf_ctx->buf_configs[i].buf));
ggml_context * ctx = meta_buf_ctx->stc_static.ctxs[i].get();
ggml_backend_buffer_type_t simple_buft = ggml_backend_meta_buft_simple_buft(buft, i);
// If a ggml_context only has zero-sized tensors, ggml_backend_alloc_ctx_tensors_from_buft returns NULL.
// For those edge cases, allocate a dummy buffer instead.
bool any_nonzero_slice = false;
for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != nullptr; t = ggml_get_next_tensor(ctx, t)) {
if (ggml_nelements(t) != 0) {
any_nonzero_slice = true;
break;
}
}
if (any_nonzero_slice) {
meta_buf_ctx->bufs[i].reset(ggml_backend_alloc_ctx_tensors_from_buft(ctx, simple_buft));
} else {
meta_buf_ctx->bufs[i].reset(ggml_backend_buft_alloc_buffer(simple_buft, 0));
for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != nullptr; t = ggml_get_next_tensor(ctx, t)) {
t->buffer = meta_buf_ctx->bufs[i].get();
}
}
GGML_ASSERT(meta_buf_ctx->bufs[i]);
meta_buf->size = std::max(meta_buf->size, ggml_backend_buffer_get_size(meta_buf_ctx->bufs[i].get()));
}
return meta_buf;
}
@@ -1605,6 +1691,9 @@ static void ggml_backend_meta_set_tensor_async(ggml_backend_t backend, ggml_tens
ggml_backend_t simple_backend = ggml_backend_meta_simple_backend(backend, j);
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t chunk_size_j = simple_tensor->nb[split_state.axis + 1];
if (chunk_size_j == 0) {
continue;
}
ggml_backend_tensor_set_2d_async(simple_backend, simple_tensor, (const char *) data + offset_j, offset, chunk_size_j,
i_stop - i_start, chunk_size_j, chunk_size_full);
offset_j += chunk_size_j;
@@ -1646,6 +1735,9 @@ static void ggml_backend_meta_get_tensor_async(ggml_backend_t backend, const ggm
ggml_backend_t simple_backend = ggml_backend_meta_simple_backend(backend, j);
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t chunk_size_j = simple_tensor->nb[split_state.axis + 1];
if (chunk_size_j == 0) {
continue;
}
ggml_backend_tensor_get_2d_async(simple_backend, simple_tensor, (char *) data + offset_j, offset, chunk_size_j,
i_stop - i_start, chunk_size_j, chunk_size_full);
offset_j += chunk_size_j;
@@ -1692,6 +1784,26 @@ static enum ggml_status ggml_backend_meta_graph_compute(ggml_backend_t backend,
}
if (needs_rebuild) {
std::set<ggml_backend_buffer_t> used_buffers;
for (int i = 0; i < cgraph->n_leafs; i++) {
if (ggml_backend_buffer_is_meta(cgraph->leafs[i]->buffer)) {
used_buffers.emplace(cgraph->leafs[i]->buffer);
}
}
for (int i = 0; i < cgraph->n_nodes; i++) {
if (ggml_backend_buffer_is_meta(cgraph->nodes[i]->buffer)) {
used_buffers.emplace(cgraph->nodes[i]->buffer);
}
}
for (ggml_backend_buffer_t buf : used_buffers) {
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) buf->context;
buf_ctx->stc_compute_index_next = buf_ctx->stc_compute_index ^ 1;
ggml_backend_meta_simple_tensor_container & stc = buf_ctx->stc_compute[buf_ctx->stc_compute_index_next];
for (ggml_context_ptr & ctx : stc.ctxs) {
ggml_reset(ctx.get());
}
stc.simple_tensors.clear();
}
size_t n_subgraphs = 0;
size_t max_tmp_size = 0;
@@ -1877,7 +1989,7 @@ static enum ggml_status ggml_backend_meta_graph_compute(ggml_backend_t backend,
const size_t mem_per_device_graphs_main = backend_ctx->max_subgraphs*ggml_graph_overhead_custom(backend_ctx->max_nnodes, cgraph->grads);
const size_t mem_per_device_graphs_aux = n_cgraphs_per_device*backend_ctx->max_subgraphs*ggml_graph_overhead_custom(1, cgraph->grads);
const size_t mem_per_device_nodes_aux = n_nodes_per_device*backend_ctx->max_subgraphs*ggml_tensor_overhead();
ggml_init_params params = {
const ggml_init_params params = {
/*.mem_size =*/ n_backends * (mem_per_device_graphs_main + mem_per_device_graphs_aux + mem_per_device_nodes_aux),
/*.mem_buffer =*/ nullptr,
/*.no_alloc =*/ true,

View File

@@ -306,7 +306,7 @@ void ggml_backend_tensor_get_2d_async(ggml_backend_t backend, const struct ggml_
GGML_ASSERT(tensor);
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
if (n_copies <= 1 || backend->iface.set_tensor_2d_async == NULL) {
if (n_copies <= 1 || backend->iface.get_tensor_2d_async == NULL) {
for (size_t i = 0; i < n_copies; i++) {
ggml_backend_tensor_get_async(backend, tensor, (char *) data + i*stride_data, offset + i*stride_tensor, size);
}
@@ -317,7 +317,7 @@ void ggml_backend_tensor_get_2d_async(ggml_backend_t backend, const struct ggml_
}
GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
GGML_ASSERT(offset + (n_copies-1)*stride_tensor + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
GGML_ASSERT(offset + (n_copies-1)*stride_tensor + size <= ggml_nbytes(tensor) && "tensor read out of bounds");
backend->iface.get_tensor_2d_async(backend, tensor, data, offset, size, n_copies, stride_tensor, stride_data);
}

View File

@@ -72,17 +72,9 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
endif()
endif()
if (GGML_OPENMP)
find_package(OpenMP)
if (OpenMP_FOUND)
set(GGML_OPENMP_ENABLED "ON" CACHE INTERNAL "")
target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_OPENMP)
target_link_libraries(${GGML_CPU_NAME} PRIVATE OpenMP::OpenMP_C OpenMP::OpenMP_CXX)
else()
set(GGML_OPENMP_ENABLED "OFF" CACHE INTERNAL "")
message(WARNING "OpenMP not found")
endif()
if (GGML_OPENMP_ENABLED)
target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_OPENMP)
target_link_libraries(${GGML_CPU_NAME} PRIVATE OpenMP::OpenMP_C OpenMP::OpenMP_CXX)
endif()
if (GGML_LLAMAFILE)

108
ggml/src/ggml-cuda/fwht.cu Normal file
View File

@@ -0,0 +1,108 @@
#include "common.cuh"
#include "fwht.cuh"
template <int N>
__launch_bounds__(4*ggml_cuda_get_physical_warp_size(), 1)
__global__ void fwht_cuda(const float * src, float * dst, const int64_t n_rows, const float scale) {
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
const int64_t r = (int64_t) blockIdx.x * blockDim.y + threadIdx.y;
if (r >= n_rows) {
return;
}
src += r * N;
dst += r * N;
static constexpr int el_w = N / warp_size;
float reg[el_w];
const int lane = threadIdx.x;
#pragma unroll
for (int i = 0; i < el_w; ++i) {
reg[i] = src[i * warp_size + lane] * scale;
}
#pragma unroll
for (int h = 1; h < warp_size; h *= 2) {
#pragma unroll
for (int j = 0; j < el_w; j++) {
const float val = reg[j];
const float val2 = __shfl_xor_sync(0xFFFFFFFF, val, h, warp_size);
reg[j] = (lane & h) == 0 ? val + val2 : val2 - val;
}
}
#pragma unroll
for (int h = warp_size; h < N; h *= 2) {
const int step = h / warp_size;
#pragma unroll
for (int j = 0; j < el_w; j += 2 * step) {
#pragma unroll
for (int k = 0; k < step; k++) {
const float x = reg[j + k];
const float y = reg[j + k + step];
reg[j + k] = x + y;
reg[j + k + step] = x - y;
}
}
}
#pragma unroll
for (int i = 0; i < el_w; ++i) {
dst[i * warp_size + lane] = reg[i];
}
}
void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst) {
GGML_ASSERT(ggml_are_same_shape(src, dst));
GGML_ASSERT(ggml_is_contiguous(src));
GGML_ASSERT(ggml_is_contiguous(dst));
const int n = src->ne[0];
const int64_t rows = ggml_nrows(src);
const float * src_d = (const float *) src->data;
float * dst_d = (float *) dst->data;
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
GGML_ASSERT(n % warp_size == 0);
const int rows_per_block = 4;
const int64_t num_blocks = (rows + rows_per_block - 1) / rows_per_block;
cudaStream_t stream = ctx.stream();
dim3 grid_dims(num_blocks, 1, 1);
dim3 block_dims(warp_size, rows_per_block, 1);
const ggml_cuda_kernel_launch_params launch_params =
ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
const float scale = 1 / sqrtf(n);
switch (n) {
case 64:
{
ggml_cuda_kernel_launch(fwht_cuda<64>, launch_params, src_d, dst_d, rows, scale);
break;
}
case 128:
{
ggml_cuda_kernel_launch(fwht_cuda<128>, launch_params, src_d, dst_d, rows, scale);
break;
}
case 256:
{
ggml_cuda_kernel_launch(fwht_cuda<256>, launch_params, src_d, dst_d, rows, scale);
break;
}
case 512:
{
ggml_cuda_kernel_launch(fwht_cuda<512>, launch_params, src_d, dst_d, rows, scale);
break;
}
default:
GGML_ABORT("fatal error");
}
}

View File

@@ -0,0 +1,3 @@
#include "common.cuh"
void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst);

View File

@@ -24,6 +24,7 @@
#include "ggml-cuda/diagmask.cuh"
#include "ggml-cuda/diag.cuh"
#include "ggml-cuda/fattn.cuh"
#include "ggml-cuda/fwht.cuh"
#include "ggml-cuda/getrows.cuh"
#include "ggml-cuda/im2col.cuh"
#include "ggml-cuda/mmf.cuh"
@@ -2594,6 +2595,13 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
bool use_batched_cublas_bf16 = src0->type == GGML_TYPE_BF16 && bf16_mma_hardware_available(cc);
bool use_batched_cublas_f32 = src0->type == GGML_TYPE_F32;
const int32_t hint = ggml_get_op_params_i32(dst, 1);
if (hint == GGML_HINT_SRC0_IS_HADAMARD) {
GGML_ASSERT(!split);
ggml_cuda_op_fwht(ctx, src1, dst);
return;
}
if (!split && use_mul_mat_vec_f) {
// the custom F16 vector kernel can be used over batched cuBLAS GEMM
// but this is only faster for GPUs without tensor cores or with a thin src0 matrix (particularly KQV in attention)

View File

@@ -852,9 +852,10 @@ static void fa_softmax_thread(unsigned int n, unsigned int i, void * data) {
v_s_rowmax1 = hvx_vec_reduce_max_f16(v_s_rowmax1);
// Splat m_prev[r], m_prev[r+1] from the per-row accumulator.
// vror brings the target lane to lane 0, then extract + re-splat.
HVX_Vector v_m_prev0 = hvx_vec_splat_f16(hvx_vec_get_f16(Q6_V_vror_VR(m_prev_v, r_vec_off * 2)));
HVX_Vector v_m_prev1 = hvx_vec_splat_f16(hvx_vec_get_f16(Q6_V_vror_VR(m_prev_v, (r_vec_off + 1) * 2)));
// vror brings the target lane to lane 0, then vdelta replicates it
// across all lanes — stays in the vector domain (no store/reload).
HVX_Vector v_m_prev0 = hvx_vec_repl_f16(Q6_V_vror_VR(m_prev_v, r_vec_off * 2));
HVX_Vector v_m_prev1 = hvx_vec_repl_f16(Q6_V_vror_VR(m_prev_v, (r_vec_off + 1) * 2));
// HVX max — both operands are splats, so result is splat of m_new.
HVX_Vector v_dup_m0 = Q6_Vhf_vmax_VhfVhf(v_m_prev0, v_s_rowmax0);

View File

@@ -661,11 +661,10 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_mul_mm_iq4_nl_f32_l4_lm;
std::vector<ProfilingInfo> profiling_info;
std::vector<ProfilingInfo> profiling_results;
void write_profiling_info() {
FILE * fperf = fopen("cl_profiling.csv", "w");
if (!fperf) {
GGML_LOG_ERROR("Failed to open cl_profiling.csv\n");
void flush_profiling_batch() {
if (profiling_info.empty()) {
return;
}
@@ -689,6 +688,7 @@ struct ggml_backend_opencl_context {
CL_CHECK(clGetEventProfilingInfo(
info.evt, CL_PROFILING_COMMAND_COMPLETE, sizeof(cl_ulong), &cmd_complete, NULL));
CL_CHECK(clReleaseEvent(info.evt));
info.evt = nullptr;
char kernel_name[512];
CL_CHECK(clGetKernelInfo(info.kernel, CL_KERNEL_FUNCTION_NAME,
@@ -706,10 +706,26 @@ struct ggml_backend_opencl_context {
info.cmd_complete_duration_ns = cmd_complete - cmd_end;
info.cmd_total_duration_ns = cmd_complete - cmd_queued;
}
profiling_results.insert(profiling_results.end(),
std::make_move_iterator(profiling_info.begin()),
std::make_move_iterator(profiling_info.end()));
profiling_info.clear();
}
void write_profiling_info() {
if (profiling_results.empty()) {
return;
}
// Dump a csv
FILE * fperf = fopen("cl_profiling.csv", "w");
if (!fperf) {
GGML_LOG_ERROR("Failed to open cl_profiling.csv\n");
return;
}
fprintf(fperf, "op name, kernel name, exec duration (ms), global size, local size, output size\n");
for (const ProfilingInfo & info : profiling_info) {
for (const ProfilingInfo & info : profiling_results) {
fprintf(fperf, "%s,%s,%f,%zux%zux%zu,%zux%zux%zu,%zux%zux%zux%zu\n",
info.op_name.c_str(), info.kernel_name.c_str(),
info.cmd_duration_ns/1.e6f,
@@ -720,14 +736,14 @@ struct ggml_backend_opencl_context {
fclose(fperf);
// Dump a simple chrome trace
FILE* ftrace = fopen("cl_trace.json", "w");
FILE * ftrace = fopen("cl_trace.json", "w");
if (!ftrace) {
GGML_LOG_ERROR("Failed to open cl_trace.json\n");
return;
}
fprintf(ftrace, "[\n");
for (const ProfilingInfo & info : profiling_info) {
for (const ProfilingInfo & info : profiling_results) {
fprintf(ftrace, "{\"name\": \"%s\", \"cat\": \"OpenCL\", \"ph\": \"B\", \"ts\": %" PRIu64 ", \"pid\": \"\", \"tid\": \"Host\"},\n",
info.kernel_name.c_str(), info.cmd_queued/1000);
fprintf(ftrace, "{\"name\": \"%s\", \"cat\": \"OpenCL\", \"ph\": \"E\", \"ts\": %" PRIu64 ", \"pid\": \"\", \"tid\": \"Host\"},\n",
@@ -738,6 +754,7 @@ struct ggml_backend_opencl_context {
fprintf(ftrace, "{\"name\": \"%s\", \"cat\": \"OpenCL\", \"ph\": \"E\", \"ts\": %" PRIu64 ", \"pid\": \"\", \"tid\": \"Device\"},\n",
info.kernel_name.c_str(), info.cmd_end/1000);
}
fprintf(ftrace, "]\n");
fclose(ftrace);
}
@@ -758,6 +775,9 @@ struct ggml_backend_opencl_context {
profiling_info.emplace_back();
populateProfilingInfo(profiling_info.back(), evt, kernel, work_dim, global_work_size, local_work_size, tensor);
if (profiling_info.size() >= 2048) {
flush_profiling_batch();
}
#else
GGML_UNUSED(tensor);
CL_CHECK(clEnqueueNDRangeKernel(queue, kernel, work_dim, NULL, global_work_size, local_work_size, 0, NULL, NULL));
@@ -804,7 +824,7 @@ struct ggml_backend_opencl_context {
if (ref_count == 0) {
#ifdef GGML_OPENCL_PROFILING
write_profiling_info();
profiling_info.clear();
profiling_results.clear();
#endif
}
}
@@ -4693,7 +4713,7 @@ inline bool use_adreno_kernels(const ggml_backend_opencl_context *backend_ctx, c
inline bool use_adreno_moe_kernels(const ggml_backend_opencl_context *backend_ctx, const ggml_tensor *tensor) {
GGML_UNUSED(backend_ctx);
int ne01 = tensor->ne[1];
return (((strstr(tensor->name, "ffn") != NULL) && (strstr(tensor->name, "exps") != NULL)) || (strstr(tensor->name, "as") != NULL)) && (ne01 % 64 == 0);
return (((strstr(tensor->name, "ffn") != NULL) && (strstr(tensor->name, "exps") != NULL)) || (strstr(tensor->name, "as") != NULL)) && (ne01 % 32 == 0);
}
inline bool enable_adreno_trans_weight(const ggml_backend_opencl_context *backend_ctx, const ggml_tensor *tensor) {
@@ -14297,7 +14317,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;
@@ -14513,7 +14533,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;
@@ -14689,7 +14709,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;
@@ -14865,7 +14885,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;
@@ -15118,7 +15138,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;
@@ -15291,7 +15311,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;
@@ -15469,7 +15489,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;
@@ -15644,7 +15664,7 @@ static void ggml_cl_mul_mat_id(ggml_backend_t backend, const ggml_tensor * src0,
CL_CHECK(status);
// set thread grid
global_size[0] = static_cast<size_t>(ne01);
global_size[0] = static_cast<size_t>(((ne01 + 63) / 64) * 64);
global_size[1] = 4;
global_size[2] = static_cast<size_t>(ne20);
local_size[1] = 4;

View File

@@ -220,6 +220,10 @@ kernel void kernel_convert_block_q4_0_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK4_0;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint dst_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -263,6 +267,10 @@ kernel void kernel_restore_block_q4_0_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK4_0;
uint dst_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint src_d_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -401,6 +409,10 @@ kernel void kernel_convert_block_q4_1_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK4_1;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint dst_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -446,6 +458,10 @@ kernel void kernel_restore_block_q4_1_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK4_1;
uint dst_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint src_dm_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -491,6 +507,10 @@ kernel void kernel_convert_block_q5_0_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK5_0;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint dst_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -536,6 +556,10 @@ kernel void kernel_restore_block_q5_0_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK5_0;
uint dst_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint src_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -583,6 +607,10 @@ kernel void kernel_convert_block_q5_1_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK5_1;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint dst_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -630,6 +658,10 @@ kernel void kernel_restore_block_q5_1_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK5_1;
uint dst_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint src_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -679,6 +711,10 @@ kernel void kernel_convert_block_q4_k_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_K;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint dst_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -732,6 +768,10 @@ kernel void kernel_restore_block_q4_k_trans4_ns(
uint i01 = get_global_id(0); // row index
uint i02 = get_global_id(2); // batch index
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_K;
uint src_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -784,6 +824,10 @@ kernel void kernel_convert_block_q5_k_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_K;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint dst_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -850,6 +894,10 @@ kernel void kernel_restore_block_q5_k_trans4_ns(
uint i01 = get_global_id(0); // row index
uint i02 = get_global_id(2); // batch index
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_K;
uint src_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -916,6 +964,10 @@ kernel void kernel_convert_block_q6_k_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_K;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
@@ -993,6 +1045,10 @@ kernel void kernel_restore_block_q6_k_trans4_ns(
uint i01 = get_global_id(0); // row index
uint i02 = get_global_id(2); // batch index
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_K;
uint src_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -1147,6 +1203,10 @@ kernel void kernel_convert_block_mxfp4_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_MXFP4;
uint src_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint dst_blk_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;
@@ -1190,6 +1250,10 @@ kernel void kernel_restore_block_mxfp4_trans4_ns(
uint i01 = get_global_id(0);
uint i02 = get_global_id(2);
if (i01 >= ne01) {
return;
}
uint ne00_blk = ne00 / QK_MXFP4;
uint dst_blk_offset = i00 + i01 * ne00_blk + i02 * ne00_blk * ne01;
uint src_d_offset = i01 + i00 * ne01 + i02 * ne00_blk * ne01;

View File

@@ -163,7 +163,7 @@ kernel void kernel_gemm_moe_mxfp4_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -248,6 +248,10 @@ kernel void kernel_gemm_moe_mxfp4_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load poster router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -115,7 +115,7 @@ kernel void kernel_gemm_moe_q4_0_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -198,6 +198,10 @@ kernel void kernel_gemm_moe_q4_0_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load poster router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -116,7 +116,7 @@ kernel void kernel_gemm_moe_q4_1_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -200,6 +200,10 @@ kernel void kernel_gemm_moe_q4_1_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load poster router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -133,7 +133,7 @@ kernel void kernel_gemm_moe_q4_k_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -225,6 +225,10 @@ kernel void kernel_gemm_moe_q4_k_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load post router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -116,7 +116,7 @@ kernel void kernel_gemm_moe_q5_0_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -202,6 +202,10 @@ kernel void kernel_gemm_moe_q5_0_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load poster router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -117,7 +117,7 @@ kernel void kernel_gemm_moe_q5_1_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -204,6 +204,10 @@ kernel void kernel_gemm_moe_q5_1_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load poster router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -134,7 +134,7 @@ kernel void kernel_gemm_moe_q5_k_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -230,6 +230,10 @@ kernel void kernel_gemm_moe_q5_k_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load post router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -117,7 +117,7 @@ kernel void kernel_gemm_moe_q6_k_f32_ns(
uint block_id_n = get_global_id(2); // n_tile
// Boundary check
if (((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) || (block_id_n >= total_tiles[0])) {
if (block_id_n >= total_tiles[0]) {
return;
}
@@ -209,6 +209,10 @@ kernel void kernel_gemm_moe_q6_k_f32_ns(
dotx16_reduce8(reg_a, shared_b, reg_c.hi, 16);
}
if ((get_global_id(0) + block_id_m * TILESIZE_M) >= ne01) {
return;
}
// Load post router and share in LM
__local uint out_idx[TILESIZE_N];

View File

@@ -82,6 +82,10 @@ __kernel void kernel_gemv_moe_mxfp4_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -37,6 +37,10 @@ __kernel void kernel_gemv_moe_q4_0_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -38,6 +38,10 @@ __kernel void kernel_gemv_moe_q4_1_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -54,6 +54,10 @@ __kernel void kernel_gemv_moe_q4_k_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -38,6 +38,10 @@ __kernel void kernel_gemv_moe_q5_0_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -39,6 +39,10 @@ __kernel void kernel_gemv_moe_q5_1_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -55,6 +55,10 @@ __kernel void kernel_gemv_moe_q5_k_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -38,6 +38,10 @@ __kernel void kernel_gemv_moe_q6_k_f32_ns(
uint sgid = get_local_id(1);
uint slid = get_sub_group_local_id();
if (i01 >= ne01) {
return;
}
uint i11 = i20 % ne11;
uint expert_id = src2[i20];

View File

@@ -13,6 +13,10 @@
#include <stdlib.h> // for qsort
#include <stdio.h> // for GGML_ASSERT
#ifdef GGML_USE_OPENMP
#include <omp.h>
#endif
#define GROUP_MAX_EPS 1e-15f
#define GROUP_MAX_EPS_IQ3_XXS 1e-8f
#define GROUP_MAX_EPS_IQ2_S 1e-8f
@@ -3064,70 +3068,121 @@ void iq2xs_init_impl(enum ggml_type type) {
}
kmap_q2xs[index] = i;
}
int8_t pos[8];
int * dist2 = (int *)malloc(2*grid_size*sizeof(int));
// The neighbour search runs in three passes:
// 1. Parallel: for each i, qsort and count its neighbours into n_per_i,
// and reduce the totals (num_neighbors, num_not_in_map).
// 2. Serial: prefix-sum n_per_i into offsets[], so each i has a
// pre-assigned slice of kneighbors_q2xs to write into.
// 3. Parallel: redo the qsort and write each i's neighbour list at
// offsets[i].
int * n_per_i = (int *)malloc(kmap_size*sizeof(int));
GGML_ASSERT(n_per_i);
int num_neighbors = 0, num_not_in_map = 0;
for (int i = 0; i < kmap_size; ++i) {
if (kmap_q2xs[i] >= 0) continue;
++num_not_in_map;
for (int k = 0; k < 8; ++k) {
int l = (i >> 2*k) & 0x3;
pos[k] = 2*l + 1;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q2xs + j);
int d2 = 0;
for (int k = 0; k < 8; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq2_compare_func);
int n = 0; int d2 = dist2[0];
int nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
#ifdef GGML_USE_OPENMP
#pragma omp parallel reduction(+:num_neighbors,num_not_in_map)
#endif
{
int * dist2 = (int *)malloc(2*grid_size*sizeof(int));
GGML_ASSERT(dist2);
int8_t pos[8];
int i;
#ifdef GGML_USE_OPENMP
#pragma omp for schedule(dynamic, 64)
#endif
for (i = 0; i < kmap_size; ++i) {
if (kmap_q2xs[i] >= 0) {
n_per_i[i] = 0;
continue;
}
++n;
++num_not_in_map;
for (int k = 0; k < 8; ++k) {
int l = (i >> 2*k) & 0x3;
pos[k] = 2*l + 1;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q2xs + j);
int d2 = 0;
for (int k = 0; k < 8; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq2_compare_func);
int n = 0; int d2 = dist2[0];
int nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
}
++n;
}
n_per_i[i] = n;
num_neighbors += n;
}
num_neighbors += n;
free(dist2);
}
//printf("%s: %d neighbours in total\n", __func__, num_neighbors);
kneighbors_q2xs = (uint16_t *)malloc((num_neighbors + num_not_in_map)*sizeof(uint16_t));
iq2_data[gindex].neighbours = kneighbors_q2xs;
int * offsets = (int *)malloc(kmap_size*sizeof(int));
GGML_ASSERT(offsets);
int counter = 0;
for (int i = 0; i < kmap_size; ++i) {
if (kmap_q2xs[i] >= 0) continue;
for (int k = 0; k < 8; ++k) {
int l = (i >> 2*k) & 0x3;
pos[k] = 2*l + 1;
if (kmap_q2xs[i] >= 0) {
offsets[i] = -1;
continue;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q2xs + j);
int d2 = 0;
for (int k = 0; k < 8; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq2_compare_func);
kmap_q2xs[i] = -(counter + 1);
int d2 = dist2[0];
uint16_t * start = &kneighbors_q2xs[counter++];
int n = 0, nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
}
kneighbors_q2xs[counter++] = dist2[2*j+1];
++n;
}
*start = n;
offsets[i] = counter;
counter += 1 + n_per_i[i];
}
free(dist2);
#ifdef GGML_USE_OPENMP
#pragma omp parallel
#endif
{
int * dist2 = (int *)malloc(2*grid_size*sizeof(int));
GGML_ASSERT(dist2);
int8_t pos[8];
int i;
#ifdef GGML_USE_OPENMP
#pragma omp for schedule(dynamic, 64)
#endif
for (i = 0; i < kmap_size; ++i) {
if (kmap_q2xs[i] >= 0) continue;
for (int k = 0; k < 8; ++k) {
int l = (i >> 2*k) & 0x3;
pos[k] = 2*l + 1;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q2xs + j);
int d2 = 0;
for (int k = 0; k < 8; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq2_compare_func);
int local_counter = offsets[i];
kmap_q2xs[i] = -(local_counter + 1);
int d2 = dist2[0];
uint16_t * start = &kneighbors_q2xs[local_counter++];
int n = 0, nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
}
kneighbors_q2xs[local_counter++] = dist2[2*j+1];
++n;
}
*start = n;
}
free(dist2);
}
free(offsets);
free(n_per_i);
}
void iq2xs_free_impl(enum ggml_type type) {
@@ -3663,70 +3718,115 @@ void iq3xs_init_impl(int grid_size) {
}
kmap_q3xs[index] = i;
}
int8_t pos[4];
int * dist2 = (int *)malloc(2*grid_size*sizeof(int));
// See explanation of parallelism in iq2xs_init_impl
int * n_per_i = (int *)malloc(kmap_size*sizeof(int));
GGML_ASSERT(n_per_i);
int num_neighbors = 0, num_not_in_map = 0;
for (int i = 0; i < kmap_size; ++i) {
if (kmap_q3xs[i] >= 0) continue;
++num_not_in_map;
for (int k = 0; k < 4; ++k) {
int l = (i >> 3*k) & 0x7;
pos[k] = 2*l + 1;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q3xs + j);
int d2 = 0;
for (int k = 0; k < 4; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq3_compare_func);
int n = 0; int d2 = dist2[0];
int nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
#ifdef GGML_USE_OPENMP
#pragma omp parallel reduction(+:num_neighbors,num_not_in_map)
#endif
{
int * dist2 = (int *)malloc(2*grid_size*sizeof(int));
GGML_ASSERT(dist2);
int8_t pos[4];
int i;
#ifdef GGML_USE_OPENMP
#pragma omp for schedule(dynamic, 64)
#endif
for (i = 0; i < kmap_size; ++i) {
if (kmap_q3xs[i] >= 0) {
n_per_i[i] = 0;
continue;
}
++n;
++num_not_in_map;
for (int k = 0; k < 4; ++k) {
int l = (i >> 3*k) & 0x7;
pos[k] = 2*l + 1;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q3xs + j);
int d2 = 0;
for (int k = 0; k < 4; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq3_compare_func);
int n = 0; int d2 = dist2[0];
int nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
}
++n;
}
n_per_i[i] = n;
num_neighbors += n;
}
num_neighbors += n;
free(dist2);
}
//printf("%s: %d neighbours in total\n", __func__, num_neighbors);
kneighbors_q3xs = (uint16_t *)malloc((num_neighbors + num_not_in_map)*sizeof(uint16_t));
iq3_data[gindex].neighbours = kneighbors_q3xs;
int * offsets = (int *)malloc(kmap_size*sizeof(int));
GGML_ASSERT(offsets);
int counter = 0;
for (int i = 0; i < kmap_size; ++i) {
if (kmap_q3xs[i] >= 0) continue;
for (int k = 0; k < 4; ++k) {
int l = (i >> 3*k) & 0x7;
pos[k] = 2*l + 1;
if (kmap_q3xs[i] >= 0) {
offsets[i] = -1;
continue;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q3xs + j);
int d2 = 0;
for (int k = 0; k < 4; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq3_compare_func);
kmap_q3xs[i] = -(counter + 1);
int d2 = dist2[0];
uint16_t * start = &kneighbors_q3xs[counter++];
int n = 0, nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
}
kneighbors_q3xs[counter++] = dist2[2*j+1];
++n;
}
*start = n;
offsets[i] = counter;
counter += 1 + n_per_i[i];
}
free(dist2);
#ifdef GGML_USE_OPENMP
#pragma omp parallel
#endif
{
int * dist2 = (int *)malloc(2*grid_size*sizeof(int));
GGML_ASSERT(dist2);
int8_t pos[4];
int i;
#ifdef GGML_USE_OPENMP
#pragma omp for schedule(dynamic, 64)
#endif
for (i = 0; i < kmap_size; ++i) {
if (kmap_q3xs[i] >= 0) continue;
for (int k = 0; k < 4; ++k) {
int l = (i >> 3*k) & 0x7;
pos[k] = 2*l + 1;
}
for (int j = 0; j < grid_size; ++j) {
const int8_t * pg = (const int8_t *)(kgrid_q3xs + j);
int d2 = 0;
for (int k = 0; k < 4; ++k) d2 += (pg[k] - pos[k])*(pg[k] - pos[k]);
dist2[2*j+0] = d2;
dist2[2*j+1] = j;
}
qsort(dist2, grid_size, 2*sizeof(int), iq3_compare_func);
int local_counter = offsets[i];
kmap_q3xs[i] = -(local_counter + 1);
int d2 = dist2[0];
uint16_t * start = &kneighbors_q3xs[local_counter++];
int n = 0, nhave = 1;
for (int j = 0; j < grid_size; ++j) {
if (dist2[2*j] > d2) {
if (nhave == nwant) break;
d2 = dist2[2*j];
++nhave;
}
kneighbors_q3xs[local_counter++] = dist2[2*j+1];
++n;
}
*start = n;
}
free(dist2);
}
free(offsets);
free(n_per_i);
}
void iq3xs_free_impl(int grid_size) {

View File

@@ -238,6 +238,8 @@ struct ggml_sycl_device_info {
std::array<float, GGML_SYCL_MAX_DEVICES> default_tensor_split = {};
int max_work_group_sizes[GGML_SYCL_MAX_DEVICES] = {0};
bool ext_oneapi_level_zero = true; // sycl::backend::ext_oneapi_level_zero used by all enumerated GPU devices
};
const ggml_sycl_device_info & ggml_sycl_info();

View File

@@ -3,6 +3,13 @@
#include "dequantize.hpp"
#include "presets.hpp"
#if defined(__INTEL_LLVM_COMPILER)
#if __has_include(<sycl/ext/oneapi/bfloat16.hpp>)
#include <sycl/ext/oneapi/bfloat16.hpp>
#define GGML_SYCL_DMMV_HAS_BF16
#endif
#endif
static void convert_f16(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
const sycl::half *x = (const sycl::half *)vx;
@@ -11,6 +18,16 @@ static void convert_f16(const void * vx, const int64_t ib, const int iqs, dfloat
v.y() = x[ib + iqs + 1];
}
#ifdef GGML_SYCL_DMMV_HAS_BF16
static void convert_bf16(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
const sycl::ext::oneapi::bfloat16 *x = (const sycl::ext::oneapi::bfloat16 *)vx;
// automatic bfloat16 -> float type cast if dfloat == float
v.x() = x[ib + iqs + 0];
v.y() = x[ib + iqs + 1];
}
#endif
static void convert_f32(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
const float * x = (const float *) vx;
@@ -217,6 +234,28 @@ static void convert_mul_mat_vec_f16_sycl(const void *vx, const dfloat *y,
}
}
#ifdef GGML_SYCL_DMMV_HAS_BF16
static void convert_mul_mat_vec_bf16_sycl(const void *vx, const dfloat *y,
float *dst, const int ncols,
const int nrows,
dpct::queue_ptr stream) {
// The qk=1 kernel iterates with stride 2*GGML_SYCL_DMMV_X, so ncols must be a
// multiple of that — not just GGML_SYCL_DMMV_X — to avoid out-of-bounds reads.
GGML_ASSERT(ncols % (2*GGML_SYCL_DMMV_X) == 0);
const int block_num_y = (nrows + GGML_SYCL_MMV_Y - 1) / GGML_SYCL_MMV_Y;
const sycl::range<3> block_nums(1, 1, block_num_y);
const sycl::range<3> block_dims(1, GGML_SYCL_MMV_Y, WARP_SIZE);
{
stream->parallel_for(
sycl::nd_range<3>(block_nums * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
dequantize_mul_mat_vec<1, 1, convert_bf16>(vx, y, dst, ncols,
nrows, item_ct1);
});
}
}
#endif
/*
DPCT1110:4: The total declared local variable size in device function
dequantize_mul_mat_vec_q2_k exceeds 128 bytes and may cause high register
@@ -1497,7 +1536,8 @@ void ggml_sycl_op_dequantize_mul_mat_vec(
bool src1_convert_f16 =
src0->type == GGML_TYPE_Q4_0 || src0->type == GGML_TYPE_Q4_1 ||
src0->type == GGML_TYPE_Q5_0 || src0->type == GGML_TYPE_Q5_1 ||
src0->type == GGML_TYPE_Q8_0 || src0->type == GGML_TYPE_F16;
src0->type == GGML_TYPE_Q8_0 || src0->type == GGML_TYPE_F16 ||
src0->type == GGML_TYPE_BF16;
if (src1_convert_f16) {
scope_op_debug_print scope_dbg_print(__func__, "/to_fp16_sycl", dst, /*num_src=*/2,
@@ -1565,6 +1605,11 @@ void ggml_sycl_op_dequantize_mul_mat_vec(
case GGML_TYPE_F16:
convert_mul_mat_vec_f16_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
#ifdef GGML_SYCL_DMMV_HAS_BF16
case GGML_TYPE_BF16:
convert_mul_mat_vec_bf16_sycl(src0_dd_i, src1_dfloat, dst_dd_i, ne00, row_diff, stream);
break;
#endif
default:
printf("ggml_sycl_op_dequantize_mul_mat_vec unsupported GGML_TYPE %d\n", src0->type);
GGML_ABORT("fatal error");

View File

@@ -6,7 +6,7 @@
#include <cmath>
template <int S_v, bool KDA>
template <int S_v, bool KDA, bool keep_rs_t>
void gated_delta_net_sycl(const float * q,
const float * k,
const float * v,
@@ -28,7 +28,8 @@ void gated_delta_net_sycl(const float * q,
int64_t sb3,
const sycl::uint3 neqk1_magic,
const sycl::uint3 rq3_magic,
float scale) {
float scale,
int K) {
auto item_ct1 = sycl::ext::oneapi::this_work_item::get_nd_item<3>();
const uint32_t h_idx = item_ct1.get_group(2);
const uint32_t sequence = item_ct1.get_group(1);
@@ -43,9 +44,13 @@ void gated_delta_net_sycl(const float * q,
float * attn_data = dst;
float * state = dst + attn_score_elems;
const int64_t state_offset = (sequence * H + h_idx) * S_v * S_v;
state += state_offset;
curr_state += state_offset;
// input state layout (D, K, n_seqs) — seq stride is K * D = K * H * S_v * S_v.
// output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
const int64_t state_in_offset = sequence * K * H * S_v * S_v + h_idx * S_v * S_v;
const int64_t state_out_offset = (sequence * H + h_idx) * S_v * S_v;
const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
state += state_out_offset;
curr_state += state_in_offset + col * S_v;
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
constexpr int warp_size = ggml_sycl_get_physical_warp_size() < S_v ? ggml_sycl_get_physical_warp_size() : S_v;
@@ -55,9 +60,13 @@ void gated_delta_net_sycl(const float * q,
#pragma unroll
for (int r = 0; r < rows_per_lane; r++) {
const int i = r * warp_size + lane;
s_shard[r] = curr_state[col * S_v + i];
s_shard[r] = curr_state[i];
}
// slot mapping: target_slot = t - shift. When n_tokens < K only the last n_tokens slots
// are written; earlier slots are left untouched (caller-owned).
const int shift = (int) n_tokens - K;
for (int t = 0; t < n_tokens; t++) {
const float * q_t = q + iq3 * sq3 + t * sq2 + iq1 * sq1;
const float * k_t = k + iq3 * sq3 + t * sq2 + iq1 * sq1;
@@ -131,17 +140,32 @@ void gated_delta_net_sycl(const float * q,
}
attn_data += S_v * H;
}
// Write state back to global memory
if constexpr (keep_rs_t) {
const int target_slot = t - shift;
if (target_slot >= 0 && target_slot < K) {
float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
#pragma unroll
for (int r = 0; r < rows_per_lane; r++) {
const int i = r * warp_size + lane;
state[col * S_v + i] = s_shard[r];
for (int r = 0; r < rows_per_lane; r++) {
const int i = r * warp_size + lane;
curr_state[col * S_v + i] = s_shard[r];
}
}
}
}
if constexpr (!keep_rs_t) {
#pragma unroll
for (int r = 0; r < rows_per_lane; r++) {
const int i = r * warp_size + lane;
state[col * S_v + i] = s_shard[r];
}
}
}
template <bool KDA>
template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(const float * q_d,
const float * k_d,
const float * v_d,
@@ -165,6 +189,7 @@ static void launch_gated_delta_net(const float * q_d,
int64_t neqk1,
int64_t rq3,
float scale,
int K,
dpct::queue_ptr stream) {
//TODO: Add chunked kernel for even faster pre-fill
const int warp_size = ggml_sycl_info().devices[ggml_sycl_get_device()].warp_size;
@@ -182,9 +207,9 @@ static void launch_gated_delta_net(const float * q_d,
constexpr int sv = 16;
stream->parallel_for(sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> /*item_ct1*/) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
gated_delta_net_sycl<sv, KDA>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens,
gated_delta_net_sycl<sv, KDA, keep_rs_t>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens,
n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2,
sb3, neqk1_magic, rq3_magic, scale);
sb3, neqk1_magic, rq3_magic, scale, K);
});
}
break;
@@ -193,9 +218,9 @@ static void launch_gated_delta_net(const float * q_d,
constexpr int sv = 32;
stream->parallel_for(sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> /*item_ct1*/) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
gated_delta_net_sycl<sv, KDA>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens,
gated_delta_net_sycl<sv, KDA, keep_rs_t>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens,
n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2,
sb3, neqk1_magic, rq3_magic, scale);
sb3, neqk1_magic, rq3_magic, scale, K);
});
}
break;
@@ -204,9 +229,9 @@ static void launch_gated_delta_net(const float * q_d,
constexpr int sv = 64;
stream->parallel_for(sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> /*item_ct1*/) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
gated_delta_net_sycl<sv, KDA>(
gated_delta_net_sycl<sv, KDA, keep_rs_t>(
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens, n_seqs, sq1, sq2,
sq3, sv1, sv2, sv3, sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
sq3, sv1, sv2, sv3, sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
});
}
break;
@@ -216,9 +241,9 @@ static void launch_gated_delta_net(const float * q_d,
constexpr int sv = 128;
stream->parallel_for(sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> /*item_ct1*/) [[sycl::reqd_sub_group_size(WARP_SIZE)]] {
gated_delta_net_sycl<sv, KDA>(
gated_delta_net_sycl<sv, KDA, keep_rs_t>(
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens, n_seqs, sq1, sq2,
sq3, sv1, sv2, sv3, sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
sq3, sv1, sv2, sv3, sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
});
}
break;
@@ -290,14 +315,30 @@ void ggml_sycl_op_gated_delta_net(ggml_backend_sycl_context & ctx, ggml_tensor *
dpct::queue_ptr stream = ctx.stream();
// state is 3D (S_v*S_v*H, K, n_seqs); K is the snapshot slot count.
const int K = (int) src_state->ne[1];
const bool keep_rs = K > 1;
if (kda) {
launch_gated_delta_net<true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, stream);
if (keep_rs) {
launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
} else {
launch_gated_delta_net<false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, stream);
if (keep_rs) {
launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
}
}

View File

@@ -98,7 +98,7 @@ static ggml_sycl_device_info ggml_sycl_init() {
for (int i = 0; i < info.device_count; ++i) {
info.devices[i].vmm = 0;
dpct::device_info prop;
sycl::device device = dpct::dev_mgr::instance().get_device(i);
auto & device = dpct::dev_mgr::instance().get_device(i);
SYCL_CHECK(CHECK_TRY_ERROR(dpct::get_device_info(
prop, device)));
@@ -117,6 +117,12 @@ static ggml_sycl_device_info ggml_sycl_init() {
info.devices[i].max_wg_per_cu = info.max_work_group_sizes[i] / prop.get_max_compute_units();
info.devices[i].hw_info = get_device_hw_info(&device);
// Only check GPU devices; CPU devices use OpenCL and would otherwise
// disable Level Zero for the GPUs on systems without ONEAPI_DEVICE_SELECTOR set.
if (device.is_gpu() && device.default_queue().get_backend() != sycl::backend::ext_oneapi_level_zero) {
GGML_LOG_WARN("SYCL GPU device %d does not use Level Zero backend, disabling Level Zero memory API\n", i);
info.ext_oneapi_level_zero = false;
}
}
for (int id = 0; id < info.device_count; ++id) {
@@ -230,26 +236,10 @@ static void ggml_check_sycl() try {
g_ggml_sycl_disable_dnn = get_sycl_env("GGML_SYCL_DISABLE_DNN", 0);
g_ggml_sycl_prioritize_dmmv = get_sycl_env("GGML_SYCL_PRIORITIZE_DMMV", 0);
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
g_ggml_sycl_enable_level_zero = get_sycl_env("GGML_SYCL_ENABLE_LEVEL_ZERO", 1);
g_ggml_sycl_enable_level_zero = get_sycl_env("GGML_SYCL_ENABLE_LEVEL_ZERO", ggml_sycl_info().ext_oneapi_level_zero);
#else
g_ggml_sycl_enable_level_zero = 0;
#endif
if (g_ggml_sycl_enable_level_zero) {
// Verify all GPU devices use the Level Zero backend before enabling L0 APIs.
// Only check GPU devices; CPU devices use OpenCL and would otherwise
// disable Level Zero for the GPUs on systems without ONEAPI_DEVICE_SELECTOR set.
for (unsigned int i = 0; i < dpct::dev_mgr::instance().device_count(); i++) {
auto & q = dpct::dev_mgr::instance().get_device(i).default_queue();
if (!q.get_device().is_gpu()) {
continue;
}
if (q.get_backend() != sycl::backend::ext_oneapi_level_zero) {
GGML_LOG_WARN("SYCL GPU device %d does not use Level Zero backend, disabling Level Zero memory API\n", i);
g_ggml_sycl_enable_level_zero = 0;
break;
}
}
}
#ifdef SYCL_FLASH_ATTN
g_ggml_sycl_enable_flash_attention = get_sycl_env("GGML_SYCL_ENABLE_FLASH_ATTN", 1);
@@ -3455,6 +3445,7 @@ static bool ggml_sycl_supports_dmmv(enum ggml_type type) {
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
case GGML_TYPE_F16:
case GGML_TYPE_BF16:
return true;
default:
return false;
@@ -3818,8 +3809,13 @@ static void opt_for_reorder(ggml_backend_sycl_context * ctx, const ggml_tensor *
static bool can_use_dequantize_mul_mat_vec(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
// The F16/BF16 qk=1 kernel iterates with stride 2*DMMV_X, requiring ne[0] to be
// a multiple of 2*DMMV_X. Quantized types use block-structured kernels that only
// need ne[0] % DMMV_X == 0.
const int64_t dmmv_x_required = (src0->type == GGML_TYPE_BF16 || src0->type == GGML_TYPE_F16) ?
2*GGML_SYCL_DMMV_X : GGML_SYCL_DMMV_X;
return ggml_sycl_supports_dmmv(src0->type) && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32 &&
src0->ne[0] % GGML_SYCL_DMMV_X == 0 && src1->ne[1] == 1;
src0->ne[0] % dmmv_x_required == 0 && src1->ne[1] == 1;
}
static bool can_use_mul_mat_vec_q(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
@@ -3923,35 +3919,17 @@ struct mmid_row_mapping {
__dpct_inline__ static void k_copy_src1_to_contiguous(
const char *__restrict__ src1_original, char *__restrict__ src1_contiguous,
int *__restrict__ cur_src1_row, mmid_row_mapping *__restrict__ row_mapping,
const char *__restrict ids, int64_t i02, size_t ids_nb1, size_t ids_nb0,
const mmid_row_mapping *__restrict__ row_mapping,
int64_t ne11, int64_t ne10, size_t nb11, size_t nb12,
const sycl::nd_item<3> &item_ct1, int &src1_row) {
int32_t iid1 = item_ct1.get_group(2);
int32_t id = item_ct1.get_group(1);
const sycl::nd_item<3> &item_ct1) {
const int32_t src1_row = item_ct1.get_group(2);
const int32_t row_id_i = *(const int32_t *) (ids + iid1*ids_nb1 + id*ids_nb0);
if (row_id_i != i02) {
return;
}
const int32_t iid1 = row_mapping[src1_row].i2;
const int32_t id = row_mapping[src1_row].i1;
const int64_t i11 = id % ne11;
const int64_t i12 = iid1;
if (item_ct1.get_local_id(2) == 0) {
src1_row =
dpct::atomic_fetch_add<sycl::access::address_space::generic_space>(
cur_src1_row, 1);
row_mapping[src1_row] = {id, iid1};
}
/*
DPCT1065:194: Consider replacing sycl::nd_item::barrier() with
sycl::nd_item::barrier(sycl::access::fence_space::local_space) for better
performance if there is no access to global memory.
*/
item_ct1.barrier();
const float * src1_row_original = (const float *)(src1_original + i11*nb11 + i12*nb12);
float * src1_row_contiguous = (float *)(src1_contiguous + src1_row*nb11);
@@ -4026,6 +4004,47 @@ static bool ggml_sycl_mul_mat_id_mmvq_fused(
src1_row_stride, stream);
}
// counting sort of the routed rows by expert id (row_id_i, as chosen by the router):
// builds a projection of a memory layout where each expert's slice is contiguous
static void mmid_counting_sort_rows(
const ggml_tensor * ids, const char * ids_host,
int64_t n_ids, int64_t n_as, int64_t n_routed_rows,
std::vector<int64_t> & expert_counts,
std::vector<int64_t> & expert_row_offsets,
std::vector<mmid_row_mapping> & routed_row_src) {
// frequencies: how many routed rows each expert "owns"
expert_counts.assign(n_as, 0);
for (int64_t iid1 = 0; iid1 < ids->ne[1]; iid1++) {
for (int64_t id = 0; id < n_ids; id++) {
const int32_t row_id_i = *(const int32_t *) (ids_host + iid1*ids->nb[1] + id*ids->nb[0]);
GGML_ASSERT(row_id_i >= 0 && row_id_i < n_as);
expert_counts[row_id_i]++;
}
}
// where each expert's slice starts (row indices) and the previous ends
expert_row_offsets.assign(n_as + 1, 0);
for (int64_t i02 = 0; i02 < n_as; i02++) {
expert_row_offsets[i02 + 1] = expert_row_offsets[i02] + expert_counts[i02];
}
std::vector<int64_t> expert_row_next = expert_row_offsets;
routed_row_src.resize(n_routed_rows);
for (int64_t iid1 = 0; iid1 < ids->ne[1]; iid1++) {
for (int64_t id = 0; id < n_ids; id++) {
const int32_t row_id_i = *(const int32_t *) (ids_host + iid1*ids->nb[1] + id*ids->nb[0]);
GGML_ASSERT(row_id_i >= 0 && row_id_i < n_as);
// find and validate the next free row for a given expert (row_id_i)
const int64_t routed_row = expert_row_next[row_id_i]++;
GGML_ASSERT(routed_row >= expert_row_offsets[row_id_i]);
GGML_ASSERT(routed_row < expert_row_offsets[row_id_i + 1]);
routed_row_src[routed_row] = {(int32_t) id, (int32_t) iid1};
}
}
}
static void ggml_sycl_mul_mat_id(ggml_backend_sycl_context & ctx,
ggml_tensor *dst) try {
scope_op_debug_print scope_dbg_print(__func__, dst, /*num_src=*/3);
@@ -4104,99 +4123,91 @@ static void ggml_sycl_mul_mat_id(ggml_backend_sycl_context & ctx,
src1_row.data = src1_contiguous.get();
dst_row.data = dst_contiguous.get();
// how many "owned" routed rows to pass to each expert
std::vector<int64_t> expert_row_counts;
// where each expert's slice starts and the previous ends (row indices, right-exclusive)
std::vector<int64_t> expert_row_offsets;
// the sources (slot/token pairs) of contiguous rows to guide k_copy_src1_to_contiguous
std::vector<mmid_row_mapping> routed_row_src;
mmid_counting_sort_rows(ids, ids_host.data(), n_ids, n_as, n_routed_rows,
expert_row_counts, expert_row_offsets, routed_row_src);
ggml_sycl_pool_alloc<mmid_row_mapping> dev_row_mapping(ctx.pool(), n_routed_rows);
SYCL_CHECK(CHECK_TRY_ERROR(
stream->memcpy(dev_row_mapping.get(), routed_row_src.data(), n_routed_rows*sizeof(mmid_row_mapping))));
const unsigned int max_work_group_size = ggml_sycl_info().max_work_group_sizes[ctx.device];
assert(max_work_group_size % (WARP_SIZE * WARP_SIZE) == 0);
{
sycl::range<3> block_dims(1, 1, std::min((unsigned int)ne10, max_work_group_size));
sycl::range<3> grid_dims(1, 1, n_routed_rows);
stream->submit([&](sycl::handler &cgh) {
char *__restrict src1_contiguous_get =
src1_contiguous.get();
mmid_row_mapping *__restrict dev_row_mapping_get =
dev_row_mapping.get();
cgh.parallel_for(
sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
k_copy_src1_to_contiguous(
src1_original, src1_contiguous_get,
dev_row_mapping_get,
ne11, ne10, nb11, nb12,
item_ct1);
});
});
}
for (int64_t i02 = 0; i02 < n_as; i02++) {
int64_t num_src1_rows = 0;
for (int64_t iid1 = 0; iid1 < ids->ne[1]; iid1++) {
for (int64_t id = 0; id < n_ids; id++) {
const int32_t row_id_i = *(const int32_t *) (ids_host.data() + iid1*ids->nb[1] + id*ids->nb[0]);
GGML_ASSERT(row_id_i >= 0 && row_id_i < n_as);
if (row_id_i != i02) {
continue;
}
num_src1_rows++;
}
}
const int64_t num_src1_rows = expert_row_counts[i02];
if (num_src1_rows == 0) {
continue;
}
ggml_sycl_pool_alloc<int> dev_cur_src1_row(ctx.pool(), 1);
ggml_sycl_pool_alloc<mmid_row_mapping> dev_row_mapping(ctx.pool(), num_src1_rows);
SYCL_CHECK(CHECK_TRY_ERROR(
stream->memset(dev_cur_src1_row.get(), 0, sizeof(int))));
const unsigned int max_work_group_size = ggml_sycl_info().max_work_group_sizes[ctx.device];
assert(max_work_group_size % (WARP_SIZE * WARP_SIZE) == 0);
{
sycl::range<3> block_dims(1, 1, std::min((unsigned int)ne10, max_work_group_size));
sycl::range<3> grid_dims(1, n_ids, ids->ne[1]);
stream->submit([&](sycl::handler &cgh) {
sycl::local_accessor<int, 0> src1_row_acc(cgh);
char *__restrict src1_contiguous_get =
src1_contiguous.get();
int *__restrict dev_cur_src1_row_get =
dev_cur_src1_row.get();
mmid_row_mapping *__restrict dev_row_mapping_get =
dev_row_mapping.get();
size_t ids_nb_ct6 = ids->nb[1];
size_t ids_nb_ct7 = ids->nb[0];
cgh.parallel_for(
sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
k_copy_src1_to_contiguous(
src1_original, src1_contiguous_get,
dev_cur_src1_row_get,
dev_row_mapping_get, ids_dev, i02,
ids_nb_ct6, ids_nb_ct7, ne11, ne10, nb11, nb12,
item_ct1, src1_row_acc);
});
});
}
const int64_t expert_row_offset = expert_row_offsets[i02];
src0_row.data = src0_original + i02*nb02;
GGML_ASSERT(nb11 == sizeof(float)*ne10);
GGML_ASSERT(nb1 == sizeof(float)*ne0);
src1_row.data = src1_contiguous.get() + expert_row_offset*nb11;
src1_row.ne[1] = num_src1_rows;
src1_row.nb[1] = nb11;
src1_row.nb[2] = num_src1_rows*nb11;
src1_row.nb[3] = num_src1_rows*nb11;
dst_row.data = dst_contiguous.get() + expert_row_offset*nb1;
dst_row.ne[1] = num_src1_rows;
dst_row.nb[1] = nb1;
dst_row.nb[2] = num_src1_rows*nb1;
dst_row.nb[3] = num_src1_rows*nb1;
ggml_sycl_mul_mat(ctx, &src0_row, &src1_row, &dst_row);
}
{
sycl::range<3> block_dims(1, 1, std::min((unsigned int)ne0, max_work_group_size));
sycl::range<3> grid_dims(1, 1, num_src1_rows);
stream->submit([&](sycl::handler &cgh) {
const char *__restrict dst_contiguous_get =
dst_contiguous.get();
const mmid_row_mapping *__restrict dev_row_mapping_get =
dev_row_mapping.get();
{
sycl::range<3> block_dims(1, 1, std::min((unsigned int)ne0, max_work_group_size));
sycl::range<3> grid_dims(1, 1, n_routed_rows);
stream->submit([&](sycl::handler &cgh) {
const char *__restrict dst_contiguous_get =
dst_contiguous.get();
const mmid_row_mapping *__restrict dev_row_mapping_get =
dev_row_mapping.get();
cgh.parallel_for(
sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
k_copy_dst_from_contiguous(dst_original,
dst_contiguous_get,
dev_row_mapping_get,
ne0, nb1, nb2, item_ct1);
});
});
}
cgh.parallel_for(
sycl::nd_range<3>(grid_dims * block_dims, block_dims),
[=](sycl::nd_item<3> item_ct1) {
k_copy_dst_from_contiguous(dst_original,
dst_contiguous_get,
dev_row_mapping_get,
ne0, nb1, nb2, item_ct1);
});
});
}
}
}

View File

@@ -8,7 +8,10 @@ endif()
find_package(Vulkan COMPONENTS glslc REQUIRED)
find_package(SPIRV-Headers REQUIRED)
if (DEFINED ENV{VULKAN_SDK})
list(APPEND CMAKE_PREFIX_PATH "$ENV{VULKAN_SDK}")
endif()
find_package(SPIRV-Headers CONFIG REQUIRED)
if (CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
# Parallel build object files

View File

@@ -28,7 +28,7 @@ if (NOT ZENDNN_ROOT OR ZENDNN_ROOT STREQUAL "" OR ZENDNN_ROOT STREQUAL "OFF")
ExternalProject_Add(
zendnn
GIT_REPOSITORY https://github.com/amd/ZenDNN.git
GIT_TAG ac9e580d9434b7b98985f2627a7ebfb5eba4bb0d # ZenDNN-2026-WW17
GIT_TAG 253b94ce0d7e9284c265fefb485714944caff9d3 # ZenDNN-2026-WW19
PREFIX ${ZENDNN_PREFIX}
SOURCE_DIR ${ZENDNN_SOURCE_DIR}
BINARY_DIR ${ZENDNN_BUILD_DIR}

View File

@@ -2,6 +2,10 @@
#include "ggml-backend-impl.h"
#include "ggml-impl.h"
#define GGML_COMMON_DECL_CPP
#include "ggml-common.h"
#include "zendnnl.hpp"
#include <cstring>
@@ -19,6 +23,8 @@ zendnnl::common::data_type_t ggml_to_zendnn_type() {
return zendnnl::common::data_type_t::f32;
} else if constexpr (std::is_same_v<T, ggml_bf16_t>) {
return zendnnl::common::data_type_t::bf16;
} else if constexpr (std::is_same_v<T, block_q8_0>) {
return zendnnl::common::data_type_t::s8;
} else {
return zendnnl::common::data_type_t::none;
}
@@ -48,6 +54,17 @@ static bool ggml_zendnn_matmul(ggml_backend_zendnn_context * ctx, int64_t m, int
params.num_threads = ctx->n_threads;
zendnnl::lowoha::matmul::matmul_batch_params_t batch_params;
if constexpr (std::is_same_v<TA, block_q8_0>) {
params.dtypes.compute = zendnnl::common::data_type_t::s8;
const int64_t num_groups = k / QK8_0;
params.dynamic_quant = true;
params.quant_params.src_scale.buff = nullptr;
params.quant_params.src_scale.dt = zendnnl::common::data_type_t::bf16;
params.quant_params.src_scale.dims = {n, num_groups};
params.packing.pack_format_b = 1;
}
zendnnl::error_handling::status_t status = zendnnl::lowoha::matmul::matmul_direct(
'r', false, true, // row-major, don't transpose B, transpose A (because it's column-major)
n, // M: rows of B and C
@@ -108,6 +125,14 @@ static bool ggml_zendnn_sgemm(ggml_backend_zendnn_context * ctx, int64_t m, int6
(const ggml_bf16_t *)B, ldb,
(float *)C, ldc);
return false;
case GGML_TYPE_Q8_0:
if (Btype != GGML_TYPE_F32 || Ctype != GGML_TYPE_F32)
return false;
return ggml_zendnn_matmul<block_q8_0, float, float>(
ctx, m, n, k,
(const block_q8_0 *)A, lda,
(const float *)B, ldb,
(float *)C, ldc);
default:
return false; // unsupported type
}
@@ -145,7 +170,9 @@ static void ggml_zendnn_compute_forward_mul_mat(
const int64_t r3 = ne13/ne03;
void * work_data = ctx->work_data.get();
if (src1->type != vec_dot_type) {
// ZenDNN requires FP32 for dynamic quantization, so conversion is skipped
if (src1->type != vec_dot_type && src0->type != GGML_TYPE_Q8_0) {
const size_t nbw1 = ggml_row_size(vec_dot_type, ne10);
const size_t nbw2 = nbw1 * ne11;
const size_t nbw3 = nbw2 * ne12;
@@ -171,7 +198,7 @@ static void ggml_zendnn_compute_forward_mul_mat(
for (int64_t i13 = 0; i13 < ne13; i13++) {
for (int64_t i12 = 0; i12 < ne12; i12++) {
const void* wdata = src1->type == vec_dot_type ? src1->data : work_data;
const void* wdata = (src1->type == vec_dot_type || src0->type == GGML_TYPE_Q8_0) ? src1->data : work_data;
const size_t row_size = ggml_row_size(vec_dot_type, ne10);
if (!ggml_zendnn_sgemm(ctx,
ne01, // m
@@ -184,7 +211,7 @@ static void ggml_zendnn_compute_forward_mul_mat(
static_cast<char *>(dst->data) + i12*nb2 + i13*nb3,
ne01, // ldc
src0->type,
vec_dot_type,
src0->type == GGML_TYPE_Q8_0 ? GGML_TYPE_F32 : vec_dot_type,
dst->type))
GGML_ABORT("%s: ZenDNN sgemm failed\n", __func__);
}
@@ -261,10 +288,15 @@ static void ggml_zendnn_compute_forward_mul_mat_id(
const size_t nbw1 = row_size;
const size_t nbw2 = nbw1 * ne11;
const size_t nbw3 = nbw2 * ne12;
const size_t src1_conv_size = (src1->type != vec_dot_type) ? ne13 * nbw3 : 0;
const size_t src1_conv_size = (src1->type != vec_dot_type && src0->type != GGML_TYPE_Q8_0) ? ne13 * nbw3 : 0;
// For Q8_0, src1 is always F32; the gather buffer must hold F32 rows (ne10*4 bytes),
// not Q8_0-encoded rows (row_size ≈ ne10/32*34 bytes) — they differ by ~4x.
const size_t f32_row_size = (size_t)ne10 * sizeof(float);
const size_t gather_row_size = (src0->type == GGML_TYPE_Q8_0) ? f32_row_size : row_size;
// size for MoE gather/scatter buffers
const size_t wdata_cur_size = max_rows * row_size;
const size_t wdata_cur_size = max_rows * gather_row_size;
const size_t dst_cur_size = max_rows * ggml_row_size(dst->type, ne01);
// allocate single buffer for all needs
@@ -279,7 +311,8 @@ static void ggml_zendnn_compute_forward_mul_mat_id(
char * wdata_cur = work_data + src1_conv_size;
char * dst_cur = wdata_cur + wdata_cur_size;
if (src1->type != vec_dot_type) {
// ZenDNN requires FP32 for dynamic quantization, so conversion is skipped
if (src1->type != vec_dot_type && src0->type != GGML_TYPE_Q8_0) {
GGML_ASSERT(src1->type == GGML_TYPE_F32);
#pragma omp parallel for collapse(3) num_threads(ctx->n_threads) schedule(static)
@@ -294,7 +327,7 @@ static void ggml_zendnn_compute_forward_mul_mat_id(
}
}
const void * wdata = src1->type == vec_dot_type ? src1->data : work_data;
const void * wdata = (src1->type == vec_dot_type || src0->type == GGML_TYPE_Q8_0) ? src1->data : work_data;
// process each expert with gather -> gemm -> scatter pattern
for (int64_t cur_a = 0; cur_a < n_as; ++cur_a) {
@@ -315,9 +348,9 @@ static void ggml_zendnn_compute_forward_mul_mat_id(
const int64_t i12 = row_mapping.i2;
std::memcpy(
wdata_cur + ir1 * row_size,
(const char *) wdata + (i11 + i12*ne11) * row_size,
row_size
wdata_cur + ir1 * gather_row_size,
(const char *) wdata + (i11 + i12*ne11) * gather_row_size,
gather_row_size
);
}
@@ -333,7 +366,7 @@ static void ggml_zendnn_compute_forward_mul_mat_id(
dst_cur,
ne01, // ldc
src0->type,
vec_dot_type,
src0->type == GGML_TYPE_Q8_0 ? GGML_TYPE_F32 : vec_dot_type,
dst->type)) {
GGML_ABORT("%s: ZenDNN sgemm failed\n", __func__);
}
@@ -577,6 +610,7 @@ static bool ggml_backend_zendnn_device_supports_op(ggml_backend_dev_t dev, const
switch (weights->type) {
case GGML_TYPE_F32:
case GGML_TYPE_BF16:
case GGML_TYPE_Q8_0:
return true;
default:
return false;

View File

@@ -228,9 +228,18 @@ struct gguf_context {
};
struct gguf_reader {
gguf_reader(FILE * file) : file(file) {
// read the remaining bytes once and update on each read
nbytes_remain = file_remain(file);
gguf_reader(
gguf_reader_callback_t callback,
void * userdata,
size_t max_chunk_read,
uint64_t data_offset = 0,
uint64_t nbytes_remain = 0)
: callback(callback),
userdata(userdata),
max_chunk_read(max_chunk_read),
data_offset(data_offset),
nbytes_remain(nbytes_remain) {
GGML_ASSERT(max_chunk_read > 0);
}
// helper for remaining bytes in a file
@@ -257,12 +266,10 @@ struct gguf_reader {
template <typename T>
bool read(T & dst) const {
const size_t size = sizeof(dst);
if (nbytes_remain < size) {
if (size > nbytes_remain) {
return false;
}
const size_t nread = fread(&dst, 1, size, file);
nbytes_remain -= nread;
return nread == size;
return read_raw(&dst, size) == size;
}
template <typename T>
@@ -344,24 +351,71 @@ struct gguf_reader {
return false;
}
dst.resize(static_cast<size_t>(size));
const size_t nread = fread(dst.data(), 1, size, file);
nbytes_remain -= nread;
return nread == size;
return read_raw(dst.data(), static_cast<size_t>(size)) == size;
}
bool read(void * dst, const size_t size) const {
if (size > nbytes_remain) {
return false;
}
const size_t nread = fread(dst, 1, size, file);
nbytes_remain -= nread;
return nread == size;
return read_raw(dst, size) == size;
}
uint64_t tell() const {
return data_offset;
}
bool seek(uint64_t absolute_offset) const {
const uint64_t end_offset = uint64_t(data_offset) + nbytes_remain;
if (absolute_offset > end_offset) {
return false;
}
data_offset = absolute_offset;
nbytes_remain = end_offset - absolute_offset;
return true;
}
private:
FILE * file;
size_t read_raw(void * dst, size_t size) const {
if (callback == nullptr || size == 0) {
return 0;
}
mutable uint64_t nbytes_remain;
uint8_t * data = static_cast<uint8_t *>(dst);
size_t total_nread = 0;
bool reached_eof = false;
while (total_nread < size) {
const size_t chunk_size = std::min(max_chunk_read, size - total_nread);
if (data_offset + total_nread < data_offset) {
break;
}
const size_t nread = callback(userdata, static_cast<void *>(data + total_nread), data_offset + total_nread, chunk_size);
total_nread += nread;
if (nread != chunk_size) {
reached_eof = true;
break;
}
}
data_offset += total_nread;
GGML_ASSERT(total_nread <= nbytes_remain);
nbytes_remain -= total_nread;
if (reached_eof) {
nbytes_remain = 0;
}
return total_nread;
}
gguf_reader_callback_t callback = nullptr;
void * userdata = nullptr;
size_t max_chunk_read = 0;
mutable uint64_t data_offset = 0;
mutable uint64_t nbytes_remain = 0;
};
struct gguf_context * gguf_init_empty(void) {
@@ -394,12 +448,7 @@ bool gguf_read_emplace_helper(const struct gguf_reader & gr, std::vector<struct
return true;
}
struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_params params) {
if (!file) {
return nullptr;
}
const struct gguf_reader gr(file);
static struct gguf_context * gguf_init_from_reader(const struct gguf_reader & gr, struct gguf_init_params params) {
struct gguf_context * ctx = new gguf_context;
bool ok = true;
@@ -700,14 +749,14 @@ struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_para
GGML_ASSERT(int64_t(ctx->info.size()) == n_tensors);
// we require the data section to be aligned, so take into account any padding
if (gguf_fseek(file, GGML_PAD(gguf_ftell(file), ctx->alignment), SEEK_SET) != 0) {
if (n_tensors > 0 && !gr.seek(GGML_PAD(gr.tell(), ctx->alignment))) {
GGML_LOG_ERROR("%s: failed to seek to beginning of data section\n", __func__);
gguf_free(ctx);
return nullptr;
}
// store the current file offset - this is where the data section starts
ctx->offset = gguf_ftell(file);
ctx->offset = gr.tell();
// compute the total size of the data section, taking into account the alignment
{
@@ -844,6 +893,89 @@ struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_para
return ctx;
}
struct gguf_context * gguf_init_from_callback(gguf_reader_callback_t callback, void * userdata, size_t max_chunk_read, uint64_t max_expected_size, struct gguf_init_params params) {
if (callback == nullptr) {
return nullptr;
}
const struct gguf_reader gr(callback, userdata, max_chunk_read == 0 ? SIZE_MAX : max_chunk_read, 0, max_expected_size);
return gguf_init_from_reader(gr, params);
}
struct gguf_file_reader {
FILE * file;
uint64_t offset;
};
static size_t gguf_file_reader_callback(void * userdata, void * output, uint64_t offset, size_t len) {
GGML_ASSERT(len > 0);
gguf_file_reader & reader = *static_cast<gguf_file_reader *>(userdata);
if (reader.offset != offset) {
if (offset > INT64_MAX || gguf_fseek(reader.file, static_cast<int64_t>(offset), SEEK_SET) != 0) {
return 0;
}
reader.offset = offset;
}
const size_t nread = fread(static_cast<uint8_t *>(output), 1, len, reader.file);
reader.offset += nread;
return nread;
}
struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_params params) {
if (!file) {
return nullptr;
}
const int64_t cur = gguf_ftell(file);
if (cur < 0) {
return nullptr;
}
gguf_file_reader reader = {
/*.file = */ file,
/*.offset = */ static_cast<uint64_t>(cur),
};
const struct gguf_reader gr(gguf_file_reader_callback, &reader, SIZE_MAX, reader.offset, gguf_reader::file_remain(file));
return gguf_init_from_reader(gr, params);
}
struct gguf_buffer_reader {
const uint8_t * data;
size_t size;
};
static size_t gguf_buffer_reader_callback(void * userdata, void * output, uint64_t offset, size_t len) {
GGML_ASSERT(len > 0);
const gguf_buffer_reader & reader = *static_cast<gguf_buffer_reader *>(userdata);
if (offset > reader.size || len > reader.size - offset) {
return 0;
}
const size_t data_offset = static_cast<size_t>(offset);
const size_t nread = std::min(len, reader.size - data_offset);
memcpy(static_cast<uint8_t *>(output), reader.data + data_offset, nread);
return nread;
}
struct gguf_context * gguf_init_from_buffer(const void * data, size_t size, struct gguf_init_params params) {
if (data == nullptr || size == 0) {
return nullptr;
}
gguf_buffer_reader reader = {
/*.data = */ static_cast<const uint8_t *>(data),
/*.size = */ size,
};
const struct gguf_reader gr(gguf_buffer_reader_callback, &reader, SIZE_MAX, 0, size);
return gguf_init_from_reader(gr, params);
}
struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params) {
FILE * file = ggml_fopen(fname, "rb");

View File

@@ -28,6 +28,7 @@ def quant_shape_from_byte_shape(shape: Sequence[int], quant_type: GGMLQuantizati
# This is faster than np.vectorize and np.apply_along_axis because it works on more than one row at a time
def _apply_over_grouped_rows(func: Callable[[np.ndarray], np.ndarray], arr: np.ndarray, otype: DTypeLike, oshape: tuple[int, ...]) -> np.ndarray:
rows = arr.reshape((-1, arr.shape[-1]))
assert len(rows.shape)
osize = 1
for dim in oshape:
osize *= dim

View File

@@ -874,7 +874,8 @@ extern "C" {
// work only with partial states, such as SWA KV cache or recurrent cache (e.g. Mamba)
#define LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY 1
// keeps the tensor data on device buffers (i.e. not accessible in host memory, but faster save/load)
// Keeps the tensor data on device buffers (i.e. not accessible in host memory, but faster save/load).
// Getting the state for a seq_id with this flag invalidates all prior states gotten for that seq_id with this flag.
#define LLAMA_STATE_SEQ_FLAGS_ON_DEVICE 2
typedef uint32_t llama_state_seq_flags;

View File

@@ -1,8 +1,8 @@
-r ./requirements-convert_legacy_llama.txt
--extra-index-url https://download.pytorch.org/whl/cpu
## Embedding Gemma requires PyTorch 2.6.0 or later
torch~=2.6.0; platform_machine != "s390x"
## Embedding Gemma requires PyTorch 2.6.0 or later, bumped to 2.11.0 for compatibility
torch==2.11.0; platform_machine != "s390x"
# torch s390x packages can only be found from nightly builds
--extra-index-url https://download.pytorch.org/whl/nightly

View File

@@ -7,10 +7,10 @@ $ErrorActionPreference = "Stop"
$BaseDir = "C:\Qualcomm"
# SDK 1: Hexagon
$HexagonUrl = "https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v6.4.0.2/hexagon-sdk-v6.4.0.2-arm64-wos.tar.xz"
$HexagonUrl = "https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v6.6.0.0/hexagon-sdk-v6.6.0.0-arm64-wos.tar.xz"
$HexagonParent = Join-Path $BaseDir "Hexagon_SDK"
$HexagonSdkVersion = "6.4.0.2"
$HexagonToolsVersion = "19.0.04"
$HexagonSdkVersion = "6.6.0.0"
$HexagonToolsVersion = "19.0.07"
$HexagonSdkTarget = Join-Path $HexagonParent $HexagonSdkVersion
$HexagonToolsTarget = Join-Path $HexagonSdkTarget "\tools\HEXAGON_Tools\$HexagonToolsVersion"

View File

@@ -1 +1 @@
0ce7ad348a3151e1da9f65d962044546bcaad421
e705c5fed490514458bdd2eaddc43bd098fcce9b

View File

@@ -5,7 +5,7 @@ import os
import sys
import subprocess
HTTPLIB_VERSION = "refs/tags/v0.45.0"
HTTPLIB_VERSION = "refs/tags/v0.45.1"
vendor = {
"https://github.com/nlohmann/json/releases/latest/download/json.hpp": "vendor/nlohmann/json.hpp",

342
scripts/ui-assets.cmake Normal file
View File

@@ -0,0 +1,342 @@
# Provision UI assets and generate ui.cpp/ui.h.
#
# Asset provisioning priority:
# 1. Pre-built assets in SRC_DIST_DIR (manually built by user)
# 2. If BUILD_UI=ON: npm build
# 3. If above did not produce assets and HF_ENABLED=ON: HF Bucket download
cmake_minimum_required(VERSION 3.16)
set(UI_SOURCE_DIR "" CACHE STRING "UI source directory (to run npm build)")
set(UI_BINARY_DIR "" CACHE STRING "UI binary directory (to store generated files)")
set(LLAMA_SOURCE_DIR "" CACHE STRING "Project source root (to resolve version from git)")
set(HF_BUCKET "" CACHE STRING "Hugging Face bucket name")
set(HF_VERSION "" CACHE STRING "Version to download (empty = resolve from git)")
set(HF_ENABLED "" CACHE STRING "Whether to allow HF Bucket download (ON/OFF)")
set(BUILD_UI "" CACHE STRING "Build UI via npm (ON/OFF)")
set(LLAMA_UI_EMBED "" CACHE STRING "Path to llama-ui-embed helper")
set(ASSETS
bundle.css
bundle.js
index.html
loading.html
)
set(DIST_DIR "${UI_BINARY_DIR}/dist")
set(SRC_DIST_DIR "${UI_SOURCE_DIR}/dist")
set(STAMP_FILE "${UI_BINARY_DIR}/.ui-stamp")
set(UI_CPP "${UI_BINARY_DIR}/ui.cpp")
set(UI_H "${UI_BINARY_DIR}/ui.h")
function(assets_present out_var)
set(present TRUE)
foreach(asset ${ASSETS})
if(NOT EXISTS "${DIST_DIR}/${asset}")
set(present FALSE)
break()
endif()
endforeach()
set(${out_var} ${present} PARENT_SCOPE)
endfunction()
function(copy_src_dist out_var)
set(${out_var} FALSE PARENT_SCOPE)
foreach(asset ${ASSETS})
if(NOT EXISTS "${SRC_DIST_DIR}/${asset}")
return()
endif()
endforeach()
file(MAKE_DIRECTORY "${DIST_DIR}")
message(STATUS "UI: using pre-built assets from ${SRC_DIST_DIR}")
foreach(asset ${ASSETS})
execute_process(
COMMAND ${CMAKE_COMMAND} -E copy_if_different
"${SRC_DIST_DIR}/${asset}" "${DIST_DIR}/${asset}"
)
endforeach()
set(${out_var} TRUE PARENT_SCOPE)
endfunction()
function(npm_build_should_skip out_var)
set(${out_var} FALSE PARENT_SCOPE)
assets_present(present)
if(NOT present)
return()
endif()
if(EXISTS "${STAMP_FILE}")
return()
endif()
if(NOT EXISTS "${UI_SOURCE_DIR}/sources.cmake")
return()
endif()
include("${UI_SOURCE_DIR}/sources.cmake")
set(globs "")
foreach(g ${UI_SOURCE_GLOBS})
list(APPEND globs "${UI_SOURCE_DIR}/${g}")
endforeach()
file(GLOB_RECURSE sources ${globs})
foreach(f ${UI_SOURCE_FILES})
list(APPEND sources "${UI_SOURCE_DIR}/${f}")
endforeach()
file(TIMESTAMP "${DIST_DIR}/index.html" out_ts)
foreach(s ${sources})
if(NOT EXISTS "${s}")
continue()
endif()
file(TIMESTAMP "${s}" s_ts)
if(s_ts STRGREATER out_ts)
return()
endif()
endforeach()
set(${out_var} TRUE PARENT_SCOPE)
endfunction()
function(npm_build out_var)
set(${out_var} FALSE PARENT_SCOPE)
if(NOT EXISTS "${UI_SOURCE_DIR}/package.json")
message(STATUS "UI: ${UI_SOURCE_DIR}/package.json not found, skipping npm")
return()
endif()
npm_build_should_skip(skip)
if(skip)
message(STATUS "UI: npm output up-to-date, skipping build")
set(${out_var} TRUE PARENT_SCOPE)
return()
endif()
if(CMAKE_HOST_WIN32)
find_program(NPM_EXECUTABLE NAMES npm.cmd npm.bat npm)
else()
find_program(NPM_EXECUTABLE npm)
endif()
if(NOT NPM_EXECUTABLE)
message(STATUS "UI: npm not found, skipping npm build")
return()
endif()
if(NOT EXISTS "${UI_SOURCE_DIR}/node_modules")
message(STATUS "UI: running npm install (first time)")
execute_process(
COMMAND ${NPM_EXECUTABLE} install
WORKING_DIRECTORY "${UI_SOURCE_DIR}"
RESULT_VARIABLE rc
ERROR_VARIABLE err
)
if(NOT rc EQUAL 0)
message(STATUS "UI: npm install failed (${rc})")
message(STATUS " stderr: ${err}")
return()
endif()
endif()
file(MAKE_DIRECTORY "${DIST_DIR}")
message(STATUS "UI: running npm run build, output -> ${DIST_DIR}")
execute_process(
COMMAND ${CMAKE_COMMAND} -E env "LLAMA_UI_OUT_DIR=${DIST_DIR}"
${NPM_EXECUTABLE} run build
WORKING_DIRECTORY "${UI_SOURCE_DIR}"
RESULT_VARIABLE rc
ERROR_VARIABLE err
)
if(NOT rc EQUAL 0)
message(STATUS "UI: npm run build failed (${rc})")
message(STATUS " stderr: ${err}")
return()
endif()
assets_present(present)
if(NOT present)
message(STATUS "UI: npm build finished but assets missing in ${DIST_DIR}")
return()
endif()
message(STATUS "UI: npm build succeeded")
file(REMOVE "${STAMP_FILE}")
set(${out_var} TRUE PARENT_SCOPE)
endfunction()
function(resolve_version out_var)
if(NOT "${HF_VERSION}" STREQUAL "")
set(${out_var} "${HF_VERSION}" PARENT_SCOPE)
return()
endif()
if(EXISTS "${LLAMA_SOURCE_DIR}/cmake/build-info.cmake")
include("${LLAMA_SOURCE_DIR}/cmake/build-info.cmake")
if(NOT "${BUILD_NUMBER}" STREQUAL "" AND NOT BUILD_NUMBER EQUAL 0)
set(${out_var} "b${BUILD_NUMBER}" PARENT_SCOPE)
return()
endif()
endif()
set(${out_var} "" PARENT_SCOPE)
endfunction()
function(hf_download version out_var out_resolved)
set(${out_var} FALSE PARENT_SCOPE)
set(${out_resolved} "" PARENT_SCOPE)
file(MAKE_DIRECTORY "${DIST_DIR}")
set(candidates "")
if(NOT "${version}" STREQUAL "")
list(APPEND candidates "${version}")
endif()
list(APPEND candidates "latest")
foreach(resolved ${candidates})
set(base "https://huggingface.co/buckets/ggml-org/${HF_BUCKET}/resolve/${resolved}")
message(STATUS "UI: downloading from ${resolved}: ${base}")
set(ok TRUE)
foreach(asset ${ASSETS})
file(DOWNLOAD "${base}/${asset}?download=true" "${DIST_DIR}/${asset}"
STATUS status TIMEOUT 60
)
list(GET status 0 rc)
if(NOT rc EQUAL 0)
list(GET status 1 errmsg)
message(STATUS "UI: download ${asset} from ${resolved} failed: ${errmsg}")
set(ok FALSE)
break()
endif()
message(STATUS "UI: downloaded ${asset}")
endforeach()
if(NOT ok)
continue()
endif()
# Best-effort checksum verification
file(DOWNLOAD "${base}/checksums.txt?download=true" "${DIST_DIR}/checksums.txt"
STATUS cs_status TIMEOUT 30
)
list(GET cs_status 0 cs_rc)
if(cs_rc EQUAL 0)
message(STATUS "UI: verifying checksums")
file(STRINGS "${DIST_DIR}/checksums.txt" cs_lines)
foreach(asset ${ASSETS})
file(SHA256 "${DIST_DIR}/${asset}" h)
string(TOLOWER "${h}" h)
string(REGEX MATCH "${h}[ \t]+${asset}" m "${cs_lines}")
if(NOT m)
message(WARNING "UI: checksum verification failed for ${asset}")
set(ok FALSE)
break()
endif()
endforeach()
if(ok)
message(STATUS "UI: all checksums verified")
endif()
endif()
if(ok)
set(${out_var} TRUE PARENT_SCOPE)
set(${out_resolved} "${resolved}" PARENT_SCOPE)
return()
endif()
endforeach()
endfunction()
function(emit_files)
assets_present(present)
set(args "${UI_CPP}" "${UI_H}")
if(present)
foreach(asset ${ASSETS})
list(APPEND args "${asset}" "${DIST_DIR}/${asset}")
endforeach()
endif()
execute_process(
COMMAND "${LLAMA_UI_EMBED}" ${args}
RESULT_VARIABLE rc
)
if(NOT rc EQUAL 0)
message(FATAL_ERROR "UI: llama-ui-embed failed (${rc})")
endif()
endfunction()
# ---------------------------------------------------------------------------
# 1. Priority 1: pre-built assets supplied in tools/ui/dist
# ---------------------------------------------------------------------------
copy_src_dist(SRC_OK)
if(SRC_OK)
emit_files()
return()
endif()
# ---------------------------------------------------------------------------
# 2. Priority 2: npm build (if BUILD_UI=ON)
# ---------------------------------------------------------------------------
set(provisioned FALSE)
if(BUILD_UI)
npm_build(NPM_OK)
if(NPM_OK)
set(provisioned TRUE)
endif()
endif()
# ---------------------------------------------------------------------------
# 3. Priority 3: HF Bucket download (if npm did not produce assets and HF_ENABLED=ON)
# ---------------------------------------------------------------------------
if(NOT provisioned AND HF_ENABLED)
resolve_version(VERSION)
set(stamp_ok FALSE)
if(EXISTS "${STAMP_FILE}" AND NOT "${VERSION}" STREQUAL "")
file(READ "${STAMP_FILE}" stamped)
string(STRIP "${stamped}" stamped)
if("${stamped}" STREQUAL "${VERSION}")
set(stamp_ok TRUE)
endif()
endif()
assets_present(have_assets)
if(stamp_ok AND have_assets)
message(STATUS "UI: HF stamp '${stamped}' matches version, skipping HF fetch")
set(provisioned TRUE)
else()
hf_download("${VERSION}" HF_OK HF_RESOLVED)
if(HF_OK)
file(WRITE "${STAMP_FILE}" "${HF_RESOLVED}")
message(STATUS "UI: HF download succeeded, stamp updated (${HF_RESOLVED})")
set(provisioned TRUE)
else()
message(STATUS "UI: HF download failed")
endif()
endif()
endif()
# ---------------------------------------------------------------------------
# 4. Fallback: warn about stale or missing assets, then emit whatever we have
# ---------------------------------------------------------------------------
if(NOT provisioned)
assets_present(have_assets)
if(have_assets)
message(WARNING "UI: provisioning failed; embedding stale assets from ${DIST_DIR}")
else()
message(WARNING "UI: no assets available - building without an embedded UI. "
"In a disconnected environment, download the pre-built UI "
"from a llama.cpp release at "
"https://github.com/ggml-org/llama.cpp/releases and "
"extract to tools/ui/dist.")
endif()
endif()
emit_files()

View File

@@ -1,223 +0,0 @@
# Download UI assets from Hugging Face Bucket at build time
# Usage: cmake -DPUBLIC_DIR=... -DHF_BUCKET=... -DHF_VERSION=... -DASSETS="a;b;c" -P scripts/ui-download.cmake
#
# Asset provisioning priority:
# 1. Pre-built assets already in PUBLIC_DIR (cached from a previous run)
# 2. Local npm build (if NPM_DIR is provided and has package.json)
# 3. Hugging Face Bucket download (version-specific, then 'latest' fallback)
cmake_minimum_required(VERSION 3.16)
set(PUBLIC_DIR "" CACHE STRING "Directory to store/download assets")
set(HF_BUCKET "" CACHE STRING "Hugging Face bucket name")
set(HF_VERSION "" CACHE STRING "Version to download (empty = resolve from git)")
set(ASSETS "" CACHE STRING "Plus-separated list of asset filenames (+)")
set(STAMP_FILE "" CACHE STRING "Stamp file to create on success (optional)")
set(SOURCE_DIR "" CACHE STRING "Project source root (to resolve version from git)")
set(NPM_DIR "" CACHE STRING "UI source directory (to run npm build)")
set(HF_ENABLED "" CACHE STRING "Whether to allow HF Bucket download (ON/OFF)")
# ---------------------------------------------------------------------------
# 1. Resolve version from git if not provided at configure time
# ---------------------------------------------------------------------------
set(RESOLVED_VERSION "${HF_VERSION}")
if("${RESOLVED_VERSION}" STREQUAL "" AND NOT "${SOURCE_DIR}" STREQUAL "")
if(EXISTS "${SOURCE_DIR}/cmake/build-info.cmake")
include("${SOURCE_DIR}/cmake/build-info.cmake")
if(NOT "${BUILD_NUMBER}" STREQUAL "" AND NOT BUILD_NUMBER EQUAL 0)
set(RESOLVED_VERSION "b${BUILD_NUMBER}")
message(STATUS "UI: resolved version from git: ${RESOLVED_VERSION}")
endif()
endif()
endif()
# Convert + back to CMake list (+ is used as separator instead of ; to
# avoid platform-specific escaping issues when passing via -D arguments)
string(REGEX REPLACE "\\+" ";" ASSETS "${ASSETS}")
# ---------------------------------------------------------------------------
# 2. Check stamp freshness — re-download if resolved version changed
# ---------------------------------------------------------------------------
set(FORCE_REBUILD FALSE)
if(NOT "${STAMP_FILE}" STREQUAL "" AND EXISTS "${STAMP_FILE}")
file(READ "${STAMP_FILE}" STAMPED_VERSION)
string(STRIP "${STAMPED_VERSION}" STAMPED_VERSION)
if(NOT "${STAMPED_VERSION}" STREQUAL "${RESOLVED_VERSION}")
message(STATUS "UI: version changed (${STAMPED_VERSION} -> ${RESOLVED_VERSION}), re-building")
set(FORCE_REBUILD TRUE)
endif()
endif()
# ---------------------------------------------------------------------------
# 3. Check if assets already exist (cached from a previous run)
# ---------------------------------------------------------------------------
set(ALL_EXISTS TRUE)
foreach(asset ${ASSETS})
if(NOT EXISTS "${PUBLIC_DIR}/${asset}")
set(ALL_EXISTS FALSE)
break()
endif()
endforeach()
if(ALL_EXISTS AND NOT FORCE_REBUILD)
message(STATUS "UI: all assets already exist in ${PUBLIC_DIR}, skipping")
return()
endif()
file(MAKE_DIRECTORY "${PUBLIC_DIR}")
# ---------------------------------------------------------------------------
# 4. Priority 2: build from source via npm (fast path for developers)
# ---------------------------------------------------------------------------
set(PROVISION_SUCCESS FALSE)
if(NOT PROVISION_SUCCESS AND NOT "${NPM_DIR}" STREQUAL "")
if(EXISTS "${NPM_DIR}/package.json")
# Check if npm is available before attempting npm build
find_program(NPM_EXECUTABLE npm)
if(NPM_EXECUTABLE)
message(STATUS "UI: building from source in ${NPM_DIR}")
# Run npm install if node_modules is missing
if(NOT EXISTS "${NPM_DIR}/node_modules")
message(STATUS "UI: running npm install (first time)")
execute_process(
COMMAND ${NPM_EXECUTABLE} install
WORKING_DIRECTORY "${NPM_DIR}"
RESULT_VARIABLE NPM_INSTALL_RESULT
OUTPUT_VARIABLE NPM_OUT
ERROR_VARIABLE NPM_ERR
)
if(NOT NPM_INSTALL_RESULT EQUAL 0)
message(STATUS "UI: npm install failed (${NPM_INSTALL_RESULT}), falling back to download")
message(STATUS " stderr: ${NPM_ERR}")
endif()
endif()
# Run the build
execute_process(
COMMAND ${NPM_EXECUTABLE} run build
WORKING_DIRECTORY "${NPM_DIR}"
RESULT_VARIABLE NPM_BUILD_RESULT
OUTPUT_VARIABLE NPM_OUT
ERROR_VARIABLE NPM_ERR
)
if(NPM_BUILD_RESULT EQUAL 0)
# Verify that the expected assets were produced
set(ALL_BUILT TRUE)
foreach(asset ${ASSETS})
if(NOT EXISTS "${PUBLIC_DIR}/${asset}")
set(ALL_BUILT FALSE)
break()
endif()
endforeach()
if(ALL_BUILT)
message(STATUS "UI: local npm build succeeded")
set(PROVISION_SUCCESS TRUE)
else()
message(STATUS "UI: npm build completed but assets missing from ${PUBLIC_DIR}, falling back to download")
endif()
else()
message(STATUS "UI: npm build failed (${NPM_BUILD_RESULT}), falling back to download")
message(STATUS " stderr: ${NPM_ERR}")
endif()
else()
message(STATUS "UI: npm not found, skipping npm build and trying HF Bucket download")
endif()
else()
message(STATUS "UI: NPM_DIR (${NPM_DIR}) has no package.json, skipping npm build")
endif()
endif()
# ---------------------------------------------------------------------------
# 5. Priority 3: download from Hugging Face Bucket (if enabled)
# ---------------------------------------------------------------------------
if(NOT PROVISION_SUCCESS AND HF_ENABLED)
# Build list of URLs to try — version-specific first, then 'latest'
set(URL_ENTRIES "")
if(NOT "${RESOLVED_VERSION}" STREQUAL "")
list(APPEND URL_ENTRIES
"version:https://huggingface.co/buckets/ggml-org/${HF_BUCKET}/resolve/${RESOLVED_VERSION}")
endif()
list(APPEND URL_ENTRIES
"latest:https://huggingface.co/buckets/ggml-org/${HF_BUCKET}/resolve/latest")
foreach(entry ${URL_ENTRIES})
string(REGEX REPLACE "^([^:]+):.*$" "\\1" url_label "${entry}")
string(REGEX REPLACE "^[^:]+:(.*)$" "\\1" base_url "${entry}")
message(STATUS "UI: downloading assets from ${url_label}: ${base_url}")
# Download each asset
set(ALL_OK TRUE)
foreach(asset ${ASSETS})
set(download_url "${base_url}/${asset}?download=true")
set(download_path "${PUBLIC_DIR}/${asset}")
file(DOWNLOAD "${download_url}" "${download_path}"
STATUS download_status TIMEOUT 60
)
list(GET download_status 0 download_result)
if(NOT download_result EQUAL 0)
list(GET download_status 1 error_message)
message(STATUS "UI: failed to download ${asset} from ${url_label}: ${error_message}")
set(ALL_OK FALSE)
break()
endif()
message(STATUS "UI: downloaded ${asset}")
endforeach()
if(NOT ALL_OK)
continue()
endif()
# Verify checksums if the server provides them
file(DOWNLOAD "${base_url}/checksums.txt?download=true"
"${PUBLIC_DIR}/checksums.txt"
STATUS checksum_status TIMEOUT 30
)
list(GET checksum_status 0 checksum_result)
if(checksum_result EQUAL 0)
message(STATUS "UI: verifying checksums...")
file(STRINGS "${PUBLIC_DIR}/checksums.txt" CHECKSUMS_CONTENT)
foreach(asset ${ASSETS})
set(download_path "${PUBLIC_DIR}/${asset}")
file(SHA256 "${download_path}" asset_hash)
string(TOLOWER "${asset_hash}" EXPECTED_HASH_LOWER)
string(REGEX MATCH "${EXPECTED_HASH_LOWER}[ \\t]+${asset}" CHECKSUM_LINE "${CHECKSUMS_CONTENT}")
if(NOT CHECKSUM_LINE)
message(WARNING "UI: checksum verification failed for ${asset}")
set(ALL_OK FALSE)
break()
endif()
endforeach()
if(ALL_OK)
message(STATUS "UI: all checksums verified")
endif()
endif()
if(ALL_OK)
set(PROVISION_SUCCESS TRUE)
break()
endif()
endforeach()
if(PROVISION_SUCCESS)
message(STATUS "UI: provisioning complete")
else()
message(WARNING "UI: failed to download assets from HF Bucket (${HF_BUCKET})")
endif()
endif()
# ---------------------------------------------------------------------------
# 6. Write stamp file on success (stores resolved version for freshness check)
# ---------------------------------------------------------------------------
if(PROVISION_SUCCESS)
if(NOT "${STAMP_FILE}" STREQUAL "")
file(WRITE "${STAMP_FILE}" "${RESOLVED_VERSION}")
endif()
else()
message(WARNING "UI: no source available. Neither local build (${NPM_DIR}) nor HF Bucket download succeeded.")
message(WARNING "UI: building server without embedded UI. Set LLAMA_BUILD_UI=OFF to suppress this warning.")
endif()

Some files were not shown because too many files have changed in this diff Show More