Compare commits

...

23 Commits

Author SHA1 Message Date
Mikolaj Kucharski
7fb1e70b59 arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) 2026-05-28 16:25:40 +02:00
Johannes Gäßler
d374e71e55 test-llama-archs: fix table format [no release] (#23810) 2026-05-28 15:53:54 +02:00
fl0rianr
30af6e2b98 ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007) 2026-05-28 15:01:14 +02:00
redfox
d7be46189f mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729)
* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:51:14 +02:00
Jaden_Mach
bc81d47aba CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227)
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:50:25 +02:00
Funtowicz Morgan
0b246862b9 server: minor tweaks to use more cpp features (#23785)
* misc(server): add default port to impl RAII

* misc(server): register_gcp_compat() can be const

* misc(server): use proper cpp const/auto methods

* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
2026-05-28 14:00:25 +02:00
Max Krasnyansky
a919001134 hexagon: minor refresh for HMX FA and MM (#23796)
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-05-28 04:49:11 -07:00
Jeff Bolz
48e7078ee0 vulkan: fast path for walsh-hadamard transform (#23687)
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
2026-05-28 13:18:43 +02:00
Jesus Talavera
bb771cbd2b chat : add Granite 4.1 chat template (#23518) 2026-05-28 13:13:33 +02:00
Winston Ma
7c48fb81ce vulkan: fix wrong index variable in inner loop (#23665) 2026-05-28 12:48:34 +02:00
Winston Ma
91eb8f4fa0 vulkan: Fix memory logger unsafe iterator access (#23667) 2026-05-28 12:46:07 +02:00
Markus Tavenrath
d205df6812 server, ui : Add support for HTTP ETags in llama-server (#23701)
* allow caching of ui elements in llama-server

* use fnv_hash

* Update tools/server/server-http.cpp

etag has to be set always

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-05-28 12:21:24 +02:00
Sachin Sharma
e8d2567429 docker : add ZenDNN Dockerfile (#23716) 2026-05-28 11:40:49 +02:00
fairydreaming
09e7b76c93 cuda : fix KQ mask offset integer overflow in fattn MMA kernel (#23610)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-28 10:55:42 +02:00
Adrien Gallouët
48e7eae41c perplexity : fix format specifier in LOG_ERR (#23788)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-28 10:34:58 +03:00
ynankani
c5229087a5 convert : add FP8 to Q8 conversion (#23250)
Signed-off-by: ynankani <ynankani@nvidia.com>
2026-05-28 10:16:17 +03:00
Martin Klacer
e31cdaa0eb ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (#22841)
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16



Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
2026-05-28 10:04:21 +03:00
Georgi Gerganov
491c4d7d2e ci : refactor (#23789)
* ci : separate CUDA windows workflow + fix names

* ci : rename workflow

* ci : prefix cache names with workflow name

* ci : rename build.yml -> build-cpu.yml

* ci : cache keys

* ci : fix windows cuda/hip concurrency of release workflow

* ci : fix apple cache names

* ci : add TODOs

* cont : keep just the last cache

* ci : update release concurrency to queue

* ci : move the release trigger to ubuntu-slim

* ci : hip add TODO

* cont : improve words

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-28 09:44:25 +03:00
ymcki
939a7dd648 Hexagon: OP_GATED_DELTA_NET K>1 support (#23531)
* K>1 state snapshot support

* removed picky indent multiple of 4 fixes
2026-05-27 23:05:25 -07:00
ymcki
8ad8aef447 opencl: OP_GATED_DELTA_NET (#23312)
* OP_GATED_DELTA_NET impl

* add back lanes_per_column declaration

* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot

* support for K>1 state snapshot

* removed picky indent multiple of 4 fixes

* removed return that won\'t be executed
2026-05-27 21:23:21 -07:00
Reese Levine
f12cc6d0fa ggml-webgpu: remove legacy constants (#23672) 2026-05-27 14:22:33 -07:00
Max Krasnyansky
aa50b2c2ae hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647)
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

* hmx-mm: add support for Q4_1

* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

* hexagon: fix repack scratch buffer overflow

* hex-mm: fix Q4_1 repack buffer sizing

* hexagon: flip the build order for mm and fa (seems to help LTO)

* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

* hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-05-27 10:46:11 -07:00
Masashi Yoshimura
c40006a62e ggml-webgpu: Fix how to dispatch WG to some ops (#23750) 2026-05-27 09:48:12 -07:00
60 changed files with 3938 additions and 892 deletions

101
.devops/zendnn.Dockerfile Normal file
View File

@@ -0,0 +1,101 @@
ARG UBUNTU_VERSION=24.04
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
FROM ubuntu:$UBUNTU_VERSION AS build
RUN apt-get update && \
apt-get install -y gcc-13 g++-13 build-essential git cmake libssl-dev libomp-dev libnuma-dev python3 ca-certificates
ENV CC=gcc-13 CXX=g++-13
WORKDIR /app
COPY . .
RUN cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_TESTS=OFF -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_ZENDNN=ON && \
cmake --build build -j $(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so*" -exec cp -P {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ubuntu:$UBUNTU_VERSION AS base
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE
RUN apt-get update \
&& apt-get install -y libgomp1 libnuma1 curl \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
python3-wheel \
&& pip install --break-system-packages --upgrade setuptools \
&& pip install --break-system-packages -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]

View File

@@ -32,7 +32,7 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
android:
default:
runs-on: ubuntu-latest
steps:
@@ -58,7 +58,7 @@ jobs:
cd examples/llama.android
./gradlew build --no-daemon
android-ndk:
ndk:
runs-on: ubuntu-latest
container:
image: 'ghcr.io/snapdragon-toolchain/arm64-android:v0.3'
@@ -92,7 +92,7 @@ jobs:
name: llama-cpp-android-arm64-cpu
path: pkg-adb/llama.cpp
android-arm64:
arm64:
runs-on: ubuntu-latest
env:
@@ -103,12 +103,18 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: android-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# note : disabled to spare some cache space (https://github.com/ggml-org/llama.cpp/pull/23789)
# for some reason, the ccache does not improve the build time in this case
# example:
# cache off: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78160400831
# cache on: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78224189394
#
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: android-ubuntu-arm64
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Set up JDK
uses: actions/setup-java@v5

View File

@@ -48,7 +48,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-arm64
key: apple-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -84,7 +84,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-x64
key: apple-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -117,10 +117,11 @@ jobs:
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-ios
key: apple-ios
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -197,10 +198,11 @@ jobs:
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-tvos
key: apple-tvos
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -230,6 +232,14 @@ jobs:
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: apple-visionos
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
@@ -261,10 +271,11 @@ jobs:
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-swift
key: apple-swift
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -1,4 +1,4 @@
name: CI
name: CI (cpu)
on:
workflow_dispatch: # allows manual triggering
@@ -6,7 +6,7 @@ on:
branches:
- master
paths: [
'.github/workflows/build.yml',
'.github/workflows/build-cpu.yml',
'.github/workflows/build-cmake-pkg.yml',
'**/CMakeLists.txt',
'**/.cmake',
@@ -27,7 +27,7 @@ on:
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build.yml',
'.github/workflows/build-cpu.yml',
'.github/workflows/build-cmake-pkg.yml',
'**/CMakeLists.txt',
'**/.cmake',
@@ -60,7 +60,7 @@ jobs:
build-cmake-pkg:
uses: ./.github/workflows/build-cmake-pkg.yml
ubuntu-cpu:
ubuntu:
strategy:
matrix:
include:
@@ -79,7 +79,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-cpu-${{ matrix.build }}
key: cpu-${{ matrix.os }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -131,46 +131,7 @@ jobs:
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
ubuntu-24-vulkan:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-24.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev spirv-headers libssl-dev ninja-build
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: Configure
id: cmake_configure
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DGGML_BACKEND_DL=ON \
-DGGML_CPU_ALL_VARIANTS=ON \
-DGGML_VULKAN=ON
- name: Build
id: cmake_build
run: |
time cmake --build build -j $(nproc)
windows-latest:
windows:
runs-on: windows-2025
env:
@@ -202,7 +163,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-${{ matrix.build }}
key: cpu-windows-2025-${{ matrix.build }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -268,88 +229,3 @@ jobs:
# cd build
# $env:LLAMA_SKIP_TESTS_SLOW_ON_EMULATOR = 1
# & $sde -future -- ctest -L main -C Release --verbose --timeout 900
ubuntu-latest-cuda:
runs-on: ubuntu-latest
container: nvidia/cuda:12.6.2-devel-ubuntu24.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install dependencies
env:
DEBIAN_FRONTEND: noninteractive
run: |
apt update
apt install -y cmake build-essential ninja-build libgomp1 git libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-latest-cuda
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with CMake
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
cmake -S . -B build -G Ninja \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=89-real \
-DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined \
-DGGML_NATIVE=OFF \
-DGGML_CUDA=ON \
-DGGML_CUDA_CUB_3DOT2=ON
cmake --build build
windows-2022-cuda:
runs-on: windows-2022
strategy:
matrix:
cuda: ['12.4']
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-cuda-${{ matrix.cuda }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
with:
cuda_version: ${{ matrix.cuda }}
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Build
id: cmake_build
shell: cmd
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
cmake -S . -B build -G "Ninja Multi-Config" ^
-DLLAMA_BUILD_SERVER=ON ^
-DLLAMA_BUILD_BORINGSSL=ON ^
-DGGML_NATIVE=OFF ^
-DGGML_BACKEND_DL=ON ^
-DGGML_CPU_ALL_VARIANTS=ON ^
-DGGML_CUDA=ON ^
-DGGML_RPC=ON ^
-DGGML_CUDA_CUB_3DOT2=ON
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% -t ggml
cmake --build build --config Release

View File

@@ -1,4 +1,4 @@
name: CI (hip)
name: CI (CUDA, ubuntu)
on:
workflow_dispatch: # allows manual triggering
@@ -6,7 +6,7 @@ on:
branches:
- master
paths: [
'.github/workflows/build-hip.yml',
'.github/workflows/build-cuda-ubuntu.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
@@ -20,7 +20,7 @@ on:
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-hip.yml',
'.github/workflows/build-cuda-ubuntu.yml',
'ggml/src/ggml-cuda/**'
]
@@ -36,8 +36,43 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
cuda:
runs-on: ubuntu-24.04
container: nvidia/cuda:12.6.2-devel-ubuntu24.04
ubuntu-22-hip:
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install dependencies
env:
DEBIAN_FRONTEND: noninteractive
run: |
apt update
apt install -y cmake build-essential ninja-build libgomp1 git libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: cuda-ubuntu-24.04-cuda
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with CMake
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
cmake -S . -B build -G Ninja \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=89-real \
-DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined \
-DGGML_NATIVE=OFF \
-DGGML_CUDA=ON \
-DGGML_CUDA_CUB_3DOT2=ON
cmake --build build
hip:
runs-on: ubuntu-22.04
container: rocm/dev-ubuntu-22.04:6.1.2
@@ -55,7 +90,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip
key: cuda-ubuntu-22.04-hip
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -69,75 +104,7 @@ jobs:
-DGGML_HIP=ON
cmake --build build --config Release -j $(nproc)
windows-latest-hip:
runs-on: windows-2022
env:
# Make sure this is in sync with build-cache.yml
HIPSDK_INSTALLER_VERSION: "26.Q1"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Grab rocWMMA package
id: grab_rocwmma
run: |
curl -o rocwmma.deb "https://repo.radeon.com/rocm/apt/7.2.1/pool/main/r/rocwmma-dev/rocwmma-dev_2.2.0.70201-81~24.04_amd64.deb"
7z x rocwmma.deb
7z x data.tar
- name: Use ROCm Installation Cache
uses: actions/cache@v5
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: cache-gha-rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
uses: ./.github/actions/windows-setup-rocm
with:
version: ${{ env.HIPSDK_INSTALLER_VERSION }}
- name: Verify ROCm
id: verify
run: |
# Find and test ROCm installation
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
if (-not $clangPath) {
Write-Error "ROCm installation not found"
exit 1
}
& $clangPath.FullName --version
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ${{ github.job }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
$env:HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake -G "Unix Makefiles" -B build -S . `
-DCMAKE_C_COMPILER="${env:HIP_PATH}\bin\clang.exe" `
-DCMAKE_CXX_COMPILER="${env:HIP_PATH}\bin\clang++.exe" `
-DCMAKE_CXX_FLAGS="-I$($PWD.Path.Replace('\', '/'))/opt/rocm-7.2.1/include/" `
-DCMAKE_BUILD_TYPE=Release `
-DLLAMA_BUILD_BORINGSSL=ON `
-DROCM_DIR="${env:HIP_PATH}" `
-DGGML_HIP=ON `
-DGGML_HIP_ROCWMMA_FATTN=ON `
-DGPU_TARGETS="gfx1100" `
-DGGML_RPC=ON
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}
ubuntu-22-musa:
musa:
runs-on: ubuntu-22.04
container: mthreads/musa:rc4.3.0-devel-ubuntu22.04-amd64
@@ -155,7 +122,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-musa
key: cuda-ubuntu-22.04-musa
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

146
.github/workflows/build-cuda-windows.yml vendored Normal file
View File

@@ -0,0 +1,146 @@
name: CI (CUDA, windows)
# TODO: this workflow is only triggered manually because it is very heavy on the CI
# when we provision dedicated windows runners, we can enable it for pushes too
# note: running this workflow manually will populate the ccache for the release builds
# this can be used before merging a PR to speed up the release workflow
on:
workflow_dispatch: # allows manual triggering
# note: this will run in queue with the release workflow
concurrency:
group: release
queue: max
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
cuda:
runs-on: windows-2022
strategy:
matrix:
cuda: ['12.4', '13.3']
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
with:
cuda_version: ${{ matrix.cuda }}
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Build
id: cmake_build
shell: cmd
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
cmake -S . -B build -G "Ninja Multi-Config" ^
-DLLAMA_BUILD_SERVER=ON ^
-DLLAMA_BUILD_BORINGSSL=ON ^
-DGGML_NATIVE=OFF ^
-DGGML_BACKEND_DL=ON ^
-DGGML_CPU_ALL_VARIANTS=ON ^
-DGGML_CUDA=ON ^
-DGGML_RPC=ON ^
-DGGML_CUDA_CUB_3DOT2=ON
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% -t ggml
cmake --build build --config Release
hip:
runs-on: windows-2022
env:
# Make sure this is in sync with build-cache.yml
HIPSDK_INSTALLER_VERSION: "26.Q1"
strategy:
matrix:
include:
# sync with release.yml
- name: "radeon"
gpu_targets: "gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Grab rocWMMA package
id: grab_rocwmma
run: |
curl -o rocwmma.deb "https://repo.radeon.com/rocm/apt/7.2.1/pool/main/r/rocwmma-dev/rocwmma-dev_2.2.0.70201-81~24.04_amd64.deb"
7z x rocwmma.deb
7z x data.tar
- name: Use ROCm Installation Cache
uses: actions/cache@v5
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: cache-gha-rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
uses: ./.github/actions/windows-setup-rocm
with:
version: ${{ env.HIPSDK_INSTALLER_VERSION }}
- name: Verify ROCm
id: verify
run: |
# Find and test ROCm installation
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
if (-not $clangPath) {
Write-Error "ROCm installation not found"
exit 1
}
& $clangPath.FullName --version
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
# TODO: this build does not match the build in release.yml, so we use a different cache key
# ideally, the builds should match, similar to the CUDA build above so that we would be able
# to populate the ccache for the release with manual runs of this workflow
#key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
key: cuda-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Build
id: cmake_build
run: |
$env:HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake -G "Unix Makefiles" -B build -S . `
-DCMAKE_C_COMPILER="${env:HIP_PATH}\bin\clang.exe" `
-DCMAKE_CXX_COMPILER="${env:HIP_PATH}\bin\clang++.exe" `
-DCMAKE_CXX_FLAGS="-I$($PWD.Path.Replace('\', '/'))/opt/rocm-7.2.1/include/" `
-DCMAKE_BUILD_TYPE=Release `
-DLLAMA_BUILD_BORINGSSL=ON `
-DROCM_DIR="${env:HIP_PATH}" `
-DGGML_HIP=ON `
-DGGML_HIP_ROCWMMA_FATTN=ON `
-DGPU_TARGETS="gfx1100" `
-DGGML_RPC=ON
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}

View File

@@ -37,7 +37,7 @@ jobs:
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.16
# with:
# key: windows-msys2
# key: msys-windows-2025-x64
# variant: ccache
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -35,8 +35,7 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
windows-latest-opencl-adreno:
windows-2025-opencl-adreno:
runs-on: windows-2025
steps:
@@ -47,7 +46,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-llvm-arm64-opencl-adreno
key: opencl-windows-2025-x64
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -67,7 +67,7 @@ jobs:
if: runner.environment == 'github-hosted'
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-openvino-${{ matrix.variant }}-no-preset-v1
key: openvino-ubuntu-24.04-${{ matrix.variant }}-no-preset-v1
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -69,7 +69,7 @@ jobs:
#- name: ccache
# uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
# with:
# key: ubuntu-cpu-riscv64-native
# key: riscv-ubuntu-native
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -139,7 +139,7 @@ jobs:
#- name: ccache
# uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
# with:
# key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}
# key: riscv-ubuntu-native-sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -34,7 +34,6 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-latest-rpc:
runs-on: ubuntu-latest

View File

@@ -41,19 +41,6 @@ jobs:
id: checkout
uses: actions/checkout@v6
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-latest-sanitizer-${{ matrix.sanitizer }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#- name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential libssl-dev
# with UNDEFINED sanitizer, we have to build in Debug to avoid GCC 13 false-positive warnings
- name: Build (undefined)
id: cmake_build_undefined

View File

@@ -396,14 +396,6 @@ jobs:
sudo apt-get update
sudo apt-get install -y cmake
# note: sparing some ccache since these jobs run on dedicated runners that are not part of the organitzation
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: arm64-cpu-kleidiai-graviton4
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Test
id: ggml-ci
run: |

View File

@@ -88,7 +88,7 @@ jobs:
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-24-sycl-${{ matrix.build }}
# key: sycl-ubuntu-24-${{ matrix.build }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
@@ -150,7 +150,7 @@ jobs:
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: windows-latest-sycl
# key: sycl-windows-latest
# variant: ccache
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -36,7 +36,54 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-vulkan-llvmpipe:
ubuntu:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-24.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: vulkan-${{ matrix.os }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev spirv-headers libssl-dev ninja-build
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: Configure
id: cmake_configure
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DGGML_BACKEND_DL=ON \
-DGGML_CPU_ALL_VARIANTS=ON \
-DGGML_VULKAN=ON
- name: Build
id: cmake_build
run: |
time cmake --build build -j $(nproc)
ubuntu-llvmpipe:
runs-on: ubuntu-24.04
steps:
@@ -47,7 +94,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-vulkan-llvmpipe
key: vulkan-ubuntu-24.04-llvmpipe
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -35,7 +35,7 @@ env:
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
macos-latest-webgpu:
macos:
runs-on: macos-latest
steps:
@@ -46,7 +46,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-webgpu
key: webgpu-macos-latest
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -76,7 +76,7 @@ jobs:
cd build
ctest -L main --verbose --timeout 900
ubuntu-24-webgpu:
ubuntu:
runs-on: ubuntu-24.04
steps:
@@ -87,7 +87,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-webgpu
key: webgpu-ubuntu-24.04
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -129,8 +129,16 @@ jobs:
# test-backend-ops is too slow on llvmpipe, skip it
ctest -L main -E test-backend-ops --verbose --timeout 900
ubuntu-24-webgpu-wasm:
runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
ubuntu-wasm:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-24.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
steps:
- name: Clone
@@ -140,7 +148,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-webgpu-wasm
key: webgpu-${{ matrix.os }}-wasm
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -50,7 +50,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip-quality-check
key: hip-quality-check-ubuntu-22.04
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -27,18 +27,18 @@ on:
'**/*.glsl'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
CMAKE_ARGS: "-DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON"
jobs:
# note: run this workflow one at a time for better cache reuse
concurrency:
group: release
queue: max
jobs:
check_release:
runs-on: [self-hosted, fast]
runs-on: ubuntu-slim
outputs:
should_release: ${{ steps.check.outputs.should_release }}
@@ -100,8 +100,8 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-${{ matrix.arch }}
evict-old-files: 1d
key: release-${{ matrix.os }}-${{ matrix.arch }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Build
id: cmake_build
@@ -165,8 +165,8 @@ jobs:
if: ${{ matrix.build != 's390x' }}
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-cpu-${{ matrix.build }}
evict-old-files: 1d
key: release-${{ matrix.os }}-cpu
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
id: depends
@@ -241,8 +241,8 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-vulkan-${{ matrix.build }}
evict-old-files: 1d
key: release-${{ matrix.os }}-vulkan
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
id: depends
@@ -311,11 +311,17 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: android-arm64
evict-old-files: 1d
# note : disabled to spare some cache space (https://github.com/ggml-org/llama.cpp/pull/23789)
# for some reason, the ccache does not improve the build time in this case
# example:
# cache off: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78160400831
# cache on: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78224189394
#
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: release-android-arm64
# append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Set up JDK
uses: actions/setup-java@v5
@@ -402,8 +408,8 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-openvino-release-no-preset-v1
evict-old-files: 1d
key: release-ubuntu-24.04-openvino-release-no-preset-v1
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
run: |
@@ -485,9 +491,8 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-cpu-${{ matrix.arch }}
variant: ccache
evict-old-files: 1d
key: release-windows-2025-${{ matrix.arch }}-cpu
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Ninja
run: |
@@ -556,9 +561,8 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-${{ matrix.backend }}-${{ matrix.arch }}
variant: ccache
evict-old-files: 1d
key: release-windows-2025-${{ matrix.arch }}-${{ matrix.backend }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Vulkan SDK
id: get_vulkan
@@ -633,12 +637,11 @@ jobs:
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: Install ccache
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-cuda-${{ matrix.cuda }}
variant: ccache
evict-old-files: 1d
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
@@ -744,9 +747,8 @@ jobs:
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: windows-latest-sycl
# variant: ccache
# evict-old-files: 1d
# key: release-windows-2022-x64-sycl
# append-timestamp: false # note: use this only with non-concurrent jobs!
#
# - name: Build
# id: cmake_build
@@ -866,9 +868,8 @@ jobs:
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-24-sycl-${{ matrix.build }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# key: release-ubuntu-24.04-sycl
# append-timestamp: false # note: use this only with non-concurrent jobs!
#
# - name: Build
# id: cmake_build
@@ -936,8 +937,8 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-rocm-${{ matrix.ROCM_VERSION }}-${{ matrix.build }}
evict-old-files: 1d
key: release-ubuntu-22.04-rocm-${{ matrix.ROCM_VERSION }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Dependencies
id: depends
@@ -1058,8 +1059,8 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}-x64
evict-old-files: 1d
key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
append-timestamp: false # note: use this only with non-concurrent jobs!
- name: Install ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'

View File

@@ -55,7 +55,7 @@ concurrency:
jobs:
ubuntu:
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
name: ubuntu (${{ matrix.wf_name }})
strategy:
@@ -96,7 +96,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: server-ubuntu-default
key: server-ubuntu-24.04-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -144,7 +144,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: server-windows-default
key: server-windows-2025-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -2998,7 +2998,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
key_file.close();
}
).set_examples({LLAMA_EXAMPLE_SERVER}));
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_API_KEY_FILE"));
add_opt(common_arg(
{"--ssl-key-file"}, "FNAME",
"path to file a PEM-encoded SSL private key",

View File

@@ -119,7 +119,8 @@ class ModelBase:
small_first_shard: bool = False, hparams: dict[str, Any] | None = None, remote_hf_model_id: str | None = None,
disable_mistral_community_chat_template: bool = False,
sentence_transformers_dense_modules: bool = False,
fuse_gate_up_exps: bool = False):
fuse_gate_up_exps: bool = False,
fp8_as_q8: bool = False):
if type(self) is ModelBase or \
type(self) is TextModel or \
type(self) is MmprojModel:
@@ -148,6 +149,8 @@ class ModelBase:
self.dir_model_card = dir_model # overridden in convert_lora_to_gguf.py
self._is_nvfp4 = False
self._is_mxfp4 = False
self._fp8_as_q8 = fp8_as_q8
self._fp8_dequantized: set[str] = set()
# Apply heuristics to figure out typical tensor encoding based on first tensor's dtype
# NOTE: can't use field "torch_dtype" in config.json, because some finetunes lie.
@@ -429,6 +432,8 @@ class ModelBase:
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s, bs=block_size: dequant_simple(w(), s(), bs)
tensors_to_remove.append(name)
if self._fp8_as_q8:
self._fp8_dequantized.add(weight_name)
if name.endswith(".activation_scale"): # unused
tensors_to_remove.append(name)
if name.endswith("_activation_scale"): # Mistral-Small-4-119B-2602, unused
@@ -440,6 +445,8 @@ class ModelBase:
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s, bs=block_size: dequant_simple(w(), s(), bs)
tensors_to_remove.append(name)
if self._fp8_as_q8:
self._fp8_dequantized.add(weight_name)
if name.endswith(".qscale_act"):
tensors_to_remove.append(name)
elif quant_method == "gptq":
@@ -483,6 +490,11 @@ class ModelBase:
strategy = weight_config.get("strategy")
assert strategy == "channel" or strategy == "block"
assert weight_config.get("group_size") is None # didn't find a model using this yet
is_fp8 = (
quant_format == "float-quantized"
and weight_config.get("type") == "float"
and weight_config.get("num_bits") == 8
)
for name in self.model_tensors.keys():
if name.endswith(".weight_scale"):
weight_name = name.removesuffix("_scale")
@@ -490,6 +502,8 @@ class ModelBase:
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s: dequant_simple(w(), s(), block_size)
tensors_to_remove.append(name)
if self._fp8_as_q8 and is_fp8:
self._fp8_dequantized.add(weight_name)
elif quant_format == "pack-quantized":
assert weight_config.get("strategy") == "group"
assert weight_config.get("type", "int") == "int"
@@ -524,10 +538,18 @@ class ModelBase:
for name in self.model_tensors.keys():
if name.endswith(".weight_scale"):
weight_name = name.removesuffix("_scale")
if weight_name not in self.model_tensors:
tensors_to_remove.append(name)
continue
w = self.model_tensors[weight_name]
s = self.model_tensors[name]
is_fp8_weight = False
if self._fp8_as_q8:
is_fp8_weight = w().dtype in (torch.float8_e4m3fn, torch.float8_e5m2)
self.model_tensors[weight_name] = lambda w=w, s=s: dequant_simple(w(), s(), None)
tensors_to_remove.append(name)
if is_fp8_weight:
self._fp8_dequantized.add(weight_name)
if name.endswith((".input_scale", ".k_scale", ".v_scale")):
tensors_to_remove.append(name)
elif quant_method is not None:
@@ -615,8 +637,10 @@ class ModelBase:
return [(new_name, data_torch)]
def tensor_force_quant(self, name: str, new_name: str, bid: int | None, n_dims: int) -> gguf.GGMLQuantizationType | bool:
del name, new_name, bid, n_dims # unused
del new_name, bid # unused
# Force FP8-original tensors to Q8_0 when requested; Q8_0 is faster than F16/BF16.
if self._fp8_as_q8 and name in self._fp8_dequantized and n_dims >= 2:
return gguf.GGMLQuantizationType.Q8_0
return False
# some models need extra generated tensors (like rope_freqs)
@@ -791,7 +815,7 @@ class ModelBase:
if quant_algo != "NVFP4":
if nvfp4_compressed_tensors:
quant_algo = "NVFP4"
elif any(v.get("quant_algo") == "NVFP4" for v in quant_layers.values() if isinstance(v, dict)):
elif any(str(v.get("quant_algo")).endswith("NVFP4") for v in quant_layers.values() if isinstance(v, dict)):
quant_algo = "NVFP4"
self._is_nvfp4 = quant_algo == "NVFP4"
@@ -2417,10 +2441,9 @@ class MmprojModel(ModelBase):
raise KeyError(f"could not find any of: {keys}")
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, name, n_dims # unused
if ".patch_embd.weight" in new_name or ".patch_merger.weight" in new_name:
return gguf.GGMLQuantizationType.F16 if self.ftype == gguf.LlamaFileType.MOSTLY_F16 else gguf.GGMLQuantizationType.F32
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
class LazyTorchTensor(gguf.LazyBase):

View File

@@ -148,6 +148,10 @@ def parse_args() -> argparse.Namespace:
"--fuse-gate-up-exps", action="store_true",
help="Fuse gate_exps and up_exps tensors into a single gate_up_exps tensor for MoE models.",
)
parser.add_argument(
"--fp8-as-q8", action="store_true",
help="Store tensors dequantized from FP8 as Q8_0 instead of BF16/F16.",
)
args = parser.parse_args()
if not args.print_supported_models and args.model is None:
@@ -264,7 +268,8 @@ def main() -> None:
small_first_shard=args.no_tensor_first_split,
remote_hf_model_id=hf_repo_id, disable_mistral_community_chat_template=disable_mistral_community_chat_template,
sentence_transformers_dense_modules=args.sentence_transformers_dense_modules,
fuse_gate_up_exps=args.fuse_gate_up_exps
fuse_gate_up_exps=args.fuse_gate_up_exps,
fp8_as_q8=args.fp8_as_q8,
)
if args.vocab_only:

View File

@@ -273,67 +273,51 @@ void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * G
#if defined(GGML_SIMD)
#if defined(__ARM_FEATURE_SVE)
const int sve_register_length = svcntb() * 8; //get vector length
const int ggml_f16_epr = sve_register_length / 16; // running when 16
const int ggml_f16_step = 8 * ggml_f16_epr; // choose 8 SVE registers
const int ggml_f16_epr = svcnth();
const int ggml_f16_step = 8 * ggml_f16_epr;
const int np = n - (n % ggml_f16_step);
const int np2 = n - (n % ggml_f16_epr);
const int np= (n & ~(ggml_f16_step - 1));
svfloat16_t sum1 = svdup_n_f16(0.0f);
svfloat16_t sum2 = svdup_n_f16(0.0f);
svfloat16_t sum3 = svdup_n_f16(0.0f);
svfloat16_t sum4 = svdup_n_f16(0.0f);
svfloat32_t sum1_lo = svdup_n_f32(0.0f);
svfloat32_t sum1_hi = svdup_n_f32(0.0f);
svfloat32_t sum2_lo = svdup_n_f32(0.0f);
svfloat32_t sum2_hi = svdup_n_f32(0.0f);
svfloat32_t sum3_lo = svdup_n_f32(0.0f);
svfloat32_t sum3_hi = svdup_n_f32(0.0f);
svfloat32_t sum4_lo = svdup_n_f32(0.0f);
svfloat32_t sum4_hi = svdup_n_f32(0.0f);
svfloat16_t ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8;
svfloat16_t ay1, ay2, ay3, ay4, ay5, ay6, ay7, ay8;
for (int i = 0; i < np; i += ggml_f16_step) {
ax1 = GGML_F16x_VEC_LOAD(x + i + 0 * ggml_f16_epr, 0);
ay1 = GGML_F16x_VEC_LOAD(y + i + 0 * ggml_f16_epr, 0);
sum1 = GGML_F16x_VEC_FMA(sum1, ax1, ay1);
ax2 = GGML_F16x_VEC_LOAD(x + i + 1 * ggml_f16_epr, 1);
ay2 = GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 1);
sum2 = GGML_F16x_VEC_FMA(sum2, ax2, ay2);
ax3 = GGML_F16x_VEC_LOAD(x + i + 2 * ggml_f16_epr, 2);
ay3 = GGML_F16x_VEC_LOAD(y + i + 2 * ggml_f16_epr, 2);
sum3 = GGML_F16x_VEC_FMA(sum3, ax3, ay3);
ax4 = GGML_F16x_VEC_LOAD(x + i + 3 * ggml_f16_epr, 3);
ay4 = GGML_F16x_VEC_LOAD(y + i + 3 * ggml_f16_epr, 3);
sum4 = GGML_F16x_VEC_FMA(sum4, ax4, ay4);
ax5 = GGML_F16x_VEC_LOAD(x + i + 4 * ggml_f16_epr, 4);
ay5 = GGML_F16x_VEC_LOAD(y + i + 4 * ggml_f16_epr, 4);
sum1 = GGML_F16x_VEC_FMA(sum1, ax5, ay5);
ax6 = GGML_F16x_VEC_LOAD(x + i + 5 * ggml_f16_epr, 5);
ay6 = GGML_F16x_VEC_LOAD(y + i + 5 * ggml_f16_epr, 5);
sum2 = GGML_F16x_VEC_FMA(sum2, ax6, ay6);
ax7 = GGML_F16x_VEC_LOAD(x + i + 6 * ggml_f16_epr, 6);
ay7 = GGML_F16x_VEC_LOAD(y + i + 6 * ggml_f16_epr, 6);
sum3 = GGML_F16x_VEC_FMA(sum3, ax7, ay7);
ax8 = GGML_F16x_VEC_LOAD(x + i + 7 * ggml_f16_epr, 7);
ay8 = GGML_F16x_VEC_LOAD(y + i + 7 * ggml_f16_epr, 7);
sum4 = GGML_F16x_VEC_FMA(sum4, ax8, ay8);
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, GGML_F16x_VEC_LOAD(x + i + 0 * ggml_f16_epr, 0), GGML_F16x_VEC_LOAD(y + i + 0 * ggml_f16_epr, 0));
ggml_sve_f16_fma_widened(&sum2_lo, &sum2_hi, GGML_F16x_VEC_LOAD(x + i + 1 * ggml_f16_epr, 1), GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 1));
ggml_sve_f16_fma_widened(&sum3_lo, &sum3_hi, GGML_F16x_VEC_LOAD(x + i + 2 * ggml_f16_epr, 2), GGML_F16x_VEC_LOAD(y + i + 2 * ggml_f16_epr, 2));
ggml_sve_f16_fma_widened(&sum4_lo, &sum4_hi, GGML_F16x_VEC_LOAD(x + i + 3 * ggml_f16_epr, 3), GGML_F16x_VEC_LOAD(y + i + 3 * ggml_f16_epr, 3));
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, GGML_F16x_VEC_LOAD(x + i + 4 * ggml_f16_epr, 4), GGML_F16x_VEC_LOAD(y + i + 4 * ggml_f16_epr, 4));
ggml_sve_f16_fma_widened(&sum2_lo, &sum2_hi, GGML_F16x_VEC_LOAD(x + i + 5 * ggml_f16_epr, 5), GGML_F16x_VEC_LOAD(y + i + 5 * ggml_f16_epr, 5));
ggml_sve_f16_fma_widened(&sum3_lo, &sum3_hi, GGML_F16x_VEC_LOAD(x + i + 6 * ggml_f16_epr, 6), GGML_F16x_VEC_LOAD(y + i + 6 * ggml_f16_epr, 6));
ggml_sve_f16_fma_widened(&sum4_lo, &sum4_hi, GGML_F16x_VEC_LOAD(x + i + 7 * ggml_f16_epr, 7), GGML_F16x_VEC_LOAD(y + i + 7 * ggml_f16_epr, 7));
}
const int np2 = (n & ~(ggml_f16_epr - 1)); // round down to multiple of 8
for (int k = np; k < np2; k += ggml_f16_epr) {
svfloat16_t rx = GGML_F16x_VEC_LOAD(x + k, 0);
svfloat16_t ry = GGML_F16x_VEC_LOAD(y + k, 0);
sum1 = GGML_F16x_VEC_FMA(sum1, rx, ry);
for (int i = np; i < np2; i += ggml_f16_epr) {
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, GGML_F16x_VEC_LOAD(x + i, 0), GGML_F16x_VEC_LOAD(y + i, 0));
}
if (np2 < n) {
svbool_t pg = svwhilelt_b16(np2, n);
svfloat16_t hx = svld1_f16(pg, (const __fp16 *)(x + np2));
svfloat16_t hy = svld1_f16(pg, (const __fp16 *)(y + np2));
const svbool_t pg = svwhilelt_b16(np2, n);
const svfloat16_t rx = svld1_f16(pg, (const __fp16 *)(x + np2));
const svfloat16_t ry = svld1_f16(pg, (const __fp16 *)(y + np2));
sum1 = svmad_f16_x(pg, hx, hy, sum1);
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, rx, ry);
}
GGML_F16x_VEC_REDUCE(sumf, sum1, sum2, sum3, sum4);
sum1_lo = svadd_f32_m(DEFAULT_PG32, sum1_lo, sum2_lo);
sum1_hi = svadd_f32_m(DEFAULT_PG32, sum1_hi, sum2_hi);
sum3_lo = svadd_f32_m(DEFAULT_PG32, sum3_lo, sum4_lo);
sum3_hi = svadd_f32_m(DEFAULT_PG32, sum3_hi, sum4_hi);
sum1_lo = svadd_f32_m(DEFAULT_PG32, sum1_lo, sum3_lo);
sum1_hi = svadd_f32_m(DEFAULT_PG32, sum1_hi, sum3_hi);
sumf = ggml_sve_sum_f32x2(sum1_lo, sum1_hi);
#elif defined(__riscv_v_intrinsic)
#if defined(__riscv_zvfh)
int vl = __riscv_vsetvlmax_e32m2();

View File

@@ -14,6 +14,35 @@
// floating point type used to accumulate sums
typedef double ggml_float;
#if defined(__ARM_FEATURE_SVE)
inline static void ggml_sve_f16_fma_widened(
svfloat32_t * acc_lo,
svfloat32_t * acc_hi,
svfloat16_t x,
svfloat16_t y) {
#if defined(__ARM_FEATURE_SVE2)
*acc_lo = svmlalb_f32(*acc_lo, x, y);
*acc_hi = svmlalt_f32(*acc_hi, x, y);
#else
// Plain SVE fallback path if SVE2 instructions not available
svfloat16_t x_even = svtrn1_f16(x, x);
svfloat16_t x_odd = svtrn2_f16(x, x);
svfloat16_t y_even = svtrn1_f16(y, y);
svfloat16_t y_odd = svtrn2_f16(y, y);
svbool_t pg = svptrue_b32();
*acc_lo = svmla_f32_x(pg, *acc_lo, svcvt_f32_f16_x(pg, x_even), svcvt_f32_f16_x(pg, y_even));
*acc_hi = svmla_f32_x(pg, *acc_hi, svcvt_f32_f16_x(pg, x_odd), svcvt_f32_f16_x(pg, y_odd));
#endif
}
inline static ggml_float ggml_sve_sum_f32x2(svfloat32_t sum_lo, svfloat32_t sum_hi) {
return (ggml_float) (svaddv_f32(svptrue_b32(), sum_lo) + svaddv_f32(svptrue_b32(), sum_hi));
}
#endif
#define GGML_GELU_FP16
#define GGML_GELU_QUICK_FP16
@@ -122,108 +151,61 @@ inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * GG
#if defined(GGML_SIMD)
#if defined(__ARM_FEATURE_SVE)
const int sve_register_length = svcntb() * 8;
const int ggml_f16_epr = sve_register_length / 16; // running when 16
const int ggml_f16_step = 8 * ggml_f16_epr; // choose 8 SVE registers
const int ggml_f16_epr = svcnth();
const int ggml_f16_step = 2 * ggml_f16_epr;
int np = n - (n % ggml_f16_step);
int np2 = n - (n % ggml_f16_epr);
int np = (n & ~(ggml_f16_step - 1));
svfloat16_t sum_00 = svdup_n_f16(0.0f);
svfloat16_t sum_01 = svdup_n_f16(0.0f);
svfloat16_t sum_02 = svdup_n_f16(0.0f);
svfloat16_t sum_03 = svdup_n_f16(0.0f);
svfloat16_t sum_10 = svdup_n_f16(0.0f);
svfloat16_t sum_11 = svdup_n_f16(0.0f);
svfloat16_t sum_12 = svdup_n_f16(0.0f);
svfloat16_t sum_13 = svdup_n_f16(0.0f);
svfloat16_t ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8;
svfloat16_t ay1, ay2, ay3, ay4, ay5, ay6, ay7, ay8;
svfloat32_t sum_0_0_lo = svdup_n_f32(0.0f);
svfloat32_t sum_0_0_hi = svdup_n_f32(0.0f);
svfloat32_t sum_0_1_lo = svdup_n_f32(0.0f);
svfloat32_t sum_0_1_hi = svdup_n_f32(0.0f);
svfloat32_t sum_1_0_lo = svdup_n_f32(0.0f);
svfloat32_t sum_1_0_hi = svdup_n_f32(0.0f);
svfloat32_t sum_1_1_lo = svdup_n_f32(0.0f);
svfloat32_t sum_1_1_hi = svdup_n_f32(0.0f);
for (int i = 0; i < np; i += ggml_f16_step) {
ay1 = GGML_F16x_VEC_LOAD(y + i + 0 * ggml_f16_epr, 0); // 8 elements
const svfloat16_t ay0 = GGML_F16x_VEC_LOAD(y + i, 0);
const svfloat16_t ax00 = GGML_F16x_VEC_LOAD(x[0] + i, 0);
const svfloat16_t ax01 = GGML_F16x_VEC_LOAD(x[1] + i, 0);
ax1 = GGML_F16x_VEC_LOAD(x[0] + i + 0*ggml_f16_epr, 0); // 8 elements
sum_00 = GGML_F16x_VEC_FMA(sum_00, ax1, ay1); // sum_00 = sum_00+ax1*ay1
ax1 = GGML_F16x_VEC_LOAD(x[1] + i + 0*ggml_f16_epr, 0); // 8 elements
sum_10 = GGML_F16x_VEC_FMA(sum_10, ax1, ay1);
ggml_sve_f16_fma_widened(&sum_0_0_lo, &sum_0_0_hi, ax00, ay0);
ggml_sve_f16_fma_widened(&sum_1_0_lo, &sum_1_0_hi, ax01, ay0);
ay2 = GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 1); // next 8 elements
const svfloat16_t ay1 = GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 0);
const svfloat16_t ax10 = GGML_F16x_VEC_LOAD(x[0] + i + 1 * ggml_f16_epr, 0);
const svfloat16_t ax11 = GGML_F16x_VEC_LOAD(x[1] + i + 1 * ggml_f16_epr, 0);
ax2 = GGML_F16x_VEC_LOAD(x[0] + i + 1*ggml_f16_epr, 1); // next 8 elements
sum_01 = GGML_F16x_VEC_FMA(sum_01, ax2, ay2);
ax2 = GGML_F16x_VEC_LOAD(x[1] + i + 1*ggml_f16_epr, 1);
sum_11 = GGML_F16x_VEC_FMA(sum_11, ax2, ay2);
ay3 = GGML_F16x_VEC_LOAD(y + i + 2 * ggml_f16_epr, 2);
ax3 = GGML_F16x_VEC_LOAD(x[0] + i + 2*ggml_f16_epr, 2);
sum_02 = GGML_F16x_VEC_FMA(sum_02, ax3, ay3);
ax3 = GGML_F16x_VEC_LOAD(x[1] + i + 2*ggml_f16_epr, 2);
sum_12 = GGML_F16x_VEC_FMA(sum_12, ax3, ay3);
ay4 = GGML_F16x_VEC_LOAD(y + i + 3 * ggml_f16_epr, 3);
ax4 = GGML_F16x_VEC_LOAD(x[0] + i + 3*ggml_f16_epr, 3);
sum_03 = GGML_F16x_VEC_FMA(sum_03, ax4, ay4);
ax4 = GGML_F16x_VEC_LOAD(x[1] + i + 3*ggml_f16_epr, 3);
sum_13 = GGML_F16x_VEC_FMA(sum_13, ax4, ay4);
ay5 = GGML_F16x_VEC_LOAD(y + i + 4 * ggml_f16_epr, 4);
ax5 = GGML_F16x_VEC_LOAD(x[0] + i + 4*ggml_f16_epr, 4);
sum_00 = GGML_F16x_VEC_FMA(sum_00, ax5, ay5);
ax5 = GGML_F16x_VEC_LOAD(x[1] + i + 4*ggml_f16_epr, 4);
sum_10 = GGML_F16x_VEC_FMA(sum_10, ax5, ay5);
ay6 = GGML_F16x_VEC_LOAD(y + i + 5 * ggml_f16_epr, 5);
ax6 = GGML_F16x_VEC_LOAD(x[0] + i + 5*ggml_f16_epr, 5);
sum_01 = GGML_F16x_VEC_FMA(sum_01, ax6, ay6);
ax6 = GGML_F16x_VEC_LOAD(x[1] + i + 5*ggml_f16_epr, 5);
sum_11 = GGML_F16x_VEC_FMA(sum_11, ax6, ay6);
ay7 = GGML_F16x_VEC_LOAD(y + i + 6 * ggml_f16_epr, 6);
ax7 = GGML_F16x_VEC_LOAD(x[0] + i + 6*ggml_f16_epr, 6);
sum_02 = GGML_F16x_VEC_FMA(sum_02, ax7, ay7);
ax7 = GGML_F16x_VEC_LOAD(x[1] + i + 6*ggml_f16_epr, 6);
sum_12 = GGML_F16x_VEC_FMA(sum_12, ax7, ay7);
ay8 = GGML_F16x_VEC_LOAD(y + i + 7 * ggml_f16_epr, 7);
ax8 = GGML_F16x_VEC_LOAD(x[0] + i + 7*ggml_f16_epr, 7);
sum_03 = GGML_F16x_VEC_FMA(sum_03, ax8, ay8);
ax8 = GGML_F16x_VEC_LOAD(x[1] + i + 7*ggml_f16_epr, 7);
sum_13 = GGML_F16x_VEC_FMA(sum_13, ax8, ay8);
ggml_sve_f16_fma_widened(&sum_0_1_lo, &sum_0_1_hi, ax10, ay1);
ggml_sve_f16_fma_widened(&sum_1_1_lo, &sum_1_1_hi, ax11, ay1);
}
const int np2 = (n & ~(ggml_f16_epr - 1));
for (int k = np; k < np2; k += ggml_f16_epr) {
svfloat16_t ry = GGML_F16x_VEC_LOAD(y + k, 0);
for (int i = np; i < np2; i += ggml_f16_epr) {
const svfloat16_t ry = GGML_F16x_VEC_LOAD(y + i, 0);
const svfloat16_t rx0 = GGML_F16x_VEC_LOAD(x[0] + i, 0);
const svfloat16_t rx1 = GGML_F16x_VEC_LOAD(x[1] + i, 0);
svfloat16_t rx = GGML_F16x_VEC_LOAD(x[0] + k, 0);
sum_00 = GGML_F16x_VEC_FMA(sum_00, rx, ry);
rx = GGML_F16x_VEC_LOAD(x[1] + k, 0);
sum_10 = GGML_F16x_VEC_FMA(sum_10, rx, ry);
ggml_sve_f16_fma_widened(&sum_0_0_lo, &sum_0_0_hi, rx0, ry);
ggml_sve_f16_fma_widened(&sum_1_0_lo, &sum_1_0_hi, rx1, ry);
}
if (np2 < n) {
svbool_t pg = svwhilelt_b16(np2, n);
svfloat16_t hx_0 = svld1_f16(pg, (const __fp16 *)(x[0] + np2));
svfloat16_t hx_1 = svld1_f16(pg, (const __fp16 *)(x[1] + np2));
svfloat16_t hy = svld1_f16(pg, (const __fp16 *)(y + np2));
const svbool_t pg = svwhilelt_b16(np2, n);
const svfloat16_t ay = svld1_f16(pg, (const __fp16 *)(y + np2));
const svfloat16_t ax0 = svld1_f16(pg, (const __fp16 *)(x[0] + np2));
const svfloat16_t ax1 = svld1_f16(pg, (const __fp16 *)(x[1] + np2));
sum_00 = svmad_f16_x(pg, hx_0, hy, sum_00);
sum_10 = svmad_f16_x(pg, hx_1, hy, sum_10);
ggml_sve_f16_fma_widened(&sum_0_0_lo, &sum_0_0_hi, ax0, ay);
ggml_sve_f16_fma_widened(&sum_1_0_lo, &sum_1_0_hi, ax1, ay);
}
GGML_F16x_VEC_REDUCE(sumf[0], sum_00, sum_01, sum_02, sum_03);
GGML_F16x_VEC_REDUCE(sumf[1], sum_10, sum_11, sum_12, sum_13);
svfloat32_t sum_0_lo = svadd_f32_x(DEFAULT_PG32, sum_0_0_lo, sum_0_1_lo);
svfloat32_t sum_0_hi = svadd_f32_x(DEFAULT_PG32, sum_0_0_hi, sum_0_1_hi);
svfloat32_t sum_1_lo = svadd_f32_x(DEFAULT_PG32, sum_1_0_lo, sum_1_1_lo);
svfloat32_t sum_1_hi = svadd_f32_x(DEFAULT_PG32, sum_1_0_hi, sum_1_1_hi);
sumf[0] = ggml_sve_sum_f32x2(sum_0_lo, sum_0_hi);
sumf[1] = ggml_sve_sum_f32x2(sum_1_lo, sum_1_hi);
np = n;
#elif defined(__riscv_v_intrinsic)
#if defined(__riscv_zvfh)

View File

@@ -472,7 +472,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
const int i = 8 * (threadIdx.x % (nbatch_fa/8));
cp_async_cg_16<preload>(tile_mask_32 + j_sram*(nbatch_fa*sizeof(half) + 16) + i*sizeof(half), mask_h + j_vram*stride_mask + i);
cp_async_cg_16<preload>(tile_mask_32 + j_sram*(nbatch_fa*sizeof(half) + 16) + i*sizeof(half), mask_h + int64_t(j_vram)*stride_mask + i);
}
} else if constexpr (oob_check) {
#pragma unroll
@@ -488,7 +488,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
for (int i0 = 0; i0 < nbatch_fa; i0 += warp_size) {
const int i = i0 + threadIdx.x;
tile_mask[j_sram*(nbatch_fa + 8) + i] = i < i_sup ? mask_h[j_vram*stride_mask + i] : half(0.0f);
tile_mask[j_sram*(nbatch_fa + 8) + i] = i < i_sup ? mask_h[int64_t(j_vram)*stride_mask + i] : half(0.0f);
}
}
} else if constexpr (nbatch_fa < 2*warp_size) {
@@ -505,7 +505,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
const int i = threadIdx.x % (warp_size/cols_per_warp);
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + 2*i, mask_h + j_vram*stride_mask + 2*i);
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + 2*i, mask_h + int64_t(j_vram)*stride_mask + 2*i);
}
} else {
#pragma unroll
@@ -521,7 +521,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
for (int i0 = 0; i0 < nbatch_fa; i0 += 2*warp_size) {
const int i = i0 + 2*threadIdx.x;
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + i, mask_h + j_vram*stride_mask + i);
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + i, mask_h + int64_t(j_vram)*stride_mask + i);
}
}
}

View File

@@ -2570,6 +2570,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
use_mul_mat_q = use_mul_mat_q && ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[1], /*n_experts=*/0);
use_mul_mat_f = use_mul_mat_f && ggml_cuda_should_use_mmf(src0->type, cc, warp_size, src0->ne, src0->nb, src1->ne[1], /*mul_mat_id=*/false);
use_mul_mat_vec_f = use_mul_mat_vec_f && ggml_cuda_should_use_mmvf(src0->type, cc, src0->ne, src0->nb, src1->ne[1]);
use_mul_mat_vec_q = use_mul_mat_vec_q && ggml_cuda_should_use_mmvq(src0->type, cc, src1->ne[1]);
any_gpus_with_slow_fp16 = any_gpus_with_slow_fp16 || !fast_fp16_hardware_available(cc);
}
} else {
@@ -2578,6 +2579,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
use_mul_mat_q = use_mul_mat_q && ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[1], /*n_experts=*/0);
use_mul_mat_f = use_mul_mat_f && ggml_cuda_should_use_mmf(src0->type, cc, warp_size, src0->ne, src0->nb, src1->ne[1], /*mul_mat_id=*/false);
use_mul_mat_vec_f = use_mul_mat_vec_f && ggml_cuda_should_use_mmvf(src0->type, cc, src0->ne, src0->nb, src1->ne[1]);
use_mul_mat_vec_q = use_mul_mat_vec_q && ggml_cuda_should_use_mmvq(src0->type, cc, src1->ne[1]);
any_gpus_with_slow_fp16 = any_gpus_with_slow_fp16 || !fast_fp16_hardware_available(cc);
}
@@ -4992,8 +4994,14 @@ static void ggml_backend_cuda_device_get_memory(ggml_backend_dev_t dev, size_t *
}
static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend_dev_t dev) {
GGML_UNUSED(dev);
return GGML_BACKEND_DEVICE_TYPE_GPU;
ggml_backend_cuda_device_context * ctx = (ggml_backend_cuda_device_context *) dev->context;
cudaDeviceProp prop;
CUDA_CHECK(cudaGetDeviceProperties(&prop, ctx->device));
return prop.integrated
? GGML_BACKEND_DEVICE_TYPE_IGPU
: GGML_BACKEND_DEVICE_TYPE_GPU;
}
static void ggml_backend_cuda_device_get_props(ggml_backend_dev_t dev, ggml_backend_dev_props * props) {

View File

@@ -63,6 +63,7 @@ static constexpr __host__ __device__ int get_vdr_mmvq(ggml_type type) {
enum mmvq_parameter_table_id {
MMVQ_PARAMETERS_GENERIC = 0,
MMVQ_PARAMETERS_TURING,
MMVQ_PARAMETERS_GCN,
MMVQ_PARAMETERS_RDNA2,
MMVQ_PARAMETERS_RDNA3_0,
@@ -78,6 +79,8 @@ static constexpr __device__ mmvq_parameter_table_id get_device_table_id() {
return MMVQ_PARAMETERS_RDNA2;
#elif defined(GCN) || defined(CDNA)
return MMVQ_PARAMETERS_GCN;
#elif defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= GGML_CUDA_CC_TURING && __CUDA_ARCH__ < GGML_CUDA_CC_AMPERE
return MMVQ_PARAMETERS_TURING;
#else
return MMVQ_PARAMETERS_GENERIC;
#endif
@@ -96,6 +99,9 @@ static __host__ mmvq_parameter_table_id get_device_table_id(int cc) {
if (GGML_CUDA_CC_IS_GCN(cc) || GGML_CUDA_CC_IS_CDNA(cc)) {
return MMVQ_PARAMETERS_GCN;
}
if (GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_TURING && ggml_cuda_highest_compiled_arch(cc) < GGML_CUDA_CC_AMPERE) {
return MMVQ_PARAMETERS_TURING;
}
return MMVQ_PARAMETERS_GENERIC;
}
@@ -271,6 +277,53 @@ int get_mmvq_mmid_max_batch(ggml_type type, int cc) {
return MMVQ_MAX_BATCH_SIZE;
}
bool ggml_cuda_should_use_mmvq(enum ggml_type type, int cc, int64_t ne11) {
if (GGML_CUDA_CC_IS_CDNA(cc)) {
if (GGML_CUDA_CC_IS_CDNA1(cc)) {
switch (type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
return ne11 <= 7;
case GGML_TYPE_Q5_1:
return ne11 <= 7;
case GGML_TYPE_Q8_0:
return ne11 <= 6;
case GGML_TYPE_Q2_K:
return ne11 <= 4;
case GGML_TYPE_Q3_K:
return ne11 <= 3;
case GGML_TYPE_Q4_K:
return ne11 <= 2;
case GGML_TYPE_Q5_K:
return ne11 <= 3;
case GGML_TYPE_Q6_K:
return ne11 <= 4;
case GGML_TYPE_IQ1_S:
return ne11 <= 5;
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ4_XS:
return ne11 <= 6;
default:
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
}
switch (type) { // tuned for CDNA2
case GGML_TYPE_Q2_K:
return ne11 <= 5;
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
return ne11 <= 3;
case GGML_TYPE_Q6_K:
return ne11 <= 5;
default:
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
}
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
// Device constexpr: returns the max batch size for the current arch+type at compile time.
template <ggml_type type>
static constexpr __device__ int get_mmvq_mmid_max_batch_for_device() {
@@ -370,11 +423,38 @@ static constexpr __host__ __device__ int calc_nwarps(ggml_type type, int ncols_d
}
return 1;
}
if (table_id == MMVQ_PARAMETERS_TURING) {
if (ncols_dst == 1) {
switch (type) {
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
return 2;
default:
return 4;
}
}
switch (ncols_dst) {
case 2:
case 3:
case 4:
return 4;
case 5:
case 6:
case 7:
case 8:
return 2;
default:
return 1;
}
}
return 1;
}
static constexpr __host__ __device__ int calc_rows_per_block(int ncols_dst, int table_id, bool small_k = false, int nwarps = 1) {
if (table_id == MMVQ_PARAMETERS_GENERIC || table_id == MMVQ_PARAMETERS_GCN) {
if (table_id == MMVQ_PARAMETERS_GENERIC || table_id == MMVQ_PARAMETERS_GCN || table_id == MMVQ_PARAMETERS_TURING) {
switch (ncols_dst) {
case 1:
return small_k ? nwarps : 1;

View File

@@ -2,6 +2,8 @@
#define MMVQ_MAX_BATCH_SIZE 8 // Max. batch size for which to use MMVQ kernels.
bool ggml_cuda_should_use_mmvq(enum ggml_type type, int cc, int64_t ne11);
// Returns the maximum batch size for which MMVQ should be used for MUL_MAT_ID,
// based on the quantization type and GPU architecture (compute capability).
int get_mmvq_mmid_max_batch(ggml_type type, int cc);

View File

@@ -68,6 +68,7 @@ static u32vec opt_pmu_evt { 0x3, 0x111, 0x100, 0x105, 0x240, 0x256, 0x7D, 0x8C }
static int opt_opstage = HTP_OPSTAGE_QUEUE | HTP_OPSTAGE_COMPUTE;
static int opt_opbatch = 1024; // max number of ops in a batch
static int opt_opqueue = 16; // max number of pending batches
static int opt_oppoll = 0; // polling for batch completions
static std::regex* opt_opfilter = NULL; // regex of ops to not claim
@@ -550,7 +551,7 @@ static void repack_q4_0_q4x4x2(ggml_tensor * t, const void * data, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -611,7 +612,7 @@ static void repack_q4x4x2_q4_0(void * data, const ggml_tensor * t, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -660,6 +661,239 @@ static void repack_q4x4x2_q4_0(void * data, const ggml_tensor * t, size_t size)
ggml_aligned_free(buf_rp, row_size_rp);
}
static void unpack_q4_1_quants(uint8_t * qs, const block_q4_1 * x, unsigned int bi) {
static const int qk = QK4_1;
for (unsigned int i = 0; i < qk / 2; ++i) {
const int x0 = (x->qs[i] & 0x0F);
const int x1 = (x->qs[i] >> 4);
qs[bi * qk + i + 0] = x0;
qs[bi * qk + i + qk / 2] = x1;
}
}
static void pack_q4_1_quants(block_q4_1 * x, const uint8_t * qs, unsigned int bi) {
static const int qk = QK4_1;
for (unsigned int i = 0; i < qk / 2; ++i) {
const uint8_t x0 = qs[bi * qk + i + 0];
const uint8_t x1 = qs[bi * qk + i + qk / 2];
x->qs[i] = x0 | (x1 << 4);
}
}
static void repack_row_q4_1x4x2(uint8_t * y, const block_q4_1 * x, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
const int nloe = k % qk; // leftovers
const int dblk_size = 8 * 4; // 8x (d, m) __fp16 = 32 bytes
const int qblk_size = qk / 2; // int4 = 128 bytes
const int qrow_size = k / 2; // int4 (not padded to blocks)
uint8_t * y_q = y + 0; // quants first
uint8_t * y_d = y + qrow_size; // then scales/offsets
// Repack the quants
for (int i = 0; i < nb; i++) {
uint8_t qs[QK_Q4_0x4x2]; // unpacked quants
unpack_q4_1_quants(qs, &x[i * 8 + 0], 0);
unpack_q4_1_quants(qs, &x[i * 8 + 1], 1);
unpack_q4_1_quants(qs, &x[i * 8 + 2], 2);
unpack_q4_1_quants(qs, &x[i * 8 + 3], 3);
unpack_q4_1_quants(qs, &x[i * 8 + 4], 4);
unpack_q4_1_quants(qs, &x[i * 8 + 5], 5);
unpack_q4_1_quants(qs, &x[i * 8 + 6], 6);
unpack_q4_1_quants(qs, &x[i * 8 + 7], 7);
bool partial = (nloe && i == nb-1);
uint8_t * q = y_q + (i * qblk_size);
for (int j = 0; j < qk / 2; j++) {
q[j] = partial ? (qs[j*2+1] << 4) | qs[j*2+0] : (qs[j+128] << 4) | qs[j+000];
}
}
// Repack the scales and offsets
for (int i = 0; i < nb; i++) {
ggml_half * d_m = (ggml_half *) (y_d + i * dblk_size);
for (int j = 0; j < 8; j++) {
d_m[j * 2 + 0] = x[i * 8 + j].d;
d_m[j * 2 + 1] = x[i * 8 + j].m;
}
}
}
static void unpack_row_q4_1x4x2(block_q4_1 * x, const uint8_t * y, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
const int nloe = k % qk; // leftovers
const int dblk_size = 8 * 4; // 8x (d, m) __fp16 = 32 bytes
const int qblk_size = qk / 2; // int4 = 128 bytes
const int qrow_size = k / 2; // int4 (not padded to blocks)
const uint8_t * y_q = y + 0; // quants first
const uint8_t * y_d = y + qrow_size; // then scales/offsets
// Unpack the quants
for (int i = 0; i < nb; i++) {
uint8_t qs[QK_Q4_0x4x2];
bool partial = (nloe && i == nb-1);
const uint8_t * q = y_q + (i * qblk_size);
for (int j = 0; j < qk / 2; j++) {
if (partial) {
qs[j*2+0] = q[j] & 0x0F;
qs[j*2+1] = q[j] >> 4;
} else {
qs[j+000] = q[j] & 0x0F;
qs[j+128] = q[j] >> 4;
}
}
pack_q4_1_quants(&x[i * 8 + 0], qs, 0);
pack_q4_1_quants(&x[i * 8 + 1], qs, 1);
pack_q4_1_quants(&x[i * 8 + 2], qs, 2);
pack_q4_1_quants(&x[i * 8 + 3], qs, 3);
pack_q4_1_quants(&x[i * 8 + 4], qs, 4);
pack_q4_1_quants(&x[i * 8 + 5], qs, 5);
pack_q4_1_quants(&x[i * 8 + 6], qs, 6);
pack_q4_1_quants(&x[i * 8 + 7], qs, 7);
}
// Unpack the scales and offsets
for (int i = 0; i < nb; i++) {
const ggml_half * d_m = (const ggml_half *) (y_d + i * dblk_size);
for (int j = 0; j < 8; j++) {
x[i * 8 + j].d = d_m[j * 2 + 0];
x[i * 8 + j].m = d_m[j * 2 + 1];
}
}
}
static void init_row_q4_1x4x2(block_q4_1 * x, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
uint8_t qs[QK_Q4_0x4x2]; // unpacked quants
memset(qs, 0, sizeof(qs));
for (int i = 0; i < nb; i++) {
pack_q4_1_quants(&x[i * 8 + 0], qs, 0);
pack_q4_1_quants(&x[i * 8 + 1], qs, 1);
pack_q4_1_quants(&x[i * 8 + 2], qs, 2);
pack_q4_1_quants(&x[i * 8 + 3], qs, 3);
pack_q4_1_quants(&x[i * 8 + 4], qs, 4);
pack_q4_1_quants(&x[i * 8 + 5], qs, 5);
pack_q4_1_quants(&x[i * 8 + 6], qs, 6);
pack_q4_1_quants(&x[i * 8 + 7], qs, 7);
}
for (int i = 0; i < nb; i++) {
for (int j = 0; j < 8; j++) {
x[i * 8 + j].d = 0;
x[i * 8 + j].m = 0;
}
}
}
static void repack_q4_1_q4x4x2(ggml_tensor * t, const void * data, size_t size) {
int64_t nrows = ggml_nrows(t);
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2));
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
const size_t total_tensor_size = (size_t)nrows * row_size;
const size_t n_bytes_to_copy = size < total_tensor_size ? size : total_tensor_size;
const int64_t n_full_rows = n_bytes_to_copy / row_size;
const size_t n_rem_bytes = n_bytes_to_copy % row_size;
void * buf_pd = ggml_aligned_malloc(row_size_pd);
GGML_ASSERT(buf_pd != NULL);
void * buf_rp = ggml_aligned_malloc(row_size_rp);
GGML_ASSERT(buf_rp != NULL);
HEX_VERBOSE("ggml-hex: repack-q4_1-q4x4x2 %s : data %p size %zu dims %ldx%ld row-size %zu\n", t->name, data, size,
t->ne[0], nrows, row_size);
init_row_q4_1x4x2((block_q4_1 *) buf_pd, t->ne[0]);
for (int64_t i = 0; i < n_full_rows; i++) {
const uint8_t * src = (const uint8_t *) data + (i * row_size);
uint8_t * dst = (uint8_t *) t->data + (i * row_size);
memcpy(buf_pd, src, row_size);
repack_row_q4_1x4x2((uint8_t *) buf_rp, (const block_q4_1 *) buf_pd, t->ne[0]);
memcpy(dst, buf_rp, row_size);
}
if (n_rem_bytes > 0) {
const int64_t i = n_full_rows;
const uint8_t * src = (const uint8_t *) data + (i * row_size);
uint8_t * dst = (uint8_t *) t->data + (i * row_size);
init_row_q4_1x4x2((block_q4_1 *) buf_pd, t->ne[0]);
memcpy(buf_pd, src, n_rem_bytes);
repack_row_q4_1x4x2((uint8_t *) buf_rp, (const block_q4_1 *) buf_pd, t->ne[0]);
memcpy(dst, buf_rp, n_rem_bytes);
}
ggml_aligned_free(buf_pd, row_size_pd);
ggml_aligned_free(buf_rp, row_size_rp);
}
static void repack_q4x4x2_q4_1(void * data, const ggml_tensor * t, size_t size) {
int64_t nrows = ggml_nrows(t);
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2));
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
const size_t total_tensor_size = (size_t)nrows * row_size;
const size_t n_bytes_to_copy = size < total_tensor_size ? size : total_tensor_size;
const int64_t n_full_rows = n_bytes_to_copy / row_size;
const size_t n_rem_bytes = n_bytes_to_copy % row_size;
void * buf_pd = ggml_aligned_malloc(row_size_pd);
GGML_ASSERT(buf_pd != NULL);
void * buf_rp = ggml_aligned_malloc(row_size_rp);
GGML_ASSERT(buf_rp != NULL);
HEX_VERBOSE("ggml-hex: repack-q4x4x2-q4_1 %s : data %p size %zu dims %ldx%ld row-size %zu\n", t->name, data, size,
t->ne[0], nrows, row_size);
memset(buf_rp, 0, row_size_rp); // clear-out padded buffer to make sure the tail is all zeros
for (int64_t i = 0; i < n_full_rows; i++) {
const uint8_t * src = (const uint8_t *) t->data + (i * row_size);
uint8_t * dst = (uint8_t *) data + (i * row_size);
memcpy(buf_rp, src, row_size);
unpack_row_q4_1x4x2((block_q4_1 *) buf_pd, (const uint8_t *) buf_rp, t->ne[0]);
memcpy(dst, buf_pd, row_size);
}
if (n_rem_bytes > 0) {
const int64_t i = n_full_rows;
const uint8_t * src = (const uint8_t *) t->data + (i * row_size);
uint8_t * dst = (uint8_t *) data + (i * row_size);
// We still need to read and unpack the entire source row because quantization is block-based.
memcpy(buf_rp, src, row_size);
unpack_row_q4_1x4x2((block_q4_1 *) buf_pd, (const uint8_t *) buf_rp, t->ne[0]);
memcpy(dst, buf_pd, n_rem_bytes);
}
ggml_aligned_free(buf_pd, row_size_pd);
ggml_aligned_free(buf_rp, row_size_rp);
}
// ======== Q8x4x2 ====================
static void dump_block_q8_0(const block_q8_0 * b, int i) {
HEX_VERBOSE("ggml-hex: repack q8_0 %d: %d %d %d %d ... %d %d %d %d : %.6f\n", i, b->qs[0], b->qs[1], b->qs[2],
@@ -876,7 +1110,7 @@ static void repack_q8_0_q8x4x2(ggml_tensor * t, const void * data, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q8_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -937,7 +1171,7 @@ static void repack_q8x4x2_q8_0(void * data, const ggml_tensor * t, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q8_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -1238,7 +1472,7 @@ static void repack_mxfp4_mxfp4x4x2(ggml_tensor * t, const void * data, size_t si
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_MXFP4x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -1299,7 +1533,7 @@ static void repack_mxfp4x4x2_mxfp4(void * data, const ggml_tensor * t, size_t si
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_MXFP4x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -1365,6 +1599,12 @@ static void ggml_backend_hexagon_buffer_set_tensor(ggml_backend_buffer_t buffer,
repack_q4_0_q4x4x2(tensor, data, size);
break;
case GGML_TYPE_Q4_1:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
repack_q4_1_q4x4x2(tensor, data, size);
break;
case GGML_TYPE_Q8_0:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
@@ -1407,6 +1647,12 @@ static void ggml_backend_hexagon_buffer_get_tensor(ggml_backend_buffer_t buffer,
repack_q4x4x2_q4_0(data, tensor, size);
break;
case GGML_TYPE_Q4_1:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
repack_q4x4x2_q4_1(data, tensor, size);
break;
case GGML_TYPE_Q8_0:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
@@ -1886,7 +2132,8 @@ void ggml_hexagon_session::flush_pending(bool all) {
uint32_t n_dbufs;
// Read response packet from queue
int err = dspqueue_read(this->queue, &flags, 1, &n_dbufs, &dbuf, sizeof(rsp), &rsp_size, (uint8_t *) &rsp, DSPQUEUE_TIMEOUT);
const uint32_t timeo = opt_oppoll ? 0 : DSPQUEUE_TIMEOUT;
int err = dspqueue_read(this->queue, &flags, 1, &n_dbufs, &dbuf, sizeof(rsp), &rsp_size, (uint8_t *) &rsp, timeo);
if (err == AEE_EEXPIRED) {
continue;
}
@@ -2290,6 +2537,7 @@ static bool ggml_hexagon_supported_gated_delta_net(const struct ggml_hexagon_ses
const int64_t H = v->ne[1];
const int64_t n_tokens = v->ne[2];
const int64_t n_seqs = v->ne[3];
const int64_t K = state->ne[1];
if (S_v <= 0 || S_v > 128 || H <= 0 || n_tokens <= 0 || n_seqs <= 0) {
return false;
@@ -2302,10 +2550,10 @@ static bool ggml_hexagon_supported_gated_delta_net(const struct ggml_hexagon_ses
if ((g->ne[0] != 1 && g->ne[0] != S_v) || beta->ne[0] != 1) {
return false;
}
if (ggml_nelements(state) != S_v * S_v * H * n_seqs) {
if (ggml_nelements(state) != S_v * S_v * H * n_seqs * K) {
return false;
}
if (dst->ne[0] != S_v * H || dst->ne[1] != n_tokens * n_seqs + S_v * n_seqs) {
if (dst->ne[0] != S_v * H || dst->ne[1] != n_tokens * n_seqs + S_v * n_seqs * K) {
return false;
}
@@ -2327,6 +2575,7 @@ static bool ggml_hexagon_supported_mul_mat(const struct ggml_hexagon_session * s
switch (src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_MXFP4:
@@ -2377,6 +2626,7 @@ static bool ggml_hexagon_supported_mul_mat_id(const struct ggml_hexagon_session
switch (src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_MXFP4:
@@ -3622,6 +3872,8 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
// Basic sanity checks to make sure definitions match
static_assert((unsigned int) HTP_TYPE_Q4_0 == (unsigned int) GGML_TYPE_Q4_0,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_Q4_1 == (unsigned int) GGML_TYPE_Q4_1,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_Q8_0 == (unsigned int) GGML_TYPE_Q8_0,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_MXFP4 == (unsigned int) GGML_TYPE_MXFP4,
@@ -3634,6 +3886,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
const char * str_opstage = getenv("GGML_HEXAGON_OPSTAGE");
const char * str_opbatch = getenv("GGML_HEXAGON_OPBATCH");
const char * str_opqueue = getenv("GGML_HEXAGON_OPQUEUE");
const char * str_oppoll = getenv("GGML_HEXAGON_OPPOLL");
const char * str_opfilter = getenv("GGML_HEXAGON_OPFILTER");
const char * str_profile = getenv("GGML_HEXAGON_PROFILE");
const char * str_etm = getenv("GGML_HEXAGON_ETM");
@@ -3671,6 +3924,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
opt_opstage = str_opstage ? strtoul(str_opstage, NULL, 0) : opt_opstage;
opt_opbatch = str_opbatch ? strtoul(str_opbatch, NULL, 0) : opt_opbatch;
opt_opqueue = str_opqueue ? strtoul(str_opqueue, NULL, 0) : opt_opqueue;
opt_oppoll = str_oppoll ? strtoul(str_oppoll, NULL, 0) : opt_oppoll;
opt_profile = str_profile ? atoi(str_profile) : 0;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_nhvx = str_nhvx ? strtoul(str_nhvx, NULL, 0) : opt_nhvx;

View File

@@ -58,15 +58,16 @@ list(FIND HTP_HMX_VERSIONS ${DSP_VERSION} _hmx_idx)
if (_hmx_idx GREATER_EQUAL 0)
target_sources(${HTP_LIB} PRIVATE
hmx-queue.c
hmx-matmul-ops.c
hmx-flash-attn-ops.c
hmx-matmul-ops.c
hmx-queue.c
)
# -mhmx enables HMX instruction set (needed by files that include hmx-utils.h)
set_source_files_properties(
hmx-matmul-ops.c
hmx-flash-attn-ops.c
hmx-matmul-ops.c
hmx-queue.c
PROPERTIES COMPILE_OPTIONS "-mhmx"
)

View File

@@ -22,6 +22,16 @@
// Must be multiple of 32
#define FLASH_ATTN_BLOCK_SIZE (32 * 2)
#if __HVX_ARCH__ < 79
#define HVX_OP_ADD_F32(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(a, b))
#define HVX_OP_SUB_F32(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vsub_VsfVsf(a, b))
#define HVX_OP_MUL_F32(a, b) Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(a, b))
#else
#define HVX_OP_ADD_F32(a, b) Q6_Vsf_vadd_VsfVsf(a, b)
#define HVX_OP_SUB_F32(a, b) Q6_Vsf_vsub_VsfVsf(a, b)
#define HVX_OP_MUL_F32(a, b) Q6_Vsf_vmpy_VsfVsf(a, b)
#endif
// This is a bit of a hack because the compiler is strugling to properly inline
// the default hvx_vec_f32_to_f16 with output into the local array.
static __attribute__((noinline)) void hvx_vec_f32_to_f16_a(void *ptr, HVX_Vector v0, HVX_Vector v1)
@@ -54,8 +64,8 @@ static inline void hvx_dot_f16_f16_aa(float * restrict r, const void * restrict
rsum_p = hvx_vec_mpyacc_f32_f16(rsum_p, x_hf, y_hf);
}
HVX_Vector rsum = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum_p), Q6_V_hi_W(rsum_p)));
rsum = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(hvx_vec_splat_f32(s), hvx_vec_reduce_sum_f32(rsum)));
HVX_Vector rsum = HVX_OP_ADD_F32(Q6_V_lo_W(rsum_p), Q6_V_hi_W(rsum_p));
rsum = HVX_OP_MUL_F32(hvx_vec_splat_f32(s), hvx_vec_reduce_sum_f32(rsum));
hvx_vec_store_u(r, 4, rsum);
}
@@ -105,10 +115,10 @@ static inline HVX_Vector hvx_dot_f16_f16_aa_rx4(const void * restrict y,
rsum3_p = hvx_vec_mpyacc_f32_f16(rsum3_p, x3_hf, y_hf);
}
HVX_Vector rsum0 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum0_p), Q6_V_hi_W(rsum0_p)));
HVX_Vector rsum1 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum1_p), Q6_V_hi_W(rsum1_p)));
HVX_Vector rsum2 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum2_p), Q6_V_hi_W(rsum2_p)));
HVX_Vector rsum3 = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_V_lo_W(rsum3_p), Q6_V_hi_W(rsum3_p)));
HVX_Vector rsum0 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum0_p), Q6_V_hi_W(rsum0_p));
HVX_Vector rsum1 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum1_p), Q6_V_hi_W(rsum1_p));
HVX_Vector rsum2 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum2_p), Q6_V_hi_W(rsum2_p));
HVX_Vector rsum3 = HVX_OP_ADD_F32(Q6_V_lo_W(rsum3_p), Q6_V_hi_W(rsum3_p));
HVX_Vector_x4 rsum0123 = { .v = { rsum0, rsum1, rsum2, rsum3 } };
return hvx_vec_reduce_sum_f32x4(rsum0123);
@@ -123,7 +133,7 @@ static inline HVX_Vector hvx_dot_f16_f16_aa_rx32(const void * restrict y,
const size_t nvec = n / VLEN_FP16; // num full fp16 hvx vectors
const size_t nloe = n % VLEN_FP16; // leftover elements
HVX_Vector sums; // initialize at j = 0
HVX_Vector sums = Q6_V_vzero();
const size_t stride_x_4 = stride_x * 4;
for (uint32_t j = 0; j < VLEN_FP32; j += 4) {
HVX_Vector sums_x4 = hvx_dot_f16_f16_aa_rx4(y, x, stride_x, nvec, nloe);
@@ -132,8 +142,7 @@ static inline HVX_Vector hvx_dot_f16_f16_aa_rx32(const void * restrict y,
x += stride_x_4;
}
sums = Q6_Vqf32_vmpy_VsfVsf(hvx_vec_splat_f32(s), sums);
return Q6_Vsf_equals_Vqf32(sums);
return HVX_OP_MUL_F32(hvx_vec_splat_f32(s), sums);
}
// MAD: y (F32) += x (F16) * s (F16)
@@ -268,11 +277,10 @@ static inline void hvx_scale_vec_f32_aa(uint8_t * restrict dst, const uint8_t *
uint32_t i = 0;
#pragma unroll(4)
for (; i < nvec; ++i) {
vdst[i] = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(vsrc[i], vs));
vdst[i] = HVX_OP_MUL_F32(vsrc[i], vs);
}
if (nloe) {
HVX_Vector v = Q6_Vqf32_vmpy_VsfVsf(vsrc[i], vs);
hvx_vec_store_a(&vdst[i], nloe * sizeof(float), Q6_Vsf_equals_Vqf32(v));
hvx_vec_store_a(&vdst[i], nloe * sizeof(float), HVX_OP_MUL_F32(vsrc[i], vs));
}
}
@@ -438,25 +446,44 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
// Process in sub-blocks of 32 (VLEN_FP32)
HVX_Vector sb_scores[FLASH_ATTN_BLOCK_SIZE / VLEN_FP32];
HVX_Vector v_max = hvx_vec_splat_f32(-INFINITY);
for (uint32_t iv = 0; ic + VLEN_FP32 <= current_block_size; ic += VLEN_FP32, ++iv) {
for (uint32_t iv = 0; ic < current_block_size; ic += VLEN_FP32, ++iv) {
// 1. Compute scores
HVX_Vector scores = hvx_dot_f16_f16_aa_rx32(q_ptr_vtcm, k_base + ic * factx->size_k_row_padded, factx->size_k_row_padded, DK, factx->scale);
// 2. Softcap
if (factx->logit_softcap != 0.0f) {
scores = hvx_vec_tanh_f32(scores);
scores = Q6_Vqf32_vmpy_VsfVsf(scores, logit_cap);
scores = Q6_Vsf_equals_Vqf32(scores);
scores = HVX_OP_MUL_F32(scores, logit_cap);
}
// 3. Mask
if (mask) {
const __fp16 * mp = m_base + ic;
HVX_Vector m_vals_f16 = *(const HVX_UVector *) mp;
HVX_VectorPair m_vals_f32_pair = Q6_Wqf32_vmpy_VhfVhf(Q6_Vh_vshuff_Vh(m_vals_f16), slope_vec);
HVX_Vector add_val = Q6_V_lo_W(m_vals_f32_pair);
scores = Q6_Vqf32_vadd_Vqf32Vsf(add_val, scores);
scores = Q6_Vsf_equals_Vqf32(scores);
// Multiplying -INFINITY (0xFC00) by a slope in VhfVhf instructions can incorrectly produce NaN on v79.
// Clamp -INFINITY to the max negative fp16 finite value (-65504.0f).
HVX_Vector vinf = Q6_Vh_vsplat_R(0xFC00);
HVX_Vector vmin = Q6_Vh_vsplat_R(0xFBFF);
HVX_VectorPred is_inf = Q6_Q_vcmp_eq_VhVh(m_vals_f16, vinf);
m_vals_f16 = Q6_V_vmux_QVV(is_inf, vmin, m_vals_f16);
#if __HVX_ARCH__ >= 79
HVX_VectorPair m_vals_f32_pair = Q6_Wsf_vmpy_VhfVhf(Q6_Vh_vshuff_Vh(m_vals_f16), slope_vec);
HVX_Vector add_val = Q6_V_lo_W(m_vals_f32_pair);
scores = Q6_Vsf_vadd_VsfVsf(add_val, scores);
#else
HVX_VectorPair m_vals_f32_pair = Q6_Wqf32_vmpy_VhfVhf(Q6_Vh_vshuff_Vh(m_vals_f16), slope_vec);
HVX_Vector add_val = Q6_V_lo_W(m_vals_f32_pair);
scores = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_Vqf32Vsf(add_val, scores));
#endif
}
// Mask out invalid lanes for leftover handling
uint32_t valid_lanes = current_block_size - ic;
if (valid_lanes < VLEN_FP32) {
HVX_VectorPred valid_pred = Q6_Q_vsetq_R(valid_lanes * 4); // 4 bytes per fp32 lane
scores = Q6_V_vmux_QVV(valid_pred, scores, hvx_vec_splat_f32(-INFINITY));
}
sb_scores[iv] = scores;
@@ -466,78 +493,55 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
{
// 4. Online Softmax Update
HVX_Vector M_new_vec = Q6_Vsf_vmax_VsfVsf(v_max, M_vec);
HVX_Vector diff_vec = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vsub_VsfVsf(M_vec, M_new_vec));
HVX_Vector diff_vec = HVX_OP_SUB_F32(M_vec, M_new_vec);
HVX_Vector ms_vec = hvx_vec_exp_f32(diff_vec);
M_vec = M_new_vec;
hvx_scale_vec_f32_aa((uint8_t *) VKQ32, (const uint8_t *) VKQ32, DV, ms_vec);
HVX_Vector p_sum_vec = hvx_vec_splat_f32(0.0f);
for (uint32_t ic2 = 0, iv = 0; ic2 + VLEN_FP32 <= current_block_size; ic2 += VLEN_FP32, ++iv) {
for (uint32_t ic2 = 0, iv = 0; ic2 < current_block_size; ic2 += VLEN_FP32, ++iv) {
HVX_Vector scores = sb_scores[iv];
HVX_Vector scores_shifted = Q6_Vqf32_vsub_VsfVsf(scores, M_vec);
HVX_Vector P = hvx_vec_exp_f32(Q6_Vsf_equals_Vqf32(scores_shifted));
HVX_Vector scores_shifted = HVX_OP_SUB_F32(scores, M_vec);
HVX_Vector P = hvx_vec_exp_f32(scores_shifted);
p_sum_vec = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(p_sum_vec, P));
p_sum_vec = HVX_OP_ADD_F32(p_sum_vec, P);
// 5. Accumulate V
__fp16 __attribute__((aligned(VLEN))) p_arr[VLEN_FP16];
hvx_vec_f32_to_f16_a(p_arr, P, hvx_vec_splat_f32(0));
float __attribute__((aligned(128))) P_arr[VLEN_FP32];
hvx_vec_store_a(P_arr, 128, P);
for (uint32_t j = 0; j < VLEN_FP32; j += 2) {
const uint32_t cur_ic = ic2 + j;
const uint8_t * v_ptr = v_base + cur_ic * factx->size_v_row_padded;
const uint32_t cur_ic = ic2 + j;
if (cur_ic >= current_block_size) {
break;
}
if (cur_ic + 1 == current_block_size) {
// Odd leftover, process single row
if (P_arr[j] != 0.0f) {
const uint8_t * v_ptr = v_base + cur_ic * factx->size_v_row_padded;
hvx_mad_f32_f16_aa(VKQ32, v_ptr, (p_arr + j), DV);
}
break;
}
// Avoid NaN * 0.0 = NaN for uninitialized V cache rows.
// Check the f32 values to safely avoid strict aliasing violations.
if (P_arr[j] == 0.0f && P_arr[j + 1] == 0.0f) {
continue;
}
const uint8_t * v_ptr = v_base + cur_ic * factx->size_v_row_padded;
hvx_mad_f32_f16_aa_rx2(VKQ32, v_ptr, v_ptr + factx->size_v_row_padded, (p_arr + j), (p_arr + j + 1), DV);
}
}
p_sum_vec = hvx_vec_reduce_sum_f32(p_sum_vec);
S_vec = Q6_Vsf_equals_Vqf32(Q6_Vqf32_vadd_VsfVsf(Q6_Vsf_equals_Vqf32(Q6_Vqf32_vmpy_VsfVsf(S_vec, ms_vec)), p_sum_vec));
}
if (ic < current_block_size) {
// Sync scalars for leftover/next block if needed
float M = hvx_vec_get_f32(M_vec);
float S = hvx_vec_get_f32(S_vec);
// Leftover
for (; ic < current_block_size; ++ic) {
float s_val;
const uint8_t * k_ptr = k_base + ic * factx->size_k_row_padded;
hvx_dot_f16_f16_aa(&s_val, q_ptr_vtcm, k_ptr, DK, factx->scale);
if (factx->logit_softcap != 0.0f) {
s_val = factx->logit_softcap * tanhf(s_val);
}
if (mask) {
const float m_val = m_base[ic];
s_val += slope * m_val;
}
const float Mold = M;
__fp16 vs = 1.0f;
if (s_val > M) {
M = s_val;
HVX_Vector diff_vec = hvx_vec_splat_f32(Mold - M);
HVX_Vector ms_vec = hvx_vec_exp_f32(diff_vec);
hvx_scale_vec_f32_aa((uint8_t *) VKQ32, (const uint8_t *) VKQ32, DV, ms_vec);
float ms = hvx_vec_get_f32(ms_vec);
S = S * ms + vs;
} else {
HVX_Vector diff_vec = hvx_vec_splat_f32(s_val - M);
vs = hvx_vec_get_f32(hvx_vec_exp_f32(diff_vec));
S += vs;
}
const uint8_t * v_ptr = v_base + ic * factx->size_v_row_padded;
hvx_mad_f32_f16_aa(VKQ32, v_ptr, &vs, DV);
}
M_vec = hvx_vec_splat_f32(M);
S_vec = hvx_vec_splat_f32(S);
S_vec = HVX_OP_ADD_F32(HVX_OP_MUL_F32(S_vec, ms_vec), p_sum_vec);
}
// Issue DMA for next+1 block (if exists)
@@ -599,8 +603,9 @@ static void flash_attn_ext_f16_thread(unsigned int nth, unsigned int ith, void *
const int i2 = iq2;
const int i3 = iq3;
// dst is permuted
uint8_t * dst_ptr = (uint8_t *) dst->data + (i3*ne2*ne1 + i2 + i1*ne1) * nb1;
// dst is permuted: [DV, n_heads, n_tokens, n_seq]
// head stride is nb[1], token stride is nb[2], batch stride is nb[3]
uint8_t * dst_ptr = (uint8_t *) dst->data + i2 * dst->nb[1] + i1 * dst->nb[2] + i3 * dst->nb[3];
if (dst->type == HTP_TYPE_F32) {
hvx_copy_f32_ua(dst_ptr, (uint8_t *) VKQ32, DV);
@@ -623,8 +628,8 @@ int op_flash_attn_ext(struct htp_ops_context * octx) {
}
#ifdef HTP_HAS_HMX
// HMX path: prefill (neq1 >= 32), head_dim multiple of 32, F16 KV
if (k->type == HTP_TYPE_F16 && v->type == HTP_TYPE_F16 && k->ne[0] % 32 == 0 && q->ne[1] >= 32) {
// HMX path: head_dim multiple of 32, F16 KV
if (k->type == HTP_TYPE_F16 && v->type == HTP_TYPE_F16 && k->ne[0] % 32 == 0) {
int ret = hmx_flash_attn_ext(octx);
if (ret == HTP_STATUS_OK) {
return ret;

View File

@@ -586,6 +586,7 @@ static void gated_delta_net_f32_pp_thread(unsigned int nth, unsigned int ith, vo
const uint32_t H = v->ne[1];
const uint32_t n_tokens = v->ne[2];
const uint32_t n_seqs = v->ne[3];
const uint32_t K = state->ne[1];
const uint32_t total_rows = H * n_seqs;
if (ith >= total_rows) {
@@ -606,6 +607,10 @@ static void gated_delta_net_f32_pp_thread(unsigned int nth, unsigned int ith, vo
float local_k[HTP_GDN_MAX_SV] __attribute__((aligned(128)));
float local_sums[4] __attribute__((aligned(128)));
const uint64_t state_seq_stride = state->nb[2] / sizeof(float);
const uint64_t state_size_per_snap = (uint64_t) S_v * S_v * H * n_seqs;
const int64_t shift = (int64_t) n_tokens - (int64_t) K;
for (uint32_t ir = ith; ir < total_rows; ir += nth) {
const uint32_t iv1 = ir % H;
const uint32_t iv3 = ir / H;
@@ -615,8 +620,8 @@ static void gated_delta_net_f32_pp_thread(unsigned int nth, unsigned int ith, vo
const uint32_t iq3 = iv3 / rq3;
const uint32_t ik3 = iv3 / rk3;
float * s_out = state_out_base + ((uint64_t) iv3 * H + iv1) * S_v * S_v;
const float * s_in = state_in_base + ((uint64_t) iv3 * H + iv1) * S_v * S_v;
float * s_out = state_out_base + (uint64_t) (K - 1) * state_size_per_snap + ((uint64_t) iv3 * H + iv1) * S_v * S_v;
const float * s_in = state_in_base + (uint64_t) iv3 * state_seq_stride + (uint64_t) iv1 * S_v * S_v;
memcpy(s_out, s_in, gctx->state_bytes);
float * s_work = s_out;
@@ -689,6 +694,16 @@ static void gated_delta_net_f32_pp_thread(unsigned int nth, unsigned int ith, vo
}
}
if (K > 1) {
const int64_t target_slot = (int64_t) t - shift;
if (target_slot >= 0 && target_slot < (int64_t) K) {
float * curr_state_o = state_out_base + (uint64_t) target_slot * state_size_per_snap + ((uint64_t) iv3 * H + iv1) * S_v * S_v;
if (curr_state_o != s_work) {
memcpy(curr_state_o, s_work, gctx->state_bytes);
}
}
}
attn_data += (uint64_t) S_v * H;
}
}
@@ -709,6 +724,7 @@ static void gated_delta_net_f32_tg_thread(unsigned int nth, unsigned int ith, vo
const uint32_t S_v = v->ne[0];
const uint32_t H = v->ne[1];
const uint32_t n_seqs = v->ne[3];
const uint32_t K = state->ne[1];
const uint32_t total_rows = H * n_seqs;
if (ith >= total_rows) {
@@ -736,6 +752,9 @@ static void gated_delta_net_f32_tg_thread(unsigned int nth, unsigned int ith, vo
spad = gctx->vtcm_state_base + gctx->vtcm_state_per_thread * ith;
}
const uint64_t state_seq_stride = state->nb[2] / sizeof(float);
const uint64_t state_size_per_snap = (uint64_t) S_v * S_v * H * n_seqs;
for (uint32_t ir = ith; ir < total_rows; ir += nth) {
const uint32_t iv1 = ir % H;
const uint32_t iv3 = ir / H;
@@ -745,8 +764,8 @@ static void gated_delta_net_f32_tg_thread(unsigned int nth, unsigned int ith, vo
const uint32_t iq3 = iv3 / rq3;
const uint32_t ik3 = iv3 / rk3;
float * s_out = state_out_base + ((uint64_t) iv3 * H + iv1) * S_v * S_v;
const float * s_in = state_in_base + ((uint64_t) iv3 * H + iv1) * S_v * S_v;
float * s_out = state_out_base + (uint64_t) (K - 1) * state_size_per_snap + ((uint64_t) iv3 * H + iv1) * S_v * S_v;
const float * s_in = state_in_base + (uint64_t) iv3 * state_seq_stride + (uint64_t) iv1 * S_v * S_v;
float * s_work;
if (spad) {
@@ -901,6 +920,7 @@ int op_gated_delta_net(struct htp_ops_context * octx) {
const uint32_t H = v->ne[1];
const uint32_t n_tokens = v->ne[2];
const uint32_t n_seqs = v->ne[3];
const uint32_t K = state->ne[1];
if (S_v == 0 || S_v > HTP_GDN_MAX_SV || H == 0 || n_tokens == 0 || n_seqs == 0) {
return HTP_STATUS_NO_SUPPORT;
@@ -913,10 +933,10 @@ int op_gated_delta_net(struct htp_ops_context * octx) {
(n_seqs % q->ne[3]) != 0 || (n_seqs % k->ne[3]) != 0) {
return HTP_STATUS_NO_SUPPORT;
}
if (state->ne[0] * state->ne[1] * state->ne[2] * state->ne[3] != S_v * S_v * H * n_seqs) {
if (state->ne[0] * state->ne[2] * state->ne[3] != S_v * S_v * H * n_seqs) {
return HTP_STATUS_NO_SUPPORT;
}
if (dst->ne[0] != S_v * H || dst->ne[1] != n_tokens * n_seqs + S_v * n_seqs) {
if (dst->ne[0] != S_v * H || dst->ne[1] != n_tokens * n_seqs + S_v * n_seqs * K) {
return HTP_STATUS_NO_SUPPORT;
}

View File

@@ -1248,9 +1248,6 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
if (DK % 32 != 0 || DV % 32 != 0) {
return HTP_STATUS_NO_SUPPORT;
}
if (neq1 < 32) {
return HTP_STATUS_NO_SUPPORT;
}
// GQA factor
const uint32_t n_kv_heads = k->ne[2];

View File

@@ -16,6 +16,7 @@
#include "ggml-common.h"
#include "hex-dma.h"
#include "hex-fastdiv.h"
#include "worker-pool.h"
#include "hvx-utils.h"
@@ -34,6 +35,10 @@ static const __fp16 q4_0_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
-8, 0, -7, 0, -6, 0, -5, 0, -4, 0, -3, 0, -2, 0, -1, 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0,
};
static const __fp16 q4_1_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0,
};
// MXFP4 dequantization LUT: maps 4-bit index to fp16 mantissa value
// kvalues: 0, 0.5, 1, 1.5, 2, 3, 4, 6, 0, -0.5, -1, -1.5, -2, -3, -4, -6
static const __fp16 mxfp4_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
@@ -62,6 +67,8 @@ static inline size_t get_x4x2_row_stride(int weight_type, int k) {
case HTP_TYPE_Q4_0:
case HTP_TYPE_IQ4_NL:
return (size_t) nb * (QK_Q4_0x4x2 / 2 + HMX_X4X2_DBLK_SIZE); // 144 * nb
case HTP_TYPE_Q4_1:
return (size_t) nb * (QK_Q4_0x4x2 / 2 + 32); // 160 * nb
case HTP_TYPE_Q8_0:
return (size_t) nb * (QK_Q8_0x4x2 + HMX_X4X2_DBLK_SIZE); // 272 * nb
case HTP_TYPE_MXFP4:
@@ -181,45 +188,44 @@ next_nc:
// In x4x2, sub-blocks 0..3 use lower nibbles, sub-blocks 4..7 use upper nibbles
// of the same 32 packed bytes.
static inline HVX_Vector dequantize_x4x2_q4_0_group_hvx(const uint8_t *packed_32, bool upper_nibbles, const __fp16 *scale, const HVX_Vector vlut_cvt) {
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_32);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
const HVX_Vector i8 = Q6_Vb_vsplat_R(8);
HVX_Vector v_scales = hvx_vec_repl_f16(hvx_vmemu(scale));
// q4x4x2 stores two int4 values per byte. Keep only the selected nibble.
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
// Shuffle before LUT
v_quants = Q6_Vb_vshuff_Vb(v_quants);
// Use standard vlut16 (not _nomatch) to avoid stale-register NaN.
// _nomatch retains the previous destination-register value for colliding
// indices, but the C intrinsic doesn't model the implicit read so the
// compiler may allocate a register containing garbage/NaN.
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_hf = Q6_V_lo_W(vp);
HVX_Vector v_int8 = Q6_Vb_vsub_VbVb(v_quants, i8);
HVX_Vector v0 = Q6_V_lo_W(Q6_Wh_vunpack_Vb(v_int8));
HVX_Vector v_hf = Q6_Vhf_equals_Vh(v0);
return Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hf, v_scales));
}
// Batch-dequantize 4 contiguous x4x2 Q4_0 groups (4x32 = 128 packed bytes) using
// full HVX vector width. One vmemu + one vlut16 replaces 4 separate calls.
// full HVX vector width.
// Output: vector_x2 each hold 32 FP16 values in the first 64 bytes.
static inline HVX_Vector_x2 dequantize_x4x2_q4_0_x4groups_hvx(
const uint8_t *packed_128, bool upper_nibbles,
const __fp16 *scales_4, const HVX_Vector vlut_cvt) {
// Load all 128 packed bytes (4 contiguous 32-byte groups)
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
const HVX_Vector i8 = Q6_Vb_vsplat_R(8);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
// Shuffle before LUT
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_Vector v_int8 = Q6_Vb_vsub_VbVb(v_quants, i8);
// Full-width vlut16: 128 byte lookups -> 128 fp16 results in a VectorPair
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_lo = Q6_V_lo_W(vp); // [group0: 32 fp16 | group1: 32 fp16]
HVX_Vector v_hi = Q6_V_hi_W(vp); // [group2: 32 fp16 | group3: 32 fp16]
HVX_VectorPair vp_int16 = Q6_Wh_vunpack_Vb(v_int8);
HVX_Vector v_lo = Q6_V_lo_W(vp_int16);
HVX_Vector v_hi = Q6_V_hi_W(vp_int16);
v_lo = Q6_Vhf_equals_Vh(v_lo);
v_hi = Q6_Vhf_equals_Vh(v_hi);
// Build per-group scale vectors: first 64 bytes use scale_a, last 64 use scale_b
HVX_Vector vscale = hvx_vmemu(scales_4);
HVX_Vector v_sc01 = hvx_vec_repl_2x_f16(vscale);
HVX_Vector v_sc23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vscale, 4));
@@ -227,9 +233,97 @@ static inline HVX_Vector_x2 dequantize_x4x2_q4_0_x4groups_hvx(
v_lo = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_lo, v_sc01));
v_hi = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hi, v_sc23));
// Extract individual groups: scatter uses q_mask64 so only first 64 bytes matter
HVX_Vector_x2 r = { v_lo,/* group1 already in [0:63] */
v_hi /* group2 already in [0:63] */ };
HVX_Vector_x2 r = { v_lo, v_hi };
return r;
}
static inline HVX_Vector dequantize_x4x2_q4_1_group_hvx(const uint8_t *packed_32, bool upper_nibbles, const __fp16 *scale_offset, const HVX_Vector vlut_cvt) {
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_32);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_dm = hvx_vmemu(scale_offset);
HVX_Vector v_scales = hvx_vec_repl_f16(v_dm);
HVX_Vector v_offsets = hvx_vec_repl_f16(Q6_V_vror_VR(v_dm, 2));
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
HVX_Vector v0 = Q6_V_lo_W(Q6_Wh_vunpack_Vb(v_quants));
HVX_Vector v_hf = Q6_Vhf_equals_Vh(v0);
return Q6_Vhf_equals_Vqf16(Q6_Vqf16_vadd_Vqf16Vhf(Q6_Vqf16_vmpy_VhfVhf(v_hf, v_scales), v_offsets));
}
static inline HVX_Vector_x2 dequantize_x4x2_q4_1_x4groups_hvx(
const uint8_t *packed_128, bool upper_nibbles,
const __fp16 *scales_offsets_4, const HVX_Vector vlut_cvt) {
(void)vlut_cvt;
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
HVX_VectorPair vp_int16 = Q6_Wh_vunpack_Vb(v_quants);
HVX_Vector v_lo = Q6_V_lo_W(vp_int16);
HVX_Vector v_hi = Q6_V_hi_W(vp_int16);
v_lo = Q6_Vhf_equals_Vh(v_lo);
v_hi = Q6_Vhf_equals_Vh(v_hi);
HVX_Vector vscale_offset = hvx_vmemu(scales_offsets_4);
HVX_VectorPair dm_deal = Q6_W_vdeal_VVR(vscale_offset, vscale_offset, -2);
HVX_Vector vd = Q6_V_lo_W(dm_deal);
HVX_Vector vm = Q6_V_hi_W(dm_deal);
HVX_Vector v_sc01 = hvx_vec_repl_2x_f16(vd);
HVX_Vector v_sc23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vd, 4));
HVX_Vector v_os01 = hvx_vec_repl_2x_f16(vm);
HVX_Vector v_os23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vm, 4));
v_lo = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vadd_Vqf16Vhf(Q6_Vqf16_vmpy_VhfVhf(v_lo, v_sc01), v_os01));
v_hi = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vadd_Vqf16Vhf(Q6_Vqf16_vmpy_VhfVhf(v_hi, v_sc23), v_os23));
HVX_Vector_x2 r = { v_lo, v_hi };
return r;
}
// LUT-based dequantizers for non-linear IQ4_NL format.
static inline HVX_Vector dequantize_x4x2_iq4_nl_group_hvx(const uint8_t *packed_32, bool upper_nibbles, const __fp16 *scale, const HVX_Vector vlut_cvt) {
HVX_Vector vq = hvx_vmemu(packed_32);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_scales = hvx_vec_repl_f16(hvx_vmemu(scale));
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_hf = Q6_V_lo_W(vp);
return Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hf, v_scales));
}
static inline HVX_Vector_x2 dequantize_x4x2_iq4_nl_x4groups_hvx(
const uint8_t *packed_128, bool upper_nibbles,
const __fp16 *scales_4, const HVX_Vector vlut_cvt) {
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_lo = Q6_V_lo_W(vp);
HVX_Vector v_hi = Q6_V_hi_W(vp);
HVX_Vector vscale = hvx_vmemu(scales_4);
HVX_Vector v_sc01 = hvx_vec_repl_2x_f16(vscale);
HVX_Vector v_sc23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vscale, 4));
v_lo = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_lo, v_sc01));
v_hi = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vmpy_VhfVhf(v_hi, v_sc23));
HVX_Vector_x2 r = { v_lo, v_hi };
return r;
}
@@ -320,100 +414,176 @@ static inline HVX_Vector_x4 dequantize_x4x2_mxfp4_x4groups_hvx(const uint8_t *
return r;
}
typedef struct {
__fp16 *dst;
const uint8_t *src;
int n_cols;
int k_block;
size_t row_stride;
int weight_type;
int n_tot_tiles;
int n_tiles_per_task;
int n_tasks;
int n_k_tiles;
struct fastdiv_values n_k_tiles_div;
} x4x2_dequantize_state_t;
// Dequantize a tile range from x4x2 weight data (already in VTCM) to tile-major FP16.
// Input: vtcm_src has n_cols rows of x4x2 data, each row_stride bytes.
// Output: vtcm_dst in tile-major FP16 layout.
static void dequantize_x4x2_weight_to_fp16_tiles_task(
__fp16 *restrict vtcm_dst,
const uint8_t *restrict vtcm_src,
int n_cols, int k_block,
size_t row_stride, int weight_type,
#define DEFINE_DEQUANTIZE_Q4_TASK(suffix, lut_name, helper_prefix, dblk_size, scale_step) \
static void dequantize_x4x2_weight_to_fp16_tiles_task_##suffix( \
const x4x2_dequantize_state_t *state, \
int start_tile, int end_tile) { \
\
const int n_k_tiles = state->n_k_tiles; \
const int qrow_size = (unsigned)state->k_block / 2; \
const struct fastdiv_values n_k_tiles_div = state->n_k_tiles_div; \
const HVX_Vector vlut_cvt = hvx_vmem(lut_name); \
\
const HVX_Vector v_scat_base = hvx_vmem(hmx_transpose_scatter_offsets); \
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4); \
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64); \
\
unsigned ct = fastdiv((unsigned)start_tile, &n_k_tiles_div); \
unsigned kt = fastmodulo((unsigned)start_tile, n_k_tiles, &n_k_tiles_div); \
\
for (unsigned t = start_tile; t < (unsigned)end_tile; ) { \
if (kt >= (unsigned)n_k_tiles) { kt = 0; ct++; } \
\
if ((kt % 4 == 0) && (t + 4 <= (unsigned)end_tile) && (fastdiv(t + 3, &n_k_tiles_div) == ct)) { \
unsigned blk_idx = ((kt * 32) / QK_Q4_0x4x2); \
unsigned sub_blk_base = ((kt * 32) % QK_Q4_0x4x2) / 32; \
bool upper = (sub_blk_base >= 4); \
unsigned packed_off = blk_idx * (QK_Q4_0x4x2 / 2); \
unsigned scale_off = qrow_size + blk_idx * (dblk_size) + sub_blk_base * (scale_step); \
\
__fp16 *tile_bases[4]; \
for (unsigned g = 0; g < 4; g++) { \
tile_bases[g] = state->dst + (t + g) * HMX_FP16_TILE_N_ELMS; \
} \
\
HVX_Vector v_off = v_scat_base; \
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * state->row_stride; \
\
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) { \
const uint8_t *r0 = state->src + row_offset; row_offset += state->row_stride; \
const uint8_t *r1 = state->src + row_offset; row_offset += state->row_stride; \
\
HVX_Vector_x2 dv0 = dequantize_x4x2_##helper_prefix##_x4groups_hvx( \
r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt); \
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]); \
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
\
HVX_Vector_x2 dv1 = dequantize_x4x2_##helper_prefix##_x4groups_hvx( \
r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt); \
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]); \
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
} \
\
for (int g = 0; g < 4; g++) { (void) *(volatile HVX_Vector *)(tile_bases[g]); } \
t += 4; kt += 4; \
continue; \
} \
\
__fp16 *tile_base = state->dst + t * HMX_FP16_TILE_N_ELMS; \
{ \
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2; \
unsigned sub_blk = ((kt * 32) % QK_Q4_0x4x2) / 32; \
bool upper = (sub_blk >= 4); \
unsigned byte_off = blk_idx * (QK_Q4_0x4x2 / 2) + (upper ? (sub_blk - 4) : sub_blk) * 32; \
unsigned scale_off = qrow_size + blk_idx * (dblk_size) + sub_blk * (scale_step); \
\
HVX_Vector v_off = v_scat_base; \
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * state->row_stride; \
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1; \
\
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) { \
const uint8_t *r0 = state->src + row_offset; row_offset += state->row_stride; \
const uint8_t *r1 = state->src + row_offset; row_offset += state->row_stride; \
\
HVX_Vector v0 = dequantize_x4x2_##helper_prefix##_group_hvx( \
r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt); \
HVX_Vector v1 = (row1 < (unsigned)state->n_cols) \
? dequantize_x4x2_##helper_prefix##_group_hvx( \
r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt) \
: Q6_V_vzero(); \
\
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1); \
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step); \
} \
(void) *(volatile HVX_Vector *)(tile_base); \
} \
++t; ++kt; \
} \
\
if (start_tile < end_tile) { \
(void) *(volatile HVX_Vector *)(state->dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS); \
} \
} \
\
static void dequantize_x4x2_worker_loop_##suffix(unsigned int n, unsigned int i, void *data) { \
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data; \
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) { \
int start = task_id * state->n_tiles_per_task; \
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles); \
dequantize_x4x2_weight_to_fp16_tiles_task_##suffix(state, start, end); \
} \
}
DEFINE_DEQUANTIZE_Q4_TASK(q4_0, q4_0_to_fp16_lut, q4_0, HMX_X4X2_DBLK_SIZE, (int)sizeof(__fp16))
DEFINE_DEQUANTIZE_Q4_TASK(q4_1, q4_1_to_fp16_lut, q4_1, 32, 4)
DEFINE_DEQUANTIZE_Q4_TASK(iq4_nl, iq4_nl_to_fp16_lut, iq4_nl, HMX_X4X2_DBLK_SIZE, (int)sizeof(__fp16))
static void dequantize_x4x2_weight_to_fp16_tiles_task_mxfp4(
const x4x2_dequantize_state_t *state,
int start_tile, int end_tile) {
const int n_k_tiles = (unsigned)k_block / HMX_FP16_TILE_N_COLS;
const bool is_q4 = (weight_type == HTP_TYPE_Q4_0 || weight_type == HTP_TYPE_IQ4_NL);
const int qrow_size = is_q4 ? ((unsigned)k_block / 2) : k_block;
const int n_k_tiles = state->n_k_tiles;
const int qrow_size = state->k_block;
const struct fastdiv_values n_k_tiles_div = state->n_k_tiles_div;
const HVX_Vector vlut_cvt = hvx_vmem(mxfp4_to_fp16_lut);
const HVX_Vector vlut_cvt = (weight_type == HTP_TYPE_IQ4_NL) ? hvx_vmem(iq4_nl_to_fp16_lut) :
(weight_type == HTP_TYPE_MXFP4) ? hvx_vmem(mxfp4_to_fp16_lut) :
hvx_vmem(q4_0_to_fp16_lut);
// vscatter setup: write dequantized K-values directly to transposed [K][N] tile positions.
// Each int32 element holds a K-row-pair (2 adjacent fp16 values). word[i] at offset i*128
// maps to K-rows 2i and 2i+1. Column offset (n*4) added per row.
const HVX_Vector v_scat_base = hvx_vmem(hmx_transpose_scatter_offsets);
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4); // 4 bytes = 1 column step
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64); // first 16 words (64 bytes)
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4);
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64);
unsigned ct = (unsigned)start_tile / n_k_tiles; // column tile index
unsigned kt = (unsigned)start_tile % n_k_tiles; // K tile index
for (unsigned t = start_tile; t < end_tile; ) {
if (kt >= n_k_tiles) { kt = 0; ct++; }
unsigned ct = fastdiv((unsigned)start_tile, &n_k_tiles_div);
unsigned kt = fastmodulo((unsigned)start_tile, n_k_tiles, &n_k_tiles_div);
// --- Batch-4 fast path for Q4: process 4 contiguous K-tiles with one vlut16 per row ---
if (is_q4 && (kt % 4 == 0) && (t + 4 <= end_tile) && ((t + 3) / n_k_tiles == ct)) {
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2;
unsigned sub_blk_base = ((kt * 32) % QK_Q4_0x4x2) / 32; // 0 or 4
bool upper = (sub_blk_base >= 4);
unsigned packed_off = blk_idx * (QK_Q4_0x4x2 / 2); // 128 contiguous packed bytes
unsigned scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE
+ sub_blk_base * (int)sizeof(__fp16); // 4 consecutive scales
for (unsigned t = start_tile; t < (unsigned)end_tile; ) {
if (kt >= (unsigned)n_k_tiles) { kt = 0; ct++; }
__fp16 *tile_bases[4];
for (unsigned g = 0; g < 4; g++) { tile_bases[g] = vtcm_dst + (t + g) * HMX_FP16_TILE_N_ELMS; }
HVX_Vector v_off = v_scat_base;
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector_x2 dv0 = dequantize_x4x2_q4_0_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector_x2 dv1 = dequantize_x4x2_q4_0_x4groups_hvx(r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
for (int g = 0; g < 4; g++) { (void) *(volatile HVX_Vector *)(tile_bases[g]); }
t += 4; kt += 4;
continue;
}
// --- Batch-4 fast path for MXFP4: same nibble layout but E8M0 scales ---
if (weight_type == HTP_TYPE_MXFP4 && (kt % 4 == 0) && (t + 4 <= end_tile) && ((t + 3) / n_k_tiles == ct)) {
// Batch-4 fast path for MXFP4
if ((kt % 4 == 0) && (t + 4 <= (unsigned)end_tile) && (fastdiv(t + 3, &n_k_tiles_div) == ct)) {
int blk_idx = (kt * 32) / QK_MXFP4x4x2;
int sub_blk_base = ((kt * 32) % QK_MXFP4x4x2) / 32; // 0 or 4
int sub_blk_base = ((kt * 32) % QK_MXFP4x4x2) / 32;
bool upper = (sub_blk_base >= 4);
int packed_off = blk_idx * (QK_MXFP4x4x2 / 2); // 128 contiguous packed bytes
int e8m0_blk_off = qrow_size + blk_idx * HMX_X4X2_MXFP4_EBLK_SIZE; // all 8 E8M0 scales
int packed_off = blk_idx * (QK_MXFP4x4x2 / 2);
int e8m0_blk_off = qrow_size + blk_idx * HMX_X4X2_MXFP4_EBLK_SIZE;
__fp16 * tile_bases[4];
for (int g = 0; g < 4; g++) {
tile_bases[g] = vtcm_dst + (t + g) * HMX_FP16_TILE_N_ELMS;
tile_bases[g] = state->dst + (t + g) * HMX_FP16_TILE_N_ELMS;
}
HVX_Vector v_off = v_scat_base;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) {
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t * r0 = vtcm_src + row0 * row_stride;
const uint8_t * r1 = vtcm_src + row1 * row_stride;
const uint8_t * r0 = state->src + row0 * state->row_stride;
const uint8_t * r1 = state->src + row1 * state->row_stride;
// Batch-convert all 8 E8M0 scales once per row (stays in HVX register)
mxfp4_scales_t r0_e8 = mxfp4_convert_scales(r0 + e8m0_blk_off);
HVX_Vector_x4 dv0, dv1;
dv0 = dequantize_x4x2_mxfp4_x4groups_hvx(r0 + packed_off, upper, sub_blk_base, vlut_cvt, r0_e8);
if (row1 < n_cols) {
if (row1 < state->n_cols) {
mxfp4_scales_t r1_e8 = mxfp4_convert_scales(r1 + e8m0_blk_off);
dv1 = dequantize_x4x2_mxfp4_x4groups_hvx(r1 + packed_off, upper, sub_blk_base, vlut_cvt, r1_e8);
} else {
@@ -434,41 +604,13 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
(void) *(volatile HVX_Vector *) (tile_bases[g]);
}
t += 4;
t += 4; kt += 4;
continue;
}
// --- Single-tile fallback ---
__fp16 *tile_base = vtcm_dst + t * HMX_FP16_TILE_N_ELMS;
if (is_q4) {
unsigned blk_idx = (kt * 32) / QK_Q4_0x4x2;
unsigned sub_blk = ((kt * 32) % QK_Q4_0x4x2) / 32;
bool upper = (sub_blk >= 4);
unsigned byte_off = blk_idx * (QK_Q4_0x4x2 / 2) + (upper ? (sub_blk - 4) : sub_blk) * 32;
unsigned scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE + sub_blk * (int)sizeof(__fp16);
HVX_Vector v_off = v_scat_base; // reset to column 0
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector v0 = dequantize_x4x2_q4_0_group_hvx(
r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector v1 = (row1 < n_cols)
? dequantize_x4x2_q4_0_group_hvx(
r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt)
: Q6_V_vzero();
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
(void) *(volatile HVX_Vector *)(tile_base);
} else if (weight_type == HTP_TYPE_MXFP4) {
// Single-tile fallback
__fp16 *tile_base = state->dst + t * HMX_FP16_TILE_N_ELMS;
{
int blk_idx = (kt * 32) / QK_MXFP4x4x2;
int sub_blk = ((kt * 32) % QK_MXFP4x4x2) / 32;
bool upper = (sub_blk >= 4);
@@ -480,15 +622,14 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t * r0 = vtcm_src + row0 * row_stride;
const uint8_t * r1 = vtcm_src + row1 * row_stride;
const uint8_t * r0 = state->src + row0 * state->row_stride;
const uint8_t * r1 = state->src + row1 * state->row_stride;
// Batch-convert all 8 E8M0 scales once per row (stays in HVX register)
mxfp4_scales_t r0_e8 = mxfp4_convert_scales(r0 + e8m0_blk_off);
HVX_Vector v0 = dequantize_x4x2_mxfp4_group_hvx(r0 + byte_off, upper, sub_blk, vlut_cvt, r0_e8);
HVX_Vector v1;
if (row1 < n_cols) {
if (row1 < state->n_cols) {
mxfp4_scales_t r1_e8 = mxfp4_convert_scales(r1 + e8m0_blk_off);
v1 = dequantize_x4x2_mxfp4_group_hvx(r1 + byte_off, upper, sub_blk, vlut_cvt, r1_e8);
} else {
@@ -501,23 +642,59 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
(void) *(volatile HVX_Vector *) (tile_base);
} else {
// Q8_0
}
++t; ++kt;
}
if (start_tile < end_tile) {
(void) *(volatile HVX_Vector *)(state->dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS);
}
}
static void dequantize_x4x2_worker_loop_mxfp4(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
dequantize_x4x2_weight_to_fp16_tiles_task_mxfp4(state, start, end);
}
}
static void dequantize_x4x2_weight_to_fp16_tiles_task_q8_0(
const x4x2_dequantize_state_t *state,
int start_tile, int end_tile) {
const int n_k_tiles = state->n_k_tiles;
const int qrow_size = state->k_block;
const struct fastdiv_values n_k_tiles_div = state->n_k_tiles_div;
const HVX_Vector v_scat_base = hvx_vmem(hmx_transpose_scatter_offsets);
const HVX_Vector v_scat_step = Q6_V_vsplat_R(4);
const HVX_VectorPred q_mask64 = Q6_Q_vsetq_R(64);
unsigned ct = fastdiv((unsigned)start_tile, &n_k_tiles_div);
unsigned kt = fastmodulo((unsigned)start_tile, n_k_tiles, &n_k_tiles_div);
for (unsigned t = start_tile; t < (unsigned)end_tile; ) {
if (kt >= (unsigned)n_k_tiles) { kt = 0; ct++; }
__fp16 *tile_base = state->dst + t * HMX_FP16_TILE_N_ELMS;
{
int blk_idx = (kt * 32) / QK_Q8_0x4x2;
int sub_blk = ((kt * 32) % QK_Q8_0x4x2) / 32;
int byte_off = blk_idx * QK_Q8_0x4x2 + sub_blk * 32;
int scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE + sub_blk * (int)sizeof(__fp16);
HVX_Vector v_off = v_scat_base; // reset to column 0
HVX_Vector v_off = v_scat_base;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2) {
int row0 = ct * HMX_FP16_TILE_N_COLS + r;
int row1 = row0 + 1;
const uint8_t *r0 = vtcm_src + row0 * row_stride;
const uint8_t *r1 = vtcm_src + row1 * row_stride;
const uint8_t *r0 = state->src + row0 * state->row_stride;
const uint8_t *r1 = state->src + row1 * state->row_stride;
HVX_Vector v0 = dequantize_x4x2_q8_0_group_hvx((const int8_t *)(r0 + byte_off), (const __fp16 *)(r0 + scale_off));
HVX_Vector v1 = (row1 < n_cols) ? dequantize_x4x2_q8_0_group_hvx((const int8_t *)(r1 + byte_off), (const __fp16 *)(r1 + scale_off)) : Q6_V_vzero();
HVX_Vector v1 = (row1 < state->n_cols) ? dequantize_x4x2_q8_0_group_hvx((const int8_t *)(r1 + byte_off), (const __fp16 *)(r1 + scale_off)) : Q6_V_vzero();
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
@@ -529,50 +706,31 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
++t; ++kt;
}
// Drain HVX scatter write buffer: a vmem load on the same HW thread retires
// all pending scatter entries to VTCM. Without this, the main thread's HMX
// reads may see stale data because atomic_fetch_sub (release) only orders
// regular stores, not the HVX scatter buffer.
if (start_tile < end_tile) {
(void) *(volatile HVX_Vector *)(vtcm_dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS);
(void) *(volatile HVX_Vector *)(state->dst + (end_tile - 1) * HMX_FP16_TILE_N_ELMS);
}
}
typedef struct {
__fp16 *dst;
const uint8_t *src;
int n_cols;
int k_block;
size_t row_stride;
int weight_type;
int n_tot_tiles;
int n_tiles_per_task;
int n_tasks;
} x4x2_dequantize_state_t;
static void dequantize_x4x2_worker_loop(unsigned int n, unsigned int i, void *data) {
static void dequantize_x4x2_worker_loop_q8_0(unsigned int n, unsigned int i, void *data) {
x4x2_dequantize_state_t *state = (x4x2_dequantize_state_t *)data;
for (unsigned int task_id = i; task_id < (unsigned int)state->n_tasks; task_id += n) {
int start = task_id * state->n_tiles_per_task;
int end = hex_smin(start + state->n_tiles_per_task, state->n_tot_tiles);
dequantize_x4x2_weight_to_fp16_tiles_task(
state->dst, state->src, state->n_cols, state->k_block,
state->row_stride, state->weight_type, start, end);
dequantize_x4x2_weight_to_fp16_tiles_task_q8_0(state, start, end);
}
}
static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
struct htp_context *ctx, __fp16 *vtcm_dst,
const void *vtcm_src, int n_cols, int k_block,
size_t row_stride, int weight_type) {
size_t row_stride, int weight_type,
int n_k_tiles, struct fastdiv_values n_k_tiles_div,
worker_callback_t dequant_worker_fn) {
assert(n_cols % HMX_FP16_TILE_N_COLS == 0);
assert(k_block % HMX_FP16_TILE_N_COLS == 0);
size_t n_col_tiles = n_cols / HMX_FP16_TILE_N_COLS;
size_t n_k_tiles = k_block / HMX_FP16_TILE_N_COLS;
size_t n_tot_tiles = n_col_tiles * n_k_tiles;
size_t n_tiles_per_task = hmx_ceil_div(n_tot_tiles, ctx->n_threads);
@@ -587,12 +745,16 @@ static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
state.k_block = k_block;
state.row_stride = row_stride;
state.weight_type = weight_type;
state.n_k_tiles = n_k_tiles;
state.n_k_tiles_div = n_k_tiles_div;
worker_pool_run_func(ctx->worker_pool, dequantize_x4x2_worker_loop, &state, ctx->n_threads);
worker_pool_run_func(ctx->worker_pool, dequant_worker_fn, &state, ctx->n_threads);
}
// --- End x4x2 dequantizers ---
#pragma clang diagnostic ignored "-Wbackend-plugin" // spurios warning for hmx intrinsics
// requires external HMX lock
static void core_dot_chunk_fp16(__fp16 *restrict output, const __fp16 *restrict activation, const __fp16 *restrict weight, const __fp16 *restrict scales,
int n_row_tiles, int n_col_tiles, int n_dot_tiles) {
@@ -883,6 +1045,20 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
return -1;
}
worker_callback_t dequant_worker_fn = NULL;
switch (weight_type) {
case HTP_TYPE_Q4_0: dequant_worker_fn = dequantize_x4x2_worker_loop_q4_0; break;
case HTP_TYPE_IQ4_NL: dequant_worker_fn = dequantize_x4x2_worker_loop_iq4_nl; break;
case HTP_TYPE_Q4_1: dequant_worker_fn = dequantize_x4x2_worker_loop_q4_1; break;
case HTP_TYPE_MXFP4: dequant_worker_fn = dequantize_x4x2_worker_loop_mxfp4; break;
case HTP_TYPE_Q8_0: dequant_worker_fn = dequantize_x4x2_worker_loop_q8_0; break;
default:
return -1;
}
const int n_k_tiles = k / HMX_FP16_TILE_N_COLS;
const struct fastdiv_values n_k_tiles_div = init_fastdiv_values(n_k_tiles);
// --- Dynamic VTCM layout ---
const size_t vec_dot_size = k * sizeof(__fp16);
const size_t vtcm_budget = ctx->vtcm_size;
@@ -975,7 +1151,7 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
{
// B0: wait for DMA, dequant weight chunk 0
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[0], vtcm_qweight, n_cols_A0, k, row_stride, weight_type);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[0], vtcm_qweight, n_cols_A0, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn);
// A1: issue DMA for weight chunk 1
const size_t n_cols_A1 = hex_smin(n - 1 * n_chunk_n_cols, n_chunk_n_cols);
@@ -994,7 +1170,7 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
// B1: DMA pop + dequant (runs in parallel with C0 on HMX worker)
if (1 < n_chunk_cnt) {
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[1], vtcm_qweight, n_cols_A1, k, row_stride, weight_type);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[1], vtcm_qweight, n_cols_A1, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn);
}
}
@@ -1036,7 +1212,7 @@ int hmx_matmul_q_f32(struct htp_context *ctx, float *restrict dst, const float *
// B_{i+2}: DMA pop + dequant (multi-thread HVX, parallel with C_{i+1})
if (i + 2 < n_chunk_cnt) {
dma_queue_pop(ctx->dma[0]);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[(i + 2) % 2], vtcm_qweight, n_cols_p2, k, row_stride, weight_type);
dequantize_x4x2_weight_chunk_to_fp16_tiles(ctx, vtcm_weight_bufs[(i + 2) % 2], vtcm_qweight, n_cols_p2, k, row_stride, weight_type, n_k_tiles, n_k_tiles_div, dequant_worker_fn);
}
}
}

View File

@@ -20,6 +20,7 @@ enum htp_data_type {
HTP_TYPE_F32 = 0,
HTP_TYPE_F16 = 1,
HTP_TYPE_Q4_0 = 2,
HTP_TYPE_Q4_1 = 3,
HTP_TYPE_Q8_0 = 8,
HTP_TYPE_IQ4_NL = 20,
HTP_TYPE_I32 = 26,
@@ -28,6 +29,7 @@ enum htp_data_type {
// types used internally for repack, dyn.quant, etc
HTP_TYPE_Q4_0x4x2 = 200,
HTP_TYPE_Q4_1x4x2,
HTP_TYPE_Q8_0x4x2,
HTP_TYPE_MXFP4x4x2,

View File

@@ -853,6 +853,11 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
for (uint32_t i=0; i < n_ops; i++) {
struct profile_data prof;
if (i == (n_ops-1)) {
// wake up the host before starting the last op
dspqueue_write_early_wakeup_noblock(queue, 0, 0);
}
profile_start(ctx->profiler, &prof);
proc_op_req(octx, tens, i, &ops[i]);
@@ -869,8 +874,6 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
}
}
// dspqueue_write_early_wakeup_noblock(ctx->queue, 10, 0);
struct htp_opbatch_rsp rsp;
rsp.id = req.id;
rsp.status = HTP_STATUS_OK;

File diff suppressed because it is too large Load Diff

View File

@@ -164,6 +164,7 @@ set(GGML_OPENCL_KERNELS
sqr
sqrt
ssm_conv
gated_delta_net
sub
sum_rows
cumsum

View File

@@ -412,6 +412,7 @@ struct ggml_backend_opencl_context {
size_t max_workgroup_size;
bool fp16_support;
bool has_vector_subgroup_broadcast;
bool has_qcom_subgroup_shuffle = false; // cl_qcom_subgroup_shuffle
bool disable_fusion;
std::regex *opfilter = nullptr; // regex of ops to not claim
@@ -634,6 +635,10 @@ struct ggml_backend_opencl_context {
cl_kernel kernel_conv_2d_f32;
cl_kernel kernel_conv_2d_f16_f32;
cl_kernel kernel_ssm_conv_f32_f32, kernel_ssm_conv_f32_f32_4;
// [size_idx][kda][tgpp] where size_idx: 0=S_V=16, 1=32, 2=64, 3=128; kda: 0 or 1.
// tgpp 0 = TG variant (COLS_PER_LANE_GROUP=1), tgpp 1 = prefill variant (COLS_PER_LANE_GROUP=4).
cl_kernel kernel_gated_delta_net_f32[4][2][2] = {};
cl_kernel kernel_timestep_embedding;
cl_kernel kernel_gemv_moe_q4_0_f32_ns, kernel_gemm_moe_q4_0_f32_ns;
cl_kernel kernel_gemv_moe_q4_1_f32_ns, kernel_gemm_moe_q4_1_f32_ns;
@@ -837,16 +842,16 @@ static std::vector<ggml_backend_device> g_ggml_backend_opencl_devices;
static std::vector<std::unique_ptr<ggml_backend_opencl_device_context>> g_ggml_backend_opencl_dev_ctxs;
inline std::string read_file(const std::string &path) {
std::ifstream ifs(path);
if (!ifs) {
return "";
}
std::string text;
ifs.seekg(0, std::ios::end);
text.resize(ifs.tellg());
ifs.seekg(0, std::ios::beg);
ifs.read(&text[0], text.size());
return text;
std::ifstream ifs(path);
if (!ifs) {
return "";
}
std::string text;
ifs.seekg(0, std::ios::end);
text.resize(ifs.tellg());
ifs.seekg(0, std::ios::beg);
ifs.read(&text[0], text.size());
return text;
}
static cl_program build_program_from_source(cl_context ctx, cl_device_id dev, const char* program_buffer, const std::string &compile_opts) {
@@ -2463,12 +2468,12 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), compile_opts);
CL_CHECK((backend_ctx->kernel_upscale = clCreateKernel(backend_ctx->program_upscale, "kernel_upscale", &err), err));
if (backend_ctx->program_upscale) {
cl_int err_bilinear;
backend_ctx->kernel_upscale_bilinear = clCreateKernel(backend_ctx->program_upscale, "kernel_upscale_bilinear", &err_bilinear);
if (err_bilinear != CL_SUCCESS) {
cl_int err_bilinear;
backend_ctx->kernel_upscale_bilinear = clCreateKernel(backend_ctx->program_upscale, "kernel_upscale_bilinear", &err_bilinear);
if (err_bilinear != CL_SUCCESS) {
GGML_LOG_WARN("ggml_opencl: kernel_upscale_bilinear not found in upscale.cl. Bilinear upscale will not be available. Error: %d\n", err_bilinear);
backend_ctx->kernel_upscale_bilinear = nullptr;
}
}
} else {
backend_ctx->kernel_upscale_bilinear = nullptr;
}
@@ -2538,8 +2543,8 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
GGML_LOG_CONT(".");
}
// conv2d
{
// conv2d
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "conv2d.cl.h"
@@ -2597,6 +2602,86 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
GGML_LOG_CONT(".");
}
// gated_delta_net: one kernel per (S_V, KDA, tgpp) triple.
{
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "gated_delta_net.cl.h"
};
#else
const std::string kernel_src = read_file("gated_delta_net.cl");
#endif
const int gdn_sizes[4] = { 16, 32, 64, 128 };
const int sg_size = backend_ctx->gpu_family == GPU_FAMILY::ADRENO ? 64 : backend_ctx->gpu_family == GPU_FAMILY::INTEL ? 32 : -1;
if (sg_size < 0) {
GGML_LOG_ERROR("Unsupported GPU Family: only Adreno and Intel are supported.\n");
exit(1);
}
for (int si = 0; si < 4; si++) {
const int S_V = gdn_sizes[si];
// MUST match the dispatcher heuristic in ggml_cl_gated_delta_net exactly.
int lanes_per_column;
if (S_V >= 128) {
lanes_per_column = 8;
} else {
lanes_per_column = std::min(S_V, sg_size);
}
// Round LANES_PER_COLUMN down until it is:
// * power-of-two
// * divides both S_V and sg_size
while (lanes_per_column > 1 &&
(((lanes_per_column & (lanes_per_column - 1)) != 0) ||
(S_V % lanes_per_column) != 0 ||
(sg_size % lanes_per_column) != 0)) {
lanes_per_column >>= 1;
}
GGML_ASSERT(lanes_per_column >= 1);
GGML_ASSERT(((lanes_per_column & (lanes_per_column - 1)) == 0));
GGML_ASSERT((S_V % lanes_per_column) == 0);
GGML_ASSERT((sg_size % lanes_per_column) == 0);
const bool is_partial_reduce = (lanes_per_column != 1) && (lanes_per_column < sg_size);
int use_qcom_shuffle = 0;
if (is_partial_reduce) {
if (backend_ctx->has_qcom_subgroup_shuffle) {
use_qcom_shuffle = 1;
}
}
for (int kda = 0; kda < 2; kda++) {
for (int tgpp = 0; tgpp < 2; tgpp++) {
const int cpl = (tgpp == 0) ? 1 : 4;
const int spw = (tgpp == 0) ? 1 : 1;
std::string opts = compile_opts;
opts += " -DS_V=" + std::to_string(S_V);
opts += " -DKDA=" + std::to_string(kda);
opts += " -DSUBGROUP_SIZE=" + std::to_string(sg_size);
opts += " -DLANES_PER_COLUMN=" + std::to_string(lanes_per_column);
opts += " -DCOLS_PER_LANE_GROUP=" + std::to_string(cpl);
opts += " -DUSE_QCOM_SUBGROUP_SHUFFLE=" + std::to_string(use_qcom_shuffle);
// Since spw=1 is found to be optimal, SUBGROUPS_PER_WG > 1 code in
// the kernel is removed. If you want to experiment with spw > 1,
// Please remember to implement code to handle it.
opts += " -DSUBGROUPS_PER_WG=" + std::to_string(spw);
cl_program prog = build_program_from_source(
backend_ctx->context, backend_ctx->device, kernel_src.c_str(), opts);
CL_CHECK((backend_ctx->kernel_gated_delta_net_f32[si][kda][tgpp] =
clCreateKernel(prog, "kernel_gated_delta_net", &err), err));
CL_CHECK(clReleaseProgram(prog));
}
}
}
GGML_LOG_CONT(".");
}
// mul_mv_id_q4_0_f32_8x_flat
{
#ifdef GGML_OPENCL_EMBED_KERNELS
@@ -2827,7 +2912,7 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "gemm_noshuffle_q4_1_f32.cl.h"
};
};
#else
const std::string kernel_src = read_file("gemm_noshuffle_q4_1_f32.cl");
#endif
@@ -2866,7 +2951,7 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "gemm_noshuffle_iq4_nl_f32.cl.h"
};
};
#else
const std::string kernel_src = read_file("gemm_noshuffle_iq4_nl_f32.cl");
#endif
@@ -2905,7 +2990,7 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "gemm_noshuffle_q8_0_f32.cl.h"
};
};
#else
const std::string kernel_src = read_file("gemm_noshuffle_q8_0_f32.cl");
#endif
@@ -2946,7 +3031,7 @@ static void load_cl_kernels(ggml_backend_opencl_context *backend_ctx) {
#ifdef GGML_OPENCL_EMBED_KERNELS
const std::string kernel_src {
#include "gemm_noshuffle_q4_k_f32.cl.h"
};
};
#else
const std::string kernel_src = read_file("gemm_noshuffle_q4_k_f32.cl");
#endif
@@ -3781,6 +3866,16 @@ static ggml_backend_opencl_context * ggml_cl_init(ggml_backend_dev_t dev) {
clGetDeviceInfo(device, CL_DEVICE_EXTENSIONS, ext_str_size, ext_buffer, NULL);
ext_buffer[ext_str_size] = '\0'; // ensure it is null terminated
// check support for qcom_subgroup_shuffle
if (opencl_c_version.major == 3 && strstr(ext_buffer, "cl_khr_subgroups") != NULL) {
GGML_LOG_INFO("ggml_opencl: cl_khr_subgroups support: true\n");
if (strstr(ext_buffer, "cl_qcom_subgroup_shuffle") != NULL) {
backend_ctx->has_qcom_subgroup_shuffle = true;
}
}
GGML_LOG_INFO("ggml_opencl: cl_qcom_subgroup_shuffle support: %s\n",
backend_ctx->has_qcom_subgroup_shuffle ? "true" : "false");
// Check if ext_buffer contains cl_khr_fp16
backend_ctx->fp16_support = strstr(ext_buffer, "cl_khr_fp16") != NULL;
GGML_LOG_INFO("ggml_opencl: device FP16 support: %s\n", backend_ctx->fp16_support ? "true" : "false");
@@ -4832,17 +4927,17 @@ static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_te
case GGML_UNARY_OP_RELU:
case GGML_UNARY_OP_GELU_ERF:
case GGML_UNARY_OP_GELU_QUICK:
return ggml_is_contiguous(op->src[0]) && op->src[0]->type == GGML_TYPE_F32;
return ggml_is_contiguous(op->src[0]) && op->src[0]->type == GGML_TYPE_F32;
case GGML_UNARY_OP_SIGMOID:
return ggml_is_contiguous(op->src[0]);
case GGML_UNARY_OP_TANH:
case GGML_UNARY_OP_NEG:
case GGML_UNARY_OP_EXP:
return op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16;
return op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16;
case GGML_UNARY_OP_EXPM1:
return op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16;
return op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16;
case GGML_UNARY_OP_SOFTPLUS:
return op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16;
return op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16;
default:
return false;
}
@@ -4891,6 +4986,15 @@ static bool ggml_opencl_supports_op(ggml_backend_dev_t dev, const struct ggml_te
(op->src[0]->type == GGML_TYPE_F16 && op->src[1]->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32);
case GGML_OP_SSM_CONV:
return (op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32);
case GGML_OP_GATED_DELTA_NET:
{
// Match the Vulkan backend: only F32 -> F32, S_v in {16, 32, 64, 128}.
if (op->src[0]->type != GGML_TYPE_F32 || op->type != GGML_TYPE_F32) {
return false;
}
const int64_t S_v = op->src[2]->ne[0];
return S_v == 16 || S_v == 32 || S_v == 64 || S_v == 128;
}
case GGML_OP_CONCAT:
return op->src[0]->type == GGML_TYPE_F32 && op->src[1]->type == GGML_TYPE_F32 && op->type == GGML_TYPE_F32;
case GGML_OP_TIMESTEP_EMBEDDING:
@@ -10555,7 +10659,7 @@ static void ggml_cl_pad(ggml_backend_t backend, const ggml_tensor * src0, ggml_t
size_t local_work_size[] = { lws0, 1, 1 };
size_t * local_work_size_ptr = local_work_size;
if (d_ne0 % lws0 != 0 && !backend_ctx->non_uniform_workgroups) {
if (d_ne0 % lws0 != 0 && !backend_ctx->non_uniform_workgroups) {
local_work_size_ptr = nullptr;
}
@@ -17052,6 +17156,185 @@ static void ggml_cl_glu(ggml_backend_t backend, const ggml_tensor * src0, const
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
}
static void ggml_cl_gated_delta_net(ggml_backend_t backend, ggml_tensor * dst) {
GGML_ASSERT(dst);
GGML_ASSERT(dst->extra);
const ggml_tensor * src_q = dst->src[0];
const ggml_tensor * src_k = dst->src[1];
const ggml_tensor * src_v = dst->src[2];
const ggml_tensor * src_g = dst->src[3];
const ggml_tensor * src_beta = dst->src[4];
const ggml_tensor * src_state = dst->src[5];
GGML_ASSERT(src_q && src_q->extra);
GGML_ASSERT(src_k && src_k->extra);
GGML_ASSERT(src_v && src_v->extra);
GGML_ASSERT(src_g && src_g->extra);
GGML_ASSERT(src_beta && src_beta->extra);
GGML_ASSERT(src_state && src_state->extra);
ggml_backend_opencl_context * backend_ctx = (ggml_backend_opencl_context *) backend->context;
const cl_uint S_v = (cl_uint) src_v->ne[0];
const cl_uint H_v = (cl_uint) src_v->ne[1];
const cl_uint n_tokens = (cl_uint) src_v->ne[2];
const cl_uint n_seqs = (cl_uint) src_v->ne[3];
const cl_uint K = (cl_uint) src_state->ne[1];
int si;
switch (S_v) {
case 16: si = 0; break;
case 32: si = 1; break;
case 64: si = 2; break;
case 128: si = 3; break;
default:
GGML_ASSERT(false && "ggml_cl_gated_delta_net: unsupported S_v");
}
const int kda = (src_g->ne[0] == (int64_t) S_v) ? 1 : 0;
// TODO: Optimize when S_v!=128. Not necessary for now as Qwen3.5/6 are all S_v=128
// token generation mode (tgpp=0):
// process 1 token at a time, so columns per lane (cpl) == 1
// prompt processing mode (tgpp=1):
// cpl=4 to process 4 tokens for single-token. 4 is chosen for Adreno 750 as per
// work-item/thread has at most 128 registers.
// All Qwen3.5/6 models are S_v == 128, so LANES_PER_COLUMN == 8
// such that ROWS_PER_LANE = 128/8 = 16
// Variables in the kernel:
// k_reg, q_reg, g_exp are all 16 floats
// s_shard has cpl*ROWS_PER_LANE = 4*16 = 64 floats
// Total 112 registers used.
// subgroups_per_workgroup (spw) can be set to 1,2,4,8,16 for tg and 1,2,4 for pp
// for S_v=128.
// Empirically found that when spw=1, we get the best performance for both tg and pp
const int tgpp = (n_tokens == 1) ? 0 : 1;
const int cpl = (tgpp == 0) ? 1 : 4;
// spw needs adjustment when S_v != 128
const int spw = (tgpp == 0) ? 1 : 1;
cl_kernel kernel = backend_ctx->kernel_gated_delta_net_f32[si][kda][tgpp];
GGML_ASSERT(kernel != nullptr);
const cl_uint s_off = S_v * H_v * n_tokens * n_seqs;
const cl_uint sq1 = (cl_uint)(src_q->nb[1] / sizeof(float));
const cl_uint sq2 = (cl_uint)(src_q->nb[2] / sizeof(float));
const cl_uint sq3 = (cl_uint)(src_q->nb[3] / sizeof(float));
const cl_uint sv1 = (cl_uint)(src_v->nb[1] / sizeof(float));
const cl_uint sv2 = (cl_uint)(src_v->nb[2] / sizeof(float));
const cl_uint sv3 = (cl_uint)(src_v->nb[3] / sizeof(float));
const cl_uint sb1 = (cl_uint)(src_beta->nb[1] / sizeof(float));
const cl_uint sb2 = (cl_uint)(src_beta->nb[2] / sizeof(float));
const cl_uint sb3 = (cl_uint)(src_beta->nb[3] / sizeof(float));
const cl_uint H_k = (cl_uint) src_q->ne[1];
const cl_uint rq3 = (cl_uint)(src_v->ne[3] / src_q->ne[3]);
const float scale = 1.0f / sqrtf((float) S_v);
ggml_tensor_extra_cl * extra_q = (ggml_tensor_extra_cl *) src_q->extra;
ggml_tensor_extra_cl * extra_k = (ggml_tensor_extra_cl *) src_k->extra;
ggml_tensor_extra_cl * extra_v = (ggml_tensor_extra_cl *) src_v->extra;
ggml_tensor_extra_cl * extra_g = (ggml_tensor_extra_cl *) src_g->extra;
ggml_tensor_extra_cl * extra_beta = (ggml_tensor_extra_cl *) src_beta->extra;
ggml_tensor_extra_cl * extra_state = (ggml_tensor_extra_cl *) src_state->extra;
ggml_tensor_extra_cl * extra_dst = (ggml_tensor_extra_cl *) dst->extra;
const cl_ulong off_q = extra_q->offset + src_q->view_offs;
const cl_ulong off_k = extra_k->offset + src_k->view_offs;
const cl_ulong off_v = extra_v->offset + src_v->view_offs;
const cl_ulong off_g = extra_g->offset + src_g->view_offs;
const cl_ulong off_beta = extra_beta->offset + src_beta->view_offs;
const cl_ulong off_state = extra_state->offset + src_state->view_offs;
const cl_ulong off_dst = extra_dst->offset + dst->view_offs;
int idx = 0;
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_mem), &extra_q->data_device));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_ulong), &off_q));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_mem), &extra_k->data_device));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_ulong), &off_k));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_mem), &extra_v->data_device));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_ulong), &off_v));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_mem), &extra_g->data_device));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_ulong), &off_g));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_mem), &extra_beta->data_device));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_ulong), &off_beta));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_mem), &extra_state->data_device));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_ulong), &off_state));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_mem), &extra_dst->data_device));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_ulong), &off_dst));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &H_v));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &n_tokens));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &n_seqs));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &s_off));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sq1));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sq2));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sq3));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sv1));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sv2));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sv3));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sb1));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sb2));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &sb3));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &H_k));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &rq3));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(float), &scale));
CL_CHECK(clSetKernelArg(kernel, idx++, sizeof(cl_uint), &K));
// Subgroup size is 64 for Adreno and 32 for Intel
const int sg_size = backend_ctx->gpu_family == GPU_FAMILY::ADRENO ? 64 : backend_ctx->gpu_family == GPU_FAMILY::INTEL ? 32 : -1;
if (sg_size < 0) {
GGML_LOG_ERROR("Unsupported GPU Family: only Adreno and Intel are supported.\n");
exit(1);
}
// For the subgroup-shuffle kernel, we can safely prefer 8 lanes/column for S_v>=128
// For the subgroup-shuffle kernel:
// S_v >= 128 -> prefer 8 lanes/column (good occupancy & register pressure tradeoff)
// else -> min(S_v, subgroup_size)
int lanes_per_column;
if ((int)S_v >= 128) {
lanes_per_column = 8;
} else {
lanes_per_column = std::min((int)S_v, sg_size);
}
// Max workgroup size for Adreno 750 is 1024
const int wg_size = sg_size * spw;
// Ensure lanes_per_column is a power-of-two and divides both S_v and subgroup_size.
// (Required for lane-group shuffle-xor reduction correctness.)
while (lanes_per_column > 1 &&
(((lanes_per_column & (lanes_per_column - 1)) != 0) ||
(((int)S_v % lanes_per_column) != 0) ||
(sg_size % lanes_per_column) != 0)) {
lanes_per_column >>= 1;
}
GGML_ASSERT(lanes_per_column >= 1);
GGML_ASSERT(((lanes_per_column & (lanes_per_column - 1)) == 0));
GGML_ASSERT(((int)S_v % lanes_per_column) == 0);
GGML_ASSERT((sg_size % lanes_per_column) == 0);
const int cols_per_wg = spw * (sg_size / lanes_per_column) * cpl;
GGML_ASSERT(cols_per_wg > 0);
GGML_ASSERT(((int)S_v % cols_per_wg) == 0);
size_t global_work_size[3];
size_t local_work_size[3];
global_work_size[0] = (size_t) H_v * (size_t) wg_size;
global_work_size[1] = (size_t) n_seqs;
global_work_size[2] = (size_t) S_v / (size_t) cols_per_wg;
local_work_size[0] = (size_t) wg_size;
local_work_size[1] = 1;
local_work_size[2] = 1;
backend_ctx->enqueue_ndrange_kernel(kernel, 3, global_work_size, local_work_size, dst);
}
//------------------------------------------------------------------------------
// Op offloading
//------------------------------------------------------------------------------
@@ -17267,8 +17550,8 @@ bool ggml_cl_compute_forward(ggml_backend_t backend, struct ggml_tensor * tensor
}
func = ggml_cl_group_norm;
break;
case GGML_OP_REPEAT:
if (!any_on_device) {
case GGML_OP_REPEAT:
if (!any_on_device) {
return false;
}
func = ggml_cl_repeat;
@@ -17297,6 +17580,14 @@ bool ggml_cl_compute_forward(ggml_backend_t backend, struct ggml_tensor * tensor
}
func = ggml_cl_ssm_conv;
break;
case GGML_OP_GATED_DELTA_NET:
if (!any_on_device) {
return false;
}
// GDN has 6 source tensors, so it cannot use the standard
// (src0, src1, dst) func signature. Dispatch directly and return.
ggml_cl_gated_delta_net(backend, tensor);
return true;
case GGML_OP_CONCAT:
if (!any_on_device) {
return false;

View File

@@ -0,0 +1,247 @@
#pragma OPENCL EXTENSION cl_khr_subgroups : enable
#ifdef cl_intel_required_subgroup_size
#pragma OPENCL EXTENSION cl_intel_required_subgroup_size : enable
#define INTEL_GPU 1
#define REQD_SUBGROUP_SIZE_16 __attribute__((intel_reqd_sub_group_size(16)))
#define REQD_SUBGROUP_SIZE_32 __attribute__((intel_reqd_sub_group_size(32)))
#elif defined(cl_qcom_reqd_sub_group_size)
#pragma OPENCL EXTENSION cl_qcom_reqd_sub_group_size : enable
#define ADRENO_GPU 1
#define REQD_SUBGROUP_SIZE_64 __attribute__((qcom_reqd_sub_group_size("half")))
#define REQD_SUBGROUP_SIZE_128 __attribute__((qcom_reqd_sub_group_size("full")))
#endif
#ifndef S_V
#define S_V 128
#endif
#ifndef KDA
#define KDA 0
#endif
#ifndef SUBGROUP_SIZE
#define SUBGROUP_SIZE 64
#endif
#ifndef LANES_PER_COLUMN
#define LANES_PER_COLUMN 8
#endif
#ifndef COLS_PER_LANE_GROUP
#define COLS_PER_LANE_GROUP 1
#endif
#ifndef SUBGROUPS_PER_WG
#define SUBGROUPS_PER_WG 1
#endif
#ifndef USE_QCOM_SUBGROUP_SHUFFLE
#define USE_QCOM_SUBGROUP_SHUFFLE 0
#endif
#define WG_SIZE (SUBGROUP_SIZE * SUBGROUPS_PER_WG)
#define LANE_GROUPS_PER_SG (SUBGROUP_SIZE / LANES_PER_COLUMN)
#define COLS_PER_SG (LANE_GROUPS_PER_SG * COLS_PER_LANE_GROUP)
#define COLS_PER_WG (SUBGROUPS_PER_WG * COLS_PER_SG)
#define ROWS_PER_LANE (S_V / LANES_PER_COLUMN)
#if USE_QCOM_SUBGROUP_SHUFFLE
#pragma OPENCL EXTENSION cl_qcom_subgroup_shuffle : enable
#endif
// XOR-based parallel sum
// This does a reduction across groups of LANES_PER_COLUMN
static inline float reduce_add_shmem(float partial, __local float * temp, uint lane) {
#if USE_QCOM_SUBGROUP_SHUFFLE
#pragma unroll
for (uint s = LANES_PER_COLUMN / 2u; s > 0u; s >>= 1u) {
partial += qcom_sub_group_shuffle_xor(partial, s, CLK_SUB_GROUP_SHUFFLE_WIDTH_WAVE_SIZE_QCOM, partial);
}
return partial;
#else
temp[lane] = partial;
sub_group_barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for (uint s = LANES_PER_COLUMN / 2u; s > 0u; s >>= 1u) {
float other = temp[lane ^ s];
sub_group_barrier(CLK_LOCAL_MEM_FENCE);
temp[lane] += other;
sub_group_barrier(CLK_LOCAL_MEM_FENCE);
}
const float result = temp[lane];
sub_group_barrier(CLK_LOCAL_MEM_FENCE);
return result;
#endif
}
#define REDUCE_PARTIAL(partial, temp_ptr, lid) \
((LANES_PER_COLUMN == 1u) ? (partial) : reduce_add_shmem((partial), (temp_ptr), (lid)))
// force compiler to optimize kernel for a specific fixed work-group size
__attribute__((reqd_work_group_size(WG_SIZE, 1, 1)))
#ifdef INTEL_GPU
REQD_SUBGROUP_SIZE_32
#elif defined (ADRENO_GPU)
REQD_SUBGROUP_SIZE_64
#endif
kernel void kernel_gated_delta_net(
global const char * q_buf, ulong off_q,
global const char * k_buf, ulong off_k,
global const char * v_buf, ulong off_v,
global const char * g_buf, ulong off_g,
global const char * beta_buf, ulong off_beta,
global const char * state_buf, ulong off_state,
global char * dst_buf, ulong off_dst,
uint H_v,
uint n_tokens,
uint n_seqs,
uint s_off,
uint sq1, uint sq2, uint sq3,
uint sv1, uint sv2, uint sv3,
uint sb1, uint sb2, uint sb3,
uint H_k,
uint rq3,
float scale,
uint K) {
global const float * data_q = (global const float *)(q_buf + off_q);
global const float * data_k = (global const float *)(k_buf + off_k);
global const float * data_v = (global const float *)(v_buf + off_v);
global const float * data_g = (global const float *)(g_buf + off_g);
global const float * data_beta = (global const float *)(beta_buf + off_beta);
global const float * data_state = (global const float *)(state_buf + off_state);
global float * data_dst = (global float *)(dst_buf + off_dst);
const uint head_id = get_group_id(0);
const uint seq_id = get_group_id(1);
const uint tid = (uint)get_local_id(0);
const uint sg_id = get_sub_group_id(); // subgroup id
const uint sg_lid = get_sub_group_local_id(); // subgroup lane id
const uint lane = sg_lid % LANES_PER_COLUMN;
const uint lane_group = sg_lid / LANES_PER_COLUMN;
const uint wg_col_base = get_group_id(2) * COLS_PER_WG;
const uint sg_col_base = wg_col_base + sg_id * COLS_PER_SG;
const uint iq1 = head_id % H_k; // head index for Q and K
const uint iq3 = seq_id / rq3; // seq index for Q and K
const uint state_size = S_V * S_V;
const uint state_base = (seq_id * K * H_v + head_id) * state_size;
const uint q_off_base = iq3 * sq3 + iq1 * sq1;
const uint v_off_base = seq_id * sv3 + head_id * sv1;
const uint gb_off_base = seq_id * sb3 + head_id * sb1;
const uint state_out_base = (seq_id * H_v + head_id) * state_size;
const uint state_size_per_snap = state_size * H_v * n_seqs;
__local float reduce_temp[WG_SIZE];
__local float * temp_ptr = reduce_temp + sg_id * SUBGROUP_SIZE;
float s_shard[COLS_PER_LANE_GROUP][ROWS_PER_LANE];
#pragma unroll
for (uint cg = 0; cg < COLS_PER_LANE_GROUP; cg++) {
const uint col = sg_col_base + cg * LANE_GROUPS_PER_SG + lane_group;
#pragma unroll
for (uint r = 0; r < ROWS_PER_LANE; r++) {
s_shard[cg][r] = data_state[state_base + col * S_V + r * LANES_PER_COLUMN + lane];
}
}
const int shift = (int)n_tokens - (int)K;
uint attn_off = (seq_id * n_tokens * H_v + head_id) * S_V;
for (uint t = 0; t < n_tokens; t++) {
const uint q_off = q_off_base + t * sq2;
const uint k_off = q_off;
const uint v_off = v_off_base + t * sv2;
const uint gb_off = gb_off_base + t * sb2;
const float beta_val = data_beta[gb_off];
float k_reg[ROWS_PER_LANE];
float q_reg[ROWS_PER_LANE];
#if KDA
float g_exp[ROWS_PER_LANE];
#pragma unroll
for (uint r = 0; r < ROWS_PER_LANE; r++) {
const uint i = r * LANES_PER_COLUMN + lane;
k_reg[r] = data_k[k_off + i];
q_reg[r] = data_q[q_off + i];
g_exp[r] = exp(data_g[gb_off * S_V + i]);
}
#else
const float g_val = exp(data_g[gb_off]);
#pragma unroll
for (uint r = 0; r < ROWS_PER_LANE; r++) {
const uint i = r * LANES_PER_COLUMN + lane;
k_reg[r] = data_k[k_off + i];
q_reg[r] = data_q[q_off + i];
}
#endif
#pragma unroll
for (uint cg = 0; cg < COLS_PER_LANE_GROUP; cg++) {
const uint col = sg_col_base + cg * LANE_GROUPS_PER_SG + lane_group;
float v_val = data_v[v_off + col];
float kv_shard = 0.0f;
#pragma unroll
for (uint r = 0; r < ROWS_PER_LANE; r++) {
#if KDA
float gs = g_exp[r] * s_shard[cg][r];
kv_shard += gs * k_reg[r];
#else
kv_shard += s_shard[cg][r] * k_reg[r];
#endif
}
#if !KDA
kv_shard *= g_val; // Applied once instead of ROWS_PER_LANE times
#endif
const float kv_col = REDUCE_PARTIAL(kv_shard, temp_ptr, sg_lid);
const float delta_col = (v_val - kv_col) * beta_val;
float attn_partial = 0.0f;
#pragma unroll
for (uint r = 0; r < ROWS_PER_LANE; r++) {
#if KDA
float gs = g_exp[r] * s_shard[cg][r];
#else
float gs = g_val * s_shard[cg][r];
#endif
s_shard[cg][r] = gs + k_reg[r] * delta_col;
attn_partial += s_shard[cg][r] * q_reg[r];
}
const float attn_col = REDUCE_PARTIAL(attn_partial, temp_ptr, sg_lid);
if (lane == 0) {
data_dst[attn_off + col] = attn_col * scale;
}
}
attn_off += S_V * H_v;
if (K > 1u) {
const int target_slot = (int)t - shift;
if (target_slot >= 0 && target_slot < (int)K) {
#pragma unroll
for (uint cg = 0; cg < COLS_PER_LANE_GROUP; cg++) {
const uint col = sg_col_base + cg * LANE_GROUPS_PER_SG + lane_group;
const uint slot_base = s_off + (uint)target_slot * state_size_per_snap + state_out_base;
#pragma unroll
for (uint r = 0; r < ROWS_PER_LANE; r++) {
data_dst[slot_base + col * S_V + r * LANES_PER_COLUMN + lane] = s_shard[cg][r];
}
}
}
}
}
if (K == 1u) {
#pragma unroll
for (uint cg = 0; cg < COLS_PER_LANE_GROUP; cg++) {
const uint col = sg_col_base + cg * LANE_GROUPS_PER_SG + lane_group;
#pragma unroll
for (uint r = 0; r < ROWS_PER_LANE; r++) {
data_dst[s_off + state_base + col * S_V + r * LANES_PER_COLUMN + lane] = s_shard[cg][r];
}
}
}
}

View File

@@ -860,6 +860,7 @@ struct vk_device_struct {
vk_pipeline pipeline_argsort_large_f32[num_argsort_pipelines];
vk_pipeline pipeline_topk_f32[num_topk_pipelines];
vk_pipeline pipeline_sum_rows_f32;
vk_pipeline pipeline_fwht_f32[4];
vk_pipeline pipeline_cumsum_f32;
vk_pipeline pipeline_cumsum_small_f32;
vk_pipeline pipeline_cumsum_multipass1_f32;
@@ -1150,6 +1151,13 @@ struct vk_op_push_constants {
float param4;
};
struct vk_op_fwht_push_constants {
uint32_t n_rows;
uint32_t src_offset;
uint32_t dst_offset;
float scale;
};
struct vk_op_count_experts_push_constants {
uint32_t ne00;
uint32_t ne01;
@@ -2055,6 +2063,15 @@ template <> void init_pushconst_tensor_offsets(ggml_backend_vk_context * ctx, vk
GGML_UNUSED(src3);
}
template <> void init_pushconst_tensor_offsets(ggml_backend_vk_context * ctx, vk_op_fwht_push_constants &p, const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * src2, const ggml_tensor * src3, ggml_tensor * dst) {
p.src_offset = get_misalign_bytes(ctx, src0) / ggml_type_size(src0->type);
p.dst_offset = get_misalign_bytes(ctx, dst) / ggml_type_size(dst->type);
GGML_UNUSED(src1);
GGML_UNUSED(src2);
GGML_UNUSED(src3);
}
struct ggml_backend_vk_buffer_context {
vk_device_ref device;
vk_buffer dev_buffer;
@@ -2095,9 +2112,9 @@ void vk_memory_logger::log_deallocation(vk_buffer_ref buf_ref) {
const bool device = bool(buf->memory_property_flags & vk::MemoryPropertyFlagBits::eDeviceLocal);
std::string type = device ? "device" : "host";
auto it = allocations.find(buf->buffer);
total_device -= device ? it->second : 0;
total_host -= device ? 0 : it->second;
if (it != allocations.end()) {
total_device -= device ? it->second : 0;
total_host -= device ? 0 : it->second;
VK_LOG_MEMORY(buf->device->name << ": -" << format_size(it->second) << " " << type << " at " << buf->buffer << ". Total device: " << format_size(total_device) << ", total host: " << format_size(total_host));
allocations.erase(it);
} else {
@@ -4982,6 +4999,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_argmax_f32, "argmax_f32", argmax_f32_len, argmax_f32_data, "main", 2, sizeof(vk_op_push_constants), {1, 1, 1}, { device->subgroup_size }, 1);
ggml_vk_create_pipeline(device, device->pipeline_sum_rows_f32, "sum_rows_f32", sum_rows_f32_len, sum_rows_f32_data, "main", 2, sizeof(vk_op_sum_rows_push_constants), {1, 1, 1}, { device->subgroup_size }, 1);
// Intel Arc B390 was observed segfaulting with this shader.
if (device->subgroup_basic && device->subgroup_shuffle && device->vendor_id != VK_VENDOR_ID_INTEL) {
int idx = 0;
for (uint32_t n : {64, 128, 256, 512}) {
if (device->subgroup_size <= n) {
ggml_vk_create_pipeline(device, device->pipeline_fwht_f32[idx], "fwht_f32", fwht_f32_len, fwht_f32_data, "main", 2, sizeof(vk_op_fwht_push_constants), {1, 1, 1}, { device->subgroup_size, n }, 1, true, true, device->subgroup_size);
}
++idx;
}
}
const uint32_t cumsum_elem_per_thread = (device->vendor_id == VK_VENDOR_ID_AMD || device->vendor_id == VK_VENDOR_ID_INTEL) ? 2 : 4;
ggml_vk_create_pipeline(device, device->pipeline_cumsum_f32, "cumsum_f32", cumsum_f32_len, cumsum_f32_data, "main", 2, sizeof(vk_op_sum_rows_push_constants), {1, 1, 1}, { 256, device->subgroup_size, cumsum_elem_per_thread }, 1, true, true, device->subgroup_size);
@@ -7233,7 +7260,7 @@ static void ggml_vk_buffer_write_nc_async(ggml_backend_vk_context * ctx, vk_cont
const uint64_t s_off = buf_offset + i3*nb3 + i2*nb2 + i1*nb1;
const uint64_t d_off = offset + i3*dstnb3 + i2*dstnb2 + i1*dstnb1;
for (uint64_t i0 = 0; i0 < ne0; i0++) {
slices.push_back({ s_off + i1*nb0, d_off + i0*dstnb0, dstnb0 });
slices.push_back({ s_off + i0*nb0, d_off + i0*dstnb0, dstnb0 });
}
}
}
@@ -8741,6 +8768,68 @@ static void ggml_vk_mul_mat_vec_nc_f16_f32(ggml_backend_vk_context * ctx, vk_con
}, pc, { (uint32_t)ne03, (uint32_t)ne01, (uint32_t)ne12 });
}
static int ggml_vk_fwht_pipeline_idx(int64_t n) {
switch (n) {
case 64: return 0;
case 128: return 1;
case 256: return 2;
case 512: return 3;
default: return -1;
}
}
static bool ggml_vk_can_use_fwht(const ggml_backend_vk_context * ctx, const ggml_tensor * src1, const ggml_tensor * dst) {
if (ctx->num_additional_fused_ops != 0) {
return false;
}
if (ggml_get_op_params_i32(dst, 1) != GGML_HINT_SRC0_IS_HADAMARD) {
return false;
}
const int idx = ggml_vk_fwht_pipeline_idx(src1->ne[0]);
if (idx < 0 || ctx->device->pipeline_fwht_f32[idx] == nullptr) {
return false;
}
if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
return false;
}
if (!ggml_is_contiguous(src1)) {
return false;
}
GGML_ASSERT(ggml_is_contiguous(dst));
return true;
}
static void ggml_vk_fwht(ggml_backend_vk_context * ctx, vk_context& subctx, const ggml_tensor * src, ggml_tensor * dst) {
const int idx = ggml_vk_fwht_pipeline_idx(src->ne[0]);
vk_pipeline pipeline = ctx->device->pipeline_fwht_f32[idx];
const uint32_t rows_per_workgroup = 4;
const uint32_t n_rows = (uint32_t)ggml_nrows(src);
const uint32_t max_workgroups_x = ctx->device->properties.limits.maxComputeWorkGroupCount[0];
const uint32_t total_workgroups = CEIL_DIV(n_rows, rows_per_workgroup);
const uint32_t workgroups_x = std::min(total_workgroups, max_workgroups_x);
ggml_pipeline_request_descriptor_sets(ctx, pipeline, 1);
const vk_subbuffer src_buf = ggml_vk_tensor_subbuffer(ctx, src, true);
const vk_subbuffer dst_buf = ggml_vk_tensor_subbuffer(ctx, dst, true);
vk_op_fwht_push_constants pc = {
n_rows,
0,
0,
1.0f / std::sqrt((float)src->ne[0]),
};
init_pushconst_tensor_offsets(ctx, pc, src, nullptr, nullptr, nullptr, dst);
ggml_vk_dispatch_pipeline(ctx, subctx, pipeline, { src_buf, dst_buf }, pc, { workgroups_x, 1, 1 });
}
static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, const struct ggml_cgraph * cgraph, int node_idx) {
ggml_tensor * dst = cgraph->nodes[node_idx];
ggml_tensor * src0 = dst->src[0];
@@ -8774,6 +8863,8 @@ static void ggml_vk_mul_mat(ggml_backend_vk_context * ctx, vk_context& subctx, c
m_offset += cur_M_size;
}
} else if (ggml_vk_can_use_fwht(ctx, src1, dst)) {
ggml_vk_fwht(ctx, subctx, src1, dst);
} else if (src0->type == GGML_TYPE_F16 && ggml_is_permuted(src0) && ggml_is_permuted(src1) && dst->ne[1] == 1 &&
// detect 0213 permutation, and batch size of 1
src0->nb[0] <= src0->nb[2] &&

View File

@@ -0,0 +1,69 @@
#version 450
#extension GL_EXT_control_flow_attributes : require
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_shuffle : enable
layout(local_size_x_id = 0, local_size_y = 4, local_size_z = 1) in;
layout(constant_id = 0) const uint WARP_SIZE = 32;
layout(constant_id = 1) const uint N = 128;
layout(push_constant) uniform parameter
{
uint n_rows;
uint src_offset;
uint dst_offset;
float scale;
};
layout(binding = 0, std430) readonly buffer A { float data_a[]; };
layout(binding = 1, std430) writeonly buffer D { float data_d[]; };
const uint EL_W = N / WARP_SIZE;
void main() {
const uint lane = gl_SubgroupInvocationID;
for (uint row = gl_WorkGroupID.x * gl_WorkGroupSize.y + gl_SubgroupID;
row < n_rows;
row += gl_NumWorkGroups.x * gl_WorkGroupSize.y) {
const uint row_offset = row * N;
float reg[EL_W];
[[unroll]]
for (uint i = 0; i < EL_W; ++i) {
reg[i] = data_a[src_offset + row_offset + i * WARP_SIZE + lane] * scale;
}
[[unroll]]
for (uint h = 1; h < WARP_SIZE; h <<= 1) {
[[unroll]]
for (uint j = 0; j < EL_W; ++j) {
const float val = reg[j];
const float val2 = subgroupShuffleXor(val, h);
reg[j] = (lane & h) == 0 ? val + val2 : val2 - val;
}
}
[[unroll]]
for (uint h = WARP_SIZE; h < N; h <<= 1) {
const uint step = h / WARP_SIZE;
[[unroll]]
for (uint j = 0; j < EL_W; j += 2 * step) {
[[unroll]]
for (uint k = 0; k < step; ++k) {
const float x = reg[j + k];
const float y = reg[j + k + step];
reg[j + k] = x + y;
reg[j + k + step] = x - y;
}
}
}
[[unroll]]
for (uint i = 0; i < EL_W; ++i) {
data_d[dst_offset + row_offset + i * WARP_SIZE + lane] = reg[i];
}
}
}

View File

@@ -934,6 +934,7 @@ void process_shaders() {
string_to_spv("argmax_f32", "argmax.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "int"}}));
string_to_spv("sum_rows_f32", "sum_rows.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "float"}}));
string_to_spv("fwht_f32", "fwht.comp", {});
string_to_spv("count_equal_i32", "count_equal.comp", merge_maps(base_dict, {{"A_TYPE", "int"}, {"B_TYPE", "int"}, {"D_TYPE", "int"}}));
string_to_spv("cumsum_f32", "cumsum.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "float"}}));
string_to_spv("cumsum_multipass1_f32", "cumsum_multipass1.comp", merge_maps(base_dict, {{"A_TYPE", "float"}, {"D_TYPE", "float"}}));

View File

@@ -94,14 +94,6 @@ static inline uint32_t ggml_webgpu_u32_from_f32(float value) {
#define WEBGPU_SET_ROWS_ERROR_BUF_SIZE_BYTES 4
#define WEBGPU_STORAGE_BUF_BINDING_MULT 4 // a storage buffer binding size must be a multiple of 4
// For operations which process a row in parallel, this seems like a reasonable
// default
#define WEBGPU_ROW_SPLIT_WG_SIZE 64
// Track https://github.com/gpuweb/gpuweb/issues/5315 for fixes to
// implementations so this can be removed, necessary only for get_rows right now
#define WEBGPU_MAX_WG_SIZE 288
/* End Constants */
// This is a "fake" base pointer, since WebGPU buffers do not have pointers to
@@ -631,7 +623,7 @@ static void ggml_backend_webgpu_buffer_memset(webgpu_global_context & ctx,
size_t size) {
std::vector<uint32_t> params = { (uint32_t) offset, (uint32_t) size, value };
std::vector<wgpu::BindGroupEntry> entries = { ggml_webgpu_make_bind_group_entry(0, buf, 0, buf.GetSize()) };
size_t bytes_per_wg = WEBGPU_MAX_WG_SIZE * ctx->capabilities.memset_bytes_per_thread;
size_t bytes_per_wg = ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup * ctx->capabilities.memset_bytes_per_thread;
uint32_t wg_x = CEIL_DIV(size + 3, bytes_per_wg);
ctx->queue.WriteBuffer(ctx->memset_params_buf, 0, params.data(), params.size() * sizeof(uint32_t));
@@ -749,8 +741,11 @@ static webgpu_encoded_op ggml_webgpu_cpy(webgpu_context & ctx, ggml_tensor * src
ggml_webgpu_make_tensor_bind_group_entry(ctx, 1, dst),
};
uint32_t wg_x = CEIL_DIV(ne, decisions->wg_size);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x);
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV(ne, decisions->wg_size);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
static webgpu_encoded_op ggml_webgpu_set(webgpu_context & ctx,
@@ -974,9 +969,10 @@ static webgpu_encoded_op ggml_webgpu_conv_2d(webgpu_context & ctx,
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
uint32_t wg_x = std::min(ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, total_wg);
uint32_t wg_y = CEIL_DIV(total_wg, wg_x);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
@@ -1064,9 +1060,10 @@ static webgpu_encoded_op ggml_webgpu_im2col(webgpu_context & ctx,
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
uint32_t wg_x = std::min(ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, total_wg);
uint32_t wg_y = CEIL_DIV(total_wg, wg_x);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
@@ -1361,7 +1358,7 @@ static webgpu_encoded_op ggml_webgpu_get_rows(webgpu_context & ctx,
shader_lib_ctx.src0 = src;
shader_lib_ctx.src1 = nullptr;
shader_lib_ctx.dst = dst;
shader_lib_ctx.max_wg_size = WEBGPU_MAX_WG_SIZE;
shader_lib_ctx.max_wg_size = ctx->global_ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup;
webgpu_pipeline pipeline = ctx->shader_lib->get_get_rows_pipeline(shader_lib_ctx);
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
@@ -1689,14 +1686,11 @@ static webgpu_encoded_op ggml_webgpu_mul_mat_id(webgpu_context & ctx,
gathered_count_ids_binding_size),
};
const uint32_t max_wg_per_dim = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
const uint32_t gather_total_wg = param_n_expert;
const uint32_t gather_wg_x = std::min(gather_total_wg, max_wg_per_dim);
const uint32_t gather_wg_y = CEIL_DIV(gather_total_wg, gather_wg_x);
// n_expert is much less than maxComputeWorkgroupsPerDimension (e.g., n_exeprt=256 at Qwen3.5-35B-A3B)
const uint32_t gather_wg_x = param_n_expert;
dispatches.push_back({
gather_pipeline, std::move(gather_params), std::move(gather_entries), { gather_wg_x, gather_wg_y }
gather_pipeline, std::move(gather_params), std::move(gather_entries), { gather_wg_x, 1 }
});
// params for mul_mat_id.wgsl
@@ -1748,7 +1742,7 @@ static webgpu_encoded_op ggml_webgpu_mul_mat_id(webgpu_context & ctx,
uint32_t max_wg_n = CEIL_DIV(total_gathered, tile_n_s) + max_active_experts;
uint32_t total_wg = wg_m * max_wg_n;
compute_2d_workgroups(total_wg, max_wg_per_dim, wg_x, wg_y);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
dispatches.push_back({
main_pipeline, std::move(main_params), std::move(main_entries), { wg_x, wg_y }
@@ -2771,10 +2765,12 @@ static webgpu_encoded_op ggml_webgpu_argsort(webgpu_context & ctx, ggml_tensor *
block_size, npr, nrows
};
const uint32_t total_wg_init = npr * nrows;
const uint32_t max_wg = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
const uint32_t wg_x_init = std::min(total_wg_init, max_wg);
const uint32_t wg_y_init = CEIL_DIV(total_wg_init, wg_x_init);
uint32_t wg_x_init;
uint32_t wg_y_init;
const uint32_t total_wg_init = npr * nrows;
const uint32_t max_wg_per_dim = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
compute_2d_workgroups(total_wg_init, max_wg_per_dim, wg_x_init, wg_y_init);
std::vector<wgpu::BindGroupEntry> init_entries = {
ggml_webgpu_make_tensor_bind_group_entry(ctx, 0, src),
ggml_webgpu_make_bind_group_entry(1, ggml_webgpu_tensor_buf(dst), init_align_offset, init_binding_size)
@@ -2831,9 +2827,11 @@ static webgpu_encoded_op ggml_webgpu_argsort(webgpu_context & ctx, ggml_tensor *
ggml_webgpu_make_bind_group_entry(2, ggml_webgpu_tensor_buf(dst), align_out, size_out)
};
uint32_t wg_x_merge;
uint32_t wg_y_merge;
const uint32_t total_wg_merge = nm * nrows;
const uint32_t wg_x_merge = std::min(total_wg_merge, max_wg);
const uint32_t wg_y_merge = CEIL_DIV(total_wg_merge, wg_x_merge);
compute_2d_workgroups(total_wg_merge, max_wg_per_dim, wg_x_merge, wg_y_merge);
dispatches.push_back({
argsort_merge_pipeline, std::move(merge_params), std::move(merge_entries), { wg_x_merge, wg_y_merge }
});
@@ -2953,9 +2951,12 @@ static webgpu_encoded_op ggml_webgpu_upscale(webgpu_context ctx, ggml_tensor * s
webgpu_pipeline pipeline = ctx->shader_lib->get_upscale_pipeline(shader_lib_ctx);
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
uint32_t wg_x = std::min(ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, total_wg);
uint32_t wg_y = CEIL_DIV(total_wg, wg_x);
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
@@ -3707,13 +3708,13 @@ static ggml_guid_t ggml_backend_webgpu_guid(void) {
static void ggml_webgpu_init_memset_pipeline(webgpu_global_context & ctx) {
// we use the maximum workgroup size for the memset pipeline
size_t max_threads = WEBGPU_MAX_WG_SIZE * ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
size_t max_threads = ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup * ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
// Size the bytes_per_thread so that the largest buffer size can be handled
ctx->capabilities.memset_bytes_per_thread =
CEIL_DIV(ctx->capabilities.limits.maxStorageBufferBindingSize, max_threads);
std::vector<wgpu::ConstantEntry> constants(2);
constants[0].key = "wg_size";
constants[0].value = WEBGPU_MAX_WG_SIZE;
constants[0].value = ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup;
constants[1].key = "bytes_per_thread";
constants[1].value = ctx->capabilities.memset_bytes_per_thread;
ctx->memset_pipeline = ggml_webgpu_create_pipeline(ctx->device, wgsl_memset, "memset", constants);

View File

@@ -49,12 +49,14 @@ struct Params{
var<uniform> params: Params;
@compute @workgroup_size(WG_SIZE)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
if (gid.x >= params.ne) {
fn main(
@builtin(global_invocation_index) gindex: u32,
) {
if (gindex >= params.ne) {
return;
}
var i = gid.x;
var i = gindex;
let i3 = i / (params.src_ne2 * params.src_ne1 * params.src_ne0);
i = i % (params.src_ne2 * params.src_ne1 * params.src_ne0);
let i2 = i / (params.src_ne1 * params.src_ne0);
@@ -62,7 +64,7 @@ fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
let i1 = i / params.src_ne0;
let i0 = i % params.src_ne0;
var j = gid.x;
var j = gindex;
let j3 = j / (params.dst_ne2 * params.dst_ne1 * params.dst_ne0);
j = j % (params.dst_ne2 * params.dst_ne1 * params.dst_ne0);
let j2 = j / (params.dst_ne1 * params.dst_ne0);

View File

@@ -21,35 +21,32 @@ var<workgroup> count:atomic<u32>;
@compute @workgroup_size(WG_SIZE)
fn main(@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(num_workgroups) num_wg: vec3<u32>) {
@builtin(local_invocation_id) local_id: vec3<u32>) {
let thread_id = local_id.x;
let own_expert = wg_id.y * num_wg.x + wg_id.x; // the expert assigned to this workgroup
let own_expert = wg_id.x; // the expert assigned to this workgroup
if (own_expert < params.n_expert) {
if (thread_id == 0u) {
atomicStore(&count, 0);
}
if (thread_id == 0u) {
atomicStore(&count, 0);
}
workgroupBarrier();
workgroupBarrier();
for (var i = thread_id;i < params.n_expert_used * params.n_tokens;i += WG_SIZE) {
let row = i / params.n_expert_used;
let col = i % params.n_expert_used;
let expert = u32(ids[params.offset_ids + row * params.stride_ids_1 + col]);
if (own_expert == expert) {
let pos = atomicAdd(&count, 1u);
let gathered_id = own_expert * params.n_tokens + pos;
global_gathered_expert_used[gathered_id] = col;
global_gathered_tokens[gathered_id] = row;
}
}
workgroupBarrier();
if (thread_id == 0u) {
gathered_count_ids[own_expert] = atomicLoad(&count);
for (var i = thread_id;i < params.n_expert_used * params.n_tokens;i += WG_SIZE) {
let row = i / params.n_expert_used;
let col = i % params.n_expert_used;
let expert = u32(ids[params.offset_ids + row * params.stride_ids_1 + col]);
if (own_expert == expert) {
let pos = atomicAdd(&count, 1u);
let gathered_id = own_expert * params.n_tokens + pos;
global_gathered_expert_used[gathered_id] = col;
global_gathered_tokens[gathered_id] = row;
}
}
workgroupBarrier();
if (thread_id == 0u) {
gathered_count_ids[own_expert] = atomicLoad(&count);
}
}

View File

@@ -0,0 +1,114 @@
{%- set tools_system_message_prefix = 'You are a helpful assistant with access to the following tools. You may call one or more tools to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>' %}
{%- set tools_system_message_suffix = '\n</tools>\n\nFor each tool call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>. If a tool does not exist in the provided list of tools, notify the user that you do not have the ability to fulfill the request.' %}
{%- set documents_system_message_prefix = 'You are a helpful assistant with access to the following documents. You may use one or more documents to assist with the user query.\n\nYou are given a list of documents within <documents></documents> XML tags:\n<documents>' %}
{%- set documents_system_message_suffix = '\n</documents>\n\nWrite the response to the user\'s input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.' %}
{%- if available_tools is defined and available_tools %}
{%- set tools = available_tools %}
{%- endif %}
{%- set ns = namespace(tools_system_message=tools_system_message_prefix,
documents_system_message=documents_system_message_prefix,
system_message=''
) %}
{%- if tools %}
{%- for tool in tools %}
{%- set ns.tools_system_message = ns.tools_system_message + '\n' + (tool | tojson) %}
{%- endfor %}
{%- set ns.tools_system_message = ns.tools_system_message + tools_system_message_suffix %}
{%- else %}
{%- set ns.tools_system_message = '' %}
{%- endif %}
{%- if documents %}
{%- for document in documents %}
{%- set ns.documents_system_message = ns.documents_system_message + '\n' + (document | tojson) %}
{%- endfor %}
{%- set ns.documents_system_message = ns.documents_system_message + documents_system_message_suffix %}
{%- else %}
{%- set ns.documents_system_message = '' %}
{%- endif %}
{%- if messages[0].role == 'system' %}
{%- if messages[0].content is string %}
{%- set ns.system_message = messages[0].content %}
{%- elif messages[0].content is iterable %}
{%- for entry in messages[0].content %}
{%- if entry.type== 'text' %}
{%- if ns.system_message != '' %}
{%- set ns.system_message = ns.system_message + '\n' %}
{%- endif %}
{%- set ns.system_message = ns.system_message + entry.text %}
{%- endif %}
{%- endfor %}
{%- endif %}
{%- if tools and documents %}
{%- set ns.system_message = ns.system_message + '\n\n' + ns.tools_system_message + '\n\n' + ns.documents_system_message %}
{%- elif tools %}
{%- set ns.system_message = ns.system_message + '\n\n' + ns.tools_system_message %}
{%- elif documents %}
{%- set ns.system_message = ns.system_message + '\n\n' + ns.documents_system_message %}
{%- endif %}
{%- else %}
{%- if tools and documents %}
{%- set ns.system_message = ns.tools_system_message + '\n\n' + ns.documents_system_message %}
{%- elif tools %}
{%- set ns.system_message = ns.tools_system_message %}
{%- elif documents %}
{%- set ns.system_message = ns.documents_system_message %}
{%- endif %}
{%- endif %}
{%- if ns.system_message %}
{{- '<|start_of_role|>system<|end_of_role|>' + ns.system_message + '<|end_of_text|>\n' }}
{%- endif %}
{%- for message in messages %}
{%- set content = namespace(val='') %}
{%- if message.content is string %}
{%- set content.val = message.content %}
{%- else %}
{%- if message.content is iterable %}
{%- for entry in message.content %}
{%- if entry.type== 'text' %}
{%- if content.val != '' %}
{%- set content.val = content.val + '\n' %}
{%- endif %}
{%- set content.val = content.val + entry.text %}
{%- endif %}
{%- endfor %}
{%- endif %}
{%- endif %}
{%- if (message.role == 'user') or (message.role == 'system' and not loop.first) %}
{{- '<|start_of_role|>' + message.role + '<|end_of_role|>' + content.val + '<|end_of_text|>\n' }}
{%- elif message.role == 'assistant' %}
{{- '<|start_of_role|>' + message.role + '<|end_of_role|>' + content.val }}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content.val) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|end_of_text|>\n' }}
{%- elif message.role == 'tool' %}
{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') %}
{{- '<|start_of_role|>user<|end_of_role|>' }}
{%- endif %}
{{- '\n<tool_response>\n' }}
{{- content.val }}
{{- '\n</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') %}
{{- '<|end_of_text|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|start_of_role|>assistant<|end_of_role|>' }}
{%- endif %}

View File

@@ -51,6 +51,9 @@ opbatch=
opqueue=
[ "$OQ" != "" ] && opqueue="GGML_HEXAGON_OPQUEUE=$OQ"
oppoll=
[ "$OP" != "" ] && oppoll="GGML_HEXAGON_OPPOLL=$OP"
opflt=
[ "$OF" != "" ] && opflt="GGML_HEXAGON_OPFILTER=$OF"
@@ -66,7 +69,7 @@ adb $adbserial $adbhost shell " \
cd $basedir; ulimit -c unlimited; \
LD_LIBRARY_PATH=$basedir/$branch/lib \
ADSP_LIBRARY_PATH=$basedir/$branch/lib \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb $opbatch $opqueue $opflt $vmem $mbuf \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb $opbatch $opqueue $oppoll $opflt $vmem $mbuf \
./$branch/bin/llama-completion --no-mmap -m $basedir/../gguf/$model \
--poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
--ctx-size 8192 --ubatch-size 1024 -fa on \

View File

@@ -42,6 +42,15 @@ ndev=
hb=
[ "$HB" != "" ] && hb="GGML_HEXAGON_HOSTBUF=$HB"
opbatch=
[ "$OB" != "" ] && opbatch="GGML_HEXAGON_OPBATCH=$OB"
opqueue=
[ "$OQ" != "" ] && opqueue="GGML_HEXAGON_OPQUEUE=$OQ"
oppoll=
[ "$OP" != "" ] && oppoll="GGML_HEXAGON_OPPOLL=$OP"
set -x
tool=$1; shift
@@ -50,5 +59,5 @@ adb $adbserial $adbhost shell " \
cd $basedir; ulimit -c unlimited; \
LD_LIBRARY_PATH=$basedir/$branch/lib \
ADSP_LIBRARY_PATH=$basedir/$branch/lib \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb ./$branch/bin/$tool $@ \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb $opbatch $opqueue $oppoll ./$branch/bin/$tool $@ \
"

View File

@@ -62,6 +62,7 @@ static const std::map<std::string, llm_chat_template> LLM_CHAT_TEMPLATES = {
{ "rwkv-world", LLM_CHAT_TEMPLATE_RWKV_WORLD },
{ "granite", LLM_CHAT_TEMPLATE_GRANITE_3_X },
{ "granite-4.0", LLM_CHAT_TEMPLATE_GRANITE_4_0 },
{ "granite-4.1", LLM_CHAT_TEMPLATE_GRANITE_4_1 },
{ "gigachat", LLM_CHAT_TEMPLATE_GIGACHAT },
{ "megrez", LLM_CHAT_TEMPLATE_MEGREZ },
{ "yandex", LLM_CHAT_TEMPLATE_YANDEX },
@@ -194,7 +195,10 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) {
return LLM_CHAT_TEMPLATE_RWKV_WORLD;
} else if (tmpl_contains("<|start_of_role|>")) {
if (tmpl_contains("<tool_call>") || tmpl_contains("<tools>")) {
return LLM_CHAT_TEMPLATE_GRANITE_4_0;
if (tmpl_contains("g4_default_system_message")) {
return LLM_CHAT_TEMPLATE_GRANITE_4_0;
}
return LLM_CHAT_TEMPLATE_GRANITE_4_1;
}
return LLM_CHAT_TEMPLATE_GRANITE_3_X;
} else if (tmpl_contains("message['role'] + additional_special_tokens[0] + message['content'] + additional_special_tokens[1]")) {
@@ -651,6 +655,20 @@ int32_t llm_chat_apply_template(
if (add_ass) {
ss << "<|start_of_role|>assistant<|end_of_role|>";
}
} else if (tmpl == LLM_CHAT_TEMPLATE_GRANITE_4_1) {
// IBM Granite 4.1 template
for (const auto & message : chat) {
std::string role(message->role);
if (role == "assistant_tool_call") {
ss << "<|start_of_role|>assistant<|end_of_role|><|tool_call|>";
} else {
ss << "<|start_of_role|>" << role << "<|end_of_role|>";
}
ss << message->content << "<|end_of_text|>\n";
}
if (add_ass) {
ss << "<|start_of_role|>assistant<|end_of_role|>";
}
} else if (tmpl == LLM_CHAT_TEMPLATE_GIGACHAT) {
// GigaChat template
bool has_system = !chat.empty() && std::string(chat[0]->role) == "system";

View File

@@ -41,6 +41,7 @@ enum llm_chat_template {
LLM_CHAT_TEMPLATE_RWKV_WORLD,
LLM_CHAT_TEMPLATE_GRANITE_3_X,
LLM_CHAT_TEMPLATE_GRANITE_4_0,
LLM_CHAT_TEMPLATE_GRANITE_4_1,
LLM_CHAT_TEMPLATE_GIGACHAT,
LLM_CHAT_TEMPLATE_MEGREZ,
LLM_CHAT_TEMPLATE_YANDEX,

View File

@@ -8318,6 +8318,7 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_mul_mat_hadamard(GGML_TYPE_F32, GGML_TYPE_F32, 128, 1, 128));
test_cases.emplace_back(new test_mul_mat_hadamard(GGML_TYPE_F32, GGML_TYPE_F32, 64, 1, 64));
test_cases.emplace_back(new test_mul_mat_hadamard(GGML_TYPE_F32, GGML_TYPE_F32, 256, 1, 256));
test_cases.emplace_back(new test_mul_mat_hadamard(GGML_TYPE_F32, GGML_TYPE_F32, 512, 1, 512));
test_cases.emplace_back(new test_mul_mat_hadamard(GGML_TYPE_F32, GGML_TYPE_F32, 128, 32, 128));
test_cases.emplace_back(new test_mul_mat_hadamard(GGML_TYPE_F32, GGML_TYPE_F32, 128, 4, 128, {2, 3}));

View File

@@ -618,6 +618,16 @@ int main_automated_tests(void) {
},
{
/* .name= */ "ibm-granite/granite-4.0 (tool call)",
/* .template_str= */ "{%- for message in messages %}\n {%- if message['role'] == 'assistant_tool_call' %}\n {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\\n' }}\n {%- else %}\n {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' + message['content'] + '<|end_of_text|>\\n' }}\n {%- endif %}\n {%- if loop.last and add_generation_prompt %}\n {{- '<|start_of_role|>assistant<|end_of_role|>' }}\n {%- endif %}\n{%- endfor %}\n{# <tool_call> <tools> g4_default_system_message #}",
/* .expected_output= */ "<|start_of_role|>system<|end_of_role|>You are a helpful assistant<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>Hi there<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>Who are you<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|> I am an assistant <|end_of_text|>\n<|start_of_role|>user<|end_of_role|>Another question<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>What is the weather?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|><|tool_call|><tool_call>\n{\"name\": \"get_weather\", \"arguments\": {\"location\": \"NYC\"}}\n</tool_call><|end_of_text|>\n<|start_of_role|>tool_response<|end_of_role|>{\"temperature\": 72}<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>",
/* .expected_output_jinja= */ "",
/* .bos_token= */ "",
/* .eos_token= */ "",
/* .supported_with_jinja= */ true,
/* .extra_conversation= */ {{"user", "What is the weather?"}, {"assistant_tool_call", "<tool_call>\n{\"name\": \"get_weather\", \"arguments\": {\"location\": \"NYC\"}}\n</tool_call>"}, {"tool_response", "{\"temperature\": 72}"}},
},
{
/* .name= */ "ibm-granite/granite-4.1 (tool call)",
/* .template_str= */ "{%- for message in messages %}\n {%- if message['role'] == 'assistant_tool_call' %}\n {{- '<|start_of_role|>assistant<|end_of_role|><|tool_call|>' + message['content'] + '<|end_of_text|>\\n' }}\n {%- else %}\n {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' + message['content'] + '<|end_of_text|>\\n' }}\n {%- endif %}\n {%- if loop.last and add_generation_prompt %}\n {{- '<|start_of_role|>assistant<|end_of_role|>' }}\n {%- endif %}\n{%- endfor %}\n{# <tool_call> <tools> #}",
/* .expected_output= */ "<|start_of_role|>system<|end_of_role|>You are a helpful assistant<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>Hi there<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>Who are you<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|> I am an assistant <|end_of_text|>\n<|start_of_role|>user<|end_of_role|>Another question<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>What is the weather?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|><|tool_call|><tool_call>\n{\"name\": \"get_weather\", \"arguments\": {\"location\": \"NYC\"}}\n</tool_call><|end_of_text|>\n<|start_of_role|>tool_response<|end_of_role|>{\"temperature\": 72}<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>",
/* .expected_output_jinja= */ "",

View File

@@ -2914,6 +2914,21 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
.run();
}
{
// IBM Granite 4.1 (same format as 4.0)
auto tst = peg_tester("models/templates/ibm-granite-granite-4.1.jinja", detailed_debug);
tst.test("Hello, world!\nWhat's up?").expect(message_assist).run();
tst.test(
"<tool_call>\n"
"{\"name\": \"special_function\", \"arguments\": {\"arg1\": 1}}\n"
"</tool_call>")
.tools({ special_function_tool })
.expect(message_assist_call)
.run();
}
{
// ByteDance-Seed-OSS (reasoning and tool calling model)
auto tst = peg_tester("models/templates/ByteDance-Seed-OSS.jinja", detailed_debug);

View File

@@ -12,6 +12,7 @@
#include "../src/llama-model-saver.h"
#include <cinttypes>
#include <cstddef>
#include <cstdio>
#include <cstring>
#include <cstdint>
@@ -497,6 +498,7 @@ static int test_backends(const llm_arch target_arch, const size_t seed, const gg
};
std::vector<device_config> dev_configs;
size_t max_device_label_length = 4;
{
std::vector<ggml_backend_dev_t> devices_meta;
{
@@ -504,6 +506,7 @@ static int test_backends(const llm_arch target_arch, const size_t seed, const gg
for (size_t i = 0; i < device_count; i++) {
ggml_backend_dev_t dev = ggml_backend_dev_get(i);
dev_configs.emplace_back(std::vector<ggml_backend_dev_t>{dev}, ggml_backend_dev_description(dev), LLAMA_SPLIT_MODE_LAYER);
max_device_label_length = std::max(max_device_label_length, dev_configs.back().label.length());
// cpu-based devices cannot be used in tensor split mode
if (ggml_backend_dev_buffer_type(dev) != ggml_backend_cpu_buffer_type()) {
@@ -515,10 +518,26 @@ static int test_backends(const llm_arch target_arch, const size_t seed, const gg
dev_configs.emplace_back(devices_meta, "Meta", LLAMA_SPLIT_MODE_TENSOR);
}
size_t max_arch_name_length = 0;
for (const llm_arch & arch : llm_arch_all()) {
max_arch_name_length = std::max(max_arch_name_length, strlen(llm_arch_name(arch)));
}
const std::string template_header = std::string("|%" + std::to_string(max_arch_name_length) + "s|%") + std::to_string(max_device_label_length) + "s|%6s|%15s|%9s|\n";
const std::string template_row = std::string("|%" + std::to_string(max_arch_name_length) + "s|%") + std::to_string(max_device_label_length) + "s|%6s|%15s %10s|%20s|\n";
bool all_ok = true;
common_log_flush(common_log_main());
printf("|%16s|%30s|%6s|%15s|%9s|\n", "Model arch.", "Device", "Config", "NMSE vs. CPU", "Roundtrip");
printf("|----------------|------------------------------|------|---------------|---------|\n");
printf(template_header.c_str(), "Model arch.", "Device", "Config", "NMSE vs. CPU", "Roundtrip");
printf("|");
for (size_t i = 0; i < max_arch_name_length; i++) {
printf("-");
}
printf("|");
for (size_t i = 0; i < max_device_label_length; i++) {
printf("-");
}
printf("|------|---------------|---------|\n");
for (const llm_arch & arch : llm_arch_all()) {
if (arch == LLM_ARCH_UNKNOWN) {
continue;
@@ -595,7 +614,7 @@ static int test_backends(const llm_arch target_arch, const size_t seed, const gg
}
}
printf("|%16s|%30s|%6s|%15s %10s|%20s|\n", llm_arch_name(arch), dc.label.c_str(),
printf(template_row.c_str(), llm_arch_name(arch), dc.label.c_str(),
config_name.c_str(), status_nmse.c_str(), nmse_str, status_roundtrip.c_str());
}
}

View File

@@ -923,7 +923,7 @@ static void hellaswag_score(llama_context * ctx, const common_params & params) {
}
if (i0 == i1) {
LOG_ERR("%s : task %zu does not fit in the context window (requires %lu tokens)\n", __func__, i0, hs_data[i0].required_tokens);
LOG_ERR("%s : task %zu does not fit in the context window (requires %zu tokens)\n", __func__, i0, hs_data[i0].required_tokens);
return;
}
@@ -1216,7 +1216,7 @@ static void winogrande_score(llama_context * ctx, const common_params & params)
}
if (i0 == i1) {
LOG_ERR("%s : task %zu does not fit in the context window (requires %lu tokens)\n", __func__, i0, data[i0].required_tokens);
LOG_ERR("%s : task %zu does not fit in the context window (requires %zu tokens)\n", __func__, i0, data[i0].required_tokens);
return;
}
@@ -1595,7 +1595,7 @@ static void multiple_choice_score(llama_context * ctx, const common_params & par
}
if (i0 == i1) {
LOG_ERR("%s : task %zu does not fit in the context window (requires %lu tokens)\n", __func__, i0, tasks[i0].required_tokens);
LOG_ERR("%s : task %zu does not fit in the context window (requires %zu tokens)\n", __func__, i0, tasks[i0].required_tokens);
return;
}

View File

@@ -201,7 +201,7 @@ For the full list of features, please refer to [server's changelog](https://gith
| `--embedding, --embeddings` | restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
| `--rerank, --reranking` | enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
| `--api-key KEY` | API key to use for authentication, multiple keys can be provided as a comma-separated list (default: none)<br/>(env: LLAMA_API_KEY) |
| `--api-key-file FNAME` | path to file containing API keys (default: none) |
| `--api-key-file FNAME` | path to file containing API keys, one per line (default: none)<br/>(env: LLAMA_ARG_API_KEY_FILE) |
| `--ssl-key-file FNAME` | path to file a PEM-encoded SSL private key<br/>(env: LLAMA_ARG_SSL_KEY_FILE) |
| `--ssl-cert-file FNAME` | path to file a PEM-encoded SSL certificate<br/>(env: LLAMA_ARG_SSL_CERT_FILE) |
| `--chat-template-kwargs STRING` | sets additional params for the json template parser, must be a valid json object string, e.g. '{"key1":"value1","key2":"value2"}'<br/>(env: LLAMA_ARG_CHAT_TEMPLATE_KWARGS) |

View File

@@ -5,9 +5,9 @@
#include <cpp-httplib/httplib.h>
#include <cstdlib>
#include <functional>
#include <future>
#include <memory>
#include <string>
#include <thread>
@@ -21,7 +21,7 @@ public:
};
server_http_context::server_http_context()
: pimpl(std::make_unique<server_http_context::Impl>())
: pimpl(std::make_unique<Impl>())
{}
server_http_context::~server_http_context() = default;
@@ -62,7 +62,7 @@ struct gcp_params {
}
static std::string getenv(const char * name, const std::string & default_value, bool ensure_leading_slash = false) {
const char * value = std::getenv(name);
const auto * value = std::getenv(name);
if (value == nullptr || value[0] == '\0') {
return default_value;
}
@@ -94,15 +94,15 @@ bool server_http_context::init(const common_params & params) {
auto & srv = pimpl->srv;
#ifdef CPPHTTPLIB_OPENSSL_SUPPORT
if (params.ssl_file_key != "" && params.ssl_file_cert != "") {
if (!params.ssl_file_key.empty() && !params.ssl_file_cert.empty()) {
SRV_INF("running with SSL: key = %s, cert = %s\n", params.ssl_file_key.c_str(), params.ssl_file_cert.c_str());
srv.reset(
new httplib::SSLServer(params.ssl_file_cert.c_str(), params.ssl_file_key.c_str())
srv = std::make_unique<httplib::SSLServer>(
params.ssl_file_cert.c_str(), params.ssl_file_key.c_str()
);
is_ssl = true;
} else {
SRV_INF("%s", "running without SSL\n");
srv.reset(new httplib::Server());
srv = std::make_unique<httplib::Server>();
}
#else
if (params.ssl_file_key != "" && params.ssl_file_cert != "") {
@@ -150,7 +150,7 @@ bool server_http_context::init(const common_params & params) {
// set timeouts and change hostname and port
srv->set_read_timeout (params.timeout_read);
srv->set_write_timeout(params.timeout_write);
srv->set_socket_options([reuse_port = params.reuse_port](socket_t sock) {
srv->set_socket_options([reuse_port = params.reuse_port](const socket_t sock) {
httplib::set_socket_opt(sock, SOL_SOCKET, SO_REUSEADDR, 1);
if (reuse_port) {
#ifdef SO_REUSEPORT
@@ -162,8 +162,8 @@ bool server_http_context::init(const common_params & params) {
});
if (params.api_keys.size() == 1) {
auto key = params.api_keys[0];
std::string substr = key.substr(std::max((int)(key.length() - 4), 0));
const auto key = params.api_keys[0];
const std::string substr = key.substr(std::max(static_cast<int>(key.length() - 4), 0));
SRV_INF("api_keys: ****%s\n", substr.c_str());
} else if (params.api_keys.size() > 1) {
SRV_INF("api_keys: %zu keys loaded\n", params.api_keys.size());
@@ -203,7 +203,7 @@ bool server_http_context::init(const common_params & params) {
}
// remove the "Bearer " prefix if needed
std::string prefix = "Bearer ";
static std::string prefix = "Bearer ";
if (req_api_key.substr(0, prefix.size()) == prefix) {
req_api_key = req_api_key.substr(prefix.size());
}
@@ -232,11 +232,10 @@ bool server_http_context::init(const common_params & params) {
};
auto middleware_server_state = [this](const httplib::Request & req, httplib::Response & res) {
bool ready = is_ready.load();
if (!ready) {
if (!is_ready.load()) {
#if defined(LLAMA_UI_HAS_ASSETS)
auto tmp = string_split<std::string>(req.path, '.');
if (req.path == "/" || (tmp.size() > 0 && tmp.back() == "html")) {
if (const auto tmp = string_split<std::string>(req.path, '.');
req.path == "/" || (!tmp.empty() && tmp.back() == "html")) {
if (const llama_ui_asset * a = llama_ui_find_asset("loading.html")) {
res.status = 503;
res.set_content(reinterpret_cast<const char*>(a->data), a->size, "text/html; charset=utf-8");
@@ -284,17 +283,17 @@ bool server_http_context::init(const common_params & params) {
return httplib::Server::HandlerResponse::Unhandled;
});
int n_threads_http = params.n_threads_http;
auto n_threads_http = params.n_threads_http;
if (n_threads_http < 1) {
// +4 threads for monitoring, health and some threads reserved for MCP and other tasks in the future
n_threads_http = std::max(params.n_parallel + 4, (int32_t) std::thread::hardware_concurrency() - 1);
n_threads_http = std::max(params.n_parallel + 4, static_cast<int32_t>(std::thread::hardware_concurrency() - 1));
}
SRV_INF("using %d threads for HTTP server\n", n_threads_http);
srv->new_task_queue = [n_threads_http] {
// spawn n_threads_http fixed thread (always alive), while allow up to 1024 max possible additional threads
// when n_threads_http is used, server will create new "dynamic" threads that will be destroyed after processing each request
// ref: https://github.com/yhirose/cpp-httplib/pull/2368
size_t max_threads = (size_t)n_threads_http + 1024;
const auto max_threads = static_cast<size_t>(n_threads_http + 1024);
return new httplib::ThreadPool(n_threads_http, max_threads);
};
@@ -310,20 +309,26 @@ bool server_http_context::init(const common_params & params) {
// register static assets routes
if (!params.public_path.empty()) {
// Set the base directory for serving static files
bool is_found = srv->set_mount_point(params.api_prefix + "/", params.public_path);
if (!is_found) {
if (const auto is_found = srv->set_mount_point(params.api_prefix + "/", params.public_path); !is_found) {
SRV_ERR("static assets path not found: %s\n", params.public_path.c_str());
return 1;
return false;
}
} else {
#if defined(LLAMA_UI_HAS_ASSETS)
auto serve_asset = [](const std::string & name, const char * mime, bool with_isolation_headers) {
return [name, mime, with_isolation_headers](const httplib::Request & /*req*/, httplib::Response & res) {
return [name, mime, with_isolation_headers](const httplib::Request & req, httplib::Response & res) {
const llama_ui_asset * a = llama_ui_find_asset(name.c_str());
if (!a) {
res.status = 404;
return false;
}
res.set_header("ETag", a->etag);
// Check If-None-Match for conditional GET (304 Not Modified)
if (const std::string & inm = req.get_header_value("If-None-Match");
!inm.empty() && inm == a->etag) {
res.status = 304;
return false;
}
if (with_isolation_headers) {
// COEP and COOP headers, required by pyodide (python interpreter)
res.set_header("Cross-Origin-Embedder-Policy", "require-corp");
@@ -346,9 +351,9 @@ bool server_http_context::init(const common_params & params) {
bool server_http_context::start() {
// Bind and listen
auto & srv = pimpl->srv;
bool was_bound = false;
bool is_sock = false;
const auto & srv = pimpl->srv;
auto was_bound = false;
auto is_sock = false;
if (string_ends_with(std::string(hostname), ".sock")) {
is_sock = true;
SRV_INF("%s", "setting address family to AF_UNIX\n");
@@ -360,7 +365,7 @@ bool server_http_context::start() {
SRV_INF("%s", "binding port with default address family\n");
// bind HTTP listen port
if (port == 0) {
int bound_port = srv->bind_to_any_port(hostname);
const auto bound_port = srv->bind_to_any_port(hostname);
was_bound = (bound_port >= 0);
if (was_bound) {
port = bound_port;
@@ -376,7 +381,7 @@ bool server_http_context::start() {
}
// run the HTTP server in a thread
thread = std::thread([this]() { pimpl->srv->listen_after_bind(); });
thread = std::thread([this] { pimpl->srv->listen_after_bind(); });
srv->wait_until_ready();
listening_address = is_sock ? string_format("unix://%s", hostname.c_str())
@@ -433,13 +438,13 @@ static void process_handler_response(server_http_req_ptr && request, server_http
if (response->is_stream()) {
res.status = response->status;
set_headers(res, response->headers);
std::string content_type = response->content_type;
const std::string content_type = response->content_type;
// convert to shared_ptr as both chunked_content_provider() and on_complete() need to use it
std::shared_ptr<server_http_req> q_ptr = std::move(request);
std::shared_ptr<server_http_res> r_ptr = std::move(response);
const auto chunked_content_provider = [response = r_ptr](size_t, httplib::DataSink & sink) -> bool {
std::shared_ptr q_ptr = std::move(request);
std::shared_ptr r_ptr = std::move(response);
const auto chunked_content_provider = [response = r_ptr](size_t, const httplib::DataSink & sink) -> bool {
std::string chunk;
bool has_next = response->next(chunk);
const bool has_next = response->next(chunk);
if (!chunk.empty()) {
if (!sink.write(chunk.data(), chunk.size())) {
return false;
@@ -550,7 +555,7 @@ static std::string path_to_gcp_format(const std::string & path) {
if (c == '/' || c == '-' || c == '_') {
cap = true;
} else {
result += cap ? (char)std::toupper(c) : (char)c;
result += static_cast<char>(cap ? std::toupper(c) : c);
cap = false;
}
}
@@ -574,7 +579,7 @@ static json parse_gcp_predict_response(const server_http_res_ptr & res) {
}
}
void server_http_context::register_gcp_compat() {
void server_http_context::register_gcp_compat() const {
const gcp_params gcp;
if (!gcp.enabled) {
@@ -595,7 +600,7 @@ void server_http_context::register_gcp_compat() {
}
if (!gcp.path_health.empty()) {
auto health_handler = handlers.find("/health");
const auto health_handler = handlers.find("/health");
GGML_ASSERT(health_handler != handlers.end());
get(gcp.path_health, health_handler->second);
}

View File

@@ -73,7 +73,7 @@ struct server_http_context {
std::string path_prefix;
std::string hostname;
int port;
int port = 8080;
bool is_ssl = false;
server_http_context();
@@ -88,7 +88,7 @@ struct server_http_context {
// Register the Google Cloud Platform (Vertex AI) compat (AIP_PREDICT_ROUTE env var, or /predict)
// Must be called AFTER all other API routes are registered
void register_gcp_compat();
void register_gcp_compat() const;
// for debugging
std::string listening_address;

View File

@@ -9,6 +9,19 @@
#include <fstream>
#include <string>
#include <vector>
#include <cstdint>
// Computes FNV-1a hash of the data
static uint64_t fnv_hash(const uint8_t * data, size_t len) {
const uint64_t fnv_prime = 0x100000001b3ULL;
uint64_t hash = 0xcbf29ce484222325ULL;
for (size_t i = 0; i < len; ++i) {
hash ^= data[i];
hash *= fnv_prime;
}
return hash;
}
static bool read_file(const std::string & path, std::vector<unsigned char> & out) {
std::ifstream f(path, std::ios::binary | std::ios::ate);
@@ -95,6 +108,7 @@ int main(int argc, char ** argv) {
" const char * name;\n"
" const unsigned char * data;\n"
" size_t size;\n"
" const char * etag;\n"
"};\n\n"
"const llama_ui_asset * llama_ui_find_asset(const char * name);\n";
@@ -110,14 +124,18 @@ int main(int argc, char ** argv) {
}
cpp += fmt("static const unsigned char asset_%d_data[] = {", i);
append_bytes_hex(cpp, bytes);
cpp += fmt("};\nstatic const size_t asset_%d_size = %lu;\n\n",
const auto hash = fnv_hash(bytes.data(), bytes.size());
cpp += fmt("};\nstatic const size_t asset_%d_size = %lu;\n",
i, static_cast<unsigned long>(bytes.size()));
cpp += fmt("static const char asset_%d_etag[] = \"\\\"0x%016lx\\\"\";\n\n",
i, static_cast<unsigned long>(hash));
}
cpp += "static const llama_ui_asset g_assets[] = {\n";
for (int i = 0; i < n_assets; i++) {
const char * name = argv[3 + i * 2];
cpp += fmt(" { \"%s\", asset_%d_data, asset_%d_size },\n", name, i, i);
cpp += fmt(" { \"%s\", asset_%d_data, asset_%d_size, asset_%d_etag },\n",
argv[3 + i * 2], i, i, i);
}
cpp += "};\n\n";