Compare commits

..

37 Commits
b9333 ... b9370

Author SHA1 Message Date
Max Krasnyansky
aa50b2c2ae hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647)
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

* hmx-mm: add support for Q4_1

* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

* hexagon: fix repack scratch buffer overflow

* hex-mm: fix Q4_1 repack buffer sizing

* hexagon: flip the build order for mm and fa (seems to help LTO)

* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

* hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-05-27 10:46:11 -07:00
Masashi Yoshimura
c40006a62e ggml-webgpu: Fix how to dispatch WG to some ops (#23750) 2026-05-27 09:48:12 -07:00
Matt Corallo
c6e4088376 vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887)
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32

Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

Note that this breaks some tests until the last commit which fixes
OOB A reads.

* vulkan: Use aligned loads in mul_mat_vec when available

Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec

Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.

Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes

There was a TODO to fix the OOB reads from the A matrix which we do
here.

It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
2026-05-27 17:19:23 +02:00
Jeff Bolz
b36eefc1b3 vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (#23541) 2026-05-27 17:18:28 +02:00
l8bloom
837bb6b447 vulkan: add REPEAT op support for f16 to f16. (#23298)
* feat: extend repeat op for vulkan

* feat: add repeat_f16 vulkan pipeline

* fix: ensure same dst and src types

* fix: use type_size instead of data types

* fix: use int16 and int32 for repeat shader op

* chore: rename repeat_f* to repeat_i*

* chore: rename repeat vulkan pipelines
2026-05-27 16:59:08 +02:00
Georgi Gerganov
ba4dd0bc67 ci : move ARM jobs to self-hosted + disable kleidiai mac release (#23780)
* ci : move ARM jobs to 3rd-party runners + disable kleidiai release

* cont : fix deps + fix names

* ocd : fix names

* cont : fix PR links
2026-05-27 17:22:20 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
617255d437 vendor : update cpp-httplib to 0.46.0 (#23650) 2026-05-27 21:36:24 +08:00
Sigbjørn Skjæret
87b0a60cdd pyproject : add conversion folder and update dependencies (#23746)
* add conversion folder and update dependencies

* limit python version for triton

* update dev-dependencies section
2026-05-27 15:06:18 +02:00
Oliver Simons
fda8528aa8 CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (#23742) 2026-05-27 15:21:04 +03:00
Sigbjørn Skjæret
2d0656fbdd ci : bump cuda release to 13.3 (#23749) 2026-05-27 15:06:08 +03:00
Georgi Gerganov
6b4e4bd582 common : fix env names to all have LLAMA_ARG_ prefix (#23778) 2026-05-27 14:52:47 +03:00
Georgi Gerganov
9f0e4b14d2 ci : fix windows ccaches (#23777)
* ci : server windows set build type explicitly

* cont : try windows-2025

* ci : use llvm

* cont : use ninja

* cont : fix shell

* ci : set number of jobs correctly

* ci : fix windows with vulkan ccache by using llvm

* ci : server ccache only on master

* ocd : fix job names

[no release]
2026-05-27 13:54:21 +03:00
Sigbjørn Skjæret
b3a739c9b6 ci : remove wasm test (#23733)
* run tests in correct build folder

* remove wasm test
2026-05-27 13:11:37 +03:00
Winston Ma
4d8cc0c56f vulkan: avoid preferring transfer queue on AMD UMA devices (#22455) 2026-05-27 11:48:40 +02:00
Georgi Gerganov
0d227ec358 ci : add ccache to server builds + fix undefined sanitizer build (#23763)
* ci : fix undefined sanitizer build to use Debug build type only

* ci : ccache the server builds

* cont : remove ui dependency + reuse ccache for both ubuntu jobs

* tmp : force ccache save

* Revert "tmp : force ccache save"

This reverts commit a857b03a10.

* cont : no need for node.js
2026-05-27 11:45:12 +03:00
quyentonndbs
1d971bba36 docs : fix duplicated "the" in granitevision and model-conversion docs (#23767)
Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>
2026-05-27 09:34:06 +02:00
zhangtao2-1
9777256c31 convert: add MiniCPM5 tokenizer support (#23384)
Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and
implement hardcoded regex handling in llama-vocab.cpp, consistent with
other BPE pre-tokenizers.

Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
2026-05-27 08:08:33 +03:00
Radoslav Gerganov
7085492c6f server : fix the log message when using SSL (#23393)
When llama-server is started with SSL key and cert, the log says that it
listens on http instead of https. This patch fixes this.
2026-05-27 08:06:30 +03:00
Vladislav
b4c0549a49 ggml-zendnn : fixed naming of matmul function (#20964)
* ggml-zendnn: fixed naming of matmul function

* ggml-zendnn: fixed naming of mul_mat_id function

* ggml-zendnn: fixed print in  mul_mat_id

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
2026-05-27 00:59:35 +02:00
Georgi Gerganov
0d18aaa9d1 ci : do not allocate ccache for 3rd-party hosted runners (#23730)
* ci : do not allocate ccache for 3rd-party hosted runners

[no release]

* cont : add prints

[no ci]
[no release]
2026-05-26 20:15:01 +03:00
Georgi Gerganov
08bc21b459 ci : move [no release] check to dedicated check_release job (#23734)
* ci : move [no release] check to dedicated check_release job

Move the workflow-level \`if\` condition that skips builds when the commit
message contains \`[no release]\` into a lightweight \`check_release\` job.
All build jobs now depend on it via \`needs\` and check its output.

This ensures the skip logic is evaluated at the job level rather than at
the workflow level, which is the recommended approach for conditional jobs.

Assisted-by: llama.cpp:local pi

* cont : use `fast` runner
2026-05-26 19:49:41 +03:00
Georgi Gerganov
35a74c8fb9 ci : add [no release] keyword + fix sanitizer builds (#23728)
* ci : skip release workflow on master when commit message contains [no release]

Assisted-by: llama.cpp:local pi

* ci : restrict sanitizer builds to x86_64 + fix build type

the spark is apparently too slow for some reason

* tests : fix undefined warning

[no ci]
2026-05-26 19:05:48 +03:00
Georgi Gerganov
5190c2ea8d ci : move macos jobs to the apple workflow + fix names (#23721) 2026-05-26 16:57:55 +03:00
Jeff Bolz
7799d31e68 vulkan: optimize conv2d and implement coopmat1 support (#22620)
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d

* vulkan: skip conv2d bounds checks when shapes align with tile sizes

* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d

* vulkan: stage cm2 conv2d accumulator through shmem before global store

* vulkan: add coopmat1 conv2d path

* fallback when using too much shared memory. clean up comments

* Require 16x16x16 and subgroup size 32 or 64

* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
2026-05-26 15:48:05 +02:00
Georgi Gerganov
3a3ed153d9 ci : remove vulkan SDK dep from webgpu job (#23718)
* ci : remove vulkan dep from webgpu build

* cont : add ccache to `ubuntu-24-webgpu-wasm`

* ci : fix name + add wasm test
2026-05-26 16:40:30 +03:00
Max Krasnyansky
ef66bfab68 hexagon: add support for CONCAT op (#23648)
* hexagon: add support for CONCAT with optimized concat_2d_transposed

qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.

* hex-concat: use fastdiv in generic version

* hex-concat: make checks for transposed a bit more readable

* hex-concat: reoder dma ops for better pipelining

* hex-cont/cpy: optimize CPY and CONT ops

The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).

* hex-get-rows: optimize GET_ROWS for large rows

We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).

Also bump the DMA queue depth now that we can take advantage of it.

* hex-concat: unroll the inner loops of concat_2d

* hex-concat: more updates to concat_2d to improve perf a bit further

* hex-cpy: fixed n_rows per thread checks in the copy ops

* hmx-fa: fix alignment issues while computing dma sizes

* hex-set-rows: add early returns for idle threads

* hvx-rope: minor optimization to replace loops with fastdiv logic

* hex-rope: replace scalar tail processing with HVX

* hex-rope: optimize rope cache init with HVX

Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
2026-05-26 06:20:05 -07:00
Georgi Gerganov
678d43d720 ci : move more CPU jobs to self-hosted runners (#23715) 2026-05-26 15:37:40 +03:00
Georgi Gerganov
ef41a69179 ci : move sanitizer jobs to self-hosted runners (#23713) 2026-05-26 15:22:09 +03:00
Georgi Gerganov
3dc7684f39 ci : reduce (disable SYCL and CANN builds/releases) (#23705)
* ci : reduce

[no ci]

* cont : disable sycl, cann + rename caches

[no ci]

* cont : cann

[no ci]
2026-05-26 15:21:21 +03:00
ghleg
dbe9c0c8ce convert : support Gemma4ForCausalLM architecture (#23682)
* convert : support Gemma4ForCausalLM architecture (#23674)

* fix indent

---------

Co-authored-by: Oleg Afonin <your.email@example.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-26 08:00:31 +03:00
Michael Wand
6fe90deffa models : Attach Mistral3 NVFP4 weight scales (#23629) 2026-05-26 07:59:59 +03:00
Alexey Kopytko
581d020b12 SYCL: implement ggml_sycl_pool_vmm (#22862)
* SYCL: implement ggml_sycl_pool_vmm

* Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM

* Clean up debugging logging

* document GGML_SYCL_DISABLE_VMM

* Multi-stream MoE optimization

* Revert "Multi-stream MoE optimization"

This reverts commit 938929c3f1.

* Update common.hpp

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM

* add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro)

* Apply suggestions from code review

Co-authored-by: Alexey Kopytko <alexey@kopytko.com>

* Apply suggestion from @sanmai

* Apply suggestion from @sanmai

---------

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2026-05-26 07:59:00 +03:00
Jeff Bolz
7623de11d9 tests: test-backend-ops -j <N> to run tests in parallel (#23637)
Create a pool of N threads that grab a chunk of up to 100 tests at a time to
iterate through. The number of tests at a time decreases as fewer remain.

Each thread uses its own dev and cpu backend, and set_n_threads_fn is not
called on the cpu backend.

Fix some TSAN issues that arose:
- In init_tensor_uniform, don't use static vector of generators.
- Replace gmtime with versions that don't use a global variable.
- Mutex calls to print_test_result.
2026-05-26 07:57:56 +03:00
Niklas Sheth
c9d98295a3 model : add support for talkie-1930-13b (#22596)
* initial talkie support, coherent

* reorder to follow convention

* absorb inverse rope

* stop folding scalars to improve quantization

* use broadcasting instead of duplication

* style cleanup

* add scaling support to LoraTorchTensor; use that path in conversion

* use layer_out_scale instead of embd_skip_scale
2026-05-26 07:57:38 +03:00
Masashi Yoshimura
1506d39e76 ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (#23594)
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K

* Fix to editorconfig checking pass

* Remove mul-mat-legacy pipeline

* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
2026-05-25 20:42:49 -07:00
Nikhil Jain
54121f7325 [WebGPU] Check batch_compute_passes before sending passes when not doing GPU profiling (#23457)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled

* restore build.yml
2026-05-25 20:32:49 -07:00
Johannes Gäßler
192d8ae8b8 CUDA: missing PDL sync for FWHT, better fallback (#23690) 2026-05-26 11:05:51 +08:00
110 changed files with 6789 additions and 2570 deletions

View File

@@ -96,3 +96,34 @@ runs:
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V13_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
- name: Install Cuda Toolkit 13.3
if: ${{ inputs.cuda_version == '13.3' }}
shell: pwsh
run: |
mkdir -p "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3"
choco install unzip -y
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_crt/windows-x86_64/cuda_crt-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cudart/windows-x86_64/cuda_cudart-windows-x86_64-13.3.29-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/windows-x86_64/cuda_nvcc-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/windows-x86_64/cuda_nvrtc-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libcublas/windows-x86_64/libcublas-windows-x86_64-13.5.1.27-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libnvvm/windows-x86_64/libnvvm-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvtx/windows-x86_64/cuda_nvtx-windows-x86_64-13.3.29-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_profiler_api/windows-x86_64/cuda_profiler_api-windows-x86_64-13.3.27-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/visual_studio_integration/windows-x86_64/visual_studio_integration-windows-x86_64-13.3.27-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cccl/windows-x86_64/cccl-windows-x86_64-13.3.3.3.1-archive.zip"
unzip '*.zip' -d "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3"
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_crt-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_cudart-windows-x86_64-13.3.29-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_nvcc-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_nvrtc-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\libcublas-windows-x86_64-13.5.1.27-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\libnvvm-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_nvtx-windows-x86_64-13.3.29-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_profiler_api-windows-x86_64-13.3.27-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\visual_studio_integration-windows-x86_64-13.3.27-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cccl-windows-x86_64-13.3.3.3.1-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V13_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8

View File

@@ -22,9 +22,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-llguidance:

View File

@@ -27,9 +27,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
android:

View File

@@ -32,12 +32,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
macOS-latest-ios:
macos-latest-arm64:
runs-on: macos-latest
steps:
@@ -48,7 +48,79 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-ios
key: macos-latest-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=OFF \
-DGGML_METAL_SHADER_DEBUG=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
leaks -atExit -- ./build/bin/test-thread-safety -hf ggml-org/gemma-3-270m-qat-GGUF -ngl 99 -p "$(printf 'hello %.0s' {1..128})" -n 16 -c 512 -ub 32 -np 2 -t 2 -lv 1
- name: Test
id: cmake_test
run: |
cd build
ctest -L main -E "test-llama-archs" --verbose --timeout 900
macos-latest-x64:
runs-on: macos-15-intel
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggml-org/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL=OFF \
-DGGML_RPC=ON \
-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
macos-latest-ios:
runs-on: macos-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macos-latest-ios
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -117,7 +189,7 @@ jobs:
xcodebuild -downloadPlatform iOS
xcodebuild -project examples/llama.swiftui/llama.swiftui.xcodeproj -scheme llama.swiftui -sdk iphoneos CODE_SIGNING_REQUIRED=NO CODE_SIGN_IDENTITY= -destination 'generic/platform=iOS' FRAMEWORK_FOLDER_PATH=./build-ios build
macOS-latest-tvos:
macos-latest-tvos:
runs-on: macos-latest
steps:
@@ -128,7 +200,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-tvos
key: macos-latest-tvos
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -150,7 +222,7 @@ jobs:
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
macOS-latest-visionos:
macos-latest-visionos:
runs-on: macos-latest
steps:
@@ -176,7 +248,7 @@ jobs:
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
macOS-latest-swift:
macos-latest-swift:
runs-on: macos-latest
needs: macos-latest-ios-xcode
@@ -192,7 +264,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-swift
key: macos-latest-swift
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -28,7 +28,7 @@ jobs:
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
key: cache-gha-vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
@@ -54,7 +54,7 @@ jobs:
# id: cache-toolchain
# with:
# path: ./spacemit_toolchain
# key: spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
# key: cache-gha-spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
# - name: Setup SpacemiT Toolchain
# if: steps.cache-toolchain.outputs.cache-hit != 'true'
@@ -81,7 +81,7 @@ jobs:
id: cache-openvino
with:
path: ./openvino_toolkit
key: openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
key: cache-gha-openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
- name: Setup OpenVINO Toolkit
if: steps.cache-openvino.outputs.cache-hit != 'true'
@@ -108,7 +108,7 @@ jobs:
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
key: cache-gha-rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'

View File

@@ -29,74 +29,76 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
openEuler-latest-cann:
defaults:
run:
shell: bash -el {0}
strategy:
matrix:
arch: [x86, aarch64]
chip_type: ['910b', '310p']
build: ['Release']
use_acl_graph: ['on', 'off']
exclude:
# 310P does not support USE_ACL_GRAPH=on
- chip_type: '310p'
use_acl_graph: 'on'
runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
steps:
- name: Checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Free up disk space
uses: ggml-org/free-disk-space@v1.3.1
with:
tool-cache: true
- name: Set container image
id: cann-image
run: |
image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
echo "image=${image}" >> "${GITHUB_OUTPUT}"
- name: Pull container image
run: docker pull "${{ steps.cann-image.outputs.image }}"
- name: Build
env:
BUILD_TYPE: ${{ matrix.build }}
SOC_TYPE: ascend${{ matrix.chip_type }}
USE_ACL_GRAPH: ${{ matrix.use_acl_graph }}
run: |
HOST_UID=$(id -u)
HOST_GID=$(id -g)
docker run --rm \
-v "${PWD}:/workspace" \
-w /workspace \
-e SOC_TYPE=${SOC_TYPE} \
-e BUILD_TYPE=${BUILD_TYPE} \
-e USE_ACL_GRAPH=${USE_ACL_GRAPH} \
"${{ steps.cann-image.outputs.image }}" \
bash -lc '
set -e
yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
yum clean all && rm -rf /var/cache/yum
git config --global --add safe.directory "/workspace"
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DGGML_CANN=on \
-DSOC_TYPE=${SOC_TYPE} \
-DUSE_ACL_GRAPH=${USE_ACL_GRAPH}
cmake --build build -j $(nproc)
chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
'
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# openEuler-latest-cann:
# defaults:
# run:
# shell: bash -el {0}
# strategy:
# matrix:
# arch: [x86, aarch64]
# chip_type: ['910b', '310p']
# build: ['Release']
# use_acl_graph: ['on', 'off']
# exclude:
# # 310P does not support USE_ACL_GRAPH=on
# - chip_type: '310p'
# use_acl_graph: 'on'
# runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
# steps:
# - name: Checkout
# uses: actions/checkout@v6
# with:
# fetch-depth: 0
#
# - name: Free up disk space
# uses: ggml-org/free-disk-space@v1.3.1
# with:
# tool-cache: true
#
# - name: Set container image
# id: cann-image
# run: |
# image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
# echo "image=${image}" >> "${GITHUB_OUTPUT}"
#
# - name: Pull container image
# run: docker pull "${{ steps.cann-image.outputs.image }}"
#
# - name: Build
# env:
# BUILD_TYPE: ${{ matrix.build }}
# SOC_TYPE: ascend${{ matrix.chip_type }}
# USE_ACL_GRAPH: ${{ matrix.use_acl_graph }}
# run: |
# HOST_UID=$(id -u)
# HOST_GID=$(id -g)
#
# docker run --rm \
# -v "${PWD}:/workspace" \
# -w /workspace \
# -e SOC_TYPE=${SOC_TYPE} \
# -e BUILD_TYPE=${BUILD_TYPE} \
# -e USE_ACL_GRAPH=${USE_ACL_GRAPH} \
# "${{ steps.cann-image.outputs.image }}" \
# bash -lc '
# set -e
# yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
# yum clean all && rm -rf /var/cache/yum
# git config --global --add safe.directory "/workspace"
# export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
# cmake -S . -B build \
# -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
# -DGGML_CANN=on \
# -DSOC_TYPE=${SOC_TYPE} \
# -DUSE_ACL_GRAPH=${USE_ACL_GRAPH}
# cmake --build build -j $(nproc)
#
# chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
# '

View File

@@ -287,7 +287,7 @@ jobs:
# id: cache-toolchain
# with:
# path: ./spacemit_toolchain
# key: spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
# key: cache-gha-spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
- name: Setup SpacemiT Toolchain
#if: steps.cache-toolchain.outputs.cache-hit != 'true'

View File

@@ -31,9 +31,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
@@ -93,7 +93,7 @@ jobs:
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
key: cache-gha-rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'

View File

@@ -29,9 +29,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:

View File

@@ -15,9 +15,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
windows-msys2:

View File

@@ -30,9 +30,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:

View File

@@ -29,9 +29,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-openvino:
@@ -84,7 +84,7 @@ jobs:
id: cache-openvino
with:
path: ./openvino_toolkit
key: openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
key: cache-gha-openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
- name: Setup OpenVINO Toolkit
if: steps.cache-openvino.outputs.cache-hit != 'true'

View File

@@ -29,9 +29,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-cpu-riscv64-native:
@@ -58,17 +58,20 @@ jobs:
ldd --version
cmake --version
rustc --version
env
echo "nproc=$(nproc)"
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-cpu-riscv64-native
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# note: sparing some ccache since these jobs run on dedicated runners that are not part of the organitzation
#- name: ccache
# uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
# with:
# key: ubuntu-cpu-riscv64-native
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -132,12 +135,13 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# note: sparing some ccache since these jobs run on dedicated runners that are not part of the organitzation
#- name: ccache
# uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
# with:
# key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build

View File

@@ -29,9 +29,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:

View File

@@ -22,66 +22,78 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-latest-sanitizer:
runs-on: ubuntu-latest
ctest:
runs-on: [self-hosted, X64, CPU, Linux]
continue-on-error: true
strategy:
matrix:
sanitizer: [ADDRESS, THREAD, UNDEFINED]
build_type: [Debug]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-latest-sanitizer-${{ matrix.sanitizer }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-latest-sanitizer-${{ matrix.sanitizer }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
#- name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential libssl-dev
# with UNDEFINED sanitizer, we have to build in Debug to avoid GCC 13 false-positive warnings
- name: Build (undefined)
id: cmake_build_undefined
if: ${{ matrix.sanitizer == 'UNDEFINED' }}
run: |
sudo apt-get update
sudo apt-get install build-essential libssl-dev
cmake -B build \
-DCMAKE_BUILD_TYPE=Debug \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON \
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON
cmake --build build --config Debug -j $(nproc)
- name: Build
id: cmake_build
if: ${{ matrix.sanitizer != 'THREAD' }}
if: ${{ matrix.sanitizer == 'ADDRESS' }}
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON \
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON
cmake --build build --config ${{ matrix.build_type }} -j $(nproc)
cmake --build build --config RelWithDebInfo -j $(nproc)
- name: Build (no OpenMP)
id: cmake_build_no_openmp
if: ${{ matrix.sanitizer == 'THREAD' }}
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON \
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
-DGGML_OPENMP=OFF
cmake --build build --config ${{ matrix.build_type }} -j $(nproc)
cmake --build build --config RelWithDebInfo -j $(nproc)
- name: Test
id: cmake_test
# skip run in Debug - very slow
if: ${{ matrix.sanitizer != 'UNDEFINED' }}
run: |
cd build
ctest -L main --verbose --timeout 900
ctest -L main -E tokenizer --verbose --timeout 900

View File

@@ -50,12 +50,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ggml-ci-nvidia-cuda:
gpu-cuda:
runs-on: [self-hosted, Linux, NVIDIA]
steps:
@@ -69,7 +69,7 @@ jobs:
nvidia-smi
GG_BUILD_CUDA=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-vulkan-cm:
gpu-vulkan-nvidia-cm:
runs-on: [self-hosted, Linux, NVIDIA]
steps:
@@ -83,7 +83,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 GGML_VK_DISABLE_COOPMAT2=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-vulkan-cm2:
gpu-vulkan-nvidia-cm2:
runs-on: [self-hosted, Linux, NVIDIA, COOPMAT2]
steps:
@@ -97,7 +97,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-webgpu:
gpu-webgpu-nvidia:
runs-on: [self-hosted, Linux, NVIDIA, X64]
steps:
@@ -127,7 +127,7 @@ jobs:
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMX-compatible machine
#ggml-ci-cpu-amx:
#cpu-amx:
# runs-on: [self-hosted, Linux, CPU, AMX]
# steps:
@@ -141,7 +141,7 @@ jobs:
# bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMD GPU machine
# ggml-ci-amd-vulkan:
# amd-vulkan:
# runs-on: [self-hosted, Linux, AMD]
# steps:
@@ -156,7 +156,7 @@ jobs:
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMD GPU machine
# ggml-ci-amd-rocm:
# amd-rocm:
# runs-on: [self-hosted, Linux, AMD]
# steps:
@@ -170,7 +170,7 @@ jobs:
# amd-smi static
# GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-metal:
gpu-metal:
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -183,7 +183,7 @@ jobs:
run: |
GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-webgpu:
gpu-webgpu-apple:
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -210,7 +210,7 @@ jobs:
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-vulkan:
gpu-vulkan:
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -224,7 +224,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-linux-intel-vulkan:
gpu-vulkan-intel-linux:
runs-on: [self-hosted, Linux, Intel]
steps:
@@ -240,7 +240,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-win-intel-vulkan:
gpu-vulkan-intel-windows:
runs-on: [self-hosted, Windows, X64, Intel]
steps:
@@ -261,7 +261,7 @@ jobs:
# a valid python environment for testing
LLAMA_FATAL_WARNINGS=OFF GG_BUILD_NINJA=1 GG_BUILD_VULKAN=1 GG_BUILD_LOW_PERF=1 ./ci/run.sh ./results/llama.cpp ./mnt/llama.cpp
ggml-ci-intel-openvino-gpu-low-perf:
cpu-openvino-low-perf:
runs-on: [self-hosted, Linux, Intel, OpenVINO]
concurrency:
@@ -297,8 +297,8 @@ jobs:
source ./openvino_toolkit/setupvars.sh
GG_BUILD_OPENVINO=1 GGML_OPENVINO_DEVICE=GPU GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-arm64-cpu-low-perf:
runs-on: [self-hosted, Linux, ARM64, CPU]
cpu-any-low-perf:
runs-on: [self-hosted, CPU]
steps:
- name: Clone
@@ -310,8 +310,8 @@ jobs:
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-arm64-cpu-high-perf:
runs-on: [self-hosted, Linux, ARM64, CPU]
cpu-any-high-perf:
runs-on: [self-hosted, CPU]
steps:
- name: Clone
@@ -323,34 +323,90 @@ jobs:
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: not sure how to detect ARM flags on DGX Spark. currently get this error during cmake:
# CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:147 (message):
# ARM -march/-mcpu not found, -mcpu=native will be used
#
# if we resolve this, we should be able to offload these jobs to the self-hosted runners
#
# ggml-ci-arm64-cpu-high-perf-sve:
# runs-on: [self-hosted, Linux, ARM64, CPU]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-arm64-cpu-kleidiai:
# runs-on: [self-hosted, Linux, ARM64, CPU]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_KLEIDIAI=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
cpu-arm64-graviton4:
runs-on: ah-ubuntu_22_04-c8g_8x
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
set -euxo pipefail
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive NEEDRESTART_MODE=a \
apt-get install -y \
build-essential \
python3-venv \
gpg \
wget \
time \
git-lfs
git lfs install
# install the latest cmake
sudo install -d /usr/share/keyrings
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc \
| gpg --dearmor \
| sudo tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ jammy main' \
| sudo tee /etc/apt/sources.list.d/kitware.list
sudo apt-get update
sudo apt-get install -y cmake
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
cpu-arm64-graviton4-kleidiai:
runs-on: ah-ubuntu_22_04-c8g_8x
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
set -euxo pipefail
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive NEEDRESTART_MODE=a \
apt-get install -y \
build-essential \
python3-venv \
gpg \
wget \
time \
git-lfs
git lfs install
# install the latest cmake
sudo install -d /usr/share/keyrings
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc \
| gpg --dearmor \
| sudo tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ jammy main' \
| sudo tee /etc/apt/sources.list.d/kitware.list
sudo apt-get update
sudo apt-get install -y cmake
# note: sparing some ccache since these jobs run on dedicated runners that are not part of the organitzation
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: arm64-cpu-kleidiai-graviton4
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Test
id: ggml-ci
run: |
GG_BUILD_KLEIDIAI=1 \
GG_BUILD_EXTRA_TESTS_0=1 \
bash ./ci/run.sh ./tmp/results ./tmp/mnt

View File

@@ -29,130 +29,134 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-sycl:
strategy:
matrix:
build: [fp32]
include:
- build: fp32
fp16: OFF
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# ubuntu-24-sycl:
# strategy:
# matrix:
# build: [fp32]
# include:
# - build: fp32
# fp16: OFF
#
# runs-on: ubuntu-24.04
#
# env:
# ONEAPI_ROOT: /opt/intel/oneapi/
# ONEAPI_INSTALLER_VERSION: "2025.3.3"
# LEVEL_ZERO_VERSION: "1.28.2"
# LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
#
# continue-on-error: true
#
# steps:
# - uses: actions/checkout@v6
#
# - name: Use oneAPI Installation Cache
# uses: actions/cache@v5
# id: cache-sycl
# with:
# path: ${{ env.ONEAPI_ROOT }}
# key: cache-gha-oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
#
# - name: Download & Install oneAPI
# shell: bash
# if: steps.cache-sycl.outputs.cache-hit != 'true'
# run: |
# cd /tmp
# wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
# sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
#
# - name: Install Level Zero SDK
# shell: bash
# run: |
# cd /tmp
# wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
# wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
# sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
#
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-24-sycl-${{ matrix.build }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Build
# id: cmake_build
# run: |
# source /opt/intel/oneapi/setvars.sh
# cmake -B build \
# -G "Ninja" \
# -DCMAKE_BUILD_TYPE=Release \
# -DGGML_SYCL=ON \
# -DCMAKE_C_COMPILER=icx \
# -DCMAKE_CXX_COMPILER=icpx \
# -DLLAMA_OPENSSL=OFF \
# -DGGML_NATIVE=OFF \
# -DGGML_SYCL_F16=${{ matrix.fp16 }}
# time cmake --build build --config Release -j $(nproc)
runs-on: ubuntu-24.04
env:
ONEAPI_ROOT: /opt/intel/oneapi/
ONEAPI_INSTALLER_VERSION: "2025.3.3"
LEVEL_ZERO_VERSION: "1.28.2"
LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
continue-on-error: true
steps:
- uses: actions/checkout@v6
- name: Use oneAPI Installation Cache
uses: actions/cache@v5
id: cache-sycl
with:
path: ${{ env.ONEAPI_ROOT }}
key: oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
- name: Download & Install oneAPI
shell: bash
if: steps.cache-sycl.outputs.cache-hit != 'true'
run: |
cd /tmp
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
- name: Install Level Zero SDK
shell: bash
run: |
cd /tmp
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-sycl-${{ matrix.build }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
source /opt/intel/oneapi/setvars.sh
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DLLAMA_OPENSSL=OFF \
-DGGML_NATIVE=OFF \
-DGGML_SYCL_F16=${{ matrix.fp16 }}
time cmake --build build --config Release -j $(nproc)
windows-latest-sycl:
runs-on: windows-2022
defaults:
run:
shell: bash
env:
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
ONEAPI_INSTALLER_VERSION: "2025.3.3"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Use oneAPI Installation Cache
uses: actions/cache@v5
id: cache-sycl
with:
path: ${{ env.ONEAPI_ROOT }}
key: oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
- name: Download & Install oneAPI
shell: bash
if: steps.cache-sycl.outputs.cache-hit != 'true'
run: |
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
- name: Install Level Zero SDK
shell: pwsh
run: |
Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
"LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-sycl
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# TODO: add ssl support ; we will also need to modify win-build-sycl.bat to accept user-specified args
- name: Build
id: cmake_build
run: examples/sycl/win-build-sycl.bat
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# windows-latest-sycl:
# runs-on: windows-2022
#
# defaults:
# run:
# shell: bash
#
# env:
# WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
# WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
# LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
# ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
# ONEAPI_INSTALLER_VERSION: "2025.3.3"
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Use oneAPI Installation Cache
# uses: actions/cache@v5
# id: cache-sycl
# with:
# path: ${{ env.ONEAPI_ROOT }}
# key: cache-gha-oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
#
# - name: Download & Install oneAPI
# shell: bash
# if: steps.cache-sycl.outputs.cache-hit != 'true'
# run: |
# scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
#
# - name: Install Level Zero SDK
# shell: pwsh
# run: |
# Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
# Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
# "LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: windows-latest-sycl
# variant: ccache
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# # TODO: add ssl support ; we will also need to modify win-build-sycl.bat to accept user-specified args
#
# - name: Build
# id: cmake_build
# run: examples/sycl/win-build-sycl.bat

View File

@@ -31,9 +31,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-vulkan-llvmpipe:
@@ -68,7 +68,7 @@ jobs:
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
key: cache-gha-vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'

View File

@@ -30,13 +30,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
macOS-latest-arm64-webgpu:
macos-latest-webgpu:
runs-on: macos-latest
steps:
@@ -47,7 +46,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64-webgpu
key: macos-latest-webgpu
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -100,25 +99,6 @@ jobs:
sudo apt-get install -y build-essential mesa-vulkan-drivers \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev
- name: Get latest Vulkan SDK version
id: vulkan_sdk_version
run: |
echo "VULKAN_SDK_VERSION=$(curl https://vulkan.lunarg.com/sdk/latest/linux.txt)" >> "$GITHUB_ENV"
- name: Use Vulkan SDK Cache
uses: actions/cache@v5
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: Dawn Dependency
id: dawn-depends
run: |
@@ -157,6 +137,13 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-webgpu-wasm
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install Emscripten
run: |
git clone https://github.com/emscripten-core/emsdk.git

View File

@@ -52,86 +52,14 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
build-cmake-pkg:
uses: ./.github/workflows/build-cmake-pkg.yml
macOS-latest-arm64:
runs-on: macos-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=OFF \
-DGGML_METAL_SHADER_DEBUG=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
leaks -atExit -- ./build/bin/test-thread-safety -hf ggml-org/gemma-3-270m-qat-GGUF -ngl 99 -p "$(printf 'hello %.0s' {1..128})" -n 16 -c 512 -ub 32 -np 2 -t 2 -lv 1
- name: Test
id: cmake_test
run: |
cd build
ctest -L main -E "test-llama-archs" --verbose --timeout 900
macOS-latest-x64:
runs-on: macos-15-intel
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggml-org/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL=OFF \
-DGGML_RPC=ON \
-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
ubuntu-cpu:
strategy:
matrix:
@@ -253,16 +181,16 @@ jobs:
strategy:
matrix:
include:
- build: 'cpu-x64 (static)'
- build: 'x64-cpu-static'
arch: 'x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF'
- build: 'openblas-x64'
- build: 'x64-openblas'
arch: 'x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
- build: 'vulkan-x64'
- build: 'x64-vulkan'
arch: 'x64'
defines: '-DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON'
- build: 'llvm-arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON'
- build: 'arm64'
arch: 'arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
@@ -281,7 +209,7 @@ jobs:
- name: Download OpenBLAS
id: get_openblas
if: ${{ matrix.build == 'openblas-x64' }}
if: ${{ matrix.build == 'x64-openblas' }}
run: |
curl.exe -o $env:RUNNER_TEMP/openblas.zip -L "https://github.com/xianyi/OpenBLAS/releases/download/v${env:OPENBLAS_VERSION}/OpenBLAS-${env:OPENBLAS_VERSION}-x64.zip"
curl.exe -o $env:RUNNER_TEMP/OpenBLAS.LICENSE.txt -L "https://github.com/xianyi/OpenBLAS/raw/v${env:OPENBLAS_VERSION}/LICENSE"
@@ -294,7 +222,7 @@ jobs:
- name: Install Vulkan SDK
id: get_vulkan
if: ${{ matrix.build == 'vulkan-x64' }}
if: ${{ matrix.build == 'x64-vulkan' }}
run: |
curl.exe -o $env:RUNNER_TEMP/VulkanSDK-Installer.exe -L "https://sdk.lunarg.com/sdk/download/${env:VULKAN_VERSION}/windows/vulkansdk-windows-X64-${env:VULKAN_VERSION}.exe"
& "$env:RUNNER_TEMP\VulkanSDK-Installer.exe" --accept-licenses --default-answer --confirm-command install
@@ -315,7 +243,7 @@ jobs:
- name: Add libopenblas.dll
id: add_libopenblas_dll
if: ${{ matrix.build == 'openblas-x64' }}
if: ${{ matrix.build == 'x64-openblas' }}
run: |
cp $env:RUNNER_TEMP/openblas/bin/libopenblas.dll ./build/bin/Release/openblas.dll
cp $env:RUNNER_TEMP/OpenBLAS.LICENSE.txt ./build/bin/Release/OpenBLAS-${env:OPENBLAS_VERSION}.txt
@@ -425,211 +353,3 @@ jobs:
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% -t ggml
cmake --build build --config Release
# TODO: simplify the following workflows using a matrix
# TODO: run lighter CI on PRs and the full CI only on master (if needed)
ggml-ci-x64-cpu-low-perf:
runs-on: ubuntu-22.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-x64-cpu-low-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# note: moved to build-self-hosted.yml - can remove from here when everything is stable
# ggml-ci-arm64-cpu-low-perf:
# runs-on: ubuntu-22.04-arm
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ggml-ci-arm64-cpu-low-perf
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-x64-cpu-high-perf:
runs-on: ubuntu-22.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-x64-cpu-high-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# note: moved to build-self-hosted.yml - can remove from here when everything is stable
# ggml-ci-arm64-cpu-high-perf:
# runs-on: ubuntu-22.04-arm
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ggml-ci-arm64-cpu-high-perf
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-high-perf-sve:
runs-on: ubuntu-22.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-high-perf-sve
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-kleidiai:
runs-on: ubuntu-22.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-kleidiai
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y build-essential
- name: Test
id: ggml-ci
run: |
GG_BUILD_KLEIDIAI=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-kleidiai-graviton4:
runs-on: ah-ubuntu_22_04-c8g_8x
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
set -euxo pipefail
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive NEEDRESTART_MODE=a \
apt-get install -y \
build-essential \
python3-venv \
gpg \
wget \
time \
git-lfs
git lfs install
# install the latest cmake
sudo install -d /usr/share/keyrings
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc \
| gpg --dearmor \
| sudo tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ jammy main' \
| sudo tee /etc/apt/sources.list.d/kitware.list
sudo apt-get update
sudo apt-get install -y cmake
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-kleidiai-graviton4
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Test
id: ggml-ci
run: |
GG_BUILD_KLEIDIAI=1 \
GG_BUILD_EXTRA_TESTS_0=1 \
bash ./ci/run.sh ./tmp/results ./tmp/mnt

View File

@@ -28,9 +28,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-22-hip-quality-check:

View File

@@ -37,8 +37,30 @@ env:
jobs:
macOS-cpu:
check_release:
runs-on: [self-hosted, fast]
outputs:
should_release: ${{ steps.check.outputs.should_release }}
steps:
- id: check
run: |
if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
echo "should_release=true" >> $GITHUB_OUTPUT
elif [[ "${{ github.event_name }}" == "push" && "${{ github.ref }}" == "refs/heads/master" ]]; then
if echo "${{ github.event.head_commit.message }}" | grep -q '\[no release\]'; then
echo "should_release=false" >> $GITHUB_OUTPUT
else
echo "should_release=true" >> $GITHUB_OUTPUT
fi
else
echo "should_release=false" >> $GITHUB_OUTPUT
fi
macos-cpu:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
strategy:
matrix:
include:
@@ -46,10 +68,12 @@ jobs:
arch: 'arm64'
os: macos-14
defines: "-DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON"
- build: 'arm64-kleidiai'
arch: 'arm64'
os: macos-14
defines: "-DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON"
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23780)
# in order to enable it again, we have to provision dedicated runners to run it
#- build: 'arm64-kleidiai'
# arch: 'arm64'
# os: macos-14
# defines: "-DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON"
- build: 'x64'
arch: 'x64'
os: macos-15-intel
@@ -76,7 +100,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-${{ matrix.arch }}
key: macos-latest-${{ matrix.arch }}
evict-old-files: 1d
- name: Build
@@ -109,7 +133,8 @@ jobs:
name: llama-bin-macos-${{ matrix.build }}.tar.gz
ubuntu-cpu:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
strategy:
matrix:
include:
@@ -186,6 +211,8 @@ jobs:
name: llama-bin-ubuntu-${{ matrix.build }}.tar.gz
ubuntu-vulkan:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
strategy:
matrix:
@@ -262,6 +289,8 @@ jobs:
name: llama-bin-ubuntu-vulkan-${{ matrix.build }}.tar.gz
android-arm64:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: ubuntu-latest
@@ -339,6 +368,8 @@ jobs:
name: llama-bin-android-arm64.tar.gz
ubuntu-24-openvino:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: ubuntu-24.04
@@ -385,7 +416,7 @@ jobs:
id: cache-openvino
with:
path: ./openvino_toolkit
key: openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
key: cache-gha-openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
- name: Setup OpenVINO Toolkit
if: steps.cache-openvino.outputs.cache-hit != 'true'
@@ -427,6 +458,8 @@ jobs:
name: llama-bin-ubuntu-openvino-${{ env.OPENVINO_VERSION_MAJOR }}-x64.tar.gz
windows-cpu:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: windows-2025
@@ -487,6 +520,8 @@ jobs:
name: llama-bin-win-cpu-${{ matrix.arch }}.zip
windows:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: windows-2025
@@ -577,12 +612,14 @@ jobs:
name: llama-bin-win-${{ matrix.backend }}-${{ matrix.arch }}.zip
windows-cuda:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: windows-2022
strategy:
matrix:
cuda: ['12.4', '13.1']
cuda: ['12.4', '13.3']
steps:
- name: Clone
@@ -655,212 +692,218 @@ jobs:
path: cudart-llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip
name: cudart-llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip
windows-sycl:
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# windows-sycl:
#
# runs-on: windows-2022
#
# defaults:
# run:
# shell: bash
#
# env:
# WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
# WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
# LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
# ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
# ONEAPI_INSTALLER_VERSION: "2025.3.3"
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Use oneAPI Installation Cache
# uses: actions/cache@v5
# id: cache-sycl
# with:
# path: ${{ env.ONEAPI_ROOT }}
# key: cache-gha-oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
#
# - name: Download & Install oneAPI
# shell: bash
# if: steps.cache-sycl.outputs.cache-hit != 'true'
# run: |
# scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
#
# - name: Install Level Zero SDK
# shell: pwsh
# run: |
# Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
# Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
# "LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
#
# - name: Setup Node.js
# uses: actions/setup-node@v6
# with:
# node-version: "24"
# cache: "npm"
# cache-dependency-path: "tools/ui/package-lock.json"
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: windows-latest-sycl
# variant: ccache
# evict-old-files: 1d
#
# - name: Build
# id: cmake_build
# shell: cmd
# run: |
# call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
# cmake -G "Ninja" -B build ^
# -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx ^
# -DCMAKE_BUILD_TYPE=Release ^
# -DGGML_BACKEND_DL=ON -DBUILD_SHARED_LIBS=ON ^
# -DGGML_CPU=OFF -DGGML_SYCL=ON ^
# -DLLAMA_BUILD_BORINGSSL=ON
# cmake --build build --target ggml-sycl -j
#
# - name: Build the release package
# id: pack_artifacts
# run: |
# echo "cp oneAPI running time dll files in ${{ env.ONEAPI_ROOT }} to ./build/bin"
#
# cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_sycl_blas.5.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_core.2.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_tbb_thread.2.dll" ./build/bin
#
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_level_zero.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_level_zero_v2.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_opencl.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_loader.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_win_proxy_loader.dll" ./build/bin
# ZE_LOADER_DLL=$(find "${{ env.ONEAPI_ROOT }}" "$LEVEL_ZERO_V1_SDK_PATH" -iname ze_loader.dll -print -quit 2>/dev/null || true)
# if [ -n "$ZE_LOADER_DLL" ]; then
# echo "Using Level Zero loader: $ZE_LOADER_DLL"
# cp "$ZE_LOADER_DLL" ./build/bin
# else
# echo "Level Zero loader DLL not found in oneAPI or SDK; relying on system driver/runtime"
# fi
#
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl8.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/svml_dispmd.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libmmd.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libiomp5md.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl-ls.exe" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libsycl-fallback-bfloat16.spv" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libsycl-native-bfloat16.spv" ./build/bin
#
# cp "${{ env.ONEAPI_ROOT }}/dnnl/latest/bin/dnnl.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/tbb/latest/bin/tbb12.dll" ./build/bin
#
# cp "${{ env.ONEAPI_ROOT }}/tcm/latest/bin/tcm.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/tcm/latest/bin/libhwloc-15.dll" ./build/bin
# cp "${{ env.ONEAPI_ROOT }}/umf/latest/bin/umf.dll" ./build/bin
#
# echo "cp oneAPI running time dll files to ./build/bin done"
# 7z a -snl llama-bin-win-sycl-x64.zip ./build/bin/*
#
# - name: Upload the release package
# uses: actions/upload-artifact@v6
# with:
# path: llama-bin-win-sycl-x64.zip
# name: llama-bin-win-sycl-x64.zip
runs-on: windows-2022
defaults:
run:
shell: bash
env:
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
ONEAPI_INSTALLER_VERSION: "2025.3.3"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Use oneAPI Installation Cache
uses: actions/cache@v5
id: cache-sycl
with:
path: ${{ env.ONEAPI_ROOT }}
key: oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
- name: Download & Install oneAPI
shell: bash
if: steps.cache-sycl.outputs.cache-hit != 'true'
run: |
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
- name: Install Level Zero SDK
shell: pwsh
run: |
Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
"LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-sycl
variant: ccache
evict-old-files: 1d
- name: Build
id: cmake_build
shell: cmd
run: |
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
cmake -G "Ninja" -B build ^
-DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx ^
-DCMAKE_BUILD_TYPE=Release ^
-DGGML_BACKEND_DL=ON -DBUILD_SHARED_LIBS=ON ^
-DGGML_CPU=OFF -DGGML_SYCL=ON ^
-DLLAMA_BUILD_BORINGSSL=ON
cmake --build build --target ggml-sycl -j
- name: Build the release package
id: pack_artifacts
run: |
echo "cp oneAPI running time dll files in ${{ env.ONEAPI_ROOT }} to ./build/bin"
cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_sycl_blas.5.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_core.2.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/mkl/latest/bin/mkl_tbb_thread.2.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_level_zero.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_level_zero_v2.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_adapter_opencl.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_loader.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/ur_win_proxy_loader.dll" ./build/bin
ZE_LOADER_DLL=$(find "${{ env.ONEAPI_ROOT }}" "$LEVEL_ZERO_V1_SDK_PATH" -iname ze_loader.dll -print -quit 2>/dev/null || true)
if [ -n "$ZE_LOADER_DLL" ]; then
echo "Using Level Zero loader: $ZE_LOADER_DLL"
cp "$ZE_LOADER_DLL" ./build/bin
else
echo "Level Zero loader DLL not found in oneAPI or SDK; relying on system driver/runtime"
fi
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl8.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/svml_dispmd.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libmmd.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libiomp5md.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl-ls.exe" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libsycl-fallback-bfloat16.spv" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libsycl-native-bfloat16.spv" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/dnnl/latest/bin/dnnl.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/tbb/latest/bin/tbb12.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/tcm/latest/bin/tcm.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/tcm/latest/bin/libhwloc-15.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/umf/latest/bin/umf.dll" ./build/bin
echo "cp oneAPI running time dll files to ./build/bin done"
7z a -snl llama-bin-win-sycl-x64.zip ./build/bin/*
- name: Upload the release package
uses: actions/upload-artifact@v6
with:
path: llama-bin-win-sycl-x64.zip
name: llama-bin-win-sycl-x64.zip
ubuntu-24-sycl:
strategy:
matrix:
build: [fp32]
include:
- build: fp32
fp16: OFF
runs-on: ubuntu-24.04
env:
ONEAPI_ROOT: /opt/intel/oneapi/
ONEAPI_INSTALLER_VERSION: "2025.3.3"
LEVEL_ZERO_VERSION: "1.28.2"
LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Use oneAPI Installation Cache
uses: actions/cache@v5
id: cache-sycl
with:
path: ${{ env.ONEAPI_ROOT }}
key: oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
- name: Download & Install oneAPI
shell: bash
if: steps.cache-sycl.outputs.cache-hit != 'true'
run: |
cd /tmp
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
- name: Install Level Zero SDK
shell: bash
run: |
cd /tmp
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-sycl-${{ matrix.build }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
source /opt/intel/oneapi/setvars.sh
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DLLAMA_OPENSSL=OFF \
-DGGML_NATIVE=OFF \
-DGGML_SYCL_F16=${{ matrix.fp16 }}
time cmake --build build --config Release -j $(nproc)
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
id: pack_artifacts
run: |
cp LICENSE ./build/bin/
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz --transform "s,^\.,llama-${{ steps.tag.outputs.name }}," -C ./build/bin .
- name: Upload artifacts
uses: actions/upload-artifact@v6
with:
path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz
name: llama-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# ubuntu-24-sycl:
#
# strategy:
# matrix:
# build: [fp32]
# include:
# - build: fp32
# fp16: OFF
#
# runs-on: ubuntu-24.04
#
# env:
# ONEAPI_ROOT: /opt/intel/oneapi/
# ONEAPI_INSTALLER_VERSION: "2025.3.3"
# LEVEL_ZERO_VERSION: "1.28.2"
# LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
# with:
# fetch-depth: 0
#
# - name: Use oneAPI Installation Cache
# uses: actions/cache@v5
# id: cache-sycl
# with:
# path: ${{ env.ONEAPI_ROOT }}
# key: cache-gha-oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
#
# - name: Download & Install oneAPI
# shell: bash
# if: steps.cache-sycl.outputs.cache-hit != 'true'
# run: |
# cd /tmp
# wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
# sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
#
# - name: Install Level Zero SDK
# shell: bash
# run: |
# cd /tmp
# wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
# wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
# sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
#
# - name: Setup Node.js
# uses: actions/setup-node@v6
# with:
# node-version: "24"
# cache: "npm"
# cache-dependency-path: "tools/ui/package-lock.json"
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-24-sycl-${{ matrix.build }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Build
# id: cmake_build
# run: |
# source /opt/intel/oneapi/setvars.sh
# cmake -B build \
# -G "Ninja" \
# -DCMAKE_BUILD_TYPE=Release \
# -DGGML_SYCL=ON \
# -DCMAKE_C_COMPILER=icx \
# -DCMAKE_CXX_COMPILER=icpx \
# -DLLAMA_OPENSSL=OFF \
# -DGGML_NATIVE=OFF \
# -DGGML_SYCL_F16=${{ matrix.fp16 }}
# time cmake --build build --config Release -j $(nproc)
#
# - name: Determine tag name
# id: tag
# uses: ./.github/actions/get-tag-name
#
# - name: Pack artifacts
# id: pack_artifacts
# run: |
# cp LICENSE ./build/bin/
# tar -czvf llama-${{ steps.tag.outputs.name }}-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz --transform "s,^\.,llama-${{ steps.tag.outputs.name }}," -C ./build/bin .
#
# - name: Upload artifacts
# uses: actions/upload-artifact@v6
# with:
# path: llama-${{ steps.tag.outputs.name }}-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz
# name: llama-bin-ubuntu-sycl-${{ matrix.build }}-x64.tar.gz
ubuntu-22-rocm:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: ubuntu-22.04
@@ -972,6 +1015,8 @@ jobs:
name: llama-bin-ubuntu-rocm-${{ env.ROCM_VERSION_SHORT }}-${{ matrix.build }}.tar.gz
windows-hip:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: windows-2022
@@ -1008,7 +1053,7 @@ jobs:
uses: actions/cache@v5
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
key: cache-gha-rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
@@ -1086,6 +1131,8 @@ jobs:
name: llama-bin-win-hip-${{ matrix.name }}-x64.zip
ios-xcode-build:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
runs-on: macos-15
steps:
@@ -1141,98 +1188,101 @@ jobs:
path: llama-${{ steps.tag.outputs.name }}-xcframework.zip
name: llama-${{ steps.tag.outputs.name }}-xcframework.zip
openEuler-cann:
strategy:
matrix:
include:
# 910b with aclgraph (both architectures)
- arch: x86
chip_type: '910b'
build: 'Release'
use_acl_graph: 'on'
- arch: aarch64
chip_type: '910b'
build: 'Release'
use_acl_graph: 'on'
# 310p without aclgraph (both architectures)
- arch: x86
chip_type: '310p'
build: 'Release'
use_acl_graph: 'off'
- arch: aarch64
chip_type: '310p'
build: 'Release'
use_acl_graph: 'off'
runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
steps:
- name: Checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Free up disk space
uses: ggml-org/free-disk-space@v1.3.1
with:
tool-cache: true
- name: Set container image
id: cann-image
run: |
image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
echo "image=${image}" >> "${GITHUB_OUTPUT}"
- name: Pull container image
run: docker pull "${{ steps.cann-image.outputs.image }}"
- name: Build
env:
BUILD_TYPE: ${{ matrix.build }}
SOC_TYPE: ascend${{ matrix.chip_type }}
USE_ACL_GRAPH: ${{ matrix.use_acl_graph }}
run: |
HOST_UID=$(id -u)
HOST_GID=$(id -g)
docker run --rm \
-v "${PWD}:/workspace" \
-w /workspace \
-e SOC_TYPE=${SOC_TYPE} \
-e BUILD_TYPE=${BUILD_TYPE} \
-e USE_ACL_GRAPH=${USE_ACL_GRAPH} \
"${{ steps.cann-image.outputs.image }}" \
bash -lc '
set -e
yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
yum clean all && rm -rf /var/cache/yum
git config --global --add safe.directory "/workspace"
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DGGML_CANN=on \
-DSOC_TYPE=${SOC_TYPE} \
-DUSE_ACL_GRAPH=${USE_ACL_GRAPH}
cmake --build build -j $(nproc)
chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
'
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
run: |
cp LICENSE ./build/bin/
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz --transform "s,^\.,llama-${{ steps.tag.outputs.name }}," -C ./build/bin .
- name: Upload artifacts
uses: actions/upload-artifact@v6
with:
path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
name: llama-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# openEuler-cann:
# strategy:
# matrix:
# include:
# # 910b with aclgraph (both architectures)
# - arch: x86
# chip_type: '910b'
# build: 'Release'
# use_acl_graph: 'on'
# - arch: aarch64
# chip_type: '910b'
# build: 'Release'
# use_acl_graph: 'on'
# # 310p without aclgraph (both architectures)
# - arch: x86
# chip_type: '310p'
# build: 'Release'
# use_acl_graph: 'off'
# - arch: aarch64
# chip_type: '310p'
# build: 'Release'
# use_acl_graph: 'off'
# runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
# steps:
# - name: Checkout
# uses: actions/checkout@v6
# with:
# fetch-depth: 0
#
# - name: Free up disk space
# uses: ggml-org/free-disk-space@v1.3.1
# with:
# tool-cache: true
#
# - name: Set container image
# id: cann-image
# run: |
# image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
# echo "image=${image}" >> "${GITHUB_OUTPUT}"
#
# - name: Pull container image
# run: docker pull "${{ steps.cann-image.outputs.image }}"
#
# - name: Build
# env:
# BUILD_TYPE: ${{ matrix.build }}
# SOC_TYPE: ascend${{ matrix.chip_type }}
# USE_ACL_GRAPH: ${{ matrix.use_acl_graph }}
# run: |
# HOST_UID=$(id -u)
# HOST_GID=$(id -g)
#
# docker run --rm \
# -v "${PWD}:/workspace" \
# -w /workspace \
# -e SOC_TYPE=${SOC_TYPE} \
# -e BUILD_TYPE=${BUILD_TYPE} \
# -e USE_ACL_GRAPH=${USE_ACL_GRAPH} \
# "${{ steps.cann-image.outputs.image }}" \
# bash -lc '
# set -e
# yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
# yum clean all && rm -rf /var/cache/yum
# git config --global --add safe.directory "/workspace"
# export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
# cmake -S . -B build \
# -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
# -DGGML_CANN=on \
# -DSOC_TYPE=${SOC_TYPE} \
# -DUSE_ACL_GRAPH=${USE_ACL_GRAPH}
# cmake --build build -j $(nproc)
#
# chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
# '
#
# - name: Determine tag name
# id: tag
# uses: ./.github/actions/get-tag-name
#
# - name: Pack artifacts
# run: |
# cp LICENSE ./build/bin/
# tar -czvf llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz --transform "s,^\.,llama-${{ steps.tag.outputs.name }}," -C ./build/bin .
#
# - name: Upload artifacts
# uses: actions/upload-artifact@v6
# with:
# path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
# name: llama-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}${{ matrix.use_acl_graph == 'on' && '-aclgraph' || '' }}.tar.gz
ui-build:
needs: [check_release]
if: ${{ needs.check_release.outputs.should_release == 'true' }}
uses: ./.github/workflows/ui-build.yml
release:
@@ -1249,17 +1299,17 @@ jobs:
- windows
- windows-cpu
- windows-cuda
- windows-sycl
#- windows-sycl
- windows-hip
- ubuntu-22-rocm
- ubuntu-cpu
- ubuntu-vulkan
- ubuntu-24-openvino
- ubuntu-24-sycl
#- ubuntu-24-sycl
- android-arm64
- macOS-cpu
- macos-cpu
- ios-xcode-build
- openEuler-cann
#- openEuler-cann
- ui-build
outputs:
@@ -1348,7 +1398,7 @@ jobs:
**macOS/iOS:**
- [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz)
- [macOS Apple Silicon (arm64, KleidiAI enabled)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-arm64-kleidiai.tar.gz)
- macOS Apple Silicon (arm64, KleidiAI enabled) [DISABLED](https://github.com/ggml-org/llama.cpp/pull/23780)
- [macOS Intel (x64)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-x64.tar.gz)
- [iOS XCFramework](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-xcframework.zip)
@@ -1360,7 +1410,7 @@ jobs:
- [Ubuntu arm64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-vulkan-arm64.tar.gz)
- [Ubuntu x64 (ROCm 7.2)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-rocm-7.2-x64.tar.gz)
- [Ubuntu x64 (OpenVINO)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-openvino-${{ needs.ubuntu-24-openvino.outputs.openvino_version }}-x64.tar.gz)
- [Ubuntu x64 (SYCL FP32)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-ubuntu-sycl-fp32-x64.tar.gz)
- Ubuntu x64 (SYCL FP32) [DISABLED](https://github.com/ggml-org/llama.cpp/pull/23705)
**Android:**
- [Android arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-android-arm64.tar.gz)
@@ -1369,16 +1419,17 @@ jobs:
- [Windows x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-x64.zip)
- [Windows arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-arm64.zip)
- [Windows x64 (CUDA 12)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-12.4-x64.zip) - [CUDA 12.4 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/cudart-llama-bin-win-cuda-12.4-x64.zip)
- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-13.1-x64.zip) - [CUDA 13.1 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/cudart-llama-bin-win-cuda-13.1-x64.zip)
- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-13.3-x64.zip) - [CUDA 13.3 DLLs](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/cudart-llama-bin-win-cuda-13.3-x64.zip)
- [Windows x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-vulkan-x64.zip)
- [Windows x64 (SYCL)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-sycl-x64.zip)
- Windows x64 (SYCL) [DISABLED](https://github.com/ggml-org/llama.cpp/pull/23705)
- [Windows x64 (HIP)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-hip-radeon-x64.zip)
**openEuler:**
- [openEuler x86 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-310p-openEuler-x86.tar.gz)
- [openEuler x86 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-910b-openEuler-x86-aclgraph.tar.gz)
- [openEuler aarch64 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-310p-openEuler-aarch64.tar.gz)
- [openEuler aarch64 (910b, ACL Graph)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-910b-openEuler-aarch64-aclgraph.tar.gz)
- [DISABLED](https://github.com/ggml-org/llama.cpp/pull/23705)
- openEuler x86 (310p)
- openEuler x86 (910b, ACL Graph)
- openEuler aarch64 (310p)
- openEuler aarch64 (910b, ACL Graph)
**UI:**
- [UI](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-ui.tar.gz)

View File

@@ -26,10 +26,10 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
@@ -37,7 +37,7 @@ concurrency:
jobs:
server:
runs-on: ubuntu-latest
runs-on: [self-hosted, CPU, Linux, llama-server]
strategy:
matrix:
@@ -46,19 +46,19 @@ jobs:
fail-fast: false
steps:
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get -y install \
build-essential \
xxd \
git \
cmake \
curl \
wget \
language-pack-en \
libssl-dev
#- name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get -y install \
# build-essential \
# xxd \
# git \
# cmake \
# curl \
# wget \
# language-pack-en \
# libssl-dev
- name: Clone
id: checkout

View File

@@ -29,10 +29,10 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}

View File

@@ -44,25 +44,20 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
ui-build:
name: Build Web UI
uses: ./.github/workflows/ui-build.yml
server:
ubuntu:
runs-on: ubuntu-latest
needs: ui-build
name: server (${{ matrix.wf_name }})
name: ubuntu (${{ matrix.wf_name }})
strategy:
matrix:
build_type: [Release]
@@ -98,17 +93,17 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Download built UI
uses: actions/download-artifact@v7
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
name: ui-build
path: tools/ui/dist
key: server-ubuntu-default
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
cmake -B build \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_SCHED_NO_REALLOC=ON
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target llama-server
@@ -135,8 +130,8 @@ jobs:
export ${{ matrix.extra_args }}
SLOW_TESTS=1 pytest -v -x
server-windows:
runs-on: windows-2022
windows:
runs-on: windows-2025
steps:
- name: Clone
@@ -146,16 +141,24 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Setup Node.js
uses: actions/setup-node@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
node-version: "24"
key: server-windows-default
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
shell: cmd
run: |
cmake -B build -DLLAMA_BUILD_BORINGSSL=ON -DGGML_SCHED_NO_REALLOC=ON
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS} --target llama-server
cmake -B build -G "Ninja Multi-Config" ^
-DCMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake ^
-DCMAKE_BUILD_TYPE=Release ^
-DLLAMA_BUILD_BORINGSSL=ON ^
-DGGML_SCHED_NO_REALLOC=ON
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% --target llama-server
- name: Python setup
id: setup_python

View File

@@ -30,10 +30,10 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}

View File

@@ -26,10 +26,10 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}

View File

@@ -63,6 +63,7 @@ After submitting your PR:
- Optionally pick a `<module>` from here: https://github.com/ggml-org/llama.cpp/wiki/Modules
- Let other maintainers merge their own PRs
- When merging a PR, make sure you have a good understanding of the changes
- If a PR does not warrant a new release, add `[no release]` in the squashed commit to spare CI resources
- Be mindful of maintenance: most of the work going into a feature happens after the PR is merged. If the PR author is not committed to contribute long-term, someone else needs to take responsibility (you)
Maintainers reserve the right to decline review or close pull requests for any reason, without any questions, particularly under any of the following conditions:

View File

@@ -66,6 +66,8 @@ fi
if [ ! -z ${GG_BUILD_METAL} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=ON"
else
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=OFF"
fi
if [ ! -z ${GG_BUILD_CUDA} ]; then
@@ -114,10 +116,7 @@ fi
if [ ! -z ${GG_BUILD_VULKAN} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_VULKAN=1"
# if on Mac, disable METAL
if [[ "$OSTYPE" == "darwin"* ]]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=OFF -DGGML_BLAS=OFF"
MACOS_RUNNER_CUSTOM_VULKAN_CMAKE_LOCATION="/usr/local/lib/cmake/vulkan"
MACOS_RUNNER_CUSTOM_SPIRV_HEADERS_LOCATION="${MACOS_RUNNER_CUSTOM_VULKAN_CMAKE_LOCATION}/SPIRV-Headers/SPIRV-HeadersConfig.cmake"
if [[ -f "${MACOS_RUNNER_CUSTOM_SPIRV_HEADERS_LOCATION}" || -h "${MACOS_RUNNER_CUSTOM_SPIRV_HEADERS_LOCATION}" ]]; then
@@ -133,7 +132,7 @@ if [ ! -z ${GG_BUILD_VULKAN} ]; then
fi
if [ ! -z ${GG_BUILD_WEBGPU} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1 -DGGML_METAL=OFF -DGGML_BLAS=OFF"
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1"
if [ ! -z "${GG_BUILD_WEBGPU_DAWN_PREFIX}" ]; then
if [ -z "${CMAKE_PREFIX_PATH}" ]; then
@@ -167,6 +166,8 @@ fi
if [ ! -z ${GG_BUILD_BLAS} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=${GG_BUILD_BLAS_VENDOR:-OpenBLAS}"
else
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_BLAS=OFF"
fi
if [ ! -z ${GG_BUILD_OPENVINO} ]; then
@@ -700,8 +701,8 @@ function gg_sum_test_backend_ops_cpu {
## main
export LLAMA_LOG_PREFIX=1
export LLAMA_LOG_TIMESTAMPS=1
export LLAMA_ARG_LOG_PREFIX=1
export LLAMA_ARG_LOG_TIMESTAMPS=1
if [ -z ${GG_BUILD_LOW_PERF} ]; then
# Create symlink: ./llama.cpp/models-mnt -> $MNT/models

View File

@@ -3026,7 +3026,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.default_template_kwargs[item.key()] = item.value().dump();
}
}
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_CHAT_TEMPLATE_KWARGS"));
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_CHAT_TEMPLATE_KWARGS"));
add_opt(common_arg(
{"-to", "--timeout"}, "N",
string_format("server read/write timeout in seconds (default: %d)", params.timeout_read),
@@ -3327,7 +3327,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params &, const std::string & value) {
common_log_set_file(common_log_main(), value.c_str());
}
).set_env("LLAMA_LOG_FILE"));
).set_env("LLAMA_ARG_LOG_FILE"));
add_opt(common_arg(
{"--log-colors"}, "[on|off|auto]",
"Set colored logging ('on', 'off', or 'auto', default: 'auto')\n"
@@ -3344,7 +3344,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
string_format("error: unknown value for --log-colors: '%s'\n", value.c_str()));
}
}
).set_env("LLAMA_LOG_COLORS"));
).set_env("LLAMA_ARG_LOG_COLORS"));
add_opt(common_arg(
{"-v", "--verbose", "--log-verbose"},
"Set verbosity level to infinity (i.e. log all messages, useful for debugging)",
@@ -3359,7 +3359,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params) {
params.offline = true;
}
).set_env("LLAMA_OFFLINE"));
).set_env("LLAMA_ARG_OFFLINE"));
add_opt(common_arg(
{"-lv", "--verbosity", "--log-verbosity"}, "N",
string_format("Set the verbosity threshold. Messages with a higher verbosity will be ignored. Values:\n"
@@ -3374,7 +3374,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.verbosity = value;
common_log_set_verbosity_thold(value);
}
).set_env("LLAMA_LOG_VERBOSITY"));
).set_env("LLAMA_ARG_LOG_VERBOSITY"));
add_opt(common_arg(
{"--log-prefix"},
{"--no-log-prefix"},

View File

@@ -74,6 +74,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"Gemma3nForCausalLM": "gemma",
"Gemma3nForConditionalGeneration": "gemma",
"Gemma4ForConditionalGeneration": "gemma",
"Gemma4ForCausalLM": "gemma",
"GemmaForCausalLM": "gemma",
"Glm4ForCausalLM": "glm",
"Glm4MoeForCausalLM": "glm",
@@ -215,6 +216,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"T5EncoderModel": "t5",
"T5ForConditionalGeneration": "t5",
"T5WithLMHeadModel": "t5",
"TalkieForCausalLM": "talkie",
"UMT5ForConditionalGeneration": "t5",
"UMT5Model": "t5",
"UltravoxModel": "ultravox",

View File

@@ -1622,6 +1622,12 @@ class TextModel(ModelBase):
if chkhsh == "62f6fb0a6fd5098caeabb19b07a5c1099cafc8b9c40eab6ea89ece4ec02fbc57":
# ref: https://huggingface.co/sarvamai/sarvam-30b
res = "sarvam-moe"
if chkhsh == "f728162c1315c26e40249849799b4ba3fe584c32084b4795b03eb295e63cb5af":
# ref: https://huggingface.co/lewtun/talkie-1930-13b-it-hf
res = "talkie"
if chkhsh == "36f3066e97b7f3994b379aaacde306c1444c6ae84e81a5ae3cd2b7ed3b8c42d4":
# ref: https://huggingface.co/openbmb/MiniCPM5-1B
res = "minicpm5"
if res is None:
logger.warning("\n")

View File

@@ -614,7 +614,7 @@ class Gemma3NModel(Gemma3Model):
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Gemma4ForConditionalGeneration")
@ModelBase.register("Gemma4ForConditionalGeneration", "Gemma4ForCausalLM")
class Gemma4Model(Gemma3Model):
model_arch = gguf.MODEL_ARCH.GEMMA4

53
conversion/talkie.py Normal file
View File

@@ -0,0 +1,53 @@
from __future__ import annotations
from typing import Iterable, TYPE_CHECKING
import torch
if TYPE_CHECKING:
from torch import Tensor
from .base import LazyTorchTensor, ModelBase, TextModel, gguf
@ModelBase.register("TalkieForCausalLM")
class TalkieModel(TextModel):
model_arch = gguf.MODEL_ARCH.TALKIE
def set_gguf_parameters(self):
super().set_gguf_parameters()
# Talkie used F.rms_norm without an explicit eps
self.gguf_writer.add_layer_norm_rms_eps(torch.finfo(torch.float32).eps)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
prefix = f"model.blocks.{bid}." if bid is not None else ""
suffix = name.removeprefix(prefix)
if suffix == "attn_gain.a_g":
yield self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_OUT, bid, ".scale"), data_torch
return
elif suffix == "mlp_gain.a_g":
yield self.format_tensor_name(gguf.MODEL_TENSOR.FFN_DOWN, bid, ".scale"), data_torch
return
elif suffix == "lm_head_gain.w_g":
self.gguf_writer.add_logit_scale(LazyTorchTensor.to_eager(data_torch).item())
return
elif suffix in ("attn.attn_query.weight", "attn.attn_key.weight"):
# absorb inverse rope
head_dim = self.hparams["head_dim"]
shape = data_torch.shape
data_torch = torch.reshape(data_torch, (-1, head_dim, shape[-1]))
signs = torch.ones((1, head_dim, 1), dtype=data_torch.dtype)
signs[:, head_dim // 2 :, :] = -1
if self.lazy:
signs = LazyTorchTensor.from_eager(signs)
# (n_head, head_dim, n_in) -> (n_out, n_in)
data_torch = torch.reshape(data_torch * signs, shape)
elif suffix == "attn.head_gain.head_g":
# allow head gain to broadcast
data_torch = data_torch.unsqueeze(-1)
if not name.endswith(".weight"):
name += ".weight"
yield from super().modify_tensors(data_torch, name, bid)

View File

@@ -156,6 +156,8 @@ models = [
{"name": "kanana2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/kakaocorp/kanana-2-30b-a3b-instruct-2601", },
{"name": "f2llmv2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/codefuse-ai/F2LLM-v2-4B", },
{"name": "sarvam-moe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/sarvamai/sarvam-30b", },
{"name": "talkie", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/lewtun/talkie-1930-13b-it-hf", },
{"name": "minicpm5", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openbmb/MiniCPM5-1B"},
]
# some models are known to be broken upstream, so we will skip them as exceptions

View File

@@ -208,6 +208,16 @@ class LoraTorchTensor:
def to(self, *args, **kwargs):
return LoraTorchTensor(self._lora_A.to(*args, **kwargs), self._lora_B.to(*args, **kwargs))
def __mul__(self, other) -> LoraTorchTensor:
# Only output-side multiplication for now
# W = B @ A, so M_out * W == (M_out * B) @ A
if not isinstance(other, (int, float)) and other.shape and other.shape[-1] != 1:
raise NotImplementedError
return LoraTorchTensor(self._lora_A, self._lora_B * other)
def __rmul__(self, other) -> LoraTorchTensor:
return self * other
@classmethod
def __torch_function__(cls, func: Callable, types, args=(), kwargs=None):
del types # unused

View File

@@ -459,7 +459,7 @@ Each returned parser is wrapped by `wrap_for_generation_prompt()`, which prepend
- Usage: `./bin/llama-template-analysis path/to/template.jinja`
**Debug Logging**: Enable with `LLAMA_LOG_VERBOSITY=2`
**Debug Logging**: Enable with `LLAMA_ARG_LOG_VERBOSITY=2`
- Shows detailed analysis steps, pattern extraction results, and generated parser structure

View File

@@ -743,6 +743,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
| GGML_SYCL_ENABLE_LEVEL_ZERO | 1 (default) or 0 | Use Level Zero API for device memory allocation instead of SYCL. Reduces system RAM usage on Intel dGPUs by avoiding DMA-buf/TTM host memory staging. Requires GGML_SYCL_SUPPORT_LEVEL_ZERO=ON at build time. |
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
| GGML_SYCL_ENABLE_VMM | 0 or 1 (default) | Enable the virtual-memory device pool. |
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Allow SYCL/Unified Runtime Level Zero device allocations larger than 4 GiB. llama.cpp's direct Level Zero allocation path requests the relaxed maximum-size limit itself when GGML_SYCL_ENABLE_LEVEL_ZERO=1. |
@@ -753,6 +754,7 @@ Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spo
| Name | Function |
|-----------------|----------------------------------------------------------------------------------|
| DEBUG_SYCL_POOL | Enable device memory pool logging on teardown. Useful for profiling allocations. |
| DEBUG_SYCL_MALLOC | Enable verbose per-call logging of device pool alloc/free operations. |
## Design Rule

View File

@@ -176,7 +176,7 @@ Note that currently you cannot quantize the visual encoder because granite visio
### 5. Running the Model in Llama cpp
Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the llama.cpp banner.
```bash
$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \

View File

@@ -335,7 +335,7 @@ $ make perplexity-run-full QUANTIZED_MODEL=~/path/to/quantized/model-Qxx.gguf LO
## HuggingFace utilities
The following targets are useful for creating collections and model repositories
on Hugging Face in the the ggml-org. These can be used when preparing a release
on Hugging Face in the ggml-org. These can be used when preparing a release
to script the process for new model releases.
For the following targets a `HF_TOKEN` environment variable is required.

View File

@@ -110,11 +110,14 @@
# define GGML_CUDA_USE_CUB
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && CUDART_VERSION >= 11070
// PDL host-side support (cudaLaunchKernelEx) requires CUDART >= 11.8 and excludes HIP/MUSA.
// PDL host-side support (cudaLaunchKernelEx) requires CUDART >= 11.8.
// However, this has been bugged in CTK < 12.3 for MSVC builds, see
// https://github.com/ggml-org/llama.cpp/pull/22522#discussion_r3302393293
// __CUDA_ARCH__ is undefined in host passes; GPU arch check happens in device-side code.
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && CUDART_VERSION >= 11080
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && \
(CUDART_VERSION >= 12030 || (!(defined(_MSC_VER) && !defined(__clang__)) && CUDART_VERSION >= 11080))
# define GGML_CUDA_USE_PDL
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && CUDART_VERSION >= 11080
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && (CUDART_VERSION >= 12030 || (!(defined(_MSC_VER) && !defined(__clang__)) && CUDART_VERSION >= 11080))
static __device__ __forceinline__ void ggml_cuda_pdl_sync() {
#if defined(GGML_CUDA_USE_PDL) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER

View File

@@ -19,6 +19,7 @@ __global__ void fwht_cuda(const float * src, float * dst, const int64_t n_rows,
float reg[el_w];
const int lane = threadIdx.x;
ggml_cuda_pdl_sync();
#pragma unroll
for (int i = 0; i < el_w; ++i) {
reg[i] = src[i * warp_size + lane] * scale;
@@ -57,10 +58,11 @@ __global__ void fwht_cuda(const float * src, float * dst, const int64_t n_rows,
}
}
void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst) {
bool ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst) {
GGML_ASSERT(ggml_are_same_shape(src, dst));
GGML_ASSERT(ggml_is_contiguous(src));
GGML_ASSERT(ggml_is_contiguous(dst));
if (!ggml_is_contiguous(src) || !ggml_is_contiguous(dst)) {
return false;
}
const int n = src->ne[0];
const int64_t rows = ggml_nrows(src);
@@ -68,7 +70,6 @@ void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src,
float * dst_d = (float *) dst->data;
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
GGML_ASSERT(n % warp_size == 0);
const int rows_per_block = 4;
const int64_t num_blocks = (rows + rows_per_block - 1) / rows_per_block;
@@ -83,26 +84,18 @@ void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src,
switch (n) {
case 64:
{
ggml_cuda_kernel_launch(fwht_cuda<64>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<64>, launch_params, src_d, dst_d, rows, scale);
return true;
case 128:
{
ggml_cuda_kernel_launch(fwht_cuda<128>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<128>, launch_params, src_d, dst_d, rows, scale);
return true;
case 256:
{
ggml_cuda_kernel_launch(fwht_cuda<256>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<256>, launch_params, src_d, dst_d, rows, scale);
return true;
case 512:
{
ggml_cuda_kernel_launch(fwht_cuda<512>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<512>, launch_params, src_d, dst_d, rows, scale);
return true;
default:
GGML_ABORT("fatal error");
return false;
}
}

View File

@@ -1,3 +1,4 @@
#include "common.cuh"
void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst);
// Returns whether the Fast Walsh-Hadamard transform could be used.
bool ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst);

View File

@@ -2596,9 +2596,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
bool use_batched_cublas_f32 = src0->type == GGML_TYPE_F32;
const int32_t hint = ggml_get_op_params_i32(dst, 1);
if (hint == GGML_HINT_SRC0_IS_HADAMARD) {
GGML_ASSERT(!split);
ggml_cuda_op_fwht(ctx, src1, dst);
if (hint == GGML_HINT_SRC0_IS_HADAMARD && !split && ggml_cuda_op_fwht(ctx, src1, dst)) {
return;
}

View File

@@ -68,6 +68,7 @@ static u32vec opt_pmu_evt { 0x3, 0x111, 0x100, 0x105, 0x240, 0x256, 0x7D, 0x8C }
static int opt_opstage = HTP_OPSTAGE_QUEUE | HTP_OPSTAGE_COMPUTE;
static int opt_opbatch = 1024; // max number of ops in a batch
static int opt_opqueue = 16; // max number of pending batches
static int opt_oppoll = 0; // polling for batch completions
static std::regex* opt_opfilter = NULL; // regex of ops to not claim
@@ -550,7 +551,7 @@ static void repack_q4_0_q4x4x2(ggml_tensor * t, const void * data, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -611,7 +612,7 @@ static void repack_q4x4x2_q4_0(void * data, const ggml_tensor * t, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -660,6 +661,239 @@ static void repack_q4x4x2_q4_0(void * data, const ggml_tensor * t, size_t size)
ggml_aligned_free(buf_rp, row_size_rp);
}
static void unpack_q4_1_quants(uint8_t * qs, const block_q4_1 * x, unsigned int bi) {
static const int qk = QK4_1;
for (unsigned int i = 0; i < qk / 2; ++i) {
const int x0 = (x->qs[i] & 0x0F);
const int x1 = (x->qs[i] >> 4);
qs[bi * qk + i + 0] = x0;
qs[bi * qk + i + qk / 2] = x1;
}
}
static void pack_q4_1_quants(block_q4_1 * x, const uint8_t * qs, unsigned int bi) {
static const int qk = QK4_1;
for (unsigned int i = 0; i < qk / 2; ++i) {
const uint8_t x0 = qs[bi * qk + i + 0];
const uint8_t x1 = qs[bi * qk + i + qk / 2];
x->qs[i] = x0 | (x1 << 4);
}
}
static void repack_row_q4_1x4x2(uint8_t * y, const block_q4_1 * x, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
const int nloe = k % qk; // leftovers
const int dblk_size = 8 * 4; // 8x (d, m) __fp16 = 32 bytes
const int qblk_size = qk / 2; // int4 = 128 bytes
const int qrow_size = k / 2; // int4 (not padded to blocks)
uint8_t * y_q = y + 0; // quants first
uint8_t * y_d = y + qrow_size; // then scales/offsets
// Repack the quants
for (int i = 0; i < nb; i++) {
uint8_t qs[QK_Q4_0x4x2]; // unpacked quants
unpack_q4_1_quants(qs, &x[i * 8 + 0], 0);
unpack_q4_1_quants(qs, &x[i * 8 + 1], 1);
unpack_q4_1_quants(qs, &x[i * 8 + 2], 2);
unpack_q4_1_quants(qs, &x[i * 8 + 3], 3);
unpack_q4_1_quants(qs, &x[i * 8 + 4], 4);
unpack_q4_1_quants(qs, &x[i * 8 + 5], 5);
unpack_q4_1_quants(qs, &x[i * 8 + 6], 6);
unpack_q4_1_quants(qs, &x[i * 8 + 7], 7);
bool partial = (nloe && i == nb-1);
uint8_t * q = y_q + (i * qblk_size);
for (int j = 0; j < qk / 2; j++) {
q[j] = partial ? (qs[j*2+1] << 4) | qs[j*2+0] : (qs[j+128] << 4) | qs[j+000];
}
}
// Repack the scales and offsets
for (int i = 0; i < nb; i++) {
ggml_half * d_m = (ggml_half *) (y_d + i * dblk_size);
for (int j = 0; j < 8; j++) {
d_m[j * 2 + 0] = x[i * 8 + j].d;
d_m[j * 2 + 1] = x[i * 8 + j].m;
}
}
}
static void unpack_row_q4_1x4x2(block_q4_1 * x, const uint8_t * y, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
const int nloe = k % qk; // leftovers
const int dblk_size = 8 * 4; // 8x (d, m) __fp16 = 32 bytes
const int qblk_size = qk / 2; // int4 = 128 bytes
const int qrow_size = k / 2; // int4 (not padded to blocks)
const uint8_t * y_q = y + 0; // quants first
const uint8_t * y_d = y + qrow_size; // then scales/offsets
// Unpack the quants
for (int i = 0; i < nb; i++) {
uint8_t qs[QK_Q4_0x4x2];
bool partial = (nloe && i == nb-1);
const uint8_t * q = y_q + (i * qblk_size);
for (int j = 0; j < qk / 2; j++) {
if (partial) {
qs[j*2+0] = q[j] & 0x0F;
qs[j*2+1] = q[j] >> 4;
} else {
qs[j+000] = q[j] & 0x0F;
qs[j+128] = q[j] >> 4;
}
}
pack_q4_1_quants(&x[i * 8 + 0], qs, 0);
pack_q4_1_quants(&x[i * 8 + 1], qs, 1);
pack_q4_1_quants(&x[i * 8 + 2], qs, 2);
pack_q4_1_quants(&x[i * 8 + 3], qs, 3);
pack_q4_1_quants(&x[i * 8 + 4], qs, 4);
pack_q4_1_quants(&x[i * 8 + 5], qs, 5);
pack_q4_1_quants(&x[i * 8 + 6], qs, 6);
pack_q4_1_quants(&x[i * 8 + 7], qs, 7);
}
// Unpack the scales and offsets
for (int i = 0; i < nb; i++) {
const ggml_half * d_m = (const ggml_half *) (y_d + i * dblk_size);
for (int j = 0; j < 8; j++) {
x[i * 8 + j].d = d_m[j * 2 + 0];
x[i * 8 + j].m = d_m[j * 2 + 1];
}
}
}
static void init_row_q4_1x4x2(block_q4_1 * x, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
uint8_t qs[QK_Q4_0x4x2]; // unpacked quants
memset(qs, 0, sizeof(qs));
for (int i = 0; i < nb; i++) {
pack_q4_1_quants(&x[i * 8 + 0], qs, 0);
pack_q4_1_quants(&x[i * 8 + 1], qs, 1);
pack_q4_1_quants(&x[i * 8 + 2], qs, 2);
pack_q4_1_quants(&x[i * 8 + 3], qs, 3);
pack_q4_1_quants(&x[i * 8 + 4], qs, 4);
pack_q4_1_quants(&x[i * 8 + 5], qs, 5);
pack_q4_1_quants(&x[i * 8 + 6], qs, 6);
pack_q4_1_quants(&x[i * 8 + 7], qs, 7);
}
for (int i = 0; i < nb; i++) {
for (int j = 0; j < 8; j++) {
x[i * 8 + j].d = 0;
x[i * 8 + j].m = 0;
}
}
}
static void repack_q4_1_q4x4x2(ggml_tensor * t, const void * data, size_t size) {
int64_t nrows = ggml_nrows(t);
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2));
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
const size_t total_tensor_size = (size_t)nrows * row_size;
const size_t n_bytes_to_copy = size < total_tensor_size ? size : total_tensor_size;
const int64_t n_full_rows = n_bytes_to_copy / row_size;
const size_t n_rem_bytes = n_bytes_to_copy % row_size;
void * buf_pd = ggml_aligned_malloc(row_size_pd);
GGML_ASSERT(buf_pd != NULL);
void * buf_rp = ggml_aligned_malloc(row_size_rp);
GGML_ASSERT(buf_rp != NULL);
HEX_VERBOSE("ggml-hex: repack-q4_1-q4x4x2 %s : data %p size %zu dims %ldx%ld row-size %zu\n", t->name, data, size,
t->ne[0], nrows, row_size);
init_row_q4_1x4x2((block_q4_1 *) buf_pd, t->ne[0]);
for (int64_t i = 0; i < n_full_rows; i++) {
const uint8_t * src = (const uint8_t *) data + (i * row_size);
uint8_t * dst = (uint8_t *) t->data + (i * row_size);
memcpy(buf_pd, src, row_size);
repack_row_q4_1x4x2((uint8_t *) buf_rp, (const block_q4_1 *) buf_pd, t->ne[0]);
memcpy(dst, buf_rp, row_size);
}
if (n_rem_bytes > 0) {
const int64_t i = n_full_rows;
const uint8_t * src = (const uint8_t *) data + (i * row_size);
uint8_t * dst = (uint8_t *) t->data + (i * row_size);
init_row_q4_1x4x2((block_q4_1 *) buf_pd, t->ne[0]);
memcpy(buf_pd, src, n_rem_bytes);
repack_row_q4_1x4x2((uint8_t *) buf_rp, (const block_q4_1 *) buf_pd, t->ne[0]);
memcpy(dst, buf_rp, n_rem_bytes);
}
ggml_aligned_free(buf_pd, row_size_pd);
ggml_aligned_free(buf_rp, row_size_rp);
}
static void repack_q4x4x2_q4_1(void * data, const ggml_tensor * t, size_t size) {
int64_t nrows = ggml_nrows(t);
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2));
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
const size_t total_tensor_size = (size_t)nrows * row_size;
const size_t n_bytes_to_copy = size < total_tensor_size ? size : total_tensor_size;
const int64_t n_full_rows = n_bytes_to_copy / row_size;
const size_t n_rem_bytes = n_bytes_to_copy % row_size;
void * buf_pd = ggml_aligned_malloc(row_size_pd);
GGML_ASSERT(buf_pd != NULL);
void * buf_rp = ggml_aligned_malloc(row_size_rp);
GGML_ASSERT(buf_rp != NULL);
HEX_VERBOSE("ggml-hex: repack-q4x4x2-q4_1 %s : data %p size %zu dims %ldx%ld row-size %zu\n", t->name, data, size,
t->ne[0], nrows, row_size);
memset(buf_rp, 0, row_size_rp); // clear-out padded buffer to make sure the tail is all zeros
for (int64_t i = 0; i < n_full_rows; i++) {
const uint8_t * src = (const uint8_t *) t->data + (i * row_size);
uint8_t * dst = (uint8_t *) data + (i * row_size);
memcpy(buf_rp, src, row_size);
unpack_row_q4_1x4x2((block_q4_1 *) buf_pd, (const uint8_t *) buf_rp, t->ne[0]);
memcpy(dst, buf_pd, row_size);
}
if (n_rem_bytes > 0) {
const int64_t i = n_full_rows;
const uint8_t * src = (const uint8_t *) t->data + (i * row_size);
uint8_t * dst = (uint8_t *) data + (i * row_size);
// We still need to read and unpack the entire source row because quantization is block-based.
memcpy(buf_rp, src, row_size);
unpack_row_q4_1x4x2((block_q4_1 *) buf_pd, (const uint8_t *) buf_rp, t->ne[0]);
memcpy(dst, buf_pd, n_rem_bytes);
}
ggml_aligned_free(buf_pd, row_size_pd);
ggml_aligned_free(buf_rp, row_size_rp);
}
// ======== Q8x4x2 ====================
static void dump_block_q8_0(const block_q8_0 * b, int i) {
HEX_VERBOSE("ggml-hex: repack q8_0 %d: %d %d %d %d ... %d %d %d %d : %.6f\n", i, b->qs[0], b->qs[1], b->qs[2],
@@ -876,7 +1110,7 @@ static void repack_q8_0_q8x4x2(ggml_tensor * t, const void * data, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q8_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -937,7 +1171,7 @@ static void repack_q8x4x2_q8_0(void * data, const ggml_tensor * t, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q8_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -1238,7 +1472,7 @@ static void repack_mxfp4_mxfp4x4x2(ggml_tensor * t, const void * data, size_t si
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_MXFP4x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -1299,7 +1533,7 @@ static void repack_mxfp4x4x2_mxfp4(void * data, const ggml_tensor * t, size_t si
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_MXFP4x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -1365,6 +1599,12 @@ static void ggml_backend_hexagon_buffer_set_tensor(ggml_backend_buffer_t buffer,
repack_q4_0_q4x4x2(tensor, data, size);
break;
case GGML_TYPE_Q4_1:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
repack_q4_1_q4x4x2(tensor, data, size);
break;
case GGML_TYPE_Q8_0:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
@@ -1407,6 +1647,12 @@ static void ggml_backend_hexagon_buffer_get_tensor(ggml_backend_buffer_t buffer,
repack_q4x4x2_q4_0(data, tensor, size);
break;
case GGML_TYPE_Q4_1:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
repack_q4x4x2_q4_1(data, tensor, size);
break;
case GGML_TYPE_Q8_0:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
@@ -1886,7 +2132,8 @@ void ggml_hexagon_session::flush_pending(bool all) {
uint32_t n_dbufs;
// Read response packet from queue
int err = dspqueue_read(this->queue, &flags, 1, &n_dbufs, &dbuf, sizeof(rsp), &rsp_size, (uint8_t *) &rsp, DSPQUEUE_TIMEOUT);
const uint32_t timeo = opt_oppoll ? 0 : DSPQUEUE_TIMEOUT;
int err = dspqueue_read(this->queue, &flags, 1, &n_dbufs, &dbuf, sizeof(rsp), &rsp_size, (uint8_t *) &rsp, timeo);
if (err == AEE_EEXPIRED) {
continue;
}
@@ -2327,6 +2574,7 @@ static bool ggml_hexagon_supported_mul_mat(const struct ggml_hexagon_session * s
switch (src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_MXFP4:
@@ -2377,6 +2625,7 @@ static bool ggml_hexagon_supported_mul_mat_id(const struct ggml_hexagon_session
switch (src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_MXFP4:
@@ -2874,6 +3123,7 @@ static htp_op_code op_remap_to_htp(const ggml_tensor * t) {
case GGML_OP_NORM: return HTP_OP_NORM;
case GGML_OP_L2_NORM: return HTP_OP_L2_NORM;
case GGML_OP_RMS_NORM: return HTP_OP_RMS_NORM;
case GGML_OP_CONCAT: return HTP_OP_CONCAT;
case GGML_OP_SCALE: return HTP_OP_SCALE;
case GGML_OP_SQR: return HTP_OP_SQR;
case GGML_OP_SQRT: return HTP_OP_SQRT;
@@ -3286,6 +3536,25 @@ static bool ggml_hexagon_supported_repeat(const struct ggml_hexagon_session * se
return true;
}
static bool ggml_hexagon_supported_concat(const struct ggml_hexagon_session * sess, const struct ggml_tensor * op) {
int dim = ((const int32_t *) op->op_params)[0];
if (dim < 0 || dim >= GGML_MAX_DIMS) {
return false;
}
for (int i = 0; i < GGML_MAX_SRC; ++i) {
const struct ggml_tensor * src = op->src[i];
if (!src) {
continue;
}
if (src->type != GGML_TYPE_F32 && src->type != GGML_TYPE_I32 && src->type != GGML_TYPE_F16) {
return false;
}
}
return true;
}
static bool ggml_hexagon_supported_fill(const struct ggml_hexagon_session * sess, const struct ggml_tensor * op) {
const struct ggml_tensor * dst = op;
@@ -3434,6 +3703,10 @@ static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, cons
supp = ggml_hexagon_supported_cumsum(sess, op);
break;
case GGML_OP_CONCAT:
supp = ggml_hexagon_supported_concat(sess, op);
break;
case GGML_OP_FILL:
supp = ggml_hexagon_supported_fill(sess, op);
break;
@@ -3598,6 +3871,8 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
// Basic sanity checks to make sure definitions match
static_assert((unsigned int) HTP_TYPE_Q4_0 == (unsigned int) GGML_TYPE_Q4_0,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_Q4_1 == (unsigned int) GGML_TYPE_Q4_1,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_Q8_0 == (unsigned int) GGML_TYPE_Q8_0,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_MXFP4 == (unsigned int) GGML_TYPE_MXFP4,
@@ -3610,6 +3885,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
const char * str_opstage = getenv("GGML_HEXAGON_OPSTAGE");
const char * str_opbatch = getenv("GGML_HEXAGON_OPBATCH");
const char * str_opqueue = getenv("GGML_HEXAGON_OPQUEUE");
const char * str_oppoll = getenv("GGML_HEXAGON_OPPOLL");
const char * str_opfilter = getenv("GGML_HEXAGON_OPFILTER");
const char * str_profile = getenv("GGML_HEXAGON_PROFILE");
const char * str_etm = getenv("GGML_HEXAGON_ETM");
@@ -3647,6 +3923,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
opt_opstage = str_opstage ? strtoul(str_opstage, NULL, 0) : opt_opstage;
opt_opbatch = str_opbatch ? strtoul(str_opbatch, NULL, 0) : opt_opbatch;
opt_opqueue = str_opqueue ? strtoul(str_opqueue, NULL, 0) : opt_opqueue;
opt_oppoll = str_oppoll ? strtoul(str_oppoll, NULL, 0) : opt_oppoll;
opt_profile = str_profile ? atoi(str_profile) : 0;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_nhvx = str_nhvx ? strtoul(str_nhvx, NULL, 0) : opt_nhvx;

View File

@@ -35,6 +35,7 @@ add_library(${HTP_LIB} SHARED
ssm-conv.c
cumsum-ops.c
fill-ops.c
concat-ops.c
diag-ops.c
solve-tri-ops.c
gated-delta-net-ops.c
@@ -58,14 +59,14 @@ list(FIND HTP_HMX_VERSIONS ${DSP_VERSION} _hmx_idx)
if (_hmx_idx GREATER_EQUAL 0)
target_sources(${HTP_LIB} PRIVATE
hmx-queue.c
hmx-matmul-ops.c
hmx-flash-attn-ops.c
hmx-matmul-ops.c
)
# -mhmx enables HMX instruction set (needed by files that include hmx-utils.h)
set_source_files_properties(
hmx-matmul-ops.c
hmx-flash-attn-ops.c
hmx-matmul-ops.c
PROPERTIES COMPILE_OPTIONS "-mhmx"
)

View File

@@ -0,0 +1,275 @@
#include "htp-ctx.h"
#include "htp-ops.h"
#include "hexagon_types.h"
#include "hexagon_protos.h"
#include "hvx_hexagon_protos.h"
#include "hex-dma.h"
#include "vtcm-utils.h"
#include "hvx-utils.h"
#include "hex-fastdiv.h"
#include <string.h>
struct htp_concat_context {
struct htp_ops_context * octx;
uint32_t dim;
uint32_t nrows_per_thread;
struct fastdiv_values div_ne0;
struct fastdiv_values div_ne1;
struct fastdiv_values div_ne2;
};
static void concat_2d_f32_transposed(unsigned int nth, unsigned int ith, void * data) {
struct htp_concat_context * cctx = (struct htp_concat_context *) data;
struct htp_ops_context * octx = cctx->octx;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
const uint32_t src0_ne0 = src0->ne[0];
const uint32_t src1_ne0 = src1->ne[0];
const uint32_t ne1 = dst->ne[1];
const uint32_t start_i = ith * cctx->nrows_per_thread;
const uint32_t end_i = (start_i + cctx->nrows_per_thread < ne1) ? (start_i + cctx->nrows_per_thread) : ne1;
if (start_i >= end_i) return;
dma_queue * q = octx->ctx->dma[ith];
uint8_t * spad0_base = octx->src0_spad.data + ith * octx->src0_spad.size_per_thread;
uint8_t * spad1_base = octx->src1_spad.data + ith * octx->src1_spad.size_per_thread;
const uint32_t block_i = 32;
const uint32_t spad1_stride = block_i * sizeof(float);
int32_t offsets[32] __attribute__((aligned(128)));
for(int k=0; k<32; k++) {
offsets[k] = k * spad1_stride;
}
HVX_Vector vv = *(HVX_Vector*)offsets;
const uint32_t src1_ne0_padded = hex_round_up(src1_ne0, 32);
const uint32_t spad0_row_bytes = hex_round_up((src0_ne0 + src1_ne0_padded) * sizeof(float), VLEN);
uint32_t mu = src1_ne0_padded * spad1_stride;
for (uint32_t i = start_i; i < end_i; i += block_i) {
uint32_t current_block_i = (end_i - i < block_i) ? (end_i - i) : block_i;
uint32_t src1_width_bytes = current_block_i * sizeof(float);
uint8_t * src1_ptr = (uint8_t *)src1->data + i * src1->nb[1];
dma_queue_push(q, dma_make_ptr(spad1_base, src1_ptr), spad1_stride, src1->nb[0], src1_width_bytes, src1_ne0);
uint32_t src0_row_bytes = src0_ne0 * sizeof(float);
uint8_t * src0_ptr = (uint8_t *)src0->data + i * src0->nb[1];
dma_queue_push(q, dma_make_ptr(spad0_base, src0_ptr), spad0_row_bytes, src0->nb[1], src0_row_bytes, current_block_i);
dma_queue_pop(q); // src1
HVX_Vector * vtcm_tmp = (HVX_Vector *)(spad1_base + src1_ne0_padded * spad1_stride);
for (uint32_t j = 0; j < src1_ne0_padded; j += 32) {
#pragma unroll(4)
for (uint32_t ii = 0; ii < current_block_i; ii++) {
size_t rt = (size_t)(spad1_base + j * spad1_stride + ii * sizeof(float));
Q6_vgather_ARMVw(&vtcm_tmp[ii], rt, mu, vv);
uint8_t * dst_ptr = spad0_base + ii * spad0_row_bytes + (src0_ne0 + j) * sizeof(float);
hvx_vmemu(dst_ptr) = vtcm_tmp[ii];
}
}
dma_queue_pop(q); // src0
uint8_t * dst_ptr = (uint8_t *)dst->data + i * dst->nb[1];
dma_queue_push(q, dma_make_ptr(dst_ptr, spad0_base), dst->nb[1], spad0_row_bytes, (src0_ne0 + src1_ne0) * sizeof(float), current_block_i);
dma_queue_pop(q);
}
}
static void concat_2d_f16_transposed(unsigned int nth, unsigned int ith, void * data) {
struct htp_concat_context * cctx = (struct htp_concat_context *) data;
struct htp_ops_context * octx = cctx->octx;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
const uint32_t src0_ne0 = src0->ne[0];
const uint32_t src1_ne0 = src1->ne[0];
const uint32_t ne1 = dst->ne[1];
const uint32_t start_i = ith * cctx->nrows_per_thread;
const uint32_t end_i = (start_i + cctx->nrows_per_thread < ne1) ? (start_i + cctx->nrows_per_thread) : ne1;
if (start_i >= end_i) return;
dma_queue * q = octx->ctx->dma[ith];
uint8_t * spad0_base = octx->src0_spad.data + ith * octx->src0_spad.size_per_thread;
uint8_t * spad1_base = octx->src1_spad.data + ith * octx->src1_spad.size_per_thread;
const uint32_t block_i = 64;
const uint32_t spad1_stride = block_i * sizeof(__fp16);
int16_t offsets[64] __attribute__((aligned(128)));
for(int k=0; k<64; k++) {
offsets[k] = k * spad1_stride;
}
HVX_Vector vv = *(HVX_Vector*)offsets;
const uint32_t src1_ne0_padded = hex_round_up(src1_ne0, 64);
const uint32_t spad0_row_bytes = hex_round_up((src0_ne0 + src1_ne0_padded) * sizeof(__fp16), VLEN);
uint32_t mu = src1_ne0_padded * spad1_stride;
for (uint32_t i = start_i; i < end_i; i += block_i) {
uint32_t current_block_i = (end_i - i < block_i) ? (end_i - i) : block_i;
uint32_t src1_width_bytes = current_block_i * sizeof(__fp16);
uint8_t * src1_ptr = (uint8_t *)src1->data + i * src1->nb[1];
dma_queue_push(q, dma_make_ptr(spad1_base, src1_ptr), spad1_stride, src1->nb[0], src1_width_bytes, src1_ne0);
uint32_t src0_row_bytes = src0_ne0 * sizeof(__fp16);
uint8_t * src0_ptr = (uint8_t *)src0->data + i * src0->nb[1];
dma_queue_push(q, dma_make_ptr(spad0_base, src0_ptr), spad0_row_bytes, src0->nb[1], src0_row_bytes, current_block_i);
dma_queue_pop(q); // src1
HVX_Vector * vtcm_tmp = (HVX_Vector *)(spad1_base + src1_ne0_padded * spad1_stride);
for (uint32_t j = 0; j < src1_ne0_padded; j += 64) {
#pragma unroll(4)
for (uint32_t ii = 0; ii < current_block_i; ii++) {
size_t rt = (size_t)(spad1_base + j * spad1_stride + ii * sizeof(__fp16));
Q6_vgather_ARMVh(&vtcm_tmp[ii], rt, mu, vv);
uint8_t * dst_ptr = spad0_base + ii * spad0_row_bytes + (src0_ne0 + j) * sizeof(__fp16);
hvx_vmemu(dst_ptr) = vtcm_tmp[ii];
}
}
dma_queue_pop(q); // src0
uint8_t * dst_ptr = (uint8_t *)dst->data + i * dst->nb[1];
dma_queue_push(q, dma_make_ptr(dst_ptr, spad0_base), dst->nb[1], spad0_row_bytes, (src0_ne0 + src1_ne0) * sizeof(__fp16), current_block_i);
dma_queue_pop(q);
}
}
static void concat_generic(unsigned int nth, unsigned int ith, void * data) {
struct htp_concat_context * cctx = (struct htp_concat_context *) data;
struct htp_ops_context * octx = cctx->octx;
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
const int dim = cctx->dim;
const uint32_t type_size = (dst->type == HTP_TYPE_F32 || dst->type == HTP_TYPE_I32) ? 4 : 2;
const uint32_t ne[4] = {dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3]};
const uint32_t total_elements = ne[0] * ne[1] * ne[2] * ne[3];
const uint32_t chunk_size = (total_elements + nth - 1) / nth;
const uint32_t start_idx = MIN(ith * chunk_size, total_elements);
const uint32_t end_idx = MIN(start_idx + chunk_size, total_elements);
// Naive scalar element-wise copy
for (uint32_t idx = start_idx; idx < end_idx; idx++) {
uint32_t idx_div_ne0 = fastdiv(idx, &cctx->div_ne0);
uint32_t i0 = idx - idx_div_ne0 * ne[0];
uint32_t idx_div_ne01 = fastdiv(idx_div_ne0, &cctx->div_ne1);
uint32_t i1 = idx_div_ne0 - idx_div_ne01 * ne[1];
uint32_t idx_div_ne012 = fastdiv(idx_div_ne01, &cctx->div_ne2);
uint32_t i2 = idx_div_ne01 - idx_div_ne012 * ne[2];
uint32_t i3 = idx_div_ne012;
uint8_t * dst_ptr = (uint8_t *)dst->data + i3 * dst->nb[3] + i2 * dst->nb[2] + i1 * dst->nb[1] + i0 * dst->nb[0];
uint32_t idx_dim = 0;
if (dim == 0) idx_dim = i0;
else if (dim == 1) idx_dim = i1;
else if (dim == 2) idx_dim = i2;
else if (dim == 3) idx_dim = i3;
const struct htp_tensor * src = (idx_dim < src0->ne[dim]) ? src0 : src1;
uint32_t s0 = i0;
uint32_t s1 = i1;
uint32_t s2 = i2;
uint32_t s3 = i3;
if (dim == 0 && src == src1) s0 -= src0->ne[0];
if (dim == 1 && src == src1) s1 -= src0->ne[1];
if (dim == 2 && src == src1) s2 -= src0->ne[2];
if (dim == 3 && src == src1) s3 -= src0->ne[3];
uint8_t * src_ptr = (uint8_t *)src->data + s3 * src->nb[3] + s2 * src->nb[2] + s1 * src->nb[1] + s0 * src->nb[0];
if (type_size == 4) {
*(float*)dst_ptr = *(float*)src_ptr;
} else {
*(__fp16*)dst_ptr = *(__fp16*)src_ptr;
}
}
}
int op_concat(struct htp_ops_context * octx) {
const struct htp_tensor * src0 = octx->src[0];
const struct htp_tensor * src1 = octx->src[1];
const struct htp_tensor * dst = octx->dst;
int dim = octx->op_params[0];
bool is_2d = dst->ne[2] == 1 && dst->ne[3] == 1;
const uint32_t type_size = (dst->type == HTP_TYPE_F32 || dst->type == HTP_TYPE_I32) ? 4 : 2;
bool is_src1_transposed = (src1->nb[0] > src1->nb[1]);
bool is_src0_transposed = (src0->nb[0] > src0->nb[1]);
uint32_t n_threads = octx->n_threads;
struct htp_concat_context cctx;
cctx.octx = octx;
cctx.dim = dim;
cctx.div_ne0 = init_fastdiv_values(dst->ne[0]);
cctx.div_ne1 = init_fastdiv_values(dst->ne[1]);
cctx.div_ne2 = init_fastdiv_values(dst->ne[2]);
void (*worker_func)(unsigned int, unsigned int, void *) = concat_generic;
if (dim == 0 && is_2d && is_src1_transposed && !is_src0_transposed) {
n_threads = MIN(dst->ne[1], n_threads);
if (n_threads < 1) {
n_threads = 1;
}
uint32_t block_i = (type_size == 4) ? 32 : 64;
cctx.nrows_per_thread = hmx_ceil_div(dst->ne[1], n_threads);
// Allocate VTCM
uint32_t spad1_stride = block_i * type_size;
uint32_t src1_ne0_padded = hex_round_up(src1->ne[0], block_i);
uint32_t spad0_row_bytes = hex_round_up((src0->ne[0] + src1_ne0_padded) * type_size, VLEN);
octx->src0_spad.size_per_thread = block_i * spad0_row_bytes;
octx->src1_spad.size_per_thread = src1_ne0_padded * spad1_stride + block_i * VLEN;
octx->src0_spad.size = n_threads * octx->src0_spad.size_per_thread;
octx->src1_spad.size = n_threads * octx->src1_spad.size_per_thread;
if (octx->src0_spad.size + octx->src1_spad.size > octx->ctx->vtcm_size) {
return HTP_STATUS_VTCM_TOO_SMALL;
}
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src1_spad.data = octx->src0_spad.data + octx->src0_spad.size;
if (type_size == 4) {
worker_func = concat_2d_f32_transposed;
} else {
worker_func = concat_2d_f16_transposed;
}
}
worker_pool_run_func(octx->ctx->worker_pool, worker_func, &cctx, n_threads);
return HTP_STATUS_OK;
}

View File

@@ -28,158 +28,170 @@ struct htp_copy_context {
uint32_t dst_blocks_per_row;
uint32_t src0_nrows_per_thread;
void (*copy)(struct htp_copy_context * ct, struct htp_ops_context * octx, int nth, int ith);
};
#define cpy_preamble \
const struct htp_tensor *src0 = octx->src[0]; \
const struct htp_tensor *dst = octx->dst; \
\
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
const uint32_t ne03 = src0->ne[3]; \
\
const uint32_t nb00 = src0->nb[0]; \
const uint32_t nb01 = src0->nb[1]; \
const uint32_t nb02 = src0->nb[2]; \
const uint32_t nb03 = src0->nb[3]; \
\
const uint32_t ne0 = dst->ne[0]; \
const uint32_t ne1 = dst->ne[1]; \
const uint32_t ne2 = dst->ne[2]; \
const uint32_t ne3 = dst->ne[3]; \
\
const uint32_t nb0 = dst->nb[0]; \
const uint32_t nb1 = dst->nb[1]; \
const uint32_t nb2 = dst->nb[2]; \
const uint32_t nb3 = dst->nb[3]; \
\
const uint32_t ne00 = src0->ne[0]; \
const uint32_t ne01 = src0->ne[1]; \
const uint32_t ne02 = src0->ne[2]; \
const uint32_t ne03 = src0->ne[3]; \
\
const uint32_t nb00 = src0->nb[0]; \
const uint32_t nb01 = src0->nb[1]; \
const uint32_t nb02 = src0->nb[2]; \
const uint32_t nb03 = src0->nb[3]; \
\
const uint32_t ne0 = dst->ne[0]; \
const uint32_t ne1 = dst->ne[1]; \
const uint32_t ne2 = dst->ne[2]; \
const uint32_t ne3 = dst->ne[3]; \
\
const uint32_t nb0 = dst->nb[0]; \
const uint32_t nb1 = dst->nb[1]; \
const uint32_t nb2 = dst->nb[2]; \
const uint32_t nb3 = dst->nb[3]; \
\
const uint32_t nr = ne01;
static void cpy_thread_sametype_sameshape(struct htp_copy_context * ct, struct htp_ops_context * octx, const int nth, const int ith) {
cpy_preamble;
// parallelize by src0 rows
const uint32_t dr = ct->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr) < nr ? (ir0 + dr) : nr;
// copy by rows
for (uint32_t i03 = 0; i03 < ne03; i03++) {
for (uint32_t i02 = 0; i02 < ne02; i02++) {
#pragma unroll(2)
for (uint32_t i01 = ir0; i01 < ir1; i01++) {
uint8_t* dst_ptr = (uint8_t*) dst->data + i01*nb1 + i02*nb2 + i03*nb3;
uint8_t* src0_ptr = (uint8_t*) src0->data + i01*nb01 + i02*nb02 + i03*nb03;
hex_l2fetch(src0_ptr, ne00 * ct->src0_type_size, nb01, 2);
hvx_copy_uu(dst_ptr, src0_ptr, ne00, ct->src0_type_size);
}
}
}
#define DEFINE_CPY_SAMESHAPE(NAME, ELEM_TYPE, ELEM_SIZE) \
static void cpy_thread_##NAME##_sameshape(unsigned int nth, unsigned int ith, void * data) { \
struct htp_copy_context * ct = (struct htp_copy_context *) data; \
struct htp_ops_context * octx = ct->octx; \
cpy_preamble; \
const uint32_t dr = ct->src0_nrows_per_thread; \
const uint32_t ir0 = dr * ith; \
const uint32_t ir1 = (ir0 + dr) < nr ? (ir0 + dr) : nr; \
if (ir0 >= nr) return; \
for (uint32_t i03 = 0; i03 < ne03; i03++) { \
for (uint32_t i02 = 0; i02 < ne02; i02++) { \
_Pragma("unroll(4)") \
for (uint32_t i01 = ir0; i01 < ir1; i01++) { \
uint8_t* dst_ptr = (uint8_t*) dst->data + i01*nb1 + i02*nb2 + i03*nb3; \
uint8_t* src0_ptr = (uint8_t*) src0->data + i01*nb01 + i02*nb02 + i03*nb03; \
hex_l2fetch(src0_ptr, ne00 * ELEM_SIZE, nb01, 2); \
hvx_copy_uu(dst_ptr, src0_ptr, ne00, ELEM_SIZE); \
} \
} \
} \
}
static void cpy_thread_sametype_reshape(struct htp_copy_context * ct, struct htp_ops_context * octx, int nth, int ith) {
cpy_preamble;
DEFINE_CPY_SAMESHAPE(f32, float, 4)
DEFINE_CPY_SAMESHAPE(f16, __fp16, 2)
// parallelize by src0 rows
const uint32_t dr = ct->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr) < nr ? (ir0 + dr) : nr;
// Fast path: when both src0 and dst are contiguous in memory
// Replace the element-by-element loop with a single bulk HVX copy per (i03, i02) slice.
const bool src0_contig = (nb00 == ct->src0_type_size) &&
(nb01 == ne00 * nb00) &&
(nb02 == ne01 * nb01) &&
(nb03 == ne02 * nb02);
const bool dst_contig = (nb0 == ct->dst_type_size) &&
(nb1 == ne0 * nb0) &&
(nb2 == ne1 * nb1) &&
(nb3 == ne2 * nb2);
if (src0_contig && dst_contig) {
for (int64_t i03 = 0; i03 < ne03; i03++) {
for (int64_t i02 = 0; i02 < ne02; i02++) {
uint8_t * src_ptr = (uint8_t *) src0->data + i03*nb03 + i02*nb02 + ir0*nb01;
uint32_t flat = ((i03*ne02 + i02)*ne01 + ir0) * ne00;
uint8_t * dst_ptr = (uint8_t *) dst->data + flat * ct->src0_type_size;
hvx_copy_uu(dst_ptr, src_ptr, (ir1 - ir0) * ne00, ct->src0_type_size);
}
}
return;
}
// dst counters
int64_t k10 = 0;
int64_t i11 = 0;
int64_t i12 = 0;
int64_t i13 = 0;
// number of blocks in a row
const int64_t nk00 = ct->src0_blocks_per_row;
const int64_t nk0 = ct->dst_blocks_per_row;
for (int64_t i03 = 0; i03 < ne03; i03++) {
for (int64_t i02 = 0; i02 < ne02; i02++) {
k10 += nk00 * ir0;
while (k10 >= nk0) {
k10 -= nk0;
if (++i11 == ne1) {
i11 = 0;
if (++i12 == ne2) {
i12 = 0;
if (++i13 == ne3) {
i13 = 0;
}
}
}
}
for (int64_t i01 = ir0; i01 < ir1; i01++) {
for (int64_t k00 = 0; k00 < nk00; k00++) {
const char * src0_ptr = ((char *) src0->data + k00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
char * dst_ptr = ((char *) dst->data + k10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
memcpy(dst_ptr, src0_ptr, ct->dst_type_size);
if (++k10 == nk0) {
k10 = 0;
if (++i11 == ne1) {
i11 = 0;
if (++i12 == ne2) {
i12 = 0;
if (++i13 == ne3) {
i13 = 0;
}
}
}
}
}
}
k10 += nk00 * (ne01 - ir1);
while (k10 >= nk0) {
k10 -= nk0;
if (++i11 == ne1) {
i11 = 0;
if (++i12 == ne2) {
i12 = 0;
if (++i13 == ne3) {
i13 = 0;
}
}
}
}
}
}
#define DEFINE_CPY_RESHAPE(NAME, ELEM_TYPE, ELEM_SIZE) \
static void cpy_thread_##NAME##_reshape(unsigned int nth, unsigned int ith, void * data) { \
struct htp_copy_context * ct = (struct htp_copy_context *) data; \
struct htp_ops_context * octx = ct->octx; \
cpy_preamble; \
const uint32_t dr = ct->src0_nrows_per_thread; \
const uint32_t ir0 = dr * ith; \
const uint32_t ir1 = (ir0 + dr) < nr ? (ir0 + dr) : nr; \
if (ir0 >= nr) return; \
const bool src0_contig = (nb00 == ELEM_SIZE) && \
(nb01 == ne00 * nb00) && \
(nb02 == ne01 * nb01) && \
(nb03 == ne02 * nb02); \
const bool dst_contig = (nb0 == ELEM_SIZE) && \
(nb1 == ne0 * nb0) && \
(nb2 == ne1 * nb1) && \
(nb3 == ne2 * nb2); \
if (src0_contig && dst_contig) { \
for (int64_t i03 = 0; i03 < ne03; i03++) { \
for (int64_t i02 = 0; i02 < ne02; i02++) { \
uint8_t * src_ptr = (uint8_t *) src0->data + i03*nb03 + i02*nb02 + ir0*nb01; \
uint32_t flat = ((i03*ne02 + i02)*ne01 + ir0) * ne00; \
uint8_t * dst_ptr = (uint8_t *) dst->data + flat * ELEM_SIZE; \
hvx_copy_uu(dst_ptr, src_ptr, (ir1 - ir0) * ne00, ELEM_SIZE); \
} \
} \
return; \
} \
const bool reshape_flat_fast = (ne03 == 1 && ne2 == 1 && ne3 == 1) && \
(ne0 == ne00 * ne01) && (ne1 == ne02) && \
(nb00 == ELEM_SIZE) && (nb0 == ELEM_SIZE); \
if (reshape_flat_fast) { \
for (uint32_t i02 = 0; i02 < ne02; i02++) { \
for (uint32_t i01 = ir0; i01 < ir1; i01++) { \
uint8_t * src0_ptr = (uint8_t *) src0->data + i01 * nb01 + i02 * nb02; \
uint8_t * dst_ptr = (uint8_t *) dst->data + i01 * ne00 * ELEM_SIZE + i02 * nb1; \
hvx_copy_uu(dst_ptr, src0_ptr, ne00, ELEM_SIZE); \
} \
} \
return; \
} \
int64_t k10 = 0; \
int64_t i11 = 0; \
int64_t i12 = 0; \
int64_t i13 = 0; \
const int64_t nk00 = ct->src0_blocks_per_row; \
const int64_t nk0 = ct->dst_blocks_per_row; \
for (int64_t i03 = 0; i03 < ne03; i03++) { \
for (int64_t i02 = 0; i02 < ne02; i02++) { \
k10 += nk00 * ir0; \
while (k10 >= nk0) { \
k10 -= nk0; \
if (++i11 == ne1) { \
i11 = 0; \
if (++i12 == ne2) { \
i12 = 0; \
if (++i13 == ne3) { \
i13 = 0; \
} \
} \
} \
} \
for (int64_t i01 = ir0; i01 < ir1; i01++) { \
for (int64_t k00 = 0; k00 < nk00; k00++) { \
const char * src0_ptr = ((char *) src0->data + k00*nb00 + i01*nb01 + i02*nb02 + i03*nb03); \
char * dst_ptr = ((char *) dst->data + k10*nb0 + i11*nb1 + i12*nb2 + i13*nb3); \
memcpy(dst_ptr, src0_ptr, ELEM_SIZE); \
if (++k10 == nk0) { \
k10 = 0; \
if (++i11 == ne1) { \
i11 = 0; \
if (++i12 == ne2) { \
i12 = 0; \
if (++i13 == ne3) { \
i13 = 0; \
} \
} \
} \
} \
} \
} \
k10 += nk00 * (ne01 - ir1); \
while (k10 >= nk0) { \
k10 -= nk0; \
if (++i11 == ne1) { \
i11 = 0; \
if (++i12 == ne2) { \
i12 = 0; \
if (++i13 == ne3) { \
i13 = 0; \
} \
} \
} \
} \
} \
} \
}
static void cpy_thread_f16_f32_sameshape(struct htp_copy_context * ct, struct htp_ops_context * octx, const int nth, const int ith) {
DEFINE_CPY_RESHAPE(f32, float, 4)
DEFINE_CPY_RESHAPE(f16, __fp16, 2)
static void cpy_thread_f16_f32_sameshape(unsigned int nth, unsigned int ith, void * data) {
struct htp_copy_context * ct = (struct htp_copy_context *) data;
struct htp_ops_context * octx = ct->octx;
cpy_preamble;
// parallelize by src0 rows
const uint32_t dr = ct->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr) < nr ? (ir0 + dr) : nr;
if (ir0 >= nr) return;
// copy by rows
for (uint32_t i03 = 0; i03 < ne03; i03++) {
@@ -195,13 +207,16 @@ static void cpy_thread_f16_f32_sameshape(struct htp_copy_context * ct, struct ht
}
}
static void cpy_thread_f32_f16_sameshape(struct htp_copy_context * ct, struct htp_ops_context * octx, const int nth, const int ith) {
static void cpy_thread_f32_f16_sameshape(unsigned int nth, unsigned int ith, void * data) {
struct htp_copy_context * ct = (struct htp_copy_context *) data;
struct htp_ops_context * octx = ct->octx;
cpy_preamble;
// parallelize by src0 rows
const uint32_t dr = ct->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr) < nr ? (ir0 + dr) : nr;
if (ir0 >= nr) return;
// copy by rows
for (uint32_t i03 = 0; i03 < ne03; i03++) {
@@ -217,11 +232,6 @@ static void cpy_thread_f32_f16_sameshape(struct htp_copy_context * ct, struct ht
}
}
static void cpy_work_func(unsigned int n, unsigned int i, void *data) {
struct htp_copy_context *ct = (struct htp_copy_context *) data;
ct->copy(ct, ct->octx, n, i);
}
int op_cpy(struct htp_ops_context * octx) {
cpy_preamble;
@@ -254,22 +264,32 @@ int op_cpy(struct htp_ops_context * octx) {
ct.src0_nrows_per_thread = (nr + n_threads - 1) / n_threads;
worker_callback_t copy_fun;
if (sametype && sameshape) {
ct.copy = cpy_thread_sametype_sameshape;
if (src0->type == HTP_TYPE_F32) {
copy_fun = cpy_thread_f32_sameshape;
} else {
copy_fun = cpy_thread_f16_sameshape;
}
} else if (sameshape) {
/**/ if (dst->type == HTP_TYPE_F16 && src0->type == HTP_TYPE_F32)
ct.copy = cpy_thread_f16_f32_sameshape;
copy_fun = cpy_thread_f16_f32_sameshape;
else if (dst->type == HTP_TYPE_F32 && src0->type == HTP_TYPE_F16)
ct.copy = cpy_thread_f32_f16_sameshape;
copy_fun = cpy_thread_f32_f16_sameshape;
else
return HTP_STATUS_NO_SUPPORT;
} else if (sametype) {
ct.copy = cpy_thread_sametype_reshape;
if (src0->type == HTP_TYPE_F32) {
copy_fun = cpy_thread_f32_reshape;
} else {
copy_fun = cpy_thread_f16_reshape;
}
} else {
return HTP_STATUS_NO_SUPPORT;
}
worker_pool_run_func(octx->ctx->worker_pool, cpy_work_func, &ct, n_threads);
worker_pool_run_func(octx->ctx->worker_pool, copy_fun, &ct, n_threads);
return HTP_STATUS_OK;
}

View File

@@ -17,9 +17,13 @@
struct get_rows_context {
struct htp_ops_context * octx;
uint32_t src1_nrows_per_thread;
uint32_t tasks_per_thread;
uint32_t total_tasks;
uint32_t chunks_per_row;
uint32_t chunk_size;
struct fastdiv_values get_rows_div_ne10;
struct fastdiv_values get_rows_div_ne10_ne11;
struct fastdiv_values get_rows_div_chunks_per_row;
};
#define get_rows_preamble \
@@ -52,20 +56,23 @@ struct get_rows_context {
\
const uint32_t nr = ne10 * ne11 * ne12;
static void get_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *data) {
static void get_rows_thread_f32_f32_dma(unsigned int nth, unsigned int ith, void *data) {
struct get_rows_context * grctx = (struct get_rows_context *)data;
struct htp_ops_context * octx = grctx->octx;
get_rows_preamble;
uint64_t qt = HAP_perf_get_qtimer_count();
// parallelize by src1 elements (which correspond to dst rows)
const uint32_t dr = grctx->src1_nrows_per_thread;
const uint32_t dr = grctx->tasks_per_thread;
const uint32_t ir0 = dr * ith;
const uint32_t ir1 = (ir0 + dr < nr) ? (ir0 + dr) : nr;
if (ir0 >= grctx->total_tasks) {
return;
}
const uint32_t ir1 = MIN(ir0 + dr, grctx->total_tasks);
const bool is_i32 = (octx->src[1]->type == HTP_TYPE_I32);
dma_queue * dma_queue = octx->ctx->dma[ith];
for (uint32_t i = ir0; i < ir1; ++i) {
const uint32_t i12 = fastdiv(i, &grctx->get_rows_div_ne10_ne11);
const uint32_t rem = i - i12 * ne11 * ne10;
@@ -73,29 +80,77 @@ static void get_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
const uint32_t i10 = rem - i11 * ne10;
const uintptr_t src1_addr = octx->src[1]->data + i10*nb10 + i11*nb11 + i12*nb12;
uint32_t i01 = is_i32 ? *(int32_t *)src1_addr : *(int64_t *)src1_addr;
if (i01 >= ne01) {
// invalid index, skip for now to avoid crash
continue;
}
const uintptr_t src0_ptr = octx->src[0]->data + i01*nb01 + i11*nb02 + i12*nb03;
const uintptr_t dst_ptr = octx->dst->data + i10*nb1 + i11*nb2 + i12*nb3;
hvx_copy_f32_uu((uint8_t *)dst_ptr, (const uint8_t *)src0_ptr, ne00);
while (!dma_queue_push(dma_queue, dma_make_ptr((void *)dst_ptr, (const void *)src0_ptr), nb1, nb01, ne00 * sizeof(float), 1)) {
dma_queue_pop(dma_queue);
}
}
dma_queue_flush(dma_queue);
qt = HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - qt);
FARF(HIGH, "get-rows-f32-f32-dma %d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u usec %u\n", ith, nth,
ne00, ne01, ne02, ne03, ir0, ir1, ne10, ne11, ne12, ne13, ne0, ne1, ne2, ne3, (unsigned) qt);
}
static void get_rows_thread_f32_f32_hvx(unsigned int nth, unsigned int ith, void *data) {
struct get_rows_context * grctx = (struct get_rows_context *)data;
struct htp_ops_context * octx = grctx->octx;
get_rows_preamble;
uint64_t qt = HAP_perf_get_qtimer_count();
const uint32_t dr = grctx->tasks_per_thread;
const uint32_t ir0 = dr * ith;
if (ir0 >= grctx->total_tasks) {
return;
}
const uint32_t ir1 = MIN(ir0 + dr, grctx->total_tasks);
const bool is_i32 = (octx->src[1]->type == HTP_TYPE_I32);
const uint32_t chunks_per_row = grctx->chunks_per_row;
const uint32_t chunk_size = grctx->chunk_size;
for (uint32_t i = ir0; i < ir1; ++i) {
const uint32_t row_idx = fastdiv(i, &grctx->get_rows_div_chunks_per_row);
const uint32_t chunk_idx = i - row_idx * chunks_per_row;
const uint32_t i12 = fastdiv(row_idx, &grctx->get_rows_div_ne10_ne11);
const uint32_t rem = row_idx - i12 * ne11 * ne10;
const uint32_t i11 = fastdiv(rem, &grctx->get_rows_div_ne10);
const uint32_t i10 = rem - i11 * ne10;
const uintptr_t src1_addr = octx->src[1]->data + i10*nb10 + i11*nb11 + i12*nb12;
uint32_t i01 = is_i32 ? *(int32_t *)src1_addr : *(int64_t *)src1_addr;
if (i01 >= ne01) {
continue;
}
const uint32_t offset = chunk_idx * chunk_size;
if (offset < ne00) {
const uint32_t copy_size = MIN(chunk_size, ne00 - offset);
const uintptr_t src0_ptr = octx->src[0]->data + i01*nb01 + i11*nb02 + i12*nb03 + offset * sizeof(float);
const uintptr_t dst_ptr = octx->dst->data + i10*nb1 + i11*nb2 + i12*nb3 + offset * sizeof(float);
hvx_copy_f32_uu((uint8_t *)dst_ptr, (const uint8_t *)src0_ptr, copy_size);
}
}
qt = HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - qt);
FARF(HIGH, "get-rows-f32-f32 %d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u usec %u\n", ith, nth,
FARF(HIGH, "get-rows-f32-f32-hvx %d/%d: %ux%ux%ux%u (%u:%u) x %ux%ux%ux%u -> %ux%ux%ux%u usec %u\n", ith, nth,
ne00, ne01, ne02, ne03, ir0, ir1, ne10, ne11, ne12, ne13, ne0, ne1, ne2, ne3, (unsigned) qt);
}
int op_get_rows(struct htp_ops_context * octx) {
get_rows_preamble;
const uint32_t n_threads = MIN(nr, octx->n_threads);
if (octx->src[0]->type != HTP_TYPE_F32) {
return HTP_STATUS_NO_SUPPORT;
}
@@ -112,13 +167,52 @@ int op_get_rows(struct htp_ops_context * octx) {
return HTP_STATUS_OK;
}
const uint32_t nb00 = octx->src[0]->nb[0];
const uint32_t nb0 = octx->dst->nb[0];
const bool can_use_dma = (nb00 == sizeof(float)) && (nb0 == sizeof(float));
const bool use_dma = can_use_dma && (ne00 >= 2048);
struct get_rows_context grctx;
grctx.octx = octx;
grctx.get_rows_div_ne10 = init_fastdiv_values(octx->src[1]->ne[0]);
grctx.get_rows_div_ne10_ne11 = init_fastdiv_values(octx->src[1]->ne[0] * octx->src[1]->ne[1]);
grctx.src1_nrows_per_thread = (nr + n_threads - 1) / n_threads;
if (use_dma) {
grctx.chunks_per_row = 1;
grctx.chunk_size = ne00;
grctx.total_tasks = nr;
grctx.get_rows_div_chunks_per_row = init_fastdiv_values(1);
worker_pool_run_func(octx->ctx->worker_pool, get_rows_thread_f32_f32, &grctx, n_threads);
const uint32_t n_threads = MIN(nr, octx->n_threads);
grctx.tasks_per_thread = (nr + n_threads - 1) / n_threads;
worker_pool_run_func(octx->ctx->worker_pool, get_rows_thread_f32_f32_dma, &grctx, n_threads);
} else {
uint32_t chunks_per_row = 1;
uint32_t chunk_size = ne00;
uint32_t total_tasks = nr;
if (nr < octx->n_threads) {
const uint32_t min_chunk_size = 1024;
uint32_t max_chunks = ne00 / min_chunk_size;
if (max_chunks == 0) {
max_chunks = 1;
}
chunks_per_row = MIN((octx->n_threads + nr - 1) / nr, max_chunks);
chunk_size = (ne00 + chunks_per_row - 1) / chunks_per_row;
total_tasks = nr * chunks_per_row;
}
grctx.chunks_per_row = chunks_per_row;
grctx.chunk_size = chunk_size;
grctx.total_tasks = total_tasks;
grctx.get_rows_div_chunks_per_row = init_fastdiv_values(chunks_per_row);
const uint32_t n_threads = MIN(total_tasks, octx->n_threads);
grctx.tasks_per_thread = (total_tasks + n_threads - 1) / n_threads;
worker_pool_run_func(octx->ctx->worker_pool, get_rows_thread_f32_f32_hvx, &grctx, n_threads);
}
return HTP_STATUS_OK;
}

View File

@@ -50,8 +50,8 @@ static size_t hmx_fa_compute_vtcm_usage(size_t gqa_factor, size_t DK, size_t DV,
const size_t g_br = hex_align_up(gqa_factor * Br, HMX_FP16_TILE_N_ROWS);
const size_t q_tile_size = hex_align_up(g_br * DK * sizeof(__fp16), 4096); // Q: [g_br, DK]
const size_t o_tile_size = hex_align_up(g_br * DV * sizeof(__fp16), 4096); // O: [g_br, DV] x2 ping-pong
const size_t k_dma_size = hex_align_up(Bc * DK * sizeof(__fp16), 4096); // K DMA: [Bc, DK] x2 double-buf
const size_t v_dma_size = hex_align_up(Bc * DV * sizeof(__fp16), 4096); // V DMA: [Bc, DV] x2 double-buf
const size_t k_dma_size = hex_align_up(Bc * hex_round_up(DK * sizeof(__fp16), 128), 4096); // K DMA: [Bc, DK] x2 double-buf
const size_t v_dma_size = hex_align_up(Bc * hex_round_up(DV * sizeof(__fp16), 128), 4096); // V DMA: [Bc, DV] x2 double-buf
const size_t k_tile_size = hex_align_up(Bc * DK * sizeof(__fp16), 4096); // K tiles: [Bc, DK] interleaved
const size_t v_tile_size = hex_align_up(Bc * DV * sizeof(__fp16), 4096); // V tiles: [Bc, DV] interleaved
const size_t s_tile_size = hex_align_up(g_br * Bc * sizeof(__fp16), 4096); // S/P:[g_br, Bc]
@@ -1278,7 +1278,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
struct hmx_fa_context factx;
memset(&factx, 0, sizeof(factx));
factx.octx = octx;
factx.n_threads = octx->ctx->n_threads;
factx.n_threads = n_threads;
factx.DK = DK;
factx.DV = DV;
factx.n_kv = nek1;
@@ -1328,10 +1328,15 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
factx.m1 = powf(2.0f, -(max_bias / 2.0f) / factx.n_head_log2);
// ======== VTCM allocation (GQA-aware) ========
const size_t size_k_row = DK * sizeof(__fp16);
const size_t size_v_row = DV * sizeof(__fp16);
const size_t size_k_row_padded = hex_round_up(size_k_row, 128);
const size_t size_v_row_padded = hex_round_up(size_v_row, 128);
const size_t q_tile_bytes = hex_align_up(g_br * DK * sizeof(__fp16), 4096);
const size_t o_tile_bytes = hex_align_up(g_br * DV * sizeof(__fp16), 4096);
const size_t k_dma_bytes = hex_align_up(Bc * DK * sizeof(__fp16), 4096);
const size_t v_dma_bytes = hex_align_up(Bc * DV * sizeof(__fp16), 4096);
const size_t k_dma_bytes = hex_align_up(Bc * size_k_row_padded, 4096);
const size_t v_dma_bytes = hex_align_up(Bc * size_v_row_padded, 4096);
const size_t k_tile_bytes = hex_align_up(Bc * DK * sizeof(__fp16), 4096);
const size_t v_tile_bytes = hex_align_up(Bc * DV * sizeof(__fp16), 4096);
const size_t s_tile_bytes = hex_align_up(g_br * Bc * sizeof(__fp16), 4096);
@@ -1401,11 +1406,7 @@ int hmx_flash_attn_ext(struct htp_ops_context * octx) {
// ======== DMA setup ========
dma_queue * const dma = ctx->dma[0];
// Padded row sizes for DMA
const size_t size_k_row = nek0 * sizeof(__fp16);
const size_t size_v_row = nev0 * sizeof(__fp16);
const size_t size_k_row_padded = hex_round_up(nek0 * sizeof(__fp16), 128);
const size_t size_v_row_padded = hex_round_up(nev0 * sizeof(__fp16), 128);
// Padded row sizes for DMA (defined in outer scope)
const size_t n_row_tiles_g_br = g_br / HMX_FP16_TILE_N_ROWS;
const size_t n_tiles_per_bc = Bc / HMX_FP16_TILE_N_COLS;

View File

@@ -34,6 +34,10 @@ static const __fp16 q4_0_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
-8, 0, -7, 0, -6, 0, -5, 0, -4, 0, -3, 0, -2, 0, -1, 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0,
};
static const __fp16 q4_1_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0,
};
// MXFP4 dequantization LUT: maps 4-bit index to fp16 mantissa value
// kvalues: 0, 0.5, 1, 1.5, 2, 3, 4, 6, 0, -0.5, -1, -1.5, -2, -3, -4, -6
static const __fp16 mxfp4_to_fp16_lut[64] __attribute__((aligned(VLEN))) = {
@@ -62,6 +66,8 @@ static inline size_t get_x4x2_row_stride(int weight_type, int k) {
case HTP_TYPE_Q4_0:
case HTP_TYPE_IQ4_NL:
return (size_t) nb * (QK_Q4_0x4x2 / 2 + HMX_X4X2_DBLK_SIZE); // 144 * nb
case HTP_TYPE_Q4_1:
return (size_t) nb * (QK_Q4_0x4x2 / 2 + 32); // 160 * nb
case HTP_TYPE_Q8_0:
return (size_t) nb * (QK_Q8_0x4x2 + HMX_X4X2_DBLK_SIZE); // 272 * nb
case HTP_TYPE_MXFP4:
@@ -233,6 +239,54 @@ static inline HVX_Vector_x2 dequantize_x4x2_q4_0_x4groups_hvx(
return r;
}
static inline HVX_Vector dequantize_x4x2_q4_1_group_hvx(const uint8_t *packed_32, bool upper_nibbles, const __fp16 *scale_offset, const HVX_Vector vlut_cvt) {
HVX_Vector vq = hvx_vmemu(packed_32);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_dm = hvx_vmemu(scale_offset);
HVX_Vector v_scales = hvx_vec_repl_f16(v_dm);
HVX_Vector v_offsets = hvx_vec_repl_f16(Q6_V_vror_VR(v_dm, 2));
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_hf = Q6_V_lo_W(vp);
return Q6_Vhf_equals_Vqf16(Q6_Vqf16_vadd_Vqf16Vhf(Q6_Vqf16_vmpy_VhfVhf(v_hf, v_scales), v_offsets));
}
static inline HVX_Vector_x2 dequantize_x4x2_q4_1_x4groups_hvx(
const uint8_t *packed_128, bool upper_nibbles,
const __fp16 *scales_offsets_4, const HVX_Vector vlut_cvt) {
HVX_Vector vq = hvx_vmemu(packed_128);
const HVX_Vector mask_h4 = Q6_Vb_vsplat_R(0x0F);
HVX_Vector v_quants = Q6_Vub_vlsr_VubR(vq, 4 * upper_nibbles);
v_quants = Q6_V_vand_VV(v_quants, mask_h4);
v_quants = Q6_Vb_vshuff_Vb(v_quants);
HVX_VectorPair vp = Q6_Wh_vlut16_VbVhR(v_quants, vlut_cvt, 0);
HVX_Vector v_lo = Q6_V_lo_W(vp);
HVX_Vector v_hi = Q6_V_hi_W(vp);
HVX_Vector vscale_offset = hvx_vmemu(scales_offsets_4);
HVX_VectorPair dm_deal = Q6_W_vdeal_VVR(vscale_offset, vscale_offset, -2);
HVX_Vector vd = Q6_V_lo_W(dm_deal);
HVX_Vector vm = Q6_V_hi_W(dm_deal);
HVX_Vector v_sc01 = hvx_vec_repl_2x_f16(vd);
HVX_Vector v_sc23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vd, 4));
HVX_Vector v_os01 = hvx_vec_repl_2x_f16(vm);
HVX_Vector v_os23 = hvx_vec_repl_2x_f16(Q6_V_vror_VR(vm, 4));
v_lo = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vadd_Vqf16Vhf(Q6_Vqf16_vmpy_VhfVhf(v_lo, v_sc01), v_os01));
v_hi = Q6_Vhf_equals_Vqf16(Q6_Vqf16_vadd_Vqf16Vhf(Q6_Vqf16_vmpy_VhfVhf(v_hi, v_sc23), v_os23));
HVX_Vector_x2 r = { v_lo, v_hi };
return r;
}
// Dequantize one x4x2 Q8_0 group (32 int8 quants) -> 32 FP16 in first 64 bytes.
static inline HVX_Vector dequantize_x4x2_q8_0_group_hvx(const int8_t *quants_32, const __fp16 *scale) {
HVX_Vector vq = hvx_vmemu(quants_32);
@@ -331,11 +385,13 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
int start_tile, int end_tile) {
const int n_k_tiles = (unsigned)k_block / HMX_FP16_TILE_N_COLS;
const bool is_q4 = (weight_type == HTP_TYPE_Q4_0 || weight_type == HTP_TYPE_IQ4_NL);
const bool is_q4 = (weight_type == HTP_TYPE_Q4_0 || weight_type == HTP_TYPE_Q4_1 || weight_type == HTP_TYPE_IQ4_NL);
const bool is_q4_1 = (weight_type == HTP_TYPE_Q4_1);
const int qrow_size = is_q4 ? ((unsigned)k_block / 2) : k_block;
const HVX_Vector vlut_cvt = (weight_type == HTP_TYPE_IQ4_NL) ? hvx_vmem(iq4_nl_to_fp16_lut) :
(weight_type == HTP_TYPE_MXFP4) ? hvx_vmem(mxfp4_to_fp16_lut) :
(weight_type == HTP_TYPE_Q4_1) ? hvx_vmem(q4_1_to_fp16_lut) :
hvx_vmem(q4_0_to_fp16_lut);
// vscatter setup: write dequantized K-values directly to transposed [K][N] tile positions.
@@ -356,8 +412,10 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
unsigned sub_blk_base = ((kt * 32) % QK_Q4_0x4x2) / 32; // 0 or 4
bool upper = (sub_blk_base >= 4);
unsigned packed_off = blk_idx * (QK_Q4_0x4x2 / 2); // 128 contiguous packed bytes
unsigned scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE
+ sub_blk_base * (int)sizeof(__fp16); // 4 consecutive scales
unsigned dblk_size = is_q4_1 ? 32 : HMX_X4X2_DBLK_SIZE;
unsigned scale_step = is_q4_1 ? 4 : (int)sizeof(__fp16);
unsigned scale_off = qrow_size + blk_idx * dblk_size
+ sub_blk_base * scale_step;
__fp16 *tile_bases[4];
for (unsigned g = 0; g < 4; g++) { tile_bases[g] = vtcm_dst + (t + g) * HMX_FP16_TILE_N_ELMS; }
@@ -367,20 +425,38 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
if (is_q4_1) {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector_x2 dv0 = dequantize_x4x2_q4_0_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector_x2 dv1 = dequantize_x4x2_q4_0_x4groups_hvx(r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt);
HVX_Vector_x2 dv0 = dequantize_x4x2_q4_1_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector_x2 dv1 = dequantize_x4x2_q4_1_x4groups_hvx(r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
} else {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector_x2 dv0 = dequantize_x4x2_q4_0_x4groups_hvx(r0 + packed_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector_x2 dv1 = dequantize_x4x2_q4_0_x4groups_hvx(r1 + packed_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv0.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_RMVwV((size_t)tile_bases[0], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[0]);
Q6_vscatter_RMVwV((size_t)tile_bases[2], 2 * HMX_FP16_TILE_SIZE - 1, v_off, dv1.v[1]);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
}
for (int g = 0; g < 4; g++) { (void) *(volatile HVX_Vector *)(tile_bases[g]); }
@@ -446,26 +522,43 @@ static void dequantize_x4x2_weight_to_fp16_tiles_task(
unsigned sub_blk = ((kt * 32) % QK_Q4_0x4x2) / 32;
bool upper = (sub_blk >= 4);
unsigned byte_off = blk_idx * (QK_Q4_0x4x2 / 2) + (upper ? (sub_blk - 4) : sub_blk) * 32;
unsigned scale_off = qrow_size + blk_idx * HMX_X4X2_DBLK_SIZE + sub_blk * (int)sizeof(__fp16);
unsigned dblk_size = is_q4_1 ? 32 : HMX_X4X2_DBLK_SIZE;
unsigned scale_step = is_q4_1 ? 4 : (int)sizeof(__fp16);
unsigned scale_off = qrow_size + blk_idx * dblk_size + sub_blk * scale_step;
HVX_Vector v_off = v_scat_base; // reset to column 0
unsigned row_offset = ct * HMX_FP16_TILE_N_COLS * row_stride;
unsigned row1 = ct * HMX_FP16_TILE_N_COLS + 1;
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
if (is_q4_1) {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector v0 = dequantize_x4x2_q4_0_group_hvx(
r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector v1 = (row1 < n_cols)
? dequantize_x4x2_q4_0_group_hvx(
r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt)
: Q6_V_vzero();
HVX_Vector v0 = dequantize_x4x2_q4_1_group_hvx(r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector v1 = (row1 < n_cols)
? dequantize_x4x2_q4_1_group_hvx(r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt)
: Q6_V_vzero();
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
} else {
for (int r = 0; r < HMX_FP16_TILE_N_ROWS; r += 2, row1 += 2) {
const uint8_t *r0 = vtcm_src + row_offset; row_offset += row_stride;
const uint8_t *r1 = vtcm_src + row_offset; row_offset += row_stride;
HVX_Vector v0 = dequantize_x4x2_q4_0_group_hvx(r0 + byte_off, upper, (const __fp16 *)(r0 + scale_off), vlut_cvt);
HVX_Vector v1 = (row1 < n_cols)
? dequantize_x4x2_q4_0_group_hvx(r1 + byte_off, upper, (const __fp16 *)(r1 + scale_off), vlut_cvt)
: Q6_V_vzero();
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v0);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
Q6_vscatter_QRMVwV(q_mask64, (size_t)tile_base, HMX_FP16_TILE_SIZE - 1, v_off, v1);
v_off = Q6_Vw_vadd_VwVw(v_off, v_scat_step);
}
}
(void) *(volatile HVX_Vector *)(tile_base);
} else if (weight_type == HTP_TYPE_MXFP4) {
@@ -593,6 +686,8 @@ static void dequantize_x4x2_weight_chunk_to_fp16_tiles(
// --- End x4x2 dequantizers ---
#pragma clang diagnostic ignored "-Wbackend-plugin" // spurios warning for hmx intrinsics
// requires external HMX lock
static void core_dot_chunk_fp16(__fp16 *restrict output, const __fp16 *restrict activation, const __fp16 *restrict weight, const __fp16 *restrict scales,
int n_row_tiles, int n_col_tiles, int n_dot_tiles) {

View File

@@ -104,6 +104,7 @@ int op_argsort(struct htp_ops_context * octx);
int op_ssm_conv(struct htp_ops_context * octx);
int op_cumsum(struct htp_ops_context * octx);
int op_fill(struct htp_ops_context * octx);
int op_concat(struct htp_ops_context * octx);
int op_diag(struct htp_ops_context * octx);
int op_solve_tri(struct htp_ops_context * octx);
int op_gated_delta_net(struct htp_ops_context * octx);

View File

@@ -20,6 +20,7 @@ enum htp_data_type {
HTP_TYPE_F32 = 0,
HTP_TYPE_F16 = 1,
HTP_TYPE_Q4_0 = 2,
HTP_TYPE_Q4_1 = 3,
HTP_TYPE_Q8_0 = 8,
HTP_TYPE_IQ4_NL = 20,
HTP_TYPE_I32 = 26,
@@ -28,6 +29,7 @@ enum htp_data_type {
// types used internally for repack, dyn.quant, etc
HTP_TYPE_Q4_0x4x2 = 200,
HTP_TYPE_Q4_1x4x2,
HTP_TYPE_Q8_0x4x2,
HTP_TYPE_MXFP4x4x2,
@@ -89,6 +91,7 @@ enum htp_op_code {
HTP_OP_TRI,
HTP_OP_PAD,
HTP_OP_NORM,
HTP_OP_CONCAT,
HTP_OP_INVALID
};

View File

@@ -0,0 +1,90 @@
#ifndef HVX_SIN_COS_H
#define HVX_SIN_COS_H
#include "hvx-base.h"
#include "hvx-floor.h"
static inline HVX_Vector hvx_vec_cos_f32(HVX_Vector x) {
HVX_Vector const_inv_pi = hvx_vec_splat_f32(0.3183098861837907f);
HVX_Vector const_half = hvx_vec_splat_f32(0.5f);
HVX_Vector const_pi = hvx_vec_splat_f32(3.141592653589793f);
HVX_Vector const_one = hvx_vec_splat_f32(1.0f);
HVX_Vector const_neg_one = hvx_vec_splat_f32(-1.0f);
// n = floor(x * (1/pi) + 0.5)
HVX_Vector n_float = hvx_vec_floor_f32(hvx_vec_add_f32_f32(hvx_vec_mul_f32_f32(x, const_inv_pi), const_half));
// y = x - n * pi
HVX_Vector y = hvx_vec_sub_f32_f32(x, hvx_vec_mul_f32_f32(n_float, const_pi));
// Sign determination: if n is odd, sign is -1.0f, else 1.0f
// half_n = n * 0.5f
HVX_Vector half_n = hvx_vec_mul_f32_f32(n_float, const_half);
// floor_half_n = floor(half_n)
HVX_Vector floor_half_n = hvx_vec_floor_f32(half_n);
// is_odd = half_n > floor_half_n
HVX_VectorPred is_odd = Q6_Q_vcmp_gt_VsfVsf(half_n, floor_half_n);
// sign = vmux(is_odd, -1.0f, 1.0f)
HVX_Vector sign = Q6_V_vmux_QVV(is_odd, const_neg_one, const_one);
// z = y^2
HVX_Vector z = hvx_vec_mul_f32_f32(y, y);
// Chebyshev approximation for cos(y)
HVX_Vector c4 = hvx_vec_splat_f32(2.3557242013849433e-05f);
HVX_Vector c3 = hvx_vec_splat_f32(-0.0013871428263450528f);
HVX_Vector c2 = hvx_vec_splat_f32(0.041665895266688284f);
HVX_Vector c1 = hvx_vec_splat_f32(-0.4999999360426369f);
HVX_Vector c0 = hvx_vec_splat_f32(0.9999999999071725f);
HVX_Vector cos_y = hvx_vec_add_f32_f32(c3, hvx_vec_mul_f32_f32(z, c4));
cos_y = hvx_vec_add_f32_f32(c2, hvx_vec_mul_f32_f32(z, cos_y));
cos_y = hvx_vec_add_f32_f32(c1, hvx_vec_mul_f32_f32(z, cos_y));
cos_y = hvx_vec_add_f32_f32(c0, hvx_vec_mul_f32_f32(z, cos_y));
return hvx_vec_mul_f32_f32(cos_y, sign);
}
static inline HVX_Vector hvx_vec_sin_f32(HVX_Vector x) {
HVX_Vector const_inv_pi = hvx_vec_splat_f32(0.3183098861837907f);
HVX_Vector const_half = hvx_vec_splat_f32(0.5f);
HVX_Vector const_pi = hvx_vec_splat_f32(3.141592653589793f);
HVX_Vector const_one = hvx_vec_splat_f32(1.0f);
HVX_Vector const_neg_one = hvx_vec_splat_f32(-1.0f);
// n = floor(x * (1/pi) + 0.5)
HVX_Vector n_float = hvx_vec_floor_f32(hvx_vec_add_f32_f32(hvx_vec_mul_f32_f32(x, const_inv_pi), const_half));
// y = x - n * pi
HVX_Vector y = hvx_vec_sub_f32_f32(x, hvx_vec_mul_f32_f32(n_float, const_pi));
// Sign determination: if n is odd, sign is -1.0f, else 1.0f
// half_n = n * 0.5f
HVX_Vector half_n = hvx_vec_mul_f32_f32(n_float, const_half);
// floor_half_n = floor(half_n)
HVX_Vector floor_half_n = hvx_vec_floor_f32(half_n);
// is_odd = half_n > floor_half_n
HVX_VectorPred is_odd = Q6_Q_vcmp_gt_VsfVsf(half_n, floor_half_n);
// sign = vmux(is_odd, -1.0f, 1.0f)
HVX_Vector sign = Q6_V_vmux_QVV(is_odd, const_neg_one, const_one);
// z = y^2
HVX_Vector z = hvx_vec_mul_f32_f32(y, y);
// Chebyshev approximation for sin(y)
HVX_Vector s4 = hvx_vec_splat_f32(2.642186986152672e-06f);
HVX_Vector s3 = hvx_vec_splat_f32(-0.00019825318964070864f);
HVX_Vector s2 = hvx_vec_splat_f32(0.00833326283319605f);
HVX_Vector s1 = hvx_vec_splat_f32(-0.16666666082087775f);
HVX_Vector s0 = hvx_vec_splat_f32(0.999999999915155f);
HVX_Vector sin_y = hvx_vec_add_f32_f32(s3, hvx_vec_mul_f32_f32(z, s4));
sin_y = hvx_vec_add_f32_f32(s2, hvx_vec_mul_f32_f32(z, sin_y));
sin_y = hvx_vec_add_f32_f32(s1, hvx_vec_mul_f32_f32(z, sin_y));
sin_y = hvx_vec_add_f32_f32(s0, hvx_vec_mul_f32_f32(z, sin_y));
sin_y = hvx_vec_mul_f32_f32(y, sin_y);
return hvx_vec_mul_f32_f32(sin_y, sign);
}
#endif /* HVX_SIN_COS_H */

View File

@@ -14,6 +14,8 @@
#include "hvx-sqrt.h"
#include "hvx-arith.h"
#include "hvx-div.h"
#include "hvx-floor.h"
#include "hvx-sin-cos.h"
#include "hvx-base.h"
#endif /* HVX_UTILS_H */

View File

@@ -420,8 +420,7 @@ AEEResult htp_iface_start(remote_handle64 handle, uint32 sess_id, uint64 dsp_que
ctx->n_threads = n_hvx;
for (int i = 0; i < ctx->n_threads; i++) {
// see discussion https://github.com/ggml-org/llama.cpp/pull/18151#discussion_r2632388541
ctx->dma[i] = dma_queue_create(128);
ctx->dma[i] = dma_queue_create(256); // queue depth
}
// init worker pool
@@ -601,6 +600,9 @@ static int execute_op(struct htp_ops_context * octx) {
case HTP_OP_PAD:
return op_pad(octx);
case HTP_OP_CONCAT:
return op_concat(octx);
case HTP_OP_GATED_DELTA_NET:
return op_gated_delta_net(octx);
@@ -851,6 +853,11 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
for (uint32_t i=0; i < n_ops; i++) {
struct profile_data prof;
if (i == (n_ops-1)) {
// wake up the host before starting the last op
dspqueue_write_early_wakeup_noblock(queue, 0, 0);
}
profile_start(ctx->profiler, &prof);
proc_op_req(octx, tens, i, &ops[i]);
@@ -867,8 +874,6 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
}
}
// dspqueue_write_early_wakeup_noblock(ctx->queue, 10, 0);
struct htp_opbatch_rsp rsp;
rsp.id = req.id;
rsp.status = HTP_STATUS_OK;

File diff suppressed because it is too large Load Diff

View File

@@ -7,6 +7,7 @@
#include <math.h>
#include <string.h>
#include <stdlib.h>
#include "hex-dma.h"
#include "hvx-utils.h"
@@ -75,6 +76,9 @@ struct htp_rope_context {
size_t theta_cache_offset;
uint32_t src0_nrows;
struct fastdiv_values div_ne2_ne1;
struct fastdiv_values div_ne1;
uint64_t t_start;
};
@@ -117,13 +121,84 @@ static __attribute__((noinline)) void rope_cache_init(const float theta_base,
float * cache,
const float theta_scale) {
// ref: https://github.com/jquesnelle/yarn/blob/master/scaled_rope/LlamaYaRNScaledRotaryEmbedding.py
float theta = theta_base;
#if __HVX_ARCH__ >= 79
const bool is_v79_or_newer = true;
#else
const bool is_v79_or_newer = false;
#endif
for (uint32_t i0 = 0; i0 < ne0; i0 += 2) {
const float ff = freq_factors ? freq_factors[i0 / 2] : 1.0f;
rope_yarn_one(theta / ff, freq_scale, corr_dims, i0, ext_factor, mscale, cache);
if (is_v79_or_newer && ext_factor == 0.0f) {
// Fast path: fully vectorized
// We process 32 pairs (64 elements) per iteration.
const uint32_t n_blocks = ne0 / 64;
theta *= theta_scale;
// Initialize theta scale powers: [1.0f, theta_scale, theta_scale^2, ..., theta_scale^31]
float __attribute__((aligned(128))) theta_powers[32];
theta_powers[0] = 1.0f;
for (int j = 1; j < 32; j++) {
theta_powers[j] = theta_powers[j - 1] * theta_scale;
}
HVX_Vector v_theta_powers = hvx_vmem(theta_powers);
HVX_Vector v_freq_scale = hvx_vec_splat_f32(freq_scale);
HVX_Vector v_mscale = hvx_vec_splat_f32(mscale);
// Base theta starts at theta_base
float theta_block = theta_base;
// The scale factor for the next block is theta_scale^32
float theta_scale_32 = 1.0f;
for (int j = 0; j < 32; j++) {
theta_scale_32 *= theta_scale;
}
for (uint32_t b = 0; b < n_blocks; b++) {
uint32_t i0 = b * 64;
HVX_Vector v_theta_base = hvx_vec_splat_f32(theta_block);
HVX_Vector v_theta = hvx_vec_mul_f32_f32(v_theta_base, v_theta_powers);
if (freq_factors) {
// Load 32 elements of freq_factors
HVX_Vector v_ff = hvx_vmemu(freq_factors + i0 / 2);
HVX_Vector v_inv_ff = hvx_vec_inverse_f32(v_ff);
v_theta = hvx_vec_mul_f32_f32(v_theta, v_inv_ff);
}
HVX_Vector v_theta_final = hvx_vec_mul_f32_f32(v_theta, v_freq_scale);
HVX_Vector vcos = hvx_vec_cos_f32(v_theta_final);
HVX_Vector vsin = hvx_vec_sin_f32(v_theta_final);
vcos = hvx_vec_mul_f32_f32(vcos, v_mscale);
vsin = hvx_vec_mul_f32_f32(vsin, v_mscale);
HVX_VectorPair vstore = Q6_W_vshuff_VVR(vsin, vcos, -4);
if (((uintptr_t)cache) % 128 == 0) {
hvx_vmem(cache + i0 + 0) = Q6_V_lo_W(vstore);
hvx_vmem(cache + i0 + 32) = Q6_V_hi_W(vstore);
} else {
hvx_vec_store_u(cache + i0 + 0, 32 * sizeof(float), Q6_V_lo_W(vstore));
hvx_vec_store_u(cache + i0 + 32, 32 * sizeof(float), Q6_V_hi_W(vstore));
}
theta_block *= theta_scale_32;
}
// Leftovers
float theta = theta_block;
for (uint32_t i0 = n_blocks * 64; i0 < ne0; i0 += 2) {
const float ff = freq_factors ? freq_factors[i0 / 2] : 1.0f;
rope_yarn_one(theta / ff, freq_scale, corr_dims, i0, ext_factor, mscale, cache);
theta *= theta_scale;
}
} else {
// Fallback to original scalar loop
float theta = theta_base;
for (uint32_t i0 = 0; i0 < ne0; i0 += 2) {
const float ff = freq_factors ? freq_factors[i0 / 2] : 1.0f;
rope_yarn_one(theta / ff, freq_scale, corr_dims, i0, ext_factor, mscale, cache);
theta *= theta_scale;
}
}
}
@@ -195,24 +270,18 @@ static void rope_corr_dims(int n_dims,
}
static inline void hvx_rope_neox_f32_aa(float * restrict dst, const float * restrict src0, uint32_t ne, const float * restrict theta_cache) {
const HVX_Vector * restrict vsrc = (const HVX_Vector *) src0;
const HVX_Vector * restrict vtheta = (const HVX_Vector *) theta_cache;
HVX_Vector * restrict vdst = (HVX_Vector *) dst;
const uint32_t he = ne / 2;
const uint32_t nvec = he / 32;
const uint32_t nloe = he % 32;
uint32_t nvec = (ne / (VLEN_FP32 * 2) * 2); // 2 vecs per loop, step of 2
for (uint32_t i = 0; i < nvec; i++) {
HVX_Vector v0 = ((const HVX_Vector *) src0)[i];
HVX_Vector v1 = hvx_vmemu(src0 + he + i * 32);
uint32_t he = ne / 2; // half_dims offset in elements
uint32_t hv = he / VLEN_FP32; // half_dims offset in vectors
HVX_Vector v2 = ((const HVX_Vector *) theta_cache)[i * 2 + 0];
HVX_Vector v3 = ((const HVX_Vector *) theta_cache)[i * 2 + 1];
#pragma unroll(2)
for (uint32_t i = 0; i < nvec; i += 2) {
HVX_Vector v0 = vsrc[i/2+0];
HVX_Vector v1 = vsrc[i/2+hv];
HVX_Vector v2 = vtheta[i+0];
HVX_Vector v3 = vtheta[i+1];
HVX_VectorPair vcos_sin = Q6_W_vdeal_VVR(v3, v2, -4); // vcos_sin[0] = cos_theta, vcos_sin[1] = sin_theta
HVX_VectorPair vcos_sin = Q6_W_vdeal_VVR(v3, v2, -4);
HVX_Vector vx0_c = Q6_Vqf32_vmpy_VsfVsf(v0, Q6_V_lo_W(vcos_sin));
HVX_Vector vx0_s = Q6_Vqf32_vmpy_VsfVsf(v0, Q6_V_hi_W(vcos_sin));
@@ -222,37 +291,45 @@ static inline void hvx_rope_neox_f32_aa(float * restrict dst, const float * rest
HVX_Vector v4 = Q6_Vqf32_vsub_Vqf32Vqf32(vx0_c, vx1_s);
HVX_Vector v5 = Q6_Vqf32_vadd_Vqf32Vqf32(vx0_s, vx1_c);
vdst[i/2+0] = Q6_Vsf_equals_Vqf32(v4);
vdst[i/2+hv] = Q6_Vsf_equals_Vqf32(v5);
((HVX_Vector *) dst)[i] = Q6_Vsf_equals_Vqf32(v4);
hvx_vmemu(dst + he + i * 32) = Q6_Vsf_equals_Vqf32(v5);
}
for (uint32_t i = nvec * VLEN_FP32; i < ne; i += 2) {
const float cos_theta = theta_cache[i+0];
const float sin_theta = theta_cache[i+1];
float x0 = src0[i/2];
float x1 = src0[i/2 + he];
dst[i/2] = x0 * cos_theta - x1 * sin_theta;
dst[i/2 + he] = x0 * sin_theta + x1 * cos_theta;
if (nloe > 0) {
HVX_Vector v0 = hvx_vmemu(src0 + nvec * 32);
HVX_Vector v1 = hvx_vmemu(src0 + he + nvec * 32);
HVX_Vector v2 = ((const HVX_Vector *) theta_cache)[nvec * 2 + 0];
HVX_Vector v3 = ((const HVX_Vector *) theta_cache)[nvec * 2 + 1];
HVX_VectorPair vcos_sin = Q6_W_vdeal_VVR(v3, v2, -4);
HVX_Vector vx0_c = Q6_Vqf32_vmpy_VsfVsf(v0, Q6_V_lo_W(vcos_sin));
HVX_Vector vx0_s = Q6_Vqf32_vmpy_VsfVsf(v0, Q6_V_hi_W(vcos_sin));
HVX_Vector vx1_c = Q6_Vqf32_vmpy_VsfVsf(v1, Q6_V_lo_W(vcos_sin));
HVX_Vector vx1_s = Q6_Vqf32_vmpy_VsfVsf(v1, Q6_V_hi_W(vcos_sin));
HVX_Vector v4 = Q6_Vqf32_vsub_Vqf32Vqf32(vx0_c, vx1_s);
HVX_Vector v5 = Q6_Vqf32_vadd_Vqf32Vqf32(vx0_s, vx1_c);
hvx_vec_store_u(dst + nvec * 32, nloe * sizeof(float), Q6_Vsf_equals_Vqf32(v4));
hvx_vec_store_u(dst + he + nvec * 32, nloe * sizeof(float), Q6_Vsf_equals_Vqf32(v5));
}
}
static inline void hvx_rope_f32_aa(float * restrict dst, const float * restrict src0, uint32_t ne, const float * restrict theta_cache) {
const HVX_Vector * restrict vsrc = (const HVX_Vector *) src0;
const HVX_Vector * restrict vtheta = (const HVX_Vector *) theta_cache;
HVX_Vector * restrict vdst = (HVX_Vector *) dst;
const uint32_t nvec = ne / 64;
const uint32_t nloe = ne % 64;
uint32_t nvec = (ne / (VLEN_FP32 * 2)) * 2; // 2 vecs per loop, step of two
for (uint32_t i = 0; i < nvec; i++) {
HVX_Vector v0 = ((const HVX_Vector *) src0)[i * 2 + 0];
HVX_Vector v1 = ((const HVX_Vector *) src0)[i * 2 + 1];
#pragma unroll(2)
for (uint32_t i = 0; i < nvec; i+=2) {
HVX_Vector v0 = vsrc[i+0];
HVX_Vector v1 = vsrc[i+1];
HVX_Vector v2 = ((const HVX_Vector *) theta_cache)[i * 2 + 0];
HVX_Vector v3 = ((const HVX_Vector *) theta_cache)[i * 2 + 1];
HVX_Vector v2 = vtheta[i+0];
HVX_Vector v3 = vtheta[i+1];
HVX_VectorPair vx0_x1 = Q6_W_vdeal_VVR(v1, v0, -4); // vx0_x1[0] = x0, vx0_x1[1] = x1
HVX_VectorPair vcos_sin = Q6_W_vdeal_VVR(v3, v2, -4); // vcos_sin[0] = cos_theta, vcos_sin[1] = sin_theta
HVX_VectorPair vx0_x1 = Q6_W_vdeal_VVR(v1, v0, -4);
HVX_VectorPair vcos_sin = Q6_W_vdeal_VVR(v3, v2, -4);
HVX_Vector vx0_c = Q6_Vqf32_vmpy_VsfVsf(Q6_V_lo_W(vx0_x1), Q6_V_lo_W(vcos_sin));
HVX_Vector vx0_s = Q6_Vqf32_vmpy_VsfVsf(Q6_V_lo_W(vx0_x1), Q6_V_hi_W(vcos_sin));
@@ -264,17 +341,52 @@ static inline void hvx_rope_f32_aa(float * restrict dst, const float * restrict
HVX_VectorPair vstore = Q6_W_vshuff_VVR(Q6_Vsf_equals_Vqf32(v5), Q6_Vsf_equals_Vqf32(v4), -4);
vdst[i+0] = Q6_V_lo_W(vstore);
vdst[i+1] = Q6_V_hi_W(vstore);
((HVX_Vector *) dst)[i * 2 + 0] = Q6_V_lo_W(vstore);
((HVX_Vector *) dst)[i * 2 + 1] = Q6_V_hi_W(vstore);
}
for (uint32_t i = nvec * VLEN_FP32; i < ne; i += 2) {
const float cos_theta = theta_cache[i+0];
const float sin_theta = theta_cache[i+1];
float x0 = src0[i+0];
float x1 = src0[i+1];
dst[i+0] = x0 * cos_theta - x1 * sin_theta;
dst[i+1] = x0 * sin_theta + x1 * cos_theta;
if (nloe > 0) {
if (nloe <= 32) {
HVX_Vector v0 = hvx_vmemu(src0 + nvec * 64);
HVX_Vector v2 = hvx_vmemu(theta_cache + nvec * 64);
HVX_VectorPair vx0_x1 = Q6_W_vdeal_VVR(Q6_V_vzero(), v0, -4);
HVX_VectorPair vcos_sin = Q6_W_vdeal_VVR(Q6_V_vzero(), v2, -4);
HVX_Vector vx0_c = Q6_Vqf32_vmpy_VsfVsf(Q6_V_lo_W(vx0_x1), Q6_V_lo_W(vcos_sin));
HVX_Vector vx0_s = Q6_Vqf32_vmpy_VsfVsf(Q6_V_lo_W(vx0_x1), Q6_V_hi_W(vcos_sin));
HVX_Vector vx1_c = Q6_Vqf32_vmpy_VsfVsf(Q6_V_hi_W(vx0_x1), Q6_V_lo_W(vcos_sin));
HVX_Vector vx1_s = Q6_Vqf32_vmpy_VsfVsf(Q6_V_hi_W(vx0_x1), Q6_V_hi_W(vcos_sin));
HVX_Vector v4 = Q6_Vqf32_vsub_Vqf32Vqf32(vx0_c, vx1_s);
HVX_Vector v5 = Q6_Vqf32_vadd_Vqf32Vqf32(vx0_s, vx1_c);
HVX_VectorPair vstore = Q6_W_vshuff_VVR(Q6_Vsf_equals_Vqf32(v5), Q6_Vsf_equals_Vqf32(v4), -4);
hvx_vec_store_u(dst + nvec * 64, nloe * sizeof(float), Q6_V_lo_W(vstore));
} else {
HVX_Vector v0 = hvx_vmemu(src0 + nvec * 64);
HVX_Vector v1 = hvx_vmemu(src0 + nvec * 64 + 32);
HVX_Vector v2 = hvx_vmemu(theta_cache + nvec * 64);
HVX_Vector v3 = hvx_vmemu(theta_cache + nvec * 64 + 32);
HVX_VectorPair vx0_x1 = Q6_W_vdeal_VVR(v1, v0, -4);
HVX_VectorPair vcos_sin = Q6_W_vdeal_VVR(v3, v2, -4);
HVX_Vector vx0_c = Q6_Vqf32_vmpy_VsfVsf(Q6_V_lo_W(vx0_x1), Q6_V_lo_W(vcos_sin));
HVX_Vector vx0_s = Q6_Vqf32_vmpy_VsfVsf(Q6_V_lo_W(vx0_x1), Q6_V_hi_W(vcos_sin));
HVX_Vector vx1_c = Q6_Vqf32_vmpy_VsfVsf(Q6_V_hi_W(vx0_x1), Q6_V_lo_W(vcos_sin));
HVX_Vector vx1_s = Q6_Vqf32_vmpy_VsfVsf(Q6_V_hi_W(vx0_x1), Q6_V_hi_W(vcos_sin));
HVX_Vector v4 = Q6_Vqf32_vsub_Vqf32Vqf32(vx0_c, vx1_s);
HVX_Vector v5 = Q6_Vqf32_vadd_Vqf32Vqf32(vx0_s, vx1_c);
HVX_VectorPair vstore = Q6_W_vshuff_VVR(Q6_Vsf_equals_Vqf32(v5), Q6_Vsf_equals_Vqf32(v4), -4);
((HVX_Vector *) dst)[nvec * 2 + 0] = Q6_V_lo_W(vstore);
hvx_vec_store_u(dst + nvec * 64 + 32, (nloe - 32) * sizeof(float), Q6_V_hi_W(vstore));
}
}
}
@@ -348,13 +460,19 @@ static void rope_job_f32(unsigned int nth, unsigned int ith, void * data) {
const int32_t * pos = (const int32_t *) src1->data;
const float * freq_factors = src2 ? (const float *) src2->data : NULL;
uint32_t ir = 0;
const uint32_t i3_start = fastdiv(src0_start_row, &rctx->div_ne2_ne1);
const uint32_t rem = fastmodulo(src0_start_row, ne2 * ne1, &rctx->div_ne2_ne1);
const uint32_t i2_start = fastdiv(rem, &rctx->div_ne1);
const uint32_t i1_start = fastmodulo(rem, ne1, &rctx->div_ne1);
uint32_t ir = src0_start_row;
uint32_t prev_i2 = (uint32_t) -1;
for (uint32_t i3 = 0; i3 < ne3; i3++) { // batch
for (uint32_t i2 = 0; i2 < ne2; i2++) { // seq-len
for (uint32_t i1 = 0; i1 < ne1; ) { // attn-heads
if (ir < src0_start_row) { ir++; i1++; continue; }
for (uint32_t i3 = i3_start; i3 < ne3; i3++) { // batch
const uint32_t i2_init = (i3 == i3_start) ? i2_start : 0;
for (uint32_t i2 = i2_init; i2 < ne2; i2++) { // seq-len
const uint32_t i1_init = (i3 == i3_start && i2 == i2_start) ? i1_start : 0;
for (uint32_t i1 = i1_init; i1 < ne1; ) { // attn-heads
if (ir >= src0_end_row) goto done;
// Rows in this block
@@ -407,9 +525,6 @@ static void rope_job_f32(unsigned int nth, unsigned int ith, void * data) {
ne0, rctx->ext_factor, rctx->attn_factor,
theta_cache, rctx->theta_scale);
}
// FARF(HIGH, "rope-theta %u: ir %u i1 %u i2 %u i3 %u cache %p : usec %u", ith, ir, i1, i2, i3, theta_cache,
// (unsigned) HAP_perf_qtimer_count_to_us(HAP_perf_get_qtimer_count() - rctx->t_start));
}
// Skip output DMA transactions from prev block (if any)
@@ -489,7 +604,7 @@ static int execute_op_rope_f32(struct htp_ops_context * octx) {
// Aligned row sizes for VTCM
const size_t src0_row_size_aligned = hex_round_up(src0_row_size, VLEN);
const size_t dst_row_size_aligned = hex_round_up(dst_row_size, VLEN);
const size_t theta_cache_size_aligned = hex_round_up(src0->ne[0] * sizeof(float), 128);
const size_t theta_cache_size_aligned = hex_round_up(src0->ne[0] * sizeof(float), 256);
// Calculate spad sizes per thread
size_t src0_spad_per_thread = theta_cache_size_aligned + HTP_ROPE_SPAD_NROWS * src0_row_size_aligned;
@@ -546,6 +661,11 @@ static int execute_op_rope_f32(struct htp_ops_context * octx) {
rctx.src0_nrows = src0_nrows;
rctx.src0_nrows_per_thread = (src0_nrows + n_threads - 1) / n_threads;
if (src0_nrows > 0) {
rctx.div_ne2_ne1 = init_fastdiv_values(dst->ne[2] * dst->ne[1]);
rctx.div_ne1 = init_fastdiv_values(dst->ne[1]);
}
FARF(HIGH, "rope-f32 n-rows %u n-dims %d ne0 %u ext-factor %.6f theta-scale %.6f attn-factor %.6f\n", rctx.src0_nrows, rctx.n_dims, ne0,
rctx.ext_factor, rctx.theta_scale, rctx.attn_factor);

View File

@@ -65,6 +65,9 @@ static void set_rows_thread_f32_f32(unsigned int nth, unsigned int ith, void *da
// parallelize by rows of src0
const uint32_t dr = srctx->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
if (ir0 >= nr) {
return;
}
const uint32_t ir1 = (ir0 + dr < nr) ? (ir0 + dr) : nr;
const bool is_i32 = (octx->src[1]->type == HTP_TYPE_I32);
@@ -109,6 +112,9 @@ static void set_rows_thread_f16_f32(unsigned int nth, unsigned int ith, void *da
// parallelize by rows of src0
const uint32_t dr = srctx->src0_nrows_per_thread;
const uint32_t ir0 = dr * ith;
if (ir0 >= nr) {
return;
}
const uint32_t ir1 = (ir0 + dr < nr) ? (ir0 + dr) : nr;
const bool is_i32 = (octx->src[1]->type == HTP_TYPE_I32);

View File

@@ -207,7 +207,7 @@ static void hvx_fast_norm_f32(const uint8_t * restrict src,
// scale = rsqrt(variance + epsilon), mean_x broadcast for subtraction
HVX_Vector scale_v = hvx_vec_rsqrt_f32(Q6_Vsf_equals_Vqf32(var_epsilon_v));
HVX_Vector mean_x_b = hvx_vec_splat_f32(hvx_vec_get_f32(Q6_Vsf_equals_Vqf32(mean_x_v)));
HVX_Vector mean_x_b = hvx_vec_repl_f32(Q6_Vsf_equals_Vqf32(mean_x_v));
#pragma unroll(4)
for (int i = 0; i < nvec; i++) {

View File

@@ -224,6 +224,7 @@ struct sycl_device_info {
int max_wg_per_cu; // max work groups per compute unit - refer to
// cudaOccupancyMaxActiveBlocksPerMultiprocessor
bool vmm; // virtual memory support
size_t vmm_granularity; // granularity of virtual memory
size_t total_vram;
sycl_hw_info hw_info;
optimize_feature opt_feature;
@@ -244,6 +245,8 @@ struct ggml_sycl_device_info {
const ggml_sycl_device_info & ggml_sycl_info();
static constexpr size_t SYCL_BUFFER_ALIGNMENT = 128;
struct ggml_sycl_pool {
virtual ~ggml_sycl_pool() = default;

View File

@@ -19,6 +19,7 @@
#include <cstdlib>
#include <float.h>
#include <limits>
#include <optional>
#include <stdint.h>
#include <stdio.h>
#include <vector>
@@ -37,6 +38,11 @@
#if defined(GGML_SYCL_GRAPH) && SYCL_EXT_ONEAPI_ASYNC_MEMORY_ALLOC
# include <sycl/ext/oneapi/experimental/async_alloc/async_alloc.hpp>
#endif
#if SYCL_EXT_ONEAPI_VIRTUAL_MEM
# include <sycl/ext/oneapi/virtual_mem/physical_mem.hpp>
# include <sycl/ext/oneapi/virtual_mem/virtual_mem.hpp>
# define GGML_SYCL_USE_VMM
#endif
#include <sycl/half_type.hpp>
#include "ggml.h"
@@ -70,6 +76,7 @@ int g_ggml_sycl_debug = 0;
int g_ggml_sycl_disable_optimize = 0;
int g_ggml_sycl_disable_graph = 0;
int g_ggml_sycl_disable_dnn = 0;
int g_ggml_sycl_enable_vmm = 1;
int g_ggml_sycl_prioritize_dmmv = 0;
int g_ggml_sycl_use_async_mem_op = 0;
int g_ggml_sycl_use_async_mem_op_requested = 1;
@@ -96,13 +103,30 @@ static ggml_sycl_device_info ggml_sycl_init() {
// GGML_LOG_INFO("%s: SYCL_USE_XMX: no\n", __func__);
// #endif
for (int i = 0; i < info.device_count; ++i) {
info.devices[i].vmm = 0;
dpct::device_info prop;
auto & device = dpct::dev_mgr::instance().get_device(i);
SYCL_CHECK(CHECK_TRY_ERROR(dpct::get_device_info(
prop, device)));
#if !defined(GGML_SYCL_USE_VMM)
info.devices[i].vmm = 0;
#else
info.devices[i].vmm = device.has(sycl::aspect::ext_oneapi_virtual_mem);
if (info.devices[i].vmm) {
// NB: SYCL's get_mem_granularity always returns the _minimum_ granularity,
// but the L0 API requires a larger page size for allocs above 2 MiB and
// rejects non-multiples with UR_RESULT_ERROR_INVALID_VALUE [sic].
// Here we clamp it to 2 MiB for simplicity, but other devices may require
// calling zeVirtualMemQueryPageSize or yet unexposed public API.
const size_t physical_page = 2ull << 20; // 2 MiB
info.devices[i].vmm_granularity = std::max<size_t>(
sycl::ext::oneapi::experimental::get_mem_granularity(
device, sycl::context(device)),
physical_page);
}
#endif
info.default_tensor_split[i] = total_vram;
total_vram += prop.get_global_mem_size();
@@ -234,6 +258,7 @@ static void ggml_check_sycl() try {
g_ggml_sycl_disable_optimize = get_sycl_env("GGML_SYCL_DISABLE_OPT", 0);
g_ggml_sycl_disable_graph = get_sycl_env("GGML_SYCL_DISABLE_GRAPH", 1);
g_ggml_sycl_disable_dnn = get_sycl_env("GGML_SYCL_DISABLE_DNN", 0);
g_ggml_sycl_enable_vmm = get_sycl_env("GGML_SYCL_ENABLE_VMM", 1);
g_ggml_sycl_prioritize_dmmv = get_sycl_env("GGML_SYCL_PRIORITIZE_DMMV", 0);
#ifdef GGML_SYCL_SUPPORT_LEVEL_ZERO
g_ggml_sycl_enable_level_zero = get_sycl_env("GGML_SYCL_ENABLE_LEVEL_ZERO", ggml_sycl_info().ext_oneapi_level_zero);
@@ -275,6 +300,11 @@ static void ggml_check_sycl() try {
#else
GGML_LOG_INFO(" GGML_SYCL_SUPPORT_LEVEL_ZERO: no\n");
#endif
#if defined(GGML_SYCL_USE_VMM)
GGML_LOG_INFO(" GGML_SYCL_USE_VMM: yes\n");
#else
GGML_LOG_INFO(" GGML_SYCL_USE_VMM: no\n");
#endif
GGML_LOG_INFO("Running with Environment Variables:\n");
GGML_LOG_INFO(" GGML_SYCL_DEBUG: %d\n", g_ggml_sycl_debug);
@@ -293,6 +323,11 @@ static void ggml_check_sycl() try {
GGML_LOG_INFO(" GGML_SYCL_DISABLE_DNN: %d\n", g_ggml_sycl_disable_dnn);
#else
GGML_LOG_INFO(" GGML_SYCL_DISABLE_DNN: DNN disabled by compile flag\n");
#endif
#if defined(GGML_SYCL_USE_VMM)
GGML_LOG_INFO(" GGML_SYCL_ENABLE_VMM: %d\n", g_ggml_sycl_enable_vmm);
#else
GGML_LOG_INFO(" GGML_SYCL_ENABLE_VMM: virtual memory extension is not available\n");
#endif
GGML_LOG_INFO(" GGML_SYCL_PRIORITIZE_DMMV: %d\n", g_ggml_sycl_prioritize_dmmv);
g_ggml_sycl_use_async_mem_op_requested = get_sycl_env("GGML_SYCL_USE_ASYNC_MEM_OP", 1);
@@ -754,7 +789,7 @@ catch (sycl::exception const &exc) {
}
static size_t ggml_backend_sycl_buffer_type_get_alignment(ggml_backend_buffer_type_t buft) {
return 128;
return SYCL_BUFFER_ALIGNMENT;
GGML_UNUSED(buft);
}
@@ -1177,7 +1212,7 @@ static ggml_backend_buffer_t ggml_backend_sycl_split_buffer_type_alloc_buffer(gg
}
static size_t ggml_backend_sycl_split_buffer_type_get_alignment(ggml_backend_buffer_type_t buft) {
return 128;
return SYCL_BUFFER_ALIGNMENT;
GGML_UNUSED(buft);
}
@@ -1462,6 +1497,121 @@ struct ggml_sycl_pool_leg : public ggml_sycl_pool {
}
};
// pool with virtual memory management
#if defined(GGML_SYCL_USE_VMM)
struct ggml_sycl_pool_vmm : public ggml_sycl_pool {
static const size_t SYCL_POOL_VMM_MAX_SIZE = 1ull << 35; // 32 GB
int device;
sycl::context ctx;
sycl::device dev;
uintptr_t pool_addr = 0;
size_t pool_used = 0;
size_t pool_size = 0;
size_t granularity;
// physical_mem owns the commits (unlike cuMemMap)
struct mapping {
sycl::ext::oneapi::experimental::physical_mem phys;
void * map_ptr;
};
std::vector<mapping> mappings;
explicit ggml_sycl_pool_vmm(queue_ptr qptr_, int device_) :
device(device_),
ctx(qptr_->get_context()),
dev(qptr_->get_device()),
granularity(ggml_sycl_info().devices[device_].vmm_granularity) {
}
~ggml_sycl_pool_vmm() {
if (pool_addr == 0) {
return;
}
// Per spec, unmap must (a) match the exact (ptr, size) of an earlier
// physical_mem::map() call and (b) precede destruction of the
// physical_mem objects (their dtors won't unmap).
for (auto & m : mappings) {
SYCL_CHECK(CHECK_TRY_ERROR(sycl::ext::oneapi::experimental::unmap(
m.map_ptr, m.phys.size(), ctx)));
}
SYCL_CHECK(CHECK_TRY_ERROR(sycl::ext::oneapi::experimental::free_virtual_mem(
pool_addr, SYCL_POOL_VMM_MAX_SIZE, ctx)));
}
void * alloc(size_t size, size_t * actual_size) override {
// round up the allocation size to the alignment to ensure that all allocations are aligned for all data types
size = GGML_PAD(size, SYCL_BUFFER_ALIGNMENT);
size_t avail = pool_size - pool_used;
if (size > avail) {
// round up to the next multiple of the granularity
size_t reserve_size = GGML_PAD(size - avail, granularity);
GGML_ASSERT(pool_size + reserve_size <= SYCL_POOL_VMM_MAX_SIZE);
// allocate more physical memory
std::optional<sycl::ext::oneapi::experimental::physical_mem> phys;
SYCL_CHECK(CHECK_TRY_ERROR(phys.emplace(dev, ctx, reserve_size)));
// reserve virtual address space (if not already reserved)
if (pool_addr == 0) {
SYCL_CHECK(CHECK_TRY_ERROR(
pool_addr = sycl::ext::oneapi::experimental::reserve_virtual_mem(
SYCL_POOL_VMM_MAX_SIZE, ctx)));
}
// map at the end of the pool
void * map_ptr = nullptr;
SYCL_CHECK(CHECK_TRY_ERROR(
map_ptr = phys->map(pool_addr + pool_size, reserve_size,
sycl::ext::oneapi::experimental::address_access_mode::read_write)));
// stash these so we could unmap this exact range in dtor
mappings.push_back({
std::move(*phys),
map_ptr,
});
// add to the pool
pool_size += reserve_size;
#ifdef DEBUG_SYCL_MALLOC
GGML_LOG_INFO("sycl pool[%d]: size increased to %llu MB (reserved %llu MB)\n",
device, (unsigned long long) (pool_size/1024/1024),
(unsigned long long) (reserve_size/1024/1024));
#endif
}
GGML_ASSERT(pool_addr != 0);
void * ptr = reinterpret_cast<void *>(pool_addr + pool_used);
*actual_size = size;
pool_used += size;
#ifdef DEBUG_SYCL_MALLOC
GGML_LOG_INFO("sycl pool[%d]: allocated %llu bytes at %p\n", device, (unsigned long long) size, ptr);
#endif
return ptr;
}
void free(void * ptr, size_t size) override {
#ifdef DEBUG_SYCL_MALLOC
GGML_LOG_INFO("sycl pool[%d]: freed %llu bytes at %p\n", device, (unsigned long long) size, ptr);
#endif
pool_used -= size;
// all deallocations must be in reverse order of the allocations
GGML_ASSERT(ptr == reinterpret_cast<void *>(pool_addr + pool_used));
}
};
#endif // defined(GGML_SYCL_USE_VMM)
struct ggml_sycl_pool_host : public ggml_sycl_pool {
queue_ptr qptr;
int device;
@@ -1542,20 +1692,19 @@ std::unique_ptr<ggml_sycl_pool> ggml_backend_sycl_context::new_pool_for_host(que
}
std::unique_ptr<ggml_sycl_pool> ggml_backend_sycl_context::new_pool_for_device(queue_ptr qptr, int device) {
// TBD: NO VMM support
// if (ggml_sycl_info().devices[device].vmm) {
// return std::unique_ptr<ggml_sycl_pool>(new ggml_sycl_pool_vmm(device));
// }
return std::unique_ptr<ggml_sycl_pool>(new ggml_sycl_pool_leg(qptr, device));
#if defined(GGML_SYCL_USE_VMM)
if (g_ggml_sycl_enable_vmm && ggml_sycl_info().devices[device].vmm) {
return std::unique_ptr<ggml_sycl_pool>(new ggml_sycl_pool_vmm(qptr, device));
}
#endif // defined(GGML_SYCL_USE_VMM)
return std::unique_ptr<ggml_sycl_pool>(new ggml_sycl_pool_leg(qptr, device));
}
std::unique_ptr<ggml_sycl_fattn_kv_buffers> ggml_backend_sycl_context::new_fattn_kv_buffers(queue_ptr qptr, int device) {
return std::unique_ptr<ggml_sycl_fattn_kv_buffers>(new ggml_sycl_fattn_kv_buffers(qptr, device));
}
// TBD pool with virtual memory management
// struct ggml_sycl_pool_vmm : public ggml_sycl_pool
/// kernels
typedef void (*ggml_sycl_op_mul_mat_t)(
ggml_backend_sycl_context & ctx,

View File

@@ -79,6 +79,12 @@ if (Vulkan_FOUND)
"GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT"
)
test_shader_extension_support(
"GL_NV_cooperative_matrix_decode_vector"
"${CMAKE_CURRENT_SOURCE_DIR}/vulkan-shaders/feature-tests/coopmat2_decode_vector.comp"
"GGML_VULKAN_COOPMAT2_DECODE_VECTOR_GLSLC_SUPPORT"
)
test_shader_extension_support(
"GL_EXT_integer_dot_product"
"${CMAKE_CURRENT_SOURCE_DIR}/vulkan-shaders/feature-tests/integer_dot.comp"

View File

@@ -21,6 +21,19 @@ DispatchLoaderDynamic & ggml_vk_default_dispatcher();
#include <vulkan/vulkan.hpp>
// Fallback definitions for VK_NV_cooperative_matrix_decode_vector in case the
// installed Vulkan headers predate the extension.
#ifndef VK_NV_cooperative_matrix_decode_vector
#define VK_NV_cooperative_matrix_decode_vector 1
#define VK_NV_COOPERATIVE_MATRIX_DECODE_VECTOR_EXTENSION_NAME "VK_NV_cooperative_matrix_decode_vector"
#define VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_DECODE_VECTOR_FEATURES_NV ((VkStructureType)1000689000)
typedef struct VkPhysicalDeviceCooperativeMatrixDecodeVectorFeaturesNV {
VkStructureType sType;
void* pNext;
VkBool32 cooperativeMatrixDecodeVector;
} VkPhysicalDeviceCooperativeMatrixDecodeVectorFeaturesNV;
#endif
// SPIR-V Headers: different SDK installations expose different include paths.
// LunarG Vulkan SDK on Windows typically provides <spirv-headers/spirv.hpp>.
// Linux packages, MSYS2 and MinGW often use the Khronos layout <spirv/unified1/spirv.hpp>.
@@ -398,6 +411,7 @@ enum vk_conv_shapes {
CONV_SHAPE_128x128,
CONV_SHAPE_64x32,
CONV_SHAPE_32x256,
CONV_SHAPE_64x128,
CONV_SHAPE_COUNT,
};
@@ -412,6 +426,7 @@ vk_conv_block_size vk_conv_block_sizes[CONV_SHAPE_COUNT] = {
{ 128, 128, 16 }, // CONV_SHAPE_128x128
{ 64, 32, 32 }, // CONV_SHAPE_64x32
{ 32, 256, 16 }, // CONV_SHAPE_32x256
{ 64, 128, 16 }, // CONV_SHAPE_64x128
};
enum dmmv_wg_sizes {
@@ -447,14 +462,16 @@ struct vk_fa_pipeline_state {
};
struct vk_conv2d_pipeline_state {
vk_conv2d_pipeline_state(uint32_t s0, uint32_t s1, uint32_t p0, uint32_t p1, uint32_t d0, uint32_t d1, uint32_t KW, uint32_t KH)
: s0(s0), s1(s1), p0(p0), p1(p1), d0(d0), d1(d1), KW(KW), KH(KH) {}
vk_conv2d_pipeline_state(uint32_t s0, uint32_t s1, uint32_t p0, uint32_t p1, uint32_t d0, uint32_t d1, uint32_t KW, uint32_t KH, uint32_t aligned)
: s0(s0), s1(s1), p0(p0), p1(p1), d0(d0), d1(d1), KW(KW), KH(KH), aligned(aligned) {}
uint32_t s0, s1, p0, p1, d0, d1, KW, KH;
// when set, shader can skip K/CRS/NPQ bounds checks and address clamps
uint32_t aligned;
bool operator<(const vk_conv2d_pipeline_state &b) const {
return std::tie(s0, s1, p0, p1, d0, d1, KW, KH) <
std::tie(b.s0, b.s1, b.p0, b.p1, b.d0, b.d1, b.KW, b.KH);
return std::tie(s0, s1, p0, p1, d0, d1, KW, KH, aligned) <
std::tie(b.s0, b.s1, b.p0, b.p1, b.d0, b.d1, b.KW, b.KH, b.aligned);
}
};
@@ -674,6 +691,7 @@ struct vk_device_struct {
uint32_t coopmat_int_k;
bool coopmat2;
bool coopmat2_decode_vector;
bool pipeline_executable_properties_support {};
@@ -764,7 +782,8 @@ struct vk_device_struct {
vk_pipeline pipeline_clamp_f32;
vk_pipeline pipeline_pad_f32;
vk_pipeline pipeline_roll_f32;
vk_pipeline pipeline_repeat_f32, pipeline_repeat_back_f32;
vk_pipeline pipeline_repeat_i32, pipeline_repeat_back_f32;
vk_pipeline pipeline_repeat_i16;
vk_pipeline pipeline_cpy_f32_f32, pipeline_cpy_f32_f16, pipeline_cpy_f16_f16, pipeline_cpy_f16_f32, pipeline_cpy_f32_bf16, pipeline_cpy_bf16_f32, pipeline_cpy_f32_i32, pipeline_cpy_i32_f32;
vk_pipeline pipeline_contig_cpy_f32_f32, pipeline_contig_cpy_f32_f16, pipeline_contig_cpy_f16_f16, pipeline_contig_cpy_f16_f32, pipeline_contig_cpy_f32_bf16, pipeline_contig_cpy_bf16_f32, pipeline_contig_cpy_f32_i32, pipeline_contig_cpy_i32_f32;
vk_pipeline pipeline_cpy_f32_quant[GGML_TYPE_COUNT];
@@ -2162,6 +2181,136 @@ static uint32_t compile_count = 0;
static std::mutex compile_count_mutex;
static std::condition_variable compile_count_cond;
static constexpr uint32_t kSpvOpCooperativeMatrixLoadTensorNV = 5367;
static constexpr uint32_t kSpvCapabilityCooperativeMatrixDecodeVectorNV = 5447;
static constexpr uint32_t kSpvTensorAddressingDecodeVectorFuncBit = 0x4;
// Remove SPV_NV_cooperative_matrix_decode_vector usage from a SPIR-V module so it
// can be loaded on drivers that only support SPV_NV_cooperative_matrix2. Drops the
// OpExtension declaration, the CooperativeMatrixDecodeVectorNV OpCapability, and the
// DecodeVectorFunc operand from any OpCooperativeMatrixLoadTensorNV instruction.
// Returns true when the input used the extension (and `out` was populated with a
// stripped copy); returns false otherwise without touching `out`.
static bool ggml_vk_strip_decode_vector(const uint32_t * code, size_t word_count, std::vector<uint32_t> & out) {
static const char kDecodeVectorExt[] = "SPV_NV_cooperative_matrix_decode_vector";
if (word_count < 5) {
return false;
}
bool uses_decode_vector = false;
for (size_t pos = 5; pos < word_count; ) {
uint32_t word = code[pos];
uint32_t wc = word >> spv::WordCountShift;
uint32_t op = word & spv::OpCodeMask;
GGML_ASSERT(wc > 0 && pos + wc <= word_count);
if (op == spv::OpExtension && wc >= 2) {
const char * s = reinterpret_cast<const char *>(&code[pos + 1]);
if (strcmp(s, kDecodeVectorExt) == 0) {
uses_decode_vector = true;
break;
}
}
pos += wc;
}
if (!uses_decode_vector) {
return false;
}
VK_LOG_DEBUG("ggml_vk_strip_decode_vector: stripping SPV_NV_cooperative_matrix_decode_vector");
// Bulk-copy unchanged runs and only break the run when an instruction needs to
// be dropped or patched. Use reserve + insert/push_back so the destination buffer
// is touched exactly once (no zero-initialization pass from resize()).
out.clear();
out.reserve(word_count);
size_t run_start = 0;
auto flush_run = [&](size_t up_to) {
if (up_to > run_start) {
out.insert(out.end(), code + run_start, code + up_to);
}
};
for (size_t pos = 5; pos < word_count; ) {
uint32_t word = code[pos];
uint32_t wc = word >> spv::WordCountShift;
uint32_t op = word & spv::OpCodeMask;
GGML_ASSERT(wc > 0 && pos + wc <= word_count);
if (op == spv::OpExtension && wc >= 2) {
const char * s = reinterpret_cast<const char *>(&code[pos + 1]);
if (strcmp(s, kDecodeVectorExt) == 0) {
flush_run(pos);
pos += wc;
run_start = pos;
continue;
}
}
if (op == spv::OpCapability && wc == 2 && code[pos + 1] == kSpvCapabilityCooperativeMatrixDecodeVectorNV) {
flush_run(pos);
pos += wc;
run_start = pos;
continue;
}
if (op == kSpvOpCooperativeMatrixLoadTensorNV) {
// [opcode/wc][ResultType][Result][Pointer][Object][TensorLayout][MemOperand mask][mem extras...][TA mask][ta extras...]
GGML_ASSERT(wc >= 8);
uint32_t mem_mask = code[pos + 6];
size_t cur = pos + 7;
// Each of these MemoryAccess bits (when set) carries one trailing operand.
cur += (mem_mask & 0x2) ? 1 : 0; // Aligned
cur += (mem_mask & 0x8) ? 1 : 0; // MakePointerAvailable
cur += (mem_mask & 0x10) ? 1 : 0; // MakePointerVisible
cur += (mem_mask & 0x10000) ? 1 : 0; // AliasScopeINTELMask
cur += (mem_mask & 0x20000) ? 1 : 0; // NoAliasINTELMask
GGML_ASSERT(cur < pos + wc);
uint32_t ta_mask = code[cur];
if ((ta_mask & kSpvTensorAddressingDecodeVectorFuncBit) == 0) {
pos += wc;
continue; // leave instruction inside the current unchanged run
}
flush_run(pos);
// Append unchanged prefix of the instruction (header through the mem-extras).
size_t inst_start = out.size();
size_t pre_n = cur - pos;
out.insert(out.end(), code + pos, code + pos + pre_n);
// Emit TA mask with the DecodeVectorFunc bit cleared.
out.push_back(ta_mask & ~kSpvTensorAddressingDecodeVectorFuncBit);
// TA extras: TensorView (0x1) and DecodeFunc (0x2) are kept verbatim;
// DecodeVectorFunc (0x4) is dropped along with its trailing id operand.
size_t keep_ta_extras = ((ta_mask & 0x1) ? 1 : 0) + ((ta_mask & 0x2) ? 1 : 0);
if (keep_ta_extras) {
out.insert(out.end(), code + cur + 1, code + cur + 1 + keep_ta_extras);
}
GGML_ASSERT(wc == pre_n + 1 + keep_ta_extras + 1);
// Patch the instruction header with the new (one-shorter) word count.
uint32_t new_wc = wc - 1;
out[inst_start] = (new_wc << spv::WordCountShift) | op;
pos += wc;
run_start = pos;
continue;
}
pos += wc;
}
flush_run(word_count);
return true;
}
static void ggml_vk_create_pipeline_func(vk_device& device, vk_pipeline& pipeline, size_t spv_size, const void* spv_data, const std::string entrypoint,
uint32_t parameter_count, std::array<uint32_t, 3> wg_denoms, std::vector<uint32_t> specialization_constants,
bool disable_robustness, bool require_full_subgroups, uint32_t required_subgroup_size) {
@@ -2233,6 +2382,18 @@ static void ggml_vk_create_pipeline_func(vk_device& device, vk_pipeline& pipelin
shader_module_create_info = vk::ShaderModuleCreateInfo({}, spirv.size() * sizeof(uint32_t), spirv.data());
}
#if defined(GGML_VULKAN_COOPMAT2_DECODE_VECTOR_GLSLC_SUPPORT)
if (device->coopmat2 && !device->coopmat2_decode_vector) {
const uint32_t * src = spirv.empty() ? reinterpret_cast<const uint32_t *>(spv_data) : spirv.data();
size_t src_n = spirv.empty() ? spv_size / sizeof(uint32_t) : spirv.size();
std::vector<uint32_t> stripped;
if (ggml_vk_strip_decode_vector(src, src_n, stripped)) {
spirv = std::move(stripped);
shader_module_create_info = vk::ShaderModuleCreateInfo({}, spirv.size() * sizeof(uint32_t), spirv.data());
}
}
#endif
pipeline->shader_module = device->device.createShaderModule(shader_module_create_info);
vk::PushConstantRange pcr(
@@ -4704,9 +4865,11 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_roll_f32, "roll_f32", roll_f32_len, roll_f32_data, "main", 2, sizeof(vk_op_unary_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_repeat_f32, "repeat_f32", repeat_f32_len, repeat_f32_data, "main", 2, sizeof(vk_op_unary_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_repeat_i32, "repeat_i32", repeat_i32_len, repeat_i32_data, "main", 2, sizeof(vk_op_unary_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_repeat_back_f32, "repeat_back_f32", repeat_back_f32_len, repeat_back_f32_data, "main", 2, sizeof(vk_op_unary_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_repeat_i16, "repeat_i16", repeat_i16_len, repeat_i16_data, "main", 2, sizeof(vk_op_unary_push_constants), {512, 1, 1}, {}, 1);
#define CREATE_UNARY(name) \
ggml_vk_create_pipeline(device, device->pipeline_ ## name [0], #name "_f32", name ## _f32_len, name ## _f32_data, "main", 2, sizeof(vk_op_push_constants), {512, 1, 1}, {}, 1); \
ggml_vk_create_pipeline(device, device->pipeline_ ## name [1], #name "_f16", name ## _f16_len, name ## _f16_data, "main", 2, sizeof(vk_op_push_constants), {512, 1, 1}, {}, 1);
@@ -4934,7 +5097,8 @@ static void ggml_vk_load_shaders(vk_device& device) {
// conv2d, conv_transpose_2d
for (uint32_t s = 0; s < CONV_SHAPE_COUNT; ++s) {
uint32_t conv2d_WG_SIZE = 256;
// smaller WG for the small-tile fallback gives more concurrent WGs per SM
uint32_t conv2d_WG_SIZE = (s == CONV_SHAPE_64x32) ? 128 : 256;
uint32_t use_collectives = 0; // Enables subgroup ops for preventing the re-calculation of indices.
uint32_t conv2d_TS_K = (s == CONV_SHAPE_64x32) ? 4 : 8;
uint32_t conv2d_SHMEM_PAD = 4;
@@ -4973,18 +5137,77 @@ static void ggml_vk_load_shaders(vk_device& device) {
conv2d_BS.CRS); // CRS block size should be capped at subgroup size for correctness when shuffle is used.
}
uint32_t conv2d_shmem_req =
(conv2d_BS.K * (conv2d_BS.CRS + conv2d_SHMEM_PAD) + conv2d_BS.CRS * (conv2d_BS.NPQ + conv2d_SHMEM_PAD)) * sizeof(float);
if (device->properties.limits.maxComputeSharedMemorySize < conv2d_shmem_req) {
// cm1 is used only when cm2 is unavailable; capped at 64x128 (due to shared memory size).
// Requires 16x16x16 f16-acc since that's the fragment shape hard-coded in the shader.
// Subgroup size must be 32 or 64 (to keep WG_SIZE sane) and we need
// subgroup_size_control to force the driver to actually use it.
bool conv2d_use_cm1 = false;
#if defined(VK_KHR_cooperative_matrix) && defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
conv2d_use_cm1 = !device->coopmat2 &&
device->coopmat_support && device->coopmat_support_16x16x16_f16acc &&
device->subgroup_size_control &&
(device->subgroup_size == 32 || device->subgroup_size == 64) &&
s != CONV_SHAPE_128x128;
#endif
const uint32_t conv2d_cm1_shmem_pad = 8;
auto shmem_req = [&](uint32_t pad, bool csh_store, bool fp16_shmem) {
const uint32_t elem_size = fp16_shmem ? (uint32_t)sizeof(uint16_t) : (uint32_t)sizeof(float);
const uint32_t csh_elems = csh_store ? conv2d_BS.K * conv2d_BS.NPQ : 0u;
return (conv2d_BS.K * (conv2d_BS.CRS + pad) + conv2d_BS.CRS * (conv2d_BS.NPQ + pad) + csh_elems) * elem_size;
};
// coopmat1 needs to store the output through shared memory, so check up front
// whether it'll fit and disable it before applying coopmat1 parameters.
if (conv2d_use_cm1 && device->properties.limits.maxComputeSharedMemorySize < shmem_req(conv2d_cm1_shmem_pad, true, true)) {
conv2d_use_cm1 = false;
}
uint32_t conv2d_WM = 16, conv2d_WN = 16; // cm1 subgroup tile, ignored otherwise
if (conv2d_use_cm1) {
conv2d_SHMEM_PAD = conv2d_cm1_shmem_pad;
// 16x16x16 fragments; pick WM/WN to keep WG_SIZE at 256
// (i.e. 8 subgroups for sg=32, 4 subgroups for sg=64).
const bool sg64 = (device->subgroup_size == 64);
switch (s) {
case CONV_SHAPE_64x32: conv2d_WM = sg64 ? 32 : 16; conv2d_WN = 16; break;
case CONV_SHAPE_64x128: conv2d_WM = 32; conv2d_WN = sg64 ? 64 : 32; break;
case CONV_SHAPE_32x256: conv2d_WM = sg64 ? 16 : 32; conv2d_WN = sg64 ? 128 : 32; break;
default: break;
}
const uint32_t warps_M = conv2d_BS.K / conv2d_WM;
const uint32_t warps_N = conv2d_BS.NPQ / conv2d_WN;
conv2d_WG_SIZE = warps_M * warps_N * device->subgroup_size;
}
// stage cm2 accumulator through shmem for coalesced global stores;
// skipped on 128x128 where the extra Csh footprint hurts occupancy.
// cm1 always uses the staged path.
uint32_t conv2d_csh_store = (device->coopmat2 && s != CONV_SHAPE_128x128) ? 1u : 0u;
if (conv2d_use_cm1) {
conv2d_csh_store = 1;
}
// shmem is fp16 on cm2/cm1 (matches Csh), fp32 on scalar
const bool conv2d_use_fp16_shmem = device->coopmat2 || conv2d_use_cm1;
// shrink CRS if the non-cm1 config still doesn't fit
if (device->properties.limits.maxComputeSharedMemorySize < shmem_req(conv2d_SHMEM_PAD, conv2d_csh_store, conv2d_use_fp16_shmem)) {
GGML_ASSERT(!conv2d_use_cm1);
conv2d_BS.CRS = 8;
if (use_collectives) {
conv2d_BS.CRS = std::min(device->subgroup_size, conv2d_BS.CRS);
}
conv2d_csh_store = 0;
}
std::array<uint32_t, 3> wg_denoms = { conv2d_BS.K, 1, 1 };
std::vector<uint32_t> spec_constants = { conv2d_WG_SIZE, conv2d_BS.K, conv2d_BS.CRS, conv2d_BS.NPQ, conv2d_TS_K, use_collectives, conv2d_SHMEM_PAD };
// cm1 needs a fixed subgroup width to match the WG_SIZE we computed
const uint32_t conv2d_required_subgroup_size = conv2d_use_cm1 ? device->subgroup_size : 0;
#define CREATE_CONV(name, type_suffix, spv_suffix) \
for (auto &c : device->pipeline_##name##type_suffix[s]) { \
const vk_conv2d_pipeline_state &state = c.first; \
@@ -4997,10 +5220,14 @@ static void ggml_vk_load_shaders(vk_device& device) {
spec_constants_cpy.push_back(state.d1); \
spec_constants_cpy.push_back(state.KW); \
spec_constants_cpy.push_back(state.KH); \
spec_constants_cpy.push_back(state.aligned); \
spec_constants_cpy.push_back(conv2d_csh_store); \
spec_constants_cpy.push_back(conv2d_WM); \
spec_constants_cpy.push_back(conv2d_WN); \
ggml_vk_create_pipeline( \
device, c.second, #name #type_suffix, \
name##type_suffix##spv_suffix##_len, name##type_suffix##spv_suffix##_data, "main", 3, \
sizeof(vk_op_conv2d_push_constants), wg_denoms, spec_constants_cpy, 1, true, use_collectives); \
sizeof(vk_op_conv2d_push_constants), wg_denoms, spec_constants_cpy, 1, true, use_collectives || conv2d_required_subgroup_size, conv2d_required_subgroup_size); \
}
#define CREATE_CONVS(spv_suffix) \
CREATE_CONV(conv2d, _f32, spv_suffix) \
@@ -5011,6 +5238,11 @@ static void ggml_vk_load_shaders(vk_device& device) {
if (device->coopmat2) {
CREATE_CONVS(_cm2)
} else
#endif
#if defined(VK_KHR_cooperative_matrix) && defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
if (conv2d_use_cm1) {
CREATE_CONVS(_cm1)
} else
#endif
if (conv2d_UNROLL) {
CREATE_CONVS(_unroll)
@@ -5083,6 +5315,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
bool amd_shader_core_properties2 = false;
bool pipeline_robustness = false;
bool coopmat2_support = false;
bool coopmat2_decode_vector_support = false;
bool pipeline_executable_properties_support = false;
device->coopmat_support = false;
device->integer_dot_product = false;
@@ -5117,6 +5350,9 @@ static vk_device ggml_vk_get_device(size_t idx) {
!getenv("GGML_VK_DISABLE_COOPMAT2")) {
coopmat2_support = true;
#endif
} else if (strcmp(VK_NV_COOPERATIVE_MATRIX_DECODE_VECTOR_EXTENSION_NAME, properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_COOPMAT2_DECODE_VECTOR")) {
coopmat2_decode_vector_support = true;
#if defined(GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT)
} else if (strcmp("VK_KHR_shader_integer_dot_product", properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_INTEGER_DOT_PRODUCT")) {
@@ -5394,6 +5630,14 @@ static vk_device ggml_vk_get_device(size_t idx) {
}
#endif
VkPhysicalDeviceCooperativeMatrixDecodeVectorFeaturesNV coopmat2_decode_vector_features {};
coopmat2_decode_vector_features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_DECODE_VECTOR_FEATURES_NV;
if (coopmat2_decode_vector_support) {
last_struct->pNext = (VkBaseOutStructure *)&coopmat2_decode_vector_features;
last_struct = (VkBaseOutStructure *)&coopmat2_decode_vector_features;
device_extensions.push_back(VK_NV_COOPERATIVE_MATRIX_DECODE_VECTOR_EXTENSION_NAME);
}
#if defined(VK_KHR_shader_bfloat16)
VkPhysicalDeviceShaderBfloat16FeaturesKHR bfloat16_features {};
bfloat16_features.pNext = nullptr;
@@ -5553,6 +5797,7 @@ static vk_device ggml_vk_get_device(size_t idx) {
found_fp32_128 && found_fp32_256 &&
coopmat2_props.cooperativeMatrixFlexibleDimensionsMaxDimension >= 512) {
device->coopmat2 = true;
device->coopmat2_decode_vector = coopmat2_decode_vector_support && coopmat2_decode_vector_features.cooperativeMatrixDecodeVector;
}
}
#endif
@@ -5768,8 +6013,12 @@ static vk_device ggml_vk_get_device(size_t idx) {
ggml_vk_load_shaders(device);
// Only use transfer queue on AMD non-GCN, when the graphics queue is not enabled
const bool prefers_transfer_queue = device->vendor_id == VK_VENDOR_ID_AMD && device->architecture != AMD_GCN && !allow_graphics_queue;
// Prefer a dedicated transfer queue on AMD dGPUs (non-GCN) when graphics queue use is disabled.
const bool prefers_transfer_queue =
device->vendor_id == VK_VENDOR_ID_AMD &&
device->architecture != AMD_GCN &&
!device->uma &&
!allow_graphics_queue;
if (!device->single_queue) {
const uint32_t transfer_queue_index = compute_queue_family_index == transfer_queue_family_index ? 1 : 0;
@@ -5835,6 +6084,7 @@ static void ggml_vk_print_gpu_info(size_t idx) {
bool fp16_compute = false;
bool coopmat_support = false;
bool coopmat2_support = false;
bool coopmat2_decode_vector_support = false;
bool integer_dot_product = false;
bool bfloat16_support = false;
@@ -5853,6 +6103,9 @@ static void ggml_vk_print_gpu_info(size_t idx) {
!getenv("GGML_VK_DISABLE_COOPMAT2")) {
coopmat2_support = true;
#endif
} else if (strcmp(VK_NV_COOPERATIVE_MATRIX_DECODE_VECTOR_EXTENSION_NAME, properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_COOPMAT2_DECODE_VECTOR")) {
coopmat2_decode_vector_support = true;
#if defined(GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT)
} else if (strcmp("VK_KHR_shader_integer_dot_product", properties.extensionName) == 0 &&
!getenv("GGML_VK_DISABLE_INTEGER_DOT_PRODUCT")) {
@@ -5937,6 +6190,13 @@ static void ggml_vk_print_gpu_info(size_t idx) {
}
#endif
VkPhysicalDeviceCooperativeMatrixDecodeVectorFeaturesNV coopmat2_decode_vector_features {};
coopmat2_decode_vector_features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_DECODE_VECTOR_FEATURES_NV;
if (coopmat2_decode_vector_support) {
last_struct->pNext = (VkBaseOutStructure *)&coopmat2_decode_vector_features;
last_struct = (VkBaseOutStructure *)&coopmat2_decode_vector_features;
}
vkGetPhysicalDeviceFeatures2(physical_device, &device_features2);
fp16 = fp16 && vk12_features.shaderFloat16;
@@ -5961,7 +6221,14 @@ static void ggml_vk_print_gpu_info(size_t idx) {
#endif
&& ggml_vk_khr_cooperative_matrix_support(props2.properties, driver_props, device_architecture);
std::string matrix_cores = coopmat2_support ? "NV_coopmat2" : coopmat_support ? "KHR_coopmat" : "none";
coopmat2_decode_vector_support = coopmat2_decode_vector_support && coopmat2_decode_vector_features.cooperativeMatrixDecodeVector;
#if !defined(GGML_VULKAN_COOPMAT2_DECODE_VECTOR_GLSLC_SUPPORT)
coopmat2_decode_vector_support = false;
#endif
std::string matrix_cores = coopmat2_support ? (coopmat2_decode_vector_support ? "NV_coopmat2v" : "NV_coopmat2")
: coopmat_support ? "KHR_coopmat"
: "none";
std::string device_name = props2.properties.deviceName.data();
GGML_LOG_DEBUG("ggml_vulkan: %zu = %s (%s) | uma: %d | fp16: %d | bf16: %d | warp size: %zu | shared memory: %d | int dot: %d | matrix cores: %s\n",
@@ -9473,10 +9740,23 @@ static vk_conv_shapes ggml_vk_conv_select_shape(ggml_backend_vk_context * ctx, u
// so small convolutions will still choose a smaller tile.
const uint32_t shader_core_count = ctx->device->shader_core_count > 0 ? ctx->device->shader_core_count : 32;
if (K > 64 && n_tiles(CONV_SHAPE_128x128) >= shader_core_count * 2) {
// 128x128 isn't used with cm1 due to shared memory size; fall through to a smaller tile.
bool allow_128x128 = true;
#if defined(VK_KHR_cooperative_matrix) && defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
if (!ctx->device->coopmat2 && ctx->device->coopmat_support && ctx->device->coopmat_support_16x16x16_f16acc) {
allow_128x128 = false;
}
#endif
if (allow_128x128 && K > 64 && n_tiles(CONV_SHAPE_128x128) >= shader_core_count * 2) {
return CONV_SHAPE_128x128;
} else if (K <= 32 && n_tiles(CONV_SHAPE_32x256) >= shader_core_count * 2) {
return CONV_SHAPE_32x256;
} else if (K <= 64 && n_tiles(CONV_SHAPE_64x128) >= shader_core_count * 2) {
return CONV_SHAPE_64x128;
} else if (!allow_128x128 && K > 64 && n_tiles(CONV_SHAPE_64x128) >= shader_core_count * 2) {
// cm1 fallback for large K when 128x128 isn't available
return CONV_SHAPE_64x128;
} else {
return CONV_SHAPE_64x32;
}
@@ -9648,7 +9928,10 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
return nullptr;
case GGML_OP_REPEAT:
if (ggml_type_size(src0->type) == sizeof(float) && ggml_type_size(dst->type) == sizeof(float)) {
return ctx->device->pipeline_repeat_f32;
return ctx->device->pipeline_repeat_i32;
}
if (ggml_type_size(src0->type) == 2 && ggml_type_size(dst->type) == 2) {
return ctx->device->pipeline_repeat_i16;
}
return nullptr;
case GGML_OP_REPEAT_BACK:
@@ -10008,7 +10291,18 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
uint32_t p1 = !transpose ? (uint32_t)ggml_get_op_params_i32(dst, 3) : 0;
uint32_t d0 = !transpose ? (uint32_t)ggml_get_op_params_i32(dst, 4) : 1;
uint32_t d1 = !transpose ? (uint32_t)ggml_get_op_params_i32(dst, 5) : 1;
vk_conv2d_pipeline_state conv2d_pipeline_state(s0, s1, p0, p1, d0, d1, KW, KH);
// tile-aligned shapes let the shader skip bounds checks
const uint32_t Cin = (uint32_t)src1->ne[2];
const uint32_t CRS = Cin * KW * KH;
const uint32_t BS_K = vk_conv_block_sizes[shape].K;
const uint32_t BS_CRS = vk_conv_block_sizes[shape].CRS;
const uint32_t BS_NPQ = vk_conv_block_sizes[shape].NPQ;
const uint32_t aligned = ((K % BS_K == 0) &&
(CRS % BS_CRS == 0) &&
(NPQ % BS_NPQ == 0)) ? 1u : 0u;
vk_conv2d_pipeline_state conv2d_pipeline_state(s0, s1, p0, p1, d0, d1, KW, KH, aligned);
std::map<vk_conv2d_pipeline_state, vk_pipeline> *pipelines = nullptr;
if (op == GGML_OP_CONV_2D) {
@@ -16152,7 +16446,8 @@ static bool ggml_backend_vk_device_supports_op(ggml_backend_dev_t dev, const ggm
return false;
}
case GGML_OP_REPEAT:
return ggml_type_size(op->type) == sizeof(float) && ggml_type_size(op->src[0]->type) == sizeof(float);
return ggml_type_size(op->type) == ggml_type_size(op->src[0]->type) &&
(ggml_type_size(op->type) == sizeof(float) || ggml_type_size(op->type) == 2);
case GGML_OP_REPEAT_BACK:
return op->type == GGML_TYPE_F32 && op->src[0]->type == GGML_TYPE_F32;
case GGML_OP_ROPE:

View File

@@ -11,6 +11,10 @@ if (GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
add_compile_definitions(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
message(STATUS "Enabling coopmat2 glslc support")
endif()
if (GGML_VULKAN_COOPMAT2_DECODE_VECTOR_GLSLC_SUPPORT)
add_compile_definitions(GGML_VULKAN_COOPMAT2_DECODE_VECTOR_GLSLC_SUPPORT)
message(STATUS "Enabling coopmat2 decode_vector glslc support")
endif()
if (GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT)
add_compile_definitions(GGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT)
message(STATUS "Enabling dot glslc support")

View File

@@ -7,6 +7,13 @@
#extension GL_KHR_memory_scope_semantics : enable
#endif
#ifdef COOPMAT
#extension GL_KHR_cooperative_matrix : enable
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_EXT_shader_explicit_arithmetic_types_float16 : require
#extension GL_KHR_memory_scope_semantics : enable
#endif
#ifdef USE_COLLECTIVES
# extension GL_KHR_shader_subgroup_shuffle : enable
#endif
@@ -77,6 +84,39 @@ layout(constant_id = 12) const uint d1 = 1;
// Kernel spatial sizes
layout(constant_id = 13) const uint KW = 1;
layout(constant_id = 14) const uint KH = 1;
// when set, skip bounds checks and address clamps (K/CRS/NPQ are tile-aligned)
layout(constant_id = 15) const uint aligned = 0;
// stage cm2 result through shmem (Csh) for coalesced stores. cm1 always does this.
layout(constant_id = 16) const uint csh_store = 0;
#ifdef COOPMAT
// cm1 subgroup tile: each subgroup computes a WM x WN region as a grid of
// TM x TN x TK fragments. Requires WM%TM == WN%TN == BS_K%WM == BS_NPQ%WN ==
// BS_CRS%TK == 0, and WG_SIZE == (BS_K/WM) * (BS_NPQ/WN) * subgroup_size.
layout(constant_id = 17) const uint WM = 32;
layout(constant_id = 18) const uint WN = 32;
const uint TM = 16;
const uint TN = 16;
const uint TK = 16;
const uint cms_per_row = WM / TM;
const uint cms_per_col = WN / TN;
const uint warps_M = BS_K / WM;
const uint warps_N = BS_NPQ / WN;
#endif
// without padding, H_idx/W_idx are in bounds by construction (non-TRANSPOSE only)
#ifdef TRANSPOSE
const bool hw_in_bounds = false;
#else
const bool hw_in_bounds = (p0 == 0) && (p1 == 0);
#endif
// TRANSPOSE stride alignment is trivially satisfied for stride 1
#ifdef TRANSPOSE
const bool stride_in_bounds = (s0 == 1) && (s1 == 1);
#else
const bool stride_in_bounds = true;
#endif
uint32_t tid = gl_LocalInvocationID.x;
const uint32_t WG_SIZE = gl_WorkGroupSize.x;
@@ -94,7 +134,7 @@ uint32_t n_elems_out = K * NPQ;
// Number of blocktiles per input
uint32_t NB_CRS = splitWork(CRS, BS_CRS);
#ifdef COOPMAT2
#if defined(COOPMAT2) || defined(COOPMAT)
#define SHMEM_TYPE float16_t
#else
#define SHMEM_TYPE float
@@ -112,6 +152,17 @@ const uint32_t Bsh_len = BS_CRS * Bsh_stride;
shared SHMEM_TYPE Ash[Ash_len]; // K x CRS
shared SHMEM_TYPE Bsh[Bsh_len]; // CRS x NPQ
#if defined(COOPMAT2) || defined(COOPMAT)
// stage matC through shmem so global stores are row-major (NPQ-contiguous)
const uint32_t Csh_stride = BS_NPQ;
#ifdef COOPMAT
const uint32_t Csh_len = BS_K * Csh_stride;
#else
const uint32_t Csh_len = csh_store != 0 ? BS_K * Csh_stride : 1;
#endif
shared SHMEM_TYPE Csh[Csh_len]; // K x NPQ
#endif
// Threadtile sizes
const uint32_t TS_NPQ = BS_K * BS_NPQ / WG_SIZE / TS_K;
@@ -161,7 +212,7 @@ ACC_TYPE perElemOpStore(const in uint32_t r, const in uint32_t c, const in ACC_T
uint32_t OH_idx = fastdiv(NPQ_idx - N_idx * p.OH * p.OW, p.OWmp, p.OWL); // divide by p.OW;
uint32_t OW_idx = NPQ_idx - N_idx * p.OH * p.OW - OH_idx * p.OW;
uint32_t dst_idx = OW_idx + OH_idx * p.nb1 + K_idx * p.nb2 + N_idx * p.nb3;
if (K_idx < K && NPQ_idx < NPQ) {
if (aligned != 0 || (K_idx < K && NPQ_idx < NPQ)) {
dst_data[dst_idx] = D_TYPE(elem);
}
return elem;
@@ -176,6 +227,13 @@ void main() {
#ifdef COOPMAT2
coopmat<ACC_TYPE, gl_ScopeWorkgroup, BS_K, BS_NPQ, gl_MatrixUseAccumulator> matC;
matC = coopmat<ACC_TYPE, gl_ScopeWorkgroup, BS_K, BS_NPQ, gl_MatrixUseAccumulator>(0.0);
#elif defined(COOPMAT)
coopmat<float16_t, gl_ScopeSubgroup, TM, TN, gl_MatrixUseAccumulator> sums[cms_per_row * cms_per_col];
[[unroll]] for (uint i = 0; i < cms_per_row * cms_per_col; i++) {
sums[i] = coopmat<float16_t, gl_ScopeSubgroup, TM, TN, gl_MatrixUseAccumulator>(0.0);
}
const uint warp_r = gl_SubgroupID / warps_N;
const uint warp_c = gl_SubgroupID % warps_N;
#else
float regC[TS_K][TS_NPQ];
for (uint32_t T_ly = 0; T_ly < TS_K; T_ly++) {
@@ -228,12 +286,15 @@ void main() {
uint32_t B_lx = Ac;
uint32_t K_idx = B_idx_K * BS_K + B_ly; /* Global K_idx (row index of A)*/
#ifdef TRANSPOSE
uint32_t knl_idx = min(KW_idx_a + KH_idx_a * p.nb01 + K_idx * p.nb02 + Cin_idx_a * p.nb03, K * CRS - 1);
uint32_t knl_idx = KW_idx_a + KH_idx_a * p.nb01 + K_idx * p.nb02 + Cin_idx_a * p.nb03;
#else
uint32_t knl_idx = min(KW_idx_a + KH_idx_a * p.nb01 + Cin_idx_a * p.nb02 + K_idx * p.nb03, K * CRS - 1);
uint32_t knl_idx = KW_idx_a + KH_idx_a * p.nb01 + Cin_idx_a * p.nb02 + K_idx * p.nb03;
#endif
if (aligned == 0) {
knl_idx = min(knl_idx, K * CRS - 1);
}
float val = knl_data[knl_idx];
if (K_idx >= K || CRS_idx_a >= CRS) {
if (aligned == 0 && (K_idx >= K || CRS_idx_a >= CRS)) {
val = 0.0;
}
Ash[B_ly * Ash_stride + B_lx] = SHMEM_TYPE(val);
@@ -282,15 +343,27 @@ void main() {
uint32_t H_idx = OH_idx * s1 + KH_idx_b * d1 - p1;
uint32_t W_idx = OW_idx * s0 + KW_idx_b * d0 - p0;
#endif
uint32_t src_idx =
min(max(W_idx + H_idx * p.nb11 + Cin_idx_b * p.nb12 + N_idx * p.nb13, 0), p.Cin * p.N * p.W * p.H - 1);
uint32_t src_idx = W_idx + H_idx * p.nb11 + Cin_idx_b * p.nb12 + N_idx * p.nb13;
// skip clamp when address can't go OOB
if (aligned == 0 || !hw_in_bounds || !stride_in_bounds) {
src_idx = min(max(src_idx, 0), p.Cin * p.N * p.W * p.H - 1);
}
float val = src_data[src_idx];
if (CRS_idx_b >= CRS || NPQ_idx >= NPQ
|| H_idx >= p.H || W_idx >= p.W // Lower bound checks aren't necessary. (idx >= 0x80000000 for such case)
bool oob = false;
if (aligned == 0 && (CRS_idx_b >= CRS || NPQ_idx >= NPQ)) {
oob = true;
}
// also catches lower-bound underflow (idx wraps to 0x80000000+)
if (!hw_in_bounds && (H_idx >= p.H || W_idx >= p.W)) {
oob = true;
}
#ifdef TRANSPOSE
|| (H_idx_x_s1 - H_idx * s1 != 0) || (W_idx_x_s0 - W_idx * s0 != 0)
if (!stride_in_bounds &&
((H_idx_x_s1 - H_idx * s1 != 0) || (W_idx_x_s0 - W_idx * s0 != 0))) {
oob = true;
}
#endif
) {
if (oob) {
val = 0.0;
}
Bsh[B_ly * Bsh_stride + B_lx] = SHMEM_TYPE(val);
@@ -303,6 +376,23 @@ void main() {
coopMatLoad(matA, Ash, 0, Ash_stride, gl_CooperativeMatrixLayoutRowMajor);
coopMatLoad(matB, Bsh, 0, Bsh_stride, gl_CooperativeMatrixLayoutRowMajor);
matC = coopMatMulAdd(matA, matB, matC);
#elif defined(COOPMAT)
// each subgroup multiplies its grid of fragments per TK-sized CRS chunk
[[unroll]] for (uint k_step = 0; k_step < BS_CRS / TK; k_step++) {
coopmat<float16_t, gl_ScopeSubgroup, TM, TK, gl_MatrixUseA> cache_a[cms_per_row];
[[unroll]] for (uint cm_row = 0; cm_row < cms_per_row; cm_row++) {
const uint a_off = (warp_r * WM + cm_row * TM) * Ash_stride + k_step * TK;
coopMatLoad(cache_a[cm_row], Ash, a_off, Ash_stride, gl_CooperativeMatrixLayoutRowMajor);
}
[[unroll]] for (uint cm_col = 0; cm_col < cms_per_col; cm_col++) {
coopmat<float16_t, gl_ScopeSubgroup, TK, TN, gl_MatrixUseB> cache_b;
const uint b_off = k_step * TK * Bsh_stride + warp_c * WN + cm_col * TN;
coopMatLoad(cache_b, Bsh, b_off, Bsh_stride, gl_CooperativeMatrixLayoutRowMajor);
[[unroll]] for (uint cm_row = 0; cm_row < cms_per_row; cm_row++) {
sums[cm_col * cms_per_row + cm_row] = coopMatMulAdd(cache_a[cm_row], cache_b, sums[cm_col * cms_per_row + cm_row]);
}
}
}
#else
if (T_y * TS_K < K) {
UNROLL for (uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++) {
@@ -325,8 +415,51 @@ void main() {
barrier();
}
/* Save C* */
#if defined(COOPMAT2) || defined(COOPMAT)
// stage matC into Csh, then write to dst with coalesced NPQ-contiguous stores
#ifdef COOPMAT
const bool use_staged_store = true;
#else
const bool use_staged_store = (csh_store != 0);
#endif
if (use_staged_store) {
#ifdef COOPMAT
// cm1: each subgroup stores its fragment grid into its Csh slot
[[unroll]] for (uint cm_row = 0; cm_row < cms_per_row; cm_row++) {
[[unroll]] for (uint cm_col = 0; cm_col < cms_per_col; cm_col++) {
const uint csh_off = (warp_r * WM + cm_row * TM) * Csh_stride + warp_c * WN + cm_col * TN;
coopMatStore(sums[cm_col * cms_per_row + cm_row], Csh, csh_off, Csh_stride, gl_CooperativeMatrixLayoutRowMajor);
}
}
#else
coopMatStore(matC, Csh, 0, Csh_stride, gl_CooperativeMatrixLayoutRowMajor);
#endif
barrier();
// cooperative shmem->global: WG threads spread across BS_NPQ (the
// contiguous direction of dst), each iter covers store_rows_per_iter K-rows
const uint32_t store_rows_per_iter = WG_SIZE / BS_NPQ;
const uint32_t store_iters = BS_K / store_rows_per_iter;
const uint32_t k_thread_offset = tid / BS_NPQ;
const uint32_t npq_thread = tid % BS_NPQ;
[[unroll]] for (uint32_t i = 0; i < store_iters; i++) {
uint32_t k_local = i * store_rows_per_iter + k_thread_offset;
uint32_t K_idx = B_idx_K * BS_K + k_local;
uint32_t NPQ_idx = B_idx_NPQ * BS_NPQ + npq_thread;
uint32_t N_idx = fastdiv(NPQ_idx, p.OWOHmp, p.OWOHL);
uint32_t OH_idx = fastdiv(NPQ_idx - N_idx * p.OH * p.OW, p.OWmp, p.OWL);
uint32_t OW_idx = NPQ_idx - N_idx * p.OH * p.OW - OH_idx * p.OW;
uint32_t dst_idx = OW_idx + OH_idx * p.nb1 + K_idx * p.nb2 + N_idx * p.nb3;
if (aligned != 0 || (K_idx < K && NPQ_idx < NPQ)) {
dst_data[dst_idx] = D_TYPE(Csh[k_local * Csh_stride + npq_thread]);
}
}
}
#ifdef COOPMAT2
coopMatPerElementNV(matC, matC, perElemOpStore);
else {
coopMatPerElementNV(matC, matC, perElemOpStore);
}
#endif
#else
if (T_y * TS_K < K) {
for (uint32_t T_ly = 0; T_ly < TS_K; T_ly++) {
@@ -337,7 +470,7 @@ void main() {
uint32_t OH_idx = fastdiv(NPQ_idx - N_idx * p.OH * p.OW, p.OWmp, p.OWL); // divide by p.OW;
uint32_t OW_idx = NPQ_idx - N_idx * p.OH * p.OW - OH_idx * p.OW;
uint32_t dst_idx = OW_idx + OH_idx * p.nb1 + K_idx * p.nb2 + N_idx * p.nb3;
if (K_idx < K && NPQ_idx < NPQ) {
if (aligned != 0 || (K_idx < K && NPQ_idx < NPQ)) {
dst_data[dst_idx] = regC[T_ly][T_lx];
}
}

View File

@@ -5,21 +5,60 @@
#include "types.glsl"
#if defined(DATA_A_F32)
FLOAT_TYPE dequantize1(uint ib, uint iqs, uint a_offset) {
return data_a[a_offset + ib];
}
vec2 dequantize(uint ib, uint iqs, uint a_offset) {
return vec2(data_a[a_offset + ib], data_a[a_offset + ib + 1]);
}
vec4 dequantize4(uint ib, uint iqs, uint a_offset) {
return vec4(data_a[a_offset + ib ], data_a[a_offset + ib + 1],
data_a[a_offset + ib + 2], data_a[a_offset + ib + 3]);
}
vec4 dequantize4_2aligned(uint ib, uint iqs, uint a_offset) {
return vec4(data_a[a_offset + ib ], data_a[a_offset + ib + 1],
data_a[a_offset + ib + 2], data_a[a_offset + ib + 3]);
}
#endif
#if defined(DATA_A_F16)
FLOAT_TYPE dequantize1(uint ib, uint iqs, uint a_offset) {
return data_a[a_offset + ib];
}
vec2 dequantize(uint ib, uint iqs, uint a_offset) {
return vec2(data_a[a_offset + ib], data_a[a_offset + ib + 1]);
}
vec4 dequantize4(uint ib, uint iqs, uint a_offset) {
return vec4(data_a[a_offset + ib ], data_a[a_offset + ib + 1],
data_a[a_offset + ib + 2], data_a[a_offset + ib + 3]);
}
vec4 dequantize4_2aligned(uint ib, uint iqs, uint a_offset) {
const vec2 a = data_a_packed32[(a_offset + ib)/2];
const vec2 b = data_a_packed32[(a_offset + ib)/2 + 1];
return vec4(a, b);
}
#endif
#if defined(DATA_A_BF16)
FLOAT_TYPE dequantize1(uint ib, uint iqs, uint a_offset) {
return bf16_to_fp32(data_a[a_offset + ib]);
}
vec2 dequantize(uint ib, uint iqs, uint a_offset) {
return vec2(bf16_to_fp32(data_a[a_offset + ib]), bf16_to_fp32(data_a[a_offset + ib + 1]));
}
vec4 dequantize4(uint ib, uint iqs, uint a_offset) {
return vec4(bf16_to_fp32(data_a[a_offset + ib ]), bf16_to_fp32(data_a[a_offset + ib + 1]),
bf16_to_fp32(data_a[a_offset + ib + 2]), bf16_to_fp32(data_a[a_offset + ib + 3]));
}
vec4 dequantize4_2aligned(uint ib, uint iqs, uint a_offset) {
const uint a = data_a_packed32[(a_offset + ib)/2];
const uint b = data_a_packed32[(a_offset + ib)/2 + 1];
return vec4(uintBitsToFloat((a & 0x0000ffff) << 16),
uintBitsToFloat( a & 0xffff0000),
uintBitsToFloat((b & 0x0000ffff) << 16),
uintBitsToFloat( b & 0xffff0000));
}
#endif
#if defined(DATA_A_Q4_0)

View File

@@ -1,4 +1,12 @@
// Each format defines a scalar dequantFunc<T> plus a V=4 dequantFunc<T>_v
// passed as the optional vector decoder to coopMatLoadTensorNV via
// GL_NV_cooperative_matrix_decode_vector. When the driver doesn't support
// the extension, ggml-vulkan.cpp strips it from the compiled SPIR-V.
#ifdef GL_NV_cooperative_matrix_decode_vector
#extension GL_NV_cooperative_matrix_decode_vector : enable
#endif
#include "types.glsl"
layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufF32 {
@@ -25,6 +33,19 @@ float16_t dequantFuncQ1_0(const in decodeBufQ1_0 bl, const in uint blockCoords[2
return bit != 0u ? d : -d;
}
f16vec4 dequantFuncQ1_0_v(const in decodeBufQ1_0 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
const float16_t md = -d;
const uint idx = coordInBlock[1];
const uint qs_nib = uint(bl.block.qs[idx >> 3]) >> (idx & 0x4u);
return f16vec4(
(qs_nib & 1u) != 0u ? d : md,
(qs_nib & 2u) != 0u ? d : md,
(qs_nib & 4u) != 0u ? d : md,
(qs_nib & 8u) != 0u ? d : md);
}
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufQ4_0 {
block_q4_0_packed16 block;
};
@@ -42,10 +63,28 @@ float16_t dequantFuncQ4_0(const in decodeBufQ4_0 bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ4_0_v(const in decodeBufQ4_0 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
const uint idx = coordInBlock[1];
const uint shift = (idx & 0x10) >> 2; // 0 or 4
const uint qs_i = (idx & 0xE) >> 1; // even, in {0,2,4,6}
const uint qsw = uint32_t(bl.block.qs[qs_i ])
| (uint32_t(bl.block.qs[qs_i + 1u]) << 16);
// shift in {0,4}: per-byte mask 0x0F isolates the wanted nibble in each byte.
const uint q4 = (qsw >> shift) & 0x0F0F0F0Fu;
const u8vec4 q = unpack8(q4);
return f16vec4((vec4(q) - vec4(8.0)) * vec4(float(d)));
}
layout(buffer_reference, std430, buffer_reference_align = 4) buffer decodeBufQ4_1 {
block_q4_1 block;
};
layout(buffer_reference, std430, buffer_reference_align = 4) buffer decodeBufQ4_1_packed32 {
block_q4_1_packed32 block;
};
float16_t dequantFuncQ4_1(const in decodeBufQ4_1 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
@@ -60,10 +99,27 @@ float16_t dequantFuncQ4_1(const in decodeBufQ4_1 bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ4_1_v(const in decodeBufQ4_1 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ4_1_packed32 bl32 = decodeBufQ4_1_packed32(bl);
const float16_t d = bl.block.d;
const float16_t m = bl.block.m;
const uint idx = coordInBlock[1];
const uint shift = (idx & 0x10) >> 2; // 0 or 4
const uint qs_w = (idx & 0xC) >> 2; // iqs / 4 in [0,4)
const uint qsw = uint32_t(bl32.block.qs[qs_w]);
const u8vec4 q = unpack8((qsw >> shift) & 0x0F0F0F0Fu);
return f16vec4(vec4(q) * vec4(float(d)) + vec4(float(m)));
}
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufQ5_0 {
block_q5_0 block;
};
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufQ5_0_packed16 {
block_q5_0_packed16 block;
};
float16_t dequantFuncQ5_0(const in decodeBufQ5_0 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
@@ -82,10 +138,32 @@ float16_t dequantFuncQ5_0(const in decodeBufQ5_0 bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ5_0_v(const in decodeBufQ5_0 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ5_0_packed16 bl16 = decodeBufQ5_0_packed16(bl);
const float16_t d = bl.block.d;
const uint idx = coordInBlock[1];
const uint shift = (idx & 0x10) >> 2; // 0 or 4
const uint qs_i = (idx & 0xC) >> 1; // packed16 word index, in {0,2,4,6}
const uint qsw = uint32_t(bl16.block.qs[qs_i ])
| (uint32_t(bl16.block.qs[qs_i + 1u]) << 16);
const u8vec4 ql = unpack8((qsw >> shift) & 0x0F0F0F0Fu);
const uint uint_qh = uint(bl16.block.qh[1]) << 16 | uint(bl16.block.qh[0]);
const uint qh_pack = uint_qh >> idx; // bits 0..3 = element idx..idx+3 high bits
const uvec4 qh_high = (uvec4(qh_pack, qh_pack >> 1u, qh_pack >> 2u, qh_pack >> 3u) & uvec4(0x01u)) << 4u;
return f16vec4((vec4(ql) + vec4(qh_high) - vec4(16.0)) * vec4(float(d)));
}
layout(buffer_reference, std430, buffer_reference_align = 8) buffer decodeBufQ5_1 {
block_q5_1 block;
};
layout(buffer_reference, std430, buffer_reference_align = 8) buffer decodeBufQ5_1_packed32 {
block_q5_1_packed32 block;
};
float16_t dequantFuncQ5_1(const in decodeBufQ5_1 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
@@ -105,6 +183,23 @@ float16_t dequantFuncQ5_1(const in decodeBufQ5_1 bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ5_1_v(const in decodeBufQ5_1 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ5_1_packed32 bl32 = decodeBufQ5_1_packed32(bl);
const float16_t d = bl.block.d;
const float16_t m = bl.block.m;
const uint idx = coordInBlock[1];
const uint shift = (idx & 0x10) >> 2; // 0 or 4
const uint qs_w = (idx & 0xC) >> 2; // iqs / 4 in [0,4)
const uint qsw = uint32_t(bl32.block.qs[qs_w]);
const u8vec4 ql = unpack8((qsw >> shift) & 0x0F0F0F0Fu);
const uint qh_pack = bl.block.qh >> idx; // bits 0..3 = element idx..idx+3 high bits
const uvec4 qh_high = (uvec4(qh_pack, qh_pack >> 1u, qh_pack >> 2u, qh_pack >> 3u) & uvec4(0x01u)) << 4u;
return f16vec4((vec4(ql) + vec4(qh_high)) * vec4(float(d)) + vec4(float(m)));
}
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufQ8_0 {
block_q8_0_packed16 block;
};
@@ -121,6 +216,17 @@ float16_t dequantFuncQ8_0(const in decodeBufQ8_0 bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ8_0_v(const in decodeBufQ8_0 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
const uint idx = coordInBlock[1];
const uint base = idx >> 1u;
const uint w = uint(uint16_t(bl.block.qs[base]))
| (uint(uint16_t(bl.block.qs[base + 1u])) << 16u);
const i8vec4 qi = unpack8(int32_t(w));
return f16vec4(vec4(qi) * vec4(float(d)));
}
layout(buffer_reference, std430, buffer_reference_align = 4) buffer decodeBufQ2_K {
block_q2_K block;
};
@@ -129,6 +235,10 @@ layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ2
block_q2_K_packed16 block;
};
layout(buffer_reference, std430, buffer_reference_align = 4) buffer decodeBufQ2_K_packed32 {
block_q2_K_packed32 block;
};
float16_t dequantFuncQ2_K(const in decodeBufQ2_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ2_K_packed16 bl16 = decodeBufQ2_K_packed16(bl);
@@ -147,10 +257,36 @@ float16_t dequantFuncQ2_K(const in decodeBufQ2_K bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ2_K_v(const in decodeBufQ2_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ2_K_packed32 bl32 = decodeBufQ2_K_packed32(bl);
const f16vec2 dm = bl.block.dm;
const uint idx = coordInBlock[1];
const uint scalesi = idx >> 4; // 0..15
const uint qsshift = (idx & 0x60) >> 4; // 0,2,4,6
// qs_i (packed16) = ((idx & 0x80) >> 3) + ((idx & 0x1E) >> 1) is even for idx % 4 == 0,
// so qs_w (packed32) = qs_i / 2 = ((idx & 0x80) >> 4) + ((idx & 0x1Cu) >> 2).
const uint qs_w = ((idx & 0x80) >> 4) + ((idx & 0x1Cu) >> 2);
const uint qsw = uint32_t(bl32.block.qs[qs_w]);
const uint qs4 = (qsw >> qsshift) & 0x03030303u;
const u8vec4 qi = unpack8(qs4);
const uint scales = bl.block.scales[scalesi];
const float16_t d_sub = dm.x * float16_t(scales & 0xF);
const float16_t m_sub = dm.y * float16_t(scales >> 4);
return f16vec4(vec4(qi) * vec4(float(d_sub)) - vec4(float(m_sub)));
}
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufQ3_K {
block_q3_K block;
};
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufQ3_K_packed16 {
block_q3_K_packed16 block;
};
float16_t dequantFuncQ3_K(const in decodeBufQ3_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const uint idx = coordInBlock[1];
@@ -179,6 +315,47 @@ float16_t dequantFuncQ3_K(const in decodeBufQ3_K bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ3_K_v(const in decodeBufQ3_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ3_K_packed16 bl16 = decodeBufQ3_K_packed16(bl);
const uint idx = coordInBlock[1];
const uint n = idx >> 7; // 0,1
const uint is = idx >> 4; // 0..15
const uint halfsplit = (idx & 0x60) >> 5; // 0,1,2,3
const uint qsshift = halfsplit << 1; // 0,2,4,6
const uint hbit = (n << 2) + halfsplit; // 0..7 (bit position in hmask byte)
uint32_t scaleidx0 = (is < 8) ? is : (is - 8);
uint32_t scaleidx0shift = (is < 8) ? 0u : 4u;
uint32_t scaleidx1 = is + 8 - (is / 4) * 4;
uint32_t scaleidx1shift = (is / 4) * 2;
const int8_t us = int8_t(
((bl.block.scales[scaleidx0] >> scaleidx0shift) & 0xF) |
(((bl.block.scales[scaleidx1] >> scaleidx1shift) & 3) << 4));
const float16_t dl = bl.block.d * float16_t(int(us) - 32);
// For idx % 4 == 0: (idx & 0x1F) == (idx & 0x1C) is a multiple of 4.
const uint qsi = (n << 5) + (idx & 0x1Cu);
const uint hmi = (idx & 0x1Cu);
// Two adjacent uint16 packed16 reads, combined into a uint32 in registers.
// After this: byte j of qsw / hmw holds the data for element idx+j.
const uint qsw = uint32_t(bl16.block.qs[qsi >> 1])
| (uint32_t(bl16.block.qs[(qsi >> 1) + 1u]) << 16);
const uint hmw = uint32_t(bl16.block.hmask[hmi >> 1])
| (uint32_t(bl16.block.hmask[(hmi >> 1) + 1u]) << 16);
// qsshift in {0,2,4,6} and hbit in {0..7}: per-byte masks isolate the wanted bits
// with no inter-byte leakage.
const uint ql4 = (qsw >> qsshift) & 0x03030303u;
const uint qh4 = (hmw >> hbit) & 0x01010101u;
const ivec4 q = ivec4(unpack8(ql4 | (qh4 << 2))) - ivec4(4);
return f16vec4(vec4(q) * vec4(float(dl)));
}
layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ4_K {
block_q4_K block;
};
@@ -187,6 +364,10 @@ layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ4
block_q4_K_packed16 block;
};
layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ4_K_packed32 {
block_q4_K_packed32 block;
};
layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ4_K_packed128 {
block_q4_K_packed128 block;
};
@@ -334,6 +515,55 @@ float16_t dequantFuncQ4_K(const in decodeBufQ4_K bl, const in uint blockCoords[2
return float16_t(ret);
}
f16vec4 dequantFuncQ4_K_v(const in decodeBufQ4_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ4_K_packed32 bl32 = decodeBufQ4_K_packed32(bl);
decodeBufQ4_K_packed128 bl128 = decodeBufQ4_K_packed128(bl);
const uint idx = coordInBlock[1];
const uint is = idx >> 5; // 0..7
#if defined(IS_MUL_MM2) && defined(DATA_A_Q4_K)
vec2 v = shAscales[is * shAscales_stride + (blockCoords[0] % BM)];
float d = v.x;
float m = v.y;
#else
uvec4 v = bl128.block.q4k[0];
const vec2 loadd = vec2(unpackFloat2x16(v.x));
uint32_t sc;
uint32_t mbyte;
uint32_t scale0 = v.y;
uint32_t scale4 = v.z;
uint32_t scale8 = v.w;
uint32_t sc_lo = scale0;
uint32_t mb_lo = scale4;
uint32_t sc_hi = (scale8 & 0x0F0F0F0F) | ((scale0 & 0xC0C0C0C0) >> 2);
uint32_t mb_hi = ((scale8 & 0xF0F0F0F0) >> 4) | ((scale4 & 0xC0C0C0C0) >> 2);
sc = is < 4 ? sc_lo : sc_hi;
mbyte = is < 4 ? mb_lo : mb_hi;
sc = sc >> (8 * (is & 3));
mbyte = mbyte >> (8 * (is & 3));
sc &= 0x3F;
mbyte &= 0x3F;
const float d = loadd.x * float(sc);
const float m = loadd.y * float(mbyte);
#endif
// idx in [0,256); vector decode uses idx a multiple of 4. packed32 word index:
// (qs_i >> 1) == (idx >> 6) * 8 + ((idx & 0x1E) >> 2). sh is 0 or 4 only, so a
// single (w >> sh) & 0x0F0F0F0F isolates all four nibbles without inter-byte leakage.
const uint sh = (idx & 0x20u) >> 3u;
const uint w = uint32_t(bl32.block.qs[(idx >> 6) * 8u + ((idx & 0x1Eu) >> 2)]);
const u8vec4 q = unpack8((w >> sh) & 0x0F0F0F0Fu);
return f16vec4(vec4(d) * vec4(q) - vec4(m));
}
layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ5_K {
block_q5_K block;
};
@@ -346,6 +576,10 @@ layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ5
block_q5_K_packed128 block;
};
layout(buffer_reference, std430, buffer_reference_align = 16) buffer decodeBufQ5_K_packed32 {
block_q5_K_packed32 block;
};
float16_t dequantFuncQ5_K(const in decodeBufQ5_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ5_K_packed16 bl16 = decodeBufQ5_K_packed16(bl);
@@ -399,6 +633,58 @@ float16_t dequantFuncQ5_K(const in decodeBufQ5_K bl, const in uint blockCoords[2
return float16_t(ret);
}
f16vec4 dequantFuncQ5_K_v(const in decodeBufQ5_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ5_K_packed32 bl32 = decodeBufQ5_K_packed32(bl);
decodeBufQ5_K_packed128 bl128 = decodeBufQ5_K_packed128(bl);
const uint idx = coordInBlock[1];
const uint is = idx >> 5;
#if defined(IS_MUL_MM2) && defined(DATA_A_Q5_K)
vec2 v = shAscales[is * shAscales_stride + (blockCoords[0] % BM)];
float d = v.x;
float m = v.y;
#else
uvec4 v = bl128.block.q5k[0];
const f16vec2 loadd = unpackFloat2x16(v.x);
uint32_t sc;
uint32_t mbyte;
uint32_t scale0 = v.y;
uint32_t scale4 = v.z;
uint32_t scale8 = v.w;
uint32_t sc_lo = scale0;
uint32_t mb_lo = scale4;
uint32_t sc_hi = (scale8 & 0x0F0F0F0F) | ((scale0 & 0xC0C0C0C0) >> 2);
uint32_t mb_hi = ((scale8 & 0xF0F0F0F0) >> 4) | ((scale4 & 0xC0C0C0C0) >> 2);
sc = is < 4 ? sc_lo : sc_hi;
mbyte = is < 4 ? mb_lo : mb_hi;
sc = sc >> (8 * (is & 3));
mbyte = mbyte >> (8 * (is & 3));
sc &= 0x3F;
mbyte &= 0x3F;
const float16_t d = loadd.x * float16_t(sc);
const float16_t m = loadd.y * float16_t(mbyte);
#endif
// sh is 0 or 4; mask 0x0F0F0F0F covers the four nibbles regardless (no inter-byte leakage).
const uint sh = (idx & 0x20u) >> 3u;
const uint qs_w = (idx >> 6) * 8u + ((idx & 0x1Eu) >> 2);
const uint qh_w = (idx & 0x1Eu) >> 2;
const uint ql4 = (uint32_t(bl32.block.qs[qs_w]) >> sh) & 0x0F0F0F0Fu;
// qh stores bit `is` per element across 4 consecutive bytes; one shift+mask handles all 4.
const uint qh4 = ((uint32_t(bl32.block.qh[qh_w]) >> is) & 0x01010101u) << 4u;
const u8vec4 qi = unpack8(ql4 | qh4);
return f16vec4(vec4(qi) * vec4(d) - vec4(m));
}
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufQ6_K {
block_q6_K block;
};
@@ -431,6 +717,35 @@ float16_t dequantFuncQ6_K(const in decodeBufQ6_K bl, const in uint blockCoords[2
return ret;
}
f16vec4 dequantFuncQ6_K_v(const in decodeBufQ6_K bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufQ6_K_packed16 bl16 = decodeBufQ6_K_packed16(bl);
const uint idx = coordInBlock[1];
const uint b = (idx & 0x40) >> 6;
const uint qhshift = (idx & 0x60) >> 4; // 0,2,4,6
const uint is = idx >> 4;
const uint sh = b * 4; // 0 or 4
const float16_t dscale = bl.block.d * float16_t(bl.block.scales[is]);
const uint ql_i = ((idx & 0x80) >> 2) + ((idx & 0x3E) >> 1);
const uint qh_i = ((idx & 0x80) >> 3) + ((idx & 0x1E) >> 1);
// Two adjacent uint16 packed16 reads, combined into a uint32 in registers.
// After this: byte j of qlw / qhw holds the data for element idx+j.
const uint qlw = uint32_t(bl16.block.ql[ql_i ]) | (uint32_t(bl16.block.ql[ql_i + 1]) << 16);
const uint qhw = uint32_t(bl16.block.qh[qh_i ]) | (uint32_t(bl16.block.qh[qh_i + 1]) << 16);
// sh in {0,4} and qhshift in {0,2,4,6}: per-byte masks 0x0F / 0x03 keep only the
// wanted bits with no inter-byte leakage; place qh's 2 bits at nibble high position.
const uint ql4 = (qlw >> sh) & 0x0F0F0F0Fu;
const uint qh4 = ((qhw >> qhshift) & 0x03030303u) << 4u;
const ivec4 qi = ivec4(unpack8(ql4 | qh4));
return f16vec4((vec4(qi) - vec4(32.0f)) * vec4(float(dscale)));
}
#if defined(DATA_A_IQ1_S)
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufIQ1_S {
block_iq1_s block;
@@ -453,6 +768,29 @@ float16_t dequantFuncIQ1_S(const in decodeBufIQ1_S bl, const in uint blockCoords
float16_t ret = float16_t(dl) * (float16_t(bitfieldExtract(int(grid), 2 * int(idx % 8), 2)) + float16_t(delta));
return ret;
}
f16vec4 dequantFuncIQ1_S_v(const in decodeBufIQ1_S bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
const uint idx = coordInBlock[1];
const uint ib32 = idx >> 5;
const uint ib8 = idx >> 3;
const int i8b = int(idx & 4); // 0 or 4
const uint qh = bl.block.qh[ib32];
const uint qs = bl.block.qs[ib8];
const float dl = float(d) * float(2 * bitfieldExtract(qh, 12, 3) + 1);
const float delta = ((qh & 0x8000u) != 0u) ? -IQ1S_DELTA : IQ1S_DELTA;
const uint grid = iq1s_grid[qs | (bitfieldExtract(qh, 3 * int(ib8 & 3), 3) << 8)];
const ivec4 q = ivec4(
bitfieldExtract(int(grid), 2 * (i8b + 0), 2),
bitfieldExtract(int(grid), 2 * (i8b + 1), 2),
bitfieldExtract(int(grid), 2 * (i8b + 2), 2),
bitfieldExtract(int(grid), 2 * (i8b + 3), 2));
return f16vec4((vec4(q) + vec4(delta)) * dl);
}
#endif
#if defined(DATA_A_IQ1_M)
@@ -485,6 +823,33 @@ float16_t dequantFuncIQ1_M(const in decodeBufIQ1_M bl, const in uint blockCoords
float16_t ret = d * float16_t(dl) * (float16_t(bitfieldExtract(int(grid), 2 * i8, 2)) + float16_t(delta));
return ret;
}
f16vec4 dequantFuncIQ1_M_v(const in decodeBufIQ1_M bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufIQ1_M_packed64 bl64 = decodeBufIQ1_M_packed64(bl);
const uint idx = coordInBlock[1];
uvec2 scales = unpack32(bl64.block.scales);
const float16_t d = uint16BitsToHalf(uint16_t(((scales.x & 0xF000) >> 12) | ((scales.x & 0xF0000000) >> 24) | ((scales.y & 0xF000) >> 4) | ((scales.y & 0xF0000000) >> 16)));
const uint ib8 = idx >> 3;
const uint ib16 = idx >> 4;
const int i8b = int(idx & 4); // 0 or 4 -- i8 base for the V=4 group
const uint sc = bl.block.scales[ib8 / 8];
const uint qs = bl.block.qs[ib8];
const uint qh = bl.block.qh[ib16] >> (4 * (ib8 & 1));
const float dl = 2.0 * float(bitfieldExtract(sc, 3 * int(ib16 & 3), 3)) + 1.0;
const float delta = ((qh & 8u) != 0u) ? -IQ1S_DELTA : IQ1S_DELTA;
const uint grid = iq1s_grid[qs | ((qh & 7u) << 8)];
const ivec4 q = ivec4(
bitfieldExtract(int(grid), 2 * (i8b + 0), 2),
bitfieldExtract(int(grid), 2 * (i8b + 1), 2),
bitfieldExtract(int(grid), 2 * (i8b + 2), 2),
bitfieldExtract(int(grid), 2 * (i8b + 3), 2));
return f16vec4((vec4(q) + vec4(delta)) * (float(d) * dl));
}
#endif
#if defined(DATA_A_IQ2_XXS)
@@ -520,6 +885,33 @@ float16_t dequantFuncIQ2_XXS(const in decodeBufIQ2_XXS bl, const in uint blockCo
vec2 ret = dscale * g * ((sign & (1 << (idx & 7))) != 0 ? -1.0hf : 1.0hf);
return float16_t(ret[idx & 1]);
}
f16vec4 dequantFuncIQ2_XXS_v(const in decodeBufIQ2_XXS bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufIQ2_XXS_packed16 bl16 = decodeBufIQ2_XXS_packed16(bl);
const uint idx = coordInBlock[1];
const uint ib32 = idx >> 5;
const uint ib8 = (idx & 0x18) >> 3;
const uint iqs = 8 * ib32 + ib8;
const uint qs = bl.block.qs[iqs];
const uint signscale = pack32(u16vec2(bl16.block.qs[4*ib32+2], bl16.block.qs[4*ib32+3]));
const float dscale = float(bl.block.d) * 0.25 * (0.5 + float(signscale >> 28));
uint sign = bitfieldExtract(signscale, 7 * int(ib8), 7);
sign |= bitCount(sign) << 7;
const uint sb = sign >> (idx & 7u);
const uint g2 = iq2xxs_grid[qs][(idx & 4) >> 2];
const u8vec4 g = unpack8(g2);
return f16vec4(
dscale * float(g.x) * ((sb & 1u) != 0u ? -1.0 : 1.0),
dscale * float(g.y) * ((sb & 2u) != 0u ? -1.0 : 1.0),
dscale * float(g.z) * ((sb & 4u) != 0u ? -1.0 : 1.0),
dscale * float(g.w) * ((sb & 8u) != 0u ? -1.0 : 1.0));
}
#endif
#if defined(DATA_A_IQ2_XS)
@@ -548,6 +940,31 @@ float16_t dequantFuncIQ2_XS(const in decodeBufIQ2_XS bl, const in uint blockCoor
vec2 ret = dscale * g * ((sign & (1 << (idx & 7))) != 0 ? -1.0hf : 1.0hf);
return float16_t(ret[idx & 1]);
}
f16vec4 dequantFuncIQ2_XS_v(const in decodeBufIQ2_XS bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const uint idx = coordInBlock[1];
const uint is = idx >> 5;
const uint sshift = (idx & 0x10) >> 2;
const uint iqs = idx >> 3;
const uint16_t qs = bl.block.qs[iqs];
const float dscale = float(bl.block.d) * 0.25 * (0.5 + float((bl.block.scales[is] >> sshift) & 0xF));
uint sign = uint(qs >> 9);
sign |= bitCount(sign) << 7;
const uint sb = sign >> (idx & 7u);
const uint g2 = iq2xs_grid[qs & 0x1FF][(idx & 4) >> 2];
const u8vec4 g = unpack8(g2);
return f16vec4(
dscale * float(g.x) * ((sb & 1u) != 0u ? -1.0 : 1.0),
dscale * float(g.y) * ((sb & 2u) != 0u ? -1.0 : 1.0),
dscale * float(g.z) * ((sb & 4u) != 0u ? -1.0 : 1.0),
dscale * float(g.w) * ((sb & 8u) != 0u ? -1.0 : 1.0));
}
#endif
#if defined(DATA_A_IQ2_S)
@@ -576,6 +993,32 @@ float16_t dequantFuncIQ2_S(const in decodeBufIQ2_S bl, const in uint blockCoords
const vec2 v = db * vec2(sign01) * vec2(unpack8(g2));
return float16_t(v[idx & 1]);
}
f16vec4 dequantFuncIQ2_S_v(const in decodeBufIQ2_S bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const uint idx = coordInBlock[1];
const uint ib32 = idx >> 5;
const uint ib8 = idx >> 3;
const uint qhshift = 2 * (ib8 % 4);
const uint scale = (bl.block.scales[ib32] >> ((idx & 0x10) >> 2)) & 0xf;
const uint qs = bl.block.qs[ib8];
const uint qh = bl.block.qh[ib32];
const uint sb = uint(bl.block.qs[QUANT_K / 8 + ib8]) >> (idx & 0x6u);
const float d = float(bl.block.d);
const float db = d * 0.25 * (0.5 + scale);
const uint g2 = iq2s_grid[qs | ((qh << (8 - qhshift)) & 0x300)][(idx & 4) >> 2];
const u8vec4 g = unpack8(g2);
return f16vec4(
db * float(g.x) * ((sb & 1u) != 0u ? -1.0 : 1.0),
db * float(g.y) * ((sb & 2u) != 0u ? -1.0 : 1.0),
db * float(g.z) * ((sb & 4u) != 0u ? -1.0 : 1.0),
db * float(g.w) * ((sb & 8u) != 0u ? -1.0 : 1.0));
}
#endif
#if defined(DATA_A_IQ3_XXS)
@@ -609,6 +1052,32 @@ float16_t dequantFuncIQ3_XXS(const in decodeBufIQ3_XXS bl, const in uint blockCo
const vec2 v = db * vec2(sign01) * vec2(unpack8(grid).xy);
return float16_t(v[idx & 1]);
}
f16vec4 dequantFuncIQ3_XXS_v(const in decodeBufIQ3_XXS bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufIQ3_XXS_packed16 bl16 = decodeBufIQ3_XXS_packed16(bl);
const uint idx = coordInBlock[1];
const uint iqs = idx >> 2;
const uint is = QUANT_K / 4 + ((idx & 0xE0) >> 3);
const float d = float(bl.block.d);
const uint qs = bl.block.qs[iqs];
const uint signs = pack32(u16vec2(bl16.block.qs[is/2+0], bl16.block.qs[is/2+1]));
const float db = d * 0.5 * (0.5 + (signs >> 28));
const uint sign7 = bitfieldExtract(signs, 7 * (int(iqs / 2) % 4), 7);
const uint sb = (sign7 | (bitCount(sign7) << 7)) >> (idx & 0x6u);
const uint grid = iq3xxs_grid[qs];
const u8vec4 g = unpack8(grid);
return f16vec4(
db * float(g.x) * ((sb & 1u) != 0u ? -1.0 : 1.0),
db * float(g.y) * ((sb & 2u) != 0u ? -1.0 : 1.0),
db * float(g.z) * ((sb & 4u) != 0u ? -1.0 : 1.0),
db * float(g.w) * ((sb & 8u) != 0u ? -1.0 : 1.0));
}
#endif
#if defined(DATA_A_IQ3_S)
@@ -635,6 +1104,30 @@ float16_t dequantFuncIQ3_S(const in decodeBufIQ3_S bl, const in uint blockCoords
return float16_t(v[idx & 1]);
}
f16vec4 dequantFuncIQ3_S_v(const in decodeBufIQ3_S bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const uint idx = coordInBlock[1];
const uint iqs = idx >> 2;
const uint iqh = idx >> 5;
const float d = float(bl.block.d);
const uint qs = bl.block.qs[iqs];
const uint qh = bl.block.qh[iqh];
const uint sb = uint(bl.block.signs[iqs / 2]) >> (idx & 0x6u);
const uint scale = bl.block.scales[iqs / 16];
const float db = d * (1 + 2 * ((scale >> (4 * (iqh & 1))) & 0xf));
const uint grid = iq3s_grid[qs | ((qh << (8 - (iqs % 8))) & 256)];
const u8vec4 g = unpack8(grid);
return f16vec4(
db * float(g.x) * ((sb & 1u) != 0u ? -1.0 : 1.0),
db * float(g.y) * ((sb & 2u) != 0u ? -1.0 : 1.0),
db * float(g.z) * ((sb & 4u) != 0u ? -1.0 : 1.0),
db * float(g.w) * ((sb & 8u) != 0u ? -1.0 : 1.0));
}
#endif
#if defined(DATA_A_IQ4_XS)
@@ -642,6 +1135,10 @@ layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufIQ4
block_iq4_xs block;
};
layout(buffer_reference, std430, buffer_reference_align = 4) buffer decodeBufIQ4_XS_packed32 {
block_iq4_xs_packed32 block;
};
float16_t dequantFuncIQ4_XS(const in decodeBufIQ4_XS bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
@@ -657,6 +1154,30 @@ float16_t dequantFuncIQ4_XS(const in decodeBufIQ4_XS bl, const in uint blockCoor
float16_t ret = d * float16_t(int(sl | (sh << 4)) - 32) * float16_t(kvalues_iq4nl[q]);
return ret;
}
f16vec4 dequantFuncIQ4_XS_v(const in decodeBufIQ4_XS bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufIQ4_XS_packed32 bl32 = decodeBufIQ4_XS_packed32(bl);
const float16_t d = bl.block.d;
const uint idx = coordInBlock[1];
const uint ib32 = idx >> 5; // 0..7
const uint sl = (bl32.block.scales_l >> (4 * ib32)) & 0xF;
const uint sh = (uint(bl32.block.scales_h) >> (2 * ib32)) & 0x3;
const uint qshift = (idx & 0x10) >> 2; // {0, 4}
const uint qs_w = 4 * ib32 + ((idx & 0xC) >> 2); // iqs / 4, in [0,32)
const float16_t dl = d * float16_t(int(sl | (sh << 4)) - 32);
const uint qsw = bl32.block.qs[qs_w];
const u8vec4 qv = unpack8((qsw >> qshift) & 0x0F0F0F0Fu);
const vec4 ret = vec4(
float(kvalues_iq4nl[qv.x]),
float(kvalues_iq4nl[qv.y]),
float(kvalues_iq4nl[qv.z]),
float(kvalues_iq4nl[qv.w])) * float(dl);
return f16vec4(ret);
}
#endif
#if defined(DATA_A_IQ4_NL)
@@ -664,6 +1185,10 @@ layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufIQ4
block_iq4_nl block;
};
layout(buffer_reference, std430, buffer_reference_align = 2) buffer decodeBufIQ4_NL_packed16 {
block_iq4_nl_packed16 block;
};
float16_t dequantFuncIQ4_NL(const in decodeBufIQ4_NL bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float16_t d = bl.block.d;
@@ -676,6 +1201,24 @@ float16_t dequantFuncIQ4_NL(const in decodeBufIQ4_NL bl, const in uint blockCoor
float16_t ret = float16_t(kvalues_iq4nl[qs]) * d;
return ret;
}
f16vec4 dequantFuncIQ4_NL_v(const in decodeBufIQ4_NL bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufIQ4_NL_packed16 bl16 = decodeBufIQ4_NL_packed16(bl);
const float16_t d = bl.block.d;
const uint idx = coordInBlock[1];
const uint shift = (idx & 0x10) >> 2; // 0 or 4
const uint qs_i = (idx & 0xC) >> 1; // packed16 word index, in {0,2,4,6}
const uint qsw = uint32_t(bl16.block.qs[qs_i ])
| (uint32_t(bl16.block.qs[qs_i + 1u]) << 16);
// shift in {0,4}: per-byte mask 0x0F isolates the wanted nibble in each byte.
const u8vec4 q = unpack8((qsw >> shift) & 0x0F0F0F0Fu);
return f16vec4(
float(d) * float(kvalues_iq4nl[q.x]),
float(d) * float(kvalues_iq4nl[q.y]),
float(d) * float(kvalues_iq4nl[q.z]),
float(d) * float(kvalues_iq4nl[q.w]));
}
#endif
#if defined(DATA_A_MXFP4)
@@ -695,6 +1238,26 @@ float16_t dequantFuncMXFP4(const in decodeBufMXFP4 bl, const in uint blockCoords
float16_t ret = float16_t(kvalues_mxfp4[qs] * d * 0.5);
return ret;
}
f16vec4 dequantFuncMXFP4_v(const in decodeBufMXFP4 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const float d = e8m0_to_fp32(bl.block.e);
const uint idx = coordInBlock[1];
const uint iqs = idx & 0xF;
const uint shift = (idx & 0x10) >> 2;
uvec4 qv = uvec4(
uint(bl.block.qs[iqs]),
uint(bl.block.qs[iqs + 1u]),
uint(bl.block.qs[iqs + 2u]),
uint(bl.block.qs[iqs + 3u]));
qv = (qv >> shift) & 0xFu;
const vec4 ret = vec4(
float(kvalues_mxfp4[qv.x]),
float(kvalues_mxfp4[qv.y]),
float(kvalues_mxfp4[qv.z]),
float(kvalues_mxfp4[qv.w])) * d * 0.5f;
return f16vec4(ret);
}
#endif
#if defined(DATA_A_NVFP4)
@@ -702,6 +1265,10 @@ layout(buffer_reference, std430, buffer_reference_align = 4) buffer decodeBufNVF
block_nvfp4 block;
};
layout(buffer_reference, std430, buffer_reference_align = 4) buffer decodeBufNVFP4_packed32 {
block_nvfp4_packed32 block;
};
float16_t dequantFuncNVFP4(const in decodeBufNVFP4 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
const uint idx = coordInBlock[1];
@@ -713,56 +1280,97 @@ float16_t dequantFuncNVFP4(const in decodeBufNVFP4 bl, const in uint blockCoords
qs = (qs >> shift) & 0xF;
return float16_t(kvalues_mxfp4[qs] * d * 0.5);
}
f16vec4 dequantFuncNVFP4_v(const in decodeBufNVFP4 bl, const in uint blockCoords[2], const in uint coordInBlock[2])
{
decodeBufNVFP4_packed32 bl32 = decodeBufNVFP4_packed32(bl);
const uint idx = coordInBlock[1];
const uint sub = idx >> 4;
const uint qs_w = ((idx & 0x30) >> 3) + ((idx & 0x4u) >> 2); // iqs / 4, in [0,8)
const uint shift = (idx & 0x8) >> 1;
const float d = ue4m3_to_fp32(bl.block.d[sub]);
const uint qsw = uint32_t(bl32.block.qs[qs_w]);
const u8vec4 qv = unpack8((qsw >> shift) & 0x0F0F0F0Fu);
const vec4 ret = vec4(
float(kvalues_mxfp4[qv.x]),
float(kvalues_mxfp4[qv.y]),
float(kvalues_mxfp4[qv.z]),
float(kvalues_mxfp4[qv.w])) * d * 0.5f;
return f16vec4(ret);
}
#endif
#if defined(DATA_A_Q1_0)
#define dequantFuncA dequantFuncQ1_0
#define dequantFuncA_v dequantFuncQ1_0_v
#elif defined(DATA_A_Q4_0)
#define dequantFuncA dequantFuncQ4_0
#define dequantFuncA_v dequantFuncQ4_0_v
#elif defined(DATA_A_Q4_1)
#define dequantFuncA dequantFuncQ4_1
#define dequantFuncA_v dequantFuncQ4_1_v
#elif defined(DATA_A_Q5_0)
#define dequantFuncA dequantFuncQ5_0
#define dequantFuncA_v dequantFuncQ5_0_v
#elif defined(DATA_A_Q5_1)
#define dequantFuncA dequantFuncQ5_1
#define dequantFuncA_v dequantFuncQ5_1_v
#elif defined(DATA_A_Q8_0)
#define dequantFuncA dequantFuncQ8_0
#define dequantFuncA_v dequantFuncQ8_0_v
#elif defined(DATA_A_Q2_K)
#define dequantFuncA dequantFuncQ2_K
#define dequantFuncA_v dequantFuncQ2_K_v
#elif defined(DATA_A_Q3_K)
#define dequantFuncA dequantFuncQ3_K
#define dequantFuncA_v dequantFuncQ3_K_v
#elif defined(DATA_A_Q4_K)
#define dequantFuncA dequantFuncQ4_K
#define dequantFuncA_v dequantFuncQ4_K_v
#define fetch_scales fetch_scalesQ4_K
#define store_scales store_scalesQ4_K
#elif defined(DATA_A_Q5_K)
#define dequantFuncA dequantFuncQ5_K
#define dequantFuncA_v dequantFuncQ5_K_v
#define fetch_scales fetch_scalesQ5_K
#define store_scales store_scalesQ4_K
#elif defined(DATA_A_Q6_K)
#define dequantFuncA dequantFuncQ6_K
#define dequantFuncA_v dequantFuncQ6_K_v
#elif defined(DATA_A_IQ1_S)
#define dequantFuncA dequantFuncIQ1_S
#define dequantFuncA_v dequantFuncIQ1_S_v
#elif defined(DATA_A_IQ1_M)
#define dequantFuncA dequantFuncIQ1_M
#define dequantFuncA_v dequantFuncIQ1_M_v
#elif defined(DATA_A_IQ2_XXS)
#define dequantFuncA dequantFuncIQ2_XXS
#define dequantFuncA_v dequantFuncIQ2_XXS_v
#elif defined(DATA_A_IQ2_XS)
#define dequantFuncA dequantFuncIQ2_XS
#define dequantFuncA_v dequantFuncIQ2_XS_v
#elif defined(DATA_A_IQ2_S)
#define dequantFuncA dequantFuncIQ2_S
#define dequantFuncA_v dequantFuncIQ2_S_v
#elif defined(DATA_A_IQ3_XXS)
#define dequantFuncA dequantFuncIQ3_XXS
#define dequantFuncA_v dequantFuncIQ3_XXS_v
#elif defined(DATA_A_IQ3_S)
#define dequantFuncA dequantFuncIQ3_S
#define dequantFuncA_v dequantFuncIQ3_S_v
#elif defined(DATA_A_IQ4_XS)
#define dequantFuncA dequantFuncIQ4_XS
#define dequantFuncA_v dequantFuncIQ4_XS_v
#elif defined(DATA_A_IQ4_NL)
#define dequantFuncA dequantFuncIQ4_NL
#define dequantFuncA_v dequantFuncIQ4_NL_v
#elif defined(DATA_A_MXFP4)
#define dequantFuncA dequantFuncMXFP4
#define dequantFuncA_v dequantFuncMXFP4_v
#elif defined(DATA_A_NVFP4)
#define dequantFuncA dequantFuncNVFP4
#define dequantFuncA_v dequantFuncNVFP4_v
#elif defined(DATA_A_F32)
#define dequantFuncA dequantFuncF32
#endif

View File

@@ -0,0 +1,7 @@
#version 460
#extension GL_NV_cooperative_matrix_decode_vector : require
void main()
{
}

View File

@@ -11,6 +11,9 @@
#extension GL_KHR_memory_scope_semantics : enable
#extension GL_KHR_cooperative_matrix : enable
#extension GL_NV_cooperative_matrix2 : enable
#ifdef GL_NV_cooperative_matrix_decode_vector
#extension GL_NV_cooperative_matrix_decode_vector : enable
#endif
#extension GL_EXT_buffer_reference : enable
#extension GL_KHR_shader_subgroup_ballot : enable
#extension GL_KHR_shader_subgroup_vote : enable
@@ -54,6 +57,41 @@ float16_t faDecodeV(const decodeBufFA_V bl_in, const uint blockCoords[2], const
}
}
// V=4 vector decode for K/V; dispatches to per-format _v decoders.
f16vec4 faDecodeKVector(const decodeBufFA_K bl_in, const uint blockCoords[2], const uint coordInBlock[2]) {
switch (FaTypeK) {
case 0u: return f16vec4(decodeBufF32(bl_in).block);
case 2u: return dequantFuncQ4_0_v(decodeBufQ4_0(bl_in), blockCoords, coordInBlock);
case 3u: return dequantFuncQ4_1_v(decodeBufQ4_1(bl_in), blockCoords, coordInBlock);
case 6u: return dequantFuncQ5_0_v(decodeBufQ5_0(bl_in), blockCoords, coordInBlock);
case 7u: return dequantFuncQ5_1_v(decodeBufQ5_1(bl_in), blockCoords, coordInBlock);
case 8u: return dequantFuncQ8_0_v(decodeBufQ8_0(bl_in), blockCoords, coordInBlock);
case 41u: return dequantFuncQ1_0_v(decodeBufQ1_0(bl_in), blockCoords, coordInBlock);
default: return f16vec4(0);
}
}
f16vec4 faDecodeVVector(const decodeBufFA_V bl_in, const uint blockCoords[2], const uint coordInBlock[2]) {
switch (FaTypeV) {
case 0u: return f16vec4(decodeBufF32(bl_in).block);
case 2u: return dequantFuncQ4_0_v(decodeBufQ4_0(bl_in), blockCoords, coordInBlock);
case 3u: return dequantFuncQ4_1_v(decodeBufQ4_1(bl_in), blockCoords, coordInBlock);
case 6u: return dequantFuncQ5_0_v(decodeBufQ5_0(bl_in), blockCoords, coordInBlock);
case 7u: return dequantFuncQ5_1_v(decodeBufQ5_1(bl_in), blockCoords, coordInBlock);
case 8u: return dequantFuncQ8_0_v(decodeBufQ8_0(bl_in), blockCoords, coordInBlock);
case 41u: return dequantFuncQ1_0_v(decodeBufQ1_0(bl_in), blockCoords, coordInBlock);
default: return f16vec4(0);
}
}
#ifdef GL_NV_cooperative_matrix_decode_vector
#define FADECODEK , faDecodeK, faDecodeKVector
#define FADECODEV , faDecodeV, faDecodeVVector
#else
#define FADECODEK , faDecodeK
#define FADECODEV , faDecodeV
#endif
layout (binding = 0) readonly buffer Q {uint8_t data_q[];};
layout (binding = 1) readonly buffer K {uint8_t data_k[];};
layout (binding = 2) readonly buffer V {uint8_t data_v[];};
@@ -259,7 +297,7 @@ void main() {
// F16: bs_k==1 (direct load). F32: bs_k==4 (vec4 / dequantFuncF32). Q4/Q8 family: bs_k==32. Q1_0: bs_k==128.
const bool k_use_decode = (bs_k > 1u);
if (k_use_decode) {
coopMatLoadTensorNV(K_T, data_k, k_offset, sliceTensorLayoutNV(tensorLayoutK, j * Bc, Bc, 0, HSK_pad), tensorViewTranspose, faDecodeK);
coopMatLoadTensorNV(K_T, data_k, k_offset, sliceTensorLayoutNV(tensorLayoutK, j * Bc, Bc, 0, HSK_pad), tensorViewTranspose FADECODEK);
} else {
coopMatLoadTensorNV(K_T, data_k, k_offset, sliceTensorLayoutNV(tensorLayoutK, j * Bc, Bc, 0, HSK_pad), tensorViewTranspose);
}
@@ -325,7 +363,7 @@ void main() {
uint32_t v_offset = iv2*p.nb22 + iv3*p.nb23;
const bool v_use_decode = (bs_v > 1u);
if (v_use_decode) {
coopMatLoadTensorNV(V, data_v, v_offset, sliceTensorLayoutNV(tensorLayoutV, j * Bc, Bc, 0, HSV_pad), faDecodeV);
coopMatLoadTensorNV(V, data_v, v_offset, sliceTensorLayoutNV(tensorLayoutV, j * Bc, Bc, 0, HSV_pad) FADECODEV);
} else {
coopMatLoadTensorNV(V, data_v, v_offset, sliceTensorLayoutNV(tensorLayoutV, j * Bc, Bc, 0, HSV_pad));
}

View File

@@ -10,12 +10,38 @@ layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
#if !defined(DATA_A_F32) && !defined(DATA_A_F16) && !defined(DATA_A_BF16)
#define K_PER_ITER 8
#else
#define K_PER_ITER 2
#define K_PER_ITER 4
#endif
uint a_offset, b_offset, d_offset, y_offset;
vec4 load_b(const uint j, const uint iybs, const uint iqs, const bool lastiter, out bool OOB_y, out bool OOB_z, out bool OOB_w) {
// Check if the latter elements are OOB, and don't fetch B or accumulate it.
OOB_y = lastiter && (iybs + iqs + y_offset >= p.ncols);
OOB_z = lastiter && (iybs + iqs + y_offset*2 >= p.ncols);
OOB_w = lastiter && (iybs + iqs + y_offset*3 >= p.ncols);
if (!OOB_w) {
return vec4(FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs]),
FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs + y_offset]),
FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs + y_offset*2]),
FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs + y_offset*3]));
} else if (!OOB_z) {
return vec4(FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs]),
FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs + y_offset]),
FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs + y_offset*2]),
0);
} else if (!OOB_y) {
return vec4(FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs]),
FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs + y_offset]),
0, 0);
} else {
return vec4(FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs]),
0, 0, 0);
}
}
void iter(inout FLOAT_TYPE temp[NUM_COLS][NUM_ROWS], const uint first_row, const uint num_rows, const uint tid, const uint i, bool lastiter)
{
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
@@ -25,6 +51,8 @@ void iter(inout FLOAT_TYPE temp[NUM_COLS][NUM_ROWS], const uint first_row, const
#if K_PER_ITER == 8
#if QUANT_R == 2
// Note that we end up fetching bogus elements here, but its fine as they'll be
// within an accessible block.
const vec4 bv02 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs) / 4]);
const vec4 bv13 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs + y_offset) / 4]);
const vec4 bv0 = vec4(bv02.x, bv13.x, bv02.y, bv13.y);
@@ -34,18 +62,11 @@ void iter(inout FLOAT_TYPE temp[NUM_COLS][NUM_ROWS], const uint first_row, const
const vec4 bv1 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs) / 4 + 1]);
#endif
#else
// Check if the second of the pair of elements is OOB, and don't fetch B or
// accumulate it. We still fetch a pair of elements for A, which is fine for
// quantized formats since they'll be within the same block. We should
// probably skip fetching the second element for F16/F32, but as of now we
// still do.
const bool OOB = lastiter && (iybs + iqs + y_offset >= p.ncols);
bool OOB_y;
bool OOB_z;
bool OOB_w;
FLOAT_TYPE b0 = 0, b1 = 0;
b0 = FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs]);
if (!OOB) {
b1 = FLOAT_TYPE(data_b[j*p.batch_stride_b + b_offset + iybs + iqs + y_offset]);
}
const vec4 b = load_b(j, iybs, iqs, lastiter, OOB_y, OOB_z, OOB_w);
#endif
uint ibi = first_row*p.ncols;
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
@@ -71,22 +92,60 @@ void iter(inout FLOAT_TYPE temp[NUM_COLS][NUM_ROWS], const uint first_row, const
temp[j][n] += rowtmp;
#else
const vec2 v = dequantize(ib, iqs, a_offset);
// matrix multiplication
temp[j][n] = fma(FLOAT_TYPE(v.x), b0, temp[j][n]);
if (!OOB) {
temp[j][n] = fma(FLOAT_TYPE(v.y), b1, temp[j][n]);
if (!OOB_w) {
const vec4 v = dequantize4(ib, iqs, a_offset);
temp[j][n] += dot(v, b);
} else if (!OOB_z) {
const vec2 v0 = dequantize(ib, iqs, a_offset);
const FLOAT_TYPE v1 = dequantize1(ib + 2/QUANT_R, iqs, a_offset);
const vec3 v = vec3(v0.x, v0.y, v1);
const vec3 b0 = vec3(b.x, b.y, b.z);
temp[j][n] += dot(v, b0);
} else if (!OOB_y) {
const vec2 v0 = dequantize(ib, iqs, a_offset);
const vec2 b0 = vec2(b.x, b.y);
temp[j][n] += dot(v0, b0);
} else {
const FLOAT_TYPE v = dequantize1(ib, iqs, a_offset);
temp[j][n] = fma(v, b.x, temp[j][n]);
}
#endif
}
}
}
#if defined(DATA_A_F32) || defined(DATA_A_F16) || defined(DATA_A_BF16)
void iter_aligned_nonquant(inout FLOAT_TYPE temp[NUM_COLS][NUM_ROWS], const uint first_row, const uint num_rows, const uint tid, const uint i)
{
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
const uint col = i*BLOCK_SIZE + K_PER_ITER*tid;
const uint iqs = 0; // quant index
const uint iybs = col; // y block start index
const vec4 b = data_b_v4[(j*p.batch_stride_b + b_offset + iybs + iqs) / 4];
uint ibi = first_row*p.ncols;
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib = (ibi + col)/QUANT_K; // block index
ibi += p.ncols;
const vec4 v = dequantize4_2aligned(ib, iqs, a_offset);
// matrix multiplication
temp[j][n] += dot(v, b);
}
}
}
#endif
void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
const uint tid = gl_LocalInvocationID.x;
get_offsets(a_offset, b_offset, d_offset);
const bool is_aligned_nonquant =
p.batch_stride_b % 4 == 0 && b_offset % 4 == 0 &&
p.ncols % 4 == 0 && BLOCK_SIZE % 4 == 0 &&
K_PER_ITER == 4;
y_offset = QUANT_R == 1 ? 1 : QUANT_K/2;
@@ -105,17 +164,26 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
int unroll_count = 4;
uint unrolled_iters = num_iters & ~(unroll_count - 1);
#if K_PER_ITER == 2
uint i = 0;
#if K_PER_ITER == 4
// If the K dimension is odd, we need lastiter==true on the last iteration
// so OOB is computed correctly. Skip some unrolling to make that happen.
if ((p.ncols & 1) != 0 &&
if ((p.ncols & 3) != 0 &&
unrolled_iters == num_iters &&
unrolled_iters > 0) {
unrolled_iters -= unroll_count;
}
if (is_aligned_nonquant) {
while (i < unrolled_iters) {
// Manually partially unroll the loop
[[unroll]] for (uint k = 0; k < unroll_count; ++k) {
iter_aligned_nonquant(temp, first_row, num_rows, tid, i*K_PER_ITER);
i++;
}
}
} else {
#endif
uint i = 0;
while (i < unrolled_iters) {
// Manually partially unroll the loop
[[unroll]] for (uint k = 0; k < unroll_count; ++k) {
@@ -123,18 +191,30 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
i++;
}
}
#if K_PER_ITER == 4
}
#endif
unroll_count = 2;
unrolled_iters = num_iters & ~(unroll_count - 1);
#if K_PER_ITER == 2
if ((p.ncols & 1) != 0 &&
#if K_PER_ITER == 4
if ((p.ncols & 3) != 0 &&
unrolled_iters == num_iters &&
unrolled_iters > 0) {
unrolled_iters -= unroll_count;
}
#endif
if (is_aligned_nonquant) {
while (i < unrolled_iters && is_aligned_nonquant) {
// Manually partially unroll the loop
[[unroll]] for (uint k = 0; k < unroll_count; ++k) {
iter_aligned_nonquant(temp, first_row, num_rows, tid, i*K_PER_ITER);
i++;
}
}
} else {
#endif
while (i < unrolled_iters) {
// Manually partially unroll the loop
[[unroll]] for (uint k = 0; k < unroll_count; ++k) {
@@ -142,10 +222,25 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
i++;
}
}
#if K_PER_ITER == 4
}
#endif
#if K_PER_ITER == 4
if (is_aligned_nonquant) {
while (i < num_iters) {
iter_aligned_nonquant(temp, first_row, num_rows, tid, i*K_PER_ITER);
i++;
}
} else {
#endif
while (i < num_iters) {
iter(temp, first_row, num_rows, tid, i*K_PER_ITER, true);
i++;
}
#if K_PER_ITER == 4
}
#endif
reduce_result(temp, d_offset, first_row, num_rows, tid);
}
@@ -164,6 +259,6 @@ void main() {
if (first_row >= p.stride_d) {
return;
}
compute_outputs(first_row, p.stride_d - first_row);
compute_outputs(first_row, min(NUM_ROWS, p.stride_d - first_row));
}
}

View File

@@ -71,10 +71,12 @@ layout (binding = 1) readonly buffer B {B_TYPE data_b[];};
layout (binding = 2) writeonly buffer D {D_TYPE data_d[];};
#if QUANT_K > 1
#define DECODEFUNCA , dequantFuncA
#include "dequant_funcs_cm2.glsl"
#if defined(dequantFuncA_v) && defined(GL_NV_cooperative_matrix_decode_vector)
#define DECODEFUNCA , dequantFuncA, dequantFuncA_v
#else
#define DECODEFUNCA , dequantFuncA
#endif
#else
#define DECODEFUNCA
#endif

View File

@@ -31,6 +31,7 @@
#else
#define A_TYPE float16_t
#endif
#define A_TYPE_PACKED32 f16vec2
#endif
#if defined(DATA_A_BF16)
@@ -44,6 +45,7 @@
#else
#define A_TYPE uint16_t
#endif
#define A_TYPE_PACKED32 uint32_t
#endif
#define QUANT_K_Q4_0 32
@@ -1722,11 +1724,18 @@ struct block_nvfp4
uint8_t qs[QUANT_K_NVFP4 / 2];
};
struct block_nvfp4_packed32
{
uint32_t d[QUANT_K_NVFP4 / 16 / 4];
uint32_t qs[QUANT_K_NVFP4 / 2 / 4];
};
#if defined(DATA_A_NVFP4)
#define QUANT_K QUANT_K_NVFP4
#define QUANT_R QUANT_R_NVFP4
#define QUANT_AUXF 1
#define A_TYPE block_nvfp4
#define A_TYPE_PACKED32 block_nvfp4_packed32
#endif
#if defined(DATA_A_IQ4_NL) || defined(DATA_A_IQ4_XS)

View File

@@ -798,9 +798,11 @@ void process_shaders() {
string_to_spv("div_f32", "div.comp", {{"A_TYPE", "float"}, {"B_TYPE", "float"}, {"D_TYPE", "float"}, {"FLOAT_TYPE", "float"}});
string_to_spv("repeat_f32", "repeat.comp", {{"A_TYPE", "float"}, {"D_TYPE", "float"}});
string_to_spv("repeat_i32", "repeat.comp", {{"A_TYPE", "int32_t"}, {"D_TYPE", "int32_t"}});
string_to_spv("repeat_back_f32", "repeat_back.comp", {{"A_TYPE", "float"}, {"D_TYPE", "float"}});
string_to_spv("repeat_i16", "repeat.comp", {{"A_TYPE", "int16_t"}, {"D_TYPE", "int16_t"}});
string_to_spv("scale_f32", "scale.comp", {{"A_TYPE", "float"}, {"D_TYPE", "float"}, {"FLOAT_TYPE", "float"}});
string_to_spv("sqr_f32", "square.comp", {{"A_TYPE", "float"}, {"D_TYPE", "float"}, {"FLOAT_TYPE", "float"}});
@@ -984,8 +986,16 @@ void process_shaders() {
string_to_spv(name + (unroll ? "_unroll" : ""), "conv2d_mm.comp", defines);
#if defined(GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)
if (unroll) {
defines["COOPMAT2"] = "1";
string_to_spv(name, "conv2d_mm.comp", defines, true, false, true);
auto cm2_defines = defines;
cm2_defines["COOPMAT2"] = "1";
string_to_spv(name, "conv2d_mm.comp", cm2_defines, true, false, true);
}
#endif
#if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT)
if (unroll) {
auto cm1_defines = defines;
cm1_defines["COOPMAT"] = "1";
string_to_spv(name, "conv2d_mm.comp", cm1_defines, true, true, false);
}
#endif
}

View File

@@ -52,7 +52,7 @@
#define WEBGPU_MUL_MAT_VEC_LEGACY_Q_OUTPUTS_PER_WG 4
#define WEBGPU_MUL_MAT_VEC_K_Q_OUTPUTS_PER_WG 4
// default size for legacy matrix multiplication
// default size for reg-tile matrix multiplication
#define WEBGPU_MUL_MAT_WG_SIZE 256
// Same hash combine function as in boost
@@ -93,6 +93,8 @@ struct ggml_webgpu_shader_lib_context {
uint32_t sg_mat_k = 0;
uint32_t min_subgroup_size = 0;
uint32_t max_subgroup_size = 0;
bool supports_dot_product = false;
std::string vendor;
};
struct webgpu_pipeline {
@@ -850,31 +852,15 @@ inline ggml_webgpu_flash_attn_decisions ggml_webgpu_flash_attn_get_decisions(
/** Matrix Multiplication **/
struct ggml_webgpu_legacy_mul_mat_pipeline_key {
ggml_type src0_type;
ggml_type src1_type;
bool operator==(const ggml_webgpu_legacy_mul_mat_pipeline_key & other) const {
return src0_type == other.src0_type && src1_type == other.src1_type;
}
};
struct ggml_webgpu_legacy_mul_mat_pipeline_key_hash {
size_t operator()(const ggml_webgpu_legacy_mul_mat_pipeline_key & key) const {
size_t seed = 0;
ggml_webgpu_hash_combine(seed, key.src0_type);
ggml_webgpu_hash_combine(seed, key.src1_type);
return seed;
}
};
struct ggml_webgpu_mul_mat_vec_pipeline_key {
ggml_type src0_type;
ggml_type src1_type;
int vectorized;
bool use_mmvq;
bool operator==(const ggml_webgpu_mul_mat_vec_pipeline_key & other) const {
return src0_type == other.src0_type && src1_type == other.src1_type && vectorized == other.vectorized;
return src0_type == other.src0_type && src1_type == other.src1_type && vectorized == other.vectorized &&
use_mmvq == other.use_mmvq;
}
};
@@ -884,6 +870,7 @@ struct ggml_webgpu_mul_mat_vec_pipeline_key_hash {
ggml_webgpu_hash_combine(seed, key.src0_type);
ggml_webgpu_hash_combine(seed, key.src1_type);
ggml_webgpu_hash_combine(seed, key.vectorized);
ggml_webgpu_hash_combine(seed, key.use_mmvq);
return seed;
}
};
@@ -894,6 +881,20 @@ struct ggml_webgpu_mul_mat_vec_shader_decisions {
uint32_t vec_size;
};
struct ggml_webgpu_quantize_q8_pipeline_key {
ggml_type src0_type;
bool operator==(const ggml_webgpu_quantize_q8_pipeline_key & other) const { return src0_type == other.src0_type; }
};
struct ggml_webgpu_quantize_q8_pipeline_key_hash {
size_t operator()(const ggml_webgpu_quantize_q8_pipeline_key & key) const {
size_t seed = 0;
ggml_webgpu_hash_combine(seed, key.src0_type);
return seed;
}
};
struct ggml_webgpu_mul_mat_pipeline_key {
ggml_type src0_type;
ggml_type src1_type;
@@ -1051,6 +1052,36 @@ struct ggml_webgpu_soft_max_pipeline_key_hash {
}
};
/** MMVQ **/
inline bool ggml_webgpu_can_use_mmvq(const ggml_tensor * src0,
const ggml_tensor * src1,
bool supports_dot_product,
const std::string & vendor) {
if (src1->ne[1] == 1) {
bool supports_dp4a = vendor == "amd" || vendor == "intel" || vendor == "nvidia";
if (supports_dp4a && supports_dot_product) {
switch (src1->type) {
case GGML_TYPE_F32:
switch (src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q4_K:
return src0->ne[0] % 4 == 0;
default:
break;
}
break;
default:
break;
}
}
}
return false;
}
class ggml_webgpu_shader_lib {
wgpu::Device device;
pre_wgsl::Preprocessor preprocessor;
@@ -1099,14 +1130,12 @@ class ggml_webgpu_shader_lib {
webgpu_pipeline,
ggml_webgpu_flash_attn_blk_pipeline_key_hash>
flash_attn_blk_pipelines;
std::unordered_map<ggml_webgpu_legacy_mul_mat_pipeline_key,
webgpu_pipeline,
ggml_webgpu_legacy_mul_mat_pipeline_key_hash>
mul_mat_legacy_pipelines; // legacy mul_mat (non-subgroup/non-regtile/non-vec)
std::unordered_map<ggml_webgpu_mul_mat_vec_pipeline_key, webgpu_pipeline, ggml_webgpu_mul_mat_vec_pipeline_key_hash>
mul_mat_vec_pipelines; // fast mat-vec (n==1)
std::unordered_map<ggml_webgpu_mul_mat_pipeline_key, webgpu_pipeline, ggml_webgpu_mul_mat_pipeline_key_hash>
mul_mat_fast_pipelines; // fast mat-mat (reg-tile or subgroup)
std::unordered_map<ggml_webgpu_quantize_q8_pipeline_key, webgpu_pipeline, ggml_webgpu_quantize_q8_pipeline_key_hash>
quantize_q8_pipelines;
std::unordered_map<int, webgpu_pipeline> mul_mat_id_gather_pipelines; // key is fixed
std::unordered_map<ggml_webgpu_mul_mat_id_pipeline_key, webgpu_pipeline, ggml_webgpu_mul_mat_id_pipeline_key_hash>
mul_mat_id_pipelines; // src0_type/src1_type
@@ -1631,7 +1660,7 @@ class ggml_webgpu_shader_lib {
key.type = context.dst->type;
key.d_state = (int) context.src0->ne[0];
key.xbc_overlap = ggml_webgpu_tensor_overlap(context.src1, context.src4) &&
ggml_webgpu_tensor_overlap(context.src1, context.src5);
ggml_webgpu_tensor_overlap(context.src1, context.src5);
auto it = ssm_scan_pipelines.find(key);
if (it != ssm_scan_pipelines.end()) {
@@ -1744,6 +1773,44 @@ class ggml_webgpu_shader_lib {
return pad_pipelines[key];
}
webgpu_pipeline get_quantize_q8_pipeline(const ggml_webgpu_shader_lib_context & context) {
ggml_webgpu_quantize_q8_pipeline_key key = {};
key.src0_type = context.src0->type;
auto it = quantize_q8_pipelines.find(key);
if (it != quantize_q8_pipelines.end()) {
return it->second;
}
const char * shader_src = wgsl_quantize_q8;
std::vector<std::string> defines;
std::string variant = "quantize_q8";
uint32_t wg_size = WEBGPU_MUL_MAT_VEC_WG_SIZE;
defines.push_back("SRC1_INNER_TYPE=f32");
defines.push_back(std::string("WG_SIZE=") + std::to_string(wg_size));
const struct ggml_type_traits * src0_traits = ggml_get_type_traits(context.src0->type);
std::string src0_name = src0_traits->type_name;
std::string type_upper = src0_name;
variant += "_" + src0_name;
std::transform(type_upper.begin(), type_upper.end(), type_upper.begin(), ::toupper);
defines.push_back("MUL_ACC_" + type_upper);
defines.push_back("Q8_1_T");
defines.push_back(context.supports_subgroups ? "USE_SUBGROUP_REDUCTION" : "USE_WORKGROUP_REDUCTION");
variant += context.supports_subgroups ? "_sg_reduce" : "_wg_reduce";
auto processed = preprocessor.preprocess(shader_src, defines);
auto decisions = std::make_shared<ggml_webgpu_generic_shader_decisions>();
decisions->wg_size = wg_size;
webgpu_pipeline pipeline = ggml_webgpu_create_pipeline(device, processed, variant);
pipeline.context = decisions;
quantize_q8_pipelines[key] = pipeline;
return quantize_q8_pipelines[key];
}
webgpu_pipeline get_mul_mat_vec_pipeline(const ggml_webgpu_shader_lib_context & context) {
ggml_webgpu_mul_mat_vec_pipeline_key key = {};
key.src0_type = context.src0->type;
@@ -1752,6 +1819,8 @@ class ggml_webgpu_shader_lib {
(context.src0->type == GGML_TYPE_F32 || context.src0->type == GGML_TYPE_F16)) ?
1 :
0;
key.use_mmvq =
ggml_webgpu_can_use_mmvq(context.src0, context.src1, context.supports_dot_product, context.vendor);
auto it = mul_mat_vec_pipelines.find(key);
if (it != mul_mat_vec_pipelines.end()) {
@@ -1788,6 +1857,19 @@ class ggml_webgpu_shader_lib {
defines.push_back("U32_DEQUANT_HELPERS");
defines.push_back("SRC0_INNER_TYPE=u32");
switch (context.src0->type) {
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
if (key.use_mmvq) {
defines.push_back("LEGACY_QUANTS");
}
break;
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q4_K:
if (key.use_mmvq) {
defines.push_back("K_QUANTS");
}
break;
case GGML_TYPE_IQ1_S:
case GGML_TYPE_IQ1_M:
case GGML_TYPE_IQ2_S:
@@ -1840,6 +1922,11 @@ class ggml_webgpu_shader_lib {
outputs_per_wg = WEBGPU_MUL_MAT_VEC_LEGACY_Q_OUTPUTS_PER_WG;
}
if (key.use_mmvq) {
defines.push_back("MMVQ");
defines.push_back("Q8_1_T");
}
defines.push_back(std::string("WG_SIZE=") + std::to_string(wg_size));
defines.push_back(std::string("OUTPUTS_PER_WG=") + std::to_string(outputs_per_wg));
defines.push_back(context.supports_subgroups ? "USE_SUBGROUP_REDUCTION" : "USE_WORKGROUP_REDUCTION");
@@ -2018,100 +2105,6 @@ class ggml_webgpu_shader_lib {
return mul_mat_fast_pipelines[key];
}
webgpu_pipeline get_mul_mat_legacy_pipeline(const ggml_webgpu_shader_lib_context & context) {
ggml_webgpu_legacy_mul_mat_pipeline_key key = {};
key.src0_type = context.src0->type;
key.src1_type = context.src1->type;
auto it = mul_mat_legacy_pipelines.find(key);
if (it != mul_mat_legacy_pipelines.end()) {
return it->second;
}
std::vector<std::string> defines;
std::string variant = "mul_mat";
switch (context.src1->type) {
case GGML_TYPE_F32:
defines.push_back("SRC1_TYPE=f32");
variant += "_f32";
break;
case GGML_TYPE_F16:
defines.push_back("SRC1_TYPE=f16");
variant += "_f16";
break;
default:
GGML_ABORT("Unsupported src1 type for mul_mat legacy shader");
}
const struct ggml_type_traits * src0_traits = ggml_get_type_traits(context.src0->type);
const char * src0_name = src0_traits->type_name;
switch (context.src0->type) {
case GGML_TYPE_F32:
defines.push_back("SRC0_TYPE=f32");
defines.push_back("FLOAT");
variant += "_f32";
break;
case GGML_TYPE_F16:
defines.push_back("SRC0_TYPE=f16");
defines.push_back("FLOAT");
variant += "_f16";
break;
default:
{
std::string type_upper = src0_name;
std::transform(type_upper.begin(), type_upper.end(), type_upper.begin(), ::toupper);
switch (context.src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q5_0:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q6_K:
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ2_XS:
case GGML_TYPE_IQ2_S:
case GGML_TYPE_IQ3_XXS:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ1_S:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_MXFP4:
{
// Quantized types using u32 buffers for portability.
defines.push_back("SRC0_TYPE=u32");
defines.push_back("U32_DEQUANT_HELPERS");
break;
}
default:
{
defines.push_back(std::string("SRC0_TYPE=") + src0_name);
}
}
defines.push_back("BYTE_HELPERS");
defines.push_back(type_upper + "_T");
defines.push_back(type_upper);
defines.push_back(type_upper + "_SCALE_MIN");
defines.push_back(type_upper + "_TABLES");
defines.push_back(type_upper + "_GRID");
variant += std::string("_") + src0_name;
break;
}
}
auto processed = preprocessor.preprocess(wgsl_mul_mat, defines);
auto decisions = std::make_shared<ggml_webgpu_generic_shader_decisions>();
decisions->wg_size = WEBGPU_MUL_MAT_WG_SIZE;
webgpu_pipeline pipeline = ggml_webgpu_create_pipeline(device, processed, variant);
pipeline.context = decisions;
mul_mat_legacy_pipelines[key] = pipeline;
return mul_mat_legacy_pipelines[key];
}
webgpu_pipeline get_mul_mat_id_gather_pipeline(const ggml_webgpu_shader_lib_context & context) {
auto it = mul_mat_id_gather_pipelines.find(1);
if (it != mul_mat_id_gather_pipelines.end()) {

View File

@@ -181,6 +181,7 @@ struct webgpu_capabilities {
wgpu::Limits limits;
bool supports_subgroups = false;
bool supports_subgroup_matrix = false;
bool supports_dot_product = false;
uint32_t sg_mat_m = 0;
uint32_t sg_mat_n = 0;
@@ -210,6 +211,8 @@ struct webgpu_global_context_struct {
wgpu::Buffer memset_params_buf;
webgpu_pipeline memset_pipeline;
std::string vendor;
// TODO: We should rework the CPU profiling time handling to make it more useful. ref: https://github.com/ggml-org/llama.cpp/pull/22050
#ifdef GGML_WEBGPU_CPU_PROFILE
// Profiling: labeled CPU time in ms (total)
@@ -259,6 +262,7 @@ struct webgpu_context_struct {
wgpu::Buffer set_rows_host_error_buf;
wgpu::CommandEncoder active_command_encoder;
wgpu::ComputePassEncoder active_compute_pass;
bool batch_compute_passes = true;
size_t memset_bytes_per_thread;
@@ -590,9 +594,18 @@ static webgpu_encoded_op ggml_backend_webgpu_build_multi(webgpu_context &
}
#else
for (size_t i = 0; i < dispatches.size(); i++) {
ctx->active_compute_pass.SetPipeline(dispatches[i].pipeline.pipeline);
ctx->active_compute_pass.SetBindGroup(0, bind_groups[i]);
ctx->active_compute_pass.DispatchWorkgroups(dispatches[i].workgroups.first, dispatches[i].workgroups.second, 1);
if (ctx->batch_compute_passes) {
ctx->active_compute_pass.SetPipeline(dispatches[i].pipeline.pipeline);
ctx->active_compute_pass.SetBindGroup(0, bind_groups[i]);
ctx->active_compute_pass.DispatchWorkgroups(dispatches[i].workgroups.first, dispatches[i].workgroups.second,
1);
} else {
wgpu::ComputePassEncoder pass = ctx->active_command_encoder.BeginComputePass();
pass.SetPipeline(dispatches[i].pipeline.pipeline);
pass.SetBindGroup(0, bind_groups[i]);
pass.DispatchWorkgroups(dispatches[i].workgroups.first, dispatches[i].workgroups.second, 1);
pass.End();
}
}
#endif
@@ -736,8 +749,11 @@ static webgpu_encoded_op ggml_webgpu_cpy(webgpu_context & ctx, ggml_tensor * src
ggml_webgpu_make_tensor_bind_group_entry(ctx, 1, dst),
};
uint32_t wg_x = CEIL_DIV(ne, decisions->wg_size);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x);
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV(ne, decisions->wg_size);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
static webgpu_encoded_op ggml_webgpu_set(webgpu_context & ctx,
@@ -961,9 +977,10 @@ static webgpu_encoded_op ggml_webgpu_conv_2d(webgpu_context & ctx,
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
uint32_t wg_x = std::min(ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, total_wg);
uint32_t wg_y = CEIL_DIV(total_wg, wg_x);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
@@ -1051,9 +1068,10 @@ static webgpu_encoded_op ggml_webgpu_im2col(webgpu_context & ctx,
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
uint32_t wg_x = std::min(ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, total_wg);
uint32_t wg_y = CEIL_DIV(total_wg, wg_x);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
@@ -1384,6 +1402,58 @@ static webgpu_encoded_op ggml_webgpu_get_rows(webgpu_context & ctx,
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x);
}
static void ggml_webgpu_quantize_q8_dispatch(webgpu_context & ctx,
ggml_tensor * src0,
ggml_tensor * src1,
ggml_tensor * dst,
std::vector<webgpu_dispatch_desc> & dispatches) {
ggml_webgpu_shader_lib_context shader_lib_ctx = {};
shader_lib_ctx.src0 = src0;
shader_lib_ctx.src1 = src1;
shader_lib_ctx.dst = dst;
shader_lib_ctx.max_wg_size = ctx->global_ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup;
shader_lib_ctx.supports_subgroups = ctx->global_ctx->capabilities.supports_subgroups;
webgpu_pipeline qq8_pipeline = ctx->shader_lib->get_quantize_q8_pipeline(shader_lib_ctx);
// quantize_q8 pipeline
const size_t dst_offset = ggml_webgpu_tensor_offset(dst);
const size_t q8_src1_align_offset = ROUNDUP_POW2(
dst_offset + ggml_nbytes(dst), ctx->global_ctx->capabilities.limits.minStorageBufferOffsetAlignment);
const size_t q8_src1_binding_size =
ROUNDUP_POW2(src1->ne[3] * src1->ne[2] * (36 /* sizeof(q8_1) */ * (src1->ne[0] / /* block_size */ 32)),
WEBGPU_STORAGE_BUF_BINDING_MULT);
std::vector<uint32_t> q8_params = {
(uint32_t) (ggml_webgpu_tensor_misalignment(ctx, src1) / ggml_type_size(src1->type)),
(uint32_t) (src1->nb[2] / ggml_type_size(src1->type)),
(uint32_t) (src1->nb[3] / ggml_type_size(src1->type)),
(uint32_t) src1->ne[0],
(uint32_t) src1->ne[2],
(uint32_t) src1->ne[3],
};
std::vector<wgpu::BindGroupEntry> q8_entries = {
ggml_webgpu_make_tensor_bind_group_entry(ctx, 0, src1),
ggml_webgpu_make_bind_group_entry(1, ggml_webgpu_tensor_buf(dst), q8_src1_align_offset, q8_src1_binding_size)
};
auto q8_decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(qq8_pipeline.context.get());
uint32_t q8_wg_size = q8_decisions->wg_size;
uint32_t q8_wg_x = 1;
uint32_t q8_wg_y = 1;
const uint32_t wg_per_vec = (src0->ne[0] / 4 + (q8_wg_size - 1)) / q8_wg_size;
const uint32_t q8_total_wg = src1->ne[2] * src1->ne[3] * wg_per_vec;
const uint32_t max_wg_per_dim = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
compute_2d_workgroups(q8_total_wg, max_wg_per_dim, q8_wg_x, q8_wg_y);
dispatches.push_back({
qq8_pipeline, std::move(q8_params), std::move(q8_entries), { q8_wg_x, q8_wg_y }
});
}
static webgpu_encoded_op ggml_webgpu_mul_mat(webgpu_context & ctx,
ggml_tensor * src0,
ggml_tensor * src1,
@@ -1391,47 +1461,9 @@ static webgpu_encoded_op ggml_webgpu_mul_mat(webgpu_context & ctx,
// Determine if this is a mat-vec operation
bool is_vec = (dst->ne[1] == 1);
// Determine if we should use fast path
bool use_fast = false;
switch (src1->type) {
case GGML_TYPE_F16:
use_fast = (src0->type == GGML_TYPE_F16);
break;
case GGML_TYPE_F32:
// TODO: implement better mat-mat for k-quants, mat-vec for all k-quants except q6_K
switch (src0->type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
case GGML_TYPE_Q5_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_Q6_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q1_0:
case GGML_TYPE_IQ1_S:
case GGML_TYPE_IQ1_M:
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ2_XS:
case GGML_TYPE_IQ2_S:
case GGML_TYPE_IQ3_XXS:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_IQ4_XS:
case GGML_TYPE_MXFP4:
use_fast = true;
break;
default:
break;
}
break;
default:
break;
}
// use MMVQ path for mat-vec
bool use_mmvq = ggml_webgpu_can_use_mmvq(src0, src1, ctx->global_ctx->capabilities.supports_dot_product,
ctx->global_ctx->vendor);
ggml_webgpu_shader_lib_context shader_lib_ctx = {};
@@ -1446,16 +1478,20 @@ static webgpu_encoded_op ggml_webgpu_mul_mat(webgpu_context & ctx,
shader_lib_ctx.sg_mat_k = ctx->global_ctx->capabilities.sg_mat_k;
shader_lib_ctx.min_subgroup_size = ctx->global_ctx->capabilities.min_subgroup_size;
shader_lib_ctx.max_subgroup_size = ctx->global_ctx->capabilities.max_subgroup_size;
shader_lib_ctx.supports_dot_product = ctx->global_ctx->capabilities.supports_dot_product;
shader_lib_ctx.vendor = ctx->global_ctx->vendor;
// Get or create pipeline
webgpu_pipeline pipeline;
webgpu_pipeline pipeline;
std::vector<webgpu_dispatch_desc> dispatches;
if (use_fast && is_vec) {
if (is_vec) {
if (use_mmvq) {
ggml_webgpu_quantize_q8_dispatch(ctx, src0, src1, dst, dispatches);
}
pipeline = ctx->shader_lib->get_mul_mat_vec_pipeline(shader_lib_ctx);
} else if (use_fast) {
pipeline = ctx->shader_lib->get_mul_mat_fast_pipeline(shader_lib_ctx);
} else {
pipeline = ctx->shader_lib->get_mul_mat_legacy_pipeline(shader_lib_ctx);
pipeline = ctx->shader_lib->get_mul_mat_fast_pipeline(shader_lib_ctx);
}
// Build params
@@ -1479,25 +1515,31 @@ static webgpu_encoded_op ggml_webgpu_mul_mat(webgpu_context & ctx,
};
// Build bind group entries
std::vector<wgpu::BindGroupEntry> entries = {
ggml_webgpu_make_tensor_bind_group_entry(ctx, 0, src0),
ggml_webgpu_make_tensor_bind_group_entry(ctx, 1, src1),
ggml_webgpu_make_tensor_bind_group_entry(ctx, 2, dst),
};
std::vector<wgpu::BindGroupEntry> entries = {};
entries.push_back(ggml_webgpu_make_tensor_bind_group_entry(ctx, 0, src0));
if (use_mmvq) {
auto & mmvq_qq8_entry = dispatches[0].bind_group_entries[1];
entries.push_back(ggml_webgpu_make_bind_group_entry(1, ggml_webgpu_tensor_buf(dst), mmvq_qq8_entry.offset,
mmvq_qq8_entry.size));
} else {
entries.push_back(ggml_webgpu_make_tensor_bind_group_entry(ctx, 1, src1));
}
entries.push_back(ggml_webgpu_make_tensor_bind_group_entry(ctx, 2, dst));
// Calculate workgroup dimensions
uint32_t wg_x = 1;
uint32_t wg_y = 1;
const uint32_t max_wg_per_dim = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
if (use_fast && is_vec) {
if (is_vec) {
auto * decisions = static_cast<ggml_webgpu_mul_mat_vec_shader_decisions *>(pipeline.context.get());
uint32_t batches = dst->ne[2] * dst->ne[3];
uint32_t output_groups = CEIL_DIV(dst->ne[0], decisions->outputs_per_wg);
uint32_t total_wg = output_groups * batches;
compute_2d_workgroups(total_wg, max_wg_per_dim, wg_x, wg_y);
} else if (use_fast) {
} else {
auto * decisions = static_cast<ggml_webgpu_mul_mat_shader_decisions *>(pipeline.context.get());
// Fast-path tiled/subgroup calculations
@@ -1518,15 +1560,13 @@ static webgpu_encoded_op ggml_webgpu_mul_mat(webgpu_context & ctx,
}
uint32_t total_wg = wg_m * wg_n * dst->ne[2] * dst->ne[3];
compute_2d_workgroups(total_wg, max_wg_per_dim, wg_x, wg_y);
} else { // legacy
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
uint32_t wg_size = decisions->wg_size;
uint32_t total_wg = CEIL_DIV(dst->ne[0] * dst->ne[1] * dst->ne[2] * dst->ne[3], wg_size);
compute_2d_workgroups(total_wg, max_wg_per_dim, wg_x, wg_y);
}
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
dispatches.push_back({
pipeline, std::move(params), std::move(entries), { wg_x, wg_y }
});
return ggml_backend_webgpu_build_multi(ctx, dispatches);
}
static webgpu_encoded_op ggml_webgpu_mul_mat_id_vec(webgpu_context & ctx,
@@ -1654,14 +1694,11 @@ static webgpu_encoded_op ggml_webgpu_mul_mat_id(webgpu_context & ctx,
gathered_count_ids_binding_size),
};
const uint32_t max_wg_per_dim = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
const uint32_t gather_total_wg = param_n_expert;
const uint32_t gather_wg_x = std::min(gather_total_wg, max_wg_per_dim);
const uint32_t gather_wg_y = CEIL_DIV(gather_total_wg, gather_wg_x);
// n_expert is much less than maxComputeWorkgroupsPerDimension (e.g., n_exeprt=256 at Qwen3.5-35B-A3B)
const uint32_t gather_wg_x = param_n_expert;
dispatches.push_back({
gather_pipeline, std::move(gather_params), std::move(gather_entries), { gather_wg_x, gather_wg_y }
gather_pipeline, std::move(gather_params), std::move(gather_entries), { gather_wg_x, 1 }
});
// params for mul_mat_id.wgsl
@@ -1713,7 +1750,7 @@ static webgpu_encoded_op ggml_webgpu_mul_mat_id(webgpu_context & ctx,
uint32_t max_wg_n = CEIL_DIV(total_gathered, tile_n_s) + max_active_experts;
uint32_t total_wg = wg_m * max_wg_n;
compute_2d_workgroups(total_wg, max_wg_per_dim, wg_x, wg_y);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
dispatches.push_back({
main_pipeline, std::move(main_params), std::move(main_entries), { wg_x, wg_y }
@@ -1956,10 +1993,10 @@ static webgpu_encoded_op ggml_webgpu_flash_attn(webgpu_context & ctx,
std::vector<wgpu::BindGroupEntry> reduce_entries;
if (use_vec_reduce) {
const uint32_t reduce_sg_size = ctx->global_ctx->capabilities.max_subgroup_size;
const uint32_t reduce_wg_size =
std::max(reduce_sg_size, (uint32_t) std::min<uint64_t>(
(uint64_t) nwg * reduce_sg_size,
ctx->global_ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup));
const uint32_t reduce_wg_size = std::max(
reduce_sg_size,
(uint32_t) std::min<uint64_t>((uint64_t) nwg * reduce_sg_size,
ctx->global_ctx->capabilities.limits.maxComputeInvocationsPerWorkgroup));
ggml_webgpu_shader_lib_context reduce_shader_ctx = shader_lib_ctx;
reduce_shader_ctx.max_wg_size = reduce_wg_size;
reduce_pipeline = ctx->shader_lib->get_flash_attn_vec_reduce_pipeline(reduce_shader_ctx);
@@ -2736,10 +2773,12 @@ static webgpu_encoded_op ggml_webgpu_argsort(webgpu_context & ctx, ggml_tensor *
block_size, npr, nrows
};
const uint32_t total_wg_init = npr * nrows;
const uint32_t max_wg = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
const uint32_t wg_x_init = std::min(total_wg_init, max_wg);
const uint32_t wg_y_init = CEIL_DIV(total_wg_init, wg_x_init);
uint32_t wg_x_init;
uint32_t wg_y_init;
const uint32_t total_wg_init = npr * nrows;
const uint32_t max_wg_per_dim = ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension;
compute_2d_workgroups(total_wg_init, max_wg_per_dim, wg_x_init, wg_y_init);
std::vector<wgpu::BindGroupEntry> init_entries = {
ggml_webgpu_make_tensor_bind_group_entry(ctx, 0, src),
ggml_webgpu_make_bind_group_entry(1, ggml_webgpu_tensor_buf(dst), init_align_offset, init_binding_size)
@@ -2796,9 +2835,11 @@ static webgpu_encoded_op ggml_webgpu_argsort(webgpu_context & ctx, ggml_tensor *
ggml_webgpu_make_bind_group_entry(2, ggml_webgpu_tensor_buf(dst), align_out, size_out)
};
uint32_t wg_x_merge;
uint32_t wg_y_merge;
const uint32_t total_wg_merge = nm * nrows;
const uint32_t wg_x_merge = std::min(total_wg_merge, max_wg);
const uint32_t wg_y_merge = CEIL_DIV(total_wg_merge, wg_x_merge);
compute_2d_workgroups(total_wg_merge, max_wg_per_dim, wg_x_merge, wg_y_merge);
dispatches.push_back({
argsort_merge_pipeline, std::move(merge_params), std::move(merge_entries), { wg_x_merge, wg_y_merge }
});
@@ -2918,9 +2959,12 @@ static webgpu_encoded_op ggml_webgpu_upscale(webgpu_context ctx, ggml_tensor * s
webgpu_pipeline pipeline = ctx->shader_lib->get_upscale_pipeline(shader_lib_ctx);
auto * decisions = static_cast<ggml_webgpu_generic_shader_decisions *>(pipeline.context.get());
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
uint32_t wg_x = std::min(ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, total_wg);
uint32_t wg_y = CEIL_DIV(total_wg, wg_x);
uint32_t wg_x;
uint32_t wg_y;
uint32_t total_wg = CEIL_DIV((uint32_t) ggml_nelements(dst), decisions->wg_size);
compute_2d_workgroups(total_wg, ctx->global_ctx->capabilities.limits.maxComputeWorkgroupsPerDimension, wg_x, wg_y);
return ggml_backend_webgpu_build(ctx, pipeline, params, entries, wg_x, wg_y);
}
@@ -3110,18 +3154,16 @@ static ggml_status ggml_backend_webgpu_graph_compute(ggml_backend_t backend, str
uint32_t num_batched_kernels = 0;
uint32_t num_inflight_batches = 0;
bool contains_set_rows = false;
bool batch_compute_passes = true;
int num_encoded_ops = 1;
int node_idx = 0;
#ifdef GGML_WEBGPU_GPU_PROFILE
ctx->profile_timestamp_query_count = 0;
batch_compute_passes = false;
std::vector<std::string> profile_pipeline_names;
#endif
ctx->active_command_encoder = ctx->global_ctx->device.CreateCommandEncoder();
if (batch_compute_passes) {
if (ctx->batch_compute_passes) {
ctx->active_compute_pass = ctx->active_command_encoder.BeginComputePass();
}
@@ -3148,7 +3190,7 @@ static ggml_status ggml_backend_webgpu_graph_compute(ggml_backend_t backend, str
// reset state for next batch
ctx->active_command_encoder = ctx->global_ctx->device.CreateCommandEncoder();
if (batch_compute_passes) {
if (ctx->batch_compute_passes) {
ctx->active_compute_pass = ctx->active_command_encoder.BeginComputePass();
}
ctx->param_arena.reset();
@@ -3548,8 +3590,8 @@ static size_t ggml_backend_webgpu_buffer_type_get_alloc_size(ggml_backend_buffer
const uint32_t kv_tile = decisions.kv_tile;
const uint32_t vec_nwg_cap = ctx->webgpu_global_ctx->capabilities.min_subgroup_size;
uint32_t nwg = 1u;
const uint64_t kv_span = (uint64_t) std::max(1u, kv_tile);
uint32_t nwg = 1u;
const uint64_t kv_span = (uint64_t) std::max(1u, kv_tile);
while ((2u * nwg * kv_span) < (uint64_t) K->ne[1] && nwg < vec_nwg_cap) {
nwg <<= 1;
}
@@ -3582,6 +3624,22 @@ static size_t ggml_backend_webgpu_buffer_type_get_alloc_size(ggml_backend_buffer
}
}
break;
case GGML_OP_MUL_MAT:
{
const ggml_tensor * src0 = tensor->src[0];
const ggml_tensor * src1 = tensor->src[1];
bool use_mmvq =
ggml_webgpu_can_use_mmvq(src0, src1, ctx->webgpu_global_ctx->capabilities.supports_dot_product,
ctx->webgpu_global_ctx->vendor);
if (use_mmvq) {
const size_t q8_src1_size =
src1->ne[3] * src1->ne[2] * (36 /* sizeof(q8_1) */ * (src1->ne[0] / /* block_size */ 32));
res = ROUNDUP_POW2(res + q8_src1_size +
ctx->webgpu_global_ctx->capabilities.limits.minStorageBufferOffsetAlignment,
WEBGPU_STORAGE_BUF_BINDING_MULT);
}
}
break;
case GGML_OP_MUL_MAT_ID:
{
const ggml_tensor * src0 = tensor->src[0];
@@ -3707,12 +3765,16 @@ static bool create_webgpu_device(ggml_backend_webgpu_reg_context * ctx) {
ctx->webgpu_global_ctx->adapter.GetInfo(&info);
ctx->webgpu_global_ctx->command_submit_batch_size = ggml_backend_webgpu_get_command_submit_batch_size();
ctx->webgpu_global_ctx->max_inflight_batches = ggml_backend_webgpu_get_max_inflight_batches();
ctx->webgpu_global_ctx->vendor = info.vendor;
wgpu::SupportedFeatures features;
ctx->webgpu_global_ctx->adapter.GetFeatures(&features);
// we require f16 support
GGML_ASSERT(ctx->webgpu_global_ctx->adapter.HasFeature(wgpu::FeatureName::ShaderF16));
ctx->webgpu_global_ctx->capabilities.supports_subgroups =
ctx->webgpu_global_ctx->adapter.HasFeature(wgpu::FeatureName::Subgroups);
// for dot4I8packed
ctx->webgpu_global_ctx->capabilities.supports_dot_product = ctx->webgpu_global_ctx->instance.HasWGSLLanguageFeature(
wgpu::WGSLLanguageFeatureName::Packed4x8IntegerDotProduct);
bool valid_subgroup_matrix_config = false;
#ifndef __EMSCRIPTEN__
@@ -3839,6 +3901,7 @@ static webgpu_context initialize_webgpu_context(ggml_backend_dev_t dev) {
wgpu::BufferUsage::CopyDst | wgpu::BufferUsage::MapRead, "set_rows_host_error_buf");
#ifdef GGML_WEBGPU_GPU_PROFILE
webgpu_ctx->batch_compute_passes = false;
ggml_webgpu_create_buffer(
webgpu_ctx->global_ctx->device, webgpu_ctx->profile_timestamp_dev_buf, WEBGPU_TIMESTAMP_QUERY_BUF_SIZE_BYTES,
wgpu::BufferUsage::QueryResolve | wgpu::BufferUsage::CopySrc, "profile_timestamp_dev_buf");

View File

@@ -95,11 +95,10 @@ struct q5_1 {
};
#endif
#ifdef Q8_1_T
struct q8_1 {
d: f16,
m: f16,
s: f16, // d * sum(qs[i])
qs: array<u32, 8>
};
#endif

View File

@@ -49,12 +49,14 @@ struct Params{
var<uniform> params: Params;
@compute @workgroup_size(WG_SIZE)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
if (gid.x >= params.ne) {
fn main(
@builtin(global_invocation_index) gindex: u32,
) {
if (gindex >= params.ne) {
return;
}
var i = gid.x;
var i = gindex;
let i3 = i / (params.src_ne2 * params.src_ne1 * params.src_ne0);
i = i % (params.src_ne2 * params.src_ne1 * params.src_ne0);
let i2 = i / (params.src_ne1 * params.src_ne0);
@@ -62,7 +64,7 @@ fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
let i1 = i / params.src_ne0;
let i0 = i % params.src_ne0;
var j = gid.x;
var j = gindex;
let j3 = j / (params.dst_ne2 * params.dst_ne1 * params.dst_ne0);
j = j % (params.dst_ne2 * params.dst_ne1 * params.dst_ne0);
let j2 = j / (params.dst_ne1 * params.dst_ne0);

View File

@@ -1,747 +0,0 @@
enable f16;
#define DECLARE_BYTE_LOADERS_SRC0
#include "common_decls.tmpl"
#ifdef FLOAT
const BLOCK_SIZE = 1u;
#elif defined(Q4_0) || defined(Q4_1) || defined(Q5_0) || defined(Q5_1) || defined(Q8_0) || defined(Q8_1) || defined(IQ4_NL)
const BLOCK_SIZE = 32u;
#elif defined(Q2_K) || defined(Q3_K) || defined(Q4_K) || defined(Q5_K) || defined(Q6_K) || defined(IQ2_XXS) || defined(IQ2_XS) || defined(IQ2_S) || defined(IQ3_XXS) || defined(IQ3_S) || defined(IQ1_S) || defined(IQ1_M) || defined(IQ4_XS)
const BLOCK_SIZE = 256u;
#endif
#ifdef FLOAT
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
return f32(src0[src0_idx_base + offset]) * f32(src1[src1_idx_base + offset]);
}
#endif
#ifdef Q4_0
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 18; // Block stride: 18 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var sum: f32 = 0.0;
for (var j: u32 = 0; j < 4; j++) {
let q_byte_offset = block_byte_base + 2 + j * 4;
let q_packed = load_u32_at_src0(q_byte_offset);
for (var k: u32 = 0; k < 4; k++) {
let q_byte = get_byte(q_packed, k);
let q_hi = (f32((q_byte >> 4) & 0xF) - 8.0f) * d;
let q_lo = (f32(q_byte & 0xF) - 8.0f) * d;
let src1_offset = src1_idx_base + offset * 32 + j * 4 + k;
sum += q_lo * f32(src1[src1_offset]);
sum += q_hi * f32(src1[src1_offset + 16]);
}
}
return sum;
}
#endif
#ifdef Q4_1
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_q4_1 = src0[src0_idx_base + offset];
let d = f32(block_q4_1.d);
let m = f32(block_q4_1.m);
var sum: f32 = 0.0;
for (var j: u32 = 0; j < 4; j++) {
let q_packed = block_q4_1.qs[j];
for (var k: u32 = 0; k < 4; k++) {
let q_byte = get_byte(q_packed, k);
let q_hi = f32((q_byte >> 4) & 0xF) * d + m;
let q_lo = f32(q_byte & 0xF) * d + m;
let src1_offset = src1_idx_base + offset * 32 + j * 4 + k;
sum += q_lo * f32(src1[src1_offset]);
sum += q_hi * f32(src1[src1_offset + 16]);
}
}
return sum;
}
#endif
#ifdef Q5_0
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 22; // Block stride: 22 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var sum: f32 = 0.0;
let qh_packed = load_u32_at_src0(block_byte_base + 2);
for (var j: u32 = 0; j < 4; j++) {
let q_byte_offset = block_byte_base + 6 + j * 4;
let q_packed = load_u32_at_src0(q_byte_offset);
for (var k: u32 = 0; k < 4; k++) {
let q_byte = get_byte(q_packed, k);
let qh_hi = (qh_packed >> (j * 4 + k + 12)) & 0x10;
let q_hi = (f32(((q_byte >> 4) & 0xF) | qh_hi) - 16.0) * d;
let qh_lo = ((qh_packed >> (j * 4 + k)) << 4) & 0x10;
let q_lo = (f32((q_byte & 0xF) | qh_lo) - 16.0) * d;
let src1_offset = src1_idx_base + offset * 32 + j * 4 + k;
sum += q_lo * f32(src1[src1_offset]);
sum += q_hi * f32(src1[src1_offset + 16]);
}
}
return sum;
}
#endif
#ifdef Q5_1
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_q5_1 = src0[src0_idx_base + offset];
let d = f32(block_q5_1.d);
let m = f32(block_q5_1.m);
var sum: f32 = 0.0;
for (var j: u32 = 0; j < 4; j++) {
let q_packed = block_q5_1.qs[j];
for (var k: u32 = 0; k < 4; k++) {
let q_byte = get_byte(q_packed, k);
let qh_hi = (block_q5_1.qh >> (j * 4 + k + 12)) & 0x10;
let q_hi = f32(((q_byte >> 4) & 0xF) | qh_hi) * d + m;
let qh_lo = ((block_q5_1.qh >> (j * 4 + k)) << 4) & 0x10;
let q_lo = f32((q_byte & 0xF) | qh_lo) * d + m;
let src1_offset = src1_idx_base + offset * 32 + j * 4 + k;
sum += q_lo * f32(src1[src1_offset]);
sum += q_hi * f32(src1[src1_offset + 16]);
}
}
return sum;
}
#endif
#ifdef Q8_0
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 34; // Block stride: 34 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var sum: f32 = 0.0;
for (var j: u32 = 0; j < 8; j++) {
let q_byte_offset = block_byte_base + 2 + j * 4;
let q_packed = load_u32_at_src0(q_byte_offset);
for (var k: u32 = 0u; k < 4u; k++) {
let q_byte = get_byte_i32(q_packed, k);
let q_val = f32(q_byte) * d;
let src1_offset = src1_idx_base + offset * 32 + j * 4 + k;
sum += q_val * f32(src1[src1_offset]);
}
}
return sum;
}
#endif
#ifdef Q8_1
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_q8_1 = src0[src0_idx_base + offset];
let d = f32(block_q8_1.d);
let m = f32(block_q8_1.m);
var sum: f32 = 0.0;
for (var j: u32 = 0; j < 8; j++) {
let q_packed = block_q8_1.qs[j];
for (var k: u32 = 0; k < 4; k++) {
let q_byte = get_byte_i32(q_packed, k);
let q_val = f32(q_byte) * d + m;
let src1_offset = src1_idx_base + offset * 32 + j * 4 + k;
sum += q_val * f32(src1[src1_offset]);
}
}
return sum;
}
#endif
#ifdef Q2_K
// 16 blocks of 16 elements each
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block = src0[src0_idx_base + offset];
let d = f32(block.d);
let m = f32(block.dmin);
var sum = 0.0;
var src1_i = src1_idx_base + offset * 256;
var is: u32 = 0;
// 2 halves of the block (128 elements each)
for (var q_b_idx: u32 = 0; q_b_idx < 64; q_b_idx += 32) {
// 4 groups (each group has 2 blocks of 16 elements)
for (var shift: u32 = 0; shift < 8; shift += 2) {
// 2 blocks
for (var k: u32 = 0; k < 32; k += 16) {
let sc = get_byte(block.scales[is / 4], is % 4);
is++;
let dl = d * f32(sc & 0xF);
let ml = m * f32(sc >> 4);
for (var l: u32 = 0u; l < 16; l++) {
let q_idx = q_b_idx + k + l;
let q_byte = get_byte(block.qs[q_idx / 4], q_idx % 4);
let qs_val = (q_byte >> shift) & 3;
sum += (f32(qs_val) * dl - ml) * src1[src1_i];
src1_i++;
}
}
}
}
return sum;
}
#endif
#ifdef Q3_K
// 16 blocks of 16 elements each
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 110; // Block stride: 110 bytes
// Bytes 108-109: f16 scale 'd'
let d = load_f16_as_f32_at_src0(block_byte_base + 108);
// extract 6-bit scales, which consist of 4-bits from first 8 bytes of scale,
// and 2-bits from the last 4 bytes
// Bytes 96-107: 12 bytes of scales (3 u32s)
let kmask1: u32 = 0x03030303;
let kmask2: u32 = 0x0f0f0f0f;
var scale_vals: array<u32, 4>;
scale_vals[0] = load_u32_at_src0(block_byte_base + 96);
scale_vals[1] = load_u32_at_src0(block_byte_base + 100);
scale_vals[2] = load_u32_at_src0(block_byte_base + 104);
var tmp: u32 = scale_vals[2];
scale_vals[2] = ((scale_vals[0] >> 4) & kmask2) | (((tmp >> 4) & kmask1) << 4);
scale_vals[3] = ((scale_vals[1] >> 4) & kmask2) | (((tmp >> 6) & kmask1) << 4);
scale_vals[0] = (scale_vals[0] & kmask2) | ((tmp & kmask1) << 4);
scale_vals[1] = (scale_vals[1] & kmask2) | (((tmp >> 2) & kmask1) << 4);
// Bytes 0-31: 32 bytes of hmask (8 u32s)
var hmask_vals: array<u32, 8>;
for (var i: u32 = 0; i < 8; i++) {
hmask_vals[i] = load_u32_at_src0(block_byte_base + i * 4);
}
// Bytes 32-95: 64 bytes of qs (16 u32s)
var qs_vals: array<u32, 16>;
for (var i: u32 = 0u; i < 16; i++) {
qs_vals[i] = load_u32_at_src0(block_byte_base + 32 + i * 4);
}
var sum = 0.0;
var src1_i = src1_idx_base + offset * 256;
var is: u32 = 0;
var m: u32 = 1;
// 2 halves of the block (128 elements each)
for (var q_b_idx: u32 = 0; q_b_idx < 64; q_b_idx += 32) {
// 4 groups (each group has 2 blocks of 16 elements)
for (var shift: u32 = 0; shift < 8; shift += 2) {
// 2 blocks
for (var k: u32 = 0; k < 32; k += 16) {
let sc = get_byte(scale_vals[is / 4], is % 4);
is++;
let dl = d * (f32(sc) - 32.0);
for (var l: u32 = 0u; l < 16u; l++) {
let q_idx = q_b_idx + k + l;
let hm_idx = k + l;
let q_byte = get_byte(qs_vals[q_idx / 4], q_idx % 4);
let hmask_byte = get_byte(hmask_vals[hm_idx / 4], hm_idx % 4);
let hm = select(4.0, 0.0, (hmask_byte & m) != 0);
let qs_val = (q_byte >> shift) & 3;
sum += ((f32(qs_val) - hm) * dl) * src1[src1_i];
src1_i++;
}
}
m <<= 1;
}
}
return sum;
}
#endif
#ifdef Q4_K
// 8 blocks of 32 elements each
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block = src0[src0_idx_base + offset];
let d = f32(block.d);
let m = f32(block.dmin);
var sum = 0.0;
var src1_i = src1_idx_base + offset * 256;
var is: u32 = 0;
// 2 blocks each iteration
for (var q_b_idx: u32 = 0; q_b_idx < 128; q_b_idx += 32) {
for (var shift: u32 = 0; shift < 8; shift += 4) {
let scale_min = get_scale_min(is, block.scales);
is++;
let dl = d * scale_min.x;
let ml = m * scale_min.y;
for (var l: u32 = 0; l < 32; l++) {
let q_idx = q_b_idx + l;
let q_byte = get_byte(block.qs[q_idx / 4], q_idx % 4);
let qs_val = (q_byte >> shift) & 0xF;
sum += (f32(qs_val) * dl - ml) * src1[src1_i];
src1_i++;
}
}
}
return sum;
}
#endif
#ifdef Q5_K
// 8 blocks of 32 elements each
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block = src0[src0_idx_base + offset];
let d = f32(block.d);
let m = f32(block.dmin);
var sum = 0.0;
var src1_i = src1_idx_base + offset * 256;
var is: u32 = 0;
var u: u32 = 1;
// 2 blocks each iteration
for (var q_b_idx: u32 = 0; q_b_idx < 128; q_b_idx += 32) {
for (var shift: u32 = 0; shift < 8; shift += 4) {
let scale_min = get_scale_min(is, block.scales);
is++;
let dl = d * scale_min.x;
let ml = m * scale_min.y;
for (var l: u32 = 0; l < 32; l++) {
let q_idx = q_b_idx + l;
let q_byte = get_byte(block.qs[q_idx / 4], q_idx % 4);
let qh_byte = get_byte(block.qh[l / 4], l % 4);
let qs_val = (q_byte >> shift) & 0xF;
let qh_val = select(0.0, 16.0, (qh_byte & u) != 0);
sum += ((f32(qs_val) + qh_val) * dl - ml) * src1[src1_i];
src1_i++;
}
u <<= 1;
}
}
return sum;
}
#endif
#ifdef Q6_K
// 16 blocks of 16 elements each
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 210; // Block stride: 210 bytes
// Bytes 208-209: f16 scale 'd'
let d = load_f16_as_f32_at_src0(block_byte_base + 208);
// Bytes 0-127: 128 bytes of ql (32 u32s)
var ql_vals: array<u32, 32>;
for (var i: u32 = 0; i < 32; i++) {
ql_vals[i] = load_u32_at_src0(block_byte_base + i * 4);
}
// Bytes 128-191: 64 bytes of qh (16 u32s)
var qh_vals: array<u32, 16>;
for (var i: u32 = 0; i < 16; i++) {
qh_vals[i] = load_u32_at_src0(block_byte_base + 128 + i * 4);
}
// Bytes 192-207: 16 bytes of scales (4 u32s)
var scale_vals: array<u32, 4>;
for (var i: u32 = 0; i < 4; i++) {
scale_vals[i] = load_u32_at_src0(block_byte_base + 192 + i * 4);
}
var sum = 0.0;
var src1_i = src1_idx_base + offset * 256;
var qh_b_idx: u32 = 0;
var sc_b_idx: u32 = 0;
for (var ql_b_idx: u32 = 0; ql_b_idx < 128; ql_b_idx += 64) {
for (var l: u32 = 0; l < 32; l++) {
let ql13_b = get_byte(ql_vals[(ql_b_idx + l) / 4], (ql_b_idx + l) % 4);
let ql24_b = get_byte(ql_vals[(ql_b_idx + l + 32) / 4], (ql_b_idx + l + 32) % 4);
let qh_b = get_byte(qh_vals[(qh_b_idx + l) / 4], (qh_b_idx + l) % 4);
let q1 = f32((ql13_b & 0xF) | ((qh_b & 3) << 4)) - 32.0;
let q2 = f32((ql24_b & 0xF) | (((qh_b >> 2) & 3) << 4)) - 32.0;
let q3 = f32((ql13_b >> 4) | (((qh_b >> 4) & 3) << 4)) - 32.0;
let q4 = f32((ql24_b >> 4) | (((qh_b >> 6) & 3) << 4)) - 32.0;
let is = l/16;
let is1 = sc_b_idx + is;
let sc1 = get_byte_i32(scale_vals[is1 / 4], is1 % 4);
let is2 = sc_b_idx + is + 2;
let sc2 = get_byte_i32(scale_vals[is2 / 4], is2 % 4);
let is3 = sc_b_idx + is + 4;
let sc3 = get_byte_i32(scale_vals[is3 / 4], is3 % 4);
let is4 = sc_b_idx + is + 6;
let sc4 = get_byte_i32(scale_vals[is4 / 4], is4 % 4);
sum += d * f32(sc1) * q1 * src1[src1_i + l];
sum += d * f32(sc2) * q2 * src1[src1_i + l + 32];
sum += d * f32(sc3) * q3 * src1[src1_i + l + 64];
sum += d * f32(sc4) * q4 * src1[src1_i + l + 96];
}
src1_i += 128;
qh_b_idx += 32;
sc_b_idx += 8;
}
return sum;
}
#endif
#ifdef IQ2_XXS
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 66; // Block stride: 66 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var src1_i = src1_idx_base + offset * 256;
var sum = 0.0;
for (var ib: u32 = 0; ib < 32; ib += 4) {
let aux0_offset = block_byte_base + 2 + ib * 2;
let aux1_offset = block_byte_base + 2 + (ib + 2) * 2;
let aux0 = load_u32_at_src0(aux0_offset);
let aux1 = load_u32_at_src0(aux1_offset);
let db = d * (0.5 + f32(aux1 >> 28)) * 0.25;
for (var l: u32 = 0; l < 4; l++) {
let ig = get_byte(aux0, l) * 8;
let is = (aux1 >> (7 * l)) & 127;
let signs = get_byte(ksigns_iq2xs[is / 4], is % 4);
for (var j: u32 = 0; j < 8; j++) {
let g = get_byte(iq2xxs_grid[(ig + j) / 4], (ig + j) % 4);
let m = select(1.0, -1.0, (get_byte(kmask_iq2xs[j / 4], j % 4) & signs) != 0);
sum += db * f32(g) * m * src1[src1_i];
src1_i++;
}
}
}
return sum;
}
#endif
#ifdef IQ2_XS
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 74; // Block stride: 74 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var src1_i = src1_idx_base + offset * 256;
var scale_vals = array<u32, 2>(
load_u32_at_src0(block_byte_base + 66),
load_u32_at_src0(block_byte_base + 70)
);
var sum = 0.0;
for (var ib: u32 = 0; ib < 32; ib += 4) {
let s = get_byte(scale_vals[ib / 16], (ib % 16) / 4);
let db = array<f32, 2>(
d * (0.5 + f32(s & 0xF)) * 0.25,
d * (0.5 + f32(s >> 4)) * 0.25
);
for (var l: u32 = 0; l < 4; l++) {
let qs_offset = block_byte_base + 2 + (ib + l) * 2;
let qs_val = load_u32_at_src0(qs_offset) & 0xFFFF;
let ig = (qs_val & 511) * 8;
let is = qs_val >> 9;
let signs = get_byte(ksigns_iq2xs[is / 4], is % 4);
let dl = db[l/2];
for (var j: u32 = 0; j < 8; j++) {
let g = get_byte(iq2xs_grid[(ig + j) / 4], (ig + j) % 4);
let m = select(1.0, -1.0, (get_byte(kmask_iq2xs[j / 4], j % 4) & signs) != 0);
sum += dl * f32(g) * m * src1[src1_i];
src1_i++;
}
}
}
return sum;
}
#endif
#ifdef IQ2_S
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 82; // Block stride: 82 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var src1_i = src1_idx_base + offset * 256;
var qs_vals : array<u32, 16>;
for (var i: u32 = 0; i < 16; i++) {
qs_vals[i] = load_u32_at_src0(block_byte_base + 2 + i * 4);
}
var qh_vals: array<u32, 2>;
qh_vals[0] = load_u32_at_src0(block_byte_base + 66);
qh_vals[1] = load_u32_at_src0(block_byte_base + 70);
var scale_vals: array<u32, 2>;
scale_vals[0] = load_u32_at_src0(block_byte_base + 74);
scale_vals[1] = load_u32_at_src0(block_byte_base + 78);
var sum = 0.0;
for (var ib: u32 = 0; ib < 8; ib ++) {
let s = get_byte(scale_vals[ib / 4], ib % 4);
let db = array<f32, 2>(
d * (0.5 + f32(s & 0xF)) * 0.25,
d * (0.5 + f32(s >> 4)) * 0.25
);
let qs_w = qs_vals[ib];
for (var l: u32 = 0; l < 4; l++) {
let qh_b = (get_byte(qh_vals[ib / 4], ib % 4) << (8 - 2 * l)) & 0x300;
let ig = (get_byte(qs_w, l) | qh_b) * 8;
let signs = get_byte(qs_vals[ib + 8], l);
let dl = db[l/2];
for (var j: u32 = 0; j < 8; j++) {
let g = get_byte(iq2s_grid[(ig + j) / 4], (ig + j) % 4);
let m = select(1.0, -1.0, (get_byte(kmask_iq2xs[j / 4], j % 4) & signs) != 0);
sum += dl * f32(g) * m * src1[src1_i];
src1_i++;
}
}
}
return sum;
}
#endif
#ifdef IQ3_XXS
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 98; // Block stride: 98 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var src1_i = src1_idx_base + offset * 256;
var sum = 0.0;
for (var ib: u32 = 0; ib < 16; ib += 2) {
let sc_sign_offset = block_byte_base + 2 + (ib + 32) * 2;
let sc_sign = load_u32_at_src0(sc_sign_offset);
let db = d * (0.5 + f32(sc_sign >> 28)) * 0.5;
for (var l: u32 = 0; l < 4; l++) {
let is = (sc_sign >> (7 * l)) & 127;
let signs = get_byte(ksigns_iq2xs[is / 4], is % 4);
let ig_val = load_u32_at_src0(block_byte_base + 2 + (ib * 2 + l) * 2) & 0xFFFF;
let ig1 = get_byte(ig_val, 0);
let ig2 = get_byte(ig_val, 1);
for (var j: u32 = 0; j < 4; j++) {
let g1 = get_byte(iq3xxs_grid[ig1], j);
let g2 = get_byte(iq3xxs_grid[ig2], j);
let m1 = select(1.0, -1.0, (get_byte(kmask_iq2xs[0], j) & signs) != 0);
let m2 = select(1.0, -1.0, (get_byte(kmask_iq2xs[1], j) & signs) != 0);
sum += db * f32(g1) * m1 * src1[src1_i];
sum += db * f32(g2) * m2 * src1[src1_i + 4];
src1_i++;
}
src1_i += 4;
}
}
return sum;
}
#endif
#ifdef IQ3_S
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 110; // Block stride: 110 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var src1_i = src1_idx_base + offset * 256;
var qh_vals = array<u32, 2>(
load_u32_at_src0(block_byte_base + 66),
load_u32_at_src0(block_byte_base + 70)
);
var sign_vals: array<u32, 8>;
for (var i: u32 = 0; i < 8; i++) {
sign_vals[i] = load_u32_at_src0(block_byte_base + 74 + i * 4);
}
var scale_vals = load_u32_at_src0(block_byte_base + 106);
var sum = 0.0;
for (var ib: u32 = 0; ib < 4; ib++) {
let s = get_byte(scale_vals, ib);
let db = array<f32, 2>(
d * (1.0 + 2.0 * f32(s & 0xF)),
d * (1.0 + 2.0 * f32(s >> 4))
);
for (var k: u32 = 0; k < 2; k++) {
let dl = db[k];
let qh_byte = get_byte(qh_vals[ib / 2], (ib % 2) * 2 + k);
let sign_w = sign_vals[ib * 2 + k];
for (var l: u32 = 0; l < 4; l++) {
let signs = get_byte(sign_w, l);
let ig_val = load_u32_at_src0(block_byte_base + 2 + (ib * 8 + k * 4 + l) * 2) & 0xFFFF;
let ig1 = get_byte(ig_val, 0) | ((qh_byte << ((8 - (2 * l)))) & 256);
let ig2 = get_byte(ig_val, 1) | ((qh_byte << ((7 - (2 * l)))) & 256);
for (var j: u32 = 0; j < 4; j++) {
let g1 = get_byte(iq3s_grid[ig1], j);
let g2 = get_byte(iq3s_grid[ig2], j);
let m1 = select(1.0, -1.0, (get_byte(kmask_iq2xs[0], j) & signs) != 0);
let m2 = select(1.0, -1.0, (get_byte(kmask_iq2xs[1], j) & signs) != 0);
sum += dl * f32(g1) * m1 * src1[src1_i];
sum += dl * f32(g2) * m2 * src1[src1_i + 4];
src1_i++;
}
src1_i += 4;
}
}
}
return sum;
}
#endif
#ifdef IQ1_S
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 50; // Block stride: 50 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var src1_i = src1_idx_base + offset * 256;
var sum = 0.0;
for (var ib: u32 = 0; ib < 8; ib++) {
let qh = load_u32_at_src0(block_byte_base + 34 + ib * 2) & 0xFFFF;
let dl = d * (2.0 * f32((qh >> 12) & 7) + 1.0);
let delta = select(IQ1_DELTA, -IQ1_DELTA, (qh & 0x8000) != 0);
let qs_w = load_u32_at_src0(block_byte_base + 2 + ib * 4);
for (var l: u32 = 0; l < 4; l++) {
let ig = (get_byte(qs_w, l) | (((qh >> (3 * l)) & 7) << 8)) * 8;
for (var j: u32 = 0; j < 8; j++) {
let gw = iq1_grid[(ig + j) / 16];
let g = (gw >> (((ig + j) % 16) * 2)) & 3;
let gs = bitcast<i32>(g << 30) >> 30;
sum += dl * (f32(gs) + delta) * src1[src1_i];
src1_i++;
}
}
}
return sum;
}
#endif
#ifdef IQ1_M
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block = src0[src0_idx_base + offset];
let scale = ((block.scales[0] >> 12) & 0xF) | ((block.scales[0] >> 24) & 0x00F0) | ((block.scales[1] >> 4) & 0x0F00) | ((block.scales[1] >> 16) & 0xF000);
let d = f32(bitcast<vec2<f16>>(scale).x);
var src1_i = src1_idx_base + offset * 256;
var sum = 0.0;
for (var ib: u32 = 0; ib < 8; ib++) {
let sw = (block.scales[ib / 4] >> (16 * ((ib / 2) % 2))) & 0xFFFF;
let s1 : u32 = (sw >> (6 * (ib % 2))) & 0x7;
let s2 : u32 = (sw >> (6 * (ib % 2) + 3)) & 0x7;
var dl = array<f32, 2>(
d * f32(2 * s1 + 1),
d * f32(2 * s2 + 1)
);
let qh = block.qh[ib / 2] >> (16 * (ib % 2));
var idx = array<u32, 4>(
get_byte(block.qs[ib], 0) | ((qh << 8) & 0x700),
get_byte(block.qs[ib], 1) | ((qh << 4) & 0x700),
get_byte(block.qs[ib], 2) | ((qh) & 0x700),
get_byte(block.qs[ib], 3) | ((qh >> 4) & 0x700)
);
var delta = array<f32, 4>(
select(IQ1_DELTA, -IQ1_DELTA, (qh & 0x08) != 0),
select(IQ1_DELTA, -IQ1_DELTA, (qh & 0x80) != 0),
select(IQ1_DELTA, -IQ1_DELTA, ((qh >> 8) & 0x08) != 0),
select(IQ1_DELTA, -IQ1_DELTA, ((qh >> 8) & 0x80) != 0)
);
for (var l: u32 = 0; l < 4; l++) {
let ig = idx[l] * 8;
for (var j: u32 = 0; j < 8; j++) {
let gw = iq1_grid[(ig + j) / 16];
let g = (gw >> (((ig + j) % 16) * 2)) & 3;
let gs = bitcast<i32>(g << 30) >> 30;
sum += dl[l/2] * (f32(gs) + delta[l]) * src1[src1_i];
src1_i++;
}
}
}
return sum;
}
#endif
#ifdef IQ4_NL
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block_byte_base = (src0_idx_base + offset) * 18; // Block stride: 18 bytes
let d = load_f16_as_f32_at_src0(block_byte_base);
var src1_i = src1_idx_base + offset * 32;
var sum = 0.0;
var qs: array<u32, 4>;
for (var i: u32 = 0; i < 4; i++) {
qs[i] = load_u32_at_src0(block_byte_base + 2 + i * 4);
}
for (var j: u32 = 0; j < 16; j++) {
let qsb = get_byte(qs[j / 4], j % 4);
sum += d * f32(kvalues_iq4nl[qsb & 0xF]) * src1[src1_i];
sum += d * f32(kvalues_iq4nl[qsb >> 4]) * src1[src1_i + 16];
src1_i++;
}
return sum;
}
#endif
#ifdef IQ4_XS
fn multiply_add(src0_idx_base: u32, src1_idx_base: u32, offset: u32) -> f32 {
let block = src0[src0_idx_base + offset];
let d = unpack2x16float(block.d_scales_h)[0];
let scales_h = block.d_scales_h >> 16;
var src1_i = src1_idx_base + offset * 256;
var sum = 0.0;
for (var ib: u32 = 0; ib < 8; ib++) {
let ls = ((get_byte(block.scales_l, ib / 2) >> (4 * (ib % 2))) & 0xF) | (((scales_h >> (2 * ib)) & 3) << 4);
let dl = d * (f32(ls) - 32.0);
for (var j: u32 = 0; j < 16; j++) {
let iqs = ib * 16 + j;
let qsb = get_byte(block.qs[iqs / 4], iqs % 4);
sum += dl * f32(kvalues_iq4nl[qsb & 0xF]) * src1[src1_i];
sum += dl * f32(kvalues_iq4nl[qsb >> 4]) * src1[src1_i + 16];
src1_i++;
}
src1_i += 16;
}
return sum;
}
#endif
struct MulMatParams {
offset_src0: u32, // in elements/blocks
offset_src1: u32, // in elements/blocks
offset_dst: u32, // in elements/blocks
m: u32,
n: u32,
k: u32,
// all strides are in elements/blocks
stride_01: u32,
stride_11: u32,
stride_02: u32,
stride_12: u32,
stride_03: u32,
stride_13: u32,
bs02: u32,
bs03: u32,
broadcast2: u32,
broadcast3: u32
};
@group(0) @binding(0) var<storage, read_write> src0: array<SRC0_TYPE>; // M rows, K columns
@group(0) @binding(1) var<storage, read_write> src1: array<SRC1_TYPE>; // K rows, N columns (transposed)
@group(0) @binding(2) var<storage, read_write> dst: array<f32>; // M rows, N columns
@group(0) @binding(3) var<uniform> params: MulMatParams;
@compute @workgroup_size(256)
fn main(@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(num_workgroups) num_wg: vec3<u32>) {
let wg_linear = wg_id.y * num_wg.x + wg_id.x;
let global_idx = wg_linear * 256u + local_id.x;
let total = params.m * params.n * params.bs02 * params.broadcast2 * params.bs03 * params.broadcast3;
if (global_idx >= total) {
return;
}
let dst2_stride = params.m * params.n;
let dst3_stride = dst2_stride * params.bs02 * params.broadcast2;
let dst3_idx = global_idx / dst3_stride;
let src03_idx = dst3_idx / params.broadcast3; // src0 may be broadcast along the third dimension
let src13_idx = dst3_idx; // src1 is not broadcast
let dst3_rem = global_idx % dst3_stride;
let dst2_idx = dst3_rem / dst2_stride;
let src02_idx = dst2_idx / params.broadcast2; // src0 may also be broadcast along the second dimension
let src12_idx = dst2_idx; // src1 is not broadcast
let dst2_rem = dst3_rem % dst2_stride;
let row = dst2_rem / params.m; // output row
let col = dst2_rem % params.m; // output column
let src0_idx_base = params.offset_src0 + src03_idx * params.stride_03 + src02_idx * params.stride_02 + col * params.stride_01;
let src1_idx_base = params.offset_src1 + src13_idx * params.stride_13 + src12_idx * params.stride_12 + row * params.stride_11;
var sum = 0.0;
for (var i: u32 = 0u; i < params.k/BLOCK_SIZE; i = i + 1u) {
sum += multiply_add(src0_idx_base, src1_idx_base, i);
}
dst[params.offset_dst + dst3_idx * dst3_stride + dst2_idx * dst2_stride + row * params.m + col] = sum;
}

View File

@@ -21,35 +21,32 @@ var<workgroup> count:atomic<u32>;
@compute @workgroup_size(WG_SIZE)
fn main(@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(num_workgroups) num_wg: vec3<u32>) {
@builtin(local_invocation_id) local_id: vec3<u32>) {
let thread_id = local_id.x;
let own_expert = wg_id.y * num_wg.x + wg_id.x; // the expert assigned to this workgroup
let own_expert = wg_id.x; // the expert assigned to this workgroup
if (own_expert < params.n_expert) {
if (thread_id == 0u) {
atomicStore(&count, 0);
}
if (thread_id == 0u) {
atomicStore(&count, 0);
}
workgroupBarrier();
workgroupBarrier();
for (var i = thread_id;i < params.n_expert_used * params.n_tokens;i += WG_SIZE) {
let row = i / params.n_expert_used;
let col = i % params.n_expert_used;
let expert = u32(ids[params.offset_ids + row * params.stride_ids_1 + col]);
if (own_expert == expert) {
let pos = atomicAdd(&count, 1u);
let gathered_id = own_expert * params.n_tokens + pos;
global_gathered_expert_used[gathered_id] = col;
global_gathered_tokens[gathered_id] = row;
}
}
workgroupBarrier();
if (thread_id == 0u) {
gathered_count_ids[own_expert] = atomicLoad(&count);
for (var i = thread_id;i < params.n_expert_used * params.n_tokens;i += WG_SIZE) {
let row = i / params.n_expert_used;
let col = i % params.n_expert_used;
let expert = u32(ids[params.offset_ids + row * params.stride_ids_1 + col]);
if (own_expert == expert) {
let pos = atomicAdd(&count, 1u);
let gathered_id = own_expert * params.n_tokens + pos;
global_gathered_expert_used[gathered_id] = col;
global_gathered_tokens[gathered_id] = row;
}
}
workgroupBarrier();
if (thread_id == 0u) {
gathered_count_ids[own_expert] = atomicLoad(&count);
}
}

View File

@@ -3,10 +3,18 @@ enable subgroups;
#endif
enable f16;
#ifdef MMVQ
requires packed_4x8_integer_dot_product;
#endif
#define DECLARE_BYTE_LOADERS_SRC0
#include "common_decls.tmpl"
#ifdef MMVQ
#include "mul_mat_vec_q_acc.tmpl"
#else
#include "mul_mat_vec_acc.tmpl"
#endif
struct MulMatParams {
offset_src0: u32,
@@ -28,9 +36,14 @@ struct MulMatParams {
};
@group(0) @binding(0) var<storage, read_write> src0: array<SRC0_TYPE>;
@group(0) @binding(1) var<storage, read_write> src1: array<SRC1_TYPE>;
@group(0) @binding(2) var<storage, read_write> dst: array<f32>;
#ifdef MMVQ
@group(0) @binding(1) var<storage, read_write> src1q: array<q8_1>;
#else
@group(0) @binding(1) var<storage, read_write> src1: array<SRC1_TYPE>;
#endif
@group(0) @binding(2) var<storage, read_write> dst: array<f32>;
// "mul_mat_vec_acc.tmpl" requires params.k, params.m, params.stride_01
@group(0) @binding(3) var<uniform> params: MulMatParams;
@@ -75,10 +88,15 @@ fn main(
let src12_idx = dst2_idx;
let src0_batch_offset = params.offset_src0 + src03_idx * params.stride_03 + src02_idx * params.stride_02;
let src1_idx_base = params.offset_src1 + src13_idx * params.stride_13 + src12_idx * params.stride_12;
let dst_idx_base = params.offset_dst + dst3_idx * dst3_stride + dst2_idx * dst2_stride + row_base;
#ifdef MMVQ
let src1q_idx_base = (src13_idx * params.bs02 * params.broadcast2 + src12_idx) * (params.k / 32u);
let acc = accumulate_vec_q_dot(thread_id, row_base, src0_batch_offset, src1q_idx_base);
#else
let src1_idx_base = params.offset_src1 + src13_idx * params.stride_13 + src12_idx * params.stride_12;
let acc = accumulate_vec_dot(thread_id, row_base, src0_batch_offset, src1_idx_base);
#endif
#ifdef USE_SUBGROUP_REDUCTION
for (var row = 0u; row < OUTPUTS_PER_WG; row++) {

View File

@@ -436,7 +436,6 @@ fn accumulate_vec_dot(thread_id: u32, row_base: u32, src0_batch_offset: u32, src
}
#endif
#ifdef MUL_ACC_Q3_K
#define BLOCK_SIZE 256
#define BLOCK_SIZE_BYTES 110

View File

@@ -0,0 +1,303 @@
#ifdef U32_DEQUANT_HELPERS
#define SRC0_TYPE u32
fn byte_of(v: u32, b: u32) -> u32 {
return (v >> (b * 8u)) & 0xFFu;
}
fn sbyte_of(v: u32, b: u32) -> i32 {
let raw = i32((v >> (b * 8u)) & 0xFFu);
return select(raw, raw - 256, raw >= 128);
}
#endif
#define SRC0_TYPE SRC0_INNER_TYPE
#define SRC1_TYPE SRC1_INNER_TYPE
#ifdef LEGACY_QUANTS
#define BLOCK_SIZE 32
#define THREADS_PER_BLOCK 4
#elif K_QUANTS
#define BLOCK_SIZE 256
#define THREADS_PER_BLOCK 16
#endif
#define ELEMS_PER_THREAD (BLOCK_SIZE/THREADS_PER_BLOCK)
#define Q8_BLOCK_SIZE 32
#ifdef MUL_ACC_Q4_0
#define BLOCK_SIZE_BYTES 18
#define B_DS_TYPE vec2<f32>
fn repack_a(block_byte_base: u32, inner_id: u32) -> vec2<u32> {
let qs_packed = load_u32_at_src0(block_byte_base + 2u + 4u * inner_id);
return vec2<u32>(
qs_packed & 0x0F0F0F0Fu,
(qs_packed >> 4u) & 0x0F0F0F0Fu
);
}
fn repack_b_qs(block:u32, inner_id: u32) -> vec2<u32> {
return vec2<u32>(
src1q[block].qs[inner_id],
src1q[block].qs[inner_id + 4u],
);
}
fn repack_b_dm(block: u32) -> B_DS_TYPE {
return B_DS_TYPE(
f32(src1q[block].d),
f32(src1q[block].s)
);
}
fn get_dm(block_byte_base: u32) -> f32 {
return f32(load_f16_at_src0(block_byte_base));
}
fn mul_q8_1(row_sum: i32, da: f32, b_ds: B_DS_TYPE) -> f32 {
return f32(row_sum) * (da * b_ds.x) - 8.0 * da * b_ds.y / THREADS_PER_BLOCK;
}
#endif
#ifdef MUL_ACC_Q4_1
#define BLOCK_SIZE_BYTES 20
#define B_DS_TYPE vec2<f32>
fn repack_a(block_byte_base: u32, inner_id: u32) -> vec2<u32> {
let qs_packed = load_u32_at_src0(block_byte_base + 4u + 4u * inner_id);
return vec2<u32>(
qs_packed & 0x0F0F0F0Fu,
(qs_packed >> 4u) & 0x0F0F0F0Fu
);
}
fn repack_b_qs(block:u32, inner_id: u32) -> vec2<u32> {
return vec2<u32>(
src1q[block].qs[inner_id],
src1q[block].qs[inner_id + 4u],
);
}
fn repack_b_dm(block: u32) -> B_DS_TYPE {
return B_DS_TYPE(
f32(src1q[block].d),
f32(src1q[block].s)
);
}
fn get_dm(block_byte_base: u32) -> vec2<f32> {
return vec2<f32>(
f32(load_f16_at_src0(block_byte_base)),
f32(load_f16_at_src0(block_byte_base + 2u))
);
}
fn mul_q8_1(row_sum: i32, dma: vec2<f32>, b_ds: B_DS_TYPE) -> f32 {
return f32(row_sum) * (dma.x * b_ds.x) + dma.y * b_ds.y / THREADS_PER_BLOCK;
}
#endif
#ifdef MUL_ACC_Q8_0
#define BLOCK_SIZE_BYTES 34
#define B_DS_TYPE f32
fn repack_a(block_byte_base: u32, inner_id: u32) -> vec2<u32> {
return vec2<u32>(
load_u32_at_src0(block_byte_base + 2u + 4u * (inner_id * 2u)),
load_u32_at_src0(block_byte_base + 2u + 4u * (inner_id * 2u + 1))
);
}
fn repack_b_qs(block:u32, inner_id: u32) -> vec2<u32> {
return vec2<u32>(
src1q[block].qs[inner_id * 2u],
src1q[block].qs[inner_id * 2u + 1],
);
}
fn repack_b_dm(block: u32) -> B_DS_TYPE {
return B_DS_TYPE(src1q[block].d);
}
fn get_dm(block_byte_base: u32) -> f32 {
return f32(load_f16_at_src0(block_byte_base));
}
fn mul_q8_1(row_sum: i32, da: f32, b_ds: B_DS_TYPE) -> f32 {
return f32(row_sum) * (da * b_ds);
}
#endif
#ifdef LEGACY_QUANTS
fn mmvq_dot_product(a_byte_base: u32, b_inner_id: u32, b_repacked: vec2<u32>, b_ds: B_DS_TYPE) -> f32 {
var row_sum = 0;
let a_repacked = repack_a(a_byte_base, b_inner_id);
row_sum += dot4I8Packed(a_repacked[0], b_repacked[0]);
row_sum += dot4I8Packed(a_repacked[1], b_repacked[1]);
return mul_q8_1(row_sum, get_dm(a_byte_base), b_ds);
}
fn accumulate_vec_q_dot(thread_id: u32, row_base: u32, src0_batch_offset: u32, src1q_idx_base: u32) -> array<f32, OUTPUTS_PER_WG> {
var acc: array<f32, OUTPUTS_PER_WG>;
let num_blocks = params.k / BLOCK_SIZE;
for (var block = thread_id / THREADS_PER_BLOCK; block < num_blocks; block += WG_SIZE / THREADS_PER_BLOCK) {
let b_inner_id = thread_id % THREADS_PER_BLOCK;
let b_block_idx = src1q_idx_base + block;
let b_repacked = repack_b_qs(b_block_idx, b_inner_id);
let b_ds = repack_b_dm(b_block_idx);
for (var row = 0u; row < OUTPUTS_PER_WG; row++) {
let output_row = row_base + row;
if (output_row < params.m) {
let block_byte_base = (src0_batch_offset + output_row * params.stride_01 + block) * BLOCK_SIZE_BYTES;
acc[row] += mmvq_dot_product(block_byte_base, b_inner_id, b_repacked, b_ds);
}
}
}
return acc;
}
#endif
#ifdef MUL_ACC_Q2_K
#define BLOCK_SIZE_BYTES 84
#define B_DS_TYPE f32
fn repack_a(block_byte_base: u32, tid: u32) -> vec4<u32> {
let ih2 = tid / 8u;
let phase = tid % 2u;
let iq4_idx = 2u * ih2 + phase;
let qs_byte_base = block_byte_base + 16u + 16u * iq4_idx;
let qs_shift = tid & 6u;
return vec4<u32>(
(load_u32_at_src0_aligned(qs_byte_base) >> qs_shift) & 0x03030303u,
(load_u32_at_src0_aligned(qs_byte_base + 4u) >> qs_shift) & 0x03030303u,
(load_u32_at_src0_aligned(qs_byte_base + 8u) >> qs_shift) & 0x03030303u,
(load_u32_at_src0_aligned(qs_byte_base + 12u) >> qs_shift) & 0x03030303u,
);
}
fn repack_b_qs(q8_block_idx: u32, tid: u32) -> vec4<u32> {
let phase = tid % 2u;
return vec4<u32>(
src1q[q8_block_idx].qs[4u * phase],
src1q[q8_block_idx].qs[4u * phase + 1u],
src1q[q8_block_idx].qs[4u * phase + 2u],
src1q[q8_block_idx].qs[4u * phase + 3u],
);
}
fn repack_b_dm(q8_block_idx: u32) -> B_DS_TYPE {
return B_DS_TYPE(src1q[q8_block_idx].d);
}
fn get_dm(block_byte_base: u32) -> vec2<f32> {
return vec2<f32>(
f32(load_f16_at_src0(block_byte_base + 80u)),
f32(load_f16_at_src0(block_byte_base + 82u)),
);
}
fn get_scale_min(block_byte_base: u32, tid: u32) -> vec2<f32> {
let scale_byte = block_byte_base + tid;
let scale = byte_of(load_u32_at_src0_aligned(scale_byte), scale_byte & 3u);
return vec2<f32>(f32(scale & 0xFu), f32(scale >> 4u));
}
fn mmvq_dot_product(a_byte_base: u32, tid: u32, b_repacked: vec4<u32>, b_ds: B_DS_TYPE) -> f32 {
let a_repacked = repack_a(a_byte_base, tid);
let dm = get_dm(a_byte_base);
let scale_min = get_scale_min(a_byte_base, tid);
let scale_q = i32(scale_min.x);
let scale_m_i8x4 = u32(scale_min.y) * 0x01010101u;
let row_sum_d = (dot4I8Packed(b_repacked[0], a_repacked[0]) + dot4I8Packed(b_repacked[1], a_repacked[1])
+ dot4I8Packed(b_repacked[2], a_repacked[2]) + dot4I8Packed(b_repacked[3], a_repacked[3])) * scale_q;
let row_sum_m = dot4I8Packed(b_repacked[0], scale_m_i8x4) + dot4I8Packed(b_repacked[1], scale_m_i8x4)
+ dot4I8Packed(b_repacked[2], scale_m_i8x4) + dot4I8Packed(b_repacked[3], scale_m_i8x4);
return b_ds * (dm.x * f32(row_sum_d) - dm.y * f32(row_sum_m));
}
#endif
#ifdef MUL_ACC_Q4_K
#define BLOCK_SIZE_BYTES 144
#define B_DS_TYPE vec2<f32>
fn repack_a(block_byte_base: u32, tid: u32) -> vec4<u32> {
let iq4 = tid / 4u;
let phase = tid % 2u;
let nibble = (tid >> 1u) % 2u;
let q_qs_byte_base = block_byte_base + 16u + 32u * iq4 + 16u * phase;
let qs_shift = 4u * nibble;
return vec4<u32>(
(load_u32_at_src0_aligned(q_qs_byte_base) >> qs_shift) & 0x0F0F0F0Fu,
(load_u32_at_src0_aligned(q_qs_byte_base + 4u) >> qs_shift) & 0x0F0F0F0Fu,
(load_u32_at_src0_aligned(q_qs_byte_base + 8u) >> qs_shift) & 0x0F0F0F0Fu,
(load_u32_at_src0_aligned(q_qs_byte_base + 12u) >> qs_shift) & 0x0F0F0F0Fu,
);
}
fn repack_b_qs(q8_block_idx: u32, tid: u32) -> vec4<u32> {
let phase = tid % 2u;
return vec4<u32>(
src1q[q8_block_idx].qs[4u * phase],
src1q[q8_block_idx].qs[4u * phase + 1u],
src1q[q8_block_idx].qs[4u * phase + 2u],
src1q[q8_block_idx].qs[4u * phase + 3u],
);
}
fn repack_b_dm(q8_block_idx: u32) -> B_DS_TYPE {
return B_DS_TYPE(
f32(src1q[q8_block_idx].d),
f32(src1q[q8_block_idx].s),
);
}
fn get_dm(block_byte_base: u32) -> vec2<f32> {
return vec2<f32>(
f32(load_f16_at_src0(block_byte_base + 0u)),
f32(load_f16_at_src0(block_byte_base + 2u)),
);
}
fn get_scale_min(block_byte_base: u32, tid: u32) -> vec2<f32> {
let sc_m_idx = tid / 2u;
let scales_byte_base = block_byte_base + 4u;
let scales0_3 = load_u32_at_src0_aligned(scales_byte_base);
let scales4_7 = load_u32_at_src0_aligned(scales_byte_base + 4u);
let scales8_11 = load_u32_at_src0_aligned(scales_byte_base + 8u);
let byte_idx = sc_m_idx & 3u;
let is_high = sc_m_idx >= 4u;
let sc_low = byte_of(scales0_3, byte_idx) & 0x3Fu;
let sc_high = (byte_of(scales8_11, byte_idx) & 0x0Fu) | ((byte_of(scales0_3, byte_idx) & 0xC0u) >> 2u);
let scale = f32(select(sc_low, sc_high, is_high));
let mn_low = byte_of(scales4_7, byte_idx) & 0x3Fu;
let mn_high = (byte_of(scales8_11, byte_idx) >> 4u) | ((byte_of(scales4_7, byte_idx) & 0xC0u) >> 2u);
let min_val = f32(select(mn_low, mn_high, is_high));
return vec2<f32>(scale, min_val);
}
fn mmvq_dot_product(a_byte_base: u32, tid: u32, b_repacked: vec4<u32>, b_ds: B_DS_TYPE) -> f32 {
let a_repacked = repack_a(a_byte_base, tid);
let dm = get_dm(a_byte_base);
let scale_min = get_scale_min(a_byte_base, tid);
let row_sum = dot4I8Packed(a_repacked[0], b_repacked[0]) + dot4I8Packed(a_repacked[1], b_repacked[1])
+ dot4I8Packed(a_repacked[2], b_repacked[2]) + dot4I8Packed(a_repacked[3], b_repacked[3]);
// Each thread covers half of the Q8_1 block, so add only b_ds.y/2.
return b_ds.x * dm.x * scale_min.x * f32(row_sum) - dm.y * scale_min.y * (b_ds.y / (Q8_BLOCK_SIZE / ELEMS_PER_THREAD));
}
#endif
#ifdef K_QUANTS
fn accumulate_vec_q_dot(thread_id: u32, row_base: u32, src0_batch_offset: u32, src1q_idx_base: u32) -> array<f32, OUTPUTS_PER_WG> {
var acc: array<f32, OUTPUTS_PER_WG>;
let tid = thread_id % THREADS_PER_BLOCK;
for (var block = thread_id / THREADS_PER_BLOCK; block < params.k / BLOCK_SIZE; block += WG_SIZE / THREADS_PER_BLOCK) {
let src1q_idx = src1q_idx_base + (block * BLOCK_SIZE + ELEMS_PER_THREAD * tid) / Q8_BLOCK_SIZE;
let b_repacked = repack_b_qs(src1q_idx, tid);
let b_ds = repack_b_dm(src1q_idx);
for (var row = 0u; row < OUTPUTS_PER_WG; row++) {
let output_row = row_base + row;
if (output_row < params.m) {
let block_byte_base = (src0_batch_offset + output_row * params.stride_01 + block) * BLOCK_SIZE_BYTES;
acc[row] += mmvq_dot_product(block_byte_base, tid, b_repacked, b_ds);
}
}
}
return acc;
}
#endif

View File

@@ -0,0 +1,173 @@
#ifdef USE_SUBGROUP_REDUCTION
enable subgroups;
#endif
enable f16;
requires packed_4x8_integer_dot_product;
#include "common_decls.tmpl"
struct Params {
offset_src1: u32,
stride_12: u32,
stride_13: u32,
ne0: u32,
ne2: u32,
ne3: u32,
};
#define SRC1_TYPE vec4<SRC1_INNER_TYPE>
@group(0) @binding(0) var<storage, read_write> src1: array<SRC1_TYPE>;
@group(0) @binding(1) var<storage, read_write> src1q: array<q8_1>;
@group(0) @binding(2) var<uniform> params: Params;
#ifdef USE_SUBGROUP_REDUCTION
fn cluster_max_8(v: f32) -> f32 {
var r = v;
r = max(r, subgroupShuffleXor(r, 1u));
r = max(r, subgroupShuffleXor(r, 2u));
r = max(r, subgroupShuffleXor(r, 4u));
return r;
}
#if defined(MUL_ACC_Q4_0) || defined(MUL_ACC_Q4_1) || defined(MUL_ACC_Q4_K)
fn cluster_add_i4x8(v: i32) -> i32 {
var r= v;
r += subgroupShuffleXor(r, 1u);
r += subgroupShuffleXor(r, 2u);
r += subgroupShuffleXor(r, 4u);
return r;
}
#endif
#endif
#ifdef USE_WORKGROUP_REDUCTION
#define CLUSTER_SIZE 8
var<workgroup> partial_amaxs: array<array<f32, CLUSTER_SIZE>, WG_SIZE / CLUSTER_SIZE>;
var<workgroup> partial_sums: array<array<i32, CLUSTER_SIZE>, WG_SIZE / CLUSTER_SIZE>;
#endif
@compute @workgroup_size(WG_SIZE)
fn main(
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) wg_id: vec3<u32>,
@builtin(num_workgroups) num_wg: vec3<u32>
) {
let thread_id = local_id.x;
let num_vec4 = params.ne0 / 4u;
let wg_per_vec = (num_vec4 + (WG_SIZE - 1u)) / WG_SIZE;
let total_batches = wg_per_vec * params.ne2 * params.ne3;
let wg_linear = wg_id.y * num_wg.x + wg_id.x;
if (wg_linear >= total_batches) {
return;
}
let src13_idx = wg_linear / (params.ne2 * wg_per_vec);
let src12_idx = (wg_linear - src13_idx * (params.ne2 * wg_per_vec)) / wg_per_vec;
let src11_wg_idx = wg_linear % wg_per_vec;
let src1_idx_base = params.offset_src1 + src13_idx * params.stride_13 + src12_idx * params.stride_12;
let src1_idx_vec4_base = src1_idx_base / 4u;
let blocks_per_row = params.ne0 / 32u;
let blocks_per_wg = (WG_SIZE * 4u) / 32u;
let src1q_idx_base = (src13_idx * params.ne2 + src12_idx) * blocks_per_row;
let src1q_idx = src1q_idx_base + src11_wg_idx * blocks_per_wg + thread_id / 8u;
let qs_idx = thread_id % 8u;
// reduction
var q4 = vec4<f32>(0.0);
var q4_quants = 0u;
var thread_amax = 0.0;
let src11_vec4_idx = src11_wg_idx * WG_SIZE + thread_id;
let is_valid = src11_vec4_idx < num_vec4;
#ifdef USE_SUBGROUP_REDUCTION
var d = 0.0;
if (is_valid) {
q4 = src1[src1_idx_vec4_base + src11_vec4_idx];
let abs_q4 = abs(q4);
thread_amax = max(max(abs_q4[0u], abs_q4[1u]), max(abs_q4[2], abs_q4[3]));
}
d = cluster_max_8(thread_amax) / 127.0;
if (is_valid) {
let id = select(0.0, 1.0 / d, d > 0.0);
q4_quants = pack4xI8(vec4<i32>(round(q4 * id)));
if (qs_idx == 0u) {
src1q[src1q_idx].d = f16(d);
}
src1q[src1q_idx].qs[qs_idx] = q4_quants;
}
#if defined(MUL_ACC_Q4_0) || defined(MUL_ACC_Q4_1) || defined(MUL_ACC_Q4_K)
let q4_quants_sum = dot4I8Packed(q4_quants, 0x01010101u);
let s = f16(d * f32(cluster_add_i4x8(q4_quants_sum)));
if (is_valid) {
if (qs_idx == 0u) {
src1q[src1q_idx].s = s;
}
}
#endif
#endif
#ifdef USE_WORKGROUP_REDUCTION
var d = 0.0;
let cluster_id = thread_id / 8u;
if (is_valid) {
q4 = src1[src1_idx_vec4_base + src11_vec4_idx];
let abs_q4 = abs(q4);
thread_amax = max(max(abs_q4[0], abs_q4[1]), max(abs_q4[2], abs_q4[3]));
partial_amaxs[cluster_id][qs_idx] = thread_amax;
}
workgroupBarrier();
if (is_valid) {
let amax = max(
max(
max(partial_amaxs[cluster_id][0], partial_amaxs[cluster_id][1]), max(partial_amaxs[cluster_id][2], partial_amaxs[cluster_id][3])),
max(
max(partial_amaxs[cluster_id][4], partial_amaxs[cluster_id][5]), max(partial_amaxs[cluster_id][6], partial_amaxs[cluster_id][7]))
);
d = amax / 127.0;
let id = select(0.0f, 1.0f / d, d > 0.0f);
q4_quants = pack4xI8(vec4<i32>(round(q4 * id)));
src1q[src1q_idx].qs[qs_idx] = q4_quants;
if (qs_idx == 0u) {
src1q[src1q_idx].d = f16(d);
}
}
#if defined(MUL_ACC_Q4_0) || defined(MUL_ACC_Q4_1) || defined(MUL_ACC_Q4_K)
partial_sums[cluster_id][qs_idx] = dot4I8Packed(q4_quants, 0x01010101u);
workgroupBarrier();
if (is_valid) {
if (qs_idx == 0u) {
let s = d * f32(partial_sums[cluster_id][0] + partial_sums[cluster_id][1] + partial_sums[cluster_id][2] + partial_sums[cluster_id][3]
+ partial_sums[cluster_id][4] + partial_sums[cluster_id][5] + partial_sums[cluster_id][6] + partial_sums[cluster_id][7]);
src1q[src1q_idx].s = f16(s);
}
}
#endif
#endif
}

View File

@@ -88,7 +88,7 @@ static bool ggml_zendnn_matmul(ggml_backend_zendnn_context * ctx, int64_t m, int
return true;
}
static bool ggml_zendnn_sgemm(ggml_backend_zendnn_context * ctx, int64_t m, int64_t n, int64_t k,
static bool ggml_zendnn_gemm(ggml_backend_zendnn_context * ctx, int64_t m, int64_t n, int64_t k,
const void * A, int64_t lda, const void * B, int64_t ldb, void * C,
int64_t ldc, int Atype, int Btype, int Ctype) {
@@ -200,7 +200,7 @@ static void ggml_zendnn_compute_forward_mul_mat(
for (int64_t i12 = 0; i12 < ne12; i12++) {
const void* wdata = (src1->type == vec_dot_type || src0->type == GGML_TYPE_Q8_0) ? src1->data : work_data;
const size_t row_size = ggml_row_size(vec_dot_type, ne10);
if (!ggml_zendnn_sgemm(ctx,
if (!ggml_zendnn_gemm(ctx,
ne01, // m
ne11, // n
ne10, // k
@@ -213,7 +213,7 @@ static void ggml_zendnn_compute_forward_mul_mat(
src0->type,
src0->type == GGML_TYPE_Q8_0 ? GGML_TYPE_F32 : vec_dot_type,
dst->type))
GGML_ABORT("%s: ZenDNN sgemm failed\n", __func__);
GGML_ABORT("%s: ZenDNN gemm failed\n", __func__);
}
}
}
@@ -355,7 +355,7 @@ static void ggml_zendnn_compute_forward_mul_mat_id(
}
// batched gemm for all tokens in this expert
if (!ggml_zendnn_sgemm(ctx,
if (!ggml_zendnn_gemm(ctx,
ne01, // m
cne1, // n
ne10, // k
@@ -368,7 +368,7 @@ static void ggml_zendnn_compute_forward_mul_mat_id(
src0->type,
src0->type == GGML_TYPE_Q8_0 ? GGML_TYPE_F32 : vec_dot_type,
dst->type)) {
GGML_ABORT("%s: ZenDNN sgemm failed\n", __func__);
GGML_ABORT("%s: ZenDNN gemm failed\n", __func__);
}
// scatter output rows to destination

View File

@@ -505,6 +505,7 @@ class MODEL_ARCH(IntEnum):
LLAMA_EMBED = auto()
MAINCODER = auto()
KIMI_LINEAR = auto()
TALKIE = auto()
class VISION_PROJECTOR_TYPE(IntEnum):
@@ -1021,6 +1022,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
MODEL_ARCH.LLAMA_EMBED: "llama-embed",
MODEL_ARCH.MAINCODER: "maincoder",
MODEL_ARCH.KIMI_LINEAR: "kimi-linear",
MODEL_ARCH.TALKIE: "talkie",
}
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
@@ -4013,6 +4015,19 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.FFN_DOWN_SHEXP,
MODEL_TENSOR.FFN_UP_SHEXP,
],
MODEL_ARCH.TALKIE: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT,
MODEL_TENSOR.ATTN_Q,
MODEL_TENSOR.ATTN_Q_NORM,
MODEL_TENSOR.ATTN_K,
MODEL_TENSOR.ATTN_V,
MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.FFN_GATE,
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
MODEL_TENSOR.LAYER_OUT_SCALE,
],
# TODO
}

View File

@@ -34,6 +34,7 @@ class TensorNameMap:
"encoder", # neobert
"model.transformer.wte", # llada
"embed_tokens", # qwen3-embedding
"model.embed", # talkie
),
# Token type embeddings
@@ -259,6 +260,7 @@ class TensorNameMap:
"model.transformer.blocks.{bid}.q_proj", # llada
"layers.{bid}.self_attn.q_proj", # qwen3-embedding
"backbone.layers.{bid}.mixer.q_proj", # nemotron-h
"model.blocks.{bid}.attn.attn_query", # talkie
),
# Attention key
@@ -279,6 +281,7 @@ class TensorNameMap:
"model.transformer.blocks.{bid}.k_proj", # llada
"layers.{bid}.self_attn.k_proj", # qwen3-embedding
"backbone.layers.{bid}.mixer.k_proj", # nemotron-h
"model.blocks.{bid}.attn.attn_key", # talkie
),
# Attention value
@@ -298,6 +301,7 @@ class TensorNameMap:
"model.transformer.blocks.{bid}.v_proj", # llada
"layers.{bid}.self_attn.v_proj", # qwen3-embedding
"backbone.layers.{bid}.mixer.v_proj", # nemotron-h
"model.blocks.{bid}.attn.attn_value", # talkie
),
# Attention output
@@ -336,6 +340,7 @@ class TensorNameMap:
"layers.{bid}.self_attn.o_proj", # qwen3-embedding
"backbone.layers.{bid}.mixer.o_proj", # nemotron-h
"model.layers.{bid}.self_attn.language_expert_dense", # cogvlm
"model.blocks.{bid}.attn.attn_resid", # talkie
),
# Attention output norm
@@ -508,6 +513,7 @@ class TensorNameMap:
"layers.{bid}.mlp.up_proj", # qwen3-embedding
"backbone.layers.{bid}.mixer.up_proj", # nemotron-h
"model.layers.{bid}.mlp.language_mlp.up_proj", # cogvlm
"model.blocks.{bid}.mlp.mlp_linear", # talkie
),
MODEL_TENSOR.FFN_UP_EXP: (
@@ -561,6 +567,7 @@ class TensorNameMap:
"model.transformer.blocks.{bid}.ff_proj", # llada
"layers.{bid}.mlp.gate_proj", # qwen3-embedding
"model.layers.{bid}.mlp.language_mlp.gate_proj", # cogvlm
"model.blocks.{bid}.mlp.mlp_gate", # talkie
),
MODEL_TENSOR.FFN_GATE_EXP: (
@@ -636,6 +643,7 @@ class TensorNameMap:
"layers.{bid}.mlp.down_proj", # qwen3-embedding
"backbone.layers.{bid}.mixer.down_proj", # nemotron-h
"model.layers.{bid}.mlp.language_mlp.down_proj", # cogvlm
"model.blocks.{bid}.mlp.mlp_resid", # talkie
),
MODEL_TENSOR.FFN_DOWN_EXP: (
@@ -682,6 +690,7 @@ class TensorNameMap:
"model.layers.layers.{bid}.mixer.q_norm", # plamo3
"layers.{bid}.self_attn.q_norm", # qwen3-embedding
"model.layers.{bid}.attention.query_layernorm", # apertus
"model.blocks.{bid}.attn.head_gain.head_g", # talkie
),
MODEL_TENSOR.ATTN_K_NORM: (
@@ -716,6 +725,7 @@ class TensorNameMap:
MODEL_TENSOR.LAYER_OUT_SCALE: (
"model.layers.{bid}.layer_scalar", # gemma4
"model.blocks.{bid}.embed_skip.a_g", # talkie
),
MODEL_TENSOR.PER_LAYER_TOKEN_EMBD: (

View File

@@ -37,7 +37,7 @@ packages = [
[tool.poetry.dependencies]
python = ">=3.10"
[tool.poetry.dev-dependencies]
[tool.poetry.group.dev.dependencies]
pytest = "^5.2"
[build-system]

View File

@@ -6,13 +6,13 @@ version = "0.0.0"
dynamic = ["classifiers"]
readme = "README.md"
authors = [{name = "GGML", email = "ggml@ggml.ai"}]
requires-python = '>=3.10'
requires-python = '>=3.10,<3.15'
dependencies = [
'numpy (>=1.25.0,<2.0.0)',
'numpy (>=1.26.4,<3.0.0)',
'sentencepiece (>=0.1.98,<0.3.0)',
'transformers (==5.5.1)',
'protobuf (>=4.21.0)',
'torch (>=2.2.0,<3.0.0)',
'protobuf (>=4.21.0,<5.0.0)',
'torch (>=2.6.0,<3.0.0)',
'gguf @ ./gguf-py',
]
classifiers = [
@@ -32,17 +32,20 @@ llama-convert-llama-ggml-to-gguf = "convert_llama_ggml_to_gguf:main"
llama-ggml-vk-generate-shaders = "ggml_vk_generate_shaders:main"
[tool.poetry]
packages = [{ include = "*.py", from = "." }]
packages = [
{ include = "*.py", from = "." },
{ include = "conversion", from = "." },
]
[tool.poetry.dependencies]
torch = [
{ version = "~=2.6.0", source = "pypi", markers = "sys_platform == 'darwin'" },
{ version = "~=2.6.0+cpu", source = "pytorch", markers = "sys_platform == 'linux'" },
{ version = "~=2.6.0", source = "pypi", markers = "sys_platform == 'win32'" }
{ version = "==2.11.0", source = "pypi", markers = "sys_platform == 'darwin'" },
{ version = "==2.11.0+cpu", source = "pytorch", markers = "sys_platform == 'linux'" },
{ version = "==2.11.0", source = "pypi", markers = "sys_platform == 'win32'" },
]
[tool.poetry.group.dev.dependencies]
pytest = "^5.2"
pytest = "~=8.3.3"
# Force wheel + cpu
# For discussion and context see https://github.com/python-poetry/poetry#6409

View File

@@ -51,6 +51,9 @@ opbatch=
opqueue=
[ "$OQ" != "" ] && opqueue="GGML_HEXAGON_OPQUEUE=$OQ"
oppoll=
[ "$OP" != "" ] && oppoll="GGML_HEXAGON_OPPOLL=$OP"
opflt=
[ "$OF" != "" ] && opflt="GGML_HEXAGON_OPFILTER=$OF"
@@ -66,7 +69,7 @@ adb $adbserial $adbhost shell " \
cd $basedir; ulimit -c unlimited; \
LD_LIBRARY_PATH=$basedir/$branch/lib \
ADSP_LIBRARY_PATH=$basedir/$branch/lib \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb $opbatch $opqueue $opflt $vmem $mbuf \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb $opbatch $opqueue $oppoll $opflt $vmem $mbuf \
./$branch/bin/llama-completion --no-mmap -m $basedir/../gguf/$model \
--poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
--ctx-size 8192 --ubatch-size 1024 -fa on \

View File

@@ -42,6 +42,15 @@ ndev=
hb=
[ "$HB" != "" ] && hb="GGML_HEXAGON_HOSTBUF=$HB"
opbatch=
[ "$OB" != "" ] && opbatch="GGML_HEXAGON_OPBATCH=$OB"
opqueue=
[ "$OQ" != "" ] && opqueue="GGML_HEXAGON_OPQUEUE=$OQ"
oppoll=
[ "$OP" != "" ] && oppoll="GGML_HEXAGON_OPPOLL=$OP"
set -x
tool=$1; shift
@@ -50,5 +59,5 @@ adb $adbserial $adbhost shell " \
cd $basedir; ulimit -c unlimited; \
LD_LIBRARY_PATH=$basedir/$branch/lib \
ADSP_LIBRARY_PATH=$basedir/$branch/lib \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb ./$branch/bin/$tool $@ \
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb $opbatch $opqueue $oppoll ./$branch/bin/$tool $@ \
"

View File

@@ -5,7 +5,7 @@ import os
import sys
import subprocess
HTTPLIB_VERSION = "refs/tags/v0.45.1"
HTTPLIB_VERSION = "refs/tags/v0.46.0"
vendor = {
"https://github.com/nlohmann/json/releases/latest/download/json.hpp": "vendor/nlohmann/json.hpp",

View File

@@ -133,6 +133,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_LLAMA_EMBED, "llama-embed" },
{ LLM_ARCH_MAINCODER, "maincoder" },
{ LLM_ARCH_KIMI_LINEAR, "kimi-linear" },
{ LLM_ARCH_TALKIE, "talkie" },
{ LLM_ARCH_UNKNOWN, "(unknown)" },
};

View File

@@ -137,6 +137,7 @@ enum llm_arch {
LLM_ARCH_LLAMA_EMBED,
LLM_ARCH_MAINCODER,
LLM_ARCH_KIMI_LINEAR,
LLM_ARCH_TALKIE,
LLM_ARCH_UNKNOWN,
};

View File

@@ -44,6 +44,8 @@ static llama_model * llama_model_mapping(llm_arch arch, const llama_model_params
return new llama_model_llama_embed(params);
case LLM_ARCH_MAINCODER:
return new llama_model_maincoder(params);
case LLM_ARCH_TALKIE:
return new llama_model_talkie(params);
case LLM_ARCH_DECI:
return new llama_model_deci(params);
case LLM_ARCH_BAICHUAN:
@@ -2353,6 +2355,7 @@ llama_rope_type llama_model_rope_type(const llama_model * model) {
case LLM_ARCH_QWEN3NEXT:
case LLM_ARCH_MIMO2:
case LLM_ARCH_STEP35:
case LLM_ARCH_TALKIE:
return LLAMA_ROPE_TYPE_NEOX;
case LLM_ARCH_QWEN2VL:

View File

@@ -488,7 +488,7 @@ struct llama_layer {
struct ggml_tensor * indexer_attn_k = nullptr;
struct ggml_tensor * indexer_attn_q_b = nullptr; // note: for lora a/b, not bias
// gemma4 layer output scale
// gemma4 layer output scale, reused for talkie embedding skip scale
struct ggml_tensor * out_scale = nullptr;
struct llama_layer_posnet posnet;

View File

@@ -511,6 +511,14 @@ struct llm_tokenizer_bpe : llm_tokenizer {
};
byte_encode = false;
break;
case LLAMA_VOCAB_PRE_TYPE_MINICPM5:
regex_exprs = {
// original regex from tokenizer.json (openbmb/MiniCPM5-1B)
"\\p{N}{1,3}",
// "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}+| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}+| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
};
break;
default:
// default regex for BPE tokenization pre-processing
regex_exprs = {
@@ -2039,6 +2047,9 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
pre_type = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
} else if (tokenizer_pre == "default") {
pre_type = LLAMA_VOCAB_PRE_TYPE_DEFAULT;
} else if (tokenizer_pre == "minicpm5") {
pre_type = LLAMA_VOCAB_PRE_TYPE_MINICPM5;
ignore_merges = true;
} else if (
tokenizer_pre == "llama3" ||
tokenizer_pre == "llama-v3" ||
@@ -2196,7 +2207,8 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
} else if (
tokenizer_pre == "gpt-4o" ||
tokenizer_pre == "llama4" ||
tokenizer_pre == "kanana2") {
tokenizer_pre == "kanana2" ||
tokenizer_pre == "talkie") {
pre_type = LLAMA_VOCAB_PRE_TYPE_GPT4O;
clean_spaces = false;
} else if (

View File

@@ -60,6 +60,7 @@ enum llama_vocab_pre_type {
LLAMA_VOCAB_PRE_TYPE_JAIS2 = 49,
LLAMA_VOCAB_PRE_TYPE_GEMMA4 = 50,
LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51,
LLAMA_VOCAB_PRE_TYPE_MINICPM5 = 52,
};
struct LLM_KV;

View File

@@ -177,9 +177,9 @@ llama_model_mistral3::graph::graph(const llama_model & model, const llm_graph_pa
cb(cur, "ffn_norm", il);
cur = build_ffn(cur,
model.layers[il].ffn_up, model.layers[il].ffn_up_b, NULL,
model.layers[il].ffn_gate, model.layers[il].ffn_gate_b, NULL,
model.layers[il].ffn_down, model.layers[il].ffn_down_b, NULL,
model.layers[il].ffn_up, model.layers[il].ffn_up_b, model.layers[il].ffn_up_s,
model.layers[il].ffn_gate, model.layers[il].ffn_gate_b, model.layers[il].ffn_gate_s,
model.layers[il].ffn_down, model.layers[il].ffn_down_b, model.layers[il].ffn_down_s,
NULL,
LLM_FFN_SILU, LLM_FFN_PAR, il);
cb(cur, "ffn_out", il);
@@ -200,7 +200,11 @@ llama_model_mistral3::graph::graph(const llama_model & model, const llm_graph_pa
LLM_FFN_SILU, true,
hparams.expert_weights_scale,
LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
il);
il,
nullptr, nullptr,
model.layers[il].ffn_up_exps_s,
model.layers[il].ffn_gate_exps_s,
model.layers[il].ffn_down_exps_s);
cb(cur, "ffn_moe_out", il);
}
cur = ggml_add(ctx0, cur, ffn_inp);

View File

@@ -186,6 +186,19 @@ struct llama_model_maincoder : public llama_model_base {
};
struct llama_model_talkie : public llama_model_base {
llama_model_talkie(const struct llama_model_params & params) : llama_model_base(params) {}
void load_arch_hparams(llama_model_loader & ml) override;
void load_arch_tensors(llama_model_loader & ml) override;
struct graph : public llm_graph_context {
graph(const llama_model & model, const llm_graph_params & params);
};
std::unique_ptr<llm_graph_context> build_arch_graph(const llm_graph_params & params) const override;
};
struct llama_model_deci : public llama_model_base {
llama_model_deci(const struct llama_model_params & params) : llama_model_base(params) {}
void load_arch_hparams(llama_model_loader & ml) override;

Some files were not shown because too many files have changed in this diff Show More