Compare commits

...

73 Commits
b7307 ... b7380

Author SHA1 Message Date
Sigbjørn Skjæret
8e4d678528 common : skip model validation when --completion-bash is requested (#17975) 2025-12-13 08:40:50 +01:00
Jeff Bolz
07a10c1090 vulkan: Allow non-pow2 n_experts in topk_moe (#17872) 2025-12-13 08:40:04 +01:00
Sigbjørn Skjæret
2bc94e7928 add llama-completion to completion-bash executables (#17976) 2025-12-13 08:35:50 +01:00
Daniel Bevenius
fd1085ffb7 model-conversion : use CONVERTED_MODEL value for converted model [no ci] (#17984)
* model-conversion : use CONVERTED_MODEL value for converted model [no ci]

This commit updates the model verification scripts to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.

The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification scripts
will look for the wrong .bin files that were generating when running the
models.
For example, the following steps were not possible:
```console
(venv) $ huggingface-cli download google/gemma-3-270m-it --local-dir ggml-org/gemma-3-270m
(venv) $ python3 convert_hf_to_gguf.py ggml-org/gemma-3-270m --outfile test-bf16.gguf --outtype bf16
(venv) $ cd examples/model-conversion/
(venv) $ export MODEL_PATH=../../ggml-org/gemma-3-270m
(venv) $ export CONVERTED_MODEL=../../test-bf16.gguf
(venv) $ make causal-verify-logits
...
Data saved to data/llamacpp-test-bf16.bin
Data saved to data/llamacpp-test-bf16.txt
Error: llama.cpp logits file not found: data/llamacpp-gemma-3-270m.bin
Please run scripts/run-converted-model.sh first to generate this file.
make: *** [Makefile:62: causal-verify-logits] Error 1
```

With the changes in this commit, the above steps will now work as
expected.
2025-12-13 08:34:26 +01:00
Xuan-Son Nguyen
380b4c984e common: support negated args (#17919)
* args: support negated args

* update docs

* fix typo

* add more neg options

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* rm duplicated arg

* fix LLAMA_ARG_NO_HOST

* add test

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-12 23:58:53 +01:00
Xuan-Son Nguyen
e39a2ce66d clip: move model cgraphs into their own files (#17965)
* clip: move model cgraphs into their own files

* more explicit enums

* fix linux build

* fix naming

* missing headers

* nits: add comments for contributors
2025-12-12 21:14:48 +01:00
jiahao su
a8c7f33d79 ci : change the cann version and the container pull method (#17953)
fix error format

Update build.yml

Remove unnecessary zip files

fix

update
2025-12-12 20:43:00 +01:00
Sigbjørn Skjæret
b7f5f46e03 docker : include legacy llama-completion binary (#17964) 2025-12-12 19:39:23 +01:00
Johannes Gäßler
482211438d CUDA: fix overflow in MMA kernel without stream-k (#17939) 2025-12-12 17:43:58 +01:00
Georgi Gerganov
7bed317f53 models : fix the attn_factor for mistral3 graphs + improve consistency (#17945)
* models : fix the attn_factor for mistral3 graphs

* cont : rework attn_factor correction logic

* cont : make deepseek2 consistent

* cont : add TODO

* cont : special-case DSv2

* cont : revert Mistral 3 Large changes

* cont : fix DS2 to use the original attn_factor

* cont : minor comments
2025-12-12 17:12:40 +02:00
Sigbjørn Skjæret
dcb7d17758 cann : fix ops broken by circular padding guard (#17825) 2025-12-12 15:49:27 +01:00
ixgbe
51604435e8 ggml-cpu : fix RISC-V Q4_0 repack select and RVV feature reporting (#17951)
* ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* using the name VLEN instead of CNT

* Update ggml/include/ggml-cpu.h

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-12 16:26:03 +02:00
Xuan-Son Nguyen
17158965ac mtmd: explicitly forbidden inclusion of private header and libcommon (#17946) 2025-12-12 15:16:06 +01:00
Aleksander Grygier
12280ae905 webui: Fix parsing non-LaTeX occurrencies of \( or \) (#17810)
* fix: Improve latex protection logic to prevent turning non-latex `\(` into `$`

* chore: update webui build output
2025-12-12 15:13:36 +01:00
Xuan-Son Nguyen
54a0fee4b7 arg: add -mm and -mmu as short form of --mmproj and --mmproj-url (#17958)
* arg: add -mm and -mmu as short form of --mmproj and --mmproj-url

* correct order

* update docs
2025-12-12 14:06:06 +01:00
Daniel Bevenius
dada4c846d model-conversion : remove max diff check in compare-logits [no ci] (#17954)
This commit removes the maximum difference check from the
compare-logits.py which would stop early if the difference between
the logits exceeded a threshold.

The motivation for removing this is that it can be useful to be able to
get the complete log for debugging/reporting purposes.
2025-12-12 13:25:16 +01:00
Adrien Gallouët
b8ee22cfde common : add minimalist multi-thread progress bar (#17602)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-12-12 12:44:35 +01:00
Gustavo Rocha Dias
2eaa2c65cb cmake: link ws2_32 for MinGW/w64devkit builds in cpp-httplib (#17949) 2025-12-12 12:02:28 +01:00
yulo
c33a58bced HIP: enable mmf for RDNA3 (#17879)
* enable mmf for RDNA3

* disable mmf for some shape

* move some mmvf to mmf

* more mmfv to mmf

* 3 is good in mmvf

---------

Co-authored-by: zhang hui <you@example.com>
2025-12-12 11:34:33 +01:00
Pascal
a81a569577 Add a search field on model selector / improve mobile display (#17765)
* webui: add search field to model selector and fixes mobile viewport overflow

* webui: simplify model search style and code

* refacor: Search Input component & consistent UI for Models Selector search

* feat: Use Popover component + improve interactions

* fix: Fetching props for only loaded models in ROUTER mode

* webui: prevent models selector popover from overflowing viewport

Use Floating UI's auto-positioning with 50dvh height limit and proper
collision detection instead of forcing top positioning. Fixes overflow
on desktop and mobile keyboard issues

* webui: keep search field near trigger in models selector

Place search at the 'near end' (closest to trigger) by swapping layout
with CSS flexbox order based on popover direction. Prevents input from
moving during typing as list shrinks

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-11 18:21:21 +01:00
Piotr Wilkin (ilintar)
53ecd4fdb9 SOLVE_TRI extension to more dimensions (#17793)
* Extended TRI

* Fix whitespace

* chore: update webui build output

* Just use cuBLAS for everything...

* Merge both versions

* Remove incorrect imports causing failures for CI

* Still failing... remove all direct cublas imports and rely on common imports from "common.cuh"

* Defines for hipBlas

* Aaaand MUSA defines...

* I hate this job...

* Stupid typo...

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-11 17:20:43 +01:00
Georgi Gerganov
c6f6e4f96a ggml-alloc : fix reuse-parent logic for misaligned sizes (#17884) 2025-12-11 14:30:10 +02:00
Georgi Gerganov
d9f8f60618 batch : fix sequence id ownership (#17915)
* batch : fix sequence id ownage

* cont : reduce allocations
2025-12-11 14:29:47 +02:00
Yuichiro Utsumi
e4ae383317 docs: use port 8080 in Docker examples (#17903) 2025-12-11 17:12:07 +08:00
nullname
34ce48d97a ggml-hexagon: fix rope failure at test-backend-ops (#17565)
* fix test failure

* fix: correct scaling calculations in rope_cache_init

* fix: optimize element copying in rope_hex_f32 using memcpy

* fix: optimize loop boundaries in rope_hex_f32 for better performance

* feat: add profiling macros for performance measurement in operations
2025-12-10 14:45:43 -08:00
Sigbjørn Skjæret
45e350e3d3 ci: fix riscv64-native build (#17916) 2025-12-10 23:24:31 +01:00
Xuan-Son Nguyen
c6b2c9310c mtmd: some small clean up (#17909)
* clip: add support for fused qkv in build_vit

* use bulid_ffn whenever possible

* fix internvl

* mtmd-cli: move image to beginning

* test script: support custom args
2025-12-10 22:20:06 +01:00
Xuan-Son Nguyen
34a6d86982 cli: enable jinja by default (#17911)
* cli: enable jinja by default

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-10 22:19:42 +01:00
Pascal
f32ca51bfe server: add presets (config) when using multiple models (#17859)
* llama-server: recursive GGUF loading

Replace flat directory scan with recursive traversal using
std::filesystem::recursive_directory_iterator. Support for
nested vendor/model layouts (e.g. vendor/model/*.gguf).
Model name now reflects the relative path within --models-dir
instead of just the filename. Aggregate files by parent
directory via std::map before constructing local_model

* server : router config POC (INI-based per-model settings)

* server: address review feedback from @aldehir and @ngxson

PEG parser usage improvements:
- Simplify parser instantiation (remove arena indirection)
- Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping)
- Fix last line without newline bug (+ operator instead of <<)
- Remove redundant end position check

Feature scope:
- Remove auto-reload feature (will be separate PR per @ngxson)
- Keep config.ini auto-creation and template generation
- Preserve per-model customization logic

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>

* server: adopt aldehir's line-oriented PEG parser

Complete rewrite of INI parser grammar and visitor:
- Use p.chars(), p.negate(), p.any() instead of p.until()
- Support end-of-line comments (key=value # comment)
- Handle EOF without trailing newline correctly
- Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*)
- Simplified visitor (no pending state, no trim needed)
- Grammar handles whitespace natively via eol rule

Business validation preserved:
- Reject section names starting with LLAMA_ARG_*
- Accept only keys starting with LLAMA_ARG_*
- Require explicit section before key-value pairs

Co-authored-by: aldehir <aldehir@users.noreply.github.com>

* server: fix CLI/env duplication in child processes

Children now receive minimal CLI args (executable, model, port, alias)
instead of inheriting all router args. Global settings pass through
LLAMA_ARG_* environment variables only, eliminating duplicate config
warnings.

Fixes: Router args like -ngl, -fa were passed both via CLI and env,
causing 'will be overwritten' warnings on every child spawn

* add common/preset.cpp

* fix compile

* cont

* allow custom-path models

* add falsey check

* server: fix router model discovery and child process spawning

- Sanitize model names: replace / and \ with _ for display
- Recursive directory scan with relative path storage
- Convert relative paths to absolute when spawning children
- Filter router control args from child processes
- Refresh args after port assignment for correct port value
- Fallback preset lookup for compatibility
- Fix missing argv[0]: store server binary path before base_args parsing

* Revert "server: fix router model discovery and child process spawning"

This reverts commit e3832b42ee.

* clarify about "no-" prefix

* correct render_args() to include binary path

* also remove arg LLAMA_ARG_MODELS_PRESET for child

* add co-author for ini parser code

Co-authored-by: aldehir <hello@alde.dev>

* also set LLAMA_ARG_HOST

* add CHILD_ADDR

* Remove dead code

---------

Co-authored-by: aldehir <aldehir@users.noreply.github.com>
Co-authored-by: ngxson <ngxson@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: aldehir <hello@alde.dev>
2025-12-10 22:18:21 +01:00
Max Krasnyansky
e1f4921980 Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (#17748)
* tests: update barrier test to check for race condition in active threads

* cpu: combine n_graph and n_threads into a single atomic update

* tests: add multi-graph test for test_barrier
2025-12-10 12:32:23 -08:00
Georgi Gerganov
4dff236a52 ggml : remove GGML_KQ_MASK_PAD constant (#17910)
* ggml : remove GGML_KQ_MASK_PAD constant

* cont : remove comment
2025-12-10 20:53:16 +02:00
Sigbjørn Skjæret
4df6e859e9 cuda : add missing support check for xielu (#17895) 2025-12-10 16:16:20 +01:00
Xuan-Son Nguyen
6c2131773c cli: new CLI experience (#17824)
* wip

* wip

* fix logging, add display info

* handle commands

* add args

* wip

* move old cli to llama-completion

* rm deprecation notice

* move server to a shared library

* move ci to llama-completion

* add loading animation

* add --show-timings arg

* add /read command, improve LOG_ERR

* add args for speculative decoding, enable show timings by default

* add arg --image and --audio

* fix windows build

* support reasoning_content

* fix llama2c workflow

* color default is auto

* fix merge conflicts

* properly fix color problem

Co-authored-by: bandoti <bandoti@users.noreply.github.com>

* better loading spinner

* make sure to clean color on force-exit

* also clear input files on "/clear"

* simplify common_log_flush

* add warning in mtmd-cli

* implement console writter

* fix data race

* add attribute

* fix llama-completion and mtmd-cli

* add some notes about console::log

* fix compilation

---------

Co-authored-by: bandoti <bandoti@users.noreply.github.com>
2025-12-10 15:28:59 +01:00
Eric Zhang
b677721819 model : Qwen3-Next-80B-A3B has 48 layers (#17898)
* model : Qwen3-Next-80B-A3B has 48 layers

* model : Add 80B-A3B type name
2025-12-10 15:22:40 +01:00
lhez
2d2e1030e3 docs : update opencl ops (#17904) 2025-12-10 15:20:00 +01:00
Johannes Gäßler
17f7f4baad CUDA: fix unpadded strides in MMA FA kernel (#17891) 2025-12-10 12:39:56 +01:00
Xuan-Son Nguyen
9e79b0116e convert: allow using quantized Mistral weight (#17889)
* convert: allow using quantized Mistral weight

* data_torch.ndim

* update dequant fn

Co-authored-by: compilade <compilade@users.noreply.github.com>

---------

Co-authored-by: compilade <compilade@users.noreply.github.com>
2025-12-10 10:26:22 +01:00
Neo Zhang Jianyu
2e9eab80c2 fix softmax for iGPU (#17838) 2025-12-10 16:59:57 +08:00
Aldehir Rojas
2fbe3b7bb7 common : add parser for ministral/mistral large 3/devstral 2 (#17713) 2025-12-09 17:31:04 -06:00
Sigbjørn Skjæret
63391852b0 docs : update cpu and cuda ops (#17890)
* update cuda ops

* update CPU as well
2025-12-09 23:31:29 +01:00
Gabe Goodhart
086a63e3a5 metal: SSM kernel improvements (#17876)
* feat: Add a batched version of ssm_conv

This was done using Claude Code. It found a number of optimizations around
how the threads were organized, resulting in a huge performance boost!

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Optimized SSM_SCAN kernel for metal

This used Claude Code and resulted in a modest performance improvement
while maintaining correctness.

Branch: Mamba2SSD

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Add test-backend-ops perf tests for SSM_CONV

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: Real representitive tests for SSM_CONV

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use function constant for ssm_conv batch size

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* test: backend op tests for ssm_scan from granite4 1b-h

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: remove commented out templates

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: float4 version of ssm_conv_batched

Branch: SSMKernelImprovements

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing ggml_metal_cv_free

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-09 21:30:02 +02:00
Piotr Wilkin (ilintar)
b63509262a Add DIAG for CUDA (#17873)
* Add DIAG for CUDA

* Refactor parameters
2025-12-09 20:28:57 +01:00
Johannes Gäßler
48f47565a7 docs: clarify that CPU support should be first (#17886) 2025-12-09 20:10:36 +01:00
Gabe Goodhart
02e409a5be ggml : Provide macos-specific backtrace printing to avoid terminal death (#17869)
Some checks failed
CI / macOS-latest-cmake-arm64 (push) Has been cancelled
CI / macOS-latest-cmake-x64 (push) Has been cancelled
CI / macOS-latest-cmake-arm64-webgpu (push) Has been cancelled
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Has been cancelled
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Has been cancelled
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Has been cancelled
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Has been cancelled
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Has been cancelled
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Has been cancelled
CI / ubuntu-latest-llguidance (push) Has been cancelled
CI / ubuntu-latest-cmake-rpc (push) Has been cancelled
CI / ubuntu-24-cmake-vulkan-deb (push) Has been cancelled
CI / ubuntu-24-cmake-vulkan (push) Has been cancelled
CI / ubuntu-24-cmake-webgpu (push) Has been cancelled
CI / ubuntu-24-wasm-webgpu (push) Has been cancelled
CI / ubuntu-22-cmake-hip (push) Has been cancelled
CI / ubuntu-22-cmake-musa (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (push) Has been cancelled
CI / build-linux-cross (push) Has been cancelled
CI / build-cmake-pkg (push) Has been cancelled
CI / macOS-latest-cmake-ios (push) Has been cancelled
CI / macOS-latest-cmake-tvos (push) Has been cancelled
CI / macOS-latest-cmake-visionos (push) Has been cancelled
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Has been cancelled
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / ubuntu-latest-cmake-cuda (push) Has been cancelled
CI / windows-2022-cmake-cuda (12.4) (push) Has been cancelled
CI / windows-latest-cmake-sycl (push) Has been cancelled
CI / windows-latest-cmake-hip (push) Has been cancelled
CI / ios-xcode-build (push) Has been cancelled
CI / android-build (push) Has been cancelled
CI / android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -D GGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm -G Ninja -D LLAMA_CURL=OFF … (push) Has been cancelled
CI / android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release) (push) Has been cancelled
CI / openEuler-latest-cmake-cann (aarch64, Release, 310p) (push) Has been cancelled
CI / openEuler-latest-cmake-cann (aarch64, Release, 910b) (push) Has been cancelled
CI / openEuler-latest-cmake-cann (x86, Release, 310p) (push) Has been cancelled
CI / openEuler-latest-cmake-cann (x86, Release, 910b) (push) Has been cancelled
CI / ggml-ci-x64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-low-perf (push) Has been cancelled
CI / ggml-ci-x64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf (push) Has been cancelled
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Has been cancelled
CI / ggml-ci-x64-nvidia-cuda (push) Has been cancelled
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Has been cancelled
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Has been cancelled
CI / ggml-ci-x64-cpu-amx (push) Has been cancelled
CI / ggml-ci-mac-metal (push) Has been cancelled
CI / ggml-ci-mac-vulkan (push) Has been cancelled
CI / ggml-ci-arm64-cpu-kleidiai (push) Has been cancelled
CI / ubuntu-cpu-cmake-riscv64-native (push) Has been cancelled
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, ADDRESS) (push) Has been cancelled
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, THREAD) (push) Has been cancelled
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, UNDEFINED) (push) Has been cancelled
CI / ubuntu-llguidance-riscv64-native (push) Has been cancelled
CI / ubuntu-cmake-rpc-riscv64-native (push) Has been cancelled
CI / ggml-ci-arm64-graviton4-kleidiai (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cuda.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:cuda]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:intel]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:musa]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/s390x.Dockerfile free_disk_space:false full:true light:true platforms:linux/s390x runs_on:ubuntu-22.04-s390x server:true tag:s390x]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:vulkan]) (push) Has been cancelled
Publish Docker image / Create and push git tag (push) Has been cancelled
Update Winget Package / Update Winget Package (push) Has been cancelled
Build Actions Cache / ubuntu-24-vulkan-cache (push) Has been cancelled
Build Actions Cache / ubuntu-24-spacemit-cache (push) Has been cancelled
Build Actions Cache / windows-2022-rocm-cache (push) Has been cancelled
CI / macOS-latest-swift (generic/platform=iOS) (push) Has been cancelled
CI / macOS-latest-swift (generic/platform=macOS) (push) Has been cancelled
CI / macOS-latest-swift (generic/platform=tvOS) (push) Has been cancelled
* fix: Provide macos-specific backtrace printing to avoid terminal death

Branch: MacOSSafeBacktrace

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace

Branch: MacOSSafeBacktrace

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-12-09 18:29:07 +02:00
Georgi Gerganov
6b82eb7883 metal : print node names for debugging (#17882) 2025-12-09 15:25:49 +02:00
Sigbjørn Skjæret
86a3f0fad8 ggml : allow fill node alloc inplace (#17870) 2025-12-09 12:23:47 +01:00
Rhys-T
63908b631a cmake: fix Mach-O current version number (#17877)
PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd
is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the
Mach-O 'current version' field's 'micro' part, which only goes up to
255. This just sets the Mach-O current version to 0 to get it building
properly again.

Fixes #17258.
2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret
42b12b5608 model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652)
* nit, DeepSeek V1 MoE is 16B

* base type on n_ff_exp instead
2025-12-09 12:15:06 +01:00
Xuan-Son Nguyen
4e842d5120 console: allow using arrow left/right, home/end keys and history mode (#17836)
* console: allow using arrow left/right to edit the line (with UTF-8 support)

* console: fix arrow keys on Windows using private-use Unicode

* console: add Home/End key support for Windows and Linux

* console: add basic Up/Down history navigation

* fix build

* console: allow using arrow left/right to edit the line (with UTF-8 support)

* console: fix arrow keys on Windows using private-use Unicode

* console: add Home/End key support for Windows and Linux

* console: add basic Up/Down history navigation

* console: remove unreachable wc == 0 check after VK switch

* console: add Ctrl+Left/Right word navigation

- Add KEY_CTRL_ARROW_LEFT and KEY_CTRL_ARROW_RIGHT codes
- Windows: detect CTRL modifier via dwControlKeyState
- Linux: parse ANSI sequences with modifier (1;5D/C)
- Implement move_word_left/right with space-skipping logic
- Refactor escape sequence parsing to accumulate params

* console: add Delete key support

- Windows: VK_DELETE detection
- Linux: ESC[3~ sequence parsing
- Forward character deletion with UTF-8 support

* console: implement bash-style history editing

- Edit any history line during UP/DOWN navigation, edits persist
- Pressing Enter appends edited version as new history entry
- Original line stay untouched in their positions

* clean up

* better history impl

* fix decode_utf8

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2025-12-09 11:53:59 +01:00
Chenguang Li
ca709e427b CANN: add support for partial RoPE and Vision mode (#17543)
Some checks failed
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-24-wasm-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -D GGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm -G Ninja -D LLAMA_CURL=OFF … (push) Waiting to run
CI / android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 910b) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 910b) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ubuntu-cpu-cmake-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, THREAD) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-llguidance-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-rpc-riscv64-native (push) Waiting to run
CI / ggml-ci-arm64-graviton4-kleidiai (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* cann: add support for partial RoPE and Vision mode

Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.

1. Support for partial RoPE (rope_dims < ne0):
   - Split tensor into head (first rope_dims dimensions) and tail portions
   - Apply rotation only to head portion using RotaryPositionEmbedding operator
   - Copy unrotated tail portion directly from source to destination
   - Handle both contiguous and non-contiguous tensor layouts

2. Support for Vision mode (GGML_ROPE_TYPE_VISION):
   - Set rope_dims = ne0 for Vision mode to rotate entire tensor
   - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
   - No tail handling needed since entire tensor is rotated

Implementation details:
   - Use has_tail flag to determine execution path: head/tail splitting when
     rope_dims < ne0, or full tensor rotation when rope_dims == ne0
   - Support both F32 and F16 data types with intermediate F32 conversion
   - Copy non-contiguous tensors to contiguous buffers before calling
     RotaryPositionEmbedding operator for compatibility
   - Improve cache invalidation logic to include rope_dims and indep_sects
     parameters

These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.

* cann: fix review comment
2025-12-09 17:53:23 +08:00
Johannes Gäßler
0cdce38a97 CUDA: fix FP16 overflow in tile FA kernel (#17875) 2025-12-09 09:34:02 +01:00
Aldehir Rojas
e39502e74b llama : add token matching support to llama-grammar (#17816)
* llama : add token support to llama-grammar

* fix inverse token comment

* refactor trigger_patterns to replay tokens instead of the entire string

* add token documentation

* fix test-llama-grammar

* improve test cases for tokens
2025-12-09 00:32:57 -06:00
philip-essential
1d2a1ab73d model : support Rnj-1 (#17811)
* add support for rnj1

* refactor gemma3 to support rnj-1

* address review comments
2025-12-09 04:49:03 +01:00
Sigbjørn Skjæret
c8554b66e0 graph : use fill instead of scale_bias in grouped expert selection (#17867)
Some checks are pending
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-24-wasm-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -D GGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm -G Ninja -D LLAMA_CURL=OFF … (push) Waiting to run
CI / android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 910b) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 910b) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ubuntu-cpu-cmake-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, THREAD) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-llguidance-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-rpc-riscv64-native (push) Waiting to run
CI / ggml-ci-arm64-graviton4-kleidiai (push) Waiting to run
* use fill instead of scale_bias in grouped expert selection

* do not explicitly use _inplace
2025-12-08 21:29:59 +01:00
Daniel Bevenius
2fa51c19b0 model-conversion : add token ids to prompt token output [no ci] (#17863)
This commit adds the token ids to the printed prompt outputs.

The motivation for this is that is can be useful to see the actual token
ids alongside the token strings for debugging.
2025-12-08 17:13:08 +01:00
Xuan-Son Nguyen
951520ddb0 server: delegate result_state creation to server_task (#17835)
* server: delegate result_state creation to server_task

* remove unued states

* add more docs
2025-12-08 17:04:38 +01:00
Neo Zhang
68522c678d ci : support bfloat16 SYCL release package (#17855)
* support bfloat16 release package

* add fallback file
2025-12-08 15:09:39 +01:00
Xuan-Son Nguyen
f896d2c34f server: improve speed of speculative decoding (#17808)
* server: improve speed of speculative decoding

* fix small draft case

* add link to the PR

* server : fix generation time measurement

* server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

* server : add comment

* add PR to docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:35:28 +01:00
Piotr Wilkin (ilintar)
e4e9c4329c Make graph_max_nodes vary by ubatch size (#17794)
* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph

* Update src/llama-context.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Add missing const

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-08 14:32:41 +01:00
hksdpc255
636fc17a37 Fix Kimi-K2 tool-call parsing issues (#17376)
* Fix kimi-k2 parsing

* fix template & add more tests for kimi-k2

* Another fix for Kimi-K2 chat template.

* enable allow_toolcall_in_think for Kimi-K2

* Refine key-value separator and value end format

* Enable tool call in think for kimi-k2

* allow_toolcall_in_think is now tested with Kimi-K2

* Remove outdated TODO comment in XML tool call parser

Removed TODO comment about untested tool call feature.

* Rename function from "utf8_truncate_safe" to "utf8_truncate_safe_len"
2025-12-08 14:32:04 +01:00
Jay Zenith
51e0c2d917 cuda : add FILL op support (#17851)
* cuda : add FILL op support

* cuda : add missing FILL op files
2025-12-08 21:10:12 +08:00
Xuan-Son Nguyen
37a4f63244 server : add development documentation (#17760)
* first draft

* rewrite

* update & remove duplicated sections
2025-12-08 13:54:58 +01:00
Georgi Gerganov
2bc96931d2 server : make cache_reuse configurable per request (#17858) 2025-12-08 12:43:12 +02:00
wsbagnsv1
5814b4dce1 cuda: optimize SOLVE_TRI using registers and FMAF (#17703)
Some checks are pending
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-24-wasm-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -D GGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm -G Ninja -D LLAMA_CURL=OFF … (push) Waiting to run
CI / android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 910b) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 910b) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ubuntu-cpu-cmake-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, THREAD) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-llguidance-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-rpc-riscv64-native (push) Waiting to run
CI / ggml-ci-arm64-graviton4-kleidiai (push) Waiting to run
* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling

- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.

* Small cleanup

* Remove comments in solve_tri.cu

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/solve_tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Use const for variables in solve_tri.cu

* Replace fmaf with more readable code

* remove last fmaf

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-08 10:41:08 +01:00
ixgbe
79d61896d3 ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (#17784)
* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* cmake: enable RISC-V zihintpause extension for Spacemit builds

* readme : add ZIHINTPAUSE support for RISC-V

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-12-08 10:41:34 +02:00
Xuan-Son Nguyen
4d3726278b model: add llama 4 scaling for mistral-large (deepseek arch) (#17744)
Some checks are pending
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-24-wasm-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -D GGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm -G Ninja -D LLAMA_CURL=OFF … (push) Waiting to run
CI / android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 910b) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 910b) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ubuntu-cpu-cmake-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, THREAD) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-llguidance-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-rpc-riscv64-native (push) Waiting to run
CI / ggml-ci-arm64-graviton4-kleidiai (push) Waiting to run
2025-12-07 22:29:54 +01:00
lovedheart
08f9d3cc1d Vulkan: improve mul_mat_vec_iq1_m (#16907)
Some checks failed
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-24-wasm-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -D GGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv8.5-a+fp16+i8mm -G Ninja -D LLAMA_CURL=OFF … (push) Waiting to run
CI / android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 910b) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 310p) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 910b) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ubuntu-cpu-cmake-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, THREAD) (push) Waiting to run
CI / ubuntu-cmake-sanitizer-riscv64-native (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-llguidance-riscv64-native (push) Waiting to run
CI / ubuntu-cmake-rpc-riscv64-native (push) Waiting to run
CI / ggml-ci-arm64-graviton4-kleidiai (push) Waiting to run
Check vendor / check-vendor (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* Optimize Vulkan shader for matrix-vector multiplication

* Revert changes on compute_outputs and main

Refactor compute_outputs to handle remaining rows correctly.

* Fix trailing whitespace
2025-12-07 18:40:42 +01:00
Sigbjørn Skjæret
0a540f9abd ci : add windows-cuda 13.1 release (#17839) 2025-12-07 14:02:04 +01:00
Sigbjørn Skjæret
22577583a3 common : change --color to accept on/off/auto, default to auto (#17827) 2025-12-07 03:43:50 +01:00
Law Po Ying
d9e03db1e7 sycl: add missing BF16 conversion support for Intel oneAPI (#17780)
* sycl: add missing BF16 conversion support for Intel oneAPI

* Fix Line 645: Trailing whitespace
2025-12-07 09:18:18 +08:00
Jeff Bolz
db97837385 vulkan: perf_logger improvements (#17672)
* vulkan: perf_logger improvements

- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.

* fix vector sizes
2025-12-06 18:46:46 +01:00
Vishal Singh
017761daf5 ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690)
* ggml-zennn: add ZenDNN backend support

* ggml-zendnn : address ZenDNN backend review fixes and suggestions

* docs : apply blockquote syntax to ZenDNN docs

---------

Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
2025-12-07 00:13:33 +08:00
Xuan-Son Nguyen
c42712b056 server: support multiple generations from one prompt (OAI "n" option) (#17775)
* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
2025-12-06 15:54:38 +01:00
187 changed files with 44290 additions and 8513 deletions

View File

@@ -4,7 +4,7 @@
# Define the CANN base image for easier version updates later
ARG CHIP_TYPE=910b
ARG CANN_BASE_IMAGE=quay.io/ascend/cann:8.3.rc1.alpha001-${CHIP_TYPE}-openeuler22.03-py3.11
ARG CANN_BASE_IMAGE=quay.io/ascend/cann:8.3.rc2-${CHIP_TYPE}-openeuler24.03-py3.11
# ==============================================================================
# BUILD STAGE
@@ -111,7 +111,7 @@ ENTRYPOINT ["/app/tools.sh"]
# ==============================================================================
FROM base AS light
COPY --from=build /app/full/llama-cli /app
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
ENTRYPOINT [ "/app/llama-cli" ]

View File

@@ -68,7 +68,7 @@ ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app

View File

@@ -74,7 +74,7 @@ ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app

View File

@@ -73,7 +73,7 @@ ENTRYPOINT ["/app/tools.sh"]
FROM base AS light
COPY --from=build /app/lib/ /app
COPY --from=build /app/full/llama-cli /app
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app

View File

@@ -81,7 +81,7 @@ ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app

View File

@@ -94,7 +94,7 @@ ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app

View File

@@ -105,7 +105,7 @@ WORKDIR /llama.cpp/bin
# Copy llama.cpp binaries and libraries
COPY --from=collector /llama.cpp/bin/*.so /llama.cpp/bin
COPY --from=collector /llama.cpp/bin/llama-cli /llama.cpp/bin
COPY --from=collector /llama.cpp/bin/llama-cli /llama.cpp/bin/llama-completion /llama.cpp/bin
ENTRYPOINT [ "/llama.cpp/bin/llama-cli" ]

View File

@@ -13,6 +13,8 @@ elif [[ "$arg1" == '--quantize' || "$arg1" == '-q' ]]; then
exec ./llama-quantize "$@"
elif [[ "$arg1" == '--run' || "$arg1" == '-r' ]]; then
exec ./llama-cli "$@"
elif [[ "$arg1" == '--run-legacy' || "$arg1" == '-l' ]]; then
exec ./llama-completion "$@"
elif [[ "$arg1" == '--bench' || "$arg1" == '-b' ]]; then
exec ./llama-bench "$@"
elif [[ "$arg1" == '--perplexity' || "$arg1" == '-p' ]]; then
@@ -32,8 +34,10 @@ elif [[ "$arg1" == '--server' || "$arg1" == '-s' ]]; then
else
echo "Unknown command: $arg1"
echo "Available commands: "
echo " --run (-r): Run a model previously converted into ggml"
echo " ex: -m /models/7B/ggml-model-q4_0.bin -p \"Building a website can be done in 10 simple steps:\" -n 512"
echo " --run (-r): Run a model (chat) previously converted into ggml"
echo " ex: -m /models/7B/ggml-model-q4_0.bin"
echo " --run-legacy (-l): Run a model (legacy completion) previously converted into ggml"
echo " ex: -m /models/7B/ggml-model-q4_0.bin -no-cnv -p \"Building a website can be done in 10 simple steps:\" -n 512"
echo " --bench (-b): Benchmark the performance of the inference for various parameters."
echo " ex: -m model.gguf"
echo " --perplexity (-p): Measure the perplexity of a model over a given text."

View File

@@ -68,7 +68,7 @@ ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app

View File

@@ -65,3 +65,34 @@ runs:
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\libnvvp" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V12_4=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
- name: Install Cuda Toolkit 13.1
if: ${{ inputs.cuda_version == '13.1' }}
shell: pwsh
run: |
mkdir -p "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1"
choco install unzip -y
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_crt/windows-x86_64/cuda_crt-windows-x86_64-13.1.80-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cudart/windows-x86_64/cuda_cudart-windows-x86_64-13.1.80-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/windows-x86_64/cuda_nvcc-windows-x86_64-13.1.80-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/windows-x86_64/cuda_nvrtc-windows-x86_64-13.1.80-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libcublas/windows-x86_64/libcublas-windows-x86_64-13.2.0.9-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libnvvm/windows-x86_64/libnvvm-windows-x86_64-13.1.80-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvtx/windows-x86_64/cuda_nvtx-windows-x86_64-13.1.68-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_profiler_api/windows-x86_64/cuda_profiler_api-windows-x86_64-13.1.80-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/visual_studio_integration/windows-x86_64/visual_studio_integration-windows-x86_64-13.1.68-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cccl/windows-x86_64/cuda_cccl-windows-x86_64-13.1.78-archive.zip"
unzip '*.zip' -d "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1"
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\cuda_crt-windows-x86_64-13.1.80-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\cuda_cudart-windows-x86_64-13.1.80-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\cuda_nvcc-windows-x86_64-13.1.80-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\cuda_nvrtc-windows-x86_64-13.1.80-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\libcublas-windows-x86_64-13.2.0.9-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\libnvvm-windows-x86_64-13.1.80-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\cuda_nvtx-windows-x86_64-13.1.68-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\cuda_profiler_api-windows-x86_64-13.1.80-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\visual_studio_integration-windows-x86_64-13.1.68-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\cuda_cccl-windows-x86_64-13.1.78-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" /E /I /H /Y
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V13_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8

View File

@@ -291,6 +291,7 @@ jobs:
-DGGML_RVV=ON \
-DGGML_RV_ZFH=ON \
-DGGML_RV_ZICBOP=ON \
-DGGML_RV_ZIHINTPAUSE=ON \
-DRISCV64_SPACEMIT_IME_SPEC=RISCV64_SPACEMIT_IME1 \
-DCMAKE_TOOLCHAIN_FILE=${PWD}/cmake/riscv64-spacemit-linux-gnu-gcc.cmake

View File

@@ -243,7 +243,7 @@ jobs:
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-cli -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
- name: Test llama2c (s390x)
id: llama2c_test_s390x
@@ -252,7 +252,7 @@ jobs:
cd build
echo "Fetch llama2c big-endian model"
wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories260K-be.gguf
./bin/llama-cli -m stories260K-be.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
./bin/llama-completion -m stories260K-be.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
ubuntu-latest-cmake-sanitizer:
runs-on: ubuntu-latest
@@ -1400,25 +1400,54 @@ jobs:
chip_type: ['910b', '310p']
build: ['Release']
runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
container: ascendai/cann:${{ matrix.chip_type == '910b' && '8.3.rc1.alpha001-910b-openeuler22.03-py3.11' || '8.2.rc1-310p-openeuler22.03-py3.11' }}
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Dependencies
- name: Free up disk space
uses: ggml-org/free-disk-space@v1.3.1
with:
tool-cache: true
- name: Set container image
id: cann-image
run: |
yum update -y
yum install -y git gcc gcc-c++ make cmake libcurl-devel
image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.3.rc2-910b-openeuler24.03-py3.11' || '8.3.rc2-310p-openeuler24.03-py3.11' }}"
echo "image=${image}" >> "${GITHUB_OUTPUT}"
- name: Pull container image
run: docker pull "${{ steps.cann-image.outputs.image }}"
- name: Build
env:
BUILD_TYPE: ${{ matrix.build }}
SOC_TYPE: ascend${{ matrix.chip_type }}
run: |
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
HOST_UID=$(id -u)
HOST_GID=$(id -g)
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${{ matrix.build }} \
-DGGML_CANN=on \
-DSOC_TYPE=ascend${{ matrix.chip_type }}
cmake --build build -j $(nproc)
docker run --rm \
-v "${PWD}:/workspace" \
-w /workspace \
-e SOC_TYPE=${SOC_TYPE} \
-e BUILD_TYPE=${BUILD_TYPE} \
"${{ steps.cann-image.outputs.image }}" \
bash -lc '
set -e
yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake libcurl-devel
yum clean all && rm -rf /var/cache/yum
git config --global --add safe.directory "/workspace"
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DGGML_CANN=on \
-DSOC_TYPE=${SOC_TYPE}
cmake --build build -j $(nproc)
chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
'
# TODO: simplify the following workflows using a matrix
# TODO: run lighter CI on PRs and the full CI only on master (if needed)
@@ -1770,7 +1799,7 @@ jobs:
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-cli -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
ubuntu-cmake-sanitizer-riscv64-native:
runs-on: RISCV64

View File

@@ -421,7 +421,7 @@ jobs:
strategy:
matrix:
cuda: ['12.4']
cuda: ['12.4', '13.1']
steps:
- name: Clone
@@ -476,6 +476,7 @@ jobs:
$dst='.\build\bin\cudart\'
robocopy "${{env.CUDA_PATH}}\bin" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
robocopy "${{env.CUDA_PATH}}\lib" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
robocopy "${{env.CUDA_PATH}}\bin\x64" $dst cudart64_*.dll cublas64_*.dll cublasLt64_*.dll
7z a cudart-llama-bin-win-cuda-${{ matrix.cuda }}-x64.zip $dst\*
- name: Upload Cuda runtime
@@ -545,6 +546,8 @@ jobs:
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libmmd.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libiomp5md.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/sycl-ls.exe" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libsycl-fallback-bfloat16.spv" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/compiler/latest/bin/libsycl-native-bfloat16.spv" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/dnnl/latest/bin/dnnl.dll" ./build/bin
cp "${{ env.ONEAPI_ROOT }}/tbb/latest/bin/tbb12.dll" ./build/bin
@@ -728,6 +731,78 @@ jobs:
path: llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz
name: llama-${{ steps.tag.outputs.name }}-xcframework.tar.gz
openEuler-cann:
strategy:
matrix:
arch: [x86, aarch64]
chip_type: ['910b', '310p']
build: ['Release']
runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Free up disk space
uses: ggml-org/free-disk-space@v1.3.1
with:
tool-cache: true
- name: Set container image
id: cann-image
run: |
image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.3.rc2-910b-openeuler24.03-py3.11' || '8.3.rc2-310p-openeuler24.03-py3.11' }}"
echo "image=${image}" >> "${GITHUB_OUTPUT}"
- name: Pull container image
run: docker pull "${{ steps.cann-image.outputs.image }}"
- name: Build
env:
BUILD_TYPE: ${{ matrix.build }}
SOC_TYPE: ascend${{ matrix.chip_type }}
run: |
HOST_UID=$(id -u)
HOST_GID=$(id -g)
docker run --rm \
-v "${PWD}:/workspace" \
-w /workspace \
-e SOC_TYPE=${SOC_TYPE} \
-e BUILD_TYPE=${BUILD_TYPE} \
"${{ steps.cann-image.outputs.image }}" \
bash -lc '
set -e
yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake libcurl-devel
yum clean all && rm -rf /var/cache/yum
git config --global --add safe.directory "/workspace"
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DGGML_CANN=on \
-DSOC_TYPE=${SOC_TYPE}
cmake --build build -j $(nproc)
chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
'
- name: Determine tag name
id: tag
uses: ./.github/actions/get-tag-name
- name: Pack artifacts
run: |
cp LICENSE ./build/bin/
tar -czvf llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz --transform "s,./,llama-${{ steps.tag.outputs.name }}/," -C ./build/bin .
- name: Upload artifacts (tar)
uses: actions/upload-artifact@v4
with:
path: llama-${{ steps.tag.outputs.name }}-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz
name: llama-bin-${{ matrix.chip_type }}-openEuler-${{ matrix.arch }}.tar.gz
release:
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
@@ -749,6 +824,7 @@ jobs:
- macOS-arm64
- macOS-x64
- ios-xcode-build
- openEuler-cann
steps:
- name: Clone
@@ -835,11 +911,18 @@ jobs:
**Windows:**
- [Windows x64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-x64.zip)
- [Windows arm64 (CPU)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cpu-arm64.zip)
- [Windows x64 (CUDA)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-12.4-x64.zip)
- [Windows x64 (CUDA 12)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-12.4-x64.zip)
- [Windows x64 (CUDA 13)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-cuda-13.1-x64.zip)
- [Windows x64 (Vulkan)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-vulkan-x64.zip)
- [Windows x64 (SYCL)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-sycl-x64.zip)
- [Windows x64 (HIP)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-win-hip-radeon-x64.zip)
**openEuler:**
- [openEuler x86 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-310p-openEuler-x86.tar.gz)
- [openEuler x86 (910b)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-910b-openEuler-x86.tar.gz)
- [openEuler aarch64 (310p)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-310p-openEuler-aarch64.tar.gz)
- [openEuler aarch64 (910b)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-910b-openEuler-aarch64.tar.gz)
- name: Upload release
id: upload_release
uses: actions/github-script@v3

View File

@@ -15,6 +15,7 @@ The project differentiates between 3 levels of contributors:
- If you modified the `ggml` source, run the `test-backend-ops` tool to check whether different backend implementations of the `ggml` operators produce consistent results (this requires access to at least two different `ggml` backends)
- If you modified a `ggml` operator or added a new one, add the corresponding test cases to `test-backend-ops`
- Create separate PRs for each feature or fix. Avoid combining unrelated changes in a single PR
- When adding support for a new model or feature, focus on **CPU support only** in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If your PR becomes stale, rebase it on top of latest `master` to get maintainers attention
- Maintainers will rely on your insights and approval when making a final decision to approve and merge a PR

View File

@@ -61,7 +61,7 @@ range of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2, AVX512 and AMX support for x86 architectures
- RVV, ZVFH, ZFH and ZICBOP support for RISC-V architectures
- RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
- Vulkan and SYCL backend support
@@ -276,6 +276,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
| [MUSA](docs/build.md#musa) | Moore Threads GPU |
| [CUDA](docs/build.md#cuda) | Nvidia GPU |
| [HIP](docs/build.md#hip) | AMD GPU |
| [ZenDNN](docs/build.md#zendnn) | AMD CPU |
| [Vulkan](docs/build.md#vulkan) | GPU |
| [CANN](docs/build.md#cann) | Ascend NPU |
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
@@ -346,19 +347,6 @@ To learn more about model quantization, [read this documentation](tools/quantize
</details>
- <details>
<summary>Run simple text completion</summary>
To disable conversation mode explicitly, use `-no-cnv`
```bash
llama-cli -m model.gguf -p "I believe the meaning of life is" -n 128 -no-cnv
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
```
</details>
- <details>
<summary>Constrain the output with a custom grammar</summary>

View File

@@ -398,18 +398,18 @@ function gg_run_qwen3_0_6b {
./bin/llama-quantize ${model_bf16} ${model_q5_k} q5_k $(nproc)
./bin/llama-quantize ${model_bf16} ${model_q6_k} q6_k $(nproc)
(time ./bin/llama-cli -no-cnv --model ${model_f16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_bf16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-bf16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-completion -no-cnv --model ${model_f16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-completion -no-cnv --model ${model_bf16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-bf16.log
(time ./bin/llama-completion -no-cnv --model ${model_q8_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-completion -no-cnv --model ${model_q4_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-completion -no-cnv --model ${model_q4_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-completion -no-cnv --model ${model_q5_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-completion -no-cnv --model ${model_q5_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-completion -no-cnv --model ${model_q2_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q3_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q4_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q5_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-completion -no-cnv --model ${model_q6_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is" ) 2>&1 | tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2 ) 2>&1 | tee -a $OUT/${ci}-tg-f16.log
if [ -z ${GG_BUILD_NO_BF16} ]; then

View File

@@ -73,6 +73,8 @@ add_library(${TARGET} STATIC
ngram-cache.h
peg-parser.cpp
peg-parser.h
preset.cpp
preset.h
regex-partial.cpp
regex-partial.h
sampling.cpp

File diff suppressed because it is too large Load Diff

View File

@@ -3,8 +3,10 @@
#include "common.h"
#include <set>
#include <map>
#include <string>
#include <vector>
#include <cstring>
//
// CLI argument parsing
@@ -14,6 +16,7 @@ struct common_arg {
std::set<enum llama_example> examples = {LLAMA_EXAMPLE_COMMON};
std::set<enum llama_example> excludes = {};
std::vector<const char *> args;
std::vector<const char *> args_neg; // for negated args like --no-xxx
const char * value_hint = nullptr; // help text or example for arg value
const char * value_hint_2 = nullptr; // for second arg value
const char * env = nullptr;
@@ -23,6 +26,9 @@ struct common_arg {
void (*handler_string) (common_params & params, const std::string &) = nullptr;
void (*handler_str_str)(common_params & params, const std::string &, const std::string &) = nullptr;
void (*handler_int) (common_params & params, int) = nullptr;
void (*handler_bool) (common_params & params, bool) = nullptr;
common_arg() = default;
common_arg(
const std::initializer_list<const char *> & args,
@@ -44,6 +50,13 @@ struct common_arg {
void (*handler)(common_params & params)
) : args(args), help(help), handler_void(handler) {}
common_arg(
const std::initializer_list<const char *> & args,
const std::initializer_list<const char *> & args_neg,
const std::string & help,
void (*handler)(common_params & params, bool)
) : args(args), args_neg(args_neg), help(help), handler_bool(handler) {}
// support 2 values for arg
common_arg(
const std::initializer_list<const char *> & args,
@@ -61,9 +74,33 @@ struct common_arg {
bool is_exclude(enum llama_example ex);
bool get_value_from_env(std::string & output) const;
bool has_value_from_env() const;
std::string to_string();
std::string to_string() const;
// for using as key in std::map
bool operator<(const common_arg& other) const {
if (args.empty() || other.args.empty()) {
return false;
}
return strcmp(args[0], other.args[0]) < 0;
}
bool operator==(const common_arg& other) const {
if (args.empty() || other.args.empty()) {
return false;
}
return strcmp(args[0], other.args[0]) == 0;
}
// get all args and env vars (including negated args/env)
std::vector<std::string> get_args() const;
std::vector<std::string> get_env() const;
};
namespace common_arg_utils {
bool is_truthy(const std::string & value);
bool is_falsey(const std::string & value);
bool is_autoy(const std::string & value);
}
struct common_params_context {
enum llama_example ex = LLAMA_EXAMPLE_COMMON;
common_params & params;
@@ -76,7 +113,11 @@ struct common_params_context {
// if one argument has invalid value, it will automatically display usage of the specific argument (and not the full usage message)
bool common_params_parse(int argc, char ** argv, common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
// function to be used by test-arg-parser
// parse input arguments from CLI into a map
// TODO: support repeated args in the future
bool common_params_parse(int argc, char ** argv, llama_example ex, std::map<common_arg, std::string> & out_map);
// initialize argument parser context - used by test-arg-parser and preset
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
struct common_remote_params {

View File

@@ -724,16 +724,10 @@ inline void parse_msg_with_xml_tool_calls(common_chat_msg_parser & builder, cons
if (reasoning_unclosed) {
if (auto pos = content.find(end_think); pos == std::string::npos && builder.pos() != builder.input().size()) {
unclosed_reasoning_content += content;
if (form.allow_toolcall_in_think) {
builder.move_to(tc->groups[0].begin);
if (!builder.try_consume_xml_tool_calls(form)) {
unclosed_reasoning_content += tool_call_start;
builder.move_to(tc->groups[0].end);
}
} else {
if (!(form.allow_toolcall_in_think && tc)) {
unclosed_reasoning_content += tool_call_start;
continue;
}
continue;
} else {
reasoning_unclosed = false;
std::string reasoning_content;
@@ -781,8 +775,12 @@ inline void parse_msg_with_xml_tool_calls(common_chat_msg_parser & builder, cons
}
} else {
// This <tool_call> start is in thinking block, skip this tool call
auto pos = think_start + start_think.size();
unclosed_reasoning_content = content.substr(pos) + tool_call_start;
// This <tool_call> start is in thinking block
if (form.allow_toolcall_in_think) {
unclosed_reasoning_content = content.substr(think_start + start_think.size());
} else {
unclosed_reasoning_content = content.substr(think_start + start_think.size()) + tool_call_start;
}
reasoning_unclosed = true;
content.resize(think_start);
toolcall_in_think = true;
@@ -805,14 +803,35 @@ inline void parse_msg_with_xml_tool_calls(common_chat_msg_parser & builder, cons
}
// remove potential partial suffix
if (content.size() > 0 && builder.pos() == builder.input().size() && unclosed_reasoning_content.empty()) {
rstrip(content);
trim_potential_partial_word(content);
rstrip(content);
if (builder.pos() == builder.input().size()) {
if (unclosed_reasoning_content.empty()) {
rstrip(content);
trim_potential_partial_word(content);
rstrip(content);
} else {
rstrip(unclosed_reasoning_content);
trim_potential_partial_word(unclosed_reasoning_content);
rstrip(unclosed_reasoning_content);
}
}
// consume unclosed_reasoning_content if allow_toolcall_in_think is set
if (form.allow_toolcall_in_think && !unclosed_reasoning_content.empty()) {
if (builder.syntax().reasoning_format != COMMON_REASONING_FORMAT_NONE && !builder.syntax().reasoning_in_content) {
builder.add_reasoning_content(unclosed_reasoning_content);
} else {
if (content.empty()) {
content = start_think + unclosed_reasoning_content;
} else {
content += "\n\n" + start_think;
content += unclosed_reasoning_content;
}
}
unclosed_reasoning_content.clear();
}
// Add content
if (content.size() != 0) {
if (!content.empty()) {
// If there are multiple content blocks
if (builder.syntax().reasoning_format != COMMON_REASONING_FORMAT_NONE && !builder.syntax().reasoning_in_content && builder.result().content.size() != 0) {
builder.add_content("\n\n");
@@ -820,7 +839,7 @@ inline void parse_msg_with_xml_tool_calls(common_chat_msg_parser & builder, cons
builder.add_content(content);
}
// This <tool_call> start is in thinking block, skip this tool call
// This <tool_call> start is in thinking block and toolcall_in_think not set, skip this tool call
if (toolcall_in_think && !form.allow_toolcall_in_think) {
continue;
}
@@ -829,7 +848,7 @@ inline void parse_msg_with_xml_tool_calls(common_chat_msg_parser & builder, cons
if (!tc) {
GGML_ASSERT(builder.pos() == builder.input().size());
GGML_ASSERT(unclosed_reasoning_content.empty());
GGML_ASSERT(!reasoning_unclosed);
if (!form.allow_toolcall_in_think) GGML_ASSERT(!reasoning_unclosed);
break;
}
@@ -854,7 +873,6 @@ inline void parse_msg_with_xml_tool_calls(common_chat_msg_parser & builder, cons
/**
* Parse content uses reasoning and XML-Style tool call
* TODO: Note that form.allow_toolcall_in_think is not tested yet. If anyone confirms it works, this comment can be removed.
*/
void common_chat_msg_parser::consume_reasoning_with_xml_tool_calls(const struct xml_tool_call_format & form, const std::string & start_think, const std::string & end_think) {
parse_msg_with_xml_tool_calls(*this, form, start_think, end_think);

View File

@@ -31,7 +31,7 @@ struct xml_tool_call_format {
std::optional<std::string> last_val_end = std::nullopt;
std::optional<std::string> last_tool_end = std::nullopt;
bool trim_raw_argval = false;
bool allow_toolcall_in_think = false; // TODO: UNTESTED!!!
bool allow_toolcall_in_think = false;
};
// make a GBNF that accept any strings except those containing any of the forbidden strings.

View File

@@ -917,12 +917,13 @@ static void common_chat_parse_kimi_k2(common_chat_msg_parser & builder) {
form.tool_start = "<|tool_call_begin|>";
form.tool_sep = "<|tool_call_argument_begin|>{";
form.key_start = "\"";
form.key_val_sep = "\": ";
form.val_end = ", ";
form.key_val_sep = "\":";
form.val_end = ",";
form.tool_end = "}<|tool_call_end|>";
form.scope_end = "<|tool_calls_section_end|>";
form.raw_argval = false;
form.last_val_end = "";
form.allow_toolcall_in_think = true;
return form;
})();
builder.consume_reasoning_with_xml_tool_calls(form, "<think>", "</think>");

View File

@@ -1,5 +1,6 @@
#include "chat.h"
#include "chat-parser.h"
#include "chat-peg-parser.h"
#include "common.h"
#include "json-partial.h"
#include "json-schema-to-grammar.h"
@@ -150,6 +151,7 @@ struct templates_params {
common_chat_tool_choice tool_choice;
json json_schema;
bool parallel_tool_calls;
common_reasoning_format reasoning_format;
bool stream;
std::string grammar;
bool add_generation_prompt = true;
@@ -589,6 +591,16 @@ common_chat_templates_ptr common_chat_templates_init(
"{%- if false %}");
}
// TODO @aldehir : this is a temporary fix, pending Minja changes
// Ref: https://github.com/ggml-org/llama.cpp/pull/17713#issuecomment-3631342664
if (default_template_src.find("[TOOL_CALLS]") != std::string::npos
// search for the error message and patch it
&& default_template_src.find("if (message['content'] is none or") != std::string::npos) {
string_replace_all(default_template_src,
"{%- if (message['content'] is none or message['content'] == '' or message['content']|length == 0) and (message['tool_calls'] is not defined or message['tool_calls'] is none or message['tool_calls']|length == 0) %}",
"{%- if false %}");
}
std::string token_bos = bos_token_override;
std::string token_eos = eos_token_override;
bool add_bos = false;
@@ -987,6 +999,118 @@ static common_chat_params common_chat_params_init_lfm2(const common_chat_templat
return data;
}
static common_chat_params common_chat_params_init_ministral_3(const common_chat_template & tmpl, const struct templates_params & inputs) {
common_chat_params data;
// Build up messages to follow the format: https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512/blob/main/chat_template.jinja
auto adjusted_messages = json::array();
for (const auto & msg : inputs.messages) {
auto role = msg.value("role", "");
if (role != "system" && role != "assistant") {
// Only adjust system and assistant messages. Interestingly, the system message may contain thinking.
adjusted_messages.push_back(msg);
continue;
}
auto content = json::array();
// If message contains `reasoning_content`, add it as a block of type `thinking`
if (msg.contains("reasoning_content") && msg.at("reasoning_content").is_string()) {
content.push_back({
{"type", "thinking"},
{"thinking", msg.at("reasoning_content").get<std::string>()},
});
}
// If message contains `content`, add it as a block of type `text`
if (msg.contains("content")) {
if (msg.at("content").is_string()) {
content.push_back({
{"type", "text"},
{"text", msg.at("content").get<std::string>()},
});
} else if (msg.at("content").is_array()) {
auto blocks = msg.at("content");
content.insert(content.end(), blocks.begin(), blocks.end());
}
}
auto adjusted = msg;
adjusted["content"] = content;
adjusted.erase("reasoning_content");
adjusted_messages.push_back(adjusted);
}
auto has_tools = inputs.tools.is_array() && !inputs.tools.empty();
auto extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
auto include_grammar = true;
data.prompt = apply(tmpl, inputs, /* messages_override = */ adjusted_messages);
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
data.preserved_tokens = {
"[THINK]",
"[/THINK]",
"[TOOL_CALLS]",
"[ARGS]",
};
auto parser = build_chat_peg_native_parser([&](common_chat_peg_native_builder & p) {
auto reasoning = extract_reasoning ? p.optional("[THINK]" + p.reasoning(p.until("[/THINK]")) + "[/THINK]") : p.eps();
// Response format parser
if (inputs.json_schema.is_object() && !inputs.json_schema.empty()) {
// Ministral wants to emit json surrounded by code fences
return reasoning << "```json" << p.content(p.schema(p.json(), "response-format", inputs.json_schema)) << "```";
}
// Tool call parser
if (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE) {
auto tool_choice = p.choice();
foreach_function(inputs.tools, [&](const json & tool) {
const auto & function = tool.at("function");
std::string name = function.at("name");
const auto & schema = function.at("parameters");
tool_choice |= p.rule("tool-" + name,
p.tool_open(p.tool_name(p.literal(name)) + "[ARGS]")
+ p.tool_args(p.schema(p.json(), "tool-" + name + "-schema", schema))
);
});
auto min_calls = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED ? 1 : 0;
auto max_calls = inputs.parallel_tool_calls ? -1 : 1;
auto tool_calls = p.trigger_rule("tool-call", p.repeat("[TOOL_CALLS]" + tool_choice, min_calls, max_calls));
return reasoning << p.content(p.until("[TOOL_CALLS]")) << tool_calls;
}
// Content only parser
include_grammar = false;
return reasoning << p.content(p.rest());
});
data.parser = parser.save();
if (include_grammar) {
data.grammar_lazy = has_tools && inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_AUTO;
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
foreach_function(inputs.tools, [&](const json & tool) {
const auto & function = tool.at("function");
auto schema = function.at("parameters");
builder.resolve_refs(schema);
});
parser.build_grammar(builder, data.grammar_lazy);
});
data.grammar_triggers = {
{COMMON_GRAMMAR_TRIGGER_TYPE_WORD, "[TOOL_CALLS]"}
};
}
return data;
}
static common_chat_params common_chat_params_init_magistral(const common_chat_template & tmpl, const struct templates_params & inputs) {
common_chat_params data;
data.prompt = apply(tmpl, inputs);
@@ -2341,6 +2465,7 @@ static common_chat_params common_chat_templates_apply_jinja(
params.messages = common_chat_msgs_to_json_oaicompat<json>(inputs.messages, /* concat_text= */ !tmpl.original_caps().requires_typed_content);
params.add_generation_prompt = inputs.add_generation_prompt;
params.tool_choice = inputs.tool_choice;
params.reasoning_format = inputs.reasoning_format;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
@@ -2504,6 +2629,13 @@ static common_chat_params common_chat_templates_apply_jinja(
return common_chat_params_init_llama_3_x(tmpl, params, allow_python_tag_builtin_tools);
}
// Ministral/Mistral Large 3
if (src.find("[SYSTEM_PROMPT]") != std::string::npos &&
src.find("[TOOL_CALLS]") != std::string::npos &&
src.find("[ARGS]") != std::string::npos) {
return common_chat_params_init_ministral_3(tmpl, params);
}
if (src.find("[THINK]") != std::string::npos && src.find("[/THINK]") != std::string::npos) {
return common_chat_params_init_magistral(tmpl, params);
}

View File

@@ -982,6 +982,32 @@ std::vector<common_file_info> fs_list(const std::string & path, bool include_dir
return files;
}
//
// TTY utils
//
bool tty_can_use_colors() {
// Check NO_COLOR environment variable (https://no-color.org/)
if (const char * no_color = std::getenv("NO_COLOR")) {
if (no_color[0] != '\0') {
return false;
}
}
// Check TERM environment variable
if (const char * term = std::getenv("TERM")) {
if (std::strcmp(term, "dumb") == 0) {
return false;
}
}
// Check if stdout and stderr are connected to a terminal
// We check both because log messages can go to either
bool stdout_is_tty = isatty(fileno(stdout));
bool stderr_is_tty = isatty(fileno(stderr));
return stdout_is_tty || stderr_is_tty;
}
//
// Model utils

View File

@@ -82,7 +82,8 @@ int32_t cpu_get_num_math();
enum llama_example {
LLAMA_EXAMPLE_COMMON,
LLAMA_EXAMPLE_SPECULATIVE,
LLAMA_EXAMPLE_MAIN,
LLAMA_EXAMPLE_COMPLETION,
LLAMA_EXAMPLE_CLI,
LLAMA_EXAMPLE_EMBEDDING,
LLAMA_EXAMPLE_PERPLEXITY,
LLAMA_EXAMPLE_RETRIEVAL,
@@ -406,6 +407,7 @@ struct common_params {
bool simple_io = false; // improves compatibility with subprocesses and limited consoles
bool cont_batching = true; // insert new sequences for decoding on-the-fly
bool no_perf = false; // disable performance metrics
bool show_timings = true; // show timing information on CLI
bool ctx_shift = false; // context shift on infinite text generation
bool swa_full = false; // use full-size SWA cache (https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
bool kv_unified = false; // enable unified KV cache
@@ -462,7 +464,7 @@ struct common_params {
std::string public_path = ""; // NOLINT
std::string api_prefix = ""; // NOLINT
std::string chat_template = ""; // NOLINT
bool use_jinja = false; // NOLINT
bool use_jinja = true; // NOLINT
bool enable_chat_template = true;
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
int reasoning_budget = -1;
@@ -482,9 +484,10 @@ struct common_params {
bool endpoint_metrics = false;
// router server configs
std::string models_dir = ""; // directory containing models for the router server
int models_max = 4; // maximum number of models to load simultaneously
bool models_autoload = true; // automatically load models when requested via the router server
std::string models_dir = ""; // directory containing models for the router server
std::string models_preset = ""; // directory containing model presets for the router server
int models_max = 4; // maximum number of models to load simultaneously
bool models_autoload = true; // automatically load models when requested via the router server
bool log_json = false;
@@ -655,6 +658,13 @@ struct common_file_info {
};
std::vector<common_file_info> fs_list(const std::string & path, bool include_directories);
//
// TTY utils
//
// Auto-detect if colors can be enabled based on terminal and environment
bool tty_can_use_colors();
//
// Model utils
//

View File

@@ -1,6 +1,16 @@
#include "console.h"
#include "log.h"
#include <vector>
#include <iostream>
#include <cassert>
#include <cstddef>
#include <cctype>
#include <cwctype>
#include <cstdint>
#include <condition_variable>
#include <mutex>
#include <thread>
#include <stdarg.h>
#if defined(_WIN32)
#define WIN32_LEAN_AND_MEAN
@@ -30,26 +40,44 @@
#define ANSI_COLOR_BLUE "\x1b[34m"
#define ANSI_COLOR_MAGENTA "\x1b[35m"
#define ANSI_COLOR_CYAN "\x1b[36m"
#define ANSI_COLOR_GRAY "\x1b[90m"
#define ANSI_COLOR_RESET "\x1b[0m"
#define ANSI_BOLD "\x1b[1m"
namespace console {
#if defined (_WIN32)
namespace {
// Use private-use unicode values to represent special keys that are not reported
// as characters (e.g. arrows on Windows). These values should never clash with
// real input and let the rest of the code handle navigation uniformly.
static constexpr char32_t KEY_ARROW_LEFT = 0xE000;
static constexpr char32_t KEY_ARROW_RIGHT = 0xE001;
static constexpr char32_t KEY_ARROW_UP = 0xE002;
static constexpr char32_t KEY_ARROW_DOWN = 0xE003;
static constexpr char32_t KEY_HOME = 0xE004;
static constexpr char32_t KEY_END = 0xE005;
static constexpr char32_t KEY_CTRL_ARROW_LEFT = 0xE006;
static constexpr char32_t KEY_CTRL_ARROW_RIGHT = 0xE007;
static constexpr char32_t KEY_DELETE = 0xE008;
}
//
// Console state
//
#endif
static bool advanced_display = false;
static bool simple_io = true;
static display_t current_display = reset;
static bool advanced_display = false;
static bool simple_io = true;
static display_type current_display = DISPLAY_TYPE_RESET;
static FILE* out = stdout;
static FILE* out = stdout;
#if defined (_WIN32)
static void* hConsole;
static void* hConsole;
#else
static FILE* tty = nullptr;
static termios initial_state;
static FILE* tty = nullptr;
static termios initial_state;
#endif
//
@@ -120,7 +148,7 @@ namespace console {
void cleanup() {
// Reset console display
set_display(reset);
set_display(DISPLAY_TYPE_RESET);
#if !defined(_WIN32)
// Restore settings on POSIX systems
@@ -140,20 +168,26 @@ namespace console {
//
// Keep track of current display and only emit ANSI code if it changes
void set_display(display_t display) {
void set_display(display_type display) {
if (advanced_display && current_display != display) {
fflush(stdout);
common_log_flush(common_log_main());
switch(display) {
case reset:
case DISPLAY_TYPE_RESET:
fprintf(out, ANSI_COLOR_RESET);
break;
case prompt:
case DISPLAY_TYPE_INFO:
fprintf(out, ANSI_COLOR_MAGENTA);
break;
case DISPLAY_TYPE_PROMPT:
fprintf(out, ANSI_COLOR_YELLOW);
break;
case user_input:
case DISPLAY_TYPE_REASONING:
fprintf(out, ANSI_COLOR_GRAY);
break;
case DISPLAY_TYPE_USER_INPUT:
fprintf(out, ANSI_BOLD ANSI_COLOR_GREEN);
break;
case error:
case DISPLAY_TYPE_ERROR:
fprintf(out, ANSI_BOLD ANSI_COLOR_RED);
}
current_display = display;
@@ -176,7 +210,18 @@ namespace console {
if (record.EventType == KEY_EVENT && record.Event.KeyEvent.bKeyDown) {
wchar_t wc = record.Event.KeyEvent.uChar.UnicodeChar;
if (wc == 0) {
continue;
const DWORD ctrl_mask = LEFT_CTRL_PRESSED | RIGHT_CTRL_PRESSED;
const bool ctrl_pressed = (record.Event.KeyEvent.dwControlKeyState & ctrl_mask) != 0;
switch (record.Event.KeyEvent.wVirtualKeyCode) {
case VK_LEFT: return ctrl_pressed ? KEY_CTRL_ARROW_LEFT : KEY_ARROW_LEFT;
case VK_RIGHT: return ctrl_pressed ? KEY_CTRL_ARROW_RIGHT : KEY_ARROW_RIGHT;
case VK_UP: return KEY_ARROW_UP;
case VK_DOWN: return KEY_ARROW_DOWN;
case VK_HOME: return KEY_HOME;
case VK_END: return KEY_END;
case VK_DELETE: return KEY_DELETE;
default: continue;
}
}
if ((wc >= 0xD800) && (wc <= 0xDBFF)) { // Check if wc is a high surrogate
@@ -315,6 +360,52 @@ namespace console {
#endif
}
static char32_t decode_utf8(const std::string & input, size_t pos, size_t & advance) {
unsigned char c = static_cast<unsigned char>(input[pos]);
if ((c & 0x80u) == 0u) {
advance = 1;
return c;
}
if ((c & 0xE0u) == 0xC0u && pos + 1 < input.size()) {
unsigned char c1 = static_cast<unsigned char>(input[pos + 1]);
if ((c1 & 0xC0u) != 0x80u) {
advance = 1;
return 0xFFFD;
}
advance = 2;
return ((c & 0x1Fu) << 6) | (static_cast<unsigned char>(input[pos + 1]) & 0x3Fu);
}
if ((c & 0xF0u) == 0xE0u && pos + 2 < input.size()) {
unsigned char c1 = static_cast<unsigned char>(input[pos + 1]);
unsigned char c2 = static_cast<unsigned char>(input[pos + 2]);
if ((c1 & 0xC0u) != 0x80u || (c2 & 0xC0u) != 0x80u) {
advance = 1;
return 0xFFFD;
}
advance = 3;
return ((c & 0x0Fu) << 12) |
((static_cast<unsigned char>(input[pos + 1]) & 0x3Fu) << 6) |
(static_cast<unsigned char>(input[pos + 2]) & 0x3Fu);
}
if ((c & 0xF8u) == 0xF0u && pos + 3 < input.size()) {
unsigned char c1 = static_cast<unsigned char>(input[pos + 1]);
unsigned char c2 = static_cast<unsigned char>(input[pos + 2]);
unsigned char c3 = static_cast<unsigned char>(input[pos + 3]);
if ((c1 & 0xC0u) != 0x80u || (c2 & 0xC0u) != 0x80u || (c3 & 0xC0u) != 0x80u) {
advance = 1;
return 0xFFFD;
}
advance = 4;
return ((c & 0x07u) << 18) |
((static_cast<unsigned char>(input[pos + 1]) & 0x3Fu) << 12) |
((static_cast<unsigned char>(input[pos + 2]) & 0x3Fu) << 6) |
(static_cast<unsigned char>(input[pos + 3]) & 0x3Fu);
}
advance = 1;
return 0xFFFD; // replacement character for invalid input
}
static void append_utf8(char32_t ch, std::string & out) {
if (ch <= 0x7F) {
out.push_back(static_cast<unsigned char>(ch));
@@ -336,22 +427,319 @@ namespace console {
}
// Helper function to remove the last UTF-8 character from a string
static void pop_back_utf8_char(std::string & line) {
if (line.empty()) {
static size_t prev_utf8_char_pos(const std::string & line, size_t pos) {
if (pos == 0) return 0;
pos--;
while (pos > 0 && (line[pos] & 0xC0) == 0x80) {
pos--;
}
return pos;
}
static size_t next_utf8_char_pos(const std::string & line, size_t pos) {
if (pos >= line.length()) return line.length();
pos++;
while (pos < line.length() && (line[pos] & 0xC0) == 0x80) {
pos++;
}
return pos;
}
static void move_cursor(int delta);
static void move_word_left(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths, const std::string & line);
static void move_word_right(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths, const std::string & line);
static void move_to_line_start(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths);
static void move_to_line_end(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths, const std::string & line);
static void delete_at_cursor(std::string & line, std::vector<int> & widths, size_t & char_pos, size_t & byte_pos) {
if (char_pos >= widths.size()) {
return;
}
size_t pos = line.length() - 1;
size_t next_pos = next_utf8_char_pos(line, byte_pos);
int w = widths[char_pos];
size_t char_len = next_pos - byte_pos;
// Find the start of the last UTF-8 character (checking up to 4 bytes back)
for (size_t i = 0; i < 3 && pos > 0; ++i, --pos) {
if ((line[pos] & 0xC0) != 0x80) {
break; // Found the start of the character
}
line.erase(byte_pos, char_len);
widths.erase(widths.begin() + char_pos);
size_t p = byte_pos;
int tail_width = 0;
for (size_t i = char_pos; i < widths.size(); ++i) {
size_t following = next_utf8_char_pos(line, p);
put_codepoint(line.c_str() + p, following - p, widths[i]);
tail_width += widths[i];
p = following;
}
line.erase(pos);
for (int i = 0; i < w; ++i) {
fputc(' ', out);
}
move_cursor(-(tail_width + w));
}
static void clear_current_line(const std::vector<int> & widths) {
int total_width = 0;
for (int w : widths) {
total_width += (w > 0 ? w : 1);
}
if (total_width > 0) {
std::string spaces(total_width, ' ');
fwrite(spaces.c_str(), 1, total_width, out);
move_cursor(-total_width);
}
}
static void set_line_contents(std::string new_line, std::string & line, std::vector<int> & widths, size_t & char_pos,
size_t & byte_pos) {
move_to_line_start(char_pos, byte_pos, widths);
clear_current_line(widths);
line = std::move(new_line);
widths.clear();
byte_pos = 0;
char_pos = 0;
size_t idx = 0;
while (idx < line.size()) {
size_t advance = 0;
char32_t cp = decode_utf8(line, idx, advance);
int expected_width = estimateWidth(cp);
int real_width = put_codepoint(line.c_str() + idx, advance, expected_width);
if (real_width < 0) real_width = 0;
widths.push_back(real_width);
idx += advance;
++char_pos;
byte_pos = idx;
}
}
static void move_to_line_start(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths) {
int back_width = 0;
for (size_t i = 0; i < char_pos; ++i) {
back_width += widths[i];
}
move_cursor(-back_width);
char_pos = 0;
byte_pos = 0;
}
static void move_to_line_end(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths, const std::string & line) {
int forward_width = 0;
for (size_t i = char_pos; i < widths.size(); ++i) {
forward_width += widths[i];
}
move_cursor(forward_width);
char_pos = widths.size();
byte_pos = line.length();
}
static bool has_ctrl_modifier(const std::string & params) {
size_t start = 0;
while (start < params.size()) {
size_t end = params.find(';', start);
size_t len = (end == std::string::npos) ? params.size() - start : end - start;
if (len > 0) {
int value = 0;
for (size_t i = 0; i < len; ++i) {
char ch = params[start + i];
if (!std::isdigit(static_cast<unsigned char>(ch))) {
value = -1;
break;
}
value = value * 10 + (ch - '0');
}
if (value == 5) {
return true;
}
}
if (end == std::string::npos) {
break;
}
start = end + 1;
}
return false;
}
static bool is_space_codepoint(char32_t cp) {
return std::iswspace(static_cast<wint_t>(cp)) != 0;
}
static void move_word_left(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths, const std::string & line) {
if (char_pos == 0) {
return;
}
size_t new_char_pos = char_pos;
size_t new_byte_pos = byte_pos;
int move_width = 0;
while (new_char_pos > 0) {
size_t prev_byte = prev_utf8_char_pos(line, new_byte_pos);
size_t advance = 0;
char32_t cp = decode_utf8(line, prev_byte, advance);
if (!is_space_codepoint(cp)) {
break;
}
move_width += widths[new_char_pos - 1];
new_char_pos--;
new_byte_pos = prev_byte;
}
while (new_char_pos > 0) {
size_t prev_byte = prev_utf8_char_pos(line, new_byte_pos);
size_t advance = 0;
char32_t cp = decode_utf8(line, prev_byte, advance);
if (is_space_codepoint(cp)) {
break;
}
move_width += widths[new_char_pos - 1];
new_char_pos--;
new_byte_pos = prev_byte;
}
move_cursor(-move_width);
char_pos = new_char_pos;
byte_pos = new_byte_pos;
}
static void move_word_right(size_t & char_pos, size_t & byte_pos, const std::vector<int> & widths, const std::string & line) {
if (char_pos >= widths.size()) {
return;
}
size_t new_char_pos = char_pos;
size_t new_byte_pos = byte_pos;
int move_width = 0;
while (new_char_pos < widths.size()) {
size_t advance = 0;
char32_t cp = decode_utf8(line, new_byte_pos, advance);
if (!is_space_codepoint(cp)) {
break;
}
move_width += widths[new_char_pos];
new_char_pos++;
new_byte_pos += advance;
}
while (new_char_pos < widths.size()) {
size_t advance = 0;
char32_t cp = decode_utf8(line, new_byte_pos, advance);
if (is_space_codepoint(cp)) {
break;
}
move_width += widths[new_char_pos];
new_char_pos++;
new_byte_pos += advance;
}
while (new_char_pos < widths.size()) {
size_t advance = 0;
char32_t cp = decode_utf8(line, new_byte_pos, advance);
if (!is_space_codepoint(cp)) {
break;
}
move_width += widths[new_char_pos];
new_char_pos++;
new_byte_pos += advance;
}
move_cursor(move_width);
char_pos = new_char_pos;
byte_pos = new_byte_pos;
}
static void move_cursor(int delta) {
if (delta == 0) return;
#if defined(_WIN32)
if (hConsole != NULL) {
CONSOLE_SCREEN_BUFFER_INFO bufferInfo;
GetConsoleScreenBufferInfo(hConsole, &bufferInfo);
COORD newCursorPosition = bufferInfo.dwCursorPosition;
int width = bufferInfo.dwSize.X;
int newX = newCursorPosition.X + delta;
int newY = newCursorPosition.Y;
while (newX >= width) {
newX -= width;
newY++;
}
while (newX < 0) {
newX += width;
newY--;
}
newCursorPosition.X = newX;
newCursorPosition.Y = newY;
SetConsoleCursorPosition(hConsole, newCursorPosition);
}
#else
if (delta < 0) {
for (int i = 0; i < -delta; i++) fprintf(out, "\b");
} else {
for (int i = 0; i < delta; i++) fprintf(out, "\033[C");
}
#endif
}
struct history_t {
std::vector<std::string> entries;
size_t viewing_idx = SIZE_MAX;
std::string backup_line; // current line before viewing history
void add(const std::string & line) {
if (line.empty()) {
return;
}
// avoid duplicates with the last entry
if (entries.empty() || entries.back() != line) {
entries.push_back(line);
}
// also clear viewing state
end_viewing();
}
bool prev(std::string & cur_line) {
if (entries.empty()) {
return false;
}
if (viewing_idx == SIZE_MAX) {
return false;
}
if (viewing_idx > 0) {
viewing_idx--;
}
cur_line = entries[viewing_idx];
return true;
}
bool next(std::string & cur_line) {
if (entries.empty() || viewing_idx == SIZE_MAX) {
return false;
}
viewing_idx++;
if (viewing_idx >= entries.size()) {
cur_line = backup_line;
end_viewing();
} else {
cur_line = entries[viewing_idx];
}
return true;
}
void begin_viewing(const std::string & line) {
backup_line = line;
viewing_idx = entries.size();
}
void end_viewing() {
viewing_idx = SIZE_MAX;
backup_line.clear();
}
bool is_viewing() const {
return viewing_idx != SIZE_MAX;
}
} history;
static bool readline_advanced(std::string & line, bool multiline_input) {
if (out != stdout) {
fflush(stdout);
@@ -362,8 +750,33 @@ namespace console {
bool is_special_char = false;
bool end_of_stream = false;
size_t byte_pos = 0; // current byte index
size_t char_pos = 0; // current character index (one char can be multiple bytes)
char32_t input_char;
while (true) {
assert(char_pos <= byte_pos);
assert(char_pos <= widths.size());
auto history_prev = [&]() {
if (!history.is_viewing()) {
history.begin_viewing(line);
}
std::string new_line;
if (!history.prev(new_line)) {
return;
}
set_line_contents(new_line, line, widths, char_pos, byte_pos);
};
auto history_next = [&]() {
if (history.is_viewing()) {
std::string new_line;
if (!history.next(new_line)) {
return;
}
set_line_contents(new_line, line, widths, char_pos, byte_pos);
}
};
fflush(out); // Ensure all output is displayed before waiting for input
input_char = getchar32();
@@ -371,20 +784,83 @@ namespace console {
break;
}
if (input_char == (char32_t) WEOF || input_char == 0x04 /* Ctrl+D*/) {
if (input_char == (char32_t) WEOF || input_char == 0x04 /* Ctrl+D */) {
end_of_stream = true;
break;
}
if (is_special_char) {
set_display(user_input);
replace_last(line.back());
is_special_char = false;
}
if (input_char == '\033') { // Escape sequence
char32_t code = getchar32();
if (code == '[' || code == 0x1B) {
if (code == '[') {
std::string params;
while (true) {
code = getchar32();
if ((code >= 'A' && code <= 'Z') || (code >= 'a' && code <= 'z') || code == '~' || code == (char32_t) WEOF) {
break;
}
params.push_back(static_cast<char>(code));
}
const bool ctrl_modifier = has_ctrl_modifier(params);
if (code == 'D') { // left
if (ctrl_modifier) {
move_word_left(char_pos, byte_pos, widths, line);
} else if (char_pos > 0) {
int w = widths[char_pos - 1];
move_cursor(-w);
char_pos--;
byte_pos = prev_utf8_char_pos(line, byte_pos);
}
} else if (code == 'C') { // right
if (ctrl_modifier) {
move_word_right(char_pos, byte_pos, widths, line);
} else if (char_pos < widths.size()) {
int w = widths[char_pos];
move_cursor(w);
char_pos++;
byte_pos = next_utf8_char_pos(line, byte_pos);
}
} else if (code == 'H') { // home
move_to_line_start(char_pos, byte_pos, widths);
} else if (code == 'F') { // end
move_to_line_end(char_pos, byte_pos, widths, line);
} else if (code == 'A' || code == 'B') {
// up/down
if (code == 'A') {
history_prev();
is_special_char = false;
} else if (code == 'B') {
history_next();
is_special_char = false;
}
} else if ((code == '~' || (code >= 'A' && code <= 'Z') || (code >= 'a' && code <= 'z')) && !params.empty()) {
std::string digits;
for (char ch : params) {
if (ch == ';') {
break;
}
if (std::isdigit(static_cast<unsigned char>(ch))) {
digits.push_back(ch);
}
}
if (code == '~') {
if (digits == "1" || digits == "7") { // home
move_to_line_start(char_pos, byte_pos, widths);
} else if (digits == "4" || digits == "8") { // end
move_to_line_end(char_pos, byte_pos, widths, line);
} else if (digits == "3") { // delete
delete_at_cursor(line, widths, char_pos, byte_pos);
}
}
}
} else if (code == 0x1B) {
// Discard the rest of the escape sequence
while ((code = getchar32()) != (char32_t) WEOF) {
if ((code >= 'A' && code <= 'Z') || (code >= 'a' && code <= 'z') || code == '~') {
@@ -392,32 +868,110 @@ namespace console {
}
}
}
#if defined(_WIN32)
} else if (input_char == KEY_ARROW_LEFT) {
if (char_pos > 0) {
int w = widths[char_pos - 1];
move_cursor(-w);
char_pos--;
byte_pos = prev_utf8_char_pos(line, byte_pos);
}
} else if (input_char == KEY_ARROW_RIGHT) {
if (char_pos < widths.size()) {
int w = widths[char_pos];
move_cursor(w);
char_pos++;
byte_pos = next_utf8_char_pos(line, byte_pos);
}
} else if (input_char == KEY_CTRL_ARROW_LEFT) {
move_word_left(char_pos, byte_pos, widths, line);
} else if (input_char == KEY_CTRL_ARROW_RIGHT) {
move_word_right(char_pos, byte_pos, widths, line);
} else if (input_char == KEY_HOME) {
move_to_line_start(char_pos, byte_pos, widths);
} else if (input_char == KEY_END) {
move_to_line_end(char_pos, byte_pos, widths, line);
} else if (input_char == KEY_DELETE) {
delete_at_cursor(line, widths, char_pos, byte_pos);
} else if (input_char == KEY_ARROW_UP || input_char == KEY_ARROW_DOWN) {
if (input_char == KEY_ARROW_UP) {
history_prev();
is_special_char = false;
} else if (input_char == KEY_ARROW_DOWN) {
history_next();
is_special_char = false;
}
#endif
} else if (input_char == 0x08 || input_char == 0x7F) { // Backspace
if (!widths.empty()) {
int count;
do {
count = widths.back();
widths.pop_back();
// Move cursor back, print space, and move cursor back again
for (int i = 0; i < count; i++) {
replace_last(' ');
pop_cursor();
}
pop_back_utf8_char(line);
} while (count == 0 && !widths.empty());
if (char_pos > 0) {
int w = widths[char_pos - 1];
move_cursor(-w);
char_pos--;
size_t prev_pos = prev_utf8_char_pos(line, byte_pos);
size_t char_len = byte_pos - prev_pos;
byte_pos = prev_pos;
// remove the character
line.erase(byte_pos, char_len);
widths.erase(widths.begin() + char_pos);
// redraw tail
size_t p = byte_pos;
int tail_width = 0;
for (size_t i = char_pos; i < widths.size(); ++i) {
size_t next_p = next_utf8_char_pos(line, p);
put_codepoint(line.c_str() + p, next_p - p, widths[i]);
tail_width += widths[i];
p = next_p;
}
// clear display
for (int i = 0; i < w; ++i) {
fputc(' ', out);
}
move_cursor(-(tail_width + w));
}
} else {
int offset = line.length();
append_utf8(input_char, line);
int width = put_codepoint(line.c_str() + offset, line.length() - offset, estimateWidth(input_char));
if (width < 0) {
width = 0;
// insert character
std::string new_char_str;
append_utf8(input_char, new_char_str);
int w = estimateWidth(input_char);
if (char_pos == widths.size()) {
// insert at the end
line += new_char_str;
int real_w = put_codepoint(new_char_str.c_str(), new_char_str.length(), w);
if (real_w < 0) real_w = 0;
widths.push_back(real_w);
byte_pos += new_char_str.length();
char_pos++;
} else {
// insert in middle
line.insert(byte_pos, new_char_str);
int real_w = put_codepoint(new_char_str.c_str(), new_char_str.length(), w);
if (real_w < 0) real_w = 0;
widths.insert(widths.begin() + char_pos, real_w);
// print the tail
size_t p = byte_pos + new_char_str.length();
int tail_width = 0;
for (size_t i = char_pos + 1; i < widths.size(); ++i) {
size_t next_p = next_utf8_char_pos(line, p);
put_codepoint(line.c_str() + p, next_p - p, widths[i]);
tail_width += widths[i];
p = next_p;
}
move_cursor(-tail_width);
byte_pos += new_char_str.length();
char_pos++;
}
widths.push_back(width);
}
if (!line.empty() && (line.back() == '\\' || line.back() == '/')) {
set_display(prompt);
replace_last(line.back());
is_special_char = true;
}
@@ -451,6 +1005,15 @@ namespace console {
}
}
if (!end_of_stream && !line.empty()) {
// remove the trailing newline for history storage
if (!line.empty() && line.back() == '\n') {
line.pop_back();
}
// TODO: maybe support multiline history entries?
history.add(line);
}
fflush(out);
return has_more;
}
@@ -493,12 +1056,82 @@ namespace console {
}
bool readline(std::string & line, bool multiline_input) {
set_display(user_input);
if (simple_io) {
return readline_simple(line, multiline_input);
}
return readline_advanced(line, multiline_input);
}
namespace spinner {
static const char LOADING_CHARS[] = {'|', '/', '-', '\\'};
static std::condition_variable cv_stop;
static std::thread th;
static size_t frame = 0; // only modified by one thread
static bool running = false;
static std::mutex mtx;
static auto wait_time = std::chrono::milliseconds(100);
static void draw_next_frame() {
// don't need lock because only one thread modifies running
frame = (frame + 1) % sizeof(LOADING_CHARS);
replace_last(LOADING_CHARS[frame]);
fflush(out);
}
void start() {
std::unique_lock<std::mutex> lock(mtx);
if (simple_io || running) {
return;
}
common_log_flush(common_log_main());
fprintf(out, "%c", LOADING_CHARS[0]);
fflush(out);
frame = 1;
running = true;
th = std::thread([]() {
std::unique_lock<std::mutex> lock(mtx);
while (true) {
if (cv_stop.wait_for(lock, wait_time, []{ return !running; })) {
break;
}
draw_next_frame();
}
});
}
void stop() {
{
std::unique_lock<std::mutex> lock(mtx);
if (simple_io || !running) {
return;
}
running = false;
cv_stop.notify_all();
}
if (th.joinable()) {
th.join();
}
replace_last(' ');
pop_cursor();
fflush(out);
}
}
void log(const char * fmt, ...) {
va_list args;
va_start(args, fmt);
vfprintf(out, fmt, args);
va_end(args);
}
void error(const char * fmt, ...) {
va_list args;
va_start(args, fmt);
display_type cur = current_display;
set_display(DISPLAY_TYPE_ERROR);
vfprintf(out, fmt, args);
set_display(cur); // restore previous color
va_end(args);
}
void flush() {
fflush(out);
}
}

View File

@@ -2,18 +2,40 @@
#pragma once
#include "common.h"
#include <string>
namespace console {
enum display_t {
reset = 0,
prompt,
user_input,
error
};
enum display_type {
DISPLAY_TYPE_RESET = 0,
DISPLAY_TYPE_INFO,
DISPLAY_TYPE_PROMPT,
DISPLAY_TYPE_REASONING,
DISPLAY_TYPE_USER_INPUT,
DISPLAY_TYPE_ERROR
};
namespace console {
void init(bool use_simple_io, bool use_advanced_display);
void cleanup();
void set_display(display_t display);
void set_display(display_type display);
bool readline(std::string & line, bool multiline_input);
namespace spinner {
void start();
void stop();
}
// note: the logging API below output directly to stdout
// it can negatively impact performance if used on inference thread
// only use in in a dedicated CLI thread
// for logging in inference thread, use log.h instead
LLAMA_COMMON_ATTRIBUTE_FORMAT(1, 2)
void log(const char * fmt, ...);
LLAMA_COMMON_ATTRIBUTE_FORMAT(1, 2)
void error(const char * fmt, ...);
void flush();
}

View File

@@ -12,6 +12,8 @@
#include <filesystem>
#include <fstream>
#include <future>
#include <map>
#include <mutex>
#include <regex>
#include <string>
#include <thread>
@@ -472,36 +474,79 @@ std::pair<long, std::vector<char>> common_remote_get_content(const std::string &
#elif defined(LLAMA_USE_HTTPLIB)
static bool is_output_a_tty() {
class ProgressBar {
static inline std::mutex mutex;
static inline std::map<const ProgressBar *, int> lines;
static inline int max_line = 0;
static void cleanup(const ProgressBar * line) {
lines.erase(line);
if (lines.empty()) {
max_line = 0;
}
}
static bool is_output_a_tty() {
#if defined(_WIN32)
return _isatty(_fileno(stdout));
return _isatty(_fileno(stdout));
#else
return isatty(1);
return isatty(1);
#endif
}
static void print_progress(size_t current, size_t total) {
if (!is_output_a_tty()) {
return;
}
if (!total) {
return;
public:
ProgressBar() = default;
~ProgressBar() {
std::lock_guard<std::mutex> lock(mutex);
cleanup(this);
}
size_t width = 50;
size_t pct = (100 * current) / total;
size_t pos = (width * current) / total;
void update(size_t current, size_t total) {
if (!is_output_a_tty()) {
return;
}
std::cout << "["
<< std::string(pos, '=')
<< (pos < width ? ">" : "")
<< std::string(width - pos, ' ')
<< "] " << std::setw(3) << pct << "% ("
<< current / (1024 * 1024) << " MB / "
<< total / (1024 * 1024) << " MB)\r";
std::cout.flush();
}
if (!total) {
return;
}
std::lock_guard<std::mutex> lock(mutex);
if (lines.find(this) == lines.end()) {
lines[this] = max_line++;
std::cout << "\n";
}
int lines_up = max_line - lines[this];
size_t width = 50;
size_t pct = (100 * current) / total;
size_t pos = (width * current) / total;
std::cout << "\033[s";
if (lines_up > 0) {
std::cout << "\033[" << lines_up << "A";
}
std::cout << "\033[2K\r["
<< std::string(pos, '=')
<< (pos < width ? ">" : "")
<< std::string(width - pos, ' ')
<< "] " << std::setw(3) << pct << "% ("
<< current / (1024 * 1024) << " MB / "
<< total / (1024 * 1024) << " MB) "
<< "\033[u";
std::cout.flush();
if (current == total) {
cleanup(this);
}
}
ProgressBar(const ProgressBar &) = delete;
ProgressBar & operator=(const ProgressBar &) = delete;
};
static bool common_pull_file(httplib::Client & cli,
const std::string & resolve_path,
@@ -523,6 +568,7 @@ static bool common_pull_file(httplib::Client & cli,
const char * func = __func__; // avoid __func__ inside a lambda
size_t downloaded = existing_size;
size_t progress_step = 0;
ProgressBar bar;
auto res = cli.Get(resolve_path, headers,
[&](const httplib::Response &response) {
@@ -554,7 +600,7 @@ static bool common_pull_file(httplib::Client & cli,
progress_step += len;
if (progress_step >= total_size / 1000 || downloaded == total_size) {
print_progress(downloaded, total_size);
bar.update(downloaded, total_size);
progress_step = 0;
}
return true;
@@ -562,8 +608,6 @@ static bool common_pull_file(httplib::Client & cli,
nullptr
);
std::cout << "\n";
if (!res) {
LOG_ERR("%s: error during download. Status: %d\n", __func__, res ? res->status : -1);
return false;

View File

@@ -1,3 +1,4 @@
#include "common.h"
#include "log.h"
#include <chrono>
@@ -26,30 +27,6 @@ void common_log_set_verbosity_thold(int verbosity) {
common_log_verbosity_thold = verbosity;
}
// Auto-detect if colors should be enabled based on terminal and environment
static bool common_log_should_use_colors_auto() {
// Check NO_COLOR environment variable (https://no-color.org/)
if (const char * no_color = std::getenv("NO_COLOR")) {
if (no_color[0] != '\0') {
return false;
}
}
// Check TERM environment variable
if (const char * term = std::getenv("TERM")) {
if (std::strcmp(term, "dumb") == 0) {
return false;
}
}
// Check if stdout and stderr are connected to a terminal
// We check both because log messages can go to either
bool stdout_is_tty = isatty(fileno(stdout));
bool stderr_is_tty = isatty(fileno(stderr));
return stdout_is_tty || stderr_is_tty;
}
static int64_t t_us() {
return std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
}
@@ -391,7 +368,7 @@ struct common_log * common_log_main() {
static std::once_flag init_flag;
std::call_once(init_flag, [&]() {
// Set default to auto-detect colors
log.set_colors(common_log_should_use_colors_auto());
log.set_colors(tty_can_use_colors());
});
return &log;
@@ -422,7 +399,7 @@ void common_log_set_file(struct common_log * log, const char * file) {
void common_log_set_colors(struct common_log * log, log_colors colors) {
if (colors == LOG_COLORS_AUTO) {
log->set_colors(common_log_should_use_colors_auto());
log->set_colors(tty_can_use_colors());
return;
}
@@ -443,6 +420,11 @@ void common_log_set_timestamps(struct common_log * log, bool timestamps) {
log->set_timestamps(timestamps);
}
void common_log_flush(struct common_log * log) {
log->pause();
log->resume();
}
static int common_get_verbosity(enum ggml_log_level level) {
switch (level) {
case GGML_LOG_LEVEL_DEBUG: return LOG_LEVEL_DEBUG;

View File

@@ -84,6 +84,7 @@ void common_log_set_file (struct common_log * log, const char * file); // n
void common_log_set_colors (struct common_log * log, log_colors colors); // not thread-safe
void common_log_set_prefix (struct common_log * log, bool prefix); // whether to output prefix to each log
void common_log_set_timestamps(struct common_log * log, bool timestamps); // whether to output timestamps in the prefix
void common_log_flush (struct common_log * log); // flush all pending log messages
// helper macros for logging
// use these to avoid computing log arguments if the verbosity of the log is higher than the threshold

186
common/preset.cpp Normal file
View File

@@ -0,0 +1,186 @@
#include "arg.h"
#include "preset.h"
#include "peg-parser.h"
#include "log.h"
#include <fstream>
#include <sstream>
#include <filesystem>
static std::string rm_leading_dashes(const std::string & str) {
size_t pos = 0;
while (pos < str.size() && str[pos] == '-') {
++pos;
}
return str.substr(pos);
}
std::vector<std::string> common_preset::to_args() const {
std::vector<std::string> args;
for (const auto & [opt, value] : options) {
args.push_back(opt.args.back()); // use the last arg as the main arg
if (opt.value_hint == nullptr && opt.value_hint_2 == nullptr) {
// flag option, no value
if (common_arg_utils::is_falsey(value)) {
// use negative arg if available
if (!opt.args_neg.empty()) {
args.back() = opt.args_neg.back();
} else {
// otherwise, skip the flag
// TODO: maybe throw an error instead?
args.pop_back();
}
}
}
if (opt.value_hint != nullptr) {
// single value
args.push_back(value);
}
if (opt.value_hint != nullptr && opt.value_hint_2 != nullptr) {
throw std::runtime_error(string_format(
"common_preset::to_args(): option '%s' has two values, which is not supported yet",
opt.args.back()
));
}
}
return args;
}
std::string common_preset::to_ini() const {
std::ostringstream ss;
ss << "[" << name << "]\n";
for (const auto & [opt, value] : options) {
auto espaced_value = value;
string_replace_all(espaced_value, "\n", "\\\n");
ss << rm_leading_dashes(opt.args.back()) << " = ";
ss << espaced_value << "\n";
}
ss << "\n";
return ss.str();
}
static std::map<std::string, std::map<std::string, std::string>> parse_ini_from_file(const std::string & path) {
std::map<std::string, std::map<std::string, std::string>> parsed;
if (!std::filesystem::exists(path)) {
throw std::runtime_error("preset file does not exist: " + path);
}
std::ifstream file(path);
if (!file.good()) {
throw std::runtime_error("failed to open server preset file: " + path);
}
std::string contents((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
static const auto parser = build_peg_parser([](auto & p) {
// newline ::= "\r\n" / "\n" / "\r"
auto newline = p.rule("newline", p.literal("\r\n") | p.literal("\n") | p.literal("\r"));
// ws ::= [ \t]*
auto ws = p.rule("ws", p.chars("[ \t]", 0, -1));
// comment ::= [;#] (!newline .)*
auto comment = p.rule("comment", p.chars("[;#]", 1, 1) + p.zero_or_more(p.negate(newline) + p.any()));
// eol ::= ws comment? (newline / EOF)
auto eol = p.rule("eol", ws + p.optional(comment) + (newline | p.end()));
// ident ::= [a-zA-Z_] [a-zA-Z0-9_.-]*
auto ident = p.rule("ident", p.chars("[a-zA-Z_]", 1, 1) + p.chars("[a-zA-Z0-9_.-]", 0, -1));
// value ::= (!eol-start .)*
auto eol_start = p.rule("eol-start", ws + (p.chars("[;#]", 1, 1) | newline | p.end()));
auto value = p.rule("value", p.zero_or_more(p.negate(eol_start) + p.any()));
// header-line ::= "[" ws ident ws "]" eol
auto header_line = p.rule("header-line", "[" + ws + p.tag("section-name", p.chars("[^]]")) + ws + "]" + eol);
// kv-line ::= ident ws "=" ws value eol
auto kv_line = p.rule("kv-line", p.tag("key", ident) + ws + "=" + ws + p.tag("value", value) + eol);
// comment-line ::= ws comment (newline / EOF)
auto comment_line = p.rule("comment-line", ws + comment + (newline | p.end()));
// blank-line ::= ws (newline / EOF)
auto blank_line = p.rule("blank-line", ws + (newline | p.end()));
// line ::= header-line / kv-line / comment-line / blank-line
auto line = p.rule("line", header_line | kv_line | comment_line | blank_line);
// ini ::= line* EOF
auto ini = p.rule("ini", p.zero_or_more(line) + p.end());
return ini;
});
common_peg_parse_context ctx(contents);
const auto result = parser.parse(ctx);
if (!result.success()) {
throw std::runtime_error("failed to parse server config file: " + path);
}
std::string current_section = COMMON_PRESET_DEFAULT_NAME;
std::string current_key;
ctx.ast.visit(result, [&](const auto & node) {
if (node.tag == "section-name") {
const std::string section = std::string(node.text);
current_section = section;
parsed[current_section] = {};
} else if (node.tag == "key") {
const std::string key = std::string(node.text);
current_key = key;
} else if (node.tag == "value" && !current_key.empty() && !current_section.empty()) {
parsed[current_section][current_key] = std::string(node.text);
current_key.clear();
}
});
return parsed;
}
static std::map<std::string, common_arg> get_map_key_opt(common_params_context & ctx_params) {
std::map<std::string, common_arg> mapping;
for (const auto & opt : ctx_params.options) {
for (const auto & env : opt.get_env()) {
mapping[env] = opt;
}
for (const auto & arg : opt.get_args()) {
mapping[rm_leading_dashes(arg)] = opt;
}
}
return mapping;
}
common_presets common_presets_load(const std::string & path, common_params_context & ctx_params) {
common_presets out;
auto key_to_opt = get_map_key_opt(ctx_params);
auto ini_data = parse_ini_from_file(path);
for (auto section : ini_data) {
common_preset preset;
if (section.first.empty()) {
preset.name = COMMON_PRESET_DEFAULT_NAME;
} else {
preset.name = section.first;
}
LOG_DBG("loading preset: %s\n", preset.name.c_str());
for (const auto & [key, value] : section.second) {
LOG_DBG("option: %s = %s\n", key.c_str(), value.c_str());
if (key_to_opt.find(key) != key_to_opt.end()) {
preset.options[key_to_opt[key]] = value;
LOG_DBG("accepted option: %s = %s\n", key.c_str(), value.c_str());
} else {
// TODO: maybe warn about unknown key?
}
}
out[preset.name] = preset;
}
return out;
}

32
common/preset.h Normal file
View File

@@ -0,0 +1,32 @@
#pragma once
#include "common.h"
#include "arg.h"
#include <string>
#include <vector>
#include <map>
//
// INI preset parser and writer
//
constexpr const char * COMMON_PRESET_DEFAULT_NAME = "default";
struct common_preset {
std::string name;
// TODO: support repeated args in the future
std::map<common_arg, std::string> options;
// convert preset to CLI argument list
std::vector<std::string> to_args() const;
// convert preset to INI format string
std::string to_ini() const;
// TODO: maybe implement to_env() if needed
};
// interface for multiple presets in one file
using common_presets = std::map<std::string, common_preset>;
common_presets common_presets_load(const std::string & path, common_params_context & ctx_params);

View File

@@ -383,6 +383,17 @@ class ModelBase:
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s, bs=block_size: dequant_simple(w(), s(), bs)
tensors_to_remove.append(name)
if name.endswith(".activation_scale"): # unused
tensors_to_remove.append(name)
# mistral format
if name.endswith(".qscale_weight"):
weight_name = name.removesuffix("qscale_weight") + "weight"
w = self.model_tensors[weight_name]
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s, bs=block_size: dequant_simple(w(), s(), bs)
tensors_to_remove.append(name)
if name.endswith(".qscale_act"):
tensors_to_remove.append(name)
elif quant_method == "gptq":
for name in self.model_tensors.keys():
if name.endswith(".qweight"):
@@ -2854,13 +2865,10 @@ class Mistral3Model(LlamaModel):
self.gguf_writer.add_attn_temperature_scale(rope_params["llama_4_scaling_beta"])
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
# TODO: probably not worth supporting quantized weight, as official BF16 is also available
if name.endswith("weight_scale_inv"):
raise ValueError("This is a quantized weight, please use BF16 weight instead")
name = name.replace("language_model.", "")
if "multi_modal_projector" in name or "vision_tower" in name:
return []
return super().modify_tensors(data_torch, name, bid)
@@ -5825,9 +5833,11 @@ class Gemma3Model(TextModel):
norm_shift = 1.0 # Gemma3RMSNorm adds 1.0 to the norm value
def set_vocab(self):
self._set_vocab_sentencepiece()
self.gguf_writer.add_add_space_prefix(False)
if (self.dir_model / "tokenizer.model").is_file():
self._set_vocab_sentencepiece()
self.gguf_writer.add_add_space_prefix(False)
else:
self._set_vocab_gpt2()
def set_gguf_parameters(self):
hparams = self.hparams
@@ -5845,13 +5855,24 @@ class Gemma3Model(TextModel):
self.gguf_writer.add_rope_freq_base(hparams.get("rope_theta", 1_000_000.0)) # for global layers
# attn_logit_softcapping is removed in Gemma3
assert hparams.get("attn_logit_softcapping") is None
self.gguf_writer.add_sliding_window(hparams["sliding_window"])
if (final_logit_softcap := hparams.get("final_logit_softcapping")):
self.gguf_writer.add_final_logit_softcapping(final_logit_softcap)
if hparams.get("sliding_window_pattern") != 1:
self.gguf_writer.add_sliding_window(hparams["sliding_window"])
self.gguf_writer.add_head_count_kv(hparams.get("num_key_value_heads", 4))
if hparams.get("rope_scaling") is not None:
assert hparams["rope_scaling"]["rope_type"] == "linear"
# important: this rope_scaling is only applied for global layers, and not used by 1B model
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.LINEAR)
self.gguf_writer.add_rope_scaling_factor(hparams["rope_scaling"]["factor"])
rope_scaling = hparams["rope_scaling"]
if rope_scaling["rope_type"] == "linear":
# important: this rope_scaling is only applied for global layers, and not used by 1B model
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.LINEAR)
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"])
elif rope_scaling["rope_type"] == "yarn":
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN)
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"])
self.gguf_writer.add_rope_scaling_orig_ctx_len(rope_scaling["original_max_position_embeddings"])
self.gguf_writer.add_rope_scaling_yarn_ext_factor(rope_scaling["extrapolation_factor"])
self.gguf_writer.add_rope_scaling_yarn_beta_fast(rope_scaling["beta_fast"])
self.gguf_writer.add_rope_scaling_yarn_beta_slow(rope_scaling["beta_slow"])
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
@@ -5865,8 +5886,10 @@ class Gemma3Model(TextModel):
# remove OOV (out-of-vocabulary) rows in token_embd
if "embed_tokens.weight" in name:
vocab = self._create_vocab_sentencepiece()
tokens = vocab[0]
if (self.dir_model / "tokenizer.model").is_file():
tokens = self._create_vocab_sentencepiece()[0]
else:
tokens = self.get_vocab_base()[0]
data_torch = data_torch[:len(tokens)]
# ref code in Gemma3RMSNorm
@@ -7263,6 +7286,10 @@ class DeepseekV2Model(TextModel):
self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.YARN)
self.gguf_writer.add_rope_scaling_factor(rope_scaling["factor"])
self.gguf_writer.add_rope_scaling_orig_ctx_len(rope_scaling["original_max_position_embeddings"])
# [TAG_DEEPSEEK2_YARN_LOG_MUL_FIX]
# note: for legacy reasons, this is not consistent with the other usages of self.gguf_writer.add_rope_scaling_yarn_log_mul
# ref https://github.com/ggml-org/llama.cpp/pull/17945
self.gguf_writer.add_rope_scaling_yarn_log_mul(0.1 * rope_scaling["mscale_all_dim"])
_experts: list[dict[str, Tensor]] | None = None
@@ -9883,6 +9910,18 @@ class MistralModel(LlamaModel):
self.gguf_writer.add_architecture()
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
def dequant_model(self):
# transform quantization config into HF format
quant_config = self.hparams.get("quantization")
if quant_config is not None:
assert quant_config["qformat_weight"] == "fp8_e4m3"
self.hparams["quantization_config"] = {
"activation_scheme": "static",
"quant_method": "fp8",
"weight_block_size": None,
}
return super().dequant_model()
@staticmethod
def get_community_chat_template(vocab: MistralVocab, templates_dir: Path, is_mistral_format: bool):
assert TokenizerVersion is not None and Tekkenizer is not None and SentencePieceTokenizer is not None, _mistral_import_error_msg
@@ -10006,6 +10045,10 @@ class MistralMoeModel(DeepseekV2Model):
MistralModel.set_mistral_config(self.gguf_writer, self.hparams)
yarn_params = self.hparams["yarn"]
self.gguf_writer.add_attn_temperature_length(yarn_params["original_max_position_embeddings"])
# [TAG_DEEPSEEK2_YARN_LOG_MUL_FIX]
# note: for legacy reasons, this is not consistent with the other usages of self.gguf_writer.add_rope_scaling_yarn_log_mul
# ref https://github.com/ggml-org/llama.cpp/pull/17945
self.gguf_writer.add_rope_scaling_yarn_log_mul(0.1) # mscale_all_dim * 0.1
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):

258
docs/backend/ZenDNN.md Normal file
View File

@@ -0,0 +1,258 @@
# llama.cpp for AMD ZenDNN
> [!WARNING]
> **Note:** ZenDNN is **not** the same as zDNN.
> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs
> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md))
- [Background](#background)
- [OS](#os)
- [Hardware](#hardware)
- [Supported Operations](#supported-operations)
- [DataType Supports](#datatype-supports)
- [Linux](#linux)
- [Environment Variable](#environment-variable)
- [Performance Optimization](#performance-optimization)
- [Known Issues](#known-issues)
- [TODO](#todo)
## Background
**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures.
**Llama.cpp + ZenDNN**
The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL BLIS, LibXSMM, OneDNN).
For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html
## OS
| OS | Status | Verified |
|:-------:|:-------:|:----------------------------------------------:|
| Linux | Support | Ubuntu 20.04, 22.04, 24.04 |
For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/zendnnl/README.md#15-supported-os).
## Hardware
### AMD CPUs
**Recommended Processors**
ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer.
| CPU Family | Status | Notes |
|:-----------------------------:|:-------:|:----------------------------------:|
| AMD EPYC™ 9005 Series (Turin)| Support | 5th Gen - Zen 5 architecture |
| AMD EPYC™ 9004 Series (Genoa)| Support | 4th Gen - Zen 4 architecture |
| AMD EPYC™ 7003 Series (Milan)| Support | 3rd Gen - Zen 3 architecture |
| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors |
*Notes:*
- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series).
- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets.
- For optimal performance, ensure your system has sufficient memory bandwidth.
## Supported Operations
The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend.
| Operation | Status | Notes |
|:-------------|:-------:|:----------------------------------------------:|
| MUL_MAT | ✓ | Accelerated via ZenDNN LowOHA MatMul |
*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs).
## DataType Supports
| DataType | Status | Notes |
|:----------------------:|:-------:|:---------------------------------------------:|
| FP32 | Support | Full precision floating point |
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
*Notes:*
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
## Linux
### I. Setup Environment
You have two options to set up ZenDNN:
#### Option 1: Automatic Download and Build (Recommended)
CMake will automatically download and build ZenDNN for you:
```sh
# Build llama.cpp - ZenDNN will be automatically downloaded and built
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
```
No manual ZenDNN installation required. CMake will handle everything automatically.
#### Option 2: Use Custom ZenDNN Installation
If you want to build ZenDNN yourself or use a specific version:
**Step 1: Build ZenDNN from source**
```sh
# Clone ZenDNN repository
git clone https://github.com/amd/ZenDNN.git
cd ZenDNN
git checkout zendnnl
# Build and install (requires CMake >= 3.25)
mkdir build && cd build
cmake ..
cmake --build . --target all
```
Default installation path: `ZenDNN/build/install`
**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/zendnnl/README.md).
**Step 2: Build llama.cpp with custom ZenDNN path**
```sh
# Using environment variable
export ZENDNN_ROOT=/path/to/ZenDNN/build/install
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
# OR specify path directly in CMake
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
```
### II. Run the Server
#### 1. Download Model
Download LLaMA 3.1 8B Instruct BF16 model:
```sh
# Download from Hugging Face
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
```
#### 2. Start Server
Run llama.cpp server with ZenDNN acceleration:
```sh
# Set optimal configuration
export OMP_NUM_THREADS=64 # Adjust to your CPU core count
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS for best performance
# Start server
./build/bin/llama-server \
-m models/Llama-3.1-8B-Instruct.BF16.gguf \
--host 0.0.0.0 \
--port 8080 \
-t 64
```
Access the server at `http://localhost:8080`.
**Performance tips**:
- Set `OMP_NUM_THREADS` to match your physical core count
- Use `ZENDNNL_MATMUL_ALGO=2` for optimal performance
- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...`
## Environment Variable
### Build Time
| Name | Value | Function |
|--------------------|---------------------------------------|---------------------------------------------|
| GGML_ZENDNN | ON/OFF | Enable ZenDNN backend support |
| ZENDNN_ROOT | Path to ZenDNN installation | Set ZenDNN installation directory |
| GGML_OPENMP | ON/OFF (recommended: ON) | Enable OpenMP for multi-threading |
### Runtime
| Name | Value | Function |
|-------------------------|--------------------------|-------------------------------------------------------------------|
| OMP_NUM_THREADS | Number (e.g., 64) | Set number of OpenMP threads (recommended: physical core count) |
| ZENDNNL_MATMUL_ALGO | 0-5 | Select MatMul backend algorithm (see Performance Optimization) |
| ZENDNNL_PROFILE_LOG_LEVEL | 0-4 | Profiling log level (0=disabled, 4=verbose) |
| ZENDNNL_ENABLE_PROFILER | 0 or 1 | Enable detailed profiling (1=enabled) |
| ZENDNNL_API_LOG_LEVEL | 0-4 | API log level (0=disabled, 4=verbose) |
**Example**:
```sh
export OMP_NUM_THREADS=64
export ZENDNNL_MATMUL_ALGO=2 # Use Blocked AOCL BLIS for best performance
./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Test" -n 100
```
## Performance Optimization
### MatMul Algorithm Selection
ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL BLIS** algorithm:
```sh
export ZENDNNL_MATMUL_ALGO=2 # Blocked AOCL BLIS (recommended)
```
**Available algorithms**:
| Value | Algorithm | Description |
|:-----:|:-----------------------|:----------------------------------------------|
| 0 | Dynamic Dispatch | Automatic backend selection (default) |
| 1 | AOCL BLIS | AOCL BLIS backend |
| 2 | AOCL BLIS Blocked | **Blocked AOCL BLIS (recommended)** |
| 3 | OneDNN | OneDNN backend |
| 4 | OneDNN Blocked | Blocked OneDNN |
| 5 | LibXSMM | LibXSMM backend |
### Profiling and Debugging
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/zendnnl/docs/logging.md).
## Known Issues
- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend.
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
## Q&A
**Q: How do I verify that ZenDNN backend is being used?**
A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output.
**Q: What performance improvement can I expect?**
A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations.
**Q: Can I use ZenDNN on non-AMD processors?**
A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures.
**Q: Does ZenDNN support quantized models?**
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
**Q: Why is my inference not faster with ZenDNN?**
A: Ensure:
1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer)
2. `OMP_NUM_THREADS` is set appropriately (physical core count)
3. `ZENDNNL_MATMUL_ALGO=2` is set for best performance (Blocked AOCL BLIS)
4. You're using a sufficiently large model (small models may not benefit as much)
5. Enable profiling to verify ZenDNN MatMul is being called
### **GitHub Contribution**:
Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay.
## TODO
- Expand operation support beyond MUL_MAT (attention operations, activations, etc.)

View File

@@ -1,5 +1,10 @@
# llama.cpp for IBM zDNN Accelerator
> [!WARNING]
> **Note:** zDNN is **not** the same as ZenDNN.
> - **zDNN** (this page): IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes
> - **ZenDNN**: AMD's deep learning library for AMD EPYC CPUs ([see ZenDNN documentation](ZenDNN.md))
## Background
IBM zDNN (Z Deep Neural Network) is a hardware acceleration library designed specifically to leverage the IBM NNPA (Neural Network Processor Assist) accelerator located within IBM Telum I and II processors. It provides significant performance improvements for neural network inference operations.

View File

@@ -19,6 +19,7 @@ cmake -B build \
-DGGML_RVV=ON \
-DGGML_RV_ZFH=ON \
-DGGML_RV_ZICBOP=ON \
-DGGML_RV_ZIHINTPAUSE=ON \
-DRISCV64_SPACEMIT_IME_SPEC=RISCV64_SPACEMIT_IME1 \
-DCMAKE_TOOLCHAIN_FILE=${PWD}/cmake/riscv64-spacemit-linux-gnu-gcc.cmake \
-DCMAKE_INSTALL_PREFIX=build/installed

View File

@@ -495,6 +495,38 @@ llama_new_context_with_model: CANN compute buffer size = 1260.81 MiB
For detailed info, such as model/device supports, CANN install, please refer to [llama.cpp for CANN](./backend/CANN.md).
## ZenDNN
ZenDNN provides optimized deep learning primitives for AMD EPYC™ CPUs. It accelerates matrix multiplication operations for inference workloads.
### Compilation
- Using `CMake` on Linux (automatic build):
```bash
cmake -B build -DGGML_ZENDNN=ON
cmake --build build --config Release
```
The first build will automatically download and build ZenDNN, which may take 5-10 minutes. Subsequent builds will be much faster.
- Using `CMake` with custom ZenDNN installation:
```bash
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/zendnn/install
cmake --build build --config Release
```
### Testing
You can test with:
```bash
./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -n 50
```
For detailed information about hardware support, setup instructions, and performance optimization, refer to [llama.cpp for ZenDNN](./backend/ZenDNN.md).
## Arm® KleidiAI™
KleidiAI is a library of optimized microkernels for AI workloads, specifically designed for Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.

View File

@@ -56,7 +56,7 @@ docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:light -m /model
or with a server image:
```bash
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggml-org/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
docker run -v /path/to/models:/models -p 8080:8080 ghcr.io/ggml-org/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512
```
## Docker With CUDA
@@ -91,7 +91,7 @@ After building locally, Usage is similar to the non-CUDA examples, but you'll ne
```bash
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 1
```
## Docker With MUSA
@@ -125,5 +125,5 @@ After building locally, Usage is similar to the non-MUSA examples, but you'll ne
```bash
docker run -v /path/to/models:/models local/llama.cpp:full-musa --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 1
```

View File

@@ -12,111 +12,112 @@ Legend:
- 🟡 Partially supported by this backend
- ❌ Not supported by this backend
| Operation | BLAS | CANN | CPU | CUDA | Metal | OpenCL | SYCL | Vulkan | WebGPU | zDNN |
|-----------|------|------|------|------|------|------|------|------|------|------|
| ABS | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
| ACC | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| ADD | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ❌ |
| ADD1 | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| ADD_ID | ❌ | ❌ | ✅ | ✅ | ✅ | | ❌ | ✅ | ❌ | ❌ |
| ARANGE | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ❌ | ❌ |
| CEIL | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ |
| CLAMP | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
| COS | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
| COUNT_EQUAL | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ |
| CROSS_ENTROPY_LOSS | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CROSS_ENTROPY_LOSS_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CUMSUM | ❌ | ❌ | ✅ | | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
| DIAG_MASK_INF | ❌ | ✅ | ✅ | | ❌ | 🟡 | ✅ | | ❌ | ❌ |
| DIV | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | | ❌ |
| DUP | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ |
| ELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | | ✅ | ❌ | | ❌ |
| EXP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
| EXPM1 | ❌ | | ✅ | 🟡 | 🟡 | ❌ | | | ❌ | ❌ |
| FILL | ❌ | ❌ | ✅ | | | ❌ | ❌ | | ❌ | ❌ |
| FLASH_ATTN_EXT | ❌ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ |
| FLOOR | ❌ | | ✅ | 🟡 | | ❌ | 🟡 | 🟡 | ❌ | ❌ |
| GATED_LINEAR_ATTN | ❌ | ❌ | ✅ | | ❌ | ❌ | | ❌ | ❌ | ❌ |
| GEGLU | ❌ | ✅ | ✅ | | 🟡 | ✅ | | 🟡 | | ❌ |
| GEGLU_ERF | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
| GEGLU_QUICK | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ |
| GELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
| GELU_QUICK | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | | 🟡 | 🟡 | 🟡 | 🟡 | ❌ |
| GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | | | | | ❌ | ❌ |
| GROUP_NORM | ❌ | | | | | | | | ❌ | ❌ |
| GROUP_NORM_MUL_ADD | ❌ | | | | | | | ❌ | ❌ | ❌ |
| HARDSIGMOID | ❌ | | | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ |
| IM2COL | ❌ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| IM2COL_3D | ❌ | ❌ | ✅ | ✅ | | | | ✅ | ❌ | ❌ |
| L2_NORM | ❌ | ❌ | ✅ | ✅ | | ❌ | ✅ | | ❌ | ❌ |
| LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | ❌ | ❌ |
| LOG | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | | ❌ | ❌ |
| MEAN | ❌ | ✅ | ✅ | ✅ | | ❌ | | ✅ | ❌ | ❌ |
| MUL | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | | ❌ |
| MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| MUL_MAT_ID | | 🟡 | | | ✅ | 🟡 | 🟡 | | | |
| NEG | | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | | ❌ |
| NORM | ❌ | ✅ | ✅ | | | | ✅ | 🟡 | ❌ | ❌ |
| NORM_MUL_ADD | ❌ | | | | | | | ❌ | ❌ | ❌ |
| OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | | ❌ | ❌ |
| OPT_STEP_SGD | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ |
| OUT_PROD | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
| PAD | | ✅ | ✅ | 🟡 | 🟡 | | 🟡 | | ❌ | ❌ |
| PAD_REFLECT_1D | ❌ | ✅ | ✅ | | | | ✅ | ❌ | ❌ | ❌ |
| POOL_2D | ❌ | 🟡 | ✅ | ✅ | ✅ | ❌ | ✅ | | ❌ | ❌ |
| REGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | | ❌ |
| RELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
| REPEAT | ❌ | ✅ | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| REPEAT_BACK | ❌ | | ✅ | | | | ✅ | | ❌ | ❌ |
| RMS_NORM | ❌ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | | ❌ |
| RMS_NORM_BACK | ❌ | | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| RMS_NORM_MUL_ADD | ❌ | ✅ | | ❌ | ❌ | ✅ | | ❌ | ❌ | ❌ |
| ROLL | ❌ | ❌ | | | ❌ | ❌ | | | ❌ | ❌ |
| ROPE | ❌ | 🟡 | ✅ | ✅ | | | ✅ | ✅ | | ❌ |
| ROPE_BACK | ❌ | | ✅ | ✅ | | | | ✅ | ❌ | ❌ |
| ROUND | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ |
| RWKV_WKV6 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| RWKV_WKV7 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ |
| SCALE | ❌ | 🟡 | ✅ | ✅ | | ✅ | ✅ | | | ❌ |
| SET | ❌ | | ✅ | ✅ | | | 🟡 | | ❌ | ❌ |
| SET_ROWS | ❌ | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ |
| SGN | ❌ | | | 🟡 | 🟡 | | | | | ❌ |
| SIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
| SILU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ |
| SILU_BACK | ❌ | | ✅ | | | | | ✅ | ❌ | ❌ |
| SIN | ❌ | ✅ | ✅ | | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
| SOFTCAP | ❌ | | | | ❌ | | | ❌ | ❌ | ❌ |
| SOFTPLUS | ❌ | ❌ | | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ |
| SOFT_MAX | ❌ | 🟡 | ✅ | | | | | | | ❌ |
| SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | | | 🟡 | ✅ | ❌ | ❌ |
| SOLVE_TRI | ❌ | ❌ | | | ❌ | ❌ | | 🟡 | ❌ | ❌ |
| SQR | ❌ | ✅ | | | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ |
| SQRT | ❌ | ✅ | ✅ | ✅ | 🟡 | | 🟡 | 🟡 | ❌ | ❌ |
| SSM_CONV | ❌ | ❌ | ✅ | ✅ | ✅ | | ✅ | | ❌ | ❌ |
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | | | 🟡 | ❌ | ❌ |
| STEP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | | 🟡 | | ❌ |
| SUB | ❌ | ✅ | ✅ | | 🟡 | 🟡 | ✅ | | ✅ | ❌ |
| SUM | ❌ | ✅ | ✅ | 🟡 | 🟡 | | 🟡 | 🟡 | ❌ | ❌ |
| SUM_ROWS | ❌ | ✅ | ✅ | 🟡 | | | 🟡 | | ❌ | ❌ |
| SWIGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | | ✅ | 🟡 | | ❌ |
| SWIGLU_OAI | ❌ | | ✅ | ✅ | | | | 🟡 | ✅ | ❌ |
| TANH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | | 🟡 | ✅ | ❌ |
| TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | | | ✅ | ✅ | ✅ | ❌ | ❌ |
| TOP_K | ❌ | | | | ✅ | | | 🟡 | ❌ | ❌ |
| TRI | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | | ❌ | ❌ |
| TRUNC | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ |
| UPSCALE | ❌ | 🟡 | ✅ | ✅ | 🟡 | | 🟡 | 🟡 | ❌ | ❌ |
| XIELU | | | ✅ | | | ❌ | ❌ | ❌ | ✅ | ❌ |
| Operation | BLAS | CANN | CPU | CUDA | Metal | OpenCL | SYCL | Vulkan | WebGPU | ZenDNN | zDNN |
|-----------|------|------|------|------|------|------|------|------|------|------|------|
| ABS | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| ACC | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ADD | ❌ | ✅ | ✅ | ✅ | 🟡 | | ✅ | ✅ | ✅ | ❌ | ❌ |
| ADD1 | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ADD_ID | ❌ | ❌ | ✅ | ✅ | ✅ | | ❌ | ✅ | ❌ | ❌ | ❌ |
| ARANGE | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | | ❌ | ❌ |
| CEIL | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| CLAMP | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| COS | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| COUNT_EQUAL | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| CROSS_ENTROPY_LOSS | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CROSS_ENTROPY_LOSS_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CUMSUM | ❌ | ❌ | ✅ | | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| DIAG | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | | | ❌ | ❌ |
| DIAG_MASK_INF | ❌ | ✅ | ✅ | ✅ | | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| DIV | ❌ | ✅ | ✅ | ✅ | 🟡 | | | ✅ | ✅ | ❌ | ❌ |
| DUP | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ✅ | ❌ | | ❌ |
| ELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | | ✅ | ❌ | ❌ |
| EXP | ❌ | | ✅ | 🟡 | 🟡 | ❌ | | 🟡 | ✅ | ❌ | ❌ |
| EXPM1 | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | ❌ | | ❌ | ❌ |
| FILL | ❌ | | ✅ | | | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| FLASH_ATTN_EXT | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | ❌ | 🟡 | | ❌ | ❌ |
| FLOOR | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| GATED_LINEAR_ATTN | ❌ | ❌ | ✅ | ✅ | | | ✅ | | | | ❌ |
| GEGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GEGLU_ERF | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GEGLU_QUICK | ❌ | ✅ | ✅ | ✅ | 🟡 | | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_QUICK | ❌ | | ✅ | 🟡 | 🟡 | 🟡 | | 🟡 | ✅ | ❌ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| GET_ROWS_BACK | ❌ | | 🟡 | 🟡 | | ❌ | ❌ | | | ❌ | ❌ |
| GROUP_NORM | ❌ | | | | | | ✅ | ✅ | ❌ | ❌ | ❌ |
| GROUP_NORM_MUL_ADD | ❌ | | | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | | | ✅ | ❌ | ❌ | ❌ |
| IM2COL_3D | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | | ❌ | ❌ |
| L2_NORM | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | | ❌ | ❌ |
| LOG | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
| MEAN | ❌ | ✅ | ✅ | ✅ | | | ✅ | ✅ | ❌ | ❌ | ❌ |
| MUL | | | ✅ | ✅ | 🟡 | ✅ | | | | | |
| MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | | | ❌ |
| NEG | ❌ | ✅ | ✅ | 🟡 | 🟡 | | ✅ | 🟡 | ✅ | ❌ | ❌ |
| NORM | ❌ | | | | | | ✅ | 🟡 | ❌ | ❌ | ❌ |
| NORM_MUL_ADD | ❌ | ❌ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| OPT_STEP_SGD | | ❌ | ✅ | ✅ | | ❌ | ❌ | | ❌ | ❌ | ❌ |
| OUT_PROD | 🟡 | | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ | ❌ |
| PAD | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ |
| PAD_REFLECT_1D | ❌ | | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| POOL_2D | ❌ | 🟡 | ✅ | ✅ | ✅ | | ✅ | ✅ | | | ❌ |
| REGLU | ❌ | ✅ | ✅ | | 🟡 | | ✅ | 🟡 | ✅ | ❌ | ❌ |
| RELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| REPEAT | ❌ | | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ❌ | ❌ | ❌ |
| REPEAT_BACK | ❌ | ❌ | ✅ | ✅ | | | ✅ | ✅ | ❌ | ❌ | ❌ |
| RMS_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ❌ | ❌ |
| RMS_NORM_BACK | ❌ | ❌ | ✅ | | ❌ | ❌ | ✅ | | ❌ | ❌ | ❌ |
| RMS_NORM_MUL_ADD | ❌ | ✅ | ❌ | | | ❌ | ❌ | | | ❌ | ❌ |
| ROLL | ❌ | | ✅ | ✅ | | | ✅ | ✅ | ❌ | ❌ | ❌ |
| ROPE | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | | | ✅ | ❌ | ❌ |
| ROPE_BACK | ❌ | ❌ | ✅ | | ❌ | ❌ | | ✅ | ❌ | ❌ | ❌ |
| ROUND | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| RWKV_WKV6 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| RWKV_WKV7 | ❌ | ❌ | | ✅ | ✅ | | ✅ | ✅ | | | ❌ |
| SCALE | ❌ | 🟡 | ✅ | ✅ | | | | ✅ | ✅ | ❌ | ❌ |
| SET | ❌ | ❌ | | | ❌ | ❌ | 🟡 | | | | ❌ |
| SET_ROWS | ❌ | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | | ❌ |
| SGN | ❌ | ✅ | ✅ | 🟡 | 🟡 | | ✅ | | ✅ | ❌ | ❌ |
| SIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SILU | ❌ | | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SILU_BACK | ❌ | ❌ | ✅ | ✅ | | | ❌ | | | ❌ | ❌ |
| SIN | ❌ | | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| SOFTCAP | ❌ | ❌ | | | ❌ | ❌ | ❌ | ❌ | | ❌ | ❌ |
| SOFTPLUS | ❌ | | ✅ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | | | ❌ |
| SOFT_MAX | ❌ | 🟡 | ✅ | ✅ | ✅ | | | | ✅ | ❌ | ❌ |
| SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
| SOLVE_TRI | ❌ | ❌ | ✅ | 🟡 | | | ❌ | 🟡 | | ❌ | ❌ |
| SQR | ❌ | ✅ | ✅ | ✅ | 🟡 | | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| SQRT | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| SSM_CONV | ❌ | ❌ | ✅ | ✅ | ✅ | | | ✅ | ❌ | ❌ | ❌ |
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | | | ❌ | 🟡 | | | ❌ |
| STEP | ❌ | ✅ | ✅ | 🟡 | 🟡 | | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SUB | ❌ | ✅ | ✅ | | 🟡 | | | ✅ | ✅ | ❌ | ❌ |
| SUM | ❌ | ✅ | ✅ | 🟡 | 🟡 | | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| SUM_ROWS | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ✅ | | | ❌ |
| SWIGLU | ❌ | | ✅ | ✅ | 🟡 | | | 🟡 | ✅ | ❌ | ❌ |
| SWIGLU_OAI | ❌ | ❌ | ✅ | ✅ | | ✅ | | 🟡 | ✅ | ❌ | ❌ |
| TANH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | | | ✅ | | | | ❌ | ❌ |
| TOP_K | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
| TRI | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | | | ❌ | ❌ |
| TRUNC | ❌ | | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| UPSCALE | ❌ | 🟡 | | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ | ❌ |
| XIELU | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |

View File

@@ -4964,6 +4964,7 @@
"CPU","CONV_TRANSPOSE_1D","ne_input=[2,1,1,1],ne_kernel=[3,1,1,1],s0=1,p0=0,d0=1","support","1","yes","CPU"
"CPU","CONV_TRANSPOSE_2D","ne_input=[3,2,3,1],ne_kernel=[2,2,1,3],stride=1","support","1","yes","CPU"
"CPU","CONV_TRANSPOSE_2D","ne_input=[10,10,9,1],ne_kernel=[3,3,1,9],stride=2","support","1","yes","CPU"
"CPU","CONV_TRANSPOSE_2D","ne_input=[129,63,35,1],ne_kernel=[3,3,48,35],stride=1","support","1","yes","CPU"
"CPU","COUNT_EQUAL","type=f32,ne=[4,500,1,1]","support","1","yes","CPU"
"CPU","COUNT_EQUAL","type=f32,ne=[4,5000,1,1]","support","1","yes","CPU"
"CPU","ARGMAX","type=f32,ne=[32,1,1,1]","support","1","yes","CPU"
@@ -5419,17 +5420,45 @@
"CPU","CPY","type_src=f16,type_dst=f16,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CPU"
"CPU","CPY","type_src=f32,type_dst=f32,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CPU"
"CPU","CPY","type_src=bf16,type_dst=bf16,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CPU"
"CPU","CPY","type_src=i32,type_dst=i32,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CPU"
"CPU","CPY","type_src=i32,type_dst=i32,ne=[256,1,4,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0","support","1","yes","CPU"
"CPU","CPY","type_src=f32,type_dst=f32,ne=[256,1,4,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[10,10,10,1]","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,1,1,1]","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,1,3,5]","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,3,5,7]","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[2,1,1,1]","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[2,1,3,5]","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[2,3,5,7]","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[2,1,1,1]","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[2,1,3,5]","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[2,3,5,7]","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,1,1,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,1,3,5],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,3,5,7],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[1,4,4,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[1,8,17,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[10,10,10,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f32,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[2,1,1,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[2,1,3,5],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[2,3,5,7],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[1,4,4,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[1,8,17,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[10,10,10,1],use_view_slice=1","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=i32,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=f16,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CPU"
"CPU","CONT","type=bf16,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CPU"
"CPU","ADD","type=f16,ne=[1,1,8,1],nr=[1,1,1,1],nf=1","support","1","yes","CPU"
"CPU","SUB","type=f16,ne=[1,1,8,1],nr=[1,1,1,1],nf=1","support","1","yes","CPU"
"CPU","MUL","type=f16,ne=[1,1,8,1],nr=[1,1,1,1],nf=1","support","1","yes","CPU"
@@ -5655,6 +5684,7 @@
"CPU","MUL","type=f32,ne=[64,262144,1,1],nr=[1,1,1,1],nf=1","support","1","yes","CPU"
"CPU","DIV","type=f32,ne=[64,262144,1,1],nr=[1,1,1,1],nf=1","support","1","yes","CPU"
"CPU","ADD1","type=f32,ne=[10,5,4,3]","support","1","yes","CPU"
"CPU","ADD1","type=f32,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","SCALE","type=f32,ne=[10,10,10,10],scale=2.000000,bias=0.000000,inplace=0","support","1","yes","CPU"
"CPU","SCALE","type=f32,ne=[10,10,10,10],scale=2.000000,bias=1.000000,inplace=0","support","1","yes","CPU"
"CPU","SCALE","type=f32,ne=[10,10,10,10],scale=2.000000,bias=1.000000,inplace=1","support","1","yes","CPU"
@@ -8644,9 +8674,13 @@
"CPU","CLAMP","type=f16,ne=[7,1,5,3],min=-0.500000,max=0.500000","support","1","yes","CPU"
"CPU","LEAKY_RELU","type=f16,ne_a=[7,1,5,3],negative_slope=0.100000","support","1","yes","CPU"
"CPU","FLOOR","type=f16,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","FLOOR","type=f16,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","CEIL","type=f16,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","CEIL","type=f16,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","ROUND","type=f16,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","ROUND","type=f16,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","TRUNC","type=f16,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","TRUNC","type=f16,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","SQR","type=f32,ne=[10,5,4,3]","support","1","yes","CPU"
"CPU","SQRT","type=f32,ne=[10,3,3,2]","support","1","yes","CPU"
"CPU","LOG","type=f32,ne=[10,5,4,3]","support","1","yes","CPU"
@@ -8666,9 +8700,13 @@
"CPU","CLAMP","type=f32,ne=[7,1,5,3],min=-0.500000,max=0.500000","support","1","yes","CPU"
"CPU","LEAKY_RELU","type=f32,ne_a=[7,1,5,3],negative_slope=0.100000","support","1","yes","CPU"
"CPU","FLOOR","type=f32,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","FLOOR","type=f32,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","CEIL","type=f32,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","CEIL","type=f32,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","ROUND","type=f32,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","ROUND","type=f32,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","TRUNC","type=f32,ne=[7,1,5,3]","support","1","yes","CPU"
"CPU","TRUNC","type=f32,ne=[1024,1024,1,1]","support","1","yes","CPU"
"CPU","DIAG_MASK_INF","type=f32,ne=[10,10,1,1],n_past=5","support","1","yes","CPU"
"CPU","DIAG_MASK_INF","type=f32,ne=[10,10,3,1],n_past=5","support","1","yes","CPU"
"CPU","DIAG_MASK_INF","type=f32,ne=[10,10,3,2],n_past=5","support","1","yes","CPU"
@@ -9411,18 +9449,405 @@
"CPU","CONCAT","type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=3","support","1","yes","CPU"
"CPU","CONCAT","type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3","support","1","yes","CPU"
"CPU","CONCAT","type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[3,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[4,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[7,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[8,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[15,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[31,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[32,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[63,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[64,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[127,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[128,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[255,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[256,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[511,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[512,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1023,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1024,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2047,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2048,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[4095,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[4096,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[8191,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[8192,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16383,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16384,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[32767,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[32768,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[65535,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[65536,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[131071,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[131072,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[262143,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[262144,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[524287,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[524288,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1048575,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1048576,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16,10,10,10],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[60,10,10,10],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1024,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16384,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1023,2,1,3],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1024,2,1,3],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1025,2,1,3],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2047,2,1,3],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2048,2,1,3],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2049,2,1,3],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2,8,8192,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[8,1,1,1],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[3,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[4,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[7,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[8,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[15,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[31,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[32,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[63,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[64,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[127,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[128,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[255,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[256,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[511,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[512,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1023,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1024,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2047,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2048,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[4095,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[4096,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[8191,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[8192,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16383,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16384,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[32767,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[32768,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[65535,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[65536,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[131071,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[131072,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[262143,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[262144,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[524287,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[524288,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1048575,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1048576,1,1,1],order=0","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16,10,10,10],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[60,10,10,10],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1024,1,1,1],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[16384,1,1,1],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1023,2,1,3],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1024,2,1,3],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[1025,2,1,3],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2047,2,1,3],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2048,2,1,3],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2049,2,1,3],order=1","support","1","yes","CPU"
"CPU","ARGSORT","type=f32,ne=[2,8,8192,1],order=1","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[12,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[13,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[13,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[15,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[15,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[15,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[19,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[19,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[19,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[19,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[27,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[27,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[27,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[27,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[27,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[43,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[43,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[43,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[43,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[43,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[64,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[75,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[64,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[75,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[64,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[75,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[64,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[75,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[64,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[75,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[128,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[139,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[128,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[139,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[128,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[139,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[128,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[139,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[128,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[139,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[128,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[139,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[256,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[267,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[256,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[267,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[256,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[267,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[256,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[267,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[256,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[267,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[256,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[267,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[512,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[523,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[512,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[523,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[512,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[523,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[512,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[523,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[512,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[523,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[512,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[523,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[512,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[523,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1035,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2059,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4096,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[4107,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8192,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[8203,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16395,1,2,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32768,1,1,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[32779,1,2,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65536,1,1,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[65547,1,2,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131072,1,1,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[131083,1,2,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262144,1,1,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[262155,1,2,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=100,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=500,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=1023,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524288,1,1,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[524299,1,2,1],k=9999,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,10,10,10],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[60,10,10,10],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1023,2,1,3],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,2,1,3],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1025,2,1,3],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2047,2,1,3],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,2,1,3],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2049,2,1,3],k=1,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,10,10,10],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[60,10,10,10],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1023,2,1,3],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,2,1,3],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1025,2,1,3],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2047,2,1,3],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,2,1,3],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2049,2,1,3],k=2,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,10,10,10],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[60,10,10,10],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1023,2,1,3],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,2,1,3],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1025,2,1,3],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2047,2,1,3],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,2,1,3],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2049,2,1,3],k=3,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,10,10,10],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[60,10,10,10],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1023,2,1,3],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,2,1,3],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1025,2,1,3],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2047,2,1,3],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,2,1,3],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2049,2,1,3],k=7,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16,10,10,10],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[60,10,10,10],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1023,2,1,3],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1024,2,1,3],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[1025,2,1,3],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[16384,1,1,1],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2047,2,1,3],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2048,2,1,3],k=15,ties=0","support","1","yes","CPU"
"CPU","TOP_K","type=f32,ne=[2049,2,1,3],k=15,ties=0","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=0","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=1","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=nearest,flags=none","support","1","yes","CPU"
@@ -9435,6 +9860,10 @@
"CPU","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bicubic,transpose=1","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bicubic,flags=none","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bicubic,flags=none","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=513,transpose=0","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=513,transpose=1","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear,flags=none","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bilinear,flags=none","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear,flags=align_corners","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[1,4,3,2],ne_tgt=[2,8,3,2],mode=bilinear,flags=align_corners","support","1","yes","CPU"
"CPU","UPSCALE","type=f32,ne=[4,1,3,2],ne_tgt=[1,1,3,2],mode=bilinear,flags=align_corners","support","1","yes","CPU"
@@ -9463,15 +9892,30 @@
"CPU","GROUP_NORM","type=f32,ne=[64,64,320,1],num_groups=32,eps=0.000001","support","1","yes","CPU"
"CPU","GROUP_NORM","type=f32,ne=[9,9,1280,1],num_groups=32,eps=0.000001","support","1","yes","CPU"
"CPU","ACC","type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1]","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],pad_0=1,pad_1=1","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,3,1],lp0=1,rp0=1,lp1=1,rp1=1,lp2=1,rp2=1,lp3=1,rp3=1,v=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],pad_0=1,pad_1=1,circular=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[33,17,2,1],pad_0=4,pad_1=3,circular=1","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,3,1],lp0=1,rp0=1,lp1=1,rp1=1,lp2=1,rp2=1,lp3=1,rp3=1,v=0,circular=0","support","1","yes","CPU"
"CPU","PAD_REFLECT_1D","type=f32,ne_a=[512,34,2,1],pad_0=10,pad_1=9","support","1","yes","CPU"
"CPU","PAD_REFLECT_1D","type=f32,ne_a=[3000,384,4,1],pad_0=10,pad_1=9","support","1","yes","CPU"
"CPU","ROLL","shift0=3,shift1=-2,shift3=1,shift4=-1","support","1","yes","CPU"
"CPU","ARANGE","type=f32,start=0.000000,stop=10.000000,step=1.000000","support","1","yes","CPU"
"CPU","ARANGE","type=f32,start=0.000000,stop=1048576.000000,step=1.000000","support","1","yes","CPU"
"CPU","TIMESTEP_EMBEDDING","type=f32,ne_a=[2,1,1,1],dim=320,max_period=10000","support","1","yes","CPU"
"CPU","LEAKY_RELU","type=f32,ne_a=[10,5,4,3],negative_slope=0.100000","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[10,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[127,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[128,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[128,128,4,4]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[255,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[256,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[511,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[512,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[1023,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[1024,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[2047,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[2048,5,4,3]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[242004,1,1,1]","support","1","yes","CPU"
"CPU","CUMSUM","type=f32,ne=[375960,1,1,1]","support","1","yes","CPU"
"CPU","XIELU","type=f32,ne=[10,5,4,3]","support","1","yes","CPU"
"CPU","TRI","type=f32,ne=[10,10,4,3],tri_type=3","support","1","yes","CPU"
"CPU","TRI","type=f32,ne=[10,10,4,3],tri_type=2","support","1","yes","CPU"
@@ -9480,6 +9924,10 @@
"CPU","FILL","type=f32,ne=[10,10,4,3],c=0.000000","support","1","yes","CPU"
"CPU","FILL","type=f32,ne=[303,207,11,3],c=2.000000","support","1","yes","CPU"
"CPU","FILL","type=f32,ne=[800,600,4,4],c=-152.000000","support","1","yes","CPU"
"CPU","FILL","type=f32,ne=[2048,512,2,2],c=3.500000","support","1","yes","CPU"
"CPU","DIAG","type=f32,ne=[10,1,4,3]","support","1","yes","CPU"
"CPU","DIAG","type=f32,ne=[79,1,19,13]","support","1","yes","CPU"
"CPU","DIAG","type=f32,ne=[256,1,8,16]","support","1","yes","CPU"
"CPU","SOLVE_TRI","type=f32,ne_lhs=[10,10,4,3],ne_rhs=[3,10,4,3]","support","1","yes","CPU"
"CPU","SOLVE_TRI","type=f32,ne_lhs=[11,11,1,1],ne_rhs=[5,11,1,1]","support","1","yes","CPU"
"CPU","SOLVE_TRI","type=f32,ne_lhs=[17,17,2,4],ne_rhs=[9,17,2,4]","support","1","yes","CPU"
@@ -9487,10 +9935,16 @@
"CPU","SOLVE_TRI","type=f32,ne_lhs=[42,42,5,2],ne_rhs=[10,42,5,2]","support","1","yes","CPU"
"CPU","SOLVE_TRI","type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]","support","1","yes","CPU"
"CPU","SOLVE_TRI","type=f32,ne_lhs=[100,100,4,4],ne_rhs=[41,100,4,4]","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1","support","1","yes","CPU"
"CPU","SOLVE_TRI","type=f32,ne_lhs=[128,128,4,4],ne_rhs=[31,128,4,4]","support","1","yes","CPU"
"CPU","SOLVE_TRI","type=f32,ne_lhs=[64,64,4,4],ne_rhs=[300,64,4,4]","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0,circular=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0,circular=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0,circular=1","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0,circular=1","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1,circular=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1,circular=0","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1,circular=1","support","1","yes","CPU"
"CPU","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1,circular=1","support","1","yes","CPU"
"CPU","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f32,permute=[0,1,2,3]","support","1","yes","CPU"
"CPU","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,1,2,3]","support","1","yes","CPU"
"CPU","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=bf16,permute=[0,1,2,3]","support","1","yes","CPU"
Can't render this file because it is too large.

View File

@@ -4964,6 +4964,7 @@
"CUDA0","CONV_TRANSPOSE_1D","ne_input=[2,1,1,1],ne_kernel=[3,1,1,1],s0=1,p0=0,d0=1","support","1","yes","CUDA"
"CUDA0","CONV_TRANSPOSE_2D","ne_input=[3,2,3,1],ne_kernel=[2,2,1,3],stride=1","support","1","yes","CUDA"
"CUDA0","CONV_TRANSPOSE_2D","ne_input=[10,10,9,1],ne_kernel=[3,3,1,9],stride=2","support","1","yes","CUDA"
"CUDA0","CONV_TRANSPOSE_2D","ne_input=[129,63,35,1],ne_kernel=[3,3,48,35],stride=1","support","1","yes","CUDA"
"CUDA0","COUNT_EQUAL","type=f32,ne=[4,500,1,1]","support","1","yes","CUDA"
"CUDA0","COUNT_EQUAL","type=f32,ne=[4,5000,1,1]","support","1","yes","CUDA"
"CUDA0","ARGMAX","type=f32,ne=[32,1,1,1]","support","1","yes","CUDA"
@@ -5419,17 +5420,45 @@
"CUDA0","CPY","type_src=f16,type_dst=f16,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CUDA"
"CUDA0","CPY","type_src=f32,type_dst=f32,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CUDA"
"CUDA0","CPY","type_src=bf16,type_dst=bf16,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CUDA"
"CUDA0","CPY","type_src=i32,type_dst=i32,ne=[256,4,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1","support","1","yes","CUDA"
"CUDA0","CPY","type_src=i32,type_dst=i32,ne=[256,1,4,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0","support","1","yes","CUDA"
"CUDA0","CPY","type_src=f32,type_dst=f32,ne=[256,1,4,1],permute_src=[1,2,0,3],permute_dst=[0,0,0,0],_src_transpose=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[10,10,10,1]","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,1,1,1]","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,1,3,5]","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,3,5,7]","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[2,1,1,1]","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[2,1,3,5]","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[2,3,5,7]","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[2,1,1,1]","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[2,1,3,5]","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[2,3,5,7]","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,1,1,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,1,3,5],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,3,5,7],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[1,4,4,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[1,8,17,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[10,10,10,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f32,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[2,1,1,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[2,1,3,5],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[2,3,5,7],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[1,4,4,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[1,8,17,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[10,10,10,1],use_view_slice=1","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=i32,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=f16,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[2,1,1,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[2,1,3,5],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[2,3,5,7],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[1,4,4,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[1,8,17,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","CONT","type=bf16,ne=[10,10,10,1],use_view_slice=0","support","1","yes","CUDA"
"CUDA0","ADD","type=f16,ne=[1,1,8,1],nr=[1,1,1,1],nf=1","support","1","yes","CUDA"
"CUDA0","SUB","type=f16,ne=[1,1,8,1],nr=[1,1,1,1],nf=1","support","1","yes","CUDA"
"CUDA0","MUL","type=f16,ne=[1,1,8,1],nr=[1,1,1,1],nf=1","support","1","yes","CUDA"
@@ -5655,6 +5684,7 @@
"CUDA0","MUL","type=f32,ne=[64,262144,1,1],nr=[1,1,1,1],nf=1","support","1","yes","CUDA"
"CUDA0","DIV","type=f32,ne=[64,262144,1,1],nr=[1,1,1,1],nf=1","support","1","yes","CUDA"
"CUDA0","ADD1","type=f32,ne=[10,5,4,3]","support","1","yes","CUDA"
"CUDA0","ADD1","type=f32,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","SCALE","type=f32,ne=[10,10,10,10],scale=2.000000,bias=0.000000,inplace=0","support","1","yes","CUDA"
"CUDA0","SCALE","type=f32,ne=[10,10,10,10],scale=2.000000,bias=1.000000,inplace=0","support","1","yes","CUDA"
"CUDA0","SCALE","type=f32,ne=[10,10,10,10],scale=2.000000,bias=1.000000,inplace=1","support","1","yes","CUDA"
@@ -8644,9 +8674,13 @@
"CUDA0","CLAMP","type=f16,ne=[7,1,5,3],min=-0.500000,max=0.500000","support","1","yes","CUDA"
"CUDA0","LEAKY_RELU","type=f16,ne_a=[7,1,5,3],negative_slope=0.100000","support","1","yes","CUDA"
"CUDA0","FLOOR","type=f16,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","FLOOR","type=f16,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","CEIL","type=f16,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","CEIL","type=f16,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","ROUND","type=f16,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","ROUND","type=f16,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","TRUNC","type=f16,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","TRUNC","type=f16,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","SQR","type=f32,ne=[10,5,4,3]","support","1","yes","CUDA"
"CUDA0","SQRT","type=f32,ne=[10,3,3,2]","support","1","yes","CUDA"
"CUDA0","LOG","type=f32,ne=[10,5,4,3]","support","1","yes","CUDA"
@@ -8666,9 +8700,13 @@
"CUDA0","CLAMP","type=f32,ne=[7,1,5,3],min=-0.500000,max=0.500000","support","1","yes","CUDA"
"CUDA0","LEAKY_RELU","type=f32,ne_a=[7,1,5,3],negative_slope=0.100000","support","1","yes","CUDA"
"CUDA0","FLOOR","type=f32,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","FLOOR","type=f32,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","CEIL","type=f32,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","CEIL","type=f32,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","ROUND","type=f32,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","ROUND","type=f32,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","TRUNC","type=f32,ne=[7,1,5,3]","support","1","yes","CUDA"
"CUDA0","TRUNC","type=f32,ne=[1024,1024,1,1]","support","1","yes","CUDA"
"CUDA0","DIAG_MASK_INF","type=f32,ne=[10,10,1,1],n_past=5","support","1","yes","CUDA"
"CUDA0","DIAG_MASK_INF","type=f32,ne=[10,10,3,1],n_past=5","support","1","yes","CUDA"
"CUDA0","DIAG_MASK_INF","type=f32,ne=[10,10,3,2],n_past=5","support","1","yes","CUDA"
@@ -9411,18 +9449,405 @@
"CUDA0","CONCAT","type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=2,v=3","support","0","no","CUDA"
"CUDA0","CONCAT","type=f32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3","support","1","yes","CUDA"
"CUDA0","CONCAT","type=i32,ne_a=[11,12,13,14],ne_b_d=7,dim=3,v=3","support","0","no","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[3,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[4,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[7,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[8,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[15,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[31,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[32,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[63,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[64,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[127,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[128,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[255,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[256,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[511,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[512,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1023,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1024,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2047,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2048,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[4095,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[4096,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[8191,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[8192,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16383,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16384,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[32767,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[32768,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[65535,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[65536,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[131071,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[131072,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[262143,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[262144,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[524287,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[524288,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1048575,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1048576,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16,10,10,10],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[60,10,10,10],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1024,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16384,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1023,2,1,3],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1024,2,1,3],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1025,2,1,3],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2047,2,1,3],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2048,2,1,3],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2049,2,1,3],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2,8,8192,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[8,1,1,1],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[3,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[4,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[7,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[8,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[15,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[31,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[32,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[63,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[64,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[127,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[128,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[255,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[256,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[511,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[512,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1023,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1024,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2047,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2048,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[4095,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[4096,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[8191,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[8192,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16383,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16384,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[32767,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[32768,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[65535,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[65536,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[131071,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[131072,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[262143,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[262144,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[524287,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[524288,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1048575,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1048576,1,1,1],order=0","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16,10,10,10],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[60,10,10,10],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1024,1,1,1],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[16384,1,1,1],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1023,2,1,3],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1024,2,1,3],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[1025,2,1,3],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2047,2,1,3],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2048,2,1,3],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2049,2,1,3],order=1","support","1","yes","CUDA"
"CUDA0","ARGSORT","type=f32,ne=[2,8,8192,1],order=1","support","1","yes","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[12,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[13,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[13,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[15,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[15,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[15,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[19,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[19,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[19,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[19,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[27,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[27,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[27,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[27,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[27,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[43,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[43,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[43,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[43,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[43,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[64,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[75,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[64,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[75,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[64,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[75,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[64,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[75,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[64,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[75,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[128,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[139,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[128,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[139,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[128,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[139,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[128,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[139,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[128,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[139,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[128,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[139,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[256,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[267,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[256,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[267,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[256,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[267,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[256,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[267,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[256,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[267,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[256,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[267,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[512,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[523,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[512,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[523,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[512,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[523,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[512,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[523,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[512,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[523,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[512,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[523,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[512,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[523,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1035,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2059,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4096,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[4107,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8192,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[8203,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16395,1,2,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32768,1,1,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[32779,1,2,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65536,1,1,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[65547,1,2,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131072,1,1,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[131083,1,2,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262144,1,1,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[262155,1,2,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=100,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=500,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=1023,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524288,1,1,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[524299,1,2,1],k=9999,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,10,10,10],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[60,10,10,10],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1023,2,1,3],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,2,1,3],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1025,2,1,3],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2047,2,1,3],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,2,1,3],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2049,2,1,3],k=1,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,10,10,10],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[60,10,10,10],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1023,2,1,3],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,2,1,3],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1025,2,1,3],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2047,2,1,3],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,2,1,3],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2049,2,1,3],k=2,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,10,10,10],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[60,10,10,10],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1023,2,1,3],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,2,1,3],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1025,2,1,3],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2047,2,1,3],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,2,1,3],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2049,2,1,3],k=3,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,10,10,10],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[60,10,10,10],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1023,2,1,3],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,2,1,3],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1025,2,1,3],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2047,2,1,3],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,2,1,3],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2049,2,1,3],k=7,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16,10,10,10],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[60,10,10,10],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1023,2,1,3],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1024,2,1,3],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[1025,2,1,3],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[16384,1,1,1],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2047,2,1,3],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2048,2,1,3],k=15,ties=0","support","0","no","CUDA"
"CUDA0","TOP_K","type=f32,ne=[2049,2,1,3],k=15,ties=0","support","0","no","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=0","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=nearest,transpose=1","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=nearest,flags=none","support","1","yes","CUDA"
@@ -9435,6 +9860,10 @@
"CUDA0","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=bicubic,transpose=1","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bicubic,flags=none","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bicubic,flags=none","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=513,transpose=0","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[512,512,3,2],scale_factor=2,mode=513,transpose=1","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear,flags=none","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[5,7,11,13],ne_tgt=[2,5,7,11],mode=bilinear,flags=none","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[2,5,7,11],ne_tgt=[5,7,11,13],mode=bilinear,flags=align_corners","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[1,4,3,2],ne_tgt=[2,8,3,2],mode=bilinear,flags=align_corners","support","1","yes","CUDA"
"CUDA0","UPSCALE","type=f32,ne=[4,1,3,2],ne_tgt=[1,1,3,2],mode=bilinear,flags=align_corners","support","1","yes","CUDA"
@@ -9463,34 +9892,59 @@
"CUDA0","GROUP_NORM","type=f32,ne=[64,64,320,1],num_groups=32,eps=0.000001","support","1","yes","CUDA"
"CUDA0","GROUP_NORM","type=f32,ne=[9,9,1280,1],num_groups=32,eps=0.000001","support","1","yes","CUDA"
"CUDA0","ACC","type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1]","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],pad_0=1,pad_1=1","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,3,1],lp0=1,rp0=1,lp1=1,rp1=1,lp2=1,rp2=1,lp3=1,rp3=1,v=0","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],pad_0=1,pad_1=1,circular=0","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[33,17,2,1],pad_0=4,pad_1=3,circular=1","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,3,1],lp0=1,rp0=1,lp1=1,rp1=1,lp2=1,rp2=1,lp3=1,rp3=1,v=0,circular=0","support","1","yes","CUDA"
"CUDA0","PAD_REFLECT_1D","type=f32,ne_a=[512,34,2,1],pad_0=10,pad_1=9","support","1","yes","CUDA"
"CUDA0","PAD_REFLECT_1D","type=f32,ne_a=[3000,384,4,1],pad_0=10,pad_1=9","support","1","yes","CUDA"
"CUDA0","ROLL","shift0=3,shift1=-2,shift3=1,shift4=-1","support","1","yes","CUDA"
"CUDA0","ARANGE","type=f32,start=0.000000,stop=10.000000,step=1.000000","support","1","yes","CUDA"
"CUDA0","ARANGE","type=f32,start=0.000000,stop=1048576.000000,step=1.000000","support","1","yes","CUDA"
"CUDA0","TIMESTEP_EMBEDDING","type=f32,ne_a=[2,1,1,1],dim=320,max_period=10000","support","1","yes","CUDA"
"CUDA0","LEAKY_RELU","type=f32,ne_a=[10,5,4,3],negative_slope=0.100000","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[10,5,4,3]","support","0","no","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[10,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[127,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[128,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[128,128,4,4]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[255,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[256,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[511,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[512,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[1023,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[1024,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[2047,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[2048,5,4,3]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[242004,1,1,1]","support","1","yes","CUDA"
"CUDA0","CUMSUM","type=f32,ne=[375960,1,1,1]","support","1","yes","CUDA"
"CUDA0","XIELU","type=f32,ne=[10,5,4,3]","support","0","no","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=3","support","0","no","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=2","support","0","no","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=1","support","0","no","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=0","support","0","no","CUDA"
"CUDA0","FILL","type=f32,ne=[10,10,4,3],c=0.000000","support","0","no","CUDA"
"CUDA0","FILL","type=f32,ne=[303,207,11,3],c=2.000000","support","0","no","CUDA"
"CUDA0","FILL","type=f32,ne=[800,600,4,4],c=-152.000000","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[10,10,4,3],ne_rhs=[3,10,4,3]","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[11,11,1,1],ne_rhs=[5,11,1,1]","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[17,17,2,4],ne_rhs=[9,17,2,4]","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[30,30,7,1],ne_rhs=[8,30,7,1]","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[42,42,5,2],ne_rhs=[10,42,5,2]","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]","support","0","no","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=3","support","1","yes","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=2","support","1","yes","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=1","support","1","yes","CUDA"
"CUDA0","TRI","type=f32,ne=[10,10,4,3],tri_type=0","support","1","yes","CUDA"
"CUDA0","FILL","type=f32,ne=[10,10,4,3],c=0.000000","support","1","yes","CUDA"
"CUDA0","FILL","type=f32,ne=[303,207,11,3],c=2.000000","support","1","yes","CUDA"
"CUDA0","FILL","type=f32,ne=[800,600,4,4],c=-152.000000","support","1","yes","CUDA"
"CUDA0","FILL","type=f32,ne=[2048,512,2,2],c=3.500000","support","1","yes","CUDA"
"CUDA0","DIAG","type=f32,ne=[10,1,4,3]","support","1","yes","CUDA"
"CUDA0","DIAG","type=f32,ne=[79,1,19,13]","support","1","yes","CUDA"
"CUDA0","DIAG","type=f32,ne=[256,1,8,16]","support","1","yes","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[10,10,4,3],ne_rhs=[3,10,4,3]","support","1","yes","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[11,11,1,1],ne_rhs=[5,11,1,1]","support","1","yes","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[17,17,2,4],ne_rhs=[9,17,2,4]","support","1","yes","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[30,30,7,1],ne_rhs=[8,30,7,1]","support","1","yes","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[42,42,5,2],ne_rhs=[10,42,5,2]","support","1","yes","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[64,64,2,2],ne_rhs=[10,64,2,2]","support","1","yes","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[100,100,4,4],ne_rhs=[41,100,4,4]","support","0","no","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1","support","0","no","CUDA"
"CUDA0","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[128,128,4,4],ne_rhs=[31,128,4,4]","support","0","no","CUDA"
"CUDA0","SOLVE_TRI","type=f32,ne_lhs=[64,64,4,4],ne_rhs=[300,64,4,4]","support","0","no","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0,circular=0","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0,circular=0","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=0,circular=1","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=0,circular=1","support","1","yes","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1,circular=0","support","0","no","CUDA"
"CUDA0","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1,circular=0","support","0","no","CUDA"
"CUDA0","PAD","type=f32,ne_a=[512,512,1,1],lp0=0,rp0=1,lp1=0,rp1=1,lp2=0,rp2=0,lp3=0,rp3=0,v=1,circular=1","support","0","no","CUDA"
"CUDA0","PAD","type=f32,ne_a=[11,22,33,44],lp0=1,rp0=2,lp1=3,rp1=4,lp2=5,rp2=6,lp3=7,rp3=8,v=1,circular=1","support","0","no","CUDA"
"CUDA0","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f32,permute=[0,1,2,3]","support","1","yes","CUDA"
"CUDA0","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,1,2,3]","support","1","yes","CUDA"
"CUDA0","FLASH_ATTN_EXT","hsk=40,hsv=40,nh=4,nr23=[1,1],kv=113,nb=1,mask=1,sinks=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=bf16,permute=[0,1,2,3]","support","0","no","CUDA"
Can't render this file because it is too large.

File diff suppressed because it is too large Load Diff

18741
docs/ops/ZenDNN.csv Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -14,12 +14,13 @@ static void write_table_header(std::ofstream & file) {
static void write_table_entry(std::ofstream & file, const common_arg & opt) {
file << "| `";
// args
for (const auto & arg : opt.args) {
if (arg == opt.args.front()) {
auto all_args = opt.get_args();
for (const auto & arg : all_args) {
if (arg == all_args.front()) {
file << arg;
if (opt.args.size() > 1) file << ", ";
if (all_args.size() > 1) file << ", ";
} else {
file << arg << (arg != opt.args.back() ? ", " : "");
file << arg << (arg != all_args.back() ? ", " : "");
}
}
// value hint
@@ -76,7 +77,7 @@ static void export_md(std::string fname, llama_example ex) {
}
int main(int, char **) {
export_md("autogen-main.md", LLAMA_EXAMPLE_MAIN);
export_md("autogen-main.md", LLAMA_EXAMPLE_COMPLETION);
export_md("autogen-server.md", LLAMA_EXAMPLE_SERVER);
return 0;

View File

@@ -144,7 +144,7 @@ int main(int argc, char ** argv) {
return 1;
}
std::string s(buf, n);
printf("%s", s.c_str());
printf("%s (%d)", s.c_str(), id);
}
printf("\n");

View File

@@ -1,10 +1,13 @@
#!/usr/bin/env python3
import numpy as np
import sys
import os
import numpy as np
from pathlib import Path
# Add utils directory to path for direct script execution
sys.path.insert(0, str(Path(__file__).parent.parent / "utils"))
from common import get_model_name_from_env_path # type: ignore[import-not-found]
def quick_logits_check(pytorch_file, llamacpp_file):
"""Lightweight sanity check before NMSE"""
@@ -32,27 +35,16 @@ def quick_logits_check(pytorch_file, llamacpp_file):
print(f"Top 10 llama.cpp logits: {llamacpp_logits[llamacpp_top10]}")
print(f"Max absolute difference: {max_diff:.4f}")
if max_diff > 1.0:
print(f"❌ NOK: Large differences detected - max diff: {max_diff:.4f}")
return False
return True
def main():
model_path = os.getenv('MODEL_PATH')
if not model_path:
print("Error: MODEL_PATH environment variable not set")
sys.exit(1)
if not os.path.exists(model_path):
print(f"Error: Model file not found: {model_path}")
sys.exit(1)
model_name = os.path.basename(model_path)
model_name = get_model_name_from_env_path('MODEL_PATH')
data_dir = Path("data")
pytorch_file = data_dir / f"pytorch-{model_name}.bin"
llamacpp_file = data_dir / f"llamacpp-{model_name}.bin"
llamacpp_model_name = get_model_name_from_env_path('CONVERTED_MODEL')
print(f"Using converted model: {llamacpp_model_name}")
llamacpp_file = data_dir / f"llamacpp-{llamacpp_model_name}.bin"
if not pytorch_file.exists():
print(f"Error: PyTorch logits file not found: {pytorch_file}")

View File

@@ -5,6 +5,7 @@ import sys
import os
import argparse
from pathlib import Path
from common import get_model_name_from_env_path # type: ignore[import-not-found]
def calculate_nmse(reference, test):
mse = np.mean((test - reference) ** 2)
@@ -67,11 +68,13 @@ def main():
parser.add_argument('-m', '--model-path', required=True, help='Path to the model directory')
args = parser.parse_args()
model_name = os.path.basename(args.model_path)
model_name = get_model_name_from_env_path('MODEL_PATH')
data_dir = Path("data")
pytorch_file = data_dir / f"pytorch-{model_name}.bin"
llamacpp_file = data_dir / f"llamacpp-{model_name}.bin"
llamacpp_model_name = get_model_name_from_env_path('CONVERTED_MODEL')
llamacpp_file = data_dir / f"llamacpp-{llamacpp_model_name}.bin"
print(f"Model name: {model_name}")
print(f"PyTorch logits file: {pytorch_file}")

View File

@@ -0,0 +1,20 @@
#!/usr/bin/env python3
import os
import sys
def get_model_name_from_env_path(env_path_name):
model_path = os.getenv(env_path_name)
if not model_path:
print(f"Error: {env_path_name} environment variable not set")
sys.exit(1)
if not os.path.exists(model_path):
print(f"Error: Model file not found: {model_path}")
sys.exit(1)
name = os.path.basename(os.path.normpath(model_path))
if name.endswith(".gguf"):
name = name[:-5]
return name

View File

@@ -168,6 +168,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
option(GGML_RV_ZFH "ggml: enable riscv zfh" ON)
option(GGML_RV_ZVFH "ggml: enable riscv zvfh" ON)
option(GGML_RV_ZICBOP "ggml: enable riscv zicbop" ON)
option(GGML_RV_ZIHINTPAUSE "ggml: enable riscv zihintpause " ON)
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
option(GGML_VXE "ggml: enable vxe" ${GGML_NATIVE})
@@ -253,6 +254,9 @@ option(GGML_HEXAGON "ggml: enable Hexagon backend"
# toolchain for vulkan-shaders-gen
set (GGML_VULKAN_SHADERS_GEN_TOOLCHAIN "" CACHE FILEPATH "ggml: toolchain file for vulkan-shaders-gen")
option(GGML_ZENDNN "ggml: use ZenDNN" OFF)
option(ZENDNN_ROOT "ggml: path to ZenDNN installation" "")
# extra artifacts
option(GGML_BUILD_TESTS "ggml: build tests" ${GGML_STANDALONE})
option(GGML_BUILD_EXAMPLES "ggml: build examples" ${GGML_STANDALONE})
@@ -314,6 +318,7 @@ set(GGML_PUBLIC_HEADERS
include/ggml-sycl.h
include/ggml-vulkan.h
include/ggml-webgpu.h
include/ggml-zendnn.h
include/gguf.h)
set_target_properties(ggml PROPERTIES PUBLIC_HEADER "${GGML_PUBLIC_HEADERS}")

View File

@@ -99,6 +99,7 @@ extern "C" {
GGML_BACKEND_API int ggml_cpu_has_sme (void);
// other
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
GGML_BACKEND_API int ggml_cpu_get_rvv_vlen (void); // risc-v vector length in bytes
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);

View File

@@ -0,0 +1,22 @@
#pragma once
#include "ggml-backend.h"
#include "ggml.h"
#ifdef __cplusplus
extern "C" {
#endif
// backend API
GGML_BACKEND_API ggml_backend_t ggml_backend_zendnn_init(void);
GGML_BACKEND_API bool ggml_backend_is_zendnn(ggml_backend_t backend);
// number of threads used for zendnn operations
GGML_BACKEND_API void ggml_backend_zendnn_set_n_threads(ggml_backend_t backend_zendnn, int n_threads);
GGML_BACKEND_API ggml_backend_reg_t ggml_backend_zendnn_reg(void);
#ifdef __cplusplus
}
#endif

View File

@@ -2305,13 +2305,11 @@ extern "C" {
float stop,
float step);
#define GGML_KQ_MASK_PAD 1
// q: [n_embd_k, n_batch, n_head, ne3 ]
// k: [n_embd_k, n_kv, n_head_kv, ne3 ]
// v: [n_embd_v, n_kv, n_head_kv, ne3 ] !! not transposed !!
// mask: [n_kv, n_batch_pad, ne32, ne33] !! n_batch_pad = GGML_PAD(n_batch, GGML_KQ_MASK_PAD) !!
// res: [n_embd_v, n_head, n_batch, ne3 ] !! permuted !!
// q: [n_embd_k, n_batch, n_head, ne3 ]
// k: [n_embd_k, n_kv, n_head_kv, ne3 ]
// v: [n_embd_v, n_kv, n_head_kv, ne3 ] !! not transposed !!
// mask: [n_kv, n_batch, ne32, ne33]
// res: [n_embd_v, n_head, n_batch, ne3 ] !! permuted !!
//
// broadcast:
// n_head % n_head_kv == 0

View File

@@ -440,6 +440,7 @@ ggml_add_backend(WebGPU)
ggml_add_backend(zDNN)
ggml_add_backend(OpenCL)
ggml_add_backend(Hexagon)
ggml_add_backend(ZenDNN)
foreach (target ggml-base ggml)
target_include_directories(${target} PUBLIC $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../include> $<INSTALL_INTERFACE:include>)

View File

@@ -25,6 +25,7 @@ static bool ggml_is_view(const struct ggml_tensor * t) {
// ops that return true for this function must not use restrict pointers for their backend implementations
bool ggml_op_can_inplace(enum ggml_op op) {
switch (op) {
case GGML_OP_FILL:
case GGML_OP_SCALE:
case GGML_OP_DIAG_MASK_ZERO:
case GGML_OP_DIAG_MASK_INF:
@@ -311,16 +312,9 @@ static struct buffer_address ggml_dyn_tallocr_alloc(struct ggml_dyn_tallocr * al
}
// this is a very naive implementation, but for our case the number of free blocks should be very small
static void ggml_dyn_tallocr_free_tensor(struct ggml_dyn_tallocr * alloc, struct buffer_address addr, size_t size, const struct ggml_tensor * tensor) {
static void ggml_dyn_tallocr_free_bytes(struct ggml_dyn_tallocr * alloc, struct buffer_address addr, size_t size) {
size = aligned_offset(NULL, size, alloc->alignment);
AT_PRINTF("%s: freeing %s at {chunk=%d, offset=%zu} (%zu bytes) - n_free_blocks = %d\n",
__func__, tensor->name, addr.chunk, addr.offset, size, alloc->chunks[addr.chunk]->n_free_blocks);
#ifdef GGML_ALLOCATOR_DEBUG
remove_allocated_tensor(alloc, addr, tensor);
#endif
struct tallocr_chunk * chunk = alloc->chunks[addr.chunk];
// see if we can merge with an existing block
@@ -356,8 +350,6 @@ static void ggml_dyn_tallocr_free_tensor(struct ggml_dyn_tallocr * alloc, struct
}
// otherwise, add a new block
ggml_dyn_tallocr_insert_block(chunk, addr.offset, size);
GGML_UNUSED(tensor);
}
static void ggml_dyn_tallocr_reset(struct ggml_dyn_tallocr * alloc) {
@@ -615,13 +607,17 @@ static void ggml_gallocr_free_extra_space(ggml_gallocr_t galloc, struct ggml_ten
GGML_ASSERT(parent_size >= node_size);
// note: we want after the freeing the chunks to continue to be aligned
struct ggml_dyn_tallocr * p_alloc = galloc->buf_tallocs[p_hn->buffer_id];
parent_size = aligned_offset(NULL, parent_size, p_alloc->alignment);
node_size = aligned_offset(NULL, node_size, p_alloc->alignment);
if (parent_size > node_size) {
struct ggml_dyn_tallocr * p_alloc = galloc->buf_tallocs[p_hn->buffer_id];
struct buffer_address p_addr = p_hn->addr;
p_addr.offset += node_size;
size_t extra_size = parent_size - node_size;
AT_PRINTF("freeing extra %zu bytes from parent %s for %s\n", extra_size, parent->name, node->name);
ggml_dyn_tallocr_free_tensor(p_alloc, p_addr, extra_size, parent);
ggml_dyn_tallocr_free_bytes(p_alloc, p_addr, extra_size);
}
}
@@ -705,7 +701,14 @@ static void ggml_gallocr_free_node(ggml_gallocr_t galloc, struct ggml_tensor * n
struct ggml_dyn_tallocr * alloc = galloc->buf_tallocs[buffer_id];
ggml_backend_buffer_type_t buft = galloc->bufts[buffer_id];
size_t size = ggml_backend_buft_get_alloc_size(buft, node);
ggml_dyn_tallocr_free_tensor(alloc, hn->addr, size, node);
AT_PRINTF("%s: freeing %s at {chunk=%d, offset=%zu} (%zu bytes) - n_free_blocks = %d\n",
__func__, node->name, hn->addr.chunk, hn->addr.offset, size, alloc->chunks[hn->addr.chunk]->n_free_blocks);
#ifdef GGML_ALLOCATOR_DEBUG
remove_allocated_tensor(alloc, hn->addr, node);
#endif
ggml_dyn_tallocr_free_bytes(alloc, hn->addr, size);
hn->allocated = false;
}

View File

@@ -73,6 +73,10 @@
#include "ggml-cann.h"
#endif
#ifdef GGML_USE_ZENDNN
#include "ggml-zendnn.h"
#endif
// disable C++17 deprecation warning for std::codecvt_utf8
#if defined(__clang__)
# pragma clang diagnostic push
@@ -203,6 +207,9 @@ struct ggml_backend_registry {
#ifdef GGML_USE_OPENCL
register_backend(ggml_backend_opencl_reg());
#endif
#ifdef GGML_USE_ZENDNN
register_backend(ggml_backend_zendnn_reg());
#endif
#ifdef GGML_USE_HEXAGON
register_backend(ggml_backend_hexagon_reg());
#endif
@@ -605,6 +612,7 @@ void ggml_backend_load_all_from_path(const char * dir_path) {
#endif
ggml_backend_load_best("blas", silent, dir_path);
ggml_backend_load_best("zendnn", silent, dir_path);
ggml_backend_load_best("cann", silent, dir_path);
ggml_backend_load_best("cuda", silent, dir_path);
ggml_backend_load_best("hip", silent, dir_path);

View File

@@ -2251,12 +2251,12 @@ static void aclnn_rope_cache_init(ggml_backend_cann_context & ctx,
int sections[4],
bool mrope_used,
bool is_imrope,
bool indep_sects) {
ggml_tensor * src0 = dst->src[0]; // input
bool indep_sects,
int64_t rope_dims) {
ggml_tensor * src1 = dst->src[1]; // position
ggml_tensor * src2 = dst->src[2]; // freq_factors
int64_t theta_scale_length = src0->ne[0] / 2;
int64_t theta_scale_length = rope_dims / 2;
int64_t position_length = dst->ne[2];
// TODO: check theta_scale_length and position_length.
@@ -2331,18 +2331,17 @@ static void aclnn_rope_cache_init(ggml_backend_cann_context & ctx,
ACL_CHECK(aclrtMemcpyAsync(ctx.rope_cache.theta_scale_cache, theta_scale_length * sizeof(float),
ctx.rope_cache.theta_scale_exp_host, theta_scale_length * sizeof(float),
ACL_MEMCPY_HOST_TO_DEVICE, ctx.stream()));
acl_theta_scale_tensor = ggml_cann_create_tensor(ctx.rope_cache.theta_scale_cache, ACL_FLOAT, sizeof(float),
theta_scale_ne, theta_scale_nb, 1);
}
acl_theta_scale_tensor = ggml_cann_create_tensor(ctx.rope_cache.theta_scale_cache, ACL_FLOAT, sizeof(float),
theta_scale_ne, theta_scale_nb, 1);
// Step1.2: prepare rope_yarn_ramp, if this part updated, should update theta_scale_tensor.
// TODO: acl_yarn_ramp_tensor use rope cache.
bool yarn_ramp_tensor_updated = false;
ggml_cann_pool_alloc yarn_ramp_allocator(ctx.pool());
acl_tensor_ptr acl_yarn_ramp_tensor;
if (ext_factor != 0 &&
// TODO: check more parameter.
(ctx.rope_cache.theta_scale_length != theta_scale_length || ctx.rope_cache.freq_scale != freq_scale)) {
if (ext_factor != 0 && (theta_scale_updated || ctx.rope_cache.theta_scale_length != theta_scale_length ||
ctx.rope_cache.freq_scale != freq_scale)) {
yarn_ramp_tensor_updated = true;
// -rope_yarn_ramp
@@ -2590,7 +2589,7 @@ static void aclnn_rope_cache_init(ggml_backend_cann_context & ctx,
aclnn_muls(ctx, acl_cos_tensor.get(), attn_factor, nullptr, true);
}
int64_t sin_reshape_ne[4] = { src0->ne[0], 1, dst->ne[2], 1 };
int64_t sin_reshape_ne[4] = { rope_dims, 1, dst->ne[2], 1 };
size_t sin_reshape_nb[GGML_MAX_DIMS];
sin_reshape_nb[0] = sizeof(float);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
@@ -2645,7 +2644,7 @@ void ggml_cann_rope(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
// param
float freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow;
int sections[4];
int sections[4];
// const int n_past = ((int32_t *) dst->op_params)[0];
const int n_dims = ((int32_t *) dst->op_params)[1];
const int mode = ((int32_t *) dst->op_params)[2];
@@ -2654,44 +2653,60 @@ void ggml_cann_rope(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
GGML_TENSOR_UNARY_OP_LOCALS
memcpy(&freq_base, (int32_t *) dst->op_params + 5, sizeof(float));
memcpy(&freq_scale, (int32_t *) dst->op_params + 6, sizeof(float));
memcpy(&ext_factor, (int32_t *) dst->op_params + 7, sizeof(float));
memcpy(&attn_factor, (int32_t *) dst->op_params + 8, sizeof(float));
memcpy(&beta_fast, (int32_t *) dst->op_params + 9, sizeof(float));
memcpy(&beta_slow, (int32_t *) dst->op_params + 10, sizeof(float));
memcpy(&sections, (int32_t *) dst->op_params + 11, sizeof(int)*4);
memcpy(&freq_base, (int32_t *) dst->op_params + 5, sizeof(float));
memcpy(&freq_scale, (int32_t *) dst->op_params + 6, sizeof(float));
memcpy(&ext_factor, (int32_t *) dst->op_params + 7, sizeof(float));
memcpy(&attn_factor, (int32_t *) dst->op_params + 8, sizeof(float));
memcpy(&beta_fast, (int32_t *) dst->op_params + 9, sizeof(float));
memcpy(&beta_slow, (int32_t *) dst->op_params + 10, sizeof(float));
memcpy(&sections, (int32_t *) dst->op_params + 11, sizeof(int) * 4);
// TODO: n_dims <= ne0
GGML_ASSERT(n_dims == ne0);
GGML_ASSERT(n_dims % 2 == 0);
GGML_ASSERT(n_dims <= ne00);
const float theta_scale = powf(freq_base, -2.0f / n_dims);
float corr_dims[2];
ggml_rope_yarn_corr_dims(n_dims, n_ctx_orig, freq_base, beta_fast, beta_slow, corr_dims);
bool is_neox = mode & GGML_ROPE_TYPE_NEOX;
const bool is_imrope = mode == GGML_ROPE_TYPE_IMROPE; // qwen3vl apply interleaved mrope
const bool mrope_used = mode & GGML_ROPE_TYPE_MROPE; // ggml_rope_multi, note: also true for vision (24 & 8 == true) and for imrope
const bool is_vision = mode == GGML_ROPE_TYPE_VISION;
bool is_neox = mode & GGML_ROPE_TYPE_NEOX;
const bool is_imrope = mode == GGML_ROPE_TYPE_IMROPE; // qwen3vl apply interleaved mrope
// mrope_used means the GGML_ROPE_TYPE_MROPE bit is set.
// Note: this bit is also set for imrope and some vision modes,
// so mrope_used does NOT exclusively indicate pure mrope.
const bool mrope_used = mode & GGML_ROPE_TYPE_MROPE;
const bool is_vision = mode == GGML_ROPE_TYPE_VISION;
if (mrope_used) {
GGML_ASSERT(sections[0] > 0 || sections[1] > 0 || sections[2] > 0);
}
if (is_vision) {
GGML_ASSERT(n_dims == ne0/2);
GGML_ASSERT(n_dims == ne0 / 2);
}
if (is_imrope || mrope_used) {
is_neox = true;
}
// init ctx.rope_cos/rope_sin cache
aclnn_rope_cache_init(ctx, dst, corr_dims, ext_factor, theta_scale, freq_scale, attn_factor, is_neox, sections, mrope_used, is_imrope, is_vision);
int64_t rope_dims = n_dims;
int64_t sin_reshape_ne[4] = { ne00, 1, ne02, 1 };
//Our current RotaryPositionEmbedding does not support the VISION mode,
//but essentially it only modifies theta_base in mrope,
//then repeats it at the end in the same way as is_neox.
//In fact, RoPE is still applied across all dimensions.
if (is_vision) {
rope_dims = src0->ne[0];
}
int64_t tail_dims = ne00 - rope_dims;
bool has_tail = tail_dims > 0;
// init ctx.rope_cos/rope_sin cache
aclnn_rope_cache_init(ctx, dst, corr_dims, ext_factor, theta_scale, freq_scale, attn_factor, is_neox, sections,
mrope_used, is_imrope, is_vision, rope_dims);
// Cache is generated with ne00 dimensions, so we use ne00 for reshape
int64_t sin_reshape_ne[4] = { rope_dims, 1, ne02, 1 };
size_t sin_reshape_nb[GGML_MAX_DIMS];
sin_reshape_nb[0] = sizeof(float);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
@@ -2704,7 +2719,6 @@ void ggml_cann_rope(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
acl_tensor_ptr acl_src = ggml_cann_create_tensor(src0);
acl_tensor_ptr acl_dst = ggml_cann_create_tensor(dst);
#ifdef ASCEND_310P
// Special ROPE operation for 310P
@@ -2844,46 +2858,124 @@ void ggml_cann_rope(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
}
return;
#endif
int64_t acl_mode = is_neox ? 0 : 1;
switch (src0->type) {
case GGML_TYPE_F32:
{
GGML_CANN_CALL_ACLNN_OP(ctx, RotaryPositionEmbedding, acl_src.get(), acl_cos_reshape_tensor.get(),
acl_sin_reshape_tensor.get(), acl_mode, acl_dst.get());
break;
}
case GGML_TYPE_F16:
{
ggml_cann_pool_alloc src_trans_allocator(ctx.pool(), ggml_nelements(src0) * sizeof(float));
void * src_trans_buffer = src_trans_allocator.get();
ggml_cann_pool_alloc dst_trans_allocator(ctx.pool(), ggml_nelements(dst) * sizeof(float));
void * dst_trans_buffer = dst_trans_allocator.get();
// Pre-define head and tail dimensions for reuse
int64_t head_ne[GGML_MAX_DIMS] = { rope_dims, ne01, ne02, ne03 };
int64_t tail_ne[GGML_MAX_DIMS] = { tail_dims, ne01, ne02, ne03 };
size_t src_trans_nb[GGML_MAX_DIMS];
src_trans_nb[0] = sizeof(float);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
src_trans_nb[i] = src_trans_nb[i - 1] * src0->ne[i - 1];
}
// Step 1: Prepare trans tensors for F16 type conversion to F32 if needed
bool src_dst_need_trans = false;
ggml_cann_pool_alloc src_trans_allocator(ctx.pool());
ggml_cann_pool_alloc dst_trans_allocator(ctx.pool());
acl_tensor_ptr acl_src_trans_tensor;
acl_tensor_ptr acl_dst_trans_tensor;
void * src_trans_buffer = nullptr;
void * dst_trans_buffer = nullptr;
size_t src_dst_trans_nb[GGML_MAX_DIMS];
if (src0->type == GGML_TYPE_F16) {
src_dst_need_trans = true;
src_trans_buffer = src_trans_allocator.alloc(ggml_nelements(src0) * sizeof(float));
dst_trans_buffer = dst_trans_allocator.alloc(ggml_nelements(dst) * sizeof(float));
acl_tensor_ptr acl_src_trans_tensor = ggml_cann_create_tensor(
src_trans_buffer, ACL_FLOAT, sizeof(float), src0->ne, src_trans_nb, GGML_MAX_DIMS);
acl_tensor_ptr acl_dst_trans_tensor = ggml_cann_create_tensor(
dst_trans_buffer, ACL_FLOAT, sizeof(float), dst->ne, src_trans_nb, GGML_MAX_DIMS);
src_dst_trans_nb[0] = sizeof(float);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
src_dst_trans_nb[i] = src_dst_trans_nb[i - 1] * src0->ne[i - 1];
}
acl_src_trans_tensor = ggml_cann_create_tensor(src_trans_buffer, ACL_FLOAT, sizeof(float), src0->ne,
src_dst_trans_nb, GGML_MAX_DIMS);
acl_dst_trans_tensor = ggml_cann_create_tensor(dst_trans_buffer, ACL_FLOAT, sizeof(float), dst->ne,
src_dst_trans_nb, GGML_MAX_DIMS);
aclnn_cast(ctx, acl_src.get(), acl_src_trans_tensor.get(), ACL_FLOAT);
}
aclnn_cast(ctx, acl_src.get(), acl_src_trans_tensor.get(), ACL_FLOAT);
// Step 2: Prepare head tensors for tail splitting if needed
acl_tensor_ptr acl_src_head;
acl_tensor_ptr acl_dst_head;
if (has_tail) {
// Create head views for RotaryPositionEmbedding (only first rope_dims dimensions)
// RotaryPositionEmbedding requires contiguous dst tensor, so we use a temporary buffer
if (src_dst_need_trans) {
// Use F32 trans tensor strides
acl_src_head = ggml_cann_create_tensor((char *) src_trans_buffer, ACL_FLOAT, sizeof(float), head_ne,
src_dst_trans_nb, GGML_MAX_DIMS);
} else {
// Use original F32 tensor strides
acl_src_head = ggml_cann_create_tensor((char *) src0->data, ACL_FLOAT, sizeof(float), head_ne, src0->nb,
GGML_MAX_DIMS);
}
GGML_CANN_CALL_ACLNN_OP(ctx, RotaryPositionEmbedding, acl_src_trans_tensor.get(),
acl_cos_reshape_tensor.get(), acl_sin_reshape_tensor.get(), acl_mode,
acl_dst_trans_tensor.get());
int64_t head_elements = rope_dims * ne01 * ne02 * ne03;
ggml_cann_pool_alloc dst_head_contiguous_allocator(ctx.pool(), head_elements * sizeof(float));
void * dst_head_contiguous_buffer = dst_head_contiguous_allocator.get();
aclnn_cast(ctx, acl_dst_trans_tensor.get(), acl_dst.get(), ACL_FLOAT16);
break;
}
default:
GGML_ABORT("Unsupported tensor type for GGML_OP_ROPE");
break;
size_t head_contiguous_nb[GGML_MAX_DIMS];
head_contiguous_nb[0] = sizeof(float);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
head_contiguous_nb[i] = head_contiguous_nb[i - 1] * head_ne[i - 1];
}
acl_dst_head = ggml_cann_create_tensor(dst_head_contiguous_buffer, ACL_FLOAT, sizeof(float), head_ne,
head_contiguous_nb, GGML_MAX_DIMS);
}
// Step 3: Execute RotaryPositionEmbedding
if (has_tail) {
// Rotate only the head portion (first rope_dims dimensions)
GGML_CANN_CALL_ACLNN_OP(ctx, RotaryPositionEmbedding, acl_src_head.get(), acl_cos_reshape_tensor.get(),
acl_sin_reshape_tensor.get(), acl_mode, acl_dst_head.get());
// Copy head result from contiguous buffer back to destination tensor
if (src_dst_need_trans) {
acl_tensor_ptr acl_dst_head_target = ggml_cann_create_tensor(
(char *) dst_trans_buffer, ACL_FLOAT, sizeof(float), head_ne, src_dst_trans_nb, GGML_MAX_DIMS);
cann_copy(ctx, acl_dst_head.get(), acl_dst_head_target.get());
} else {
acl_tensor_ptr acl_dst_head_target =
ggml_cann_create_tensor((char *) dst->data, ACL_FLOAT, sizeof(float), head_ne, dst->nb, GGML_MAX_DIMS);
cann_copy(ctx, acl_dst_head.get(), acl_dst_head_target.get());
}
} else if (src_dst_need_trans) {
// Rotate full tensor (no tail), using trans tensors
GGML_CANN_CALL_ACLNN_OP(ctx, RotaryPositionEmbedding, acl_src_trans_tensor.get(), acl_cos_reshape_tensor.get(),
acl_sin_reshape_tensor.get(), acl_mode, acl_dst_trans_tensor.get());
} else {
// Rotate full tensor (no tail), using original tensors
GGML_CANN_CALL_ACLNN_OP(ctx, RotaryPositionEmbedding, acl_src.get(), acl_cos_reshape_tensor.get(),
acl_sin_reshape_tensor.get(), acl_mode, acl_dst.get());
}
// Step 4: Copy unrotated tail portion from source to destination
if (has_tail) {
size_t src_tail_offset;
size_t dst_tail_offset;
auto copy_tail_device = [&](void * src_ptr, void * dst_ptr, aclDataType dtype, size_t elem_size,
size_t * nb_src_arr, size_t * nb_dst_arr) {
acl_tensor_ptr acl_src_tail =
ggml_cann_create_tensor(src_ptr, dtype, elem_size, tail_ne, nb_src_arr, GGML_MAX_DIMS);
acl_tensor_ptr acl_dst_tail =
ggml_cann_create_tensor(dst_ptr, dtype, elem_size, tail_ne, nb_dst_arr, GGML_MAX_DIMS);
cann_copy(ctx, acl_src_tail.get(), acl_dst_tail.get());
};
if (src_dst_need_trans) {
// Use F32 trans tensor strides and offsets
src_tail_offset = rope_dims * src_dst_trans_nb[0];
dst_tail_offset = rope_dims * src_dst_trans_nb[0];
copy_tail_device((char *) src_trans_buffer + src_tail_offset, (char *) dst_trans_buffer + dst_tail_offset,
ACL_FLOAT, sizeof(float), src_dst_trans_nb, src_dst_trans_nb);
} else {
// Use original tensor strides and offsets
src_tail_offset = rope_dims * nb00;
dst_tail_offset = rope_dims * nb0;
copy_tail_device((char *) src0->data + src_tail_offset, (char *) dst->data + dst_tail_offset,
ggml_cann_type_mapping(dst->type), ggml_element_size(dst), src0->nb, dst->nb);
}
}
// Step 5: Cast back to F16 if needed
if (src_dst_need_trans) {
aclnn_cast(ctx, acl_dst_trans_tensor.get(), acl_dst.get(), ACL_FLOAT16);
}
}

View File

@@ -315,7 +315,7 @@ struct ggml_cann_rope_cache {
if (theta_scale_exp_host) {
free(theta_scale_exp_host);
}
if(position_select_index_host) {
if (position_select_index_host) {
free(position_select_index_host);
}
}
@@ -340,7 +340,7 @@ struct ggml_cann_rope_cache {
void set(int64_t theta_scale_length,
int64_t position_length,
float ext_factor,
float ext_factor,
float theta_scale,
float freq_scale,
float attn_factor,

View File

@@ -2308,7 +2308,7 @@ static enum ggml_status ggml_backend_cann_graph_compute(ggml_backend_t backend,
bool cann_graph_update_required = false;
#ifdef USE_ACL_GRAPH
bool use_cann_graph = true;
bool use_cann_graph = true;
static bool prefill_use_graph = parse_bool(get_env("GGML_CANN_PREFILL_USE_GRAPH").value_or(""));
if (!prefill_use_graph) {
@@ -2338,7 +2338,7 @@ static enum ggml_status ggml_backend_cann_graph_compute(ggml_backend_t backend,
}
}
#else
bool use_cann_graph = false;
bool use_cann_graph = false;
#endif // USE_ACL_GRAPH
evaluate_and_capture_cann_graph(cann_ctx, cgraph, use_cann_graph, cann_graph_update_required);
@@ -2474,16 +2474,14 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, const ggml_ten
}
case GGML_OP_ROPE:
{
// TODO: with ops-test v == 1
// TODO: n_dims <= ne0
if (op->src[0]->ne[0] != op->op_params[1]) {
return false;
}
if (op->src[0]->ne[0] > 896) {
return false;
}
#ifdef ASCEND_310P
// TODO: Support rope_dim < ne00(dim)
if (op->src[0]->ne[0] != op->op_params[1]) {
return false;
}
if (!ggml_is_contiguous(op->src[0])) {
return false;
}
@@ -2550,6 +2548,7 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, const ggml_ten
case GGML_OP_ARGSORT:
case GGML_OP_ACC:
case GGML_OP_GROUP_NORM:
return true;
case GGML_OP_PAD:
// TODO: add circular padding support for cann, see https://github.com/ggml-org/llama.cpp/pull/16985
return ggml_get_op_params_i32(op, 8) == 0;

View File

@@ -469,6 +469,9 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_RV_ZICBOP)
string(APPEND MARCH_STR "_zicbop")
endif()
if (GGML_RV_ZIHINTPAUSE)
string(APPEND MARCH_STR "_zihintpause")
endif()
list(APPEND ARCH_FLAGS "-march=${MARCH_STR}" -mabi=lp64d)
else()
# Begin with the lowest baseline

View File

@@ -81,6 +81,11 @@ struct ggml_arm_arch_features_type {
} ggml_arm_arch_features = { 0 };
#endif
#if defined(__riscv)
struct ggml_riscv_arch_features_type {
int rvv_vlen;
} ggml_riscv_arch_features = { 0 };
#endif
#if defined(_WIN32)
@@ -187,6 +192,9 @@ typedef void * thread_ret_t;
typedef pthread_t ggml_thread_t;
#define GGML_THREADPOOL_N_THREADS_MASK (0xffffU)
#define GGML_THREADPOOL_N_THREADS_BITS (16)
#if defined(__APPLE__)
#include <unistd.h>
#include <mach/mach.h>
@@ -449,7 +457,7 @@ struct ggml_threadpool {
struct ggml_cplan * cplan;
// synchronization primitives
atomic_int n_graph; // incremented when there is work to be done (i.e each graph)
atomic_int n_graph; // updated when there is work to be done (i.e each graph) holds graph and active thread counts.
atomic_int GGML_CACHE_ALIGN n_barrier;
atomic_int GGML_CACHE_ALIGN n_barrier_passed;
atomic_int GGML_CACHE_ALIGN current_chunk; // currently processing chunk during Mat_Mul, shared between all the threads.
@@ -457,12 +465,10 @@ struct ggml_threadpool {
// these are atomic as an annotation for thread-sanitizer
atomic_bool stop; // Used for stopping the threadpool altogether
atomic_bool pause; // Used for pausing the threadpool or individual threads
atomic_int abort; // Used for aborting processing of a graph
atomic_int abort; // Used for aborting processing of a graph
struct ggml_compute_state * workers; // per thread state
int n_threads_max; // number of threads in the pool
atomic_int n_threads_cur; // number of threads used in the current graph
int n_threads; // Number of threads in the pool
int32_t prio; // Scheduling priority
uint32_t poll; // Polling level (0 - no polling)
@@ -490,6 +496,15 @@ static inline void ggml_thread_cpu_relax(void) {
static inline void ggml_thread_cpu_relax(void) {
_mm_pause();
}
#elif defined(__riscv)
static inline void ggml_thread_cpu_relax(void) {
#ifdef __riscv_zihintpause
__asm__ __volatile__ ("pause");
#else
/* Encoding of the pause instruction */
__asm__ __volatile__ (".4byte 0x100000F");
#endif
}
#else
static inline void ggml_thread_cpu_relax(void) {;}
#endif
@@ -530,7 +545,7 @@ struct ggml_state {
static struct ggml_state g_state = {0};
void ggml_barrier(struct ggml_threadpool * tp) {
int n_threads = atomic_load_explicit(&tp->n_threads_cur, memory_order_relaxed);
int n_threads = atomic_load_explicit(&tp->n_graph, memory_order_relaxed) & GGML_THREADPOOL_N_THREADS_MASK;
if (n_threads == 1) {
return;
}
@@ -547,7 +562,7 @@ void ggml_barrier(struct ggml_threadpool * tp) {
// last thread
atomic_store_explicit(&tp->n_barrier, 0, memory_order_relaxed);
// exit barrier (fill seq-cst fence)
// exit barrier (full seq-cst fence)
atomic_fetch_add_explicit(&tp->n_barrier_passed, 1, memory_order_seq_cst);
return;
}
@@ -693,6 +708,15 @@ static void ggml_init_arm_arch_features(void) {}
#endif
#endif // __ARM_ARCH
#if defined(__riscv) && defined(__riscv_v_intrinsic)
#include <riscv_vector.h>
static void ggml_init_riscv_arch_features(void) {
ggml_riscv_arch_features.rvv_vlen = __riscv_vlenb();
}
#else
static void ggml_init_riscv_arch_features(void) {}
#endif
struct ggml_tensor * ggml_new_i32(struct ggml_context * ctx, int32_t value) {
GGML_ASSERT(!ggml_get_no_alloc(ctx));
@@ -2619,7 +2643,7 @@ static void ggml_thread_cpumask_next(const bool * global_mask, bool * local_mask
void ggml_threadpool_free(struct ggml_threadpool* threadpool) {
if (!threadpool) return;
const int n_threads = threadpool->n_threads_max;
const int n_threads = threadpool->n_threads;
#ifndef GGML_USE_OPENMP
struct ggml_compute_state* workers = threadpool->workers;
@@ -2695,7 +2719,7 @@ struct ggml_cplan ggml_graph_plan(
//GGML_PRINT_DEBUG("Threadpool is not specified. Will create a disposable threadpool : n_threads %d\n", n_threads);
}
if (n_threads <= 0) {
n_threads = threadpool ? threadpool->n_threads_max : GGML_DEFAULT_N_THREADS;
n_threads = threadpool ? threadpool->n_threads : GGML_DEFAULT_N_THREADS;
}
#if defined(__EMSCRIPTEN__) && !defined(__EMSCRIPTEN_PTHREADS__)
@@ -2903,12 +2927,14 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
struct ggml_compute_params params = {
/*.ith =*/ state->ith,
/*.nth =*/ atomic_load_explicit(&tp->n_threads_cur, memory_order_relaxed),
/*.nth =*/ atomic_load_explicit(&tp->n_graph, memory_order_relaxed) & GGML_THREADPOOL_N_THREADS_MASK,
/*.wsize =*/ cplan->work_size,
/*.wdata =*/ cplan->work_data,
/*.threadpool=*/ tp,
};
GGML_PRINT_DEBUG("thread #%d compute-start cplan %p last-graph %d \n", state->ith, cplan, state->last_graph);
for (int node_n = 0; node_n < cgraph->n_nodes && atomic_load_explicit(&tp->abort, memory_order_relaxed) != node_n; node_n++) {
struct ggml_tensor * node = cgraph->nodes[node_n];
@@ -2930,6 +2956,8 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
}
}
GGML_PRINT_DEBUG("thread #%d compute-done cplan %p last-graph %d \n", state->ith, cplan, state->last_graph);
ggml_barrier(state->threadpool);
return 0;
@@ -2937,27 +2965,23 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
#ifndef GGML_USE_OPENMP
// check if thread is active
static inline bool ggml_graph_compute_thread_active(struct ggml_compute_state * state) {
struct ggml_threadpool * threadpool = state->threadpool;
int n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);
return (state->ith < n_threads);
}
// check if thread is ready to proceed (exit from polling or sleeping)
// returns true if loops should exit, sets state->pending to indicate new work
static inline bool ggml_graph_compute_thread_ready(struct ggml_compute_state * state) {
struct ggml_threadpool * threadpool = state->threadpool;
if (state->pending || threadpool->stop || threadpool->pause) { return true; }
// check for new graph/work
int new_graph = atomic_load_explicit(&threadpool->n_graph, memory_order_relaxed);
if (new_graph != state->last_graph) {
state->pending = ggml_graph_compute_thread_active(state);
state->last_graph = new_graph;
int n_graph = atomic_load_explicit(&threadpool->n_graph, memory_order_relaxed);
int n_threads = n_graph & GGML_THREADPOOL_N_THREADS_MASK;
if (n_graph != state->last_graph) {
state->pending = (state->ith < n_threads);
state->last_graph = n_graph;
return true;
}
return state->pending;
return false;
}
// sync thread state after polling
@@ -2974,11 +2998,6 @@ static inline void ggml_graph_compute_thread_sync(struct ggml_compute_state * st
static inline bool ggml_graph_compute_poll_for_work(struct ggml_compute_state * state) {
struct ggml_threadpool * threadpool = state->threadpool;
// Skip polling for unused threads
if (!ggml_graph_compute_thread_active(state)) {
return state->pending;
}
// This seems to make 0 ... 100 a decent range for polling level across modern processors.
// Perhaps, we can adjust it dynamically based on load and things.
const uint64_t n_rounds = 1024UL * 128 * threadpool->poll;
@@ -3040,7 +3059,6 @@ static thread_ret_t ggml_graph_compute_secondary_thread(void* data) {
ggml_graph_compute_check_for_work(state);
if (state->pending) {
state->pending = false;
ggml_graph_compute_thread(state);
}
}
@@ -3055,14 +3073,15 @@ static void ggml_graph_compute_kickoff(struct ggml_threadpool * threadpool, int
ggml_mutex_lock(&threadpool->mutex);
GGML_PRINT_DEBUG("threadpool: n_threads_cur %d n_threads %d\n", threadpool->n_threads_cur, n_threads);
// Update the number of active threads and the graph count
int n_graph = atomic_load_explicit(&threadpool->n_graph, memory_order_relaxed) >> GGML_THREADPOOL_N_THREADS_BITS;
n_graph = ((n_graph + 1) << GGML_THREADPOOL_N_THREADS_BITS) | (n_threads & GGML_THREADPOOL_N_THREADS_MASK);
// Update the number of active threads
atomic_store_explicit(&threadpool->n_threads_cur, n_threads, memory_order_relaxed);
GGML_PRINT_DEBUG("compute-kickoff: n_threads %d n_graph %d\n", n_threads, n_graph);
// Indicate the graph is ready to be processed
// We need the full seq-cst fence here because of the polling threads (used in thread_sync)
atomic_fetch_add_explicit(&threadpool->n_graph, 1, memory_order_seq_cst);
atomic_store_explicit(&threadpool->n_graph, n_graph, memory_order_seq_cst);
if (threadpool->pause) {
// Update main thread prio and affinity to match the threadpool settings
@@ -3100,8 +3119,7 @@ static struct ggml_threadpool * ggml_threadpool_new_impl(
threadpool->pause = tpp->paused;
threadpool->abort = -1;
threadpool->workers = NULL;
threadpool->n_threads_max = tpp->n_threads;
threadpool->n_threads_cur = tpp->n_threads;
threadpool->n_threads = tpp->n_threads;
threadpool->poll = tpp->poll;
threadpool->prio = tpp->prio;
threadpool->ec = GGML_STATUS_SUCCESS;
@@ -3196,7 +3214,7 @@ enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cpl
{
// update the number of threads from the actual number of threads that we got from OpenMP
n_threads = omp_get_num_threads();
atomic_store_explicit(&threadpool->n_threads_cur, n_threads, memory_order_relaxed);
atomic_store_explicit(&threadpool->n_graph, n_threads, memory_order_relaxed);
}
// Apply thread CPU mask and priority
@@ -3209,13 +3227,13 @@ enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cpl
ggml_graph_compute_thread(&threadpool->workers[ith]);
}
} else {
atomic_store_explicit(&threadpool->n_threads_cur, 1, memory_order_relaxed);
atomic_store_explicit(&threadpool->n_graph, 1, memory_order_relaxed);
ggml_graph_compute_thread(&threadpool->workers[0]);
}
#else
if (n_threads > threadpool->n_threads_max) {
GGML_LOG_WARN("cplan requested more threads (%d) than available (%d)\n", n_threads, threadpool->n_threads_max);
n_threads = threadpool->n_threads_max;
if (n_threads > threadpool->n_threads) {
GGML_LOG_WARN("cplan requested more threads (%d) than available (%d)\n", n_threads, threadpool->n_threads);
n_threads = threadpool->n_threads;
}
// Kick all threads to start the new graph
@@ -3455,6 +3473,14 @@ int ggml_cpu_has_riscv_v(void) {
#endif
}
int ggml_cpu_get_rvv_vlen(void) {
#if defined(__riscv) && defined(__riscv_v_intrinsic)
return ggml_riscv_arch_features.rvv_vlen;
#else
return 0;
#endif
}
int ggml_cpu_has_f16c(void) {
#if defined(__F16C__)
return 1;
@@ -3621,6 +3647,10 @@ void ggml_cpu_init(void) {
ggml_init_arm_arch_features();
#endif
#if defined(__riscv)
ggml_init_riscv_arch_features();
#endif
is_first_call = false;
}

View File

@@ -583,6 +583,10 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
if (ggml_cpu_has_riscv_v()) {
features.push_back({ "RISCV_V", "1" });
}
if (ggml_cpu_get_rvv_vlen() > 0) {
static std::string rvv_vlen = std::to_string(ggml_cpu_get_rvv_vlen());
features.push_back({ "RVV_VLEN", rvv_vlen.c_str() });
}
if (ggml_cpu_has_vsx()) {
features.push_back({ "VSX", "1" });
}

View File

@@ -2169,7 +2169,8 @@ static const ggml::cpu::tensor_traits * ggml_repack_get_optimal_repack_type(cons
static const ggml::cpu::repack::tensor_traits<block_iq4_nl, 8, 8, GGML_TYPE_Q8_0> iq4_nl_8x8_q8_0;
if (cur->type == GGML_TYPE_Q4_0) {
if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0)) {
if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0)
|| (ggml_cpu_has_riscv_v() && (ggml_cpu_get_rvv_vlen() >= QK4_0))) {
if (cur->ne[1] % 8 == 0) {
return &q4_0_8x8_q8_0;
}

View File

@@ -67,19 +67,22 @@
#define GGML_CUDA_CC_RDNA1 (GGML_CUDA_CC_OFFSET_AMD + 0x1010) // RX 5000
#define GGML_CUDA_CC_RDNA2 (GGML_CUDA_CC_OFFSET_AMD + 0x1030) // RX 6000, minimum for dp4a
#define GGML_CUDA_CC_RDNA3 (GGML_CUDA_CC_OFFSET_AMD + 0x1100) // RX 7000, minimum for WMMA
#define GGML_CUDA_CC_RDNA3_5 (GGML_CUDA_CC_OFFSET_AMD + 0x1150) // AI 370, AI Max 395 laptops.
#define GGML_CUDA_CC_RDNA4 (GGML_CUDA_CC_OFFSET_AMD + 0x1200) // RX 9000
#define GGML_CUDA_CC_IS_AMD(cc) (cc >= GGML_CUDA_CC_OFFSET_AMD)
#define GGML_CUDA_CC_IS_RDNA(cc) (cc >= GGML_CUDA_CC_RDNA1)
#define GGML_CUDA_CC_IS_RDNA1(cc) (cc >= GGML_CUDA_CC_RDNA1 && cc < GGML_CUDA_CC_RDNA2)
#define GGML_CUDA_CC_IS_RDNA2(cc) (cc >= GGML_CUDA_CC_RDNA2 && cc < GGML_CUDA_CC_RDNA3)
#define GGML_CUDA_CC_IS_RDNA3(cc) (cc >= GGML_CUDA_CC_RDNA3 && cc < GGML_CUDA_CC_RDNA4)
#define GGML_CUDA_CC_IS_RDNA4(cc) (cc >= GGML_CUDA_CC_RDNA4)
#define GGML_CUDA_CC_IS_GCN(cc) (cc > GGML_CUDA_CC_OFFSET_AMD && cc < GGML_CUDA_CC_CDNA1)
#define GGML_CUDA_CC_IS_CDNA(cc) (cc >= GGML_CUDA_CC_CDNA1 && cc < GGML_CUDA_CC_RDNA1)
#define GGML_CUDA_CC_IS_CDNA1(cc) (cc >= GGML_CUDA_CC_CDNA1 && cc < GGML_CUDA_CC_CDNA2)
#define GGML_CUDA_CC_IS_CDNA2(cc) (cc >= GGML_CUDA_CC_CDNA2 && cc < GGML_CUDA_CC_CDNA3)
#define GGML_CUDA_CC_IS_CDNA3(cc) (cc >= GGML_CUDA_CC_CDNA3 && cc < GGML_CUDA_CC_RDNA1)
#define GGML_CUDA_CC_IS_AMD(cc) (cc >= GGML_CUDA_CC_OFFSET_AMD)
#define GGML_CUDA_CC_IS_RDNA(cc) (cc >= GGML_CUDA_CC_RDNA1)
#define GGML_CUDA_CC_IS_RDNA1(cc) (cc >= GGML_CUDA_CC_RDNA1 && cc < GGML_CUDA_CC_RDNA2)
#define GGML_CUDA_CC_IS_RDNA2(cc) (cc >= GGML_CUDA_CC_RDNA2 && cc < GGML_CUDA_CC_RDNA3)
#define GGML_CUDA_CC_IS_RDNA3_0(cc) (cc >= GGML_CUDA_CC_RDNA3 && cc < GGML_CUDA_CC_RDNA3_5)
#define GGML_CUDA_CC_IS_RDNA3_5(cc) (cc >= GGML_CUDA_CC_RDNA3_5 && cc < GGML_CUDA_CC_RDNA4)
#define GGML_CUDA_CC_IS_RDNA3(cc) (GGML_CUDA_CC_IS_RDNA3_0(cc) || GGML_CUDA_CC_IS_RDNA3_5(cc))
#define GGML_CUDA_CC_IS_RDNA4(cc) (cc >= GGML_CUDA_CC_RDNA4)
#define GGML_CUDA_CC_IS_GCN(cc) (cc > GGML_CUDA_CC_OFFSET_AMD && cc < GGML_CUDA_CC_CDNA1)
#define GGML_CUDA_CC_IS_CDNA(cc) (cc >= GGML_CUDA_CC_CDNA1 && cc < GGML_CUDA_CC_RDNA1)
#define GGML_CUDA_CC_IS_CDNA1(cc) (cc >= GGML_CUDA_CC_CDNA1 && cc < GGML_CUDA_CC_CDNA2)
#define GGML_CUDA_CC_IS_CDNA2(cc) (cc >= GGML_CUDA_CC_CDNA2 && cc < GGML_CUDA_CC_CDNA3)
#define GGML_CUDA_CC_IS_CDNA3(cc) (cc >= GGML_CUDA_CC_CDNA3 && cc < GGML_CUDA_CC_RDNA1)
// Moore Threads
#define MUSART_HMASK 40300 // MUSA rc4.3, min. ver. for half2 -> uint mask comparisons

View File

@@ -0,0 +1,77 @@
#include "convert.cuh"
#include "diag.cuh"
#include "ggml.h"
template <typename T>
static __global__ void diag_kernel(T * __restrict__ dst,
const T * __restrict__ src,
const int64_t ne0,
const int64_t ne1,
const int64_t ne2,
const int64_t ne3,
const int64_t total_elements) {
const int64_t global_idx = blockIdx.x * blockDim.x + threadIdx.x;
if (global_idx >= total_elements) {
return;
}
const int64_t i0 = global_idx % ne0;
const int64_t i1 = (global_idx / ne0) % ne1;
const int64_t i2 = (global_idx / (ne0 * ne1)) % ne2;
const int64_t i3 = global_idx / (ne0 * ne1 * ne2);
const int64_t dst_idx = ((i3 * ne2 + i2) * ne1 + i1) * ne0 + i0;
if (i0 == i1) {
const int64_t batch_idx = i3 * ne2 + i2;
const int64_t src_idx = batch_idx * ne0 + i0;
dst[dst_idx] = src[src_idx];
} else {
dst[dst_idx] = ggml_cuda_cast<T>(0);
}
GGML_UNUSED_VARS(ne3);
}
void ggml_cuda_op_diag(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
void * dst_d = dst->data;
const void * src0_d = src0->data;
cudaStream_t stream = ctx.stream();
GGML_ASSERT(ggml_is_contiguous(dst));
GGML_ASSERT(ggml_is_contiguous(src0));
const int64_t ne00 = src0->ne[0];
const int64_t ne01 = src0->ne[1];
const int64_t ne02 = src0->ne[2];
const int64_t ne03 = src0->ne[3];
const int64_t ne0 = dst->ne[0];
const int64_t ne1 = dst->ne[1];
const int64_t ne2 = dst->ne[2];
const int64_t ne3 = dst->ne[3];
GGML_ASSERT(ne00 == ne0);
GGML_ASSERT(ne01 == 1);
GGML_ASSERT(ne02 == ne2);
GGML_ASSERT(ne03 == ne3);
const int64_t n_elems = ggml_nelements(dst);
const int64_t num_blocks = (n_elems + CUDA_DIAG_BLOCK_SIZE - 1) / CUDA_DIAG_BLOCK_SIZE;
switch (dst->type) {
case GGML_TYPE_F32:
diag_kernel<<<num_blocks, CUDA_DIAG_BLOCK_SIZE, 0, stream>>>((float *) dst_d, (const float *) src0_d, ne0,
ne1, ne2, ne3, n_elems);
break;
case GGML_TYPE_F16:
diag_kernel<<<num_blocks, CUDA_DIAG_BLOCK_SIZE, 0, stream>>>((half *) dst_d, (const half *) src0_d, ne0,
ne1, ne2, ne3, n_elems);
break;
default:
GGML_ABORT("unsupported type");
}
}

View File

@@ -0,0 +1,5 @@
#include "common.cuh"
#define CUDA_DIAG_BLOCK_SIZE 256
void ggml_cuda_op_diag(ggml_backend_cuda_context & ctx, ggml_tensor * dst);

View File

@@ -642,8 +642,8 @@ static __global__ void flash_attn_stream_k_fixup(
const int iter_k = (ne11 + (nbatch_fa - 1)) / nbatch_fa;
const int iter_j = (ne01 + (ncols1 - 1)) / ncols1;
const int kbc0 = (bidx0 + 0)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
const int kbc0_stop = (bidx0 + 1)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
const int kbc0 = int64_t(bidx0 + 0)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
const int kbc0_stop = int64_t(bidx0 + 1)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
const bool did_not_have_any_data = kbc0 == kbc0_stop;
const bool wrote_beginning_of_tile = kbc0 % iter_k == 0;
@@ -679,7 +679,7 @@ static __global__ void flash_attn_stream_k_fixup(
int bidx = bidx0 - 1;
int kbc_stop = kbc0;
while(true) {
const int kbc = bidx*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
const int kbc = int64_t(bidx)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
if (kbc == kbc_stop) { // Did not have any data.
bidx--;
kbc_stop = kbc;

View File

@@ -955,22 +955,11 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
(K_h2 + int64_t(kb0)*nbatch_fa*stride_K, tile_K, nbatch_K2, stride_K, k_VKQ_sup);
}
for (; kb0 < kb0_stop-1; ++kb0) {
constexpr bool last_iter = false;
constexpr bool oob_check = false;
constexpr int k_VKQ_sup = nbatch_fa;
flash_attn_ext_f16_iter
<DKQ, DV, ncols1, ncols2, nwarps, use_logit_softcap, mla, needs_fixup, is_fixup, last_iter, oob_check,
T_A_KQ, T_B_KQ, T_C_KQ, T_A_VKQ, T_B_VKQ, T_C_VKQ>
(Q_f2, K_h2, V_h2, mask_h, dstk, dstk_fixup, scale, slope, logit_softcap,
ne01, ne02, stride_K, stride_V, stride_mask, tile_Q, tile_K, tile_V, tile_mask, Q_B, VKQ_C,
KQ_max, KQ_rowsum, jt, kb0, k_VKQ_sup);
}
// kb0_start is always < kb0_stop so the last iter can be executed unconditionally.
if constexpr (ncols2 == 1) {
if (ne11 % nbatch_fa == 0) {
constexpr bool last_iter = true;
constexpr bool oob_check = false;
constexpr bool oob_check = true;
for (; kb0 < kb0_stop-1; ++kb0) {
constexpr bool last_iter = false;
constexpr int k_VKQ_sup = nbatch_fa;
flash_attn_ext_f16_iter
<DKQ, DV, ncols1, ncols2, nwarps, use_logit_softcap, mla, needs_fixup, is_fixup, last_iter, oob_check,
@@ -978,10 +967,20 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
(Q_f2, K_h2, V_h2, mask_h, dstk, dstk_fixup, scale, slope, logit_softcap,
ne01, ne02, stride_K, stride_V, stride_mask, tile_Q, tile_K, tile_V, tile_mask, Q_B, VKQ_C,
KQ_max, KQ_rowsum, jt, kb0, k_VKQ_sup);
} else {
constexpr bool last_iter = true;
constexpr bool oob_check = true;
const int k_VKQ_sup = ne11 - kb0*nbatch_fa;
}
constexpr bool last_iter = true;
const int k_VKQ_sup = ne11 - kb0*nbatch_fa;
flash_attn_ext_f16_iter
<DKQ, DV, ncols1, ncols2, nwarps, use_logit_softcap, mla, needs_fixup, is_fixup, last_iter, oob_check,
T_A_KQ, T_B_KQ, T_C_KQ, T_A_VKQ, T_B_VKQ, T_C_VKQ>
(Q_f2, K_h2, V_h2, mask_h, dstk, dstk_fixup, scale, slope, logit_softcap,
ne01, ne02, stride_K, stride_V, stride_mask, tile_Q, tile_K, tile_V, tile_mask, Q_B, VKQ_C,
KQ_max, KQ_rowsum, jt, kb0, k_VKQ_sup);
} else {
constexpr bool oob_check = false;
for (; kb0 < kb0_stop-1; ++kb0) {
constexpr bool last_iter = false;
constexpr int k_VKQ_sup = nbatch_fa;
flash_attn_ext_f16_iter
<DKQ, DV, ncols1, ncols2, nwarps, use_logit_softcap, mla, needs_fixup, is_fixup, last_iter, oob_check,
T_A_KQ, T_B_KQ, T_C_KQ, T_A_VKQ, T_B_VKQ, T_C_VKQ>
@@ -989,9 +988,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
ne01, ne02, stride_K, stride_V, stride_mask, tile_Q, tile_K, tile_V, tile_mask, Q_B, VKQ_C,
KQ_max, KQ_rowsum, jt, kb0, k_VKQ_sup);
}
} else {
constexpr bool last_iter = true;
constexpr bool oob_check = false;
constexpr int k_VKQ_sup = nbatch_fa;
flash_attn_ext_f16_iter
<DKQ, DV, ncols1, ncols2, nwarps, use_logit_softcap, mla, needs_fixup, is_fixup, last_iter, oob_check,
@@ -1383,8 +1380,8 @@ static __global__ void flash_attn_ext_f16(
const int iter_j = (ne01.z + (ncols1 - 1)) / ncols1;
// kbc == k block continuous, current index in continuous ijk space.
int kbc = (blockIdx.x + 0)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
const int kbc_stop = (blockIdx.x + 1)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
int kbc = int64_t(blockIdx.x + 0)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
const int kbc_stop = int64_t(blockIdx.x + 1)*(iter_k*iter_j*(ne02/ncols2)*ne03) / gridDim.x;
// If the seams of 2 CUDA blocks fall within an output tile their results need to be combined.
// For this we need to track both the block that starts the tile (needs_fixup) and the block that finishes the tile (is_fixup).
@@ -1404,7 +1401,7 @@ static __global__ void flash_attn_ext_f16(
const float2 * Q_f2 = (const float2 *) (Q + nb03*sequence + nb02* head0);
const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio));
const half * mask_h = ncols2 == 1 && !mask ? nullptr :
(const half *) (mask + nb33*(sequence % ne33));
(const half *) (mask + nb33*(sequence % ne33));
float2 * dstk = ((float2 *) dst) + (sequence*ne01.z*ne02 + head0) * (DV/2);
const half2 * V_h2 = mla ? K_h2 + (DKQ/2 - DV/2) : (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio));

View File

@@ -564,6 +564,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
for (int i_KQ_0 = 0; i_KQ_0 < nbatch_fa; i_KQ_0 += np*warp_size) {
const int i_KQ = i_KQ_0 + (threadIdx.y % np)*warp_size + threadIdx.x;
#if defined(FAST_FP16_AVAILABLE) && !defined(V_DOT2_F32_F16_AVAILABLE)
// Without the v_dot2_f32_f16 instruction there is a higher risk of numerical overflow in the KQ calculation.
// Therefore, scale down Q values and apply the inverse scale the FP32 KQ values afterwards again.
KQ_acc[i_KQ_0/(np*warp_size)*cpw + jc0] *= 4.0f;
#endif // defined(FAST_FP16_AVAILABLE) && !defined(V_DOT2_F32_F16_AVAILABLE)
if (use_logit_softcap) {
KQ_acc[(i_KQ_0/(np*warp_size))*cpw + jc0] = logit_softcap * tanhf(KQ_acc[(i_KQ_0/(np*warp_size))*cpw + jc0]);
}
@@ -858,6 +864,11 @@ static __global__ void flash_attn_tile(
#pragma unroll
for (int i1 = 0; i1 < cpy_ne_D; i1 += 2) {
tmp_h2[i1/2] = make_half2(tmp_f[i1 + 0], tmp_f[i1 + 1]);
#if defined(FAST_FP16_AVAILABLE) && !defined(V_DOT2_F32_F16_AVAILABLE)
// Without the v_dot2_f32_f16 instruction there is a higher risk of numerical overflow in the KQ calculation.
// Therefore, scale down Q values and apply the inverse scale the FP32 KQ values afterwards again.
tmp_h2[i1/2] *= make_half2(0.25f, 0.25f);
#endif // defined(FAST_FP16_AVAILABLE) && !defined(V_DOT2_F32_F16_AVAILABLE)
}
ggml_cuda_memcpy_1<sizeof(tmp_h2)>(
&Q_tmp[jc*(DKQ/2) + i0/2 + (threadIdx.y % np)*(warp_size*cpy_ne_D/2) + threadIdx.x*(cpy_ne_D/2)],

View File

@@ -36,12 +36,26 @@ static void ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2(ggml_backend_cuda_con
const ggml_tensor * KQV = dst;
const ggml_tensor * Q = dst->src[0];
const ggml_tensor * K = dst->src[1];
const ggml_tensor * V = dst->src[2];
const ggml_tensor * mask = dst->src[3];
float max_bias = 0.0f;
memcpy(&max_bias, (const float *) KQV->op_params + 1, sizeof(float));
const bool use_gqa_opt = mask && max_bias == 0.0f && K->ne[1] % FATTN_KQ_STRIDE == 0;
// Edge cases like no mask, ALiBi, unpadded K/V, or misaligned addresses for large data transfers
// are put into the template specialization without GQA optimizations.
bool use_gqa_opt = mask && max_bias == 0.0f && K->ne[1] % FATTN_KQ_STRIDE == 0;
for (const ggml_tensor * t : {Q, K, V, mask}) {
if (t == nullptr) {
continue;
}
for (size_t i = 1; i < GGML_MAX_DIMS; ++i) {
if (t->nb[i] % 16 != 0) {
use_gqa_opt = false;
break;
}
}
}
GGML_ASSERT(Q->ne[2] % K->ne[2] == 0);
const int gqa_ratio = Q->ne[2] / K->ne[2];

View File

@@ -0,0 +1,37 @@
#include "fill.cuh"
#include "convert.cuh"
#define CUDA_FILL_BLOCK_SIZE 256
template <typename T>
static __global__ void fill_kernel(T * dst, const int64_t k, const T value) {
const int64_t i = (int64_t)blockDim.x * blockIdx.x + threadIdx.x;
if (i >= k) {
return;
}
dst[i] = value;
}
void ggml_cuda_op_fill(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
void * dst_d = dst->data;
cudaStream_t stream = ctx.stream();
GGML_ASSERT(ggml_is_contiguous(dst));
float value;
memcpy(&value, dst->op_params, sizeof(float));
const int64_t k = ggml_nelements(dst);
const int64_t num_blocks = (k + CUDA_FILL_BLOCK_SIZE - 1) / CUDA_FILL_BLOCK_SIZE;
switch (dst->type) {
case GGML_TYPE_F32:
fill_kernel<<<num_blocks, CUDA_FILL_BLOCK_SIZE, 0, stream>>>((float *)dst_d, k, value);
break;
case GGML_TYPE_F16:
fill_kernel<<<num_blocks, CUDA_FILL_BLOCK_SIZE, 0, stream>>>((half *)dst_d, k, ggml_cuda_cast<half>(value));
break;
default:
GGML_ABORT("unsupported type");
}
}

View File

@@ -0,0 +1,3 @@
#include "common.cuh"
void ggml_cuda_op_fill(ggml_backend_cuda_context & ctx, ggml_tensor * dst);

View File

@@ -20,6 +20,7 @@
#include "ggml-cuda/cpy.cuh"
#include "ggml-cuda/cross-entropy-loss.cuh"
#include "ggml-cuda/diagmask.cuh"
#include "ggml-cuda/diag.cuh"
#include "ggml-cuda/fattn.cuh"
#include "ggml-cuda/getrows.cuh"
#include "ggml-cuda/im2col.cuh"
@@ -56,6 +57,7 @@
#include "ggml-cuda/solve_tri.cuh"
#include "ggml-cuda/tri.cuh"
#include "ggml-cuda/cumsum.cuh"
#include "ggml-cuda/fill.cuh"
#include "ggml.h"
#include <algorithm>
@@ -2640,6 +2642,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
case GGML_OP_PERMUTE:
case GGML_OP_TRANSPOSE:
break;
case GGML_OP_DIAG:
ggml_cuda_op_diag(ctx, dst);
break;
case GGML_OP_DIAG_MASK_INF:
ggml_cuda_op_diag_mask_inf(ctx, dst);
break;
@@ -2730,6 +2735,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
case GGML_OP_SOLVE_TRI:
ggml_cuda_op_solve_tri(ctx, dst);
break;
case GGML_OP_FILL:
ggml_cuda_op_fill(ctx, dst);
break;
default:
return false;
}
@@ -4305,6 +4313,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_UNARY_OP_EXPM1:
case GGML_UNARY_OP_SOFTPLUS:
case GGML_UNARY_OP_ELU:
case GGML_UNARY_OP_XIELU:
case GGML_UNARY_OP_FLOOR:
case GGML_UNARY_OP_CEIL:
case GGML_UNARY_OP_ROUND:
@@ -4617,11 +4626,13 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_OP_CROSS_ENTROPY_LOSS_BACK:
case GGML_OP_OPT_STEP_ADAMW:
case GGML_OP_OPT_STEP_SGD:
case GGML_OP_FILL:
case GGML_OP_CUMSUM:
case GGML_OP_TRI:
return true;
case GGML_OP_DIAG:
case GGML_OP_SOLVE_TRI:
return op->src[0]->ne[0] <= 64 && op->src[1]->ne[0] <= 32;
return true;
default:
return false;
}

View File

@@ -189,6 +189,9 @@ namespace ggml_cuda_mma {
return 8 * (threadIdx.x / 16) + l;
#elif defined(RDNA3)
return 2 * l + (threadIdx.x / 16);
#else
NO_DEVICE_CODE;
return -1;
#endif // defined(RDNA4)
} else {
NO_DEVICE_CODE;
@@ -290,8 +293,12 @@ namespace ggml_cuda_mma {
}
}
#elif defined(AMD_WMMA_AVAILABLE)
#if defined(RDNA3)
// RDNA3 has duplicated data as input.
static constexpr int ne = I * J / 32 * 2;
#else
static constexpr int ne = I * J / 32;
#endif // defined(RDNA3)
half2 x[ne] = {{0.0f, 0.0f}};
static constexpr __device__ bool supported() {
@@ -310,7 +317,14 @@ namespace ggml_cuda_mma {
static __device__ __forceinline__ int get_j(const int l) {
if constexpr (I == 16 && J == 8) {
#if defined(RDNA4)
return 4 * (threadIdx.x / 16) + l;
#elif defined(RDNA3)
return l;
#else
NO_DEVICE_CODE;
return -1;
#endif // defined(RDNA4)
} else {
NO_DEVICE_CODE;
return -1;
@@ -366,11 +380,16 @@ namespace ggml_cuda_mma {
static constexpr int I = I_;
static constexpr int J = J_;
static constexpr data_layout dl = DATA_LAYOUT_I_MAJOR;
static constexpr int ne = I * J / WARP_SIZE;
nv_bfloat162 x[ne] = {{0.0f, 0.0f}};
#if defined(AMD_WMMA_AVAILABLE)
#if defined(RDNA3)
// RDNA3 has duplicated data as input.
static constexpr int ne = I * J / 32 * 2;
#else
static constexpr int ne = I * J / 32;
#endif // defined(RDNA3)
nv_bfloat162 x[ne] = {{0.0f, 0.0f}};
static constexpr __device__ bool supported() {
if (I == 16 && J == 8) return true;
return false;
@@ -387,13 +406,23 @@ namespace ggml_cuda_mma {
static __device__ __forceinline__ int get_j(const int l) {
if constexpr (I == 16 && J == 8) {
#if defined(RDNA4)
return 4 * (threadIdx.x / 16) + l;
#elif defined(RDNA3)
return l;
#else
NO_DEVICE_CODE;
return -1;
#endif // defined(RDNA4)
} else {
NO_DEVICE_CODE;
return -1;
}
}
#else
static constexpr int ne = I * J / WARP_SIZE;
nv_bfloat162 x[ne] = {{0.0f, 0.0f}};
static constexpr __device__ bool supported() {
if (I == 8 && J == 8) return true;
if (I == 16 && J == 4) return true;
@@ -546,8 +575,14 @@ namespace ggml_cuda_mma {
}
#elif defined(AMD_WMMA_AVAILABLE)
if constexpr (std::is_same_v<T, half2> || std::is_same_v<T, nv_bfloat162>) {
ggml_cuda_memcpy_1<sizeof(t.x)>(t.x, xs0 + t.get_i(0) * stride + t.get_j(0));
#if defined(RDNA4)
ggml_cuda_memcpy_1<sizeof(t.x)>(t.x, xs0 + t.get_i(0) * stride + t.get_j(0));
#elif defined(RDNA3)
ggml_cuda_memcpy_1<sizeof(t.x)/2>(t.x, xs0 + t.get_i(0) * stride + t.get_j(0));
ggml_cuda_memcpy_1<sizeof(t.x)/2>(t.x + t.ne/2, xs0 + t.get_i(0) * stride + t.get_j(t.ne/2));
#else
NO_DEVICE_CODE;
#endif // defined(RDNA4)
} else if constexpr (std::is_same_v<T, int>) {
if constexpr (I == 16 && J == 4) {
int64_t * xi = (int64_t *) t.x;
@@ -888,6 +923,16 @@ namespace ggml_cuda_mma {
const halfx8_t& a_frag = reinterpret_cast<const halfx8_t&>(A.x[0]);
const halfx8_t& b_frag = reinterpret_cast<const halfx8_t&>(B.x[0]);
acc_frag = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12(a_frag, b_frag, acc_frag);
#elif defined(RDNA3)
using halfx16_t = __attribute__((ext_vector_type(16))) _Float16;
using floatx8_t = __attribute__((ext_vector_type(8))) float;
floatx8_t& acc_frag = reinterpret_cast<floatx8_t&>(D.x[0]);
const halfx16_t& a_frag = reinterpret_cast<const halfx16_t&>(A.x[0]);
const halfx16_t& b_frag = reinterpret_cast<const halfx16_t&>(B.x[0]);
acc_frag = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_frag, b_frag, acc_frag);
#else
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // RDNA4
#else
GGML_UNUSED_VARS(D, A, B);
@@ -905,6 +950,16 @@ namespace ggml_cuda_mma {
const bf16x8_t& a_frag = reinterpret_cast<const bf16x8_t&>(A.x[0]);
const bf16x8_t& b_frag = reinterpret_cast<const bf16x8_t&>(B.x[0]);
acc_frag = __builtin_amdgcn_wmma_f32_16x16x16_bf16_w32_gfx12(a_frag, b_frag, acc_frag);
#elif defined(RDNA3)
using bf16x16_t = __attribute__((ext_vector_type(16))) __bf16;
using floatx8_t = __attribute__((ext_vector_type(8))) float;
floatx8_t& acc_frag = reinterpret_cast<floatx8_t&>(D.x[0]);
const bf16x16_t& a_frag = reinterpret_cast<const bf16x16_t&>(A.x[0]);
const bf16x16_t& b_frag = reinterpret_cast<const bf16x16_t&>(B.x[0]);
acc_frag = __builtin_amdgcn_wmma_f32_16x16x16_bf16_w32(a_frag, b_frag, acc_frag);
#else
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // RDNA4
#else
GGML_UNUSED_VARS(D, A, B);

View File

@@ -151,7 +151,9 @@ bool ggml_cuda_should_use_mmf(enum ggml_type type, int cc, int warp_size, const
return false;
}
} else {
if (src1_ncols > 16) {
if (GGML_CUDA_CC_IS_RDNA3_0(cc) && src1_ncols > 8) {
return false;
} else if (src1_ncols > 16) {
return false;
}
}
@@ -160,9 +162,9 @@ bool ggml_cuda_should_use_mmf(enum ggml_type type, int cc, int warp_size, const
case GGML_TYPE_F32:
return ampere_mma_available(cc);
case GGML_TYPE_F16:
return volta_mma_available(cc) || turing_mma_available(cc) || (amd_wmma_available(cc) && GGML_CUDA_CC_IS_RDNA4(cc));
return volta_mma_available(cc) || turing_mma_available(cc) || amd_wmma_available(cc);
case GGML_TYPE_BF16:
return ampere_mma_available(cc) || (amd_wmma_available(cc) && GGML_CUDA_CC_IS_RDNA4(cc));
return ampere_mma_available(cc) || amd_wmma_available(cc);
default:
return false;
}

View File

@@ -765,7 +765,10 @@ bool ggml_cuda_should_use_mmvf(enum ggml_type type, int cc, const int64_t * src0
return ne11 <= 8;
} else if (GGML_CUDA_CC_IS_AMD(cc)) {
if (fp16_mma_hardware_available(cc)) {
if (GGML_CUDA_CC_IS_RDNA3(cc) || GGML_CUDA_CC_IS_RDNA4(cc)) {
if (GGML_CUDA_CC_IS_RDNA3(cc)) {
return ne11 <= 3;
}
if (GGML_CUDA_CC_IS_RDNA4(cc)) {
return ne11 <= 5;
}
return ne11 <= 2;

View File

@@ -5,6 +5,79 @@
#define MAX_N_FAST 64
#define MAX_K_FAST 32
static __global__ void get_batch_pointers(const float * A,
float * X,
const float ** A_ptrs,
float ** X_ptrs,
int64_t ne02,
int64_t total_batches,
size_t s02,
size_t s03,
size_t s2,
size_t s3) {
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= total_batches) {
return;
}
const int64_t i3 = idx / ne02;
const int64_t i2 = idx % ne02;
A_ptrs[idx] = A + i3 * s03 + i2 * s02;
X_ptrs[idx] = X + i3 * s3 + i2 * s2;
}
static void solve_tri_f32_cublas(ggml_backend_cuda_context & ctx,
const float * A,
const float * B,
float * X,
int n,
int k,
int64_t ne02,
int64_t ne03,
size_t s02,
size_t s03,
size_t s12,
size_t s13,
size_t s2,
size_t s3,
cudaStream_t stream) {
const float alpha = 1.0f;
const int64_t total_batches = ne02 * ne03;
if (total_batches == 0) {
return;
}
// Bulk copy B -> X (contiguous tensors)
if (X != B) {
const int64_t total_elements_BX = n * k * total_batches;
CUDA_CHECK(cudaMemcpyAsync(X, B, total_elements_BX * sizeof(float), cudaMemcpyDeviceToDevice, stream));
}
const int id = ggml_cuda_get_device();
ggml_cuda_pool_alloc<const float *> A_ptrs_alloc(ctx.pool(id), total_batches);
ggml_cuda_pool_alloc<float *> X_ptrs_alloc(ctx.pool(id), total_batches);
const float ** A_ptrs_dev = A_ptrs_alloc.get();
float ** X_ptrs_dev = X_ptrs_alloc.get();
get_batch_pointers<<<(total_batches + 255) / 256, 256, 0, stream>>>(A, X, A_ptrs_dev, X_ptrs_dev, ne02,
total_batches, s02, s03, s2, s3);
CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream));
// Yes, this is necessary, without this we get RMSE errors
CUBLAS_CHECK(cublasSetMathMode(ctx.cublas_handle(id), CUBLAS_DEFAULT_MATH));
CUBLAS_CHECK(cublasStrsmBatched(ctx.cublas_handle(id), CUBLAS_SIDE_RIGHT, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N,
CUBLAS_DIAG_NON_UNIT, k, n, &alpha, A_ptrs_dev, n, X_ptrs_dev, k, total_batches));
// revert to standard mode from common.cuh
CUBLAS_CHECK(cublasSetMathMode(ctx.cublas_handle(id), CUBLAS_TF32_TENSOR_OP_MATH));
GGML_UNUSED_VARS(s12, s13);
}
// ======================
// Fast Kernel (n <= 64, k <= 32) - Warp-based parallel reduction
// ======================
@@ -48,65 +121,58 @@ static __global__ void solve_tri_f32_fast(const float * __restrict__ A,
float * X_batch = (float *) (X + i02 * nb2 + i03 * nb3);
__shared__ float sA[MAX_N_FAST * MAX_N_FAST];
__shared__ float sXt[MAX_N_FAST * (MAX_K_FAST + 1)];
const int offset = threadIdx.x + threadIdx.y * blockDim.x;
#pragma unroll
for (int i = 0; i < n * n; i += k * WARP_SIZE) {
int i0 = i + offset;
const int i0 = i + offset;
if (i0 < n * n) {
sA[i0] = A_batch[i0];
}
}
const int rows_per_warp = (n + WARP_SIZE - 1) / WARP_SIZE;
#pragma unroll
for (int i = 0; i < rows_per_warp; i++) {
const int i0 = lane + i * WARP_SIZE;
if (i0 < n) {
sXt[col_idx * n + i0] = B_batch[i0 * k + col_idx];
}
}
__syncthreads();
float x_low = (lane < n) ? B_batch[lane * k + col_idx] : 0.0f;
float x_high = (WARP_SIZE + lane < n) ? B_batch[(WARP_SIZE + lane) * k + col_idx] : 0.0f;
const int half = WARP_SIZE;
const int nrows_low = (n < half) ? n : half;
#pragma unroll
for (int row = 0; row < n; ++row) {
for (int row = 0; row < nrows_low; ++row) {
float sum = 0.0f;
{
int j = lane;
if (j < row) {
sum += sA[row * n + j] * sXt[col_idx * n + j];
}
if (lane < row) {
sum += sA[row * n + lane] * x_low;
}
if (row >= WARP_SIZE) {
int j = WARP_SIZE + lane;
if (j < row) {
sum += sA[row * n + j] * sXt[col_idx * n + j];
}
}
sum = warp_reduce_sum(sum);
if (lane == 0) {
const float b_val = sXt[col_idx * n + row];
const float a_diag = sA[row * n + row];
// no safeguards for division by zero because that indicates corrupt
// data anyway
sXt[col_idx * n + row] = (b_val - sum) / a_diag;
if (lane == row) {
x_low = (x_low - sum) / sA[row * n + row];
}
}
__syncthreads();
#pragma unroll
for (int row = half; row < n; ++row) {
float sum = sA[row * n + lane] * x_low;
const int j = half + lane;
if (j < row) {
sum += sA[row * n + j] * x_high;
}
sum = warp_reduce_sum(sum);
if (lane == row - half) {
x_high = (x_high - sum) / sA[row * n + row];
}
}
#pragma unroll
for (int i = 0; i < rows_per_warp; i++) {
const int i0 = lane + i * WARP_SIZE;
if (i0 < n) {
X_batch[i0 * k + col_idx] = sXt[col_idx * n + i0];
for (int rr = 0; rr < 2; ++rr) {
const int row = rr * WARP_SIZE + lane;
if (row < n) {
const float val = (row < half) ? x_low : x_high;
X_batch[row * k + col_idx] = val;
}
}
}
@@ -184,20 +250,26 @@ static void solve_tri_f32_cuda(const float * A,
}
void ggml_cuda_op_solve_tri(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0]; // A (triangular n x x matrix)
const ggml_tensor * src1 = dst->src[1]; // B (right hand side of n x k equation columns)
const ggml_tensor * src0 = dst->src[0]; // A (n×n, lower triangular)
const ggml_tensor * src1 = dst->src[1]; // B (n×k)
ggml_is_contiguous(src0);
ggml_is_contiguous(src1);
const int64_t n = src0->ne[0];
const int64_t k = src1->ne[0];
const int64_t n = src0->ne[0];
const int64_t k = src1->ne[0];
const int64_t ne02 = src0->ne[2];
const int64_t ne03 = src0->ne[3];
GGML_ASSERT(n <= 64);
GGML_ASSERT(k <= 32);
solve_tri_f32_cuda((const float *) src0->data, (const float *) src1->data, (float *) dst->data, n, k, src0->ne[2],
src0->ne[3], src0->nb[2] / sizeof(float), src0->nb[3] / sizeof(float),
src1->nb[2] / sizeof(float), src1->nb[3] / sizeof(float), dst->nb[2] / sizeof(float),
dst->nb[3] / sizeof(float), ctx.stream());
if (n <= MAX_N_FAST && k <= MAX_K_FAST) {
solve_tri_f32_cuda((const float *) src0->data, (const float *) src1->data, (float *) dst->data, n, k,
src0->ne[2], src0->ne[3], src0->nb[2] / sizeof(float), src0->nb[3] / sizeof(float),
src1->nb[2] / sizeof(float), src1->nb[3] / sizeof(float), dst->nb[2] / sizeof(float),
dst->nb[3] / sizeof(float), ctx.stream());
} else {
solve_tri_f32_cublas(ctx, (const float *) src0->data, (const float *) src1->data, (float *) dst->data, n, k,
ne02, ne03, src0->nb[2] / sizeof(float), src0->nb[3] / sizeof(float),
src1->nb[2] / sizeof(float), src1->nb[3] / sizeof(float), dst->nb[2] / sizeof(float),
dst->nb[3] / sizeof(float), ctx.stream());
}
}

View File

@@ -19,6 +19,9 @@
#define CUDA_R_16F HIPBLAS_R_16F
#define CUDA_R_16BF HIPBLAS_R_16B
#define CUDA_R_32F HIPBLAS_R_32F
#define CUBLAS_SIDE_RIGHT HIPBLAS_SIDE_RIGHT
#define CUBLAS_FILL_MODE_UPPER HIPBLAS_FILL_MODE_UPPER
#define CUBLAS_DIAG_NON_UNIT HIPBLAS_DIAG_NON_UNIT
#define CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED hipDeviceAttributeVirtualMemoryManagementSupported
#define CU_MEM_ALLOC_GRANULARITY_RECOMMENDED hipMemAllocationGranularityRecommended
#define CU_MEM_ALLOCATION_TYPE_PINNED hipMemAllocationTypePinned
@@ -30,6 +33,7 @@
#define __shfl_xor_sync(mask, var, laneMask, width) __shfl_xor(var, laneMask, width)
#define __all_sync(mask, var) __all(var)
#define __any_sync(mask, var) __any(var)
#define cublasStrsmBatched hipblasStrsmBatched
#define cublasCreate hipblasCreate
#define cublasDestroy hipblasDestroy
#define cublasGemmEx hipblasGemmEx

View File

@@ -12,11 +12,16 @@
#define CUBLAS_GEMM_DEFAULT_TENSOR_OP MUBLAS_GEMM_DEFAULT
#define CUBLAS_OP_N MUBLAS_OP_N
#define CUBLAS_OP_T MUBLAS_OP_T
#define CUBLAS_DEFAULT_MATH MUBLAS_DEFAULT_MATH
#define CUBLAS_SIDE_RIGHT MUBLAS_SIDE_RIGHT
#define CUBLAS_FILL_MODE_UPPER MUBLAS_FILL_MODE_UPPER
#define CUBLAS_DIAG_NON_UNIT MUBLAS_DIAG_NON_UNIT
#define CUBLAS_STATUS_SUCCESS MUBLAS_STATUS_SUCCESS
#define CUBLAS_TF32_TENSOR_OP_MATH MUBLAS_TENSOR_OP_MATH
#define CUDA_R_16F MUSA_R_16F
#define CUDA_R_16BF MUSA_R_16BF
#define CUDA_R_32F MUSA_R_32F
#define cublasStrsmBatched mublasStrsmBatched
#define cublasComputeType_t cudaDataType_t
#define cublasCreate mublasCreate
#define cublasDestroy mublasDestroy

View File

@@ -73,15 +73,15 @@ static float rope_yarn_ramp(const float low, const float high, const int i0) {
return (1 - MIN(1, MAX(0, y)));
}
static void rope_cache_init(const float theta_base,
float freq_scale,
const float * freq_factors,
float * corr_dims,
uint32_t ne0,
float ext_factor,
float mscale,
float * cache,
float theta_scale) {
static void rope_cache_init(const float theta_base,
const float freq_scale,
const float * freq_factors,
float * corr_dims,
const uint32_t ne0,
const float ext_factor,
const float mscale,
float * cache,
const float theta_scale) {
// ref: https://github.com/jquesnelle/yarn/blob/master/scaled_rope/LlamaYaRNScaledRotaryEmbedding.py
float theta = theta_base;
@@ -92,18 +92,19 @@ static void rope_cache_init(const float theta_base,
// Get n-d rotational scaling corrected for extrapolation
float theta_interp = freq_scale * theta_extrap;
float theta2 = theta_interp;
float theta_final = theta_interp;
float mscale_final = mscale;
if (ext_factor != 0.0f) {
float ramp_mix = rope_yarn_ramp(corr_dims[0], corr_dims[1], i0) * ext_factor;
theta2 = theta_interp * (1 - ramp_mix) + theta_extrap * ramp_mix;
theta_final = theta_interp * (1 - ramp_mix) + theta_extrap * ramp_mix;
// Get n-d magnitude scaling corrected for interpolation
mscale *= 1.0f + 0.1f * logf(1.0f / freq_scale);
mscale_final *= 1.0f + 0.1f * logf(1.0f / freq_scale);
}
cache[i0 + 0] = cosf(theta2) * mscale;
cache[i0 + 1] = sinf(theta2) * mscale;
cache[i0 + 0] = cosf(theta_final) * mscale_final;
cache[i0 + 1] = sinf(theta_final) * mscale_final;
theta *= theta_scale;
}
@@ -151,9 +152,9 @@ static void init_rope_ctx(struct rope_th_ctx * rope_ctx, struct htp_ops_context
}
static void hvx_calc_rope_neox_f32(const float * restrict src0,
float * restrict dst,
const int num_elems,
const float * restrict theta_cache) {
float * restrict dst,
const int num_elems,
const float * restrict theta_cache) {
// for (int i = 0; i < num_elems; i += 2) {
//const float cos_theta = theta_cache[i + 0];
//const float sin_theta = theta_cache[i + 1];
@@ -192,7 +193,7 @@ static void hvx_calc_rope_neox_f32(const float * restrict src0,
HVX_Vector v4 = Q6_Vqf32_vsub_Vqf32Vqf32(vx0_c, vx1_s);
HVX_Vector v5 = Q6_Vqf32_vadd_Vqf32Vqf32(vx0_s, vx1_c);
*(HVX_Vector *) dst_curr = Q6_Vsf_equals_Vqf32(v4);
*(HVX_Vector *) dst_curr = Q6_Vsf_equals_Vqf32(v4);
*(HVX_Vector *) (dst_curr + half_size) = Q6_Vsf_equals_Vqf32(v5);
src0_curr += VLEN;
@@ -259,7 +260,7 @@ static void rope_hex_f32(struct rope_th_ctx * rope_ctx,
const uint32_t ir1,
int nth,
int ith,
int opt_path) {
const int opt_path) {
struct htp_ops_context * octx = rope_ctx->octx;
const struct htp_tensor * src0 = &octx->src0;
@@ -267,8 +268,8 @@ static void rope_hex_f32(struct rope_th_ctx * rope_ctx,
const struct htp_tensor * src2 = &octx->src2;
struct htp_tensor * dst = &octx->dst;
const int32_t mode = rope_ctx->mode;
const bool is_neox = mode & HTP_ROPE_TYPE_NEOX;
const int32_t mode = rope_ctx->mode;
const bool is_neox = mode & HTP_ROPE_TYPE_NEOX;
htp_rope_preamble;
@@ -281,8 +282,9 @@ static void rope_hex_f32(struct rope_th_ctx * rope_ctx,
freq_factors = (const float *) src2->data;
}
int ir = 0;
const uint32_t i1_end = MIN(ir1, ne1);
const int32_t half_dims = rope_ctx->n_dims / 2;
const size_t remain_bytes = (ne0 - rope_ctx->n_dims) * sizeof(float);
for (uint32_t i3 = 0; i3 < ne3; i3++) { // batch
for (uint32_t i2 = 0; i2 < ne2; i2++) { // seq-len
const int32_t p = pos[i2];
@@ -290,14 +292,7 @@ static void rope_hex_f32(struct rope_th_ctx * rope_ctx,
rope_cache_init(p, rope_ctx->freq_scale, freq_factors, rope_ctx->corr_dims, ne0, rope_ctx->ext_factor,
rope_ctx->attn_factor, wp0, rope_ctx->theta_scale);
for (uint32_t i1 = 0; i1 < ne1; i1++) { // attn-heads
if (ir++ < ir0) {
continue;
}
if (ir > ir1) {
break;
}
for (uint32_t i1 = ir0; i1 < i1_end; i1++) { // attn-heads
const float * src = (float *) ((char *) src0->data + i3 * nb03 + i2 * nb02 + i1 * nb01);
float * dst_data = (float *) ((char *) dst->data + i3 * nb3 + i2 * nb2 + i1 * nb1);
@@ -310,6 +305,9 @@ static void rope_hex_f32(struct rope_th_ctx * rope_ctx,
} else {
hvx_calc_rope_f32(src_loc, dst_data_loc, rope_ctx->n_dims, wp0);
}
src_loc += rope_ctx->n_dims;
dst_data_loc += rope_ctx->n_dims;
} else {
for (uint32_t i0 = 0; i0 < rope_ctx->n_dims; i0 += 2) {
const float cos_theta = wp0[i0 + 0];
@@ -317,10 +315,10 @@ static void rope_hex_f32(struct rope_th_ctx * rope_ctx,
if (is_neox) {
const float x0 = src_loc[0];
const float x1 = src_loc[rope_ctx->n_dims/2];
const float x1 = src_loc[half_dims];
dst_data_loc[0] = x0 * cos_theta - x1 * sin_theta;
dst_data_loc[rope_ctx->n_dims/2] = x0 * sin_theta + x1 * cos_theta;
dst_data_loc[0] = x0 * cos_theta - x1 * sin_theta;
dst_data_loc[half_dims] = x0 * sin_theta + x1 * cos_theta;
src_loc += 1;
dst_data_loc += 1;
@@ -335,15 +333,13 @@ static void rope_hex_f32(struct rope_th_ctx * rope_ctx,
dst_data_loc += 2;
}
}
src_loc += (is_neox ? half_dims : 0);
dst_data_loc += (is_neox ? half_dims : 0);
}
for (uint32_t i0 = rope_ctx->n_dims; i0 < ne0; i0 += 2) {
dst_data_loc[0] = src_loc[0];
dst_data_loc[1] = src_loc[1];
src_loc += 2;
dst_data_loc += 2;
}
// TODO: use simd to speed up the remaining elements copy
memcpy(dst_data_loc, src_loc, remain_bytes);
}
}
}

View File

@@ -411,6 +411,38 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_ssm_conv(ggml_me
return res;
}
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_ssm_conv_batched(ggml_metal_library_t lib, const ggml_tensor * op, int ssm_conv_bs) {
GGML_ASSERT(op->src[0]->type == GGML_TYPE_F32);
GGML_ASSERT(op->src[1]->type == GGML_TYPE_F32);
GGML_ASSERT(ggml_is_contiguous(op->src[0]));
GGML_ASSERT(ggml_is_contiguous(op->src[1]));
char base[256];
char name[256];
const char * suffix = "";
if (op->src[1]->ne[0] % 4 == 0) {
suffix = "_4";
}
snprintf(base, 256, "kernel_ssm_conv_%s_%s_batched%s", ggml_type_name(op->src[0]->type), ggml_type_name(op->src[1]->type), suffix);
snprintf(name, 256, "%s_ssm_conv_bs=%d", base, ssm_conv_bs);
ggml_metal_pipeline_with_params res = ggml_metal_library_get_pipeline(lib, name);
if (!res.pipeline) {
ggml_metal_cv_t cv = ggml_metal_cv_init();
ggml_metal_cv_set_int16(cv, ssm_conv_bs, FC_SSM_CONV + 0);
res = ggml_metal_library_compile_pipeline(lib, base, name, cv);
ggml_metal_cv_free(cv);
}
return res;
}
ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_ssm_scan(ggml_metal_library_t lib, const ggml_tensor * op) {
GGML_TENSOR_LOCALS( int32_t, ne0, op->src[0], ne);
@@ -427,7 +459,12 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_ssm_scan(ggml_me
res = ggml_metal_library_compile_pipeline(lib, base, name, nullptr);
}
res.smem = 32*sizeof(float)*nsg;
// Shared memory layout:
// - sgptg * NW floats for partial sums (nsg * 32)
// - sgptg floats for shared_x_dt (nsg)
// - sgptg floats for shared_dA (nsg)
// Total: nsg * (32 + 2) floats
res.smem = (32 + 2)*sizeof(float)*nsg;
return res;
}

View File

@@ -117,6 +117,7 @@ struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_cumsum_ad
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_tri (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_soft_max (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_ssm_conv (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_ssm_conv_batched (ggml_metal_library_t lib, const struct ggml_tensor * op, int ssm_conv_bs);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_ssm_scan (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_rwkv (ggml_metal_library_t lib, const struct ggml_tensor * op);
struct ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_mul_mv_ext (ggml_metal_library_t lib, enum ggml_type tsrc0, enum ggml_type tsrc1, int nsg, int nxpsg, int r1ptg);

View File

@@ -77,6 +77,7 @@
#define FC_MUL_MV 600
#define FC_MUL_MM 700
#define FC_ROPE 800
#define FC_SSM_CONV 900
// op-specific constants
#define OP_FLASH_ATTN_EXT_NQPTG 8

View File

@@ -221,7 +221,7 @@ static int ggml_metal_op_encode_impl(ggml_metal_op_t ctx, int idx) {
}
if (ctx->debug_graph > 0) {
GGML_LOG_DEBUG("%s: node[%5d] - %-12s %s\n", __func__, idx, ggml_op_name(node->op), is_concurrent ? "(concurrent)" : "");
GGML_LOG_DEBUG("%s: node[%5d] - %-12s %-12s %s\n", __func__, idx, ggml_op_name(node->op), ggml_get_name(node), is_concurrent ? "(concurrent)" : "");
}
if (ctx->debug_graph > 1) {
GGML_TENSOR_LOCALS( int64_t, ne0, node->src[0], ne);
@@ -1365,15 +1365,43 @@ int ggml_metal_op_ssm_conv(ggml_metal_op_t ctx, int idx) {
/*.nb2 =*/ nb2,
};
auto pipeline = ggml_metal_library_get_pipeline_ssm_conv(lib, op);
// Use batched kernel for prefill (ne1 > 1) to reduce threadgroup dispatch overhead
const bool use_batched = (ne1 > 1);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes(enc, &args, sizeof(args), 0);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op->src[0]), 1);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op->src[1]), 2);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op), 3);
if (use_batched) {
// Determine the smallest power of 2 that's >= ne1, but <= 256
int BATCH_SIZE;
if (ne1 > 128) BATCH_SIZE = 256;
else if (ne1 > 64 ) BATCH_SIZE = 128;
else if (ne1 > 32 ) BATCH_SIZE = 64;
else if (ne1 > 16 ) BATCH_SIZE = 32;
else if (ne1 > 8 ) BATCH_SIZE = 16;
else if (ne1 > 4 ) BATCH_SIZE = 8;
else BATCH_SIZE = 2;
ggml_metal_encoder_dispatch_threadgroups(enc, ne01, ne1, ne02, 1, 1, 1);
auto pipeline = ggml_metal_library_get_pipeline_ssm_conv_batched(lib, op, BATCH_SIZE);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes(enc, &args, sizeof(args), 0);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op->src[0]), 1);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op->src[1]), 2);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op), 3);
// Dispatch: ne01 rows, ceil(ne1/BATCH_SIZE) token batches, ne02 sequences
// Each threadgroup has BATCH_SIZE threads, each handling one token
const int n_token_batches = (ne1 + BATCH_SIZE - 1) / BATCH_SIZE;
ggml_metal_encoder_dispatch_threadgroups(enc, ne01, n_token_batches, ne02, BATCH_SIZE, 1, 1);
} else {
auto pipeline = ggml_metal_library_get_pipeline_ssm_conv(lib, op);
ggml_metal_encoder_set_pipeline(enc, pipeline);
ggml_metal_encoder_set_bytes(enc, &args, sizeof(args), 0);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op->src[0]), 1);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op->src[1]), 2);
ggml_metal_encoder_set_buffer(enc, ggml_metal_get_buffer_id(op), 3);
ggml_metal_encoder_dispatch_threadgroups(enc, ne01, ne1, ne02, 1, 1, 1);
}
return 1;
}

View File

@@ -2343,7 +2343,102 @@ kernel void kernel_ssm_conv_f32_f32_4(
x[0] = sumf;
}
constant short FC_ssm_conv_bs [[function_constant(FC_SSM_CONV + 0)]];
// Batched version: each threadgroup processes multiple tokens for better efficiency
// Thread layout: each thread handles one token, threadgroup covers BATCH_SIZE tokens
kernel void kernel_ssm_conv_f32_f32_batched(
constant ggml_metal_kargs_ssm_conv & args,
device const void * src0,
device const void * src1,
device float * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
uint3 tpitg[[thread_position_in_threadgroup]],
uint3 ntg[[threads_per_threadgroup]]) {
// tgpig.x = row index (ir)
// tgpig.y = batch of tokens (i2_base / BATCH_SIZE)
// tgpig.z = sequence index (i3)
// tpitg.x = thread within batch (0..BATCH_SIZE-1)
const short BATCH_SIZE = FC_ssm_conv_bs;
const int64_t ir = tgpig.x;
const int64_t i2_base = tgpig.y * BATCH_SIZE;
const int64_t i3 = tgpig.z;
const int64_t i2_off = tpitg.x;
const int64_t i2 = i2_base + i2_off;
const int64_t nc = args.ne10; // conv kernel size (typically 4)
const int64_t n_t = args.ne1; // number of tokens
// Bounds check for partial batches at the end
if (i2 >= n_t) {
return;
}
// Load conv weights (shared across all tokens for this row)
device const float * c = (device const float *) ((device const char *) src1 + ir*args.nb11);
// Load source for this specific token
device const float * s = (device const float *) ((device const char *) src0 + ir*args.nb01 + i2*args.nb00 + i3*args.nb02);
// Output location for this token
device float * x = (device float *) ((device char *) dst + ir*args.nb0 + i2*args.nb1 + i3*args.nb2);
float sumf = 0.0f;
for (int64_t i0 = 0; i0 < nc; ++i0) {
sumf += s[i0] * c[i0];
}
x[0] = sumf;
}
kernel void kernel_ssm_conv_f32_f32_batched_4(
constant ggml_metal_kargs_ssm_conv & args,
device const void * src0,
device const void * src1,
device float * dst,
uint3 tgpig[[threadgroup_position_in_grid]],
uint3 tpitg[[thread_position_in_threadgroup]],
uint3 ntg[[threads_per_threadgroup]]) {
// tgpig.x = row index (ir)
// tgpig.y = batch of tokens (i2_base / BATCH_SIZE)
// tgpig.z = sequence index (i3)
// tpitg.x = thread within batch (0..BATCH_SIZE-1)
const short BATCH_SIZE = FC_ssm_conv_bs;
const int64_t ir = tgpig.x;
const int64_t i2_base = tgpig.y * BATCH_SIZE;
const int64_t i3 = tgpig.z;
const int64_t i2_off = tpitg.x;
const int64_t i2 = i2_base + i2_off;
const int64_t nc = args.ne10; // conv kernel size (typically 4)
const int64_t n_t = args.ne1; // number of tokens
// Bounds check for partial batches at the end
if (i2 >= n_t) {
return;
}
// Load conv weights (shared across all tokens for this row)
device const float4 * c = (device const float4 *) ((device const char *) src1 + ir*args.nb11);
// Load source for this specific token
device const float4 * s = (device const float4 *) ((device const char *) src0 + ir*args.nb01 + i2*args.nb00 + i3*args.nb02);
// Output location for this token
device float * x = (device float *) ((device char *) dst + ir*args.nb0 + i2*args.nb1 + i3*args.nb2);
float sumf = 0.0f;
for (int64_t i0 = 0; i0 < nc/4; ++i0) {
sumf += dot(s[i0], c[i0]);
}
x[0] = sumf;
}
// ref: ggml.c:ggml_compute_forward_ssm_scan_f32, Mamba-2 part
// Optimized version: reduces redundant memory loads by having one thread load shared values
kernel void kernel_ssm_scan_f32(
constant ggml_metal_kargs_ssm_scan & args,
device const void * src0,
@@ -2363,7 +2458,15 @@ kernel void kernel_ssm_scan_f32(
uint3 tgpg[[threadgroups_per_grid]]) {
constexpr short NW = N_SIMDWIDTH;
shared[tpitg.x] = 0.0f;
// Shared memory layout:
// [0..sgptg*NW-1]: partial sums for reduction (existing)
// [sgptg*NW..sgptg*NW+sgptg-1]: pre-computed x_dt values for each token in batch
// [sgptg*NW+sgptg..sgptg*NW+2*sgptg-1]: pre-computed dA values for each token in batch
threadgroup float * shared_sums = shared;
threadgroup float * shared_x_dt = shared + sgptg * NW;
threadgroup float * shared_dA = shared + sgptg * NW + sgptg;
shared_sums[tpitg.x] = 0.0f;
const int32_t i0 = tpitg.x;
const int32_t i1 = tgpig.x;
@@ -2403,32 +2506,47 @@ kernel void kernel_ssm_scan_f32(
for (int i2 = 0; i2 < n_t; i2 += sgptg) {
threadgroup_barrier(mem_flags::mem_threadgroup);
for (int t = 0; t < sgptg && i2 + t < n_t; t++) {
const float dt0 = dt[0];
// Pre-compute x_dt and dA for this batch of tokens
// Only first sgptg threads do the loads and expensive math
if (i0 < sgptg && i2 + i0 < n_t) {
// ns12 and ns21 are element strides (nb12/nb10, nb21/nb20)
device const float * x_t = x + i0 * args.ns12;
device const float * dt_t = dt + i0 * args.ns21;
const float dt0 = dt_t[0];
const float dtsp = dt0 <= 20.0f ? log(1.0f + exp(dt0)) : dt0;
const float x_dt = x[0] * dtsp;
const float dA = exp(dtsp * A0);
shared_x_dt[i0] = x_t[0] * dtsp;
shared_dA[i0] = dtsp; // Store dtsp, compute exp(dtsp * A0) per-thread since A0 varies
}
threadgroup_barrier(mem_flags::mem_threadgroup);
for (int t = 0; t < sgptg && i2 + t < n_t; t++) {
const float x_dt = shared_x_dt[t];
const float dA = exp(shared_dA[t] * A0);
s = (s0 * dA) + (B[i0] * x_dt);
const float sumf = simd_sum(s * C[i0]);
if (tiisg == 0) {
shared[t*NW + sgitg] = sumf;
shared_sums[t*NW + sgitg] = sumf;
}
// recurse
s0 = s;
x += args.ns12;
dt += args.ns21;
B += args.ns42;
C += args.ns52;
}
// Advance pointers for next batch
x += sgptg * args.ns12;
dt += sgptg * args.ns21;
threadgroup_barrier(mem_flags::mem_threadgroup);
const float sumf = simd_sum(shared[sgitg*NW + tiisg]);
const float sumf = simd_sum(shared_sums[sgitg*NW + tiisg]);
if (tiisg == 0 && i2 + sgitg < n_t) {
y[sgitg*nh*nr] = sumf;

View File

@@ -2,6 +2,13 @@
#include "dequantize.hpp"
#include "presets.hpp"
#if defined(__INTEL_LLVM_COMPILER)
#if __has_include(<sycl/ext/oneapi/bfloat16.hpp>)
#include <sycl/ext/oneapi/bfloat16.hpp>
#define GGML_SYCL_HAS_BF16
#endif
#endif
template <int qk, int qr, dequantize_kernel_t dequantize_kernel, typename dst_t>
static void dequantize_block(const void * __restrict__ vx, dst_t * __restrict__ y, const int64_t k,
const sycl::nd_item<3> &item_ct1) {
@@ -566,6 +573,10 @@ to_fp16_sycl_t ggml_get_to_fp16_sycl(ggml_type type, ggml_tensor * dst) {
return dequantize_row_iq4_nl_sycl;
case GGML_TYPE_F32:
return convert_unary_sycl<float>;
#ifdef GGML_SYCL_HAS_BF16
case GGML_TYPE_BF16:
return convert_unary_sycl<sycl::ext::oneapi::bfloat16>;
#endif
default:
return nullptr;
}
@@ -627,6 +638,10 @@ to_fp32_sycl_t ggml_get_to_fp32_sycl(ggml_type type, ggml_tensor *dst) {
return dequantize_row_iq4_nl_sycl;
case GGML_TYPE_F16:
return convert_unary_sycl<sycl::half>;
#ifdef GGML_SYCL_HAS_BF16
case GGML_TYPE_BF16:
return convert_unary_sycl<sycl::ext::oneapi::bfloat16>;
#endif
default:
return nullptr;
}
@@ -636,6 +651,10 @@ to_fp16_nc_sycl_t get_to_fp16_nc_sycl(ggml_type type) {
switch (type) {
case GGML_TYPE_F32:
return convert_unary_nc_sycl<float>;
#ifdef GGML_SYCL_HAS_BF16
case GGML_TYPE_BF16:
return convert_unary_nc_sycl<sycl::ext::oneapi::bfloat16>;
#endif
default:
return nullptr;
}

View File

@@ -277,7 +277,7 @@ static void soft_max_f32_sycl(const float *x, const T *mask,
const int id = get_current_device_id();
const size_t smpbo = ggml_sycl_info().devices[id].smpbo;
if (nbytes_shared <= smpbo) {
if (nbytes_shared <= smpbo && ncols_x <= max_block_size) {
launch_soft_max_kernels<32, 64, 128, 256, 512, 1024, 2048, 4096>(
x, mask, sinks, dst, params, stream, block_dims, block_nums,
nbytes_shared);

View File

@@ -757,7 +757,8 @@ struct vk_device_struct {
vk_pipeline pipeline_flash_attn_split_k_reduce;
vk_pipeline pipeline_topk_moe[num_topk_moe_pipelines][TOPK_MOE_COUNT];
// [2] is for whether to take n_experts from spec constant (0) or push constant (1)
vk_pipeline pipeline_topk_moe[num_topk_moe_pipelines][TOPK_MOE_COUNT][2];
std::vector<vk_pipeline_ref> all_pipelines;
@@ -777,11 +778,6 @@ struct vk_device_struct {
std::unique_ptr<vk_memory_logger> memory_logger;
#endif
// for GGML_VK_PERF_LOGGER
std::unique_ptr<vk_perf_logger> perf_logger;
vk::QueryPool query_pool;
int32_t num_queries;
~vk_device_struct() {
VK_LOG_DEBUG("destroy device " << name);
@@ -1154,6 +1150,7 @@ static_assert(sizeof(vk_op_multi_add_push_constants) <= 256);
struct vk_op_topk_moe_push_constants {
uint32_t n_rows;
uint32_t n_experts_push;
uint32_t n_expert_used;
float clamp_min;
float clamp_max;
@@ -1523,12 +1520,21 @@ private:
#define VK_LOG_MEMORY(msg) ((void) 0)
#endif // GGML_VULKAN_MEMORY_DEBUG
static bool vk_perf_logger_enabled = false;
// number of calls between perf logger prints
static uint32_t vk_perf_logger_frequency = 1;
class vk_perf_logger {
public:
void print_timings() {
void print_timings(bool force = false) {
if (timings.empty()) {
return;
}
print_count++;
if ((print_count % vk_perf_logger_frequency) != 0 && !force) {
return;
}
print_count = 0;
uint64_t total_all_op_times = 0;
std::cerr << "----------------\nVulkan Timings:" << std::endl;
for (const auto & t : timings) {
@@ -1565,16 +1571,20 @@ class vk_perf_logger {
flops.clear();
}
void log_timing(const ggml_tensor * node, uint64_t time) {
void log_timing(const ggml_tensor * node, const char *fusion_name, uint64_t time) {
std::string fusion_str;
if (fusion_name) {
fusion_str = fusion_name + std::string(" ");
}
if (node->op == GGML_OP_UNARY) {
timings[ggml_unary_op_name(ggml_get_unary_op(node))].push_back(time);
timings[fusion_str + ggml_unary_op_name(ggml_get_unary_op(node))].push_back(time);
return;
}
if (node->op == GGML_OP_MUL_MAT || node->op == GGML_OP_MUL_MAT_ID) {
const uint64_t m = node->src[0]->ne[1];
const uint64_t n = (node->op == GGML_OP_MUL_MAT) ? node->ne[1] : node->ne[2];
const uint64_t m = node->ne[0];
const uint64_t n = node->ne[1];
const uint64_t k = node->src[1]->ne[0];
const uint64_t batch = node->src[1]->ne[2] * node->src[1]->ne[3];
const uint64_t batch = node->ne[2] * node->ne[3];
std::string name = ggml_op_name(node->op);
if ((node->op == GGML_OP_MUL_MAT && n <= mul_mat_vec_max_cols) ||
(node->op == GGML_OP_MUL_MAT_ID && node->src[2]->ne[1] == 1)) {
@@ -1583,9 +1593,13 @@ class vk_perf_logger {
name += " ";
name += ggml_type_name(node->src[0]->type);
name += " m=" + std::to_string(m) + " n=" + std::to_string(n) + " k=" + std::to_string(k);
if (node->op == GGML_OP_MUL_MAT_ID) {
name += " n_expert=" + std::to_string(node->src[0]->ne[2]);
}
if (batch > 1) {
name += " batch=" + std::to_string(batch);
}
name = fusion_str + name;
timings[name].push_back(time);
flops[name].push_back(m * n * (k + (k - 1)) * batch);
return;
@@ -1607,6 +1621,7 @@ class vk_perf_logger {
uint64_t n_flops = size_M * size_N * (size_K + (size_K - 1));
name += " M=Cout=" + std::to_string(size_M) + ", K=Cin*KW*KH=" + std::to_string(size_K) +
", N=N*OW*OH=" + std::to_string(size_N);
name = fusion_str + name;
flops[name].push_back(n_flops);
timings[name].push_back(time);
return;
@@ -1614,6 +1629,7 @@ class vk_perf_logger {
if (node->op == GGML_OP_RMS_NORM) {
std::string name = ggml_op_name(node->op);
name += "(" + std::to_string(node->ne[0]) + "," + std::to_string(node->ne[1]) + "," + std::to_string(node->ne[2]) + "," + std::to_string(node->ne[3]) + ")";
name = fusion_str + name;
timings[name].push_back(time);
return;
}
@@ -1624,6 +1640,7 @@ class vk_perf_logger {
const ggml_tensor * v = node->src[2];
const ggml_tensor * m = node->src[3];
std::stringstream name;
name << fusion_str;
name << ggml_op_name(node->op) <<
" dst(" << dst->ne[0] << "," << dst->ne[1] << "," << dst->ne[2] << "," << dst->ne[3] << "), " <<
" q(" << q->ne[0] << "," << q->ne[1] << "," << q->ne[2] << "," << q->ne[3] << "), " <<
@@ -1635,17 +1652,19 @@ class vk_perf_logger {
}
if (node->op == GGML_OP_TOP_K) {
std::stringstream name;
name << fusion_str;
name << ggml_op_name(node->op) <<
" K=" << node->ne[0] <<
" (" << node->src[0]->ne[0] << "," << node->src[0]->ne[1] << "," << node->src[0]->ne[2] << "," << node->src[0]->ne[3] << ")";
timings[name.str()].push_back(time);
return;
}
timings[ggml_op_name(node->op)].push_back(time);
timings[fusion_str + ggml_op_name(node->op)].push_back(time);
}
private:
std::map<std::string, std::vector<uint64_t>> timings;
std::map<std::string, std::vector<uint64_t>> flops;
uint32_t print_count {};
};
struct ggml_backend_vk_context {
@@ -1699,6 +1718,14 @@ struct ggml_backend_vk_context {
// Bit 'i' means nodes[start_of_fusion + i] writes to memory.
// If there's no fusion, bit 0 is still set.
int fused_ops_write_mask {};
// for GGML_VK_PERF_LOGGER
std::unique_ptr<vk_perf_logger> perf_logger;
vk::QueryPool query_pool;
std::vector<const char *> query_fusion_names;
std::vector<ggml_tensor *> query_nodes;
int32_t num_queries {};
int32_t query_idx {};
};
static void * const vk_ptr_base = (void *)(uintptr_t) 0x1000; // NOLINT
@@ -1824,8 +1851,6 @@ struct vk_instance_t {
static bool vk_instance_initialized = false;
static vk_instance_t vk_instance;
static bool vk_perf_logger_enabled = false;
#ifdef GGML_VULKAN_CHECK_RESULTS
static size_t vk_skip_checks;
static size_t vk_output_tensor;
@@ -4181,10 +4206,12 @@ static void ggml_vk_load_shaders(vk_device& device) {
ggml_vk_create_pipeline(device, device->pipeline_conv2d_dw_whcn_f16_f32, "conv2d_dw_whcn_f16_f32", conv2d_dw_whcn_f16_f32_len, conv2d_dw_whcn_f16_f32_data, "main", 3, sizeof(vk_op_conv2d_dw_push_constants), {512, 1, 1}, {}, 1);
ggml_vk_create_pipeline(device, device->pipeline_conv2d_dw_cwhn_f16_f32, "conv2d_dw_cwhn_f16_f32", conv2d_dw_cwhn_f16_f32_len, conv2d_dw_cwhn_f16_f32_data, "main", 3, sizeof(vk_op_conv2d_dw_push_constants), {512, 1, 1}, {}, 1);
for (uint32_t i = 0; i < num_topk_moe_pipelines; ++i) {
ggml_vk_create_pipeline2(device, device->pipeline_topk_moe[i][TOPK_MOE_EARLY_SOFTMAX], "topk_moe_f32_early_softmax_"+std::to_string(i), topk_moe_f32_len, topk_moe_f32_data, "main", 3, sizeof(vk_op_topk_moe_push_constants), {1, 1, 1}, {device->subgroup_size, 1u<<i, 0, 0}, 1, true, true, device->subgroup_size);
ggml_vk_create_pipeline2(device, device->pipeline_topk_moe[i][TOPK_MOE_EARLY_SOFTMAX_NORM], "topk_moe_f32_early_softmax_norm"+std::to_string(i), topk_moe_f32_len, topk_moe_f32_data, "main", 3, sizeof(vk_op_topk_moe_push_constants), {1, 1, 1}, {device->subgroup_size, 1u<<i, 1, 0}, 1, true, true, device->subgroup_size);
ggml_vk_create_pipeline2(device, device->pipeline_topk_moe[i][TOPK_MOE_LATE_SOFTMAX], "topk_moe_f32_late_softmax"+std::to_string(i), topk_moe_f32_len, topk_moe_f32_data, "main", 3, sizeof(vk_op_topk_moe_push_constants), {1, 1, 1}, {device->subgroup_size, 1u<<i, 0, 1}, 1, true, true, device->subgroup_size);
for (uint32_t use_push = 0; use_push < 2; ++use_push) {
for (uint32_t i = 0; i < num_topk_moe_pipelines; ++i) {
ggml_vk_create_pipeline2(device, device->pipeline_topk_moe[i][TOPK_MOE_EARLY_SOFTMAX][use_push], "topk_moe_f32_early_softmax_"+std::to_string(i), topk_moe_f32_len, topk_moe_f32_data, "main", 3, sizeof(vk_op_topk_moe_push_constants), {1, 1, 1}, {device->subgroup_size, 1u<<i, 0, 0, use_push}, 1, true, true, device->subgroup_size);
ggml_vk_create_pipeline2(device, device->pipeline_topk_moe[i][TOPK_MOE_EARLY_SOFTMAX_NORM][use_push], "topk_moe_f32_early_softmax_norm"+std::to_string(i), topk_moe_f32_len, topk_moe_f32_data, "main", 3, sizeof(vk_op_topk_moe_push_constants), {1, 1, 1}, {device->subgroup_size, 1u<<i, 1, 0, use_push}, 1, true, true, device->subgroup_size);
ggml_vk_create_pipeline2(device, device->pipeline_topk_moe[i][TOPK_MOE_LATE_SOFTMAX][use_push], "topk_moe_f32_late_softmax"+std::to_string(i), topk_moe_f32_len, topk_moe_f32_data, "main", 3, sizeof(vk_op_topk_moe_push_constants), {1, 1, 1}, {device->subgroup_size, 1u<<i, 0, 1, use_push}, 1, true, true, device->subgroup_size);
}
}
for (auto &c : compiles) {
@@ -4205,9 +4232,6 @@ static vk_device ggml_vk_get_device(size_t idx) {
#ifdef GGML_VULKAN_MEMORY_DEBUG
device->memory_logger = std::unique_ptr<vk_memory_logger>(new vk_memory_logger());
#endif
if (vk_perf_logger_enabled) {
device->perf_logger = std::unique_ptr<vk_perf_logger>(new vk_perf_logger());
}
size_t dev_num = vk_instance.device_indices[idx];
@@ -5153,6 +5177,11 @@ static void ggml_vk_instance_init() {
}
vk_perf_logger_enabled = getenv("GGML_VK_PERF_LOGGER") != nullptr;
const char* GGML_VK_PERF_LOGGER_FREQUENCY = getenv("GGML_VK_PERF_LOGGER_FREQUENCY");
if (GGML_VK_PERF_LOGGER_FREQUENCY != nullptr) {
vk_perf_logger_frequency = std::stoul(GGML_VK_PERF_LOGGER_FREQUENCY);
}
// See https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers-
VULKAN_HPP_DEFAULT_DISPATCHER.init(vk_instance.instance);
@@ -5330,6 +5359,10 @@ static void ggml_vk_init(ggml_backend_vk_context * ctx, size_t idx) {
ctx->compute_cmd_pool.init(ctx->device, &ctx->device->compute_queue);
ctx->transfer_cmd_pool.init(ctx->device, &ctx->device->transfer_queue);
if (vk_perf_logger_enabled) {
ctx->perf_logger = std::unique_ptr<vk_perf_logger>(new vk_perf_logger());
}
#ifdef GGML_VULKAN_CHECK_RESULTS
const char* skip_checks = getenv("GGML_VULKAN_SKIP_CHECKS");
vk_skip_checks = (skip_checks == NULL ? 0 : atoi(skip_checks));
@@ -8525,7 +8558,9 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
uint32_t idx = (uint32_t)ceilf(log2f(float(dst->ne[0])));
GGML_ASSERT(idx < num_topk_moe_pipelines);
topk_moe_mode mode = ggml_vk_num_additional_ops_to_topk_moe_mode(ctx->num_additional_fused_ops);
return ctx->device->pipeline_topk_moe[idx][mode];
// use n_experts from push constant if it's not equal to the power of two spec constant
bool use_push = dst->ne[0] != (1u << idx);
return ctx->device->pipeline_topk_moe[idx][mode][use_push];
}
if (src0->type == GGML_TYPE_F32 && (src1 == nullptr || src1->type == GGML_TYPE_F32) && dst->type == GGML_TYPE_F32) {
@@ -10129,6 +10164,7 @@ static void ggml_vk_topk_moe(ggml_backend_vk_context * ctx, vk_context& subctx,
vk_op_topk_moe_push_constants pc {};
pc.n_rows = n_rows;
pc.n_experts_push = n_experts;
pc.n_expert_used = n_expert_used;
if (mode == TOPK_MOE_EARLY_SOFTMAX_NORM) {
ggml_tensor * clamp = cgraph->nodes[node_idx + 7];
@@ -12205,6 +12241,9 @@ static void ggml_vk_cleanup(ggml_backend_vk_context * ctx) {
ctx->compute_cmd_pool.destroy(ctx->device->device);
ctx->transfer_cmd_pool.destroy(ctx->device->device);
if (vk_perf_logger_enabled) {
ctx->perf_logger->print_timings(true);
}
}
static int ggml_vk_get_device_count() {
@@ -12800,8 +12839,7 @@ static bool ggml_vk_can_fuse_topk_moe(ggml_backend_vk_context * ctx, const struc
}
const int n_expert = softmax->ne[0];
// n_expert must be a power of 2
if (!is_pow2(n_expert) || n_expert > (1 << (num_topk_moe_pipelines-1))) {
if (n_expert > (1 << (num_topk_moe_pipelines-1))) {
return false;
}
@@ -13003,24 +13041,29 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
vk_context compute_ctx;
if (vk_perf_logger_enabled) {
// allocate/resize the query pool
if (ctx->device->num_queries < cgraph->n_nodes + 1) {
if (ctx->device->query_pool) {
ctx->device->device.destroyQueryPool(ctx->device->query_pool);
if (ctx->num_queries < cgraph->n_nodes + 1) {
if (ctx->query_pool) {
ctx->device->device.destroyQueryPool(ctx->query_pool);
}
vk::QueryPoolCreateInfo query_create_info;
query_create_info.queryType = vk::QueryType::eTimestamp;
query_create_info.queryCount = cgraph->n_nodes + 100;
ctx->device->query_pool = ctx->device->device.createQueryPool(query_create_info);
ctx->device->num_queries = query_create_info.queryCount;
ctx->query_pool = ctx->device->device.createQueryPool(query_create_info);
ctx->num_queries = query_create_info.queryCount;
ctx->query_fusion_names.resize(ctx->num_queries);
ctx->query_nodes.resize(ctx->num_queries);
}
ctx->device->device.resetQueryPool(ctx->device->query_pool, 0, cgraph->n_nodes+1);
ctx->device->device.resetQueryPool(ctx->query_pool, 0, cgraph->n_nodes+1);
std::fill(ctx->query_fusion_names.begin(), ctx->query_fusion_names.end(), nullptr);
std::fill(ctx->query_nodes.begin(), ctx->query_nodes.end(), nullptr);
GGML_ASSERT(ctx->compute_ctx.expired());
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->device->query_pool, 0);
ctx->query_idx = 0;
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->query_pool, ctx->query_idx++);
}
ctx->prealloc_y_last_pipeline_used = nullptr;
@@ -13061,52 +13104,66 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
total_mul_mat_bytes += bytes;
}
const char *fusion_string {};
if (!ctx->device->disable_fusion) {
uint32_t num_adds = ggml_vk_fuse_multi_add(ctx, cgraph, i);
if (num_adds) {
ctx->num_additional_fused_ops = num_adds - 1;
fusion_string = "MULTI_ADD";
} else if (ggml_vk_can_fuse(ctx, cgraph, i, { GGML_OP_MUL_MAT, GGML_OP_ADD, GGML_OP_ADD })) {
ctx->num_additional_fused_ops = 2;
fusion_string = "MUL_MAT_ADD_ADD";
} else if (ggml_vk_can_fuse(ctx, cgraph, i, { GGML_OP_MUL_MAT, GGML_OP_ADD })) {
ctx->num_additional_fused_ops = 1;
fusion_string = "MUL_MAT_ADD";
} else if (ggml_vk_can_fuse(ctx, cgraph, i, { GGML_OP_MUL_MAT_ID, GGML_OP_ADD_ID, GGML_OP_MUL })) {
ctx->num_additional_fused_ops = 2;
fusion_string = "MUL_MAT_ID_ADD_ID_MUL";
} else if (ggml_vk_can_fuse(ctx, cgraph, i, { GGML_OP_MUL_MAT_ID, GGML_OP_ADD_ID })) {
ctx->num_additional_fused_ops = 1;
fusion_string = "MUL_MAT_ID_ADD_ID";
} else if (ggml_vk_can_fuse(ctx, cgraph, i, { GGML_OP_MUL_MAT_ID, GGML_OP_MUL })) {
ctx->num_additional_fused_ops = 1;
fusion_string = "MUL_MAT_ID_MUL";
} else if (ggml_can_fuse_subgraph(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_ROPE, GGML_OP_VIEW, GGML_OP_SET_ROWS }, { i + 4 }) &&
ggml_check_edges(cgraph, i, rms_norm_mul_rope_view_set_rows_edges) &&
ggml_vk_can_fuse_rms_norm_mul_rope(ctx, cgraph, i) &&
ggml_vk_can_fuse_rope_set_rows(ctx, cgraph, i + 2)) {
ctx->num_additional_fused_ops = 4;
fusion_string = "RMS_NORM_MUL_ROPE_VIEW_SET_ROWS";
} else if (ggml_vk_can_fuse(ctx, cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_ROPE })&&
ggml_vk_can_fuse_rms_norm_mul_rope(ctx, cgraph, i)) {
ctx->num_additional_fused_ops = 2;
fusion_string = "RMS_NORM_MUL_ROPE";
} else if (ggml_vk_can_fuse(ctx, cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL })) {
ctx->num_additional_fused_ops = 1;
fusion_string = "RMS_NORM_MUL";
} else if (ggml_can_fuse_subgraph(cgraph, i, { GGML_OP_ROPE, GGML_OP_VIEW, GGML_OP_SET_ROWS }, { i + 2 }) &&
ggml_check_edges(cgraph, i, rope_view_set_rows_edges) &&
ggml_vk_can_fuse_rope_set_rows(ctx, cgraph, i)) {
ctx->num_additional_fused_ops = 2;
fusion_string = "ROPE_VIEW_SET_ROWS";
} else if (ggml_can_fuse_subgraph(cgraph, i, topk_moe_early_softmax_norm, { i + 3, i + 9 }) &&
ggml_check_edges(cgraph, i, topk_moe_early_softmax_norm_edges) &&
ggml_vk_can_fuse_topk_moe(ctx, cgraph, i, TOPK_MOE_EARLY_SOFTMAX_NORM)) {
ctx->num_additional_fused_ops = topk_moe_early_softmax_norm.size() - 1;
// view of argsort writes to memory
ctx->fused_ops_write_mask |= 1 << 3;
fusion_string = "TOPK_MOE_EARLY_SOFTMAX_NORM";
} else if (ggml_can_fuse_subgraph(cgraph, i, topk_moe_early_softmax, { i + 3, i + 4 }) &&
ggml_check_edges(cgraph, i, topk_moe_early_softmax_edges) &&
ggml_vk_can_fuse_topk_moe(ctx, cgraph, i, TOPK_MOE_EARLY_SOFTMAX)) {
ctx->num_additional_fused_ops = topk_moe_early_softmax.size() - 1;
// view of argsort writes to memory
ctx->fused_ops_write_mask |= 1 << 3;
fusion_string = "TOPK_MOE_EARLY_SOFTMAX";
} else if (ggml_can_fuse_subgraph(cgraph, i, topk_moe_late_softmax, { i + 1, i + 5 }) &&
ggml_check_edges(cgraph, i, topk_moe_late_softmax_edges) &&
ggml_vk_can_fuse_topk_moe(ctx, cgraph, i, TOPK_MOE_LATE_SOFTMAX)) {
ctx->num_additional_fused_ops = topk_moe_late_softmax.size() - 1;
// view of argsort writes to memory
ctx->fused_ops_write_mask |= 1 << 1;
fusion_string = "TOPK_MOE_LATE_SOFTMAX";
}
}
ctx->fused_ops_write_mask |= 1 << ctx->num_additional_fused_ops;
@@ -13120,7 +13177,7 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
bool enqueued = ggml_vk_build_graph(ctx, cgraph, i, cgraph->nodes[submit_node_idx], submit_node_idx, i + ctx->num_additional_fused_ops >= last_node, almost_ready, submit);
if (vk_perf_logger_enabled) {
if (vk_perf_logger_enabled && enqueued) {
if (ctx->compute_ctx.expired()) {
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
@@ -13128,10 +13185,9 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
} else {
compute_ctx = ctx->compute_ctx.lock();
}
// If there are fused ops, just write out timestamps for all nodes to keep the accounting simple
for (int j = 0; j < ctx->num_additional_fused_ops + 1; ++j) {
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->device->query_pool, i+j+1);
}
ctx->query_nodes[ctx->query_idx] = cgraph->nodes[i];
ctx->query_fusion_names[ctx->query_idx] = fusion_string;
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->query_pool, ctx->query_idx++);
}
if (enqueued) {
@@ -13172,14 +13228,14 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
// Get the results and pass them to the logger
std::vector<uint64_t> timestamps(cgraph->n_nodes + 1);
VK_CHECK(ctx->device->device.getQueryPoolResults(ctx->device->query_pool, 0, cgraph->n_nodes + 1, (cgraph->n_nodes + 1)*sizeof(uint64_t), timestamps.data(), sizeof(uint64_t), vk::QueryResultFlagBits::e64 | vk::QueryResultFlagBits::eWait), "get timestamp results");
for (int i = 0; i < cgraph->n_nodes; i++) {
if (!ggml_vk_is_empty(cgraph->nodes[i])) {
ctx->device->perf_logger->log_timing(cgraph->nodes[i], uint64_t((timestamps[i+1] - timestamps[i]) * ctx->device->properties.limits.timestampPeriod));
}
VK_CHECK(ctx->device->device.getQueryPoolResults(ctx->query_pool, 0, ctx->query_idx, (cgraph->n_nodes + 1)*sizeof(uint64_t), timestamps.data(), sizeof(uint64_t), vk::QueryResultFlagBits::e64 | vk::QueryResultFlagBits::eWait), "get timestamp results");
for (int i = 1; i < ctx->query_idx; i++) {
auto node = ctx->query_nodes[i];
auto name = ctx->query_fusion_names[i];
ctx->perf_logger->log_timing(node, name, uint64_t((timestamps[i] - timestamps[i-1]) * ctx->device->properties.limits.timestampPeriod));
}
ctx->device->perf_logger->print_timings();
ctx->perf_logger->print_timings();
}
if (!ctx->device->support_async) {

View File

@@ -7,35 +7,85 @@ layout(local_size_x_id = 0, local_size_y = 1, local_size_z = 1) in;
FLOAT_TYPE temp[NUM_COLS][NUM_ROWS];
void calc_superblock(const uint a_offset, const uint b_offset, const uint ib32, const uint i, const uint num_blocks_per_row, const uint first_row, const uint num_rows) {
void calc_superblock(const uint a_offset, const uint b_offset, const uint ib32, const uint i,
const uint num_blocks_per_row, const uint first_row, const uint num_rows) {
// Compute starting index in matrix B for this superblock
const uint y_idx = i * QUANT_K + 32 * ib32;
uint ibi = a_offset / QUANT_K + first_row * num_blocks_per_row + i;
// Precompute indices for quantization lookup tables
const uint qh_base = 2 * ib32;
const uint qs_base = 4 * ib32;
const uint sc_index = ib32 / 2;
const uint sc_shift = 6 * (ib32 & 1);
// Loop over rows in the superblock
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
// Load per-block scales and shift for quantization
const uint16_t[4] scales = data_a[ibi].scales;
const u16vec4 s = u16vec4(scales[0], scales[1], scales[2], scales[3]) >> 12;
const float d = float(unpackHalf2x16(s.x | (s.y << 4) | (s.z << 8) | (s.w << 12)).x);
const uint sc = data_a[ibi].scales[sc_index] >> sc_shift;
const uint sc = data_a[ibi].scales[ib32 / 2] >> (6 * (ib32 & 1));
// Temporary caches for decoding
FLOAT_TYPE dl_cache[4];
uint16_t gvf_cache[4];
float delta_cache[4];
// Precompute the multiplier and lookup values for 4 sub-blocks
[[unroll]] for (uint l = 0; l < 4; ++l) {
const uint qh = data_a[ibi].qh[2 * ib32 + l / 2] >> (4 * (l&1));
const uint qs = data_a[ibi].qs[4 * ib32 + l];
const float delta = ((qh & 8) != 0) ? -IQ1M_DELTA : IQ1M_DELTA;
const float dl = d * (2 * bitfieldExtract(sc, 3 * int(l / 2), 3) + 1);
dl_cache[l] = FLOAT_TYPE(d * (2 * bitfieldExtract(sc, 3 * int(l / 2), 3) + 1));
const uint qh = data_a[ibi].qh[qh_base + l / 2] >> (4 * (l & 1));
const uint qs = data_a[ibi].qs[qs_base + l];
gvf_cache[l] = iq1s_grid[qs | ((qh & 7) << 8)];
delta_cache[l] = ((qh & 8) != 0) ? -IQ1M_DELTA : IQ1M_DELTA;
}
const int16_t grid = int16_t(iq1s_grid[qs | ((qh & 7) << 8)]);
// Loop over columns of the output
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
// Compute base index for matrix B
const uint base_b_idx = (j * p.batch_stride_b + b_offset + y_idx) / 4;
vec4 b_vals[8];
[[unroll]] for (uint j = 0; j < NUM_COLS; ++j) {
vec4 b0 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 2*l + 0]);
vec4 b4 = vec4(data_b_v4[(j*p.batch_stride_b + b_offset + y_idx) / 4 + 2*l + 1]);
FLOAT_TYPE sum = FLOAT_TYPE(0.0);
[[unroll]] for (int k = 0; k < 4; ++k) {
sum = fma(FLOAT_TYPE(b0[k]), bitfieldExtract(grid, 2 * k, 2) + delta,
fma(FLOAT_TYPE(b4[k]), bitfieldExtract(grid, 8 + 2 * k, 2) + delta, sum));
}
temp[j][n] = fma(dl, sum, temp[j][n]);
// Load 8 vec4 values from matrix B
[[unroll]] for (int idx = 0; idx < 8; ++idx) {
b_vals[idx] = vec4(data_b_v4[base_b_idx + idx]);
}
FLOAT_TYPE col_sum = FLOAT_TYPE(0.0);
// Loop over sub-blocks
[[unroll]] for (uint l = 0; l < 4; ++l) {
const uint16_t grid = gvf_cache[l];
const float dl = dl_cache[l];
// Decode 8 2-bit fbits from gvf_cache
float f0 = float(bitfieldExtract(grid, 0, 2));
float f1 = float(bitfieldExtract(grid, 2, 2));
float f2 = float(bitfieldExtract(grid, 4, 2));
float f3 = float(bitfieldExtract(grid, 6, 2));
float f4 = float(bitfieldExtract(grid, 8, 2));
float f5 = float(bitfieldExtract(grid, 10, 2));
float f6 = float(bitfieldExtract(grid, 12, 2));
float f7 = float(bitfieldExtract(grid, 14, 2));
// Pack into vec4 for vectorized FMA
const vec4 fbits_v0 = vec4(f0, f1, f2, f3);
const vec4 fbits_v1 = vec4(f4, f5, f6, f7);
const vec4 delta_v = vec4(delta_cache[l]);
// Vectorized fused multiply-add
vec4 sum_v = fma(b_vals[2*l + 0], fbits_v0 + delta_v, vec4(0.0));
sum_v = fma(b_vals[2*l + 1], fbits_v1 + delta_v, sum_v);
// Horizontal add to get scalar sum
FLOAT_TYPE sum = sum_v.x + sum_v.y + sum_v.z + sum_v.w;
// Accumulate to column sum
col_sum = fma(dl, sum, col_sum);
}
// Write result to temporary buffer
temp[j][n] += col_sum;
}
ibi += num_blocks_per_row;
}

View File

@@ -10,6 +10,7 @@
layout (push_constant) uniform parameter
{
uint n_rows;
uint n_experts_push;
uint n_expert_used;
float clamp_min;
float clamp_max;
@@ -18,11 +19,16 @@ layout (push_constant) uniform parameter
layout(local_size_x_id = 0, local_size_y = 4, local_size_z = 1) in;
layout(constant_id = 0) const uint WARP_SIZE = 32;
layout(constant_id = 1) const uint n_experts = 512;
layout(constant_id = 1) const uint n_experts_spec = 512;
layout(constant_id = 2) const bool with_norm = true;
layout(constant_id = 3) const bool late_softmax = false;
layout(constant_id = 4) const bool nexperts_use_push = false;
const uint experts_per_thread = (n_experts > WARP_SIZE) ? n_experts / WARP_SIZE : 1;
uint n_experts = nexperts_use_push ? n_experts_push : n_experts_spec;
#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))
const uint experts_per_thread = CEIL_DIV(n_experts_spec, WARP_SIZE);
layout (binding = 0, std430) readonly buffer Logits {float logits[];};
layout (binding = 1, std430) writeonly buffer Weights {float weights[];};
@@ -94,7 +100,7 @@ void main() {
}
if (!late_softmax) {
softmax_warp_inplace(wt, n_experts, lane, false);
softmax_warp_inplace(wt, n_experts, lane, nexperts_use_push);
}
// at this point, each thread holds a portion of softmax,

View File

@@ -0,0 +1,92 @@
ggml_add_backend_library(ggml-zendnn
ggml-zendnn.cpp)
# Get ZenDNN path
if (NOT DEFINED ZENDNN_ROOT OR ZENDNN_ROOT STREQUAL "")
set(ZENDNN_ROOT "$ENV{ZENDNN_ROOT}")
endif()
# Check if path is still empty or OFF
if (NOT ZENDNN_ROOT OR ZENDNN_ROOT STREQUAL "" OR ZENDNN_ROOT STREQUAL "OFF")
message(STATUS "ZENDNN_ROOT not set. Automatically downloading and building ZenDNN...")
message(STATUS "This will take several minutes on first build...")
include(ExternalProject)
set(ZENDNN_PREFIX ${CMAKE_BINARY_DIR}/_deps/zendnn-prefix)
set(ZENDNN_SOURCE_DIR ${ZENDNN_PREFIX}/src/zendnn)
set(ZENDNN_BUILD_DIR ${ZENDNN_PREFIX}/build)
set(ZENDNN_INSTALL_DIR ${ZENDNN_BUILD_DIR}/install)
ExternalProject_Add(
zendnn
GIT_REPOSITORY https://github.com/amd/ZenDNN.git
GIT_TAG zendnnl
PREFIX ${ZENDNN_PREFIX}
SOURCE_DIR ${ZENDNN_SOURCE_DIR}
BINARY_DIR ${ZENDNN_BUILD_DIR}
CMAKE_ARGS
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_INSTALL_PREFIX=${ZENDNN_INSTALL_DIR}
-DZENDNNL_BUILD_EXAMPLES=OFF
-DZENDNNL_BUILD_DOXYGEN=OFF
-DZENDNNL_BUILD_GTEST=OFF
-DZENDNNL_BUILD_BENCHDNN=OFF
# Enable ALL matmul algorithm backends
-DZENDNNL_DEPENDS_AOCLDLP=ON
-DZENDNNL_DEPENDS_ONEDNN=ON
-DZENDNNL_DEPENDS_LIBXSMM=ON
BUILD_COMMAND ${CMAKE_COMMAND} --build ${ZENDNN_BUILD_DIR} --target zendnnl
INSTALL_COMMAND ${CMAKE_COMMAND} --build ${ZENDNN_BUILD_DIR} --target install
BUILD_ALWAYS OFF
LOG_DOWNLOAD ON
LOG_CONFIGURE ON
LOG_BUILD ON
LOG_INSTALL ON
)
# Add dependency so ZenDNN builds before our library
add_dependencies(ggml-zendnn zendnn)
# Set ZENDNN_ROOT to the installation directory
set(ZENDNN_ROOT ${ZENDNN_INSTALL_DIR})
message(STATUS "ZenDNN will be built to: ${ZENDNN_ROOT}")
else()
message(STATUS "Using custom ZenDNN installation at: ${ZENDNN_ROOT}")
endif()
# ZenDNN headers + libs
target_include_directories(ggml-zendnn PRIVATE
${ZENDNN_ROOT}/zendnnl/include
${ZENDNN_ROOT}/deps/aocldlp/include
${ZENDNN_ROOT}/deps/aoclutils/include
${ZENDNN_ROOT}/deps/json/include
${ZENDNN_ROOT}/deps/libxsmm/include
${ZENDNN_ROOT}/deps/onednn/include
)
target_link_directories(ggml-zendnn PRIVATE
${ZENDNN_ROOT}/zendnnl/lib
${ZENDNN_ROOT}/deps/aocldlp/lib
${ZENDNN_ROOT}/deps/aoclutils/lib
${ZENDNN_ROOT}/deps/libxsmm/lib
${ZENDNN_ROOT}/deps/onednn/lib
)
target_link_libraries(ggml-zendnn PRIVATE
zendnnl_archive # ZenDNN main
aocl-dlp # AOCL libraries
aoclutils
au_cpuid
dnnl # OneDNN
xsmm # libxsmm small matrix math
xsmmext
xsmmnoblas
m
pthread
)
if (GGML_OPENMP)
target_link_libraries(ggml-zendnn PRIVATE OpenMP::OpenMP_CXX)
endif()

View File

@@ -0,0 +1,466 @@
#include "ggml-zendnn.h"
#include "ggml-backend-impl.h"
#include "ggml-impl.h"
#include "ggml-cpu.h"
#include "zendnnl.hpp"
#include <cstring>
struct ggml_backend_zendnn_context {
int n_threads = GGML_DEFAULT_N_THREADS;
std::unique_ptr<char[]> work_data;
size_t work_size = 0;
};
template<typename T>
zendnnl::common::data_type_t ggml_to_zendnn_type() {
if constexpr (std::is_same_v<T, float>) {
return zendnnl::common::data_type_t::f32;
} else if constexpr (std::is_same_v<T, ggml_bf16_t>) {
return zendnnl::common::data_type_t::bf16;
} else {
return zendnnl::common::data_type_t::none;
}
}
/**
* ZenDNN matmul: computes C = B * A.
*
* - A: weights, shape (k, m), column-major (each column is a weight vector for one output).
* - B: input, shape (n, k), row-major (each row is an input sample).
* - C: output, shape (n, m), row-major.
*
* Dimensions:
* m = output features (columns of C, columns of A)
* n = batch size (rows of C, rows of B)
* k = inner dimension (columns of B, rows of A)
*/
template <typename TA, typename TB, typename TC>
static bool ggml_zendnn_matmul(ggml_backend_zendnn_context * ctx, int64_t m, int64_t n, int64_t k,
const TA * A, int64_t lda, const TB * B, int64_t ldb, TC * C,
int64_t ldc) {
zendnnl::lowoha::lowoha_params params;
params.dtypes.src = ggml_to_zendnn_type<TB>();
params.dtypes.wei = ggml_to_zendnn_type<TA>();
params.dtypes.dst = ggml_to_zendnn_type<TC>();
params.num_threads = ctx->n_threads;
zendnnl::lowoha::status_t status = zendnnl::lowoha::matmul_direct(
'r', false, true, // row-major, don't transpose B, transpose A (because it's column-major)
n, // M: rows of B and C
m, // N: cols of A^T and C
k, // K: cols of B, rows of A
1.0f, // alpha
B, ldb, // src: B[n,k]
A, lda, // weight: A[k,m] column-major (transposed)
nullptr, // bias
0.0f, // beta
C, ldc, // output C[n,m]
true, // is_weights_const
{}, // batch_params
params // params
);
if (status != zendnnl::lowoha::status_t::success) {
GGML_LOG_ERROR("%s, ZenDNN matmul failed: status=%d\n", __func__, static_cast<int>(status));
return false;
}
return true;
}
static bool ggml_zendnn_sgemm(ggml_backend_zendnn_context * ctx, int64_t m, int64_t n, int64_t k,
const void * A, int64_t lda, const void * B, int64_t ldb, void * C,
int64_t ldc, int Atype, int Btype, int Ctype) {
assert(m >= 0);
assert(n >= 0);
assert(k >= 0);
assert(lda >= k);
assert(ldb >= k);
assert(ldc >= m);
// categorize types
switch (Atype) {
case GGML_TYPE_F32:
if (Btype != GGML_TYPE_F32 || Ctype != GGML_TYPE_F32)
return false;
return ggml_zendnn_matmul<float, float, float>(
ctx, m, n, k,
(const float *)A, lda,
(const float *)B, ldb,
(float *)C, ldc);
case GGML_TYPE_BF16:
if (Btype != GGML_TYPE_BF16)
return false;
if (Ctype == GGML_TYPE_BF16)
return ggml_zendnn_matmul<ggml_bf16_t, ggml_bf16_t, ggml_bf16_t>(
ctx, m, n, k,
(const ggml_bf16_t *)A, lda,
(const ggml_bf16_t *)B, ldb,
(ggml_bf16_t *)C, ldc);
if (Ctype == GGML_TYPE_F32)
return ggml_zendnn_matmul<ggml_bf16_t, ggml_bf16_t, float>(
ctx, m, n, k,
(const ggml_bf16_t *)A, lda,
(const ggml_bf16_t *)B, ldb,
(float *)C, ldc);
return false;
default:
return false; // unsupported type
}
}
static void ggml_zendnn_compute_forward_mul_mat(
ggml_backend_zendnn_context * ctx,
ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0]; // weights
const ggml_tensor * src1 = dst->src[1]; // inputs
GGML_TENSOR_BINARY_OP_LOCALS
ggml_type const vec_dot_type = ggml_get_type_traits_cpu(src0->type)->vec_dot_type;
ggml_from_float_t const from_float = ggml_get_type_traits_cpu(vec_dot_type)->from_float;
GGML_ASSERT(ne0 == ne01);
GGML_ASSERT(ne1 == ne11);
GGML_ASSERT(ne2 == ne12);
GGML_ASSERT(ne3 == ne13);
// we don't support permuted src0 or src1
GGML_ASSERT(nb00 == ggml_type_size(src0->type));
GGML_ASSERT(nb10 == ggml_type_size(src1->type));
// dst cannot be transposed or permuted
GGML_ASSERT(nb0 == sizeof(float));
GGML_ASSERT(nb0 <= nb1);
GGML_ASSERT(nb1 <= nb2);
GGML_ASSERT(nb2 <= nb3);
// broadcast factors
const int64_t r2 = ne12/ne02;
const int64_t r3 = ne13/ne03;
void * work_data = ctx->work_data.get();
if (src1->type != vec_dot_type) {
const size_t nbw1 = ggml_row_size(vec_dot_type, ne10);
const size_t nbw2 = nbw1 * ne11;
const size_t nbw3 = nbw2 * ne12;
const size_t desired_wsize = ne13 * nbw3;
if (ctx->work_size < desired_wsize) {
ctx->work_data.reset(new char[desired_wsize]);
ctx->work_size = desired_wsize;
}
work_data = ctx->work_data.get();
// #pragma omp parallel for num_threads(ctx->n_threads)
#pragma omp parallel for collapse(3) num_threads(ctx->n_threads) schedule(static)
for (int64_t i13 = 0; i13 < ne13; ++i13) {
for (int64_t i12 = 0; i12 < ne12; ++i12) {
for (int64_t i11 = 0; i11 < ne11; ++i11) {
const float * src1_f32 = (float *)((char *)src1->data + i11*nb11 + i12*nb12 + i13*nb13);
void * src1_conv = (char *)work_data + i11*nbw1 + i12*nbw2 + i13*nbw3;
from_float(src1_f32, src1_conv, ne10);
}
}
}
}
for (int64_t i13 = 0; i13 < ne13; i13++) {
for (int64_t i12 = 0; i12 < ne12; i12++) {
const void* wdata = src1->type == vec_dot_type ? src1->data : work_data;
const size_t row_size = ggml_row_size(vec_dot_type, ne10);
if (!ggml_zendnn_sgemm(ctx,
ne01, // m
ne11, // n
ne10, // k
static_cast<const char *>(src0->data) + (i12/r2)*nb02 + (i13/r3)*nb03,
ne00, // lda
static_cast<const char *>(wdata) + (i12*ne11 + i13*ne12*ne11)*row_size,
ne10, // ldb
static_cast<char *>(dst->data) + i12*nb2 + i13*nb3,
ne01, // ldc
src0->type,
vec_dot_type,
dst->type))
GGML_ABORT("%s: ZenDNN sgemm failed\n", __func__);
}
}
}
// backend interface
static const char * ggml_backend_zendnn_get_name(ggml_backend_t backend) {
return "ZenDNN";
GGML_UNUSED(backend);
}
static void ggml_backend_zendnn_free(ggml_backend_t backend) {
ggml_backend_zendnn_context * ctx = (ggml_backend_zendnn_context *)backend->context;
delete ctx;
delete backend;
}
static ggml_status ggml_backend_zendnn_graph_compute(ggml_backend_t backend, ggml_cgraph * cgraph) {
ggml_backend_zendnn_context * ctx = (ggml_backend_zendnn_context *)backend->context;
for (int i = 0; i < cgraph->n_nodes; i++) {
struct ggml_tensor * node = cgraph->nodes[i];
switch (node->op) {
case GGML_OP_MUL_MAT:
ggml_zendnn_compute_forward_mul_mat(ctx, node);
break;
case GGML_OP_NONE:
case GGML_OP_RESHAPE:
case GGML_OP_VIEW:
case GGML_OP_PERMUTE:
case GGML_OP_TRANSPOSE:
break;
default:
GGML_ABORT("%s: unsupported op %s\n", __func__, ggml_op_desc(node));
}
}
return GGML_STATUS_SUCCESS;
GGML_UNUSED(backend);
}
static struct ggml_backend_i ggml_backend_zendnn_i = {
/* .get_name = */ ggml_backend_zendnn_get_name,
/* .free = */ ggml_backend_zendnn_free,
/* .set_tensor_async = */ NULL,
/* .get_tensor_async = */ NULL,
/* .cpy_tensor_async = */ NULL,
/* .synchronize = */ NULL,
/* .graph_plan_create = */ NULL,
/* .graph_plan_free = */ NULL,
/* .graph_plan_update = */ NULL,
/* .graph_plan_compute = */ NULL,
/* .graph_compute = */ ggml_backend_zendnn_graph_compute,
/* .event_record = */ NULL,
/* .event_wait = */ NULL,
/* .graph_optimize = */ NULL,
};
static ggml_guid_t ggml_backend_zendnn_guid(void) {
static const char * guid_str = "AMD-ZENDNN-ACCEL";
return reinterpret_cast<ggml_guid_t>(const_cast<char*>(guid_str));
}
ggml_backend_t ggml_backend_zendnn_init(void) {
ggml_backend_zendnn_context * ctx = new ggml_backend_zendnn_context;
ggml_backend_t backend = new ggml_backend {
/* .guid = */ ggml_backend_zendnn_guid(),
/* .iface = */ ggml_backend_zendnn_i,
/* .device = */ ggml_backend_reg_dev_get(ggml_backend_zendnn_reg(), 0),
/* .context = */ ctx,
};
return backend;
}
bool ggml_backend_is_zendnn(ggml_backend_t backend) {
return backend != NULL && ggml_guid_matches(backend->guid, ggml_backend_zendnn_guid());
}
void ggml_backend_zendnn_set_n_threads(ggml_backend_t backend_zendnn, int n_threads) {
GGML_ASSERT(ggml_backend_is_zendnn(backend_zendnn));
ggml_backend_zendnn_context * ctx = (ggml_backend_zendnn_context *)backend_zendnn->context;
ctx->n_threads = n_threads;
}
// device interface
static const char * ggml_backend_zendnn_device_get_name(ggml_backend_dev_t dev) {
return "ZenDNN";
GGML_UNUSED(dev);
}
/**
* ZenDNN is AMD's performance library providing optimized primitives and implementations
* for deep learning workloads on AMD CPUs. It targets improved performance for common
* neural network operations on AMD architectures. For more information, see:
* https://www.amd.com/en/developer/zendnn.html
*/
static const char * ggml_backend_zendnn_device_get_description(ggml_backend_dev_t dev) {
return "ZenDNN: AMD optimized primitives backend for GGML (optimized for AMD CPUs)";
GGML_UNUSED(dev);
}
static void ggml_backend_zendnn_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
*free = 0;
*total = 0;
GGML_UNUSED(dev);
}
static enum ggml_backend_dev_type ggml_backend_zendnn_device_get_type(ggml_backend_dev_t dev) {
return GGML_BACKEND_DEVICE_TYPE_ACCEL;
GGML_UNUSED(dev);
}
static void ggml_backend_zendnn_device_get_props(ggml_backend_dev_t dev, struct ggml_backend_dev_props * props) {
props->name = ggml_backend_zendnn_device_get_name(dev);
props->description = ggml_backend_zendnn_device_get_description(dev);
props->type = ggml_backend_zendnn_device_get_type(dev);
ggml_backend_zendnn_device_get_memory(dev, &props->memory_free, &props->memory_total);
props->caps = {
/* .async = */ false,
/* .host_buffer = */ false,
/* .buffer_from_host_ptr = */ true,
/* .events = */ false
};
}
static ggml_backend_t ggml_backend_zendnn_device_init_backend(ggml_backend_dev_t dev, const char * params) {
ggml_backend_t backend = ggml_backend_zendnn_init();
if (backend == NULL) {
GGML_LOG_ERROR("%s: error: failed to initialize ZenDNN backend\n", __func__);
return NULL;
}
return backend;
GGML_UNUSED(dev);
GGML_UNUSED(params);
}
static ggml_backend_buffer_type_t ggml_backend_zendnn_device_get_buffer_type(ggml_backend_dev_t dev) {
return ggml_backend_cpu_buffer_type();
GGML_UNUSED(dev);
}
static ggml_backend_buffer_t ggml_backend_zendnn_device_buffer_from_host_ptr(ggml_backend_dev_t dev, void * ptr, size_t size, size_t max_tensor_size) {
return ggml_backend_cpu_buffer_from_ptr(ptr, size);
GGML_UNUSED(dev);
GGML_UNUSED(max_tensor_size);
}
static bool ggml_backend_zendnn_device_supports_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
switch (op->op) {
case GGML_OP_NONE:
case GGML_OP_RESHAPE:
case GGML_OP_VIEW:
case GGML_OP_PERMUTE:
case GGML_OP_TRANSPOSE:
return true;
case GGML_OP_MUL_MAT:
{
const ggml_tensor * weights = op->src[0];
const ggml_tensor * inputs = op->src[1];
const int64_t ne10 = inputs->ne[0];
const int64_t ne0 = op->ne[0];
const int64_t ne1 = op->ne[1];
const int64_t min_batch = 1;
if (!ggml_is_contiguous(weights) || !ggml_is_contiguous(inputs) ||
ne0 < min_batch || ne1 < min_batch || ne10 < min_batch) {
return false;
}
switch (weights->type) {
case GGML_TYPE_F32:
case GGML_TYPE_BF16:
return true;
default:
return false;
}
} break;
default:
return false;
}
GGML_UNUSED(dev);
}
static bool ggml_backend_zendnn_device_supports_buft(ggml_backend_dev_t dev, ggml_backend_buffer_type_t buft) {
return ggml_backend_buft_is_host(buft);
GGML_UNUSED(dev);
}
static const struct ggml_backend_device_i ggml_backend_zendnn_device_i = {
/* .get_name = */ ggml_backend_zendnn_device_get_name,
/* .get_description = */ ggml_backend_zendnn_device_get_description,
/* .get_memory = */ ggml_backend_zendnn_device_get_memory,
/* .get_type = */ ggml_backend_zendnn_device_get_type,
/* .get_props = */ ggml_backend_zendnn_device_get_props,
/* .init_backend = */ ggml_backend_zendnn_device_init_backend,
/* .get_buffer_type = */ ggml_backend_zendnn_device_get_buffer_type,
/* .get_host_buffer_type = */ NULL,
/* .buffer_from_host_ptr = */ ggml_backend_zendnn_device_buffer_from_host_ptr,
/* .supports_op = */ ggml_backend_zendnn_device_supports_op,
/* .supports_buft = */ ggml_backend_zendnn_device_supports_buft,
/* .offload_op = */ NULL,
/* .event_new = */ NULL,
/* .event_free = */ NULL,
/* .event_synchronize = */ NULL,
};
// backend reg interface
static const char * ggml_backend_zendnn_reg_get_name(ggml_backend_reg_t reg) {
return "ZenDNN";
GGML_UNUSED(reg);
}
static size_t ggml_backend_zendnn_reg_get_device_count(ggml_backend_reg_t reg) {
return 1;
GGML_UNUSED(reg);
}
static ggml_backend_dev_t ggml_backend_zendnn_reg_get_device(ggml_backend_reg_t reg, size_t index) {
GGML_ASSERT(index == 0);
static ggml_backend_device ggml_backend_zendnn_device = {
/* .iface = */ ggml_backend_zendnn_device_i,
/* .reg = */ reg,
/* .context = */ nullptr,
};
return &ggml_backend_zendnn_device;
}
static void * ggml_backend_zendnn_get_proc_address(ggml_backend_reg_t reg, const char * name) {
if (std::strcmp(name, "ggml_backend_set_n_threads") == 0) {
return (void *) ggml_backend_zendnn_set_n_threads;
}
return NULL;
GGML_UNUSED(reg);
GGML_UNUSED(name);
}
static const struct ggml_backend_reg_i ggml_backend_zendnn_reg_i = {
/* .get_name = */ ggml_backend_zendnn_reg_get_name,
/* .get_device_count = */ ggml_backend_zendnn_reg_get_device_count,
/* .get_device = */ ggml_backend_zendnn_reg_get_device,
/* .get_proc_address = */ ggml_backend_zendnn_get_proc_address,
};
ggml_backend_reg_t ggml_backend_zendnn_reg(void) {
static struct ggml_backend_reg ggml_backend_zendnn_reg = {
/* .api_version = */ GGML_BACKEND_API_VERSION,
/* .iface = */ ggml_backend_zendnn_reg_i,
/* .context = */ NULL,
};
return &ggml_backend_zendnn_reg;
}
GGML_BACKEND_DL_IMPL(ggml_backend_zendnn_reg)

View File

@@ -124,6 +124,13 @@ static void ggml_print_backtrace_symbols(void) {
int nptrs = backtrace(trace, sizeof(trace)/sizeof(trace[0]));
backtrace_symbols_fd(trace, nptrs, STDERR_FILENO);
}
#elif defined(__APPLE__)
#include <execinfo.h>
static void ggml_print_backtrace_symbols(void) {
void * trace[100];
int nptrs = backtrace(trace, sizeof(trace)/sizeof(trace[0]));
backtrace_symbols_fd(trace, nptrs, STDERR_FILENO);
}
#else
static void ggml_print_backtrace_symbols(void) {
// platform not supported
@@ -135,6 +142,20 @@ void ggml_print_backtrace(void) {
if (GGML_NO_BACKTRACE) {
return;
}
#if defined(__APPLE__)
// On macOS, fork+debugger attachment is problematic due to:
// 1. libdispatch "poisons" forked child processes
// 2. lldb has issues attaching to parent from forked child
// Use simple backtrace() instead to avoid Terminal.app crashes
const char * GGML_BACKTRACE_LLDB = getenv("GGML_BACKTRACE_LLDB");
if (!GGML_BACKTRACE_LLDB) {
fprintf(stderr, "WARNING: Using native backtrace. Set GGML_BACKTRACE_LLDB for more info.\n");
fprintf(stderr, "WARNING: GGML_BACKTRACE_LLDB may cause native MacOS Terminal.app to crash.\n");
fprintf(stderr, "See: https://github.com/ggml-org/llama.cpp/pull/17869\n");
ggml_print_backtrace_symbols();
return;
}
#endif
#if defined(__linux__)
FILE * f = fopen("/proc/self/status", "r");
size_t size = 0;
@@ -5239,8 +5260,6 @@ struct ggml_tensor * ggml_flash_attn_ext(
if (mask) {
GGML_ASSERT(ggml_is_contiguous(mask));
GGML_ASSERT(mask->ne[1] >= GGML_PAD(q->ne[1], GGML_KQ_MASK_PAD) &&
"the Flash-Attention kernel requires the mask to be padded to GGML_KQ_MASK_PAD and at least n_queries big");
//GGML_ASSERT(ggml_can_repeat_rows(mask, qk));
GGML_ASSERT(q->ne[2] % mask->ne[2] == 0);

View File

@@ -67,6 +67,30 @@ Parentheses `()` can be used to group sequences, which allows for embedding alte
- `{m,n}` repeats the precedent symbol or sequence at between `m` and `n` times (included)
- `{0,n}` repeats the precedent symbol or sequence at most `n` times (included)
## Tokens
Tokens allow grammars to match specific tokenizer tokens rather than character sequences. This is useful for constraining outputs based on special tokens (like `<think>` or `</think>`).
Tokens can be specified in two ways:
1. **Token ID**: Use angle brackets with the token ID in square brackets: `<[token-id]>`. For example, `<[1000]>` matches the token with ID 1000.
2. **Token string**: Use angle brackets with the token text directly: `<token>`. For example, `<think>` will match the token whose text is exactly `<think>`. This only works if the string tokenizes to exactly one token in the vocabulary, otherwise the grammar will fail to parse.
You can negate token matches using the `!` prefix: `!<[1000]>` or `!<think>` matches any token *except* the specified one.
```
# Match a thinking block: <think>...</think>
# Using token strings (requires these to be single tokens in the vocab)
root ::= <think> thinking </think> .*
thinking ::= !</think>*
# Equivalent grammar using explicit token IDs
# Assumes token 1000 = <think>, token 1001 = </think>
root ::= <[1000]> thinking <[1001]> .*
thinking ::= !<[1001]>*
```
## Comments and newlines
Comments can be specified with `#`:

View File

@@ -14,7 +14,7 @@
{%- endmacro %}
{%- set tool_response_queue = namespace(ids=[]) -%}
{%- set tool_call_counter = namespace(value=1) -%}
{%- set tool_call_counter = namespace(value=0) -%}
{%- if tools -%}
<|im_system|>tool_declare<|im_middle|>{{ tools | tojson }}<|im_end|>
@@ -36,12 +36,8 @@
{%- if message['role'] == 'assistant' and message.get('tool_calls') -%}
{{render_content(message)}}<|tool_calls_section_begin|>
{%- for tool_call in message['tool_calls'] -%}
{%- if tool_call['id'] is defined -%}
{%- set formatted_id = tool_call['id'] -%}
{%- else -%}
{%- set formatted_id = 'functions.' + tool_call['function']['name'] + ':' + (tool_call_counter.value | string) -%}
{%- set tool_call_counter.value = tool_call_counter.value + 1 -%}
{%- endif -%}
{%- set formatted_id = 'functions.' + tool_call['function']['name'] + ':' + (tool_call_counter.value | string) -%}
{%- set tool_call_counter.value = tool_call_counter.value + 1 -%}
{%- set _ = tool_response_queue.ids.append(formatted_id) -%}
<|tool_call_begin|>{{ formatted_id }}<|tool_call_argument_begin|>{% if tool_call['function']['arguments'] is string %}{{ tool_call['function']['arguments'] }}{% else %}{{ tool_call['function']['arguments'] | tojson }}{% endif %}<|tool_call_end|>
{%- endfor -%}

View File

@@ -25,17 +25,13 @@
{%- endmacro -%}
{%- set tool_response_queue = namespace(ids=[]) -%}
{%- set tool_call_counter = namespace(value=1) -%}
{%- set tool_call_counter = namespace(value=0) -%}
{%- macro render_toolcalls(message) -%}
<|tool_calls_section_begin|>
{%- for tool_call in message['tool_calls'] -%}
{%- if tool_call['id'] is defined -%}
{%- set formatted_id = tool_call['id'] -%}
{%- else -%}
{%- set formatted_id = 'functions.' + tool_call['function']['name'] + ':' + (tool_call_counter.value | string) -%}
{%- set tool_call_counter.value = tool_call_counter.value + 1 -%}
{%- endif -%}
{%- set formatted_id = 'functions.' + tool_call['function']['name'] + ':' + (tool_call_counter.value | string) -%}
{%- set tool_call_counter.value = tool_call_counter.value + 1 -%}
{%- set _ = tool_response_queue.ids.append(formatted_id) -%}
<|tool_call_begin|>{{ formatted_id }}<|tool_call_argument_begin|>{% if tool_call['function']['arguments'] is string %}{{ tool_call['function']['arguments'] }}{% else %}{{ tool_call['function']['arguments'] | tojson }}{% endif %}<|tool_call_end|>
{%- endfor -%}

View File

@@ -0,0 +1,126 @@
{#- Default system message if no system prompt is passed. #}
{%- set default_system_message = '# HOW YOU SHOULD THINK AND ANSWER\n\nFirst draft your thinking process (inner monologue) until you arrive at a response. Format your response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and the response in the same language as the input.\n\nYour thinking process must follow the template below:[THINK]Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate the response to the user.[/THINK]Here, provide a self-contained response.' %}
{#- Begin of sequence token. #}
{{- bos_token }}
{#- Handle system prompt if it exists. #}
{#- System prompt supports text content or text and thinking chunks. #}
{%- if messages[0]['role'] == 'system' %}
{{- '[SYSTEM_PROMPT]' -}}
{%- if messages[0]['content'] is string %}
{{- messages[0]['content'] -}}
{%- else %}
{%- for block in messages[0]['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- elif block['type'] == 'thinking' %}
{{- '[THINK]' + block['thinking'] + '[/THINK]' }}
{%- else %}
{{- raise_exception('Only text and thinking chunks are supported in system message contents.') }}
{%- endif %}
{%- endfor %}
{%- endif %}
{{- '[/SYSTEM_PROMPT]' -}}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- if default_system_message != '' %}
{{- '[SYSTEM_PROMPT]' + default_system_message + '[/SYSTEM_PROMPT]' }}
{%- endif %}
{%- endif %}
{#- Tools definition #}
{%- set tools_definition = '' %}
{%- set has_tools = false %}
{%- if tools is defined and tools is not none and tools|length > 0 %}
{%- set has_tools = true %}
{%- set tools_definition = '[AVAILABLE_TOOLS]' + (tools| tojson) + '[/AVAILABLE_TOOLS]' %}
{{- tools_definition }}
{%- endif %}
{#- Checks for alternating user/assistant messages. #}
{%- set ns = namespace(index=0) %}
{%- for message in loop_messages %}
{%- if message.role == 'user' or (message.role == 'assistant' and (message.tool_calls is not defined or message.tool_calls is none or message.tool_calls | length == 0)) %}
{%- if (message['role'] == 'user') != (ns.index % 2 == 0) %}
{{- raise_exception('After the optional system message, conversation roles must alternate user and assistant roles except for tool calls and results.') }}
{%- endif %}
{%- set ns.index = ns.index + 1 %}
{%- endif %}
{%- endfor %}
{#- Handle conversation messages. #}
{%- for message in loop_messages %}
{#- User messages supports text content or text and image chunks. #}
{%- if message['role'] == 'user' %}
{%- if message['content'] is string %}
{{- '[INST]' + message['content'] + '[/INST]' }}
{%- elif message['content'] | length > 0 %}
{{- '[INST]' }}
{%- if message['content'] | length == 2 %}
{%- set blocks = message['content'] | sort(attribute='type') %}
{%- else %}
{%- set blocks = message['content'] %}
{%- endif %}
{%- for block in blocks %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- elif block['type'] in ['image', 'image_url'] %}
{{- '[IMG]' }}
{%- else %}
{{- raise_exception('Only text, image and image_url chunks are supported in user message content.') }}
{%- endif %}
{%- endfor %}
{{- '[/INST]' }}
{%- else %}
{{- raise_exception('User message must have a string or a list of chunks in content') }}
{%- endif %}
{#- Assistant messages supports text content or text, image and thinking chunks. #}
{%- elif message['role'] == 'assistant' %}
{%- if (message['content'] is none or message['content'] == '' or message['content']|length == 0) and (message['tool_calls'] is not defined or message['tool_calls'] is none or message['tool_calls']|length == 0) %}
{{- raise_exception('Assistant message must have a string or a list of chunks in content or a list of tool calls.') }}
{%- endif %}
{%- if message['content'] is string and message['content'] != '' %}
{{- message['content'] }}
{%- elif message['content'] | length > 0 %}
{%- for block in message['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- elif block['type'] == 'thinking' %}
{{- '[THINK]' + block['thinking'] + '[/THINK]' }}
{%- else %}
{{- raise_exception('Only text and thinking chunks are supported in assistant message contents.') }}
{%- endif %}
{%- endfor %}
{%- endif %}
{%- if message['tool_calls'] is defined and message['tool_calls'] is not none and message['tool_calls']|length > 0 %}
{%- for tool in message['tool_calls'] %}
{{- '[TOOL_CALLS]' }}
{%- set name = tool['function']['name'] %}
{%- set arguments = tool['function']['arguments'] %}
{%- if arguments is not string %}
{%- set arguments = arguments|tojson|safe %}
{%- elif arguments == '' %}
{%- set arguments = '{}' %}
{%- endif %}
{{- name + '[ARGS]' + arguments }}
{%- endfor %}
{%- endif %}
{{- eos_token }}
{#- Tool messages only supports text content. #}
{%- elif message['role'] == 'tool' %}
{{- '[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]' }}
{#- Raise exception for unsupported roles. #}
{%- else %}
{{- raise_exception('Only user, assistant and tool roles are supported, got ' + message['role'] + '.') }}
{%- endif %}
{%- endfor %}

View File

@@ -1,5 +1,5 @@
{
"extraPaths": ["gguf-py"],
"extraPaths": ["gguf-py", "examples/model-conversion/scripts"],
"pythonVersion": "3.9",
"pythonPlatform": "All",
"reportUnusedImport": "warning",

View File

@@ -46,7 +46,7 @@ adb $adbserial shell " \
LD_LIBRARY_PATH=$basedir/$branch/lib \
ADSP_LIBRARY_PATH=$basedir/$branch/lib \
$verbose $experimental $sched $opmask $profile $nhvx $ndev \
./$branch/bin/llama-cli --no-mmap -m $basedir/../gguf/$model \
./$branch/bin/llama-completion --no-mmap -m $basedir/../gguf/$model \
--poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
--ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on \
-ngl 99 --device $device $cli_opts $@ \

View File

@@ -67,7 +67,7 @@ add_library(llama
models/gemma-embedding.cpp
models/gemma.cpp
models/gemma2-iswa.cpp
models/gemma3-iswa.cpp
models/gemma3.cpp
models/gemma3n-iswa.cpp
models/glm4-moe.cpp
models/glm4.cpp
@@ -139,6 +139,7 @@ add_library(llama
set_target_properties(llama PROPERTIES
VERSION ${LLAMA_INSTALL_VERSION}
SOVERSION 0
MACHO_CURRENT_VERSION 0 # keep macOS linker from seeing oversized version number
)
target_include_directories(llama PRIVATE .)

Some files were not shown because too many files have changed in this diff Show More