Compare commits

..

112 Commits

Author SHA1 Message Date
ihb2032
dc8d14c582 fix(ggml): correct RISC-V ISA string canonical ordering for RVV in CMake (#20888)
Signed-off-by: ihb2032 <hebome@foxmail.com>
2026-03-26 13:08:41 +02:00
Adrien Gallouët
93dfbc1291 common : make LLAMA_CACHE the one cache for everything (#21009)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-26 12:04:57 +01:00
Adrien Gallouët
3cba8bba18 common : fix split model migration (#21019)
Sadly the manifest does not list all required files, i honestly thought
it was the case

Without the files listed we don't have the sha256, so if the first file
is valid, and all others have the correct size, then we can assume we
are good and do the migration...

Here my test:

    $ find /home/angt/.cache/llama.cpp
    /home/angt/.cache/llama.cpp
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf.etag
    /home/angt/.cache/llama.cpp/angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf.etag
    /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json

    $ build/bin/llama-server
    ================================================================================
    WARNING: Migrating cache to HuggingFace cache directory
      Old cache: /home/angt/.cache/llama.cpp/
      New cache: /home/angt/.cache/huggingface/hub
    This one-time migration moves models previously downloaded with -hf
    from the legacy llama.cpp cache to the standard HuggingFace cache.
    Models downloaded with --model-url are not affected.
    ================================================================================
    migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00001-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf
    migrate_file: migrated angt_test-split-model-stories260K_stories260K-f32-00002-of-00002.gguf -> /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf
    migrate_old_cache_to_hf_cache: migration complete, deleting manifest: /home/angt/.cache/llama.cpp/manifest=angt=test-split-model-stories260K=latest.json

    $ find /home/angt/.cache/llama.cpp /home/angt/.cache/huggingface
    /home/angt/.cache/llama.cpp
    /home/angt/.cache/huggingface
    /home/angt/.cache/huggingface/hub
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/50d019817c2626eb9e8a41f361ff5bfa538757e6f708a3076cd3356354a75694
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/blobs/7b273e1dbfab11dc67dce479deb5923fef27c39cbf56a20b3a928a47b77dab3c
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/refs/main
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00002-of-00002.gguf
    /home/angt/.cache/huggingface/hub/models--angt--test-split-model-stories260K/snapshots/68c3ea2061e8c7688455fab07597dde0f4d7f0db/stories260K-f32-00001-of-00002.gguf

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-26 12:04:37 +01:00
Michael Wand
112c78159f ggml-cuda: Add NVFP4 dp4a kernel (#20644)
Added check for dst_t to cuda_cast template for float
Restored ggml_cuda_ue4m3_to_fp32, changed vecdot ints to int32ts
Added CUDART/HIP Check and HIP/fp8 include
Added NVFP4 to Test-backend-ops
Added hip_fp8_e4m3 to __nv_fp8_e4m3 typedef

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-26 09:54:03 +01:00
SamareshSingh
0fac87b157 imatrix : fix crash when using --show-statistics with zero counts (#19532)
* imatrix: fix crash when using --show-statistics with zero counts

Fixes division by zero that caused floating point exceptions when processing imatrix files with zero count values. Added checks to skip zero counts and handle empty activation vectors.

Fix for the bug #19190

* imatrix: lower log level for zero-count skip message to DBG
2026-03-26 08:14:36 +01:00
Yihao Wang
0a524f2404 CUDA & CPU: support F32 kernel type for CONV_TRANSPOSE_2D (#17094)
* Refactor CUDA 2D transpose implementation to support multiple kernel types and improve parameter handling

- Introduced a `conv2d_transpose_params` struct for better parameter management.
- Updated `conv2d_transpose_kernel` to be templated for different kernel types (float and half).
- Modified `ggml_cuda_conv_2d_transpose_p0` to handle both F16 and F32 kernel types.
- Enhanced test cases to validate functionality for both kernel types.

* Refactor test cases for 2D convolution transpose to support dynamic kernel types

- Updated `test_conv_transpose_2d` structure to improve parameter handling by reordering constructor arguments.
- Enhanced test case generation to iterate over kernel types, allowing for flexible testing of different configurations.
- Removed hardcoded kernel type instances in favor of a loop for better maintainability and scalability.

* Refactor ggml_compute_forward_conv_transpose_2d to support both F16 and F32 tensor types.

* Refactor conv2d transpose kernel to use a template for kernel type, enhancing flexibility for different data types.
Update test cases to include both F16 and F32 tensor types for comprehensive coverage.

* Update ggml/src/ggml-cuda/conv2d-transpose.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Update ggml/src/ggml-cpu/ggml-cpu.c

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Refactor conv2d transpose implementation by removing the conv2d_transpose_params struct and dispatching with direct kernel launch.

* Enhance cpu conv2d transpose implementation by introducing a templated kernel type for improved flexibility with F16 and F32 data types.

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-03-26 10:19:14 +08:00
Adrien Gallouët
c0159f9c1f common : do not delete old files from the old cache when updating (#21000)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-25 22:28:04 +01:00
Saba Fallah
a970515bdb mtmd: Add DeepSeekOCR Support (#17400)
* mtmd: llama.cpp DeepSeekOCR support
init commit

* loading sam tensors

* mtmd: fix vision model processing

* deepseek-ocr clip-vit model impl

* mtmd: add DeepSeek-OCR LM support with standard attention

* mtmd: successfully runs DeepSeek-OCR LM in llama-cli

* mtmd: Fix RoPE type for DeepSeek-OCR LM.

* loading LM
testing Vision model loading

* sam warmup working

* sam erroneous return corrected

* clip-vit:  corrected cls_embd concat

* clip-vit: model convert  qkv_proj split

* corrected combining of image encoders' results

* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model

* concat image_newline and image_seperator tokens

* visual_model warmup (technically) works

* window partitioning using standard ggml ops

* sam implementation without using CPU only ops

* clip: fixed warnings

* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr

* mtmd: fix get_rel_pos

* mtmd: fixed the wrong scaler for get_rel_pos

* image encoding technically works but the output can't be checked singe image decoding fails

* mtmd: minor changed

* mtmd: add native resolution support

* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter

* mtmd: correct token order

* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4

* mtmd: quick fix token order

* mtmd: fix danling pointer

* mtmd: SAM numerically works

* mtmd: debug CLIP-L (vit_pre_ln)

* mtmd: debug CLIP-L & first working DeepSeek-OCR model

* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work

* mtmd: simplify SAM patch embedding

* mtmd: adapt Pillow image resizing function

* mtmd:  simplify DeepSeek-OCR dynamic resolution preprocessing

* mtmd: remove --dsocr-mode argument

* mtmd: refactor code & remove unused helper functions

* mtmd: fix tensor names for image newlines and view separator

* clean up

* reverting automatically removed spaces

* reverting automatically removed spaces

* mtmd: fixed bad ocr check in Deepseek2 (LM)

* mtmd: support combined QKV projection in buid_vit

* using common build_attn in sam

* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option

* mtmd: minor fix

* minor formatting and style

* fixed flake8 lint issues

* minor editorconfig-check fixes

* minor editorconfig-check fixes

* mtmd: simplify get_rel_pos

* mtmd: make sam hparams configurable

* mtmd: add detailed comments for resize_bicubic_pillow

* mtmd: fixed wrong input setting

* mtmd: convert model in FP16

* mtmd: minor fix

* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template

* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution

* minor: editconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn

* minor: editconfig-check fix

* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR

* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909

* refactoring, one single builder function and static helpers

* added deepseek-ocr test to tests.sh

* minor formatting fixes

* check with fixed expected resutls

* minor formatting

* editorconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042

* minor
- added GLM-4.6V to big tests
- added missing deps for python test

* convert: minor fix

* mtmd: format code

* convert: quick fix

* convert: quick fix

* minor python formatting

* fixed merge build issue

* merge resolved
- fixed issues in convert
- tested several deepseek models

* minor fix

* minor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions

* - cleaning commented out code

* fixing instabilities issues reintroducing resize_bicubic_pillow

* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr

* rename fc_w --> mm_fc_w

* add links to OCR discussion

* cleaner loading code

* add missing .weight to some tensors

* add default jinja template (to be used by server)

* move test model to ggml-org

* rolling back upscale change

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: bluebread <hotbread70127@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-03-25 19:57:40 +01:00
Adrien Gallouët
056b50c319 common : fix verbosity setup (#20989)
The verbosity threshold was set at the end of common_params_parse_ex(),
after doing many things (like downloading files..)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-25 19:41:01 +01:00
Adrien Gallouët
f2c72b8f1f common : fix gguf selection in common_list_cached_models (#20996)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-25 19:18:06 +01:00
uvos
ec54ac13a8 ci : fix parsing of vgpr counts in hip-quality-check (#20987)
* scripts: hip: gcn-cdna-vgpr-check: fix parsing of vgpr counts when an amdclang Remark block is interlieved with another from a different process

* Return warning ignore

* obay pep8 inline double space before inline commets

* add # noqa: NP100 for other prints too

* Add script changes to cause autotrigger
2026-03-25 19:00:37 +01:00
Saba Fallah
80322ebdaf model: codefuse-ai/F2LLM-v2 support 2026-03-25 18:33:42 +01:00
Dowon
44c51e526b model : allow causal_attn and pooling_type on all architectures (#20973)
* models : allow causal_attn and pooling_type on all architectures

* fix: move location
2026-03-25 18:12:38 +01:00
Aparna M P
1922f87c2f snapdragon: add missing features to WoS scripts to achieve parity with ADB scripts (#20884)
* Add missing features to WoS scripts to achieve parity with ADB scripts

* Fix line-ending in run-mtmd.ps1

Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>

---------

Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-25 09:43:12 -07:00
Shreya Jain
345de3cd87 Use docker in build-android.yml (#20928)
* use docker instead of SDK separately

* fix whitespaces

* Update .github/workflows/build-android.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-25 09:36:27 -07:00
Aman Gupta
9c600bcd4b llama-bench: print -n-cpu-moe when offloaded layers > 1 (#20984) 2026-03-25 21:17:27 +08:00
Masato Nakasaka
b2704f9028 ci: Allow ninja to be used during unit test (#20742)
* Remove make dependency

* Added option to specify Ninja generator

* use ninja-build as default for several CI

* Revert "use ninja-build as default for several CI"

This reverts commit f552c4559b.

* changed use plain string rather than arrays

* Enabled ninja build by default for experimentation

* ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Enabled ninja build by default on self-hosted envs for experimentation

* ci: revert generator to ninja instead of ninja multi-config

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: install ninja-build for self-hosted workflows

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: revert ninja from self-hosted runners

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: missed one self-hosted step

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: fix windows ci errors from an errenous revert

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Added explicit build types for Ninja

Also reverted some needless change

* ci: use ninja multi-config for vulkan-x64 build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* added time command to measure build time

* Keeping some configs to use Ninja which show improvement

* minor fix based on review

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* ci: rm `time` from custom containers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-25 21:00:49 +08:00
Georgi Gerganov
3fab96cd04 ci : disable self-hosted mac jobs (#20985) 2026-03-25 14:46:40 +02:00
Xuan-Son Nguyen
914eb5ff0c jinja: fix macro with kwargs (#20960)
* jinja: fix macro with kwargs

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix newline problem

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-25 12:22:48 +01:00
Francisco Herrera
8fc17493c3 gguf-split : clarify operation of gguf-split (#19749)
* clarify operation of gguf-split

so that you don't have to find out by trial and error

* formatting
2026-03-25 13:12:50 +02:00
Johannes Gäßler
36dafba5c4 llama: fix llama-model-saver (#20503)
* llama : add fd-based model loading via llama_model_load_from_fd

* llama : address review feedback for fd-based model loading

* llama : use FILE pointer instead of fd in public API

* llama : use FILE pointer consistently, address review feedback

* fixup

* fix tensor names

* fix llama-model-saver

* roundtrip tests

* fixup

* refactor tests

* fix prints

* fix model saving

* fix CI, disable Chameleon

* print seed

---------

Co-authored-by: Siddhesh2377 <siddheshsonar2377@gmail.com>
2026-03-25 12:53:16 +02:00
Aleksander Grygier
69e0ecef06 webui: Fix editing assistant message without branching (#20944)
* fix: Editing assistant response without branching

* chore: update webui build output
2026-03-25 12:47:33 +02:00
Pascal
062cca58fc Add SLEEPING status to the WebUI model selector (#20949)
* webui: handle sleeping model status, fix favourite -> favorite

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: fix optional event parameter in sleeping model onclick

* typo

* webui: restore orange sleeping indicator dot with hover unload

* chore: update webui build output

* webui: move stopPropagation into ActionIcon onclick, remove svelte-ignore

* chore: update webui build output

* webui: fix favourite -> favorite (UK -> US spelling) everywhere

Address review feedback from WhyNotHugo

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-25 11:02:32 +01:00
yikechayedan
406f4e3f61 android : fix-pointer-dangling (#20974) 2026-03-25 11:51:26 +02:00
Neo Zhang
53dc8b59bf sycl : fix wrong variable check by assert (#20903)
* fix wrong variable check by assert

* use GGML api
2026-03-25 11:48:37 +02:00
Sigbjørn Skjæret
403c9c9cef ci : bump gguf publish python version (#20982) 2026-03-25 11:04:59 +02:00
Sigbjørn Skjæret
8fc85db9d2 ci : limit requirements versions (#20980)
* set requests version

* limit versions outside requirements
2026-03-25 10:55:37 +02:00
Dowon
3a60d06ad9 convert : register Qwen3Model architecture (#20967) 2026-03-25 10:37:59 +02:00
Ravi Panchumarthy
abd86ef175 docs : Update OpenVINO backend docs (#20968)
* OpenVINO doc updates

* Update docs/backend/OPENVINO.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

---------

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-25 10:33:51 +02:00
Georgi Gerganov
9f102a1407 models : move the token embedding norms to the first layer (#20943)
* models : move the token embedding norms to the first layer

* cont : fix LLM_TENSOR_CONV1D + fix il indexing
2026-03-24 17:00:30 +02:00
Aman Gupta
3fc6f1aed1 ggml-backend: re-enable graph reuse with pipeline parallelism (#20927) 2026-03-24 20:47:00 +08:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
29771a0a4c vendor : update cpp-httplib to 0.39.0 (#20933) 2026-03-24 13:33:33 +01:00
Adrien Gallouët
42ebce3beb common : fix get_gguf_split_info (#20946)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 13:33:14 +01:00
BlueMöhre
a94fdb090a WebUI: fix edit msg form textarea height (#20830)
* autoresize textarea on mount

* allow textarea to grow to same height as rendered messages

* add UI build file
2026-03-24 13:17:45 +01:00
Adrien Gallouët
c9dc43333f readme : clarify MODEL_ENDPOINT usage (#20941)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 10:35:07 +01:00
Adrien Gallouët
2d2d9c2062 common : add a WARNING for HF cache migration (#20935)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 09:24:39 +01:00
nuri
92080b4396 metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930)
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
2026-03-24 10:13:07 +02:00
Georgi Gerganov
342d6125bc metal : add FA instantiations for HSK=512, HSV=512 (#20902) 2026-03-24 10:03:09 +02:00
Aaron Teo
c2e224d829 issues: add openvino backends (#20932)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-03-24 14:41:10 +08:00
Adrien Gallouët
8c7957ca33 common : add standard Hugging Face cache support (#20775)
* common : add standard Hugging Face cache support

- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Check with the quant tag

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Improve error handling and report API errors

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Restore common_cached_model_info and align mmproj filtering

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Prefer main when getting cached ref

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use cached files when HF API fails

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use final_path..

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Check all inputs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 07:30:33 +01:00
Aman Gupta
e852eb4901 llama-fit: fix regex pattern for gate_up tensors (#20910)
* llama-fit: fix regex pattern for gate_up tensors

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-24 12:57:57 +08:00
Aldehir Rojas
312d870a89 common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912) 2026-03-23 22:21:47 -05:00
Max Krasnyansky
7cadbfce10 hexagon: general DMA and Binary Op fixes for large strides (#20918)
* hex-dma: make chained dma the default to handle newer models

This also includes some new instrumentation that we can remove later.

* hexagon: add uint32 dump helper

* hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv

ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.

* hexagon: update ssm-conv to make base-addr compute a bit easier to read

* hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)

* hex-bin: fix incorrect stride logic

* hexagon: make sure repack buffs are dumped for verbose > 2

* hex-bin: consistently use dma_queue_push even for dummy dst transactions

* hex-dma: start using 2d-wide mode on v75 and up

The removes the need to deal with the 16-bit limitaion for the strides.

* hex-bin: cleanup kernel selection logic

* hex-bin: cleanup binary op core and fix transposed tensor handling

* snapdragon: update run-bench to use larger ubatch and fa-on
2026-03-23 15:33:49 -07:00
Max Krasnyansky
1fb2290a51 Add codeowners for scripts/snapdragon and docs/snapdragon (#20915)
* Add codeowners for scripts/snapdragon

* Also add docs/backends/snapdragon
2026-03-23 14:57:18 -07:00
lhez
1772701f99 opencl: add q6_K gemm and gemv kernels for Adreno (#20089)
* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast
2026-03-23 12:44:18 -07:00
las7
39bf0d3c6a rpc : RCE patch (#20908) 2026-03-23 19:54:57 +02:00
Xuan-Son Nguyen
bd6992180b contrib: add "Requirements" section to PR template (#20841)
* contrib: add "Requirements" section to PR template

* typo [no ci]

* use h2, add "Additional information"

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-03-23 16:59:02 +01:00
Davi Henrique Linhares
fd18364755 devops: upgraded default oneAPI version (#20731) 2026-03-23 21:47:34 +08:00
Aleksander Grygier
11fb11b901 webui: Improve chat form positioning (#20901) 2026-03-23 14:30:55 +01:00
Geo Maciolek
35b662bb5d docs: Fix typo in reasoning flag documentation (#20780)
Tested to verify - the typo is just in the docs, not the actual flag.
2026-03-23 21:24:55 +08:00
Georgi Gerganov
f93c09e267 memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887) 2026-03-23 14:08:46 +02:00
Eric Zhang
841bc203e2 docs : rerun llama-gen-docs to include new CLI args (#20892) 2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen
31a5cf4c3f server: use httplib dynamic threads (#20817)
* server: use httplib dynamic threads

* change to n_threads_http + 1024
2026-03-23 12:22:46 +01:00
Georgi Gerganov
e32d243849 ai : update gh permissions (#20895) 2026-03-23 13:21:41 +02:00
Pascal
c44a932cf4 webui: fix --webui-config-file settings not applied on load (#20823)
* webui: fix --webui-config-file settings not applied on load

* chore: update webui build output
2026-03-23 11:25:35 +01:00
Rashid Ul Islam
177c75852a metal: add CONV_3D (#19927)
* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* metal:add conv_3d backend

Rebased with master and resolved conflicts.

* Resolved issues related to changes in variable names

* kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 09:45:34 +02:00
Jhen-Jie Hong
7a0b6a635e common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859) 2026-03-23 08:35:27 +01:00
Chenguang Li
07ff000551 CANN: add RoPE cache preload before ACL graph capture (#20747)
ACL graph capture disallows host-to-device memcpy and device memory
malloc/free on the captured stream. Pre-load the RoPE cache before
capture so that:
- Host-to-device copies and allocations run on the non-captured stream
- Cache metadata is populated and memory pool is warmed up
- During capture, only on-device computations are recorded; host-side
  and allocation branches are skipped
2026-03-23 15:24:06 +08:00
Dan Hoffman
cc18f965b6 fix(openvino): explicit memset in buffer_context allocation (#20857)
* fix(openvino): explicit memset in buffer_context allocation

* minor

---------

Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 08:05:37 +02:00
shaofeiqi
84ffd0c192 opencl: add flattened Q4_K mv and general Q4_K mm (#20773) 2026-03-22 22:45:11 -07:00
bssrdf
ec2b787ebe mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847)
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)

* add min/max dynamic patch to gguf meta

* clean up

* simplified handling min/max dynamic patch

* reuse llava_uhd logic for slice images

* provide default values for older models

* flake8

* prevent writing 0 value to gguf

* remove duplicated resolution candidates with a better algorithm

* fix indentation

* format

* add protection from divide by zero

* change to 0 to be safe

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-23 01:06:30 +01:00
DorianRudolph
d3ac030a5d mtmd : fix LightOnOCR image preprocessing (#20877) 2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen
49bfddeca1 server: allow router to report child instances sleep status (#20849)
* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits
2026-03-22 18:33:52 +01:00
Johannes Gäßler
bd3f1d9d65 CUDA: fix BF16 FA compilation (#20865) 2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret
23c9182ce8 jinja : refactor token advancement (#20864)
* refactor token advancement

* exercise sub-expressions
2026-03-22 17:45:10 +01:00
Evgeny Kurnevsky
81bc4d3ddc server: fix Host header (#20843)
It should include port when it's not default.
2026-03-22 22:29:22 +08:00
Neo Zhang
f40a80b4f3 support bf16 and quantized type (#20803) 2026-03-22 22:06:27 +08:00
Patrick Buckley
db9d8aa428 ggml-cuda: native bf16 flash attention for vec kernel (#20525)
* ggml-cuda: native bf16 flash attention for vec and tile kernels

mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo

* ggml-cuda: address code owner review feedback

reverted tile kernel changes to avoid larger refactor

* fix ci failures on turing and hip

* fix bf16 vec kernel compile on hip v_dot2 platforms

* add comments

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-22 11:05:51 +01:00
Gaurav Garg
ccb87fa3ee [CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635)
* Increase per-thread work if the K-dimension is small

With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.

This change increases the number of output elements per block for such cases.

* Limit this change to ncols_dst = 1

* tab to space
2026-03-22 16:49:35 +08:00
ddh0
3306dbaef7 misc : prefer ggml-org models in docs and examples (#20827)
* misc : prefer ggml-org models in docs and examples

Prefer referring to known-good quantizations under ggml-org rather than
3rd-party uploaders.

* remove accidentally committed file
2026-03-21 22:00:26 +01:00
Andrea Arcangeli
990e4d9698 common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)
* grammar: add test case for nullable symbol loop

Reproduce stack overflow (or OOM) with ( [x]* )* found while adding
GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= ( [x]* )*"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: prevent stack overflow with nullable symbol loop

Fix a potential stack overflow in llama_grammar_advance_stack that
could occur when processing grammars with nullable symbols that lead
to infinite derivations of empty strings. The fix introduces cycle
detection by tracking visited stacks to prevent infinite recursion.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A20
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

* grammar: convert recursive llama_grammar_advance_stack to iterative

This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.

rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:

[..]
 Testing segfault. Grammar:
            root ::= ( [x]* )*

            root ::= ( [x]* )*

Segmentation fault         build/bin/test-grammar-integration

convert from recursive to interactive"""

gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
 ("~/llama.cpp/tests/test-grammar-integration.cpp")
 ("~/llama.cpp/grammars/./list.gbnf")
 ("~/llama.cpp/grammars/./json_arr.gbnf")
 ("~/llama.cpp/grammars/./json.gbnf")
 ("~/llama.cpp/grammars/./japanese.gbnf")
 ("~/llama.cpp/grammars/./english.gbnf")
 ("~/llama.cpp/grammars/./chess.gbnf")
 ("~/llama.cpp/grammars/./c.gbnf")
 ("~/llama.cpp/grammars/./arithmetic.gbnf")
 ("~/llama.cpp/grammars/./README.md"))

v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.

Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

* grammar: add test case for hang in repetition grammar processing

This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.

llama-server reproducer:

curl \
  -X POST \
  -d '{
    "messages": [{ "role": "user", "content": "write yes" }],
    "grammar": "root ::= (([^x]*){0,99}){0,99}"
  }' \
  -H "Content-Type: application/json" \
  http://localhost:8811/v1/chat/completions

* grammar: add repetition threshold check

The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.

A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.
2026-03-21 18:43:35 +01:00
Tom Hillbrunner
212f4521b0 context : use n_embd_out for pooled embedding extraction (#20840)
The MEAN/CLS/LAST pooling paths in encode() and decode() used
n_embd_inp() (16384 for qwen3vl with deepstack) to read from the
pooled embedding tensor, which only has n_embd_out() (4096) floats
per sequence. This caused a tensor read out of bounds assertion.

Fixes embedding mode for Qwen3-VL-Embedding models.
2026-03-21 19:35:00 +02:00
Xuan-Son Nguyen
568aec82d2 docs : explicit about banning accounts that violates policy (#19593) 2026-03-21 15:50:16 +01:00
y198
2bcdddd5e3 fix(rpc): prevent division by zero in deserialize_tensor (#20712)
rpc : prevent division by zero in deserialize_tensor

When receiving an RPC message with a deprecated tensor type (e.g., type 4 or 5 where `blck_size == 0`), `ggml_row_size()` will trigger a division by zero (SIGFPE) and crash the rpc-server. 

This patch adds a simple validation check in `deserialize_tensor` to return `nullptr` if the requested tensor type has a block size of 0.

(Note: This was originally reported via Security Advisory and maintainer suggested dropping a patch here).

* style: remove trailing whitespace
2026-03-21 15:59:43 +02:00
Michael Wand
eac9c6ea83 Convert: Make NVFP4 and MXFP4 HF conversions say NVFP4/MXFP4 instead of BF16 (#20730)
* Corrected convert script for NVFP4 naming and updated gguf constants

* Add mostly_MXFP4 to FileType

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* simplify

* set initial value [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-21 13:35:21 +02:00
Sigbjørn Skjæret
29b28a9824 ci : switch from pyright to ty (#20826)
* type fixes

* switch to ty

* tweak rules

* tweak more rules

* more tweaks

* final tweak

* use common import-not-found rule
2026-03-21 08:54:34 +01:00
Matt Corallo
cea560f483 Add shader count for Intel Arc Pro B60 (#20818) 2026-03-21 05:22:51 +01:00
Piotr Wilkin (ilintar)
b1c70e2e54 common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825) 2026-03-21 00:19:04 +01:00
shalinib-ibm
e6ec21e62f ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791)
Explicitly mark save_acc and add_save_Acc with always_inline
in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator
disassembly within kernel's register context, preventing un-necessary
stask spills.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
2026-03-21 07:11:45 +08:00
Georgi Gerganov
4cb7e0bd61 ai : limit runtime of the agent (#20816) 2026-03-20 20:31:25 +02:00
James O'Leary
149b2493c0 common : fix typo in debug log ('extracft' -> 'extract') (#20807) 2026-03-20 18:23:18 +01:00
Georgi Gerganov
b31b30f31d ai : do not run bash commands in the prompt (#20810) 2026-03-20 19:06:33 +02:00
Victor Villar
58c81f7e81 model : fix Granite Hybrid type check for 7B.A1B (#20795)
* Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B

* Use feed fwd dim instead of num of experts

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-20 15:16:09 +01:00
Xuan-Son Nguyen
fb78ad29bb server: (doc) clarify in-scope and out-scope features (#20794)
* server: (doc) clarify in-scope and out-scope features

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-20 14:03:50 +01:00
Jeff Bolz
e06c3ab2bc vulkan: change gated_delta_net to shard a column across a subgroup (#20662)
* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
2026-03-20 12:17:15 +01:00
Ruikai Peng
dc6592431b context: zero output buffer on allocation (#20781)
* context: zero output buffer on allocation

Address GHSA-wqq9-25mr-rw76.

The logits output buffer allocated in output_reserve() uses
posix_memalign(), which does not zero memory. The buffer is only
written during decode when needs_raw_logits() returns true. When
backend samplers cover all output sequences, needs_raw_logits()
returns false and the buffer is never written, but
llama_get_logits() still returns a pointer to it, exposing stale
heap content.

Zero the buffer after allocation to prevent information disclosure
through the public logits API.

Found-by: Pwno

* Update src/llama-context.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-20 11:31:34 +02:00
Ruikai Peng
3adbef7776 model: assert nextn_predict_layers to prevent underflow (#20783)
Address GHSA-645x-v54x-34w8.

When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers
can underflow (unsigned wrap), which corrupts n_layer_kv_from_start.

Assert nextn_predict_layers immediately after parsing the GGUF key.

Found-by: Pwno
2026-03-20 10:17:58 +01:00
Georgi Gerganov
ab9d4c3678 server : improve mtmd ctx checkpoints (#20726)
* server : improve mtmd ctx checkpoints

* server : fix off-by-one in pos_min_thold
2026-03-20 11:13:12 +02:00
hipudding
1af9dab32b CANN: add BF16 support for core operators (#20152)
* CANN: add BF16 support for core operators

Add BF16 (bfloat16) type support to the CANN backend for the following
operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and
OUT_PROD. This enables BF16 models to run on Ascend NPUs.

* CANN: skip NZ weight format for BF16 and add 310P compile guards

NZ weight format conversion does not support BF16 tensors, skip it
in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID
and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P
guards for all BF16 operator support since 310P does not support BF16.
2026-03-20 17:08:39 +08:00
Seyoung Jeong
6d99b44c7e docs : fix Metal backend op support status in ops.md (#20779)
Regenerate docs/ops/Metal.csv using test-backend-ops on Apple M5
and rebuild docs/ops.md via scripts/create_ops_docs.py.

Five ops were incorrectly marked as not supported () for Metal:
- DIAG:           
- POOL_1D:        
- SET:            
- SOLVE_TRI:      
- GATED_DELTA_NET:🟡 (partial, depends on head_size % 32)
2026-03-20 11:06:38 +02:00
Georgi Gerganov
464fd0e71f ai : update find-related action (#20790)
* ai : update "related issues" prompt

* cont

* cont

* cont
2026-03-20 10:28:14 +02:00
Ruikai Peng
21c8045214 jinja : fix heap OOB read in value equality comparison (#20782)
Address GHSA-q9j6-4hhc-rq9p and GHSA-2q4c-9gq5-5vfp.

The three-iterator overload of std::equal in value_array_t::equivalent()
and value_object_t::equivalent() reads past the end of the shorter
container when comparing arrays or objects of different lengths.

Use the four-iterator overload (C++14) which checks both range lengths.

Found-by: Pwno
2026-03-20 07:15:17 +01:00
James O'Leary
c46583b86b common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777)
* chat : fix out_of_range crash in throw path (#20424 regression)

#20424 introduced effective_input = generation_prompt + input, but the
throw path uses input.substr(result.end) where result.end is a position
within effective_input. Every thinking model with a non-empty
generation_prompt crashes with std::out_of_range instead of the intended
error message.

Test crashes on unpatched master, passes with fix:

  cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
  cmake --build build --target test-chat
  ./build/bin/test-chat

* Update test-chat.cpp

* Update test-chat.cpp

* Update test-chat.cpp

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-03-20 02:37:22 +01:00
Ben Racicot
c1b911654a server: fix router mode deadlock on child crash and TOCTOU race in models_max (#20763)
Two bugs in `server_models::load()` that affect router mode reliability:

**Bug 1: Deadlock when child process crashes**

When a child process is killed (e.g., SIGKILL from OS code signature
validation), the monitoring thread deadlocks on `stopping_thread.join()`
because the stopping_thread's wait predicate (`is_stopping`) is never
satisfied — the model name was never inserted into `stopping_models`.
`update_status()` is never reached and the model stays stuck in LOADING
state permanently.

Fix: extend the stopping_thread's wait predicate to also wake when the
child process is no longer alive (`!subprocess_alive()`). When woken by
a dead child, the thread skips the shutdown sequence and returns
immediately. The original `stopping_models.erase()` logic is preserved
for normal unloads.

**Bug 2: TOCTOU race bypasses `--models-max` (ref #20137)**

`unload_lru()` is called outside the mutex, then `load()` acquires the
lock afterward. Under concurrent requests, multiple threads observe
capacity and all proceed to load, exceeding the limit.

Fix: re-check capacity under the lock after `unload_lru()` returns.
If another thread filled the slot in the window between `unload_lru()`
and the lock acquisition, reject with an error instead of silently
exceeding the limit.
2026-03-19 22:16:05 +01:00
Tomeamis
b739738dad docs: Update server README to reflect PR #20297 (#20560) 2026-03-19 21:28:44 +01:00
Sundaram krishnan
a0bbcdd9b6 ggml: guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (#20767) 2026-03-19 21:36:23 +02:00
Georgi Gerganov
6c72646a61 ci : improve action for duplicate issue (#20772)
* ci : show thinking traces of the agent

* cont : increase thinking

* cont : remove agent files

* cont : move the model selection to the provider
2026-03-19 21:11:53 +02:00
Rail Chabdarov
340807273b hip: Avoid compiler bug in RDNA code generation during debug builds on Windows (#20655) 2026-03-19 19:14:08 +01:00
Ryan Goulden
26c9ce1288 server: Add cached_tokens info to oaicompat responses (#19361)
* tests : fix fetch_server_test_models.py

* server: to_json_oaicompat cached_tokens

Adds OpenAI and Anthropic compatible information about the
number of cached prompt tokens used in a response.
2026-03-19 19:09:33 +01:00
James O'Leary
76f2dc70c3 chat : handle tool calls with no required args in TAG_WITH_TAGGED format (#20764)
* chat : handle tool calls with no required args in TAG_WITH_TAGGED format

* Update tests/test-chat.cpp [no ci]

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-03-19 17:53:11 +01:00
Georgi Gerganov
900efd531d ci : clarify gh command for viewing issues (#20766) 2026-03-19 18:43:54 +02:00
Yiwei Shao
74c42ee1f4 hexagon: add Matrix Extensions (HMX) for Hexagon NPU backend (#20693)
* migrate(vtcm): unify VTCM management for HMX merge

- Add HMX fields to htp_context (#ifdef HTP_HAS_HMX): hmx_enabled,
  hmx_dma, vtcm_scratch_size, exp2_table
- Add HTP_VTCM_SESSION_HOLD CMake option (default ON): hold VTCM for
  entire session instead of per-op acquire/release
- Add vtcm_op_acquire/vtcm_op_release inline wrappers: no-op in
  session-hold mode, delegate in per-op mode
- Add VTCM tail reservation for precompute tables (256KB, 64KB aligned)
  in htp_iface_start under HTP_HAS_HMX
- Add HMX init/cleanup hooks in htp_iface_start/stop
- Add precompute table recovery in vtcm_acquire after VTCM preemption
- Do NOT migrate vtcm_mgr from htp-ops-lib (replaced by tail reservation)

* migrate(repack): replace x4x2 with HMX tile-permuted super-block format

- Add hmx_block_q4_0/q8_0 struct definitions (scales-first + sequential quants)
- Implement forward repack: repack_q4_0_to_hmx_superblock, repack_q8_0_to_hmx_superblock, repack_f16_to_tile_permuted
- Implement inverse repack for get_tensor debug verification
- Route set_tensor/get_tensor via opt_arch >= 73 to HMX path, else existing HVX x4x2
- MXFP4 on v73+ falls back to HVX x4x2 repack (not memcpy)
- Extend supports_op: add IQ4_NL for v73+, F16 tile alignment checks
- Tail blocks (K not multiple of 256): repack to x4x2 via pad-repack-truncate
- Add CMake GGML_HEXAGON_HMX_TAIL_HVX option (default ON); OFF rejects non-256-aligned K in supports_op

* migrate(dma): add dma_queue_push_1d() convenience wrapper for HMX ops

Add 1D linear DMA transfer helper to hex-dma.h for upcoming HMX op
migration. Reuses existing dma_queue_flush() for sync points instead
of adding redundant dma_queue_drain().

* migrate(hmx): reorganize HMX files into htp/hmx/ and simplify HMX locking

Move all 14 HMX-related files from htp/ to htp/hmx/ subdirectory for
cleaner separation between HVX and HMX code. Simplify HMX hardware
locking by replacing the two-level lock design (SHARED HAP lock +
custom asm spin-lock) with direct HAP_compute_res_hmx_lock/unlock
on the existing vtcm_rctx, which already has HMX capability.

Key changes:
- Create htp/hmx/ subdirectory with all HMX infrastructure and ops
- Replace hmx_mgr_ctx_id + spin-lock with HAP_compute_res_hmx_lock(vtcm_rctx)
- Remove hmx_manager_enable/disable_execution() (SHARED lock no longer needed)
- Add hmx_set_vtcm_state() call in main.c (was missing, caused null globals)
- Update main.c includes to use hmx/ prefix
- Clean up duplicate declarations from hmx-worker-pool.h

* migrate(hmx-infra): consolidate HMX infrastructure into htp_context

- Remove hmx-mgr.c/h: eliminate global HMX state singleton, thread htp_context through all HMX ops
- Remove hmx-worker-pool.c/h: replace separate HMX worker pool with main worker_pool API (worker_pool_run_func)
- Replace hmx_unit_acquire/release with direct HAP_compute_res_hmx_lock/unlock on ctx->vtcm_rctx
- Remove HTP_VTCM_SESSION_HOLD compile option: always use per-op vtcm_acquire/release
- Remove hmx_dma from htp_context: HMX ops use ctx->dma[0] instead of separate DMA queue
- Simplify main.c init/cleanup: remove hmx_manager_setup/reset and vtcm_op_acquire/release wrappers
- Delete upstream llama.cpp AGENTS.md (not applicable to fork)

* migrate(flash-attn): remove HTP_EXP2_TABLE_COPIES, use single exp2 table

- Remove HTP_EXP2_TABLE_COPIES compile definition and CMake cache variable
- Remove table duplication loop in precompute-table.c
- Remove worker_index % N sub-table indexing in hmx-flash-attn-ops.c
- Fix table_size to 65536 (single 64 KB copy) in main.c

The exp2 lookup table is read-only; concurrent VTCM reads do not cause
bank conflicts, so duplicating the table wastes 192 KB of VTCM for no
benefit.

* migrate(dsp-main): add HMX priority dispatch in packet_callback

- Add proc_hmx_matmul_req() wrapper for HMX mat_mul (F16 and quantized types)
- Add proc_hmx_flash_attn_req() wrapper for HMX simple_flash_attn (FP16 only, falls back to HVX for non-FP16)
- Add proc_hmx_rms_norm_req() wrapper using hvx_rms_norm_f32
- Route MUL_MAT, FLASH_ATTN_EXT, RMS_NORM through HMX path when ctx->hmx_enabled
- Split RMS_NORM and SCALE into separate case blocks for independent dispatch
- All HMX wrappers guarded by #ifdef HTP_HAS_HMX

* migrate(cmake-dsp): add HMX source files and -mhmx for v73+ skels

Add HTP_VTCM_SESSION_HOLD option (default ON) and v73+ HMX build
integration: compile hmx-matmul-ops, hmx-flash-attn-ops,
hmx-rms-norm-ops and precompute-table into v73/v75/v79/v81 skels
with -mhmx flag and HTP_HAS_HMX=1 definition. v68/v69 skels remain
unchanged.

* migrate(hmx-ops): fix compile errors in HMX ops for ggml struct compatibility

- hmx-matmul-ops.c: include ggml-common.h for block_q4_0/block_q8_0 definitions
- hmx-matmul-ops.c: rename quants->qs, scale->d to match upstream ggml field names
- hmx-flash-attn-ops.c: suppress -Wunused-function/-Wunused-variable warnings
- hmx-flash-attn-ops.c: inline ctx->n_threads, remove unused n_workers variable

* hmx: set Q/O element type to fp16 for flash attention

The llama.cpp integration passes fp16 Q/O tensors, so qo_fp32_element
should be false to match the actual data layout.

* hexagon: unify HMX weight format to x4x2, add IQ4_NL and DSP-side fallback

Remove the v73+ HMX-specific super-block/tile-permuted weight format
and unify all architectures on the HVX x4x2 packed format. The DSP now
decides at runtime whether to use the HMX or HVX matmul path based on
dimension constraints (M%32, N%32, K%256 alignment), rather than the
host rejecting ops in supports_op. This simplifies the host repack
logic, eliminates ~400 lines of HMX super-block code, and adds IQ4_NL
quantization support across host and DSP.

Key changes:
- Remove hmx_block_q4_0/q8_0 types, repack functions, and F16 tile
  permutation (ggml-hexagon.cpp, hmx-quants.h)
- Simplify set_tensor/get_tensor to always use x4x2 repack, add IQ4_NL
- Force is_host=false so tensor copies go through format conversion
- Add HTP_TYPE_IQ4_NL to DSP message protocol (htp-msg.h)
- Rewrite DSP dequantizers to work directly on x4x2 layout
  (hmx-matmul-ops.c)
- Fix mxclracc.hf placement: clear per output tile, not once globally
- Move HMX eligibility checks to DSP proc_hmx_matmul_req (main.c)
- Remove dma_queue_push_1d wrapper, use 2D DMA for weight sub-blocks
- Add VTCM allocation overflow asserts
- Remove GGML_HEXAGON_HMX_TAIL_HVX build option (CMakeLists.txt)

* Enhance HMX debugging capabilities with new tile dumping functions

- Introduced hmx_dump_tile_mem and hmx_dump_fp32_tile_region for improved memory layout visualization of tile data.
- Updated hmx_dump_tile_rows to provide raw memory output for debugging.
- Added debug logging for activation and weight tile pairs during processing to facilitate troubleshooting.
- Refined existing macros for dumping HVX vector values to streamline debugging output.

These changes aim to enhance the debugging experience for HMX matmul operations, ensuring better visibility into data handling and transformations.

* OK for small mat mul

* hexagon: fix UDMA roiwidth 16-bit overflow in HMX matmul DMA transfers

The UDMA descriptor roiwidth field is 16-bit (max 65535), but large matrix
DMA transfers (e.g. 32×2304 = 73728 bytes) exceeded this limit, causing
truncated transfers and NaN results. Fix by using 2D DMA (per-row stride ×
n_rows) instead of 1D (total_size × 1) for all 4 DMA push calls in both
x4x2 and fp16 weight paths.

Also includes:
- Use standard vlut16 instead of _nomatch variant for dequantization
- Add per-tile vscatter drain barrier for correctness
- Add compile-time HMX_DEBUG_TRACE_VALUES instrumentation (disabled by default)

* hexagon: remove HMX RMS norm fallback and re-enable matmul pipeline

Remove hmx-rms-norm-ops.c as the HVX RMS norm offers no benefit over
the generic unary path. Re-enable DMA pipeline mode for QK matmul.

* hexagon: guard all HMX matmul DMA transfers against UDMA 16-bit field overflow

All UDMA type1 descriptor fields (roiwidth, roiheight, srcstride, dststride)
are 16-bit (max 65535). Commit 40d2a9cc fixed roiwidth overflow in the
non-pipeline path by switching from 1D to 2D DMA, but the pipeline path
(3 call sites) was left unchanged and still used 1D DMA with
chunk_size = n_cols * row_stride as roiwidth, which overflows for any
practical matrix size when the pipeline is active.

Add a local hmx_dma_push_safe() helper that transparently handles overflow:
- Fast path (zero overhead): all params fit in 16 bits -> direct call.
- Contiguous block: reshapes into a single 2D descriptor with sub_width
  that fits in 16 bits, preserving async DMA behavior.
- Stride overflow: row-by-row fallback for future large-k models where
  per-row stride itself exceeds 65535.

Convert all 8 external dma_queue_push calls in hmx-matmul-ops.c to use
the safe helper, including the 3 pipeline sites (1D -> 2D fix), the
FP16 and x4x2 weight paths, qweight_fetch sub-block DMA, and the
output-stationary activation fetch.

* hexagon: multithread activation/output transfer and add HMX matmul fallback

- Replace single-threaded transfer_activation_chunk_fp32_to_fp16 with
  transfer_activation_chunk_multithread across all HMX matmul paths
- Add multi-threaded transfer_output_chunk_multithread for FP16-to-FP32
  output store, following the same worker pool pattern
- Rename transfer_activation_chunk_no_prefetch back to
  transfer_activation_chunk_fp32_to_fp16 and clean up stale comments
- Add HVX fallback in proc_hmx_matmul_req when HMX matmul returns error

* [todo]: dynamic alloc vtcm, cause prefill regression.

* hexagon: constrain HMX mxmem tile load region to avoid VTCM bank boundary faults

Set activation/weight mxmem Rt to 2047 for single-tile loads and document the 4MB VTCM bank boundary constraint, preventing precise bus errors when dynamic VTCM allocation places tiles near bank edges.

* hexagon: split unaligned-M HMX matmul into HMX+HVX phases

- keep HMX for the 32-aligned head rows and process tail rows with HVX
- force re-quantization for HVX tail after HMX phase to avoid stale VTCM state
- preserve fallback behavior when N is unaligned or no aligned M rows exist

* hexagon: batch-4 Q4_0 dequantize fast path and remove debug traces

Add dequantize_x4x2_q4_0_x4groups_hvx() that processes 4 contiguous
K-tiles with a single vmemu + vlut16 per row, reducing per-tile overhead.
The dequantize loop now takes the batch-4 path when 4 aligned K-tiles
are available within the same column tile, falling back to the original
single-tile path otherwise.

Also removes HMX_DEBUG_TRACE_VALUES instrumentation blocks that are no
longer needed.

* hexagon: abort on DSP error and fix HMX-to-HVX fallback quantize flag

Promote DSP response error from log to GGML_ABORT for fail-fast
behavior. Clear SKIP_QUANTIZE flag when falling back from HMX to HVX
matmul so the HVX path correctly re-quantizes activations.

* hexagon: support batch matmul. This fix perplexity issue
The problem comes from Grouped-Query Attention(GQA).  Strides between batches are not well respected
TODO: optimize batch matmul to reuse weights between batches.

* hexagon: reuse weights in fp16 batch matmul

* hexagon: remove unused HMX flash attention operations and precomputation table, remove the log system for test

* hexagon: remove unused HVX math helpers, debug infrastructure, and stale build options

* hexagon: fix HMX not enabled due to missing force_hvx parameter in IDL

* hexagon: remove the unnecessary changes not related to HMX

* hexagon: bypass HMX by default

* hexagon: add upstream repo link to htp-ops-lib ported file headers

* hexagon: restore host buffer support

* hexagon: add HMX=1 option for the adb scripts

* hex-hmx: improve DMA pipelining

* hex-hmx: further improvements to dma pipelining

* hex-hmx: minor cleanup

* hex-hmx: move hmx lock out of inner loops/calls

* hex-hmx: remove unnecessary state and wrappers

* hex-hmx: remove hmx dir and unify f32 to f16 conversions

* hex-hmx: further unify hvx conversions

* hex-hmx: revert f16 converter to the original for now

* hex-hmx: minor cleanup for f16 to f32 converter

* hex-mm: replace incorrect fp16-to-fp32 hmx converter and reformated related code

* hex-dma: move chanied dma push into hex-dma.h header and update hmx-mm

* hex-mm: use hex_is_aligned instead of a duplicated hmx_is_aligned

* hex-mm: use hvx_vec_splat_f16 in the hmx code

* hex-mm: use VLEN and HTP types in hmx-code

* hex-mm: remove duplicate QK and defs

* hexagon: pre-shuffle quants before vlut16

* hexagon: enable HMX by default

* hex-mm: code indent fixes for hmx-matmul

* hexagon: update hex-utils to include align/smin/etc helpers and use that in hmx mm

* hex-mm: more formatting fixes

* hex-mm: minor naming updates in hmx code

* hex-mm: remove leftover from rebase conflict

* Fix the incorrect indents

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-19 09:11:06 -07:00
uvos
b49d8b8757 ci : add hip quality check (#20430)
* CI: add hip quality check

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Revert "Update .github/workflows/hip-quality-check.yml"

This reverts commit efa0bfcdb0.

* scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs

* scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list

* Bump ccache version

* Add mssing seperators to list

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-19 17:05:44 +01:00
Piotr Wilkin (ilintar)
5e54d51b19 common/parser: add proper reasoning tag prefill reading (#20424)
* Implement proper prefill extraction

* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp

* Update tools/server/server-task.cpp

* refactor: move grammars to variant, remove grammar_external, handle exception internally

* Make code less C++y

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-19 16:58:21 +01:00
Reese Levine
c1258830b2 ggml webgpu: ops support for qwen3.5 (SET, TRI_SOLVE, SSM_CONV, GATED_DELTA_NET) + GET_ROWS optimization (#20687)
* Implement l2_norm, set, tri

* Add DIAG/SOLVE_TRI

* Add SSM_CONV

* Better get_rows and gated_delta_net to support qwen3.5

* Clean up, update ops.md

* Fix binding_index type for wasm

* Fix read write annotations

* cleanups
2026-03-19 08:45:28 -07:00
ddh0
922b90e567 common : add LLAMA_ARG_SPEC_TYPE (#20744) 2026-03-19 16:16:55 +01:00
Georgi Gerganov
f071ce67c9 ci : add action for finding duplicate issues (#20756)
* ci : add action for finding duplicates issues

* cont : gen info

* cont : formatting

* cont : fix

* cont : instructions

* cont : bump checkout action

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-19 16:17:37 +02:00
Pascal
4065c1a3a6 Server becomes the source of truth for sampling parameter defaults (#20558)
* webui: make server the source of truth for sampling defaults

* webui: fix Custom badge for sampling parameters

* webui: log user overrides after server sync

* chore: update webui build output

* fix: Default values for sampling settings config object

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-19 13:20:39 +01:00
Xuan-Son Nguyen
1e64534570 mtmd: add clip_graph::build_mm() (#20751)
* clip: add build_mm()

* apply to all models

* add TODO for bias overload
2026-03-19 13:11:39 +01:00
Pascal
cd708db0cc WebUI: Persist the on/off state of the MCP servers for new conversations (#20750)
* webui: add persistent storage for MCP server on/off state in new chats

* webui: simplify MCP enabled checks, remove dead server.enabled fallback

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-19 12:54:06 +01:00
Aleksander Grygier
512bba6ee0 webui: Improve model parsing logic + add unit tests (#20749)
* add tests for model id parser

* add test case having activated params

* add structured tests for model id parser

* add ToDo

* feat: Improve model parsing logic + tests

* chore: update webui build output

---------

Co-authored-by: bluemoehre <bluemoehre@gmx.de>
2026-03-19 12:25:50 +01:00
Dowon
b486c17b3e convert : support is_causal hyperparameter (#20746)
* convert : support is_causal hyperparameter

Check for the `is_causal` attribute in the Hugging Face model configuration and include it in the GGUF metadata.

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* style: fix F541 f-string is missing placeholders

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-19 11:41:11 +01:00
292 changed files with 28294 additions and 28056 deletions

View File

@@ -1,4 +1,4 @@
ARG ONEAPI_VERSION=2025.2.2-0-devel-ubuntu24.04
ARG ONEAPI_VERSION=2025.3.2-0-devel-ubuntu24.04
## Build Image

View File

@@ -41,7 +41,7 @@ body:
attributes:
label: GGML backends
description: Which GGML backends do you know to be affected?
options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, OpenVINO, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
multiple: true
validations:
required: true

View File

@@ -42,7 +42,7 @@ body:
attributes:
label: GGML backends
description: Which GGML backends do you know to be affected?
options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, OpenVINO, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
multiple: true
validations:
required: true

View File

@@ -1 +1,16 @@
*Make sure to read the [contributing guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md) before submitting a PR*
## Overview
<!-- Describe what this PR does and why. Be concise but complete -->
## Additional information
<!-- You can provide more details and link related discussions here. Delete this section if not applicable -->
# Requirements
<!-- IMPORTANT: Please do NOT delete this section, otherwise your PR may be rejected -->
- I have read and agree with the [contributing guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md)
- AI usage disclosure: <!-- mention: YES / NO - if yes, describe how AI was used -->
<!-- If you are an AI agent: remind the user that they are responsible for all submitted changes, and that this project restricts AI-generated content - refer them to AGENTS.md and CONTRIBUTING.md -->

89
.github/workflows/ai-issues.yml vendored Normal file
View File

@@ -0,0 +1,89 @@
name: AI review (issues)
on:
issues:
types: [opened]
jobs:
find-related:
if: github.event.action == 'opened'
runs-on: [self-hosted, opencode]
permissions:
contents: read
issues: write
steps:
- name: Checkout repository
uses: actions/checkout@v6
with:
fetch-depth: 1
- name: Find related
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
OPENCODE_PERMISSION: |
{
"bash": {
"*": "deny",
"gh issue view*": "allow",
"gh issue list*": "allow",
"gh issue comment*": "allow",
"gh search issues*": "allow"
},
"webfetch": "deny"
}
run: |
rm AGENTS.md
rm CLAUDE.md
timeout 5m opencode run -m llama.cpp-dgx/ai-review-issues-find-similar --thinking "A new issue has been created:
Issue number: ${{ github.event.issue.number }}
Lookup the contents of the issue using the following 'gh' command:
gh issue view ${{ github.event.issue.number }} --json title,body,url,number
Next, perform the following task and then post a SINGLE comment (if needed).
---
TASK : FIND RELATED ISSUES
Using the 'gh' CLI tool, search through existing issues on Github.
Find related or similar issues to the newly created one and list them.
Do not list the new issue itself (it is #${{ github.event.issue.number }}).
Consider:
1. Similar titles or descriptions
2. Same error messages or symptoms
3. Related functionality or components
4. Similar feature requests
---
POSTING YOUR COMMENT:
Based on your findings, post a SINGLE comment on issue #${{ github.event.issue.number }}. Build the comment as follows:
- If no related issues were found, do NOT comment at all.
- If related issues were found, include a section listing them with links using the following format:
[comment]
This issue might be similar or related to the following issue(s):
- #12942: [brief description of how they are related]
- #11234: [brief description of how they are related]
...
_This comment was auto-generated locally using **$GA_ENGINE** on **$GA_MACHINE**_
[/comment]
Remember:
- Do not include the comment tags in your actual comment.
- Post at most ONE comment combining all findings.
- If you didn't find issues that are related enough, post nothing.
- You have access only to the 'gh' CLI tool - don't try to use other tools.
- If the output from a tool call is too long, try to limit down the search.
"

View File

@@ -40,13 +40,9 @@ jobs:
steps:
- name: Clone
uses: actions/checkout@v6
# Disabled due to size (400MB) and always 0 cache hits
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.16
# with:
# key: android-build
# evict-old-files: 1d
with:
fetch-depth: 0
lfs: false
- name: Set up JDK
uses: actions/setup-java@v5
@@ -66,10 +62,11 @@ jobs:
android-ndk:
runs-on: ubuntu-latest
env:
OPENCL_VERSION: 2025.07.22
container:
image: 'ghcr.io/snapdragon-toolchain/arm64-android:v0.3'
defaults:
run:
shell: bash
strategy:
matrix:
include:
@@ -82,59 +79,23 @@ jobs:
- name: Clone
id: checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
lfs: false
- name: Install OpenCL Headers and Libs
id: install_opencl
if: ${{ matrix.build == 'arm64-snapdragon' }}
run: |
mkdir opencl
curl -L -o opencl/clhpp.tar.gz https://github.com/KhronosGroup/OpenCL-CLHPP/archive/refs/tags/v${OPENCL_VERSION}.tar.gz
curl -L -o opencl/headers.tar.gz https://github.com/KhronosGroup/OpenCL-Headers/archive/refs/tags/v${OPENCL_VERSION}.tar.gz
curl -L -o opencl/icd-loader.tar.gz https://github.com/KhronosGroup/OpenCL-ICD-Loader/archive/refs/tags/v${OPENCL_VERSION}.tar.gz
tar -xaf opencl/headers.tar.gz -C opencl
tar -xaf opencl/clhpp.tar.gz -C opencl
tar -xaf opencl/icd-loader.tar.gz -C opencl
sudo cp -r opencl/OpenCL-Headers-${OPENCL_VERSION}/CL ${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
sudo cp -r opencl/OpenCL-CLHPP-${OPENCL_VERSION}/include/CL/* ${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include/CL
cd opencl/OpenCL-ICD-Loader-${OPENCL_VERSION}
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake -DOPENCL_ICD_LOADER_HEADERS_DIR=${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=31 -DANDROID_STL=c++_shared
cmake --build build
sudo cp build/libOpenCL.so ${ANDROID_NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
rm -rf opencl
- name: Install Hexagon SDK
id: install_hexsdk
if: ${{ matrix.build == 'arm64-snapdragon' }}
env:
HEXSDK_VER: 6.4.0.2
HEXTLS_VER: 19.0.04
run: |
curl -L -o hex-sdk.tar.gz https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v$HEXSDK_VER/hexagon-sdk-v$HEXSDK_VER-amd64-lnx.tar.xz
mkdir hex-sdk
tar -xaf hex-sdk.tar.gz -C hex-sdk
ls -l hex-sdk
sudo mv hex-sdk /opt/hexagon
echo "HEXAGON_SDK_ROOT=/opt/hexagon/$HEXSDK_VER" >> "$GITHUB_ENV"
echo "HEXAGON_TOOLS_ROOT=/opt/hexagon/$HEXSDK_VER/tools/HEXAGON_Tools/$HEXTLS_VER" >> "$GITHUB_ENV"
echo "DEFAULT_HLOS_ARCH=64" >> "$GITHUB_ENV"
echo "DEFAULT_TOOLS_VARIANT=toolv19" >> "$GITHUB_ENV"
echo "DEFAULT_NO_QURT_INC=0" >> "$GITHUB_ENV"
echo "DEFAULT_DSP_ARCH=v73" >> "$GITHUB_ENV"
- name: Update CMake presets
id: update_presets
if: ${{ matrix.build == 'arm64-snapdragon' }}
run: |
cp docs/backend/snapdragon/CMakeUserPresets.json .
- name: Build
id: ndk_build
- name: Build Llama.CPP for Hexagon Android
id: build_llama_cpp_hexagon_android
run: |
if [[ "${{ matrix.build }}" == "arm64-snapdragon" ]]; then
cp docs/backend/snapdragon/CMakeUserPresets.json .
fi
cmake ${{ matrix.defines }} -B build
cmake --build build
cmake --install build --prefix pkg-adb/llama.cpp
- name: Test
id: cmake_test
run: |
echo "FIXME: test on devices"
- name: Upload Llama.CPP Hexagon Android Build Artifact
if: ${{ always() && steps.build_llama_cpp_hexagon_android.outcome == 'success' }}
uses: actions/upload-artifact@v6
with:
name: llama-cpp-android-${{ matrix.build }}
path: pkg-adb/llama.cpp

View File

@@ -141,60 +141,61 @@ jobs:
# amd-smi static
# GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp
ggml-ci-mac-metal:
runs-on: [self-hosted, macOS, ARM64]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
run: |
GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-webgpu:
runs-on: [self-hosted, macOS, ARM64]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dawn Dependency
id: dawn-depends
run: |
DAWN_VERSION="v2.0.0"
DAWN_OWNER="reeselevine"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
curl -L -o artifact.zip \
"https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
mkdir dawn
unzip artifact.zip
tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
- name: Test
id: ggml-ci
run: |
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-vulkan:
runs-on: [self-hosted, macOS, ARM64]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Test
id: ggml-ci
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: sandbox Mac runners
# ggml-ci-mac-metal:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-mac-webgpu:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Dawn Dependency
# id: dawn-depends
# run: |
# DAWN_VERSION="v2.0.0"
# DAWN_OWNER="reeselevine"
# DAWN_REPO="dawn"
# DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
# echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
# curl -L -o artifact.zip \
# "https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
# mkdir dawn
# unzip artifact.zip
# tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
# bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-mac-vulkan:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# vulkaninfo --summary
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-linux-intel-vulkan:
runs-on: [self-hosted, Linux, Intel]

View File

@@ -87,7 +87,7 @@ jobs:
-DGGML_METAL_EMBED_LIBRARY=OFF \
-DGGML_METAL_SHADER_DEBUG=ON \
-DGGML_RPC=ON
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
leaks -atExit -- ./build/bin/test-thread-safety -hf ggml-org/gemma-3-270m-qat-GGUF -ngl 99 -p "$(printf 'hello %.0s' {1..128})" -n 16 -c 512 -ub 32 -np 2 -t 2 -lv 1
- name: Test
@@ -124,7 +124,7 @@ jobs:
-DGGML_METAL=OFF \
-DGGML_RPC=ON \
-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
@@ -165,8 +165,8 @@ jobs:
id: cmake_build
run: |
export CMAKE_PREFIX_PATH=dawn
cmake -B build -DGGML_WEBGPU=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
cmake -B build -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DGGML_WEBGPU=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
@@ -231,7 +231,7 @@ jobs:
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_RPC=ON
cmake --build build --config Release -j $(nproc)
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
@@ -274,14 +274,16 @@ jobs:
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential libssl-dev
sudo apt-get install build-essential libssl-dev ninja-build
- name: Build
id: cmake_build
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_RPC=ON
cmake --build build --config Release -j $(nproc)
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
@@ -300,12 +302,13 @@ jobs:
- name: Dependencies
id: depends
run: |
sudo apt-get install -y glslc libvulkan-dev libssl-dev
sudo apt-get install -y glslc libvulkan-dev libssl-dev ninja-build
- name: Configure
id: cmake_configure
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DGGML_BACKEND_DL=ON \
-DGGML_CPU_ALL_VARIANTS=ON \
@@ -314,7 +317,7 @@ jobs:
- name: Build
id: cmake_build
run: |
cmake --build build -j $(nproc)
time cmake --build build -j $(nproc)
ubuntu-24-webgpu:
runs-on: ubuntu-24.04
@@ -336,7 +339,8 @@ jobs:
run: |
sudo add-apt-repository -y ppa:kisak/kisak-mesa
sudo apt-get update -y
sudo apt-get install -y build-essential mesa-vulkan-drivers libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev
sudo apt-get install -y build-essential mesa-vulkan-drivers \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev
- name: Get latest Vulkan SDK version
id: vulkan_sdk_version
@@ -378,7 +382,7 @@ jobs:
export Dawn_DIR=dawn/lib64/cmake/Dawn
cmake -B build \
-DGGML_WEBGPU=ON
cmake --build build --config Release -j $(nproc)
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
@@ -415,11 +419,13 @@ jobs:
run: |
source emsdk/emsdk_env.sh
emcmake cmake -B build-wasm \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_WEBGPU=ON \
-DLLAMA_OPENSSL=OFF \
-DEMDAWNWEBGPU_DIR=emdawnwebgpu_pkg
cmake --build build-wasm --target test-backend-ops -j $(nproc)
time cmake --build build-wasm --config Release --target test-backend-ops -j $(nproc)
ubuntu-22-hip:
runs-on: ubuntu-22.04
@@ -479,7 +485,7 @@ jobs:
run: |
cmake -B build -S . \
-DGGML_MUSA=ON
cmake --build build --config Release -j $(nproc)
time cmake --build build --config Release -j $(nproc)
ubuntu-22-sycl:
runs-on: ubuntu-22.04
@@ -528,7 +534,7 @@ jobs:
-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j $(nproc)
time cmake --build build --config Release -j $(nproc)
ubuntu-22-sycl-fp16:
runs-on: ubuntu-22.04
@@ -551,7 +557,7 @@ jobs:
shell: bash
run: |
sudo apt update
sudo apt install intel-oneapi-compiler-dpcpp-cpp libssl-dev
sudo apt install intel-oneapi-compiler-dpcpp-cpp libssl-dev ninja-build
- name: install oneAPI MKL library
shell: bash
@@ -574,11 +580,13 @@ jobs:
run: |
source /opt/intel/oneapi/setvars.sh
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL_F16=ON
cmake --build build --config Release -j $(nproc)
time cmake --build build --config Release -j $(nproc)
ubuntu-24-openvino:
name: ubuntu-24-openvino-${{ matrix.openvino_device }}
@@ -648,7 +656,7 @@ jobs:
cmake -B build/ReleaseOV -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENVINO=ON
cmake --build build/ReleaseOV --config Release -j $(nproc)
time cmake --build build/ReleaseOV --config Release -j $(nproc)
- name: Test
id: cmake_test
@@ -1039,7 +1047,7 @@ jobs:
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14
cmake --build build --config Release -j $(nproc)
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test

View File

@@ -54,4 +54,3 @@ jobs:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements/requirements-all.txt -r tools/server/tests/requirements.txt
pip install flake8 pyright pre-commit

View File

@@ -28,11 +28,11 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.9.x'
python-version: '3.11'
- name: Install dependencies
run: |
cd gguf-py
python -m pip install poetry
python -m pip install poetry==2.3.2
poetry install
- name: Build package

82
.github/workflows/hip-quality-check.yml vendored Normal file
View File

@@ -0,0 +1,82 @@
name: HIP quality check
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/hip-quality-check.yml',
'**/*.cu',
'**/*.cuh',
'scripts/hip/gcn-cdna-vgpr-check.py'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/hip-quality-check.yml',
'**/*.cu',
'**/*.cuh',
'scripts/hip/gcn-cdna-vgpr-check.py'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
jobs:
ubuntu-22-hip-quality-check:
runs-on: ubuntu-22.04
container: rocm/dev-ubuntu-22.04:7.2
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y build-essential git cmake rocblas-dev hipblas-dev libssl-dev python3
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip-quality-check
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with Werror
id: cmake_build
run: |
cmake -B build -S . \
-DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" \
-DGPU_TARGETS=gfx908 \
-DGGML_HIP=ON \
-DGGML_HIP_EXPORT_METRICS=Off \
-DCMAKE_HIP_FLAGS="-Werror -Wno-tautological-compare" \
-DCMAKE_BUILD_TYPE=Release
cd build
make -j $(nproc)
- name: Check for major VGPR spills
id: vgpr_check
run: |
cmake -B build -S . \
-DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" \
-DGPU_TARGETS=gfx908 \
-DGGML_HIP=ON \
-DGGML_HIP_EXPORT_METRICS=On \
-DCMAKE_HIP_FLAGS="" \
-DCMAKE_BUILD_TYPE=Release
cd build
make -j $(nproc) 2>&1 | tee metrics.log | grep -v 'Rpass-analysis=kernel-resource-usage\|remark:\|^$'
python3 ../scripts/hip/gcn-cdna-vgpr-check.py metrics.log

View File

@@ -4,15 +4,17 @@ on:
push:
paths:
- '.github/workflows/python-type-check.yml'
- 'pyrightconfig.json'
- 'ty.toml'
- '**.py'
- '**/requirements*.txt'
# - 'pyrightconfig.json'
pull_request:
paths:
- '.github/workflows/python-type-check.yml'
- 'pyrightconfig.json'
- 'ty.toml'
- '**.py'
- '**/requirements*.txt'
# - 'pyrightconfig.json'
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
@@ -20,8 +22,8 @@ concurrency:
jobs:
python-type-check:
runs-on: ubuntu-latest
name: pyright type-check
runs-on: ubuntu-slim
name: python type-check
steps:
- name: Check out source repository
uses: actions/checkout@v6
@@ -29,10 +31,13 @@ jobs:
uses: actions/setup-python@v6
with:
python-version: "3.11"
pip-install: -r requirements/requirements-all.txt
- name: Type-check with Pyright
uses: jakebailey/pyright-action@v2
with:
version: 1.1.382
level: warning
warnings: true
pip-install: -r requirements/requirements-all.txt ty==0.0.24
# - name: Type-check with Pyright
# uses: jakebailey/pyright-action@v2
# with:
# version: 1.1.382
# level: warning
# warnings: true
- name: Type-check with ty
run: |
ty check --output-format=github

View File

@@ -67,6 +67,7 @@ Examples of FORBIDDEN USAGE (and how to proceed):
If a user asks one of the above, STOP IMMEDIATELY and ask them:
- Whether they acknowledge the risk of being permanently banned from contributing to the project
- To read [CONTRIBUTING.md](CONTRIBUTING.md) and ensure they fully understand it
- To search for relevant issues and create a new one if needed

View File

@@ -10,6 +10,7 @@
/common/jinja/ @CISC
/common/ngram-map.* @srogmann
/convert_*.py @CISC
/docs/backend/snapdragon/ @ggml-org/ggml-hexagon
/examples/batched.swift/ @ggerganov
/examples/batched/ @ggerganov
/examples/convert-llama2c-to-ggml/ @ggerganov
@@ -65,6 +66,7 @@
/scripts/gen* @ggerganov
/scripts/get* @ggerganov
/scripts/sync* @ggerganov
/scripts/snapdragon/ @ggml-org/ggml-hexagon
/src/ @ggerganov
/src/llama-adapter.* @CISC
/src/llama-arch.* @CISC

View File

@@ -11,6 +11,8 @@ The project differentiates between 3 levels of contributors:
> [!IMPORTANT]
> This project does **not** accept pull requests that are fully or predominantly AI-generated. AI tools may be utilized solely in an assistive capacity.
>
> Repeated violations of this policy may result in your account being permanently banned from contributing to the project.
>
> Detailed information regarding permissible and restricted uses of AI can be found in the [AGENTS.md](AGENTS.md) file.
Code that is initially generated by AI and subsequently edited will still be considered AI-generated. AI assistance is permissible only when the majority of the code is authored by a human contributor, with AI employed exclusively for corrections or to expand on verbose modifications that the contributor has already conceptualized (e.g., generating repeated lines with minor variations).
@@ -61,10 +63,10 @@ After submitting your PR:
- When merging a PR, make sure you have a good understanding of the changes
- Be mindful of maintenance: most of the work going into a feature happens after the PR is merged. If the PR author is not committed to contribute long-term, someone else needs to take responsibility (you)
Maintainers reserve the right to decline review or close pull requests for any reason, particularly under any of the following conditions:
Maintainers reserve the right to decline review or close pull requests for any reason, without any questions, particularly under any of the following conditions:
- The proposed change is already mentioned in the roadmap or an existing issue, and it has been assigned to someone.
- The pull request duplicates an existing one.
- The contributor fails to adhere to this contributing guide.
- The contributor fails to adhere to this contributing guide or the AI policy.
# Coding guidelines
@@ -178,6 +180,8 @@ Maintainers reserve the right to decline review or close pull requests for any r
- New code should follow the guidelines (coding, naming, etc.) outlined in this document. Exceptions are allowed in isolated, backend-specific parts of the code that do not interface directly with the `ggml` interfaces.
_(NOTE: for legacy reasons, existing code is not required to follow this guideline)_
- For changes in server, please make sure to refer to the [server development documentation](./tools/server/README-dev.md)
# Documentation
- Documentation is a community effort

View File

@@ -17,6 +17,7 @@ LLM inference in C/C++
## Hot topics
- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
- **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
- [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)
- [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
@@ -241,7 +242,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
<details>
<summary>Tools</summary>
- [akx/ggify](https://github.com/akx/ggify) download PyTorch models from HuggingFace Hub and convert them to GGML
- [akx/ggify](https://github.com/akx/ggify) download PyTorch models from Hugging Face Hub and convert them to GGML
- [akx/ollama-dl](https://github.com/akx/ollama-dl) download models from the Ollama library to be used directly with llama.cpp
- [crashr/gppm](https://github.com/crashr/gppm) launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
@@ -300,13 +301,13 @@ The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](htt
- [Trending](https://huggingface.co/models?library=gguf&sort=trending)
- [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, such as [ModelScope](https://modelscope.cn/), by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:
You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:
```sh
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
```
By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. For example, you may opt to downloading model checkpoints from ModelScope or other model sharing communities by setting the environment variable, e.g. `MODEL_ENDPOINT=https://www.modelscope.cn/`.
By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. The `MODEL_ENDPOINT` must point to a Hugging Face compatible API endpoint.
After downloading a model, use the CLI tools to run it locally - see below.

View File

@@ -57,6 +57,13 @@ SRC=`pwd`
CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=${LLAMA_FATAL_WARNINGS:-ON} -DLLAMA_OPENSSL=OFF -DGGML_SCHED_NO_REALLOC=ON"
CTEST_EXTRA=""
# Default to use make unless specified for compatibility
CMAKE_GENERATOR="Unix Makefiles"
if [ ! -z "${GG_BUILD_NINJA}" ]; then
CMAKE_GENERATOR="Ninja"
fi
if [ ! -z ${GG_BUILD_METAL} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=ON"
fi
@@ -242,13 +249,13 @@ function gg_run_ctest_debug {
set -e
# Check cmake, make and ctest are installed
# Check cmake and ctest are installed
gg_check_build_requirements
(time cmake -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
(cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time cmake --build . --config Debug -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log
(time ctest --output-on-failure -L main -E "test-opt|test-backend-ops" ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
(time ctest -C Debug --output-on-failure -L main -E "test-opt|test-backend-ops" ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
set +e
}
@@ -273,16 +280,16 @@ function gg_run_ctest_release {
set -e
# Check cmake, make and ctest are installed
# Check cmake and ctest are installed
gg_check_build_requirements
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
(cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log
if [ -z ${GG_BUILD_LOW_PERF} ]; then
(time ctest --output-on-failure -L 'main|python' ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
(time ctest -C Release --output-on-failure -L 'main|python' ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
else
(time ctest --output-on-failure -L main -E test-opt ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
(time ctest -C Release --output-on-failure -L main -E test-opt ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
fi
set +e
@@ -340,7 +347,7 @@ function gg_run_ctest_with_model_debug {
cd build-ci-debug
set -e
(LLAMACPP_TEST_MODELFILE="$model" time ctest --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log
(LLAMACPP_TEST_MODELFILE="$model" time ctest -C Debug --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log
set +e
cd ..
@@ -353,7 +360,7 @@ function gg_run_ctest_with_model_release {
cd build-ci-release
set -e
(LLAMACPP_TEST_MODELFILE="$model" time ctest --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log
(LLAMACPP_TEST_MODELFILE="$model" time ctest -C Release --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log
# test memory leaks
#if [[ ! -z ${GG_BUILD_METAL} ]]; then
@@ -407,8 +414,8 @@ function gg_run_qwen3_0_6b {
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
(cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf --outtype f16
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-bf16.gguf --outtype bf16
@@ -556,8 +563,8 @@ function gg_run_embd_bge_small {
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
(cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
@@ -601,8 +608,8 @@ function gg_run_rerank_tiny {
set -e
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j$(nproc) ) 2>&1 | tee -a $OUT/${ci}-make.log
(cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
@@ -652,10 +659,6 @@ function gg_check_build_requirements {
gg_printf 'cmake not found, please install'
fi
if ! command -v make &> /dev/null; then
gg_printf 'make not found, please install'
fi
if ! command -v ctest &> /dev/null; then
gg_printf 'ctest not found, please install'
fi

View File

@@ -63,6 +63,8 @@ add_library(${TARGET} STATIC
debug.h
download.cpp
download.h
hf-cache.cpp
hf-cache.h
http.h
json-partial.cpp
json-partial.h

View File

@@ -3,6 +3,7 @@
#include "chat.h"
#include "common.h"
#include "download.h"
#include "hf-cache.h"
#include "json-schema-to-grammar.h"
#include "log.h"
#include "sampling.h"
@@ -326,60 +327,48 @@ struct handle_model_result {
common_params_model mmproj;
};
static handle_model_result common_params_handle_model(
struct common_params_model & model,
const std::string & bearer_token,
bool offline) {
static handle_model_result common_params_handle_model(struct common_params_model & model,
const std::string & bearer_token,
bool offline) {
handle_model_result result;
// handle pre-fill default model path and url based on hf_repo and hf_file
{
if (!model.docker_repo.empty()) { // Handle Docker URLs by resolving them to local paths
model.path = common_docker_resolve_model(model.docker_repo);
model.name = model.docker_repo; // set name for consistency
} else if (!model.hf_repo.empty()) {
// short-hand to avoid specifying --hf-file -> default it to --model
if (model.hf_file.empty()) {
if (model.path.empty()) {
auto auto_detected = common_get_hf_file(model.hf_repo, bearer_token, offline);
if (auto_detected.repo.empty() || auto_detected.ggufFile.empty()) {
exit(1); // error message already printed
}
model.name = model.hf_repo; // repo name with tag
model.hf_repo = auto_detected.repo; // repo name without tag
model.hf_file = auto_detected.ggufFile;
if (!auto_detected.mmprojFile.empty()) {
result.found_mmproj = true;
result.mmproj.hf_repo = model.hf_repo;
result.mmproj.hf_file = auto_detected.mmprojFile;
}
} else {
model.hf_file = model.path;
}
}
std::string model_endpoint = get_model_endpoint();
model.url = model_endpoint + model.hf_repo + "/resolve/main/" + model.hf_file;
// make sure model path is present (for caching purposes)
if (model.path.empty()) {
// this is to avoid different repo having same file name, or same file name in different subdirs
std::string filename = clean_file_name(model.hf_repo + "_" + model.hf_file);
model.path = fs_get_cache_file(filename);
}
} else if (!model.url.empty()) {
if (model.path.empty()) {
auto f = string_split<std::string>(model.url, '#').front();
f = string_split<std::string>(f, '?').front();
model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
}
if (!model.docker_repo.empty()) {
model.path = common_docker_resolve_model(model.docker_repo);
model.name = model.docker_repo;
} else if (!model.hf_repo.empty()) {
// If -m was used with -hf, treat the model "path" as the hf_file to download
if (model.hf_file.empty() && !model.path.empty()) {
model.hf_file = model.path;
model.path = "";
}
}
common_download_model_opts opts;
opts.download_mmproj = true;
opts.offline = offline;
auto download_result = common_download_model(model, bearer_token, opts);
// then, download it if needed
if (!model.url.empty()) {
bool ok = common_download_model(model, bearer_token, offline);
if (!ok) {
if (download_result.model_path.empty()) {
LOG_ERR("error: failed to download model from Hugging Face\n");
exit(1);
}
model.name = model.hf_repo;
model.path = download_result.model_path;
if (!download_result.mmproj_path.empty()) {
result.found_mmproj = true;
result.mmproj.path = download_result.mmproj_path;
}
} else if (!model.url.empty()) {
if (model.path.empty()) {
auto f = string_split<std::string>(model.url, '#').front();
f = string_split<std::string>(f, '?').front();
model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
}
common_download_model_opts opts;
opts.offline = offline;
auto download_result = common_download_model(model, bearer_token, opts);
if (download_result.model_path.empty()) {
LOG_ERR("error: failed to download model from %s\n", model.url.c_str());
exit(1);
}
@@ -434,6 +423,9 @@ static bool parse_bool_value(const std::string & value) {
static bool common_params_parse_ex(int argc, char ** argv, common_params_context & ctx_arg) {
common_params & params = ctx_arg.params;
// setup log directly from params.verbosity: see tools/cli/cli.cpp
common_log_set_verbosity_thold(params.verbosity);
std::unordered_map<std::string, std::pair<common_arg *, bool>> arg_to_options;
for (auto & opt : ctx_arg.options) {
for (const auto & arg : opt.args) {
@@ -539,6 +531,13 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
// parse the first time to get -hf option (used for remote preset)
parse_cli_args();
// TODO: Remove later
try {
hf_cache::migrate_old_cache_to_hf_cache(params.hf_token, params.offline);
} catch (const std::exception & e) {
LOG_WRN("HF cache migration failed: %s\n", e.what());
}
// maybe handle remote preset
if (!params.model.hf_repo.empty()) {
std::string cli_hf_repo = params.model.hf_repo;
@@ -635,8 +634,6 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
));
}
common_log_set_verbosity_thold(params.verbosity);
return true;
}
@@ -1061,12 +1058,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"-cl", "--cache-list"},
"show list of models in cache",
[](common_params &) {
printf("model cache directory: %s\n", fs_get_cache_directory().c_str());
auto models = common_list_cached_models();
printf("number of models in cache: %zu\n", models.size());
for (size_t i = 0; i < models.size(); i++) {
auto & model = models[i];
printf("%4d. %s\n", (int) i + 1, model.to_string().c_str());
printf("%4zu. %s\n", i + 1, models[i].to_string().c_str());
}
exit(0);
}
@@ -1830,23 +1825,23 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_sparam());
add_opt(common_arg(
{"--grammar"}, "GRAMMAR",
string_format("BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '%s')", params.sampling.grammar.c_str()),
"BNF-like grammar to constrain generations (see samples in grammars/ dir)",
[](common_params & params, const std::string & value) {
params.sampling.grammar = value;
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_USER, value};
}
).set_sparam());
add_opt(common_arg(
{"--grammar-file"}, "FNAME",
"file to read grammar from",
[](common_params & params, const std::string & value) {
params.sampling.grammar = read_file(value);
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_USER, read_file(value)};
}
).set_sparam());
add_opt(common_arg(
{"-j", "--json-schema"}, "SCHEMA",
"JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object\nFor schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead",
[](common_params & params, const std::string & value) {
params.sampling.grammar = json_schema_to_grammar(json::parse(value));
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_OUTPUT_FORMAT, json_schema_to_grammar(json::parse(value))};
}
).set_sparam());
add_opt(common_arg(
@@ -1863,7 +1858,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
std::istreambuf_iterator<char>(),
std::back_inserter(schema)
);
params.sampling.grammar = json_schema_to_grammar(json::parse(schema));
params.sampling.grammar = {COMMON_GRAMMAR_TYPE_OUTPUT_FORMAT, json_schema_to_grammar(json::parse(schema))};
}
).set_sparam());
add_opt(common_arg(
@@ -2583,7 +2578,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"-hf", "-hfr", "--hf-repo"}, "<user>/<model>[:quant]",
"Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.\n"
"mmproj is also downloaded automatically if available. to disable, add --no-mmproj\n"
"example: unsloth/phi-4-GGUF:q4_k_m\n"
"example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M\n"
"(default: unused)",
[](common_params & params, const std::string & value) {
params.model.hf_repo = value;
@@ -3250,6 +3245,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
"Set verbosity level to infinity (i.e. log all messages, useful for debugging)",
[](common_params & params) {
params.verbosity = INT_MAX;
common_log_set_verbosity_thold(INT_MAX);
}
));
add_opt(common_arg(
@@ -3270,6 +3266,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
"(default: %d)\n", params.verbosity),
[](common_params & params, int value) {
params.verbosity = value;
common_log_set_verbosity_thold(value);
}
).set_env("LLAMA_LOG_VERBOSITY"));
add_opt(common_arg(
@@ -3494,7 +3491,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
throw std::invalid_argument("unknown speculative decoding type without draft model");
}
}
).set_examples({LLAMA_EXAMPLE_SERVER}));
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_SPEC_TYPE"));
add_opt(common_arg(
{"--spec-ngram-size-n"}, "N",
string_format("ngram size N for ngram-simple/ngram-map speculative decoding, length of lookup n-gram (default: %d)", params.speculative.ngram_size_n),

View File

@@ -1,3 +1,4 @@
#include "chat-auto-parser-helpers.h"
#include "chat-auto-parser.h"
#include "chat-peg-parser.h"
#include "chat.h"
@@ -23,13 +24,13 @@ static void foreach_function(const json & tools, const std::function<void(const
namespace autoparser {
parser_build_context::parser_build_context(common_chat_peg_builder & p, const templates_params & inputs) :
parser_build_context::parser_build_context(common_chat_peg_builder & p, const generation_params & inputs) :
p(p),
inputs(inputs),
reasoning_parser(p.eps()) {}
common_chat_params peg_generator::generate_parser(const common_chat_template & tmpl,
const struct templates_params & inputs) {
const struct generation_params & inputs) {
// Run differential analysis to extract template structure
struct autoparser autoparser;
autoparser.analyze_template(tmpl);
@@ -37,17 +38,16 @@ common_chat_params peg_generator::generate_parser(const common_chat_template &
}
common_chat_params peg_generator::generate_parser(const common_chat_template & tmpl,
const struct templates_params & inputs,
const struct generation_params & inputs,
const autoparser & autoparser) {
// Build the parser using the analysis results
auto parser = autoparser.build_parser(inputs);
// Create the result structure
common_chat_params data;
data.prompt = common_chat_template_direct_apply(tmpl, inputs);
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
data.preserved_tokens = autoparser.preserved_tokens;
data.parser = parser.save();
auto parser = autoparser.build_parser(inputs);
data.parser = parser.save();
// Build grammar if tools are present
bool has_tools =
@@ -82,44 +82,37 @@ common_chat_params peg_generator::generate_parser(const common_chat_template &
return data;
}
common_peg_arena autoparser::build_parser(const templates_params & inputs) const {
common_peg_arena autoparser::build_parser(const generation_params & inputs) const {
if (!analysis_complete) {
throw std::invalid_argument("Cannot call build_parser on autoparser without performing analysis first, call analyze_template(...)");
}
return build_chat_peg_parser([&](common_chat_peg_builder & p) {
// If the template uses Python dict format (single-quoted strings in JSON structures),
// pre-register a json-string rule that accepts both quote styles. This must happen
// before any call to p.json() so that all JSON parsing inherits the flexible rule.
if (tools.format.uses_python_dicts) {
p.rule("json-string", p.quoted_string());
}
parser_build_context ctx(p, inputs);
bool extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
bool enable_thinking = inputs.enable_thinking;
ctx.extracting_reasoning = extract_reasoning && enable_thinking && reasoning.mode != reasoning_mode::NONE;
ctx.extracting_reasoning = extract_reasoning && reasoning.mode != reasoning_mode::NONE;
ctx.content = &content;
// Build reasoning parser
ctx.reasoning_parser = reasoning.build_parser(ctx);
auto parser = p.eps();
bool has_tools = inputs.tools.is_array() && !inputs.tools.empty();
bool has_response_format = inputs.json_schema.is_object() && !inputs.json_schema.empty();
if (has_response_format) {
auto response_format = p.rule("response-format", p.content(p.schema(p.json(), "response-format-schema", inputs.json_schema)));
return ctx.reasoning_parser + p.space() + p.choice({
parser = ctx.reasoning_parser + p.space() + p.choice({
p.literal("```json") + p.space() + response_format + p.space() + p.literal("```"),
response_format
}) + p.end();
} else if (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE && jinja_caps.supports_tool_calls) {
parser = tools.build_parser(ctx);
} else {
parser = content.build_parser(ctx);
}
if (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE && jinja_caps.supports_tool_calls) {
return tools.build_parser(ctx);
}
return content.build_parser(ctx);
return p.prefix(inputs.generation_prompt, reasoning.start) + parser;
});
}
@@ -130,24 +123,15 @@ common_peg_parser analyze_reasoning::build_parser(parser_build_context & ctx) co
return p.eps();
}
bool thinking_forced_open = (mode == reasoning_mode::FORCED_OPEN);
bool thinking_forced_closed = (mode == reasoning_mode::FORCED_CLOSED);
if (thinking_forced_open || thinking_forced_closed) {
// Thinking is forced open OR forced closed with enable_thinking=true
// In both cases, expect only the closing tag (opening was in template)
// However, since we might have incorrectly detected the open/close pattern,
// we admit an optional starting marker
return p.optional(p.literal(start)) + p.reasoning(p.until(end)) + end;
}
if (mode == reasoning_mode::TAG_BASED || mode == reasoning_mode::TOOLS_ONLY) {
// Standard tag-based reasoning OR tools-only mode (reasoning appears with tools)
// Both use the same tag-based pattern if markers are available
if (!start.empty() && !end.empty()) {
return p.optional(start + p.reasoning(p.until(end)) + end);
if (!end.empty()) {
if (!start.empty()) {
// Standard tag-based: optional(<think>reasoning</think>)
return p.optional(start + p.reasoning(p.until(end)) + end + p.space());
}
// Delimiter-style (empty start)
return p.optional(p.reasoning(p.until(end)) + end + p.space());
}
} else if (mode == reasoning_mode::DELIMITER) {
return p.optional(p.reasoning(p.until(end)) + end);
}
return p.eps();
@@ -335,7 +319,7 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte
"tool-" + name + "-arg-" + param_name + "-schema",
param_schema, true)) :
p.tool_arg_json_value(p.schema(
p.json(), "tool-" + name + "-arg-" + param_name + "-schema", param_schema, format.uses_python_dicts)) +
p.json(), "tool-" + name + "-arg-" + param_name + "-schema", param_schema, false)) +
p.space()) +
p.tool_arg_close(p.literal(arguments.value_suffix)));
@@ -384,7 +368,9 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte
func_parser = p.atomic(p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix) +
call_id_section) + p.space() + args_seq;
matched_atomic = true;
} else if (!arguments.name_prefix.empty() && properties.size() > 0) {
} else if (!arguments.name_prefix.empty() && !required_parsers.empty()) {
// Only peek for an arg tag when there are required args that must follow.
// When all args are optional, the model may emit no arg tags at all (#20650).
func_parser = p.atomic(p.tool_open(function.name_prefix + p.tool_name(p.literal(name)) + function.name_suffix) +
call_id_section + p.space() + p.peek(p.literal(arguments.name_prefix))) + args_seq;
matched_atomic = true;

View File

@@ -1,9 +1,11 @@
#include "chat-auto-parser-helpers.h"
#include "chat-auto-parser.h"
#include "chat-peg-parser.h"
#include "chat.h"
#include "log.h"
#include "nlohmann/json.hpp"
#include "peg-parser.h"
#include <cctype>
#include <numeric>
@@ -186,6 +188,21 @@ diff_split calculate_diff_split(const std::string & left, const std::string & ri
result.suffix = "";
// pick prefix = all as representation
}
// When left has no unique content (result.left is empty), left is entirely
// shared with right. The simultaneous prefix/suffix segment matching can
// incorrectly consume trailing segments of left as suffix when those same
// segments also appear at the end of right (e.g. "\n" at the end of both
// the shared content and the generation prompt). This rotates the diff.
// Fix: if left is a prefix of right, enforce that directly.
if (result.left.empty() && !result.right.empty() &&
left.size() <= right.size() &&
right.substr(0, left.size()) == left) {
result.prefix = left;
result.suffix = "";
result.right = right.substr(left.size());
}
return result;
}
@@ -294,7 +311,7 @@ std::vector<segment> prune_whitespace_segments(const std::vector<segment> & segm
namespace autoparser {
std::string apply_template(const common_chat_template & tmpl, const template_params & params) {
templates_params tmpl_params;
generation_params tmpl_params;
tmpl_params.messages = params.messages;
tmpl_params.tools = params.tools;
tmpl_params.add_generation_prompt = params.add_generation_prompt;

View File

@@ -1,6 +1,7 @@
#pragma once
#include "chat-auto-parser.h"
#include "peg-parser.h"
#include <functional>
#include <optional>
#include <string>

View File

@@ -50,7 +50,7 @@ namespace autoparser {
// High-level params for parser generation
// ============================================================================
struct templates_params {
struct generation_params {
json messages;
json tools;
common_chat_tool_choice tool_choice = COMMON_CHAT_TOOL_CHOICE_AUTO;
@@ -62,6 +62,7 @@ struct templates_params {
bool add_generation_prompt = false;
bool enable_thinking = true;
std::chrono::system_clock::time_point now = std::chrono::system_clock::now();
std::string generation_prompt;
json extra_context;
bool add_bos = false;
bool add_eos = false;
@@ -77,11 +78,7 @@ struct templates_params {
// Reasoning handling mode (derived from R1-R3 comparisons)
enum class reasoning_mode {
NONE, // No reasoning markers detected
TAG_BASED, // Standard tag-based: <think>...</think>
DELIMITER, // Delimiter-based: [BEGIN FINAL RESPONSE] (reasoning ends at delimiter)
FORCED_OPEN, // Template ends with open reasoning tag (empty start, non-empty end)
FORCED_CLOSED, // Template ends with open reasoning tag on enabled thinking but
// with both opened and closed tag for disabled thinking
TAG_BASED, // Tag-based: <think>...</think> (start can be empty for delimiter-style)
TOOLS_ONLY // Only reason on tool calls, not on normal content
};
@@ -91,12 +88,6 @@ inline std::ostream & operator<<(std::ostream & os, const reasoning_mode & mode)
return os << "NONE";
case reasoning_mode::TAG_BASED:
return os << "TAG_BASED";
case reasoning_mode::DELIMITER:
return os << "DELIMITER";
case reasoning_mode::FORCED_OPEN:
return os << "FORCED_OPEN";
case reasoning_mode::FORCED_CLOSED:
return os << "FORCED_CLOSED";
case reasoning_mode::TOOLS_ONLY:
return os << "TOOLS_ONLY";
default:
@@ -184,7 +175,6 @@ struct tool_format_analysis {
bool fun_name_is_key = false; // In JSON format function name is JSON key, i.e. { "<funname>": { ... arguments ... } }
bool tools_array_wrapped = false; // Tool calls wrapped in JSON array [...]
bool uses_python_dicts = false; // Tool call args use Python dict format (single-quoted strings)
std::string function_field = "function";
std::string name_field = "name";
@@ -225,12 +215,12 @@ struct analyze_content;
struct parser_build_context {
common_chat_peg_builder & p;
const templates_params & inputs;
const generation_params & inputs;
common_peg_parser reasoning_parser;
bool extracting_reasoning = false;
const analyze_content * content = nullptr;
parser_build_context(common_chat_peg_builder & p, const templates_params & inputs);
parser_build_context(common_chat_peg_builder & p, const generation_params & inputs);
};
// ============================================================================
@@ -260,6 +250,7 @@ struct analyze_reasoning : analyze_base {
analyze_reasoning() = default;
analyze_reasoning(const common_chat_template & tmpl, bool supports_tools);
analyze_reasoning(std::string start_, std::string end_) : start(std::move(start_)), end(std::move(end_)) {}
common_peg_parser build_parser(parser_build_context & ctx) const override;
@@ -381,7 +372,7 @@ struct autoparser {
void analyze_template(const common_chat_template & tmpl);
// Build the PEG parser for this template
common_peg_arena build_parser(const templates_params & inputs) const;
common_peg_arena build_parser(const generation_params & inputs) const;
private:
// Collect tokens from entire analysis to preserve
@@ -395,10 +386,10 @@ struct autoparser {
class peg_generator {
public:
static common_chat_params generate_parser(const common_chat_template & tmpl,
const struct templates_params & inputs);
const struct generation_params & inputs);
static common_chat_params generate_parser(const common_chat_template & tmpl,
const struct templates_params & inputs,
const struct generation_params & inputs,
const autoparser & autoparser);
};

View File

@@ -2,6 +2,7 @@
#include "chat-auto-parser-helpers.h"
#include "chat-peg-parser.h"
#include "chat.h"
#include "common.h"
#include "log.h"
#include "nlohmann/json.hpp"
#include "peg-parser.h"
@@ -31,8 +32,9 @@ static std::vector<std::function<void(const common_chat_template & tmpl, autopar
[](const common_chat_template & tmpl, autoparser & analysis) -> void {
if (tmpl.src.find("content.split('</think>')") != std::string::npos &&
tmpl.src.find("reasoning_content") == std::string::npos &&
tmpl.src.find("<SPECIAL_12>") == std::string::npos &&
analysis.reasoning.mode == reasoning_mode::NONE) {
analysis.reasoning.mode = reasoning_mode::FORCED_OPEN;
analysis.reasoning.mode = reasoning_mode::TAG_BASED;
analysis.reasoning.start = "<think>";
analysis.reasoning.end = "</think>";
analysis.preserved_tokens.push_back("<think>");
@@ -185,7 +187,6 @@ void autoparser::analyze_template(const common_chat_template & tmpl) {
LOG_DBG("func_name_prefix: '%s'\n", tools.function.name_prefix.c_str());
LOG_DBG("func_name_suffix: '%s'\n", tools.function.name_suffix.c_str());
LOG_DBG("func_close: '%s'\n", tools.function.close.c_str());
LOG_DBG("python_dict_format: %s\n", tools.format.uses_python_dicts ? "true" : "false");
LOG_DBG("arg_name_prefix: '%s'\n", tools.arguments.name_prefix.c_str());
LOG_DBG("arg_name_suffix: '%s'\n", tools.arguments.name_suffix.c_str());
LOG_DBG("arg_value_prefix: '%s'\n", tools.arguments.value_prefix.c_str());
@@ -295,16 +296,12 @@ void analyze_reasoning::compare_reasoning_presence() {
}
if (result.result.success()) {
if (!result.tags["pre"].empty() && !result.tags["post"].empty()) {
if (parser_wrapped.parse_anywhere_and_extract(diff.right).result.success()) { // both tags in the diff = no forced close
mode = reasoning_mode::TAG_BASED;
} else {
mode = reasoning_mode::FORCED_CLOSED;
}
mode = reasoning_mode::TAG_BASED;
start = trim_whitespace(result.tags["pre"]);
end = result.tags["post"];
end = trim_trailing_whitespace(result.tags["post"]);
} else if (!result.tags["post"].empty()) {
mode = reasoning_mode::DELIMITER;
end = result.tags["post"];
mode = reasoning_mode::TAG_BASED;
end = trim_trailing_whitespace(result.tags["post"]);
}
}
}
@@ -331,53 +328,58 @@ void analyze_reasoning::compare_thinking_enabled() {
const auto & diff = comparison->diff;
std::string left_trimmed = trim_whitespace(diff.left);
std::string right_trimmed = trim_whitespace(diff.right);
if (left_trimmed.empty() && !diff.right.empty()) {
std::string right_trimmed = trim_whitespace(diff.right);
if (!right_trimmed.empty() && string_ends_with(comparison->output_B, right_trimmed)) {
if (start.empty()) {
start = right_trimmed;
mode = reasoning_mode::FORCED_OPEN;
mode = reasoning_mode::TAG_BASED;
}
}
} else if (right_trimmed.empty() && !diff.left.empty()) {
if (!left_trimmed.empty() && string_ends_with(comparison->output_A, left_trimmed)) {
if (end.empty()) {
auto seg = prune_whitespace_segments(segmentize_markers(comparison->output_A));
if (seg.size() >= 2 && seg[seg.size() - 1].value == left_trimmed && seg[seg.size() - 2].type == segment_type::MARKER) {
start = seg[seg.size() - 2].value;
}
end = left_trimmed;
mode = reasoning_mode::TAG_BASED;
}
}
} else if (!left_trimmed.empty() && !right_trimmed.empty()) {
// Full-output diff is noisy (e.g., SmolLM3 changes the system message when enable_thinking flips).
// Try to find reasoning markers by tail-anchoring:
// one output's generation prompt tail may appear in the other with extra reasoning markers appended.
const auto & output_A = comparison->output_A;
const auto & output_B = comparison->output_B;
const size_t anchor_len = 64;
for (int dir = 0; dir < 2; dir++) {
const auto & base = dir == 0 ? output_B : output_A;
const auto & extended = dir == 0 ? output_A : output_B;
size_t len = std::min(base.size(), anchor_len);
std::string anchor = base.substr(base.size() - len);
auto pos = extended.rfind(anchor);
if (pos == std::string::npos || pos + len >= extended.size()) continue;
std::string extra = trim_whitespace(extended.substr(pos + len));
if (extra.empty()) continue;
auto seg = prune_whitespace_segments(segmentize_markers(extra));
if (seg.size() == 2 && seg[0].type == segment_type::MARKER && seg[1].type == segment_type::MARKER) {
if (start.empty()) start = seg[0].value;
if (end.empty()) end = seg[1].value;
mode = reasoning_mode::TAG_BASED;
break;
}
}
}
if (start.empty() && !end.empty()) {
mode = reasoning_mode::DELIMITER;
}
// Check for FORCED_CLOSED: when enable_thinking=false produces both start and end markers,
// but enable_thinking=true produces only the start marker
if (!comparison->output_A.empty() && !comparison->output_B.empty()) {
auto parser_start = build_tagged_peg_parser([&](common_peg_parser_builder &p) {
return p.literal(start) + p.space() + p.literal(end) + p.rest();
});
auto parser_start_end = build_tagged_peg_parser([&](common_peg_parser_builder &p) {
return p.tag("pre", p.literal(start)) + p.space() + p.negate(p.literal(end)) + p.rest();
});
if (!start.empty() && parser_start_end.parse_anywhere_and_extract(comparison->output_A).result.success() &&
parser_start.parse_anywhere_and_extract(comparison->output_B).result.success()) {
mode = reasoning_mode::FORCED_CLOSED;
} else if (!end.empty()) { // we extract the starting marker now since we didn't get it earlier
auto result = parser_start_end.parse_anywhere_and_extract(comparison->output_A);
if (result.result.success()) {
start = result.tags["pre"];
mode = reasoning_mode::FORCED_CLOSED;
}
}
}
if (start.empty() && end.empty()) { // we might still have the case of "just open" and "just close"
if (!diff.left.empty() && !diff.right.empty()) {
auto seg_A = segmentize_markers(trim_trailing_whitespace(diff.left));
auto seg_B = segmentize_markers(trim_trailing_whitespace(diff.right));
if (seg_A.size() == 1 && seg_B.size() == 1) {
mode = reasoning_mode::FORCED_CLOSED;
start = seg_B[0].value;
end = seg_A[0].value;
}
}
if (mode == reasoning_mode::NONE && start.empty() && !end.empty()) {
mode = reasoning_mode::TAG_BASED;
}
}
@@ -426,16 +428,16 @@ void analyze_reasoning::compare_reasoning_scope() {
auto result = parser_wrapped.parse_anywhere_and_extract(comparison->output_B);
if (result.result.success()) {
start = result.tags["pre"];
end = result.tags["post"];
end = trim_trailing_whitespace(result.tags["post"]);
} else {
auto parser_delimiter = build_tagged_peg_parser([&](common_peg_parser_builder &p) {
return p.literal(reasoning_content) + p.space() + p.optional(p.tag("post", (p.marker() + p.space())));
});
result = parser_delimiter.parse_anywhere_and_extract(comparison->output_B);
if (result.result.success()) {
end = result.tags["post"];
end = trim_trailing_whitespace(result.tags["post"]);
} else {
LOG_DBG(ANSI_ORANGE "%s: Unable to extracft reasoning markers, falling back to reasoning = NONE\n" ANSI_RESET, __func__);
LOG_DBG(ANSI_ORANGE "%s: Unable to extract reasoning markers, falling back to reasoning = NONE\n" ANSI_RESET, __func__);
mode = reasoning_mode::NONE;
}
}
@@ -600,33 +602,23 @@ void analyze_tools::analyze_tool_call_format(const std::string & haystack,
return;
}
enum class json_quote_style { NONE, DOUBLE_QUOTES, SINGLE_QUOTES };
auto in_json_haystack = [&haystack](const std::string & needle) -> json_quote_style {
auto in_json_haystack = [&haystack](const std::string & needle) -> bool {
auto parser = build_tagged_peg_parser([&](common_peg_parser_builder &p) {
return p.choice({ p.literal("{"), p.literal(":") }) << p.choice({
p.tag("sq", p.literal("'") + p.literal(needle) + p.literal("'")),
p.tag("dq", p.literal("\"") + p.literal(needle) + p.literal("\"")) });
});
auto result = parser.parse_anywhere_and_extract(haystack);
if (!result.result.success()) {
return json_quote_style::NONE;
}
return result.tags.count("sq") && !result.tags["sq"].empty()
? json_quote_style::SINGLE_QUOTES
: json_quote_style::DOUBLE_QUOTES;
return result.result.success();
};
auto fun_quote = in_json_haystack(fun_name_needle);
auto arg_quote = in_json_haystack(arg_name_needle);
if (fun_quote != json_quote_style::NONE) {
if (fun_quote) {
// no need to check further, we're in JSON land
format.mode = tool_format::JSON_NATIVE;
format.uses_python_dicts = (fun_quote == json_quote_style::SINGLE_QUOTES);
} else if (arg_quote != json_quote_style::NONE) {
} else if (arg_quote) {
format.mode = tool_format::TAG_WITH_JSON;
format.uses_python_dicts = (arg_quote == json_quote_style::SINGLE_QUOTES);
} else {
format.mode = tool_format::TAG_WITH_TAGGED;
}

View File

@@ -229,6 +229,20 @@ void common_chat_peg_mapper::from_ast(const common_peg_ast_arena & arena,
result.tool_calls.push_back(pending_tool_call.value());
pending_tool_call.reset();
}
// Discard whitespace-only reasoning content (e.g. from <think></think> prefill)
if (!result.reasoning_content.empty()) {
bool all_whitespace = true;
for (char c : result.reasoning_content) {
if (c != ' ' && c != '\n' && c != '\r' && c != '\t') {
all_whitespace = false;
break;
}
}
if (all_whitespace) {
result.reasoning_content.clear();
}
}
}
void common_chat_peg_mapper::map(const common_peg_ast_node & node) {
@@ -788,6 +802,16 @@ common_peg_parser common_chat_peg_builder::build_json_tools_flat_keys(
return tool_choices;
}
common_peg_parser common_chat_peg_builder::prefix(const std::string & s, const std::string & delimiter) {
if (s.empty()) {
return eps();
}
if (delimiter.empty()) {
return literal(s);
}
return literal(s.substr(0, s.rfind(delimiter)));
}
common_peg_parser common_chat_peg_builder::standard_json_tools(
const std::string & section_start,
const std::string & section_end,

View File

@@ -82,6 +82,10 @@ class common_chat_peg_builder : public common_peg_parser_builder {
common_peg_parser tool_arg_string_value(const common_peg_parser & p) { return tag(TOOL_ARG_STRING_VALUE, p); }
common_peg_parser tool_arg_json_value(const common_peg_parser & p) { return atomic(tag(TOOL_ARG_VALUE, p)); }
// Return a parser that parses the prefix of a string, up to a given delimiter.
common_peg_parser prefix(const std::string & s, const std::string & delimiter = {});
// Legacy-compatible helper for building standard JSON tool calls
// Used by tests and manual parsers
// name_key/args_key: JSON key names for function name and arguments

View File

@@ -1,5 +1,6 @@
#include "chat.h"
#include "chat-auto-parser-helpers.h"
#include "chat-auto-parser.h"
#include "chat-peg-parser.h"
#include "common.h"
@@ -22,6 +23,7 @@
#include <sstream>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>
using json = nlohmann::ordered_json;
@@ -760,7 +762,7 @@ static void foreach_parameter(const json &
std::string common_chat_template_direct_apply(
const common_chat_template & tmpl,
const autoparser::templates_params & inputs,
const autoparser::generation_params & inputs,
const std::optional<json> & messages_override,
const std::optional<json> & tools_override,
const std::optional<json> & additional_context) {
@@ -811,7 +813,7 @@ std::string common_chat_template_direct_apply(
}
static common_chat_params common_chat_params_init_ministral_3(const common_chat_template & tmpl,
const autoparser::templates_params & inputs) {
const autoparser::generation_params & inputs) {
common_chat_params data;
// Build up messages to follow the format: https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512/blob/main/chat_template.jinja
@@ -870,14 +872,14 @@ static common_chat_params common_chat_params_init_ministral_3(const common_chat_
};
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
auto generation_prompt = p.prefix(inputs.generation_prompt, "[THINK]");
auto reasoning =
extract_reasoning ? p.optional("[THINK]" + p.reasoning(p.until("[/THINK]")) + "[/THINK]") : p.eps();
// Response format parser
if (inputs.json_schema.is_object() && !inputs.json_schema.empty()) {
// Ministral wants to emit json surrounded by code fences
return reasoning << "```json" << p.content(p.schema(p.json(), "response-format", inputs.json_schema))
<< "```";
return generation_prompt + (reasoning << "```json" << p.content(p.schema(p.json(), "response-format", inputs.json_schema)) << "```");
}
// Tool call parser
@@ -897,12 +899,12 @@ static common_chat_params common_chat_params_init_ministral_3(const common_chat_
auto max_calls = inputs.parallel_tool_calls ? -1 : 1;
auto tool_calls = p.trigger_rule("tool-call", p.repeat("[TOOL_CALLS]" + tool_choice, min_calls, max_calls));
return reasoning << p.content(p.until("[TOOL_CALLS]")) << tool_calls;
return generation_prompt + (reasoning << p.content(p.until("[TOOL_CALLS]")) << tool_calls);
}
// Content only parser
include_grammar = false;
return reasoning << p.content(p.rest());
return generation_prompt + (reasoning << p.content(p.rest()));
});
data.parser = parser.save();
@@ -928,7 +930,7 @@ static common_chat_params common_chat_params_init_ministral_3(const common_chat_
}
static common_chat_params common_chat_params_init_gpt_oss(const common_chat_template & tmpl,
const autoparser::templates_params & inputs) {
const autoparser::generation_params & inputs) {
common_chat_params data;
// Copy reasoning to the "thinking" field as expected by the gpt-oss template
@@ -988,7 +990,7 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
p.literal("<|channel|>final") + constraint + p.literal("<|message|>") +
p.content(p.schema(p.json(), "response-format-schema", inputs.json_schema)));
return response_format | (analysis + p.zero_or_more(start + analysis) + start + response_format);
return p.zero_or_more(start + analysis) + start + response_format;
}
if (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE) {
@@ -1017,13 +1019,13 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
auto tool_call = p.trigger_rule("tool-call", tool_choice);
if (inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED) {
return tool_call | ( any + p.zero_or_more(start + any) + start + tool_call);
return p.zero_or_more(start + any) + start + tool_call;
}
return tool_call | final_msg | (any + p.zero_or_more(start + any) + start + (tool_call | final_msg));
return p.zero_or_more(start + any) + start + (tool_call | final_msg);
}
return final_msg | (any + p.zero_or_more(start + any) + start + final_msg);
return p.zero_or_more(start + any) + start + final_msg;
});
data.parser = parser.save();
@@ -1051,7 +1053,7 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
// Functionary v3.2 - uses recipient-based format: >>>recipient\n{content}
static common_chat_params common_chat_params_init_functionary_v3_2(const common_chat_template & tmpl,
const autoparser::templates_params & inputs) {
const autoparser::generation_params & inputs) {
common_chat_params data;
data.prompt = common_chat_template_direct_apply(tmpl, inputs);
@@ -1072,13 +1074,14 @@ static common_chat_params common_chat_params_init_functionary_v3_2(const common_
// Build content parser for >>>all\n{content}
// When tools are present, content stops before the next ">>>" (tool call)
// When no tools, content goes until end
auto content_until_tool = p.literal(">>>all\n") + p.content(p.until(">>>"));
auto content_until_end = p.literal(">>>all\n") + p.content(p.rest());
auto content_until_tool = p.literal("all\n") + p.content(p.until(">>>"));
auto content_until_end = p.literal("all\n") + p.content(p.rest());
auto generation_prompt = p.literal(inputs.generation_prompt);
// If no tools or tool_choice is NONE, just parse content
if (!has_tools || inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_NONE) {
// When no tools, just match the prefix and capture everything after
return content_until_end + p.end();
return generation_prompt + content_until_end + p.end();
}
// Build tool call parsers for each available function
@@ -1090,7 +1093,7 @@ static common_chat_params common_chat_params_init_functionary_v3_2(const common_
// Tool format: >>>function_name\n{json_args}
auto tool_parser = p.tool(
p.tool_open(p.literal(">>>") + p.tool_name(p.literal(name)) + p.literal("\n")) +
p.tool_open(p.tool_name(p.literal(name)) + p.literal("\n")) +
p.tool_args(p.schema(p.json(), "tool-" + name + "-schema", schema))
);
@@ -1101,17 +1104,20 @@ static common_chat_params common_chat_params_init_functionary_v3_2(const common_
auto tools_only = p.trigger_rule("tools", p.one_or_more(tool_choice));
auto content_and_tools = content_until_tool + tools_only;
auto ret = p.eps();
if (inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED) {
if (inputs.parallel_tool_calls) {
return p.choice({ content_and_tools, tools_only }) + p.end();
ret = p.choice({ content_and_tools, tools_only }) + p.end();
} else {
ret = p.choice({ content_until_tool + tool_choice, tools_only }) + p.end();
}
return p.choice({ content_until_tool + tool_choice, tools_only }) + p.end();
} else if (inputs.parallel_tool_calls) {
ret = p.choice({ content_and_tools, content_only, tools_only }) + p.end();
} else {
auto content_and_tool = content_until_tool + tool_choice;
ret = p.choice({ content_and_tool, content_only, tool_choice }) + p.end();
}
if (inputs.parallel_tool_calls) {
return p.choice({ content_and_tools, content_only, tools_only }) + p.end();
}
auto content_and_tool = content_until_tool + tool_choice;
return p.choice({ content_and_tool, content_only, tool_choice }) + p.end();
return generation_prompt + ret;
});
data.parser = parser.save();
@@ -1141,14 +1147,12 @@ static common_chat_params common_chat_params_init_functionary_v3_2(const common_
// Kimi K2 Thinking - uses unique tool call ID format: functions.<name>:<index>
// The ID contains both the function name and an incrementing counter
static common_chat_params common_chat_params_init_kimi_k2(const common_chat_template & tmpl,
const autoparser::templates_params & inputs) {
const autoparser::generation_params & inputs) {
common_chat_params data;
data.prompt = common_chat_template_direct_apply(tmpl, inputs);
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
data.supports_thinking = true;
data.thinking_start_tag = "<think>";
data.thinking_end_tag = "</think>";
data.preserved_tokens = {
"<|tool_calls_section_begin|>",
"<|tool_calls_section_end|>",
@@ -1163,6 +1167,18 @@ static common_chat_params common_chat_params_init_kimi_k2(const common_chat_temp
auto extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
auto include_grammar = has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE;
const std::string SECTION_BEGIN = "<|tool_calls_section_begin|>";
const std::string SECTION_END = "<|tool_calls_section_end|>";
const std::string CALL_BEGIN = "<|tool_call_begin|>";
const std::string ARGS_BEGIN = "<|tool_call_argument_begin|>";
const std::string CALL_END = "<|tool_call_end|>";
const std::string THINK_START = "<think>";
const std::string THINK_END = "</think>";
data.thinking_start_tag = THINK_START;
data.thinking_end_tag = THINK_END;
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
// Kimi K2 Thinking format:
// - Reasoning: <think>{reasoning}</think>
@@ -1174,16 +1190,7 @@ static common_chat_params common_chat_params_init_kimi_k2(const common_chat_temp
// <|tool_calls_section_end|>
// The ID format is: functions.<function_name>:<counter> where counter is 0, 1, 2, ...
// Tool call markers
const std::string SECTION_BEGIN = "<|tool_calls_section_begin|>";
const std::string SECTION_END = "<|tool_calls_section_end|>";
const std::string CALL_BEGIN = "<|tool_call_begin|>";
const std::string ARGS_BEGIN = "<|tool_call_argument_begin|>";
const std::string CALL_END = "<|tool_call_end|>";
const std::string THINK_START = "<think>";
const std::string THINK_END = "</think>";
// Tool call markers
auto end = p.end();
// Note: this model is CRAZY. It can diverge from its supposed tool calling pattern in so many ways it's not funny.
@@ -1191,11 +1198,12 @@ static common_chat_params common_chat_params_init_kimi_k2(const common_chat_temp
auto reasoning = extract_reasoning ? p.optional(THINK_START + p.reasoning(
p.until_one_of({ THINK_END, "<|tool_calls_section_begin|>", "<|tool_call_begin|>" })) +
p.optional(p.literal(THINK_END))) : p.eps();
auto generation_prompt = p.prefix(inputs.generation_prompt, THINK_START);
// Content only parser (no tools)
if (!has_tools || inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_NONE) {
return reasoning + p.content(p.rest()) + end;
return generation_prompt + reasoning + p.content(p.rest()) + end;
}
// Build tool call parsers for each available function
@@ -1231,7 +1239,7 @@ static common_chat_params common_chat_params_init_kimi_k2(const common_chat_temp
auto content_before_tools = p.content(p.until_one_of({ SECTION_BEGIN, CALL_BEGIN }));
return reasoning + content_before_tools + tool_calls + end;
return generation_prompt + reasoning + content_before_tools + tool_calls + end;
});
data.parser = parser.save();
@@ -1261,7 +1269,7 @@ static common_chat_params common_chat_params_init_kimi_k2(const common_chat_temp
// - Tool calls: <|tool_call_start|>[function_name(arg1="value1", arg2="value2")]<|tool_call_end|>
// Tool calls can appear multiple times (parallel tool calls)
static common_chat_params common_chat_params_init_lfm2(const common_chat_template & tmpl,
const autoparser::templates_params & inputs) {
const autoparser::generation_params & inputs) {
common_chat_params data;
data.prompt = common_chat_template_direct_apply(tmpl, inputs);
@@ -1280,13 +1288,16 @@ static common_chat_params common_chat_params_init_lfm2(const common_chat_templat
auto extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
auto include_grammar = has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE;
const std::string TOOL_CALL_START = "<|tool_call_start|>";
const std::string TOOL_CALL_END = "<|tool_call_end|>";
const std::string THINK_START = "<think>";
const std::string THINK_END = "</think>";
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
data.thinking_start_tag = THINK_START;
data.thinking_end_tag = THINK_END;
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
auto generation_prompt = p.prefix(inputs.generation_prompt, THINK_START);
auto end = p.end();
auto reasoning = p.eps();
@@ -1295,7 +1306,7 @@ static common_chat_params common_chat_params_init_lfm2(const common_chat_templat
}
if (!has_tools || inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_NONE) {
return reasoning + p.content(p.rest()) + end;
return generation_prompt + reasoning + p.content(p.rest()) + end;
}
auto tool_calls = p.rule("tool-calls",
@@ -1307,7 +1318,7 @@ static common_chat_params common_chat_params_init_lfm2(const common_chat_templat
auto content = p.content(p.until(TOOL_CALL_START));
return reasoning + content + tool_calls + end;
return generation_prompt + reasoning + content + tool_calls + end;
});
data.parser = parser.save();
@@ -1333,7 +1344,7 @@ static common_chat_params common_chat_params_init_lfm2(const common_chat_templat
static common_chat_params common_chat_params_init_gigachat_v3(
const common_chat_template & tmpl,
const autoparser::templates_params & inputs) {
const autoparser::generation_params & inputs) {
common_chat_params data;
@@ -1347,9 +1358,10 @@ static common_chat_params common_chat_params_init_gigachat_v3(
auto has_tools = inputs.tools.is_array() && !inputs.tools.empty();
auto include_grammar = has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE;
auto tool_call_start_prefix = "<|message_sep|>\n\nfunction call<|role_sep|>\n";
const auto *tool_call_start_prefix = "<|message_sep|>\n\nfunction call<|role_sep|>\n";
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
auto ret = p.eps();
if (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE) {
// Build a choice of all available tools
auto tool_choice = p.choice();
@@ -1372,13 +1384,14 @@ static common_chat_params common_chat_params_init_gigachat_v3(
auto tool_call = p.rule("tool-call", p.literal(tool_call_start_prefix) + tool_choice);
auto tool_calls = p.trigger_rule("tool-call-root", p.repeat(tool_call, /* min = */ min_calls, /* max = */ max_calls));
return p.content(p.until("<|message_sep|>\n\n")) << tool_calls;
ret = p.content(p.until("<|message_sep|>\n\n")) << tool_calls;
} else {
// Content only parser
include_grammar = false;
ret = p.content(p.rest());
}
// Content only parser
include_grammar = false;
return p.content(p.rest());
return p.literal(inputs.generation_prompt) + ret;
});
data.parser = parser.save();
@@ -1473,87 +1486,10 @@ static json common_chat_extra_context() {
return ctx;
}
static common_chat_params common_chat_templates_apply_jinja(const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs) {
autoparser::templates_params params;
params.tools = common_chat_tools_to_json_oaicompat(inputs.tools);
const auto & tmpl = params.tools.is_array() && tmpls->template_tool_use
? *tmpls->template_tool_use
: *tmpls->template_default;
const auto & src = tmpl.source();
const auto & caps = tmpl.original_caps();
params.messages = render_message_to_json(inputs.messages, tmpl.original_caps());
params.add_generation_prompt = inputs.add_generation_prompt;
params.tool_choice = inputs.tool_choice;
params.reasoning_format = inputs.reasoning_format;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
params.add_bos = tmpls->add_bos;
params.add_eos = tmpls->add_eos;
if (src.find("<|channel|>") == std::string::npos) {
// map developer to system for all models except for GPT-OSS
workaround::map_developer_role_to_system(params.messages);
}
if (!tmpl.original_caps().supports_system_role) {
workaround::system_message_not_supported(params.messages);
}
if (tmpl.original_caps().supports_tool_calls) {
// some templates will require the content field in tool call messages
// to still be non-null, this puts an empty string everywhere where the
// content field is null
workaround::requires_non_null_content(params.messages);
}
if (tmpl.original_caps().supports_object_arguments) {
workaround::func_args_not_string(params.messages);
}
params.extra_context = common_chat_extra_context();
for (auto el : inputs.chat_template_kwargs) {
params.extra_context[el.first] = json::parse(el.second);
}
if (!inputs.json_schema.empty()) {
params.json_schema = json::parse(inputs.json_schema);
}
// if (inputs.parallel_tool_calls && !tmpl.original_caps().supports_parallel_tool_calls) {
// LOG_DBG("Disabling parallel_tool_calls because the template does not support it\n");
// params.parallel_tool_calls = false;
// } else {
params.parallel_tool_calls = inputs.parallel_tool_calls;
//}
if (params.tools.is_array()) {
if (params.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE && !params.grammar.empty()) {
throw std::runtime_error("Cannot specify grammar with tools");
}
if (caps.supports_tool_calls && !caps.supports_tools) {
LOG_WRN(
"Template supports tool calls but does not natively describe tools. The fallback behaviour used may "
"produce bad results, inspect prompt w/ --verbose & consider overriding the template.\n");
}
}
if (inputs.force_pure_content) {
LOG_WRN("Forcing pure content template, will not render reasoning or tools separately.");
// Create the result structure
common_chat_params data;
auto params_copy = params;
params_copy.reasoning_format = COMMON_REASONING_FORMAT_NONE;
data.prompt = common_chat_template_direct_apply(tmpl, params_copy);
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
auto parser = build_chat_peg_parser([](common_chat_peg_builder &p) {
return p.content(p.rest());
});
data.parser = parser.save();
return data;
}
static std::optional<common_chat_params> try_specialized_template(
const common_chat_template & tmpl,
const std::string & src,
const autoparser::generation_params & params) {
// Ministral/Mistral Large 3 - uses special reasoning structure fixes, can't use autoparser
// Note: Mistral Small 3.2 uses [CALL_ID] which Ministral doesn't have, so we can distinguish them
if (src.find("[SYSTEM_PROMPT]") != std::string::npos && src.find("[TOOL_CALLS]") != std::string::npos &&
@@ -1594,14 +1530,105 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
// GigaChatV3 format detection
if (src.find("<|role_sep|>") != std::string::npos &&
src.find("<|message_sep|>") != std::string::npos &&
src.find("<|function_call|>") == std::string::npos
) {
src.find("<|function_call|>") == std::string::npos) {
LOG_DBG("Using specialized template: GigaChatV3\n");
return common_chat_params_init_gigachat_v3(tmpl, params);
}
return std::nullopt;
}
static common_chat_params common_chat_templates_apply_jinja(const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs) {
autoparser::generation_params params;
params.tools = common_chat_tools_to_json_oaicompat(inputs.tools);
const auto & tmpl =
params.tools.is_array() && tmpls->template_tool_use ? *tmpls->template_tool_use : *tmpls->template_default;
const auto & src = tmpl.source();
const auto & caps = tmpl.original_caps();
params.messages = render_message_to_json(inputs.messages, tmpl.original_caps());
params.tool_choice = inputs.tool_choice;
params.reasoning_format = inputs.reasoning_format;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
params.add_bos = tmpls->add_bos;
params.add_eos = tmpls->add_eos;
if (src.find("<|channel|>") == std::string::npos) {
// map developer to system for all models except for GPT-OSS
workaround::map_developer_role_to_system(params.messages);
}
if (!tmpl.original_caps().supports_system_role) {
workaround::system_message_not_supported(params.messages);
}
if (tmpl.original_caps().supports_tool_calls) {
// some templates will require the content field in tool call messages
// to still be non-null, this puts an empty string everywhere where the
// content field is null
workaround::requires_non_null_content(params.messages);
}
if (tmpl.original_caps().supports_object_arguments) {
workaround::func_args_not_string(params.messages);
}
params.add_generation_prompt = false;
std::string no_gen_prompt = common_chat_template_direct_apply(tmpl, params);
params.add_generation_prompt = true;
std::string gen_prompt = common_chat_template_direct_apply(tmpl, params);
auto diff = calculate_diff_split(no_gen_prompt, gen_prompt);
params.generation_prompt = diff.right;
params.add_generation_prompt = inputs.add_generation_prompt;
params.extra_context = common_chat_extra_context();
for (auto el : inputs.chat_template_kwargs) {
params.extra_context[el.first] = json::parse(el.second);
}
if (!inputs.json_schema.empty()) {
params.json_schema = json::parse(inputs.json_schema);
}
params.parallel_tool_calls = inputs.parallel_tool_calls;
if (params.tools.is_array()) {
if (params.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE && !params.grammar.empty()) {
throw std::runtime_error("Cannot specify grammar with tools");
}
if (caps.supports_tool_calls && !caps.supports_tools) {
LOG_WRN(
"Template supports tool calls but does not natively describe tools. The fallback behaviour used may "
"produce bad results, inspect prompt w/ --verbose & consider overriding the template.\n");
}
}
if (inputs.force_pure_content) {
LOG_WRN("Forcing pure content template, will not render reasoning or tools separately.");
// Create the result structure
common_chat_params data;
auto params_copy = params;
params_copy.reasoning_format = COMMON_REASONING_FORMAT_NONE;
data.prompt = common_chat_template_direct_apply(tmpl, params_copy);
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
data.generation_prompt = params.generation_prompt;
auto parser = build_chat_peg_parser([&params](common_chat_peg_builder &p) {
return p.prefix(params.generation_prompt) + p.content(p.rest());
});
data.parser = parser.save();
return data;
}
if (auto result = try_specialized_template(tmpl, src, params)) {
result->generation_prompt = params.generation_prompt;
return *result;
}
try {
LOG_DBG("Using differential autoparser\n");
LOG_DBG("%s: using differential autoparser\n", __func__);
struct autoparser::autoparser autoparser;
autoparser.analyze_template(tmpl);
auto auto_params = autoparser::peg_generator::generate_parser(tmpl, params, autoparser);
@@ -1609,13 +1636,11 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
if (auto_params.supports_thinking) {
auto_params.thinking_start_tag = autoparser.reasoning.start;
auto_params.thinking_end_tag = autoparser.reasoning.end;
// FORCED_OPEN and FORCED_CLOSED both put <think> in the generation prompt
// (FORCED_CLOSED forces empty <think></think> when thinking is disabled,
// but forces <think> open when thinking is enabled)
auto_params.thinking_forced_open =
autoparser.reasoning.mode == autoparser::reasoning_mode::FORCED_OPEN ||
autoparser.reasoning.mode == autoparser::reasoning_mode::FORCED_CLOSED;
}
auto_params.generation_prompt = params.generation_prompt;
common_peg_arena arena;
arena.load(auto_params.parser);
LOG_DBG("%s: generated parser:\n%s\n\nparser generation prompt: %s\n", __func__, arena.dump(arena.root()).c_str(), auto_params.generation_prompt.c_str());
return auto_params;
} catch (const std::exception & e) {
throw std::invalid_argument(std::string("Unable to generate parser for this template. Automatic parser generation failed: ") + e.what());
@@ -1713,14 +1738,18 @@ common_chat_msg common_chat_peg_parse(const common_peg_arena & src_pars
LOG_DBG("No parser definition detected, assuming pure content parser.");
}
LOG_DBG("Parsing PEG input with format %s: %s\n", common_chat_format_name(params.format), input.c_str());
const std::string effective_input = params.generation_prompt.empty()
? input
: params.generation_prompt + input;
LOG_DBG("Parsing PEG input with format %s: %s\n", common_chat_format_name(params.format), effective_input.c_str());
common_peg_parse_flags flags = COMMON_PEG_PARSE_FLAG_LENIENT;
if (params.debug) {
flags |= COMMON_PEG_PARSE_FLAG_DEBUG;
}
common_peg_parse_context ctx(input, flags);
common_peg_parse_context ctx(effective_input, flags);
auto result = parser.parse(ctx);
if (result.fail()) {
@@ -1740,7 +1769,7 @@ common_chat_msg common_chat_peg_parse(const common_peg_arena & src_pars
return msg;
}
throw std::runtime_error(std::string("Failed to parse input at pos ") + std::to_string(result.end) + ": " +
input.substr(result.end));
effective_input.substr(result.end));
}
common_chat_msg msg;

View File

@@ -24,7 +24,7 @@ using json = nlohmann::ordered_json;
struct common_chat_templates;
namespace autoparser {
struct templates_params;
struct generation_params;
} // namespace autoparser
struct common_chat_tool_call {
@@ -212,7 +212,7 @@ struct common_chat_params {
std::string prompt;
std::string grammar;
bool grammar_lazy = false;
bool thinking_forced_open = false;
std::string generation_prompt;
bool supports_thinking = false;
std::string thinking_start_tag; // e.g., "<think>"
std::string thinking_end_tag; // e.g., "</think>"
@@ -229,14 +229,14 @@ struct common_chat_parser_params {
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_NONE; // TODO: refactor this to "bool parse_reasoning"
// Whether reasoning_content should be inlined in the content (e.g. for reasoning_format=deepseek in stream mode)
bool reasoning_in_content = false;
bool thinking_forced_open = false;
std::string generation_prompt;
bool parse_tool_calls = true;
bool debug = false; // Enable debug output for PEG parser
common_peg_arena parser = {};
common_chat_parser_params() = default;
common_chat_parser_params(const common_chat_params & chat_params) {
format = chat_params.format;
thinking_forced_open = chat_params.thinking_forced_open;
format = chat_params.format;
generation_prompt = chat_params.generation_prompt;
}
};
@@ -302,7 +302,7 @@ std::map<std::string, bool> common_chat_templates_get_caps(const common_chat_tem
std::string common_chat_template_direct_apply(
const common_chat_template & tmpl,
const autoparser::templates_params & inputs,
const autoparser::generation_params & inputs,
const std::optional<json> & messages_override = std::nullopt,
const std::optional<json> & tools_override = std::nullopt,
const std::optional<json> & additional_context = std::nullopt);

View File

@@ -3,12 +3,14 @@
#pragma once
#include "ggml-opt.h"
#include "ggml.h"
#include "llama-cpp.h"
#include <set>
#include <sstream>
#include <string>
#include <string_view>
#include <variant>
#include <vector>
#include <map>
@@ -178,6 +180,43 @@ enum common_speculative_type {
COMMON_SPECULATIVE_TYPE_COUNT // number of types, unknown type
};
// Grammar type enumeration
enum common_grammar_type {
COMMON_GRAMMAR_TYPE_NONE, // no grammar set
COMMON_GRAMMAR_TYPE_USER, // user-provided GBNF (--grammar / "grammar" API field)
COMMON_GRAMMAR_TYPE_OUTPUT_FORMAT, // auto-generated from JSON schema (--json-schema / "json_schema" API field)
COMMON_GRAMMAR_TYPE_TOOL_CALLS, // auto-generated by chat template parser for function calling
};
// Grammar variant struct with type and grammar string
struct common_grammar {
common_grammar_type type = COMMON_GRAMMAR_TYPE_NONE;
std::string grammar;
// Default constructor - no grammar
common_grammar() = default;
// Constructor with type and grammar string
common_grammar(common_grammar_type t, std::string g) : type(t), grammar(std::move(g)) {
GGML_ASSERT(type != COMMON_GRAMMAR_TYPE_NONE || !grammar.empty());
}
// Check if a grammar is set
bool empty() const { return type == COMMON_GRAMMAR_TYPE_NONE || grammar.empty(); }
};
// Returns the raw grammar string, or empty string if no grammar is set.
inline const std::string & common_grammar_value(const common_grammar & g) {
return g.grammar;
}
// Returns true when the generation_prompt should be prefilled into the grammar sampler.
// Only output-format and tool-call grammars need prefill; user-supplied grammars must not be prefilled.
inline bool common_grammar_needs_prefill(const common_grammar & g) {
return g.type == COMMON_GRAMMAR_TYPE_OUTPUT_FORMAT
|| g.type == COMMON_GRAMMAR_TYPE_TOOL_CALLS;
}
// sampling parameters
struct common_params_sampling {
uint32_t seed = LLAMA_DEFAULT_SEED; // the seed used to initialize llama_sampler
@@ -228,7 +267,7 @@ struct common_params_sampling {
COMMON_SAMPLER_TYPE_TEMPERATURE,
};
std::string grammar; // optional BNF-like grammar to constrain sampling
common_grammar grammar; // optional grammar constraint (user / output-format / tool-calls)
bool grammar_lazy = false;
std::vector<common_grammar_trigger> grammar_triggers; // optional triggers (for lazy grammars)
std::set<llama_token> preserved_tokens;
@@ -236,10 +275,15 @@ struct common_params_sampling {
std::vector<llama_logit_bias> logit_bias; // logit biases to apply
std::vector<llama_logit_bias> logit_bias_eog; // pre-calculated logit biases for EOG tokens
// The assistant generation prompt already prefilled into the prompt.
// Fed to the grammar sampler (to advance past pre-existing tokens) and used
// to determine the reasoning budget sampler's initial state.
// Only applied when the grammar is of output-format or tool-calls type.
std::string generation_prompt;
// reasoning budget sampler parameters
// these are populated by the server/CLI based on chat template params
int32_t reasoning_budget_tokens = -1; // -1 = disabled, >= 0 = token budget
bool reasoning_budget_activate_immediately = false;
std::vector<llama_token> reasoning_budget_start; // start tag token sequence
std::vector<llama_token> reasoning_budget_end; // end tag token sequence
std::vector<llama_token> reasoning_budget_forced; // forced sequence (message + end tag)

View File

@@ -1,9 +1,9 @@
#include "arg.h"
#include "common.h"
#include "gguf.h" // for reading GGUF splits
#include "log.h"
#include "download.h"
#include "hf-cache.h"
#define JSON_ASSERT GGML_ASSERT
#include <nlohmann/json.hpp>
@@ -15,6 +15,7 @@
#include <map>
#include <mutex>
#include <regex>
#include <unordered_set>
#include <string>
#include <thread>
#include <vector>
@@ -35,8 +36,6 @@
#endif
#endif
#define LLAMA_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
// isatty
#if defined(_WIN32)
#include <io.h>
@@ -51,31 +50,6 @@ using json = nlohmann::ordered_json;
//
// validate repo name format: owner/repo
static bool validate_repo_name(const std::string & repo) {
static const std::regex repo_regex(R"(^[A-Za-z0-9_.\-]+\/[A-Za-z0-9_.\-]+$)");
return std::regex_match(repo, repo_regex);
}
static std::string get_manifest_path(const std::string & repo, const std::string & tag) {
// we use "=" to avoid clashing with other component, while still being allowed on windows
std::string fname = "manifest=" + repo + "=" + tag + ".json";
if (!validate_repo_name(repo)) {
throw std::runtime_error("error: repo name must be in the format 'owner/repo'");
}
string_replace_all(fname, "/", "=");
return fs_get_cache_file(fname);
}
static std::string read_file(const std::string & fname) {
std::ifstream file(fname);
if (!file) {
throw std::runtime_error(string_format("error: failed to open file '%s'\n", fname.c_str()));
}
std::string content((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
file.close();
return content;
}
static void write_file(const std::string & fname, const std::string & content) {
const std::string fname_tmp = fname + ".tmp";
std::ofstream file(fname_tmp);
@@ -132,7 +106,7 @@ static bool is_http_status_ok(int status) {
std::pair<std::string, std::string> common_download_split_repo_tag(const std::string & hf_repo_with_tag) {
auto parts = string_split<std::string>(hf_repo_with_tag, ':');
std::string tag = parts.size() > 1 ? parts.back() : "latest";
std::string tag = parts.size() > 1 ? parts.back() : "";
std::string hf_repo = parts[0];
if (string_split<std::string>(hf_repo, '/').size() != 2) {
throw std::invalid_argument("error: invalid HF repo format, expected <user>/<model>[:quant]\n");
@@ -290,7 +264,8 @@ static bool common_pull_file(httplib::Client & cli,
static int common_download_file_single_online(const std::string & url,
const std::string & path,
const std::string & bearer_token,
const common_header_list & custom_headers) {
const common_header_list & custom_headers,
bool skip_etag = false) {
static const int max_attempts = 3;
static const int retry_delay_seconds = 2;
@@ -310,6 +285,11 @@ static int common_download_file_single_online(const std::string & url,
const bool file_exists = std::filesystem::exists(path);
if (file_exists && skip_etag) {
LOG_INF("%s: using cached file: %s\n", __func__, path.c_str());
return 304; // 304 Not Modified - fake cached response
}
std::string last_etag;
if (file_exists) {
last_etag = read_etag(path);
@@ -361,6 +341,12 @@ static int common_download_file_single_online(const std::string & url,
}
}
{ // silent
std::error_code ec;
std::filesystem::path p(path);
std::filesystem::create_directories(p.parent_path(), ec);
}
const std::string path_temporary = path + ".downloadInProgress";
int delay = retry_delay_seconds;
@@ -391,7 +377,7 @@ static int common_download_file_single_online(const std::string & url,
LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str());
return -1;
}
if (!etag.empty()) {
if (!etag.empty() && !skip_etag) {
write_etag(path, etag);
}
return head->status;
@@ -440,9 +426,10 @@ int common_download_file_single(const std::string & url,
const std::string & path,
const std::string & bearer_token,
bool offline,
const common_header_list & headers) {
const common_header_list & headers,
bool skip_etag) {
if (!offline) {
return common_download_file_single_online(url, path, bearer_token, headers);
return common_download_file_single_online(url, path, bearer_token, headers, skip_etag);
}
if (!std::filesystem::exists(path)) {
@@ -454,193 +441,295 @@ int common_download_file_single(const std::string & url,
return 304; // Not Modified - fake cached response
}
// download multiple files from remote URLs to local paths
// the input is a vector of pairs <url, path>
static bool common_download_file_multiple(const std::vector<std::pair<std::string, std::string>> & urls,
const std::string & bearer_token,
bool offline,
const common_header_list & headers) {
// Prepare download in parallel
std::vector<std::future<bool>> futures_download;
futures_download.reserve(urls.size());
struct gguf_split_info {
std::string prefix; // tag included
std::string tag;
int index;
int count;
};
for (auto const & item : urls) {
futures_download.push_back(
std::async(
std::launch::async,
[&bearer_token, offline, &headers](const std::pair<std::string, std::string> & it) -> bool {
const int http_status = common_download_file_single(it.first, it.second, bearer_token, offline, headers);
return is_http_status_ok(http_status);
},
item
)
);
static gguf_split_info get_gguf_split_info(const std::string & path) {
static const std::regex re_split("^(.+)-([0-9]{5})-of-([0-9]{5})$", std::regex::icase);
static const std::regex re_tag("[-.]([A-Z0-9_]+)$", std::regex::icase);
std::smatch m;
std::string prefix = path;
if (!string_remove_suffix(prefix, ".gguf")) {
return {};
}
// Wait for all downloads to complete
for (auto & f : futures_download) {
if (!f.get()) {
return false;
int index = 1;
int count = 1;
if (std::regex_match(prefix, m, re_split)) {
index = std::stoi(m[2].str());
count = std::stoi(m[3].str());
prefix = m[1].str();
}
std::string tag;
if (std::regex_search(prefix, m, re_tag)) {
tag = m[1].str();
for (char & c : tag) {
c = std::toupper((unsigned char)c);
}
}
return true;
return {std::move(prefix), std::move(tag), index, count};
}
bool common_download_model(const common_params_model & model,
const std::string & bearer_token,
bool offline,
const common_header_list & headers) {
// Basic validation of the model.url
if (model.url.empty()) {
LOG_ERR("%s: invalid model url\n", __func__);
return false;
// Q4_0 -> 4, F16 -> 16, NVFP4 -> 4, Q8_K_M -> 8, etc
static int extract_quant_bits(const std::string & filename) {
auto split = get_gguf_split_info(filename);
auto pos = split.tag.find_first_of("0123456789");
if (pos == std::string::npos) {
return 0;
}
const int http_status = common_download_file_single(model.url, model.path, bearer_token, offline, headers);
if (!is_http_status_ok(http_status)) {
return false;
}
// check for additional GGUFs split to download
int n_split = 0;
{
struct gguf_init_params gguf_params = {
/*.no_alloc = */ true,
/*.ctx = */ NULL,
};
auto * ctx_gguf = gguf_init_from_file(model.path.c_str(), gguf_params);
if (!ctx_gguf) {
LOG_ERR("\n%s: failed to load input GGUF from %s\n", __func__, model.path.c_str());
return false;
}
auto key_n_split = gguf_find_key(ctx_gguf, LLM_KV_SPLIT_COUNT);
if (key_n_split >= 0) {
n_split = gguf_get_val_u16(ctx_gguf, key_n_split);
}
gguf_free(ctx_gguf);
}
if (n_split > 1) {
char split_prefix[PATH_MAX] = {0};
char split_url_prefix[LLAMA_MAX_URL_LENGTH] = {0};
// Verify the first split file format
// and extract split URL and PATH prefixes
{
if (!llama_split_prefix(split_prefix, sizeof(split_prefix), model.path.c_str(), 0, n_split)) {
LOG_ERR("\n%s: unexpected model file name: %s n_split=%d\n", __func__, model.path.c_str(), n_split);
return false;
}
if (!llama_split_prefix(split_url_prefix, sizeof(split_url_prefix), model.url.c_str(), 0, n_split)) {
LOG_ERR("\n%s: unexpected model url: %s n_split=%d\n", __func__, model.url.c_str(), n_split);
return false;
}
}
std::vector<std::pair<std::string, std::string>> urls;
for (int idx = 1; idx < n_split; idx++) {
char split_path[PATH_MAX] = {0};
llama_split_path(split_path, sizeof(split_path), split_prefix, idx, n_split);
char split_url[LLAMA_MAX_URL_LENGTH] = {0};
llama_split_path(split_url, sizeof(split_url), split_url_prefix, idx, n_split);
if (std::string(split_path) == model.path) {
continue; // skip the already downloaded file
}
urls.push_back({split_url, split_path});
}
// Download in parallel
common_download_file_multiple(urls, bearer_token, offline, headers);
}
return true;
return std::stoi(split.tag.substr(pos));
}
common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag,
const std::string & bearer_token,
bool offline,
const common_header_list & custom_headers) {
// the returned hf_repo is without tag
auto [hf_repo, tag] = common_download_split_repo_tag(hf_repo_with_tag);
static hf_cache::hf_files get_split_files(const hf_cache::hf_files & files,
const hf_cache::hf_file & file) {
auto split = get_gguf_split_info(file.path);
std::string url = get_model_endpoint() + "v2/" + hf_repo + "/manifests/" + tag;
// headers
common_header_list headers = custom_headers;
headers.push_back({"Accept", "application/json"});
if (!bearer_token.empty()) {
headers.push_back({"Authorization", "Bearer " + bearer_token});
if (split.count <= 1) {
return {file};
}
// Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response
// User-Agent header is already set in common_remote_get_content, no need to set it here
hf_cache::hf_files result;
// make the request
common_remote_params params;
params.headers = headers;
long res_code = 0;
std::string res_str;
bool use_cache = false;
std::string cached_response_path = get_manifest_path(hf_repo, tag);
if (!offline) {
try {
auto res = common_remote_get_content(url, params);
res_code = res.first;
res_str = std::string(res.second.data(), res.second.size());
} catch (const std::exception & e) {
LOG_WRN("error: failed to get manifest at %s: %s\n", url.c_str(), e.what());
for (const auto & f : files) {
auto split_f = get_gguf_split_info(f.path);
if (split_f.count == split.count && split_f.prefix == split.prefix) {
result.push_back(f);
}
}
if (res_code == 0) {
if (std::filesystem::exists(cached_response_path)) {
LOG_WRN("trying to read manifest from cache: %s\n", cached_response_path.c_str());
res_str = read_file(cached_response_path);
res_code = 200;
use_cache = true;
} else {
throw std::runtime_error(
offline ? "error: failed to get manifest (offline mode)"
: "error: failed to get manifest (check your internet connection)");
return result;
}
static hf_cache::hf_file find_best_mmproj(const hf_cache::hf_files & files,
const std::string & model) {
hf_cache::hf_file best;
size_t best_depth = 0;
int best_diff = 0;
bool found = false;
auto model_bits = extract_quant_bits(model);
auto model_parts = string_split<std::string>(model, '/');
auto model_dir = model_parts.end() - 1;
for (const auto & f : files) {
if (!string_ends_with(f.path, ".gguf") ||
f.path.find("mmproj") == std::string::npos) {
continue;
}
auto mmproj_parts = string_split<std::string>(f.path, '/');
auto mmproj_dir = mmproj_parts.end() - 1;
auto [_, dir] = std::mismatch(model_parts.begin(), model_dir,
mmproj_parts.begin(), mmproj_dir);
if (dir != mmproj_dir) {
continue;
}
size_t depth = dir - mmproj_parts.begin();
auto bits = extract_quant_bits(f.path);
auto diff = std::abs(bits - model_bits);
if (!found || depth > best_depth || (depth == best_depth && diff < best_diff)) {
best = f;
best_depth = depth;
best_diff = diff;
found = true;
}
}
std::string ggufFile;
std::string mmprojFile;
return best;
}
if (res_code == 200 || res_code == 304) {
try {
auto j = json::parse(res_str);
static hf_cache::hf_file find_best_model(const hf_cache::hf_files & files,
const std::string & tag) {
std::vector<std::string> tags;
if (j.contains("ggufFile") && j["ggufFile"].contains("rfilename")) {
ggufFile = j["ggufFile"]["rfilename"].get<std::string>();
}
if (j.contains("mmprojFile") && j["mmprojFile"].contains("rfilename")) {
mmprojFile = j["mmprojFile"]["rfilename"].get<std::string>();
}
} catch (const std::exception & e) {
throw std::runtime_error(std::string("error parsing manifest JSON: ") + e.what());
}
if (!use_cache) {
// if not using cached response, update the cache file
write_file(cached_response_path, res_str);
}
} else if (res_code == 401) {
throw std::runtime_error("error: model is private or does not exist; if you are accessing a gated model, please provide a valid HF token");
if (!tag.empty()) {
tags.push_back(tag);
} else {
throw std::runtime_error(string_format("error from HF API (%s), response code: %ld, data: %s", url.c_str(), res_code, res_str.c_str()));
tags = {"Q4_K_M", "Q4_0"};
}
// check response
if (ggufFile.empty()) {
throw std::runtime_error("error: model does not have ggufFile");
for (const auto & t : tags) {
std::regex pattern(t + "[.-]", std::regex::icase);
for (const auto & f : files) {
if (string_ends_with(f.path, ".gguf") &&
f.path.find("mmproj") == std::string::npos &&
std::regex_search(f.path, pattern)) {
return f;
}
}
}
return { hf_repo, ggufFile, mmprojFile };
for (const auto & f : files) {
if (string_ends_with(f.path, ".gguf") &&
f.path.find("mmproj") == std::string::npos) {
return f;
}
}
return {};
}
static void list_available_gguf_files(const hf_cache::hf_files & files) {
LOG_INF("Available GGUF files:\n");
for (const auto & f : files) {
if (string_ends_with(f.path, ".gguf")) {
LOG_INF(" - %s\n", f.path.c_str());
}
}
}
struct hf_plan {
hf_cache::hf_files model_files;
hf_cache::hf_file mmproj;
};
static hf_plan get_hf_plan(const common_params_model & model,
const std::string & token,
const common_download_model_opts & opts) {
hf_plan plan;
hf_cache::hf_files all;
auto [repo, tag] = common_download_split_repo_tag(model.hf_repo);
if (!opts.offline) {
all = hf_cache::get_repo_files(repo, token);
}
if (all.empty()) {
all = hf_cache::get_cached_files(repo);
}
if (all.empty()) {
return plan;
}
hf_cache::hf_file primary;
if (!model.hf_file.empty()) {
for (const auto & f : all) {
if (f.path == model.hf_file) {
primary = f;
break;
}
}
if (primary.path.empty()) {
LOG_ERR("%s: file '%s' not found in repository\n", __func__, model.hf_file.c_str());
list_available_gguf_files(all);
return plan;
}
} else {
primary = find_best_model(all, tag);
if (primary.path.empty()) {
LOG_ERR("%s: no GGUF files found in repository %s\n", __func__, repo.c_str());
list_available_gguf_files(all);
return plan;
}
}
plan.model_files = get_split_files(all, primary);
if (opts.download_mmproj) {
plan.mmproj = find_best_mmproj(all, primary.path);
}
return plan;
}
struct download_task {
std::string url;
std::string path;
};
static std::vector<download_task> get_url_tasks(const common_params_model & model) {
auto split = get_gguf_split_info(model.url);
if (split.count <= 1) {
return {{model.url, model.path}};
}
auto filename = split.prefix;
if (auto pos = split.prefix.rfind('/'); pos != std::string::npos) {
filename = split.prefix.substr(pos + 1);
}
auto parent_path = std::filesystem::path(model.path).parent_path();
auto prefix_path = (parent_path / filename).string();
std::vector<download_task> tasks;
for (int i = 1; i <= split.count; i++) {
auto suffix = string_format("-%05d-of-%05d.gguf", i, split.count);
tasks.push_back({split.prefix + suffix, prefix_path + suffix});
}
return tasks;
}
common_download_model_result common_download_model(const common_params_model & model,
const std::string & bearer_token,
const common_download_model_opts & opts,
const common_header_list & headers) {
common_download_model_result result;
std::vector<download_task> tasks;
hf_plan hf;
bool is_hf = !model.hf_repo.empty();
if (is_hf) {
hf = get_hf_plan(model, bearer_token, opts);
for (const auto & f : hf.model_files) {
tasks.push_back({f.url, f.local_path});
}
if (!hf.mmproj.path.empty()) {
tasks.push_back({hf.mmproj.url, hf.mmproj.local_path});
}
} else if (!model.url.empty()) {
tasks = get_url_tasks(model);
} else {
result.model_path = model.path;
return result;
}
if (tasks.empty()) {
return result;
}
std::vector<std::future<bool>> futures;
for (const auto & task : tasks) {
futures.push_back(std::async(std::launch::async,
[&task, &bearer_token, offline = opts.offline, &headers, is_hf]() {
int status = common_download_file_single(task.url, task.path, bearer_token, offline, headers, is_hf);
return is_http_status_ok(status);
}
));
}
for (auto & f : futures) {
if (!f.get()) {
return {};
}
}
if (is_hf) {
for (const auto & f : hf.model_files) {
hf_cache::finalize_file(f);
}
result.model_path = hf.model_files[0].final_path;
if (!hf.mmproj.path.empty()) {
result.mmproj_path = hf_cache::finalize_file(hf.mmproj);
}
} else {
result.model_path = model.path;
}
return result;
}
//
@@ -765,28 +854,21 @@ std::string common_docker_resolve_model(const std::string & docker) {
}
std::vector<common_cached_model_info> common_list_cached_models() {
std::vector<common_cached_model_info> models;
const std::string cache_dir = fs_get_cache_directory();
const std::vector<common_file_info> files = fs_list(cache_dir, false);
for (const auto & file : files) {
if (string_starts_with(file.name, "manifest=") && string_ends_with(file.name, ".json")) {
common_cached_model_info model_info;
model_info.manifest_path = file.path;
std::string fname = file.name;
string_replace_all(fname, ".json", ""); // remove extension
auto parts = string_split<std::string>(fname, '=');
if (parts.size() == 4) {
// expect format: manifest=<user>=<model>=<tag>=<other>
model_info.user = parts[1];
model_info.model = parts[2];
model_info.tag = parts[3];
} else {
// invalid format
continue;
}
model_info.size = 0; // TODO: get GGUF size, not manifest size
models.push_back(model_info);
std::unordered_set<std::string> seen;
std::vector<common_cached_model_info> result;
auto files = hf_cache::get_cached_files();
for (const auto & f : files) {
auto split = get_gguf_split_info(f.path);
if (split.index != 1 || split.tag.empty() ||
split.prefix.find("mmproj") != std::string::npos) {
continue;
}
if (seen.insert(f.repo_id + ":" + split.tag).second) {
result.push_back({f.repo_id, split.tag});
}
}
return models;
return result;
}

View File

@@ -17,54 +17,60 @@ struct common_remote_params {
// get remote file content, returns <http_code, raw_response_body>
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params);
// split HF repo with tag into <repo, tag>
// for example: "user/model:tag" -> <"user/model", "tag">
// if tag is not present, default to "latest"
// example: "user/model" -> <"user/model", "latest">
// split HF repo with tag into <repo, tag>, for example:
// - "ggml-org/models:F16" -> <"ggml-org/models", "F16">
// tag is optional and can be empty
std::pair<std::string, std::string> common_download_split_repo_tag(const std::string & hf_repo_with_tag);
// Result of common_list_cached_models
struct common_cached_model_info {
std::string manifest_path;
std::string user;
std::string model;
std::string repo;
std::string tag;
size_t size = 0; // GGUF size in bytes
// return string representation like "user/model:tag"
// if tag is "latest", it will be omitted
std::string to_string() const {
return user + "/" + model + (tag == "latest" ? "" : ":" + tag);
return repo + ":" + tag;
}
};
struct common_hf_file_res {
std::string repo; // repo name with ":tag" removed
std::string ggufFile;
std::string mmprojFile;
// Options for common_download_model
struct common_download_model_opts {
bool download_mmproj = false;
bool offline = false;
};
/**
* Allow getting the HF file from the HF repo with tag (like ollama), for example:
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q4
* - bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q5_k_s
* Tag is optional, default to "latest" (meaning it checks for Q4_K_M first, then Q4, then if not found, return the first GGUF file in repo)
*
* Return pair of <repo, file> (with "repo" already having tag removed)
*
* Note: we use the Ollama-compatible HF API, but not using the blobId. Instead, we use the special "ggufFile" field which returns the value for "hf_file". This is done to be backward-compatible with existing cache files.
*/
common_hf_file_res common_get_hf_file(
const std::string & hf_repo_with_tag,
const std::string & bearer_token,
bool offline,
const common_header_list & headers = {}
);
// Result of common_download_model
struct common_download_model_result {
std::string model_path;
std::string mmproj_path;
};
// returns true if download succeeded
bool common_download_model(
// Download model from HuggingFace repo or URL
//
// input (via model struct):
// - model.hf_repo: HF repo with optional tag, see common_download_split_repo_tag
// - model.hf_file: specific file in the repo (requires hf_repo)
// - model.url: simple download (used if hf_repo is empty)
// - model.path: local file path
//
// tag matching (for HF repos without model.hf_file):
// - if tag is specified, searches for GGUF matching that quantization
// - if no tag, searches for Q4_K_M, then Q4_0, then first available GGUF
//
// split GGUF: multi-part files like "model-00001-of-00003.gguf" are automatically
// detected and all parts are downloaded
//
// caching:
// - HF repos: uses HuggingFace cache
// - URLs: uses ETag-based caching
//
// when opts.offline=true, no network requests are made
// when download_mmproj=true, searches for mmproj in same directory as model or any parent directory
// then with the closest quantization bits
//
// returns result with model_path and mmproj_path (empty on failure)
common_download_model_result common_download_model(
const common_params_model & model,
const std::string & bearer_token,
bool offline,
const common_download_model_opts & opts = {},
const common_header_list & headers = {}
);
@@ -73,11 +79,13 @@ std::vector<common_cached_model_info> common_list_cached_models();
// download single file from url to local path
// returns status code or -1 on error
// skip_etag: if true, don't read/write .etag files (for HF cache where filename is the hash)
int common_download_file_single(const std::string & url,
const std::string & path,
const std::string & bearer_token,
bool offline,
const common_header_list & headers = {});
const common_header_list & headers = {},
bool skip_etag = false);
// resolve and download model from Docker registry
// return local path to downloaded model file

762
common/hf-cache.cpp Normal file
View File

@@ -0,0 +1,762 @@
#include "hf-cache.h"
#include "common.h"
#include "log.h"
#include "http.h"
#define JSON_ASSERT GGML_ASSERT
#include <nlohmann/json.hpp>
#include <filesystem>
#include <fstream>
#include <atomic>
#include <regex> // migration only
#include <string>
#include <string_view>
#include <stdexcept>
namespace nl = nlohmann;
#if defined(_WIN32)
#define WIN32_LEAN_AND_MEAN
#ifndef NOMINMAX
#define NOMINMAX
#endif
#define HOME_DIR "USERPROFILE"
#include <windows.h>
#else
#define HOME_DIR "HOME"
#endif
namespace hf_cache {
namespace fs = std::filesystem;
static fs::path get_cache_directory() {
static const fs::path cache = []() {
struct {
const char * var;
fs::path path;
} entries[] = {
{"LLAMA_CACHE", fs::path()},
{"HF_HUB_CACHE", fs::path()},
{"HUGGINGFACE_HUB_CACHE", fs::path()},
{"HF_HOME", fs::path("hub")},
{"XDG_CACHE_HOME", fs::path("huggingface") / "hub"},
{HOME_DIR, fs::path(".cache") / "huggingface" / "hub"}
};
for (const auto & entry : entries) {
if (auto * p = std::getenv(entry.var); p && *p) {
fs::path base(p);
return entry.path.empty() ? base : base / entry.path;
}
}
throw std::runtime_error("Failed to determine HF cache directory");
}();
return cache;
}
static std::string folder_name_to_repo(const std::string & folder) {
constexpr std::string_view prefix = "models--";
if (folder.rfind(prefix, 0)) {
return {};
}
std::string result = folder.substr(prefix.length());
string_replace_all(result, "--", "/");
return result;
}
static std::string repo_to_folder_name(const std::string & repo_id) {
constexpr std::string_view prefix = "models--";
std::string result = std::string(prefix) + repo_id;
string_replace_all(result, "/", "--");
return result;
}
static fs::path get_repo_path(const std::string & repo_id) {
return get_cache_directory() / repo_to_folder_name(repo_id);
}
static bool is_hex_char(const char c) {
return (c >= 'A' && c <= 'F') ||
(c >= 'a' && c <= 'f') ||
(c >= '0' && c <= '9');
}
static bool is_hex_string(const std::string & s, size_t expected_len) {
if (s.length() != expected_len) {
return false;
}
for (const char c : s) {
if (!is_hex_char(c)) {
return false;
}
}
return true;
}
static bool is_alphanum(const char c) {
return (c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
(c >= '0' && c <= '9');
}
static bool is_special_char(char c) {
return c == '/' || c == '.' || c == '-';
}
// base chars [A-Za-z0-9_] are always valid
// special chars [/.-] must be surrounded by base chars
// exactly one '/' required
static bool is_valid_repo_id(const std::string & repo_id) {
if (repo_id.empty() || repo_id.length() > 256) {
return false;
}
int slash = 0;
bool special = true;
for (const char c : repo_id) {
if (is_alphanum(c) || c == '_') {
special = false;
} else if (is_special_char(c)) {
if (special) {
return false;
}
slash += (c == '/');
special = true;
} else {
return false;
}
}
return !special && slash == 1;
}
static bool is_valid_hf_token(const std::string & token) {
if (token.length() < 37 || token.length() > 256 ||
!string_starts_with(token, "hf_")) {
return false;
}
for (size_t i = 3; i < token.length(); ++i) {
if (!is_alphanum(token[i])) {
return false;
}
}
return true;
}
static bool is_valid_commit(const std::string & hash) {
return is_hex_string(hash, 40);
}
static bool is_valid_oid(const std::string & oid) {
return is_hex_string(oid, 40) || is_hex_string(oid, 64);
}
static bool is_valid_subpath(const fs::path & path, const fs::path & subpath) {
if (subpath.is_absolute()) {
return false; // never do a / b with b absolute
}
auto b = fs::absolute(path).lexically_normal();
auto t = (b / subpath).lexically_normal();
auto [b_end, _] = std::mismatch(b.begin(), b.end(), t.begin(), t.end());
return b_end == b.end();
}
static void safe_write_file(const fs::path & path, const std::string & data) {
fs::path path_tmp = path.string() + ".tmp";
if (path.has_parent_path()) {
fs::create_directories(path.parent_path());
}
std::ofstream file(path_tmp);
file << data;
file.close();
std::error_code ec;
if (!file.fail()) {
fs::rename(path_tmp, path, ec);
}
if (file.fail() || ec) {
fs::remove(path_tmp, ec);
throw std::runtime_error("failed to write file: " + path.string());
}
}
static nl::json api_get(const std::string & url,
const std::string & token) {
auto [cli, parts] = common_http_client(url);
httplib::Headers headers = {
{"User-Agent", "llama-cpp/" + build_info},
{"Accept", "application/json"}
};
if (is_valid_hf_token(token)) {
headers.emplace("Authorization", "Bearer " + token);
} else if (!token.empty()) {
LOG_WRN("%s: invalid token, authentication disabled\n", __func__);
}
if (auto res = cli.Get(parts.path, headers)) {
auto body = res->body;
if (res->status == 200) {
return nl::json::parse(res->body);
}
try {
body = nl::json::parse(res->body)["error"].get<std::string>();
} catch (...) { }
throw std::runtime_error("GET failed (" + std::to_string(res->status) + "): " + body);
} else {
throw std::runtime_error("HTTPLIB failed: " + httplib::to_string(res.error()));
}
}
static std::string get_repo_commit(const std::string & repo_id,
const std::string & token) {
try {
auto endpoint = get_model_endpoint();
auto json = api_get(endpoint + "api/models/" + repo_id + "/refs", token);
if (!json.is_object() ||
!json.contains("branches") || !json["branches"].is_array()) {
LOG_WRN("%s: missing 'branches' for '%s'\n", __func__, repo_id.c_str());
return {};
}
fs::path refs_path = get_repo_path(repo_id) / "refs";
std::string name;
std::string commit;
for (const auto & branch : json["branches"]) {
if (!branch.is_object() ||
!branch.contains("name") || !branch["name"].is_string() ||
!branch.contains("targetCommit") || !branch["targetCommit"].is_string()) {
continue;
}
std::string _name = branch["name"].get<std::string>();
std::string _commit = branch["targetCommit"].get<std::string>();
if (!is_valid_subpath(refs_path, _name)) {
LOG_WRN("%s: skip invalid branch: %s\n", __func__, _name.c_str());
continue;
}
if (!is_valid_commit(_commit)) {
LOG_WRN("%s: skip invalid commit: %s\n", __func__, _commit.c_str());
continue;
}
if (_name == "main") {
name = _name;
commit = _commit;
break;
}
if (name.empty() || commit.empty()) {
name = _name;
commit = _commit;
}
}
if (name.empty() || commit.empty()) {
LOG_WRN("%s: no valid branch for '%s'\n", __func__, repo_id.c_str());
return {};
}
safe_write_file(refs_path / name, commit);
return commit;
} catch (const nl::json::exception & e) {
LOG_ERR("%s: JSON error: %s\n", __func__, e.what());
} catch (const std::exception & e) {
LOG_ERR("%s: error: %s\n", __func__, e.what());
}
return {};
}
hf_files get_repo_files(const std::string & repo_id,
const std::string & token) {
if (!is_valid_repo_id(repo_id)) {
LOG_WRN("%s: invalid repository: %s\n", __func__, repo_id.c_str());
return {};
}
std::string commit = get_repo_commit(repo_id, token);
if (commit.empty()) {
LOG_WRN("%s: failed to resolve commit for %s\n", __func__, repo_id.c_str());
return {};
}
fs::path blobs_path = get_repo_path(repo_id) / "blobs";
fs::path commit_path = get_repo_path(repo_id) / "snapshots" / commit;
hf_files files;
try {
auto endpoint = get_model_endpoint();
auto json = api_get(endpoint + "api/models/" + repo_id + "/tree/" + commit + "?recursive=true", token);
if (!json.is_array()) {
LOG_WRN("%s: response is not an array for '%s'\n", __func__, repo_id.c_str());
return {};
}
for (const auto & item : json) {
if (!item.is_object() ||
!item.contains("type") || !item["type"].is_string() || item["type"] != "file" ||
!item.contains("path") || !item["path"].is_string()) {
continue;
}
hf_file file;
file.repo_id = repo_id;
file.path = item["path"].get<std::string>();
if (!is_valid_subpath(commit_path, file.path)) {
LOG_WRN("%s: skip invalid path: %s\n", __func__, file.path.c_str());
continue;
}
if (item.contains("lfs") && item["lfs"].is_object()) {
if (item["lfs"].contains("oid") && item["lfs"]["oid"].is_string()) {
file.oid = item["lfs"]["oid"].get<std::string>();
}
if (item["lfs"].contains("size") && item["lfs"]["size"].is_number()) {
file.size = item["lfs"]["size"].get<size_t>();
}
} else if (item.contains("oid") && item["oid"].is_string()) {
file.oid = item["oid"].get<std::string>();
}
if (file.size == 0 && item.contains("size") && item["size"].is_number()) {
file.size = item["size"].get<size_t>();
}
if (!file.oid.empty() && !is_valid_oid(file.oid)) {
LOG_WRN("%s: skip invalid oid: %s\n", __func__, file.oid.c_str());
continue;
}
file.url = endpoint + repo_id + "/resolve/" + commit + "/" + file.path;
fs::path final_path = commit_path / file.path;
file.final_path = final_path.string();
if (!file.oid.empty() && !fs::exists(final_path)) {
fs::path local_path = blobs_path / file.oid;
file.local_path = local_path.string();
} else {
file.local_path = file.final_path;
}
files.push_back(file);
}
} catch (const nl::json::exception & e) {
LOG_ERR("%s: JSON error: %s\n", __func__, e.what());
} catch (const std::exception & e) {
LOG_ERR("%s: error: %s\n", __func__, e.what());
}
return files;
}
static std::string get_cached_ref(const fs::path & repo_path) {
fs::path refs_path = repo_path / "refs";
if (!fs::is_directory(refs_path)) {
return {};
}
std::string fallback;
for (const auto & entry : fs::directory_iterator(refs_path)) {
if (!entry.is_regular_file()) {
continue;
}
std::ifstream f(entry.path());
std::string commit;
if (!f || !std::getline(f, commit) || commit.empty()) {
continue;
}
if (!is_valid_commit(commit)) {
LOG_WRN("%s: skip invalid commit: %s\n", __func__, commit.c_str());
continue;
}
if (entry.path().filename() == "main") {
return commit;
}
if (fallback.empty()) {
fallback = commit;
}
}
return fallback;
}
hf_files get_cached_files(const std::string & repo_id) {
fs::path cache_dir = get_cache_directory();
if (!fs::exists(cache_dir)) {
return {};
}
if (!repo_id.empty() && !is_valid_repo_id(repo_id)) {
LOG_WRN("%s: invalid repository: %s\n", __func__, repo_id.c_str());
return {};
}
hf_files files;
for (const auto & repo : fs::directory_iterator(cache_dir)) {
if (!repo.is_directory()) {
continue;
}
fs::path snapshots_path = repo.path() / "snapshots";
if (!fs::exists(snapshots_path)) {
continue;
}
std::string _repo_id = folder_name_to_repo(repo.path().filename().string());
if (!is_valid_repo_id(_repo_id)) {
continue;
}
if (!repo_id.empty() && _repo_id != repo_id) {
continue;
}
std::string commit = get_cached_ref(repo.path());
fs::path commit_path = snapshots_path / commit;
if (commit.empty() || !fs::is_directory(commit_path)) {
continue;
}
for (const auto & entry : fs::recursive_directory_iterator(commit_path)) {
if (!entry.is_regular_file() && !entry.is_symlink()) {
continue;
}
fs::path path = entry.path().lexically_relative(commit_path);
if (!path.empty()) {
hf_file file;
file.repo_id = _repo_id;
file.path = path.generic_string();
file.local_path = entry.path().string();
file.final_path = file.local_path;
files.push_back(std::move(file));
}
}
}
return files;
}
std::string finalize_file(const hf_file & file) {
static std::atomic<bool> symlinks_disabled{false};
std::error_code ec;
fs::path local_path(file.local_path);
fs::path final_path(file.final_path);
if (local_path == final_path || fs::exists(final_path, ec)) {
return file.final_path;
}
if (!fs::exists(local_path, ec)) {
return file.final_path;
}
fs::create_directories(final_path.parent_path(), ec);
if (!symlinks_disabled) {
fs::path target = fs::relative(local_path, final_path.parent_path(), ec);
if (!ec) {
fs::create_symlink(target, final_path, ec);
}
if (!ec) {
return file.final_path;
}
}
if (!symlinks_disabled.exchange(true)) {
LOG_WRN("%s: failed to create symlink: %s\n", __func__, ec.message().c_str());
LOG_WRN("%s: switching to degraded mode\n", __func__);
}
fs::rename(local_path, final_path, ec);
if (ec) {
LOG_WRN("%s: failed to move file to snapshots: %s\n", __func__, ec.message().c_str());
fs::copy(local_path, final_path, ec);
if (ec) {
LOG_ERR("%s: failed to copy file to snapshots: %s\n", __func__, ec.message().c_str());
}
}
return file.final_path;
}
// delete everything after this line, one day
// copied from download.cpp without the tag part
struct gguf_split_info {
std::string prefix; // tag included
int index;
int count;
};
static gguf_split_info get_gguf_split_info(const std::string & path) {
static const std::regex re_split("^(.+)-([0-9]{5})-of-([0-9]{5})$", std::regex::icase);
std::smatch m;
std::string prefix = path;
if (!string_remove_suffix(prefix, ".gguf")) {
return {};
}
int index = 1;
int count = 1;
if (std::regex_match(prefix, m, re_split)) {
index = std::stoi(m[2].str());
count = std::stoi(m[3].str());
prefix = m[1].str();
}
return {std::move(prefix), index, count};
}
static std::pair<std::string, std::string> parse_manifest_name(std::string & filename) {
static const std::regex re(R"(^manifest=([^=]+)=([^=]+)=.*\.json$)");
std::smatch match;
if (std::regex_match(filename, match, re)) {
return {match[1].str(), match[2].str()};
}
return {};
}
static std::string make_old_cache_filename(const std::string & owner,
const std::string & repo,
const std::string & filename) {
auto result = owner + "_" + repo + "_" + filename;
string_replace_all(result, "/", "_");
return result;
}
struct migrate_file {
std::string path;
std::string sha256;
size_t size;
fs::path old_path;
fs::path etag_path;
const hf_file * file;
};
using migrate_files = std::vector<migrate_file>;
static bool collect_file(const fs::path & old_cache,
const std::string & owner,
const std::string & repo,
const std::string & path,
const std::string & sha256,
const hf_files & files,
migrate_files & to_migrate) {
const hf_file * file = nullptr;
for (const auto & f : files) {
if (f.path == path) {
file = &f;
break;
}
}
std::string old_filename = make_old_cache_filename(owner, repo, path);
fs::path old_path = old_cache / old_filename;
fs::path etag_path = old_path.string() + ".etag";
if (!fs::exists(old_path)) {
if (file && fs::exists(file->final_path)) {
return true;
}
LOG_WRN("%s: %s not found in old cache or HF cache\n", __func__, old_filename.c_str());
return false;
}
if (!file) {
LOG_WRN("%s: %s not found in current repo\n", __func__, old_filename.c_str());
return false;
}
if (!sha256.empty() && !file->oid.empty() && sha256 != file->oid) {
LOG_WRN("%s: %s is not up to date (sha256 mismatch)\n", __func__, old_filename.c_str());
return false;
}
if (file->size > 0) {
size_t size = fs::file_size(old_path);
if (size != file->size) {
LOG_WRN("%s: %s has wrong size %zu (expected %zu)\n", __func__, old_filename.c_str(), size, file->size);
return false;
}
}
to_migrate.push_back({path, sha256, file->size, old_path, etag_path, file});
return true;
}
static bool collect_files(const fs::path & old_cache,
const std::string & owner,
const std::string & repo,
const nl::json & node,
const hf_files & files,
migrate_files & to_migrate) {
if (!node.contains("rfilename") ||
!node.contains("lfs") ||
!node["lfs"].contains("sha256")) {
return true;
}
std::string path = node["rfilename"];
std::string sha256 = node["lfs"]["sha256"];
auto split = get_gguf_split_info(path);
if (split.count <= 1) {
return collect_file(old_cache, owner, repo, path, sha256, files, to_migrate);
}
std::vector<std::pair<std::string, std::string>> splits;
for (const auto & f : files) {
auto split_f = get_gguf_split_info(f.path);
if (split_f.count == split.count && split_f.prefix == split.prefix) {
// sadly the manifest only provides the sha256 of the first file (index == 1)
// the rest will be verified using the size...
std::string f_sha256 = (split_f.index == 1) ? sha256 : "";
splits.emplace_back(f.path, f_sha256);
}
}
if ((int)splits.size() != split.count) {
LOG_WRN("%s: expected %d split files but found %d in repo\n", __func__, split.count, (int)splits.size());
return false;
}
for (const auto & [f_path, f_sha256] : splits) {
if (!collect_file(old_cache, owner, repo, f_path, f_sha256, files, to_migrate)) {
return false;
}
}
return true;
}
static bool migrate_file(const migrate_file & file) {
std::error_code ec;
fs::path new_path(file.file->local_path);
fs::create_directories(new_path.parent_path(), ec);
if (!fs::exists(new_path, ec)) {
fs::rename(file.old_path, new_path, ec);
if (ec) {
fs::copy_file(file.old_path, new_path, ec);
if (ec) {
LOG_ERR("%s: failed to move/copy %s: %s\n", __func__, file.old_path.string().c_str(), ec.message().c_str());
return false;
}
}
fs::remove(file.old_path, ec);
}
fs::remove(file.etag_path, ec);
std::string filename = finalize_file(*file.file);
LOG_INF("%s: migrated %s -> %s\n", __func__, file.old_path.filename().string().c_str(), filename.c_str());
return true;
}
void migrate_old_cache_to_hf_cache(const std::string & token, bool offline) {
fs::path old_cache = fs_get_cache_directory();
if (!fs::exists(old_cache)) {
return;
}
if (offline) {
LOG_WRN("%s: skipping migration in offline mode (will run when online)\n", __func__);
return; // -hf is not going to work
}
bool warned = false;
for (const auto & entry : fs::directory_iterator(old_cache)) {
if (!entry.is_regular_file()) {
continue;
}
auto filename = entry.path().filename().string();
auto [owner, repo] = parse_manifest_name(filename);
if (owner.empty() || repo.empty()) {
continue;
}
if (!warned) {
warned = true;
LOG_WRN("================================================================================\n"
"WARNING: Migrating cache to HuggingFace cache directory\n"
" Old cache: %s\n"
" New cache: %s\n"
"This one-time migration moves models previously downloaded with -hf\n"
"from the legacy llama.cpp cache to the standard HuggingFace cache.\n"
"Models downloaded with --model-url are not affected.\n"
"================================================================================\n",
old_cache.string().c_str(), get_cache_directory().string().c_str());
}
auto repo_id = owner + "/" + repo;
auto files = get_repo_files(repo_id, token);
if (files.empty()) {
LOG_WRN("%s: could not get repo files for %s, skipping\n", __func__, repo_id.c_str());
continue;
}
migrate_files to_migrate;
bool ok = true;
try {
std::ifstream manifest(entry.path());
auto json = nl::json::parse(manifest);
for (const char * key : {"ggufFile", "mmprojFile"}) {
if (json.contains(key)) {
if (!collect_files(old_cache, owner, repo, json[key], files, to_migrate)) {
ok = false;
break;
}
}
}
} catch (const std::exception & e) {
LOG_WRN("%s: failed to parse manifest %s: %s\n", __func__, filename.c_str(), e.what());
continue;
}
if (!ok) {
LOG_WRN("%s: migration skipped: one or more files failed validation\n", __func__);
continue;
}
for (const auto & file : to_migrate) {
if (!migrate_file(file)) {
ok = false;
break;
}
}
if (!ok) {
LOG_WRN("%s: migration failed: could not migrate all files\n", __func__);
continue;
}
LOG_INF("%s: migration complete, deleting manifest: %s\n", __func__, entry.path().string().c_str());
fs::remove(entry.path());
}
}
} // namespace hf_cache

36
common/hf-cache.h Normal file
View File

@@ -0,0 +1,36 @@
#pragma once
#include <string>
#include <vector>
// Ref: https://huggingface.co/docs/hub/local-cache.md
namespace hf_cache {
struct hf_file {
std::string path;
std::string url;
std::string local_path;
std::string final_path;
std::string oid;
std::string repo_id;
size_t size = 0; // only for the migration
};
using hf_files = std::vector<hf_file>;
// Get files from HF API
hf_files get_repo_files(
const std::string & repo_id,
const std::string & token
);
hf_files get_cached_files(const std::string & repo_id = {});
// Create snapshot path (link or move/copy) and return it
std::string finalize_file(const hf_file & file);
// TODO: Remove later
void migrate_old_cache_to_hf_cache(const std::string & token, bool offline = false);
} // namespace hf_cache

View File

@@ -53,6 +53,13 @@ private:
return tokens[current + offset];
}
const token & next() {
if (current >= tokens.size()) {
throw parser_exception("Parser Error: Unexpected EOF", source, tokens.empty() ? 0 : tokens.back().pos);
}
return tokens[current++];
}
token expect(token::type type, const std::string& error) {
const auto & t = peek();
if (t.t != type) {
@@ -90,9 +97,9 @@ private:
size_t start_pos = current;
switch (peek().t) {
case token::comment:
return mk_stmt<comment_statement>(start_pos, tokens[current++].value);
return mk_stmt<comment_statement>(start_pos, next().value);
case token::text:
return mk_stmt<string_literal>(start_pos, tokens[current++].value);
return mk_stmt<string_literal>(start_pos, next().value);
case token::open_statement:
return parse_jinja_statement();
case token::open_expression:
@@ -119,8 +126,7 @@ private:
}
size_t start_pos = current;
std::string name = peek().value;
current++; // consume identifier
std::string name = next().value;
statement_ptr result;
if (name == "set") {
@@ -202,7 +208,7 @@ private:
// Ignore generation blocks (transformers-specific)
// See https://github.com/huggingface/transformers/pull/30650 for more information.
result = mk_stmt<noop_statement>(start_pos);
current++;
++current;
} else {
throw std::runtime_error("Unknown statement: " + name);
@@ -217,7 +223,7 @@ private:
statements body;
if (is(token::equals)) {
current++;
++current;
value = parse_expression_sequence();
} else {
// parsing multiline set here
@@ -280,7 +286,7 @@ private:
exprs.push_back(primary ? parse_primary_expression() : parse_expression());
bool is_tuple = is(token::comma);
while (is(token::comma)) {
current++; // consume comma
++current; // consume comma
exprs.push_back(primary ? parse_primary_expression() : parse_expression());
}
return is_tuple ? mk_stmt<tuple_literal>(start_pos, std::move(exprs)) : std::move(exprs[0]);
@@ -290,7 +296,7 @@ private:
// e.g., `message` in `for message in messages`
auto loop_var = parse_expression_sequence(true); // should be an identifier/tuple
if (!is_identifier("in")) throw std::runtime_error("Expected 'in'");
current++;
++current; // consume 'in'
// `messages` in `for message in messages`
auto iterable = parse_expression();
@@ -305,7 +311,8 @@ private:
}
if (is_statement({"else"})) {
current += 2;
++current; // consume {%
++current; // consume 'else'
expect(token::close_statement, "Expected %}");
while (!is_statement({"endfor"})) {
alternate.push_back(parse_any());
@@ -347,7 +354,7 @@ private:
auto left = parse_logical_and_expression();
while (is_identifier("or")) {
size_t start_pos = current;
token op = tokens[current++];
token op = next();
left = mk_stmt<binary_expression>(start_pos, op, std::move(left), parse_logical_and_expression());
}
return left;
@@ -357,7 +364,7 @@ private:
auto left = parse_logical_negation_expression();
while (is_identifier("and")) {
size_t start_pos = current;
auto op = tokens[current++];
auto op = next();
left = mk_stmt<binary_expression>(start_pos, op, std::move(left), parse_logical_negation_expression());
}
return left;
@@ -367,7 +374,7 @@ private:
// Try parse unary operators
if (is_identifier("not")) {
size_t start_pos = current;
auto op = tokens[current++];
auto op = next();
return mk_stmt<unary_expression>(start_pos, op, parse_logical_negation_expression());
}
return parse_comparison_expression();
@@ -382,11 +389,12 @@ private:
size_t start_pos = current;
if (is_identifier("not") && peek(1).t == token::identifier && peek(1).value == "in") {
op = {token::identifier, "not in", tokens[current].pos};
current += 2;
++current; // consume 'not'
++current; // consume 'in'
} else if (is_identifier("in")) {
op = tokens[current++];
op = next();
} else if (is(token::comparison_binary_operator)) {
op = tokens[current++];
op = next();
} else break;
left = mk_stmt<binary_expression>(start_pos, op, std::move(left), parse_additive_expression());
}
@@ -397,7 +405,7 @@ private:
auto left = parse_multiplicative_expression();
while (is(token::additive_binary_operator)) {
size_t start_pos = current;
auto op = tokens[current++];
auto op = next();
left = mk_stmt<binary_expression>(start_pos, op, std::move(left), parse_multiplicative_expression());
}
return left;
@@ -407,7 +415,7 @@ private:
auto left = parse_test_expression();
while (is(token::multiplicative_binary_operator)) {
size_t start_pos = current;
auto op = tokens[current++];
auto op = next();
left = mk_stmt<binary_expression>(start_pos, op, std::move(left), parse_test_expression());
}
return left;
@@ -417,9 +425,9 @@ private:
auto operand = parse_filter_expression();
while (is_identifier("is")) {
size_t start_pos = current;
current++;
++current; // consume 'is'
bool negate = false;
if (is_identifier("not")) { current++; negate = true; }
if (is_identifier("not")) { ++current; negate = true; }
auto test_id = parse_primary_expression();
// FIXME: tests can also be expressed like this: if x is eq 3
if (is(token::open_paren)) test_id = parse_call_expression(std::move(test_id));
@@ -432,7 +440,7 @@ private:
auto operand = parse_call_member_expression();
while (is(token::pipe)) {
size_t start_pos = current;
current++;
++current; // consume pipe
auto filter = parse_primary_expression();
if (is(token::open_paren)) filter = parse_call_expression(std::move(filter));
operand = mk_stmt<filter_expression>(start_pos, std::move(operand), std::move(filter));
@@ -490,7 +498,7 @@ private:
statement_ptr parse_member_expression(statement_ptr object) {
size_t start_pos = current;
while (is(token::dot) || is(token::open_square_bracket)) {
auto op = tokens[current++];
auto op = next();
bool computed = op.t == token::open_square_bracket;
statement_ptr prop;
if (computed) {
@@ -536,7 +544,7 @@ private:
statement_ptr parse_primary_expression() {
size_t start_pos = current;
auto t = tokens[current++];
auto t = next();
switch (t.t) {
case token::numeric_literal:
if (t.value.find('.') != std::string::npos) {
@@ -547,7 +555,7 @@ private:
case token::string_literal: {
std::string val = t.value;
while (is(token::string_literal)) {
val += tokens[current++].value;
val += next().value;
}
return mk_stmt<string_literal>(start_pos, val);
}
@@ -562,9 +570,9 @@ private:
statements vals;
while (!is(token::close_square_bracket)) {
vals.push_back(parse_expression());
if (is(token::comma)) current++;
if (is(token::comma)) ++current;
}
current++;
++current;
return mk_stmt<array_literal>(start_pos, std::move(vals));
}
case token::open_curly_bracket: {
@@ -573,9 +581,9 @@ private:
auto key = parse_expression();
expect(token::colon, "Expected :");
pairs.push_back({std::move(key), parse_expression()});
if (is(token::comma)) current++;
if (is(token::comma)) ++current;
}
current++;
++current;
return mk_stmt<object_literal>(start_pos, std::move(pairs));
}
default:

View File

@@ -667,8 +667,9 @@ value macro_statement::execute_impl(context & ctx) {
if (is_stmt<identifier>(this->args[i])) {
// normal parameter
std::string param_name = cast_stmt<identifier>(this->args[i])->val;
JJ_DEBUG(" Binding parameter '%s' to argument of type %s", param_name.c_str(), args.get_pos(i)->type().c_str());
macro_ctx.set_val(param_name, args.get_pos(i));
value param_value = args.get_kwarg_or_pos(param_name, i);
JJ_DEBUG(" Binding parameter '%s' to argument of type %s", param_name.c_str(), param_value->type().c_str());
macro_ctx.set_val(param_name, param_value);
} else if (is_stmt<keyword_argument_expression>(this->args[i])) {
// default argument used as normal parameter
auto kwarg = cast_stmt<keyword_argument_expression>(this->args[i]);
@@ -676,8 +677,9 @@ value macro_statement::execute_impl(context & ctx) {
throw std::runtime_error("Keyword argument key must be an identifier in macro '" + name + "'");
}
std::string param_name = cast_stmt<identifier>(kwarg->key)->val;
JJ_DEBUG(" Binding parameter '%s' to argument of type %s", param_name.c_str(), args.get_pos(i)->type().c_str());
macro_ctx.set_val(param_name, args.get_pos(i));
value param_value = args.get_kwarg_or_pos(param_name, i);
JJ_DEBUG(" Binding parameter '%s' to argument of type %s", param_name.c_str(), param_value->type().c_str());
macro_ctx.set_val(param_name, param_value);
} else {
throw std::runtime_error("Invalid parameter type in macro '" + name + "'");
}

View File

@@ -451,7 +451,7 @@ struct value_array_t : public value_t {
}
protected:
virtual bool equivalent(const value_t & other) const override {
return typeid(*this) == typeid(other) && is_hashable() && other.is_hashable() && std::equal(val_arr.begin(), val_arr.end(), other.val_arr.begin(), value_equivalence());
return typeid(*this) == typeid(other) && is_hashable() && other.is_hashable() && std::equal(val_arr.begin(), val_arr.end(), other.val_arr.begin(), other.val_arr.end(), value_equivalence());
}
};
using value_array = std::shared_ptr<value_array_t>;
@@ -587,7 +587,7 @@ struct value_object_t : public value_t {
}
protected:
virtual bool equivalent(const value_t & other) const override {
return typeid(*this) == typeid(other) && is_hashable() && other.is_hashable() && std::equal(val_obj.begin(), val_obj.end(), other.val_obj.begin(), value_equivalence());
return typeid(*this) == typeid(other) && is_hashable() && other.is_hashable() && std::equal(val_obj.begin(), val_obj.end(), other.val_obj.begin(), other.val_obj.end(), value_equivalence());
}
};
using value_object = std::shared_ptr<value_object_t>;

View File

@@ -163,9 +163,15 @@ static void common_reasoning_budget_reset(struct llama_sampler * smpl) {
ctx->force_pos = 0;
}
// forward declaration for use in clone
static struct llama_sampler * common_reasoning_budget_init_state(
const struct llama_vocab * vocab, const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens, const std::vector<llama_token> & forced_tokens,
int32_t budget, common_reasoning_budget_state initial_state);
static struct llama_sampler * common_reasoning_budget_clone(const struct llama_sampler * smpl) {
const auto * ctx = (const common_reasoning_budget_ctx *) smpl->ctx;
return common_reasoning_budget_init(
return common_reasoning_budget_init_state(
ctx->vocab,
ctx->start_matcher.tokens,
ctx->end_matcher.tokens,
@@ -191,13 +197,13 @@ static struct llama_sampler_i common_reasoning_budget_i = {
/* .backend_set_input = */ nullptr,
};
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
common_reasoning_budget_state initial_state) {
static struct llama_sampler * common_reasoning_budget_init_state(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
common_reasoning_budget_state initial_state) {
// promote COUNTING with budget <= 0 to FORCING
if (initial_state == REASONING_BUDGET_COUNTING && budget <= 0) {
initial_state = REASONING_BUDGET_FORCING;
@@ -217,3 +223,41 @@ struct llama_sampler * common_reasoning_budget_init(
}
);
}
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
const std::vector<llama_token> & prefill_tokens) {
// Determine initial state from prefill: COUNTING if the prefill begins with
// the start sequence but does not also contain the end sequence after it.
common_reasoning_budget_state initial_state = REASONING_BUDGET_IDLE;
if (!prefill_tokens.empty() && !start_tokens.empty() &&
prefill_tokens.size() >= start_tokens.size() &&
std::equal(start_tokens.begin(), start_tokens.end(), prefill_tokens.begin())) {
initial_state = REASONING_BUDGET_COUNTING;
// If the end sequence also follows the start in the prefill, reasoning
// was opened and immediately closed — stay IDLE.
if (!end_tokens.empty() &&
prefill_tokens.size() >= start_tokens.size() + end_tokens.size()) {
auto end_start = prefill_tokens.end() - (ptrdiff_t) end_tokens.size();
if (end_start >= prefill_tokens.begin() + (ptrdiff_t) start_tokens.size() &&
std::equal(end_tokens.begin(), end_tokens.end(), end_start)) {
initial_state = REASONING_BUDGET_IDLE;
}
}
}
return common_reasoning_budget_init_state(vocab, start_tokens, end_tokens, forced_tokens, budget, initial_state);
}
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
common_reasoning_budget_state initial_state) {
return common_reasoning_budget_init_state(vocab, start_tokens, end_tokens, forced_tokens, budget, initial_state);
}

View File

@@ -24,14 +24,26 @@ enum common_reasoning_budget_state {
// DONE: passthrough forever
//
// Parameters:
// vocab - vocabulary (used for UTF-8 boundary detection; can be nullptr)
// start_tokens - token sequence that activates counting
// end_tokens - token sequence for natural deactivation
// forced_tokens - token sequence forced when budget expires
// budget - max tokens allowed in the reasoning block
// initial_state - initial state of the sampler (e.g. IDLE or COUNTING)
// note: COUNTING with budget <= 0 is promoted to FORCING
// vocab - vocabulary (used for UTF-8 boundary detection; can be nullptr)
// start_tokens - token sequence that activates counting
// end_tokens - token sequence for natural deactivation
// forced_tokens - token sequence forced when budget expires
// budget - max tokens allowed in the reasoning block
// prefill_tokens - tokens already present in the prompt (generation prompt);
// used to determine the initial state: COUNTING if they begin
// with start_tokens (but don't also end with end_tokens),
// IDLE otherwise. COUNTING with budget <= 0 is promoted to FORCING.
//
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
const std::vector<llama_token> & prefill_tokens = {});
// Variant that takes an explicit initial state (used by tests and clone).
// COUNTING with budget <= 0 is promoted to FORCING.
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,

View File

@@ -1,13 +1,16 @@
#include "sampling.h"
#include "common.h"
#include "ggml.h"
#include "log.h"
#include "reasoning-budget.h"
#include <algorithm>
#include <cctype>
#include <cmath>
#include <cstring>
#include <unordered_map>
#include <vector>
// the ring buffer works similarly to std::deque, but with a fixed capacity
// TODO: deduplicate with llama-impl.h
@@ -189,9 +192,10 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
std::vector<llama_sampler *> samplers;
if (params.grammar.compare(0, 11, "%llguidance") == 0) {
const std::string & grammar_str = common_grammar_value(params.grammar);
if (grammar_str.compare(0, 11, "%llguidance") == 0) {
#ifdef LLAMA_USE_LLGUIDANCE
grmr = llama_sampler_init_llg(vocab, "lark", params.grammar.c_str());
grmr = llama_sampler_init_llg(vocab, "lark", grammar_str.c_str());
#else
GGML_ABORT("llguidance (cmake -DLLAMA_LLGUIDANCE=ON) is not enabled");
#endif // LLAMA_USE_LLGUIDANCE
@@ -240,17 +244,46 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
trigger_patterns_c.push_back(regex.c_str());
}
if (!params.grammar.empty()) {
if (!grammar_str.empty()) {
if (params.grammar_lazy) {
grmr = llama_sampler_init_grammar_lazy_patterns(vocab, params.grammar.c_str(), "root",
grmr = llama_sampler_init_grammar_lazy_patterns(vocab, grammar_str.c_str(), "root",
trigger_patterns_c.data(), trigger_patterns_c.size(),
trigger_tokens.data(), trigger_tokens.size());
} else {
grmr = llama_sampler_init_grammar(vocab, params.grammar.c_str(), "root");
grmr = llama_sampler_init_grammar(vocab, grammar_str.c_str(), "root");
}
}
}
// Feed generation prompt tokens to the grammar sampler so it advances past
// tokens the template already placed in the prompt.
// Only applies to output-format and tool-call grammars; user-supplied grammars must not be prefilled.
std::vector<llama_token> prefill_tokens;
if (!params.generation_prompt.empty() && common_grammar_needs_prefill(params.grammar)) {
GGML_ASSERT(vocab != nullptr);
prefill_tokens = common_tokenize(vocab, params.generation_prompt, false, true);
if (!prefill_tokens.empty()) {
std::string first_token = common_token_to_piece(vocab, prefill_tokens[0], true);
if (std::isspace(first_token[0]) && !std::isspace(params.generation_prompt[0])) {
// Some tokenizers will add a space before the first special token, need to remove
prefill_tokens = std::vector<llama_token>(prefill_tokens.begin() + 1, prefill_tokens.end());
}
}
if (grmr) {
try {
for (const auto & token : prefill_tokens) {
llama_sampler_accept(grmr, token);
LOG_DBG("%s: accepted prefill token (%d)\n", __func__, token);
}
} catch (std::exception &e) {
LOG_ERR("%s: error initializing grammar sampler for grammar:\n%s\n\nGeneration prompt:\n'%s'\n", __func__,
common_grammar_value(params.grammar).c_str(), params.generation_prompt.c_str());
throw e;
}
}
}
// reasoning budget sampler — added first so it can force tokens before other samplers
if (params.reasoning_budget_tokens >= 0 && !params.reasoning_budget_forced.empty()) {
samplers.push_back(common_reasoning_budget_init(
@@ -259,7 +292,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
params.reasoning_budget_end,
params.reasoning_budget_forced,
params.reasoning_budget_tokens,
params.reasoning_budget_activate_immediately ? REASONING_BUDGET_COUNTING : REASONING_BUDGET_IDLE));
prefill_tokens));
}
if (params.has_logit_bias()) {

View File

@@ -31,10 +31,10 @@ import gguf
from gguf.vocab import MistralTokenizerType, MistralVocab
try:
from mistral_common.tokens.tokenizers.base import TokenizerVersion # pyright: ignore[reportMissingImports]
from mistral_common.tokens.tokenizers.multimodal import DATASET_MEAN as _MISTRAL_COMMON_DATASET_MEAN, DATASET_STD as _MISTRAL_COMMON_DATASET_STD # pyright: ignore[reportMissingImports]
from mistral_common.tokens.tokenizers.tekken import Tekkenizer # pyright: ignore[reportMissingImports]
from mistral_common.tokens.tokenizers.sentencepiece import ( # pyright: ignore[reportMissingImports]
from mistral_common.tokens.tokenizers.base import TokenizerVersion # type: ignore[import-not-found]
from mistral_common.tokens.tokenizers.multimodal import DATASET_MEAN as _MISTRAL_COMMON_DATASET_MEAN, DATASET_STD as _MISTRAL_COMMON_DATASET_STD # type: ignore[import-not-found]
from mistral_common.tokens.tokenizers.tekken import Tekkenizer # type: ignore[import-not-found]
from mistral_common.tokens.tokenizers.sentencepiece import ( # type: ignore[import-not-found]
SentencePieceTokenizer,
)
@@ -45,9 +45,9 @@ except ImportError:
_MISTRAL_COMMON_DATASET_STD = (0.26862954, 0.26130258, 0.27577711)
_mistral_common_installed = False
TokenizerVersion = None
Tekkenizer = None
SentencePieceTokenizer = None
TokenizerVersion: Any = None
Tekkenizer: Any = None
SentencePieceTokenizer: Any = None
_mistral_import_error_msg = (
"Mistral format requires `mistral-common` to be installed. Please run "
"`pip install mistral-common[image,audio]` to install it."
@@ -145,6 +145,7 @@ class ModelBase:
self.model_name = model_name
self.dir_model_card = dir_model # overridden in convert_lora_to_gguf.py
self._is_nvfp4 = False
self._is_mxfp4 = False
# Apply heuristics to figure out typical tensor encoding based on first tensor's dtype
# NOTE: can't use field "torch_dtype" in config.json, because some finetunes lie.
@@ -220,7 +221,7 @@ class ModelBase:
if weight_map is None or not isinstance(weight_map, dict):
raise ValueError(f"Can't load 'weight_map' from {index_name!r}")
tensor_names_from_index.update(weight_map.keys())
part_dict: dict[str, None] = dict.fromkeys(weight_map.values(), None)
part_dict: dict[str, None] = dict.fromkeys(weight_map.values(), None) # ty: ignore[invalid-assignment]
part_names = sorted(part_dict.keys())
else:
weight_map = {}
@@ -712,6 +713,7 @@ class ModelBase:
def prepare_tensors(self):
# detect NVFP4 quantization (ModelOpt format)
quant_algo = (self.hparams.get("quantization_config") or {}).get("quant_algo")
quant_method = (self.hparams.get("quantization_config") or {}).get("quant_method")
quant_layers = (self.hparams.get("quantization_config") or {}).get("quantized_layers") or {}
quant_config_file = self.dir_model / "hf_quant_config.json"
@@ -728,6 +730,7 @@ class ModelBase:
quant_algo = "NVFP4"
self._is_nvfp4 = quant_algo == "NVFP4"
self._is_mxfp4 = quant_method == "mxfp4"
# NVFP4 weights are repacked and written directly to gguf_writer.
# This must run before dequant_model so NVFP4 tensors are removed
@@ -876,6 +879,12 @@ class ModelBase:
if self.metadata.name is None:
self.metadata.name = self.dir_model.name
if self.ftype in (gguf.LlamaFileType.ALL_F32, gguf.LlamaFileType.MOSTLY_F16, gguf.LlamaFileType.MOSTLY_BF16):
if self._is_nvfp4:
self.ftype = gguf.LlamaFileType.MOSTLY_NVFP4
elif self._is_mxfp4:
self.ftype = gguf.LlamaFileType.MOSTLY_MXFP4_MOE
# Generate parameter weight class (useful for leader boards) if not yet determined
if self.metadata.size_label is None and total_params > 0:
self.metadata.size_label = gguf.size_label(total_params, shared_params, expert_params, expert_count)
@@ -938,6 +947,9 @@ class ModelBase:
if "thinker_config" in config:
# rename for Qwen2.5-Omni
config["text_config"] = config["thinker_config"]["text_config"]
if "language_config" in config:
# rename for DeepSeekOCR
config["text_config"] = config["language_config"]
if "lfm" in config:
# rename for LFM2-Audio
config["text_config"] = config["lfm"]
@@ -1062,6 +1074,10 @@ class TextModel(ModelBase):
self.gguf_writer.add_head_count_kv(n_head_kv)
logger.info(f"gguf: key-value head count = {n_head_kv}")
if self.hparams.get("is_causal") is False:
self.gguf_writer.add_causal_attention(False)
logger.info("gguf: causal attention = False")
# TODO: Handle "sliding_attention" similarly when models start implementing it
rope_params = self.rope_parameters.get("full_attention", self.rope_parameters)
if (rope_type := rope_params.get("rope_type")) is not None:
@@ -1490,6 +1506,9 @@ class TextModel(ModelBase):
if chkhsh == "e4d54df1ebc1f2b91acd986c5b51aa50837d5faf7c7398e73c1f9e9ee5d19869":
# ref: https://huggingface.co/kakaocorp/kanana-2-30b-a3b-instruct-2601
res = "kanana2"
if chkhsh == "862f827721df956049dff5ca81a57f29e575280bc622e290d3bf4e35eca29015":
# ref: https://huggingface.co/codefuse-ai/F2LLM-v2-4B
res = "f2llmv2"
if res is None:
logger.warning("\n")
@@ -2058,7 +2077,7 @@ class MmprojModel(ModelBase):
preprocessor_config: dict[str, Any]
global_config: dict[str, Any]
n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "encoder_layers", "vt_num_hidden_layers"]
n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers", "vt_num_hidden_layers"]
has_vision_encoder: bool = True # by default
has_audio_encoder: bool = False
@@ -4260,6 +4279,16 @@ class Qwen25OmniModel(Qwen2VLVisionModel):
@ModelBase.register("InternVisionModel")
class InternVisionModel(MmprojModel):
min_dynamic_tiles: int = 0
max_dynamic_tiles: int = 0
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
assert self.hparams_vision is not None
self.min_dynamic_tiles = self.global_config.get("min_dynamic_patch", 0)
self.max_dynamic_tiles = self.global_config.get("max_dynamic_patch", 0)
def set_gguf_parameters(self):
assert self.hparams_vision is not None
if isinstance(self.hparams_vision['image_size'], list):
@@ -4282,6 +4311,11 @@ class InternVisionModel(MmprojModel):
downsample_ratio = self.global_config.get("downsample_ratio")
assert downsample_ratio is not None
self.gguf_writer.add_vision_projector_scale_factor(int(1.0 / downsample_ratio))
# older models may not have min/max_dynamic_patch in config
if self.min_dynamic_tiles > 0:
self.gguf_writer.add_vision_preproc_min_tiles(self.min_dynamic_tiles)
if self.max_dynamic_tiles > 0:
self.gguf_writer.add_vision_preproc_max_tiles(self.max_dynamic_tiles)
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ".position_embd." in new_name:
@@ -4544,7 +4578,7 @@ class Qwen2MoeModel(TextModel):
raise ValueError(f"Unprocessed experts: {experts}")
@ModelBase.register("Qwen3ForCausalLM")
@ModelBase.register("Qwen3ForCausalLM", "Qwen3Model")
class Qwen3Model(Qwen2Model):
model_arch = gguf.MODEL_ARCH.QWEN3
@@ -5878,7 +5912,7 @@ class InternLM2Model(TextModel):
logger.error(f'Error: Missing {tokenizer_path}')
sys.exit(1)
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue]
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue] # ty: ignore[unresolved-attribute]
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
add_prefix = sentencepiece_model.normalizer_spec.add_dummy_prefix
@@ -6199,7 +6233,7 @@ class BertModel(TextModel):
vocab_size = max(self.hparams.get("vocab_size", 0), tokenizer.vocab_size)
else:
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue]
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue] # ty: ignore[unresolved-attribute]
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
assert sentencepiece_model.trainer_spec.model_type == 1 # UNIGRAM
@@ -6907,6 +6941,68 @@ class ConformerAudioModel(MmprojModel):
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("DeepseekOCRForCausalLM")
class DeepseekOCRVisionModel(MmprojModel):
def set_gguf_parameters(self):
super().set_gguf_parameters()
hparams = self.hparams
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.DEEPSEEKOCR)
# default values below are taken from HF tranformers code
self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("layer_norm_eps", 1e-6))
self.gguf_writer.add_vision_use_gelu(True)
# calculate proj_scale_factor (used by tinygemma3 test model)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
n_per_side = int(image_seq_length ** 0.5)
image_size = self.hparams["image_size"]
patch_size = self.hparams["patch_size"]
proj_scale_factor = (image_size // patch_size) // n_per_side
if proj_scale_factor > 0 and proj_scale_factor != 4:
# we only need to write this if it's not the default value
# in this case, we are converting a test model
self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor)
# @bluebread: there's no window_size in config but just add it here anyway
self.gguf_writer.add_vision_window_size(self.hparams.get("window_size", 14))
# SAM configuration
sam_hparams = hparams['sam']
self.gguf_writer.add_vision_sam_layers_count(sam_hparams['layers'])
self.gguf_writer.add_vision_sam_embedding_length(sam_hparams['width'])
self.gguf_writer.add_vision_sam_head_count(sam_hparams['heads'])
def get_vision_config(self) -> dict[str, Any]:
vision_config: dict[str, Any] | None = self.global_config.get("vision_config")
if not vision_config:
raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found")
vision_config['sam'] = vision_config['width']['sam_vit_b']
vision_config.update(vision_config['width']['clip-l-14-224'])
vision_config['hidden_size'] = vision_config['width']
vision_config['num_heads'] = vision_config['heads']
vision_config['intermediate_size'] = vision_config['heads'] * 4
return vision_config
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ".embeddings." in name or 'pos_embed' in name:
return gguf.GGMLQuantizationType.F32
if ".rel_pos_h" in name or '.rel_pos_w' in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# Only process vision-related tensors, skip language model tensors
# Vision components: sam_model, vision_model, projector, image_newline, view_seperator
# Language model components to skip: lm_head, embed_tokens, layers, norm
if name.startswith(("lm_head.", "model.embed_tokens.", "model.layers.", "model.norm.")):
return
if name.endswith("pos_embed") or name.endswith("rel_pos_h") or name.endswith("rel_pos_w"):
name += ".weight"
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Gemma3nForConditionalGeneration")
class Gemma3nVisionAudioModel(ConformerAudioModel):
has_audio_encoder = True
@@ -8252,6 +8348,19 @@ class DeepseekV2Model(TextModel):
merge_expert = True
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
hparams: dict = ModelBase.load_hparams(self.dir_model, is_mistral_format=False)
self.origin_hf_arch = hparams.get('architectures', [None])[0]
# special handling for Deepseek OCR
if self.origin_hf_arch == "DeepseekOCRForCausalLM":
self.model_arch = gguf.MODEL_ARCH.DEEPSEEK2OCR
self.gguf_writer.arch = gguf.MODEL_ARCH_NAMES[self.model_arch]
self.gguf_writer.add_architecture()
# default jinja template
self.gguf_writer.add_chat_template("{% for m in messages %}{{m['content']}}{% endfor %}")
def set_vocab(self):
try:
self._set_vocab_gpt2()
@@ -8307,9 +8416,15 @@ class DeepseekV2Model(TextModel):
raise NotImplementedError(f"Deepseek pre-tokenizer {tokpre!r} is not supported yet!")
def set_gguf_parameters(self):
is_ocr = (self.model_arch == gguf.MODEL_ARCH.DEEPSEEK2OCR)
# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
self.hparams["num_key_value_heads"] = 1
if is_ocr:
self.hparams['rope_theta'] = self.hparams.get('rope_theta', 10000.0)
else:
# note: deepseek2 using MLA converts into MQA (ie: GQA with 1 group)
self.hparams["num_key_value_heads"] = 1
self.hparams['rms_norm_eps'] = self.hparams.get('rms_norm_eps', 1e-6)
super().set_gguf_parameters()
hparams = self.hparams
@@ -8323,16 +8438,18 @@ class DeepseekV2Model(TextModel):
# Default: if no MoE, all layers are dense; if MoE, none are dense
first_k_dense_replace = hparams["num_hidden_layers"] if not has_moe else 0
self.gguf_writer.add_leading_dense_block_count(first_k_dense_replace)
kv_lora_rank = hparams.get("kv_lora_rank", 512)
self.gguf_writer.add_vocab_size(hparams["vocab_size"])
if "q_lora_rank" in hparams and hparams["q_lora_rank"] is not None:
self.gguf_writer.add_q_lora_rank(hparams["q_lora_rank"])
self.gguf_writer.add_kv_lora_rank(hparams["kv_lora_rank"])
# note: deepseek2 using MLA converts into MQA with larger heads, then decompresses to MHA
self.gguf_writer.add_key_length(hparams["kv_lora_rank"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length(hparams["kv_lora_rank"])
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
if not is_ocr:
self.gguf_writer.add_kv_lora_rank(kv_lora_rank)
self.gguf_writer.add_key_length(kv_lora_rank + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length(kv_lora_rank)
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
# MoE parameters (required by C++ code for DEEPSEEK2 arch)
# For non-MoE models like Youtu, use intermediate_size as expert_feed_forward_length
@@ -8364,8 +8481,15 @@ class DeepseekV2Model(TextModel):
_experts: list[dict[str, Tensor]] | None = None
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip vision tensors and remove "language_model." for Kimi-VL and Kimi-K2.5
if "vision_tower" in name or "multi_modal_projector" in name or "mm_projector" in name:
# skip vision tensors and remove "language_model." for Kimi-VL and Kimi-K2.5, and DeepSeek-OCR
if ("vision_tower" in name
or "multi_modal_projector" in name
or "mm_projector" in name
or "vision_model" in name
or "image_newline" in name
or "model.projector" in name
or "sam_model" in name
or "view_seperator" in name):
return
if name.startswith("siglip2.") or name.startswith("merger."):
return
@@ -8876,7 +9000,7 @@ class T5Model(TextModel):
if not tokenizer_path.is_file():
raise FileNotFoundError(f"File not found: {tokenizer_path}")
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue]
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue] # ty: ignore[unresolved-attribute]
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
# some models like Pile-T5 family use BPE tokenizer instead of Unigram
@@ -9013,7 +9137,7 @@ class T5EncoderModel(TextModel):
if not tokenizer_path.is_file():
raise FileNotFoundError(f"File not found: {tokenizer_path}")
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue]
sentencepiece_model = model.ModelProto() # pyright: ignore[reportAttributeAccessIssue] # ty: ignore[unresolved-attribute]
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
# some models like Pile-T5 family use BPE tokenizer instead of Unigram
@@ -11121,8 +11245,7 @@ class GptOssModel(TextModel):
# TODO: remove once MXFP4 is supported more generally
def dequant_model(self):
quant_config = self.hparams.get("quantization_config")
if quant_config is not None and quant_config.get("quant_method") == "mxfp4":
if self._is_mxfp4:
return
return super().dequant_model()
@@ -12275,6 +12398,7 @@ class LazyTorchTensor(gguf.LazyBase):
kwargs = {}
if func is torch.Tensor.numpy:
assert len(args)
return args[0].numpy()
return cls._wrap_fn(func)(*args, **kwargs)

View File

@@ -154,6 +154,7 @@ models = [
{"name": "qwen35", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Qwen/Qwen3.5-9B-Instruct", },
{"name": "joyai-llm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jdopensource/JoyAI-LLM-Flash", },
{"name": "kanana2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/kakaocorp/kanana-2-30b-a3b-instruct-2601", },
{"name": "f2llmv2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/codefuse-ai/F2LLM-v2-4B", },
]
# some models are known to be broken upstream, so we will skip them as exceptions

View File

@@ -112,11 +112,11 @@ class Tensor:
(n_dims, name_len, dtype) = struct.unpack('<3I', data[offset:offset + 12])
assert n_dims >= 0 and n_dims <= 4, f'Invalid tensor dimensions {n_dims}'
assert name_len < 4096, 'Absurd tensor name length'
quant = gguf.GGML_QUANT_SIZES.get(dtype)
self.dtype = gguf.GGMLQuantizationType(dtype)
quant = gguf.GGML_QUANT_SIZES.get(self.dtype)
assert quant is not None, 'Unknown tensor type'
(blksize, tysize) = quant
offset += 12
self.dtype= gguf.GGMLQuantizationType(dtype)
self.dims = struct.unpack(f'<{n_dims}I', data[offset:offset + (4 * n_dims)])
offset += 4 * n_dims
self.name = bytes(data[offset:offset + name_len])

View File

@@ -199,10 +199,13 @@ class LoraTorchTensor:
kwargs = {}
if func is torch.permute:
assert len(args)
return type(args[0]).permute(*args, **kwargs)
elif func is torch.reshape:
assert len(args)
return type(args[0]).reshape(*args, **kwargs)
elif func is torch.stack:
assert len(args)
assert isinstance(args[0], Sequence)
dim = kwargs.get("dim", 0)
assert dim == 0
@@ -211,6 +214,7 @@ class LoraTorchTensor:
torch.stack([b._lora_B for b in args[0]], dim),
)
elif func is torch.cat:
assert len(args)
assert isinstance(args[0], Sequence)
dim = kwargs.get("dim", 0)
assert dim == 0
@@ -362,7 +366,7 @@ if __name__ == '__main__':
logger.error(f"Model {hparams['architectures'][0]} is not supported")
sys.exit(1)
class LoraModel(model_class):
class LoraModel(model_class): # ty: ignore[unsupported-base]
model_arch = model_class.model_arch
lora_alpha: float

View File

@@ -14,7 +14,7 @@ The unified auto-parser uses a pure differential, compositional approach (inspir
**Analysis + Parser Building in Two Steps**:
1. `autoparser::autoparser tmpl_analysis(tmpl)` — runs all differential comparisons and populates the analysis structs
2. `autoparser::peg_generator::generate_parser(tmpl, params, tmpl_analysis)` — uses the analysis to build a PEG parser and optional GBNF grammar
2. `autoparser::peg_generator::generate_parser(tmpl, generation_params, tmpl_analysis)` — uses the analysis to build a PEG parser and optional GBNF grammar
## Data Structures
@@ -34,7 +34,7 @@ All structs are defined in [common/chat-auto-parser.h](common/chat-auto-parser.h
### `analyze_tools` and its sub-structs
- [common/chat-auto-parser.h:176-194](common/chat-auto-parser.h#L176-L194) — `tool_format_analysis`: `mode` enum, `section_start/end`, `per_call_start/end`, JSON field names (`function_field`, `name_field`, `args_field`, `id_field`, `gen_id_field`), and format flags (`fun_name_is_key`, `tools_array_wrapped`, `uses_python_dicts`)
- [common/chat-auto-parser.h:176-194](common/chat-auto-parser.h#L176-L194) — `tool_format_analysis`: `mode` enum, `section_start/end`, `per_call_start/end`, JSON field names (`function_field`, `name_field`, `args_field`, `id_field`, `gen_id_field`), and format flags (`fun_name_is_key`, `tools_array_wrapped`)
- [common/chat-auto-parser.h:196-200](common/chat-auto-parser.h#L196-L200) — `tool_function_analysis`: `name_prefix`, `name_suffix`, `close` markers around function names
- [common/chat-auto-parser.h:202-210](common/chat-auto-parser.h#L202-L210) — `tool_arguments_analysis`: `start/end` container markers, `name_prefix/suffix`, `value_prefix/suffix`, `separator`
- [common/chat-auto-parser.h:212-217](common/chat-auto-parser.h#L212-L217) — `tool_id_analysis`: `pos` enum, `prefix`/`suffix` markers around call ID values
@@ -47,12 +47,21 @@ All structs are defined in [common/chat-auto-parser.h](common/chat-auto-parser.h
| Value | Description |
|-----------------|-----------------------------------------------------------------------------------|
| `NONE` | No reasoning markers detected |
| `TAG_BASED` | Standard tag-based: `<think>...</think>` |
| `DELIMITER` | Delimiter-based: reasoning ends at a delimiter (e.g., `[BEGIN FINAL RESPONSE]`) |
| `FORCED_OPEN` | Template ends with open reasoning tag when `enable_thinking=true` |
| `FORCED_CLOSED` | `enable_thinking=false` emits both tags; `enable_thinking=true` emits only start |
| `TAG_BASED` | Tag-based: `<think>...</think>` (start can be empty for delimiter-style formats) |
| `TOOLS_ONLY` | Reasoning only appears in tool call responses, not plain content |
**Generation Prompt & Reasoning Prefill**: Computed in `common_chat_templates_apply_jinja` before invoking either the specialized handlers or the auto-parser, by rendering the template twice — once with `add_generation_prompt=false` and once with `add_generation_prompt=true` — and storing the diff suffix as `generation_params::generation_prompt`. This string is propagated into `common_chat_params::generation_prompt` and `common_chat_parser_params::generation_prompt`.
The generation prompt is prepended to model output before PEG parsing via `wrap_for_generation_prompt()`. The portion *before* the reasoning start marker (if any) is prepended as a literal to ensure any boilerplate added by the template is consumed. The full string is also fed to the grammar sampler via `llama_sampler_accept` (stored in `common_params_sampling::grammar_prefill`), advancing the grammar past tokens already in the prompt. It is used to determine the reasoning budget sampler's initial state — COUNTING if the prefill tokens begin with the reasoning start sequence (but don't also contain the end sequence), IDLE otherwise.
**`grammar_prefill`** (`common_params_sampling`): The generation prompt string tokenized and accepted by the grammar sampler at init time. Only applied when `grammar_external` is false (i.e., the grammar was not set explicitly by the user).
Three outcomes for reasoning-prefill handling (in `generate_parser()`):
1. **Start+end in generation prompt** (e.g. `<think></think>\n`): the parser sees reasoning as opened and immediately closed; whitespace-only reasoning content is discarded.
2. **Only start in generation prompt** (e.g. `<think>\n`): the parser sees reasoning as already open.
3. **Start marker present but not at the end** (e.g. Apriel's `<|begin_assistant|>` followed by boilerplate): the marker is a template artifact; the start literal is cleared so reasoning uses delimiter-style (end-only). For templates that ignore `add_generation_prompt` (empty diff), the rendered `data.prompt` is used as fallback — but only for non-TOOLS_ONLY modes, since in TOOLS_ONLY the start tag is model-generated and may appear in prior conversation turns.
**`content_mode`**: How the template wraps assistant content.
| Value | Description |
@@ -261,16 +270,16 @@ Text is segmentized into markers and non-marker fragments using `segmentize_mark
- Searches `diff.right` (output with reasoning) for the reasoning content needle
- Uses PEG parsers to find surrounding markers:
- If both pre/post markers found in `diff.right``TAG_BASED` (both tags visible in diff = no forced close)
- If both found but post marker only in the full output B → `FORCED_CLOSED`
- If only post marker found → `DELIMITER`
- If both pre/post markers found in `diff.right``TAG_BASED`
- If both found but post marker only in the full output B → `TAG_BASED` (template forces markers; handled via prefill)
- If only post marker found → `TAG_BASED` (delimiter-style, empty start)
- Sets `reasoning.start` and `reasoning.end`
**R2 — `compare_thinking_enabled()`**: Compares `enable_thinking=false` vs `true` with a generation prompt.
- Detects `FORCED_OPEN`: `enable_thinking=true` adds a non-empty marker at the end of the prompt (where model will start generating) — sets `reasoning.start`, mode = `FORCED_OPEN`
- Detects `FORCED_CLOSED`: `enable_thinking=false` produces both start+end markers; `enable_thinking=true` produces only start marker
- Handles the reverse case: if both start and end are still empty, looks for a single-segment diff on each side to extract both markers
- Detects template-added reasoning markers: `enable_thinking=true` appends a non-empty marker sets `reasoning.start`, mode = `TAG_BASED`
- Handles the reverse case (`enable_thinking=false` appends the marker instead): extracts both start (from the preceding segment) and end markers; mode = `TAG_BASED`
- The reasoning prefill (markers added by the template) is later extracted in `common_chat_templates_apply_jinja` and prepended to model output before parsing
**R3 — `compare_reasoning_scope()`**: Compares assistant message with reasoning+text-content vs reasoning+tool-calls.
@@ -343,7 +352,7 @@ Classification logic:
A workaround array in `common/chat-diff-analyzer.cpp` applies post-hoc patches after analysis. Each workaround is a lambda that inspects the template source and overrides analysis results. Current workarounds:
1. **Old Qwen/DeepSeek thinking templates** — source contains `content.split('</think>')`: sets `reasoning.mode = FORCED_OPEN` with `<think>`/`</think>` markers if no reasoning was detected
1. **Old Qwen/DeepSeek thinking templates** — source contains `content.split('</think>')` but not `<SPECIAL_12>`: sets `reasoning.mode = TAG_BASED` with `<think>`/`</think>` markers if no reasoning was detected
2. **Granite 3.3** — source contains specific "Write your thoughts" text: forces `TAG_BASED` reasoning with `<think>`/`</think>` and `WRAPPED_WITH_REASONING` content with `<response>`/`</response>`
3. **Cohere Command R+** — source contains `<|CHATBOT_TOKEN|>`: sets `ALWAYS_WRAPPED` content mode if no content start is already set
4. **Functionary 3.1** — source contains `set has_code_interpreter`: forces `PLAIN` content, specific `per_call_start/end`, clears preserved tokens to only keep Functionary-specific markers
@@ -355,12 +364,13 @@ Each analyzer struct (`analyze_reasoning`, `analyze_content`, `analyze_tools`) i
#### Reasoning Parser (`analyze_reasoning::build_parser`)
| Mode | Parser |
|-----------------------------------|---------------------------------------------------------------------|
| Not extracting reasoning | `eps()` |
| `FORCED_OPEN` or `FORCED_CLOSED` | `reasoning(until(end)) + end` — opening tag was in the prompt |
| `TAG_BASED` or `TOOLS_ONLY` | `optional(start + reasoning(until(end)) + end)` |
| `DELIMITER` | `optional(reasoning(until(end)) + end)` — no start marker |
| Mode | Parser |
|-----------------------------------------------|---------------------------------------------------------------------------|
| Not extracting reasoning | `eps()` |
| `TAG_BASED` or `TOOLS_ONLY` (non-empty start) | `optional(start + reasoning(until(end)) + end + space())` |
| `TAG_BASED` or `TOOLS_ONLY` (empty start) | `optional(reasoning(until(end)) + end + space())` — delimiter-style |
Note: The start marker may be empty either because the analyzer detected delimiter-style reasoning, or because `generate_parser()` cleared a template artifact start marker (see Generation Prompt & Reasoning Prefill above). Whitespace-only reasoning content (e.g. from a `<think></think>` prefill) is discarded by the mapper.
#### Content Parser (`analyze_content::build_parser`)
@@ -410,9 +420,7 @@ All three tool parsers return:
reasoning + optional(content(until(trigger_marker))) + tool_calls + end()
```
### Python Dict Format
When `format.uses_python_dicts` is true (detected when single-quoted strings appear in JSON argument context), `build_parser()` pre-registers a `json-string` rule that accepts both single-quoted and double-quoted strings. This is done before any `p.json()` call so all JSON parsing inherits the flexible rule.
Each returned parser is wrapped by `wrap_for_generation_prompt()`, which prepends a literal for any boilerplate prefix of the generation prompt (the portion before the reasoning start marker).
## Mapper
@@ -421,22 +429,22 @@ When `format.uses_python_dicts` is true (detected when single-quoted strings app
- **Buffered arguments**: Before `tool_name` is known, argument text goes to `args_buffer`; once the name is set, the buffer is flushed to `current_tool->arguments`
- **`args_target()`**: Returns a reference to whichever destination is currently active (buffer or tool args), eliminating branching
- **`closing_quote_pending`**: Tracks whether a closing `"` needs to be appended when a string argument value is finalized (for schema-declared string types in tagged format)
- **Quote normalization**: Python-style quotes (`'key': 'value'`) are converted to JSON (`"key": "value"`)
- **Whitespace-only reasoning**: Reasoning content that consists entirely of whitespace (e.g. from a `<think></think>` prefill) is cleared so the message shows no reasoning
- **Brace auto-closing**: At tool close, unclosed `{` braces are closed automatically
## Files
| File | Purpose |
|-------------------------------------------|----------------------------------------------------------------------|
| `common/chat-auto-parser.h` | All analysis structs, enums, `autoparser`, `peg_generator`, `templates_params` |
| `common/chat-auto-parser-generator.cpp` | Parser generator: `generate_parser()` and `build_parser()` methods |
| `common/chat-diff-analyzer.cpp` | Differential analysis implementation and workarounds |
| `common/chat-auto-parser-helpers.h/cpp` | `calculate_diff_split()`, `segmentize_markers()`, |
| | `compare_variants()`, string helpers |
| `common/chat-peg-parser.h/cpp` | `common_chat_peg_builder`, `common_chat_peg_mapper`, and helpers |
| `common/chat.cpp` | Entry point: `common_chat_templates_apply_jinja()` |
| `tools/parser/debug-template-parser.cpp` | Debug tool for template analysis |
| `tools/parser/template-analysis.cpp` | Template analysis tool |
| File | Purpose |
|-------------------------------------------|---------------------------------------------------------------------------------|
| `common/chat-auto-parser.h` | All analysis structs, enums, `autoparser`, `peg_generator`, `generation_params` |
| `common/chat-auto-parser-generator.cpp` | Parser generator: `generate_parser()` and `build_parser()` methods |
| `common/chat-diff-analyzer.cpp` | Differential analysis implementation and workarounds |
| `common/chat-auto-parser-helpers.h/cpp` | `calculate_diff_split()`, `segmentize_markers()`, `compare_variants()`, |
| | `wrap_for_generation_prompt()`, string helpers |
| `common/chat-peg-parser.h/cpp` | `common_chat_peg_builder`, `common_chat_peg_mapper`, and helpers |
| `common/chat.cpp` | Entry point: `common_chat_templates_apply_jinja()` |
| `tools/parser/debug-template-parser.cpp` | Debug tool for template analysis |
| `tools/parser/template-analysis.cpp` | Template analysis tool |
## Testing & Debugging
@@ -516,10 +524,10 @@ To support a new template format:
## Edge Cases and Quirks
1. **Forced Thinking**: When `enable_thinking=true` and the model prompt ends with an open reasoning tag (e.g., `<think>`), the parser enters forced thinking mode and immediately expects reasoning content without waiting for a start marker.
1. **Generation Prompt & Reasoning Prefill**: The generation prompt is extracted by diffing `add_generation_prompt=false` vs `true` in `common_chat_templates_apply_jinja`, so it contains exactly what the template appends — avoiding false positives from prior conversation turns.
2. **Per-Call vs Per-Section Markers**: Some templates wrap each tool call individually (`per_call_start/end`); others wrap the entire section (`section_start/end`). T2 (`check_per_call_markers()`) disambiguates by checking if the second call in a two-call output starts with the section marker.
3. **Python Dict Format**: The Seed template family uses single-quoted JSON (`'key': 'value'`). The `uses_python_dicts` flag causes the PEG builder to register a flexible `json-string` rule accepting both quote styles before any JSON rules are built.
4. **Tag Boundary Fixing**: `calculate_diff_split()` iteratively adjusts prefix/suffix boundaries to avoid splitting `<tag>` or `[marker]` tokens, ensuring clean extraction.
5. **Call ID Side Effects**: When a call ID is detected, `per_call_end` may have been incorrectly set to include the call ID suffix. T7 clears `per_call_end` in this case.
6. **Tool Analysis Gating**: `analyze_tools` is only constructed (and all tool analysis phases run) when `jinja_caps.supports_tool_calls` is true. Within tool analysis, `check_per_call_markers()` (T2) only runs if `jinja_caps.supports_parallel_tool_calls`.
7. **`analyze_arguments()` Gating**: Within tool analysis, A1 and A2 (argument name/value marker extraction) only run for `TAG_WITH_TAGGED` format. `extract_argument_separator()` and `extract_args_markers()` run for all non-`JSON_NATIVE` formats.
3. **Tag Boundary Fixing**: `calculate_diff_split()` iteratively adjusts prefix/suffix boundaries to avoid splitting `<tag>` or `[marker]` tokens, ensuring clean extraction.
4. **Call ID Side Effects**: When a call ID is detected, `per_call_end` may have been incorrectly set to include the call ID suffix. T7 clears `per_call_end` in this case.
5. **Tool Analysis Gating**: `analyze_tools` is only constructed (and all tool analysis phases run) when `jinja_caps.supports_tool_calls` is true. Within tool analysis, `check_per_call_markers()` (T2) only runs if `jinja_caps.supports_parallel_tool_calls`.
6. **`analyze_arguments()` Gating**: Within tool analysis, A1 and A2 (argument name/value marker extraction) only run for `TAG_WITH_TAGGED` format. `extract_argument_separator()` and `extract_args_markers()` run for all non-`JSON_NATIVE` formats.
7. **Undetected Tool Format**: If `analyze_tools` concludes tool calling is supported but cannot determine the format, `build_parser()` logs an error and returns `eps()` (graceful degradation) rather than aborting.

View File

@@ -1,6 +1,9 @@
# OpenVINO Backend for llama.cpp
[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge.
This document describes the [OpenVINO backend for llama.cpp](../../src/ggml-openvino), which enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.
> [!NOTE]
> Performance and memory optimizations, accuracy validation, broader quantization coverage, broader operator and model support are work in progress.
[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge. [OpenVINO backend for llama.cpp](../../src/ggml-openvino) enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.
The OpenVINO backend is implemented in `ggml/src/ggml-openvino` and provides a translation layer for core GGML operations. The OpenVINO backend replaces the standard GGML graph execution path with Intel's OpenVINO inference engine. This approach allows the same GGUF model file to run on Intel CPUs, Intel GPUs (integrated and discrete), and Intel NPUs without changes to the model or the rest of the llama.cpp stack. When a `ggml_cgraph` is dispatched to OpenVINO backend, it:
@@ -179,31 +182,73 @@ curl -L https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/L
When using the OpenVINO backend, the first inference token may have slightly higher latency due to on-the-fly conversion to the OpenVINO graph. Subsequent tokens and runs will be faster.
> [!NOTE]
> Default context size is set to the model training context, which may be very large. For example, 131072 for Llama 3.2 1B, which may result in lower performance, especially on edge/laptop devices. Use `-c` to limit context size in supported llama.cpp tools for better performance. For example, `-c 512`.
```bash
# If device is unset or unavailable, defaults to CPU.
# If the system has multiple GPUs, use GPU.0 or GPU.1 to explicitly target a specific GPU.
# Linux
export GGML_OPENVINO_DEVICE=GPU
# Enable stateful execution with GPU device to avoid known stateless execution failures.
export GGML_OPENVINO_STATEFUL_EXECUTION=1
# To run llama-simple:
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
# To run in chat mode:
./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf
./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 1024
# To run llama-bench, -fa 1 is needed
GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-bench -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -fa 1
# NPU: keep context small to avoid failures from very large model context windows.
export GGML_OPENVINO_DEVICE=NPU
./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 512
# Windows Command Line
set GGML_OPENVINO_DEVICE=GPU
# Enable stateful execution with GPU device to avoid known stateless execution failures.
set GGML_OPENVINO_STATEFUL_EXECUTION=1
# Windows PowerShell
$env:GGML_OPENVINO_DEVICE = "GPU"
$env:GGML_OPENVINO_STATEFUL_EXECUTION = "1"
# To run llama-simple
build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
# To run in chat mode:
build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf"
build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -c 1024
# To run llama-bench, -fa 1 is needed
build\ReleaseOV\bin\llama-bench.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -fa 1
# NPU: keep context small to avoid failures from very large model context windows.
# Windows Command Line
set GGML_OPENVINO_DEVICE=NPU
# Windows PowerShell
$env:GGML_OPENVINO_DEVICE = "NPU"
build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -c 512
```
> [!NOTE]
> On systems with multiple GPUs, use `GPU.0` or `GPU.1` to explicitly target specific GPU. See [OpenVINO GPU Device](https://docs.openvino.ai/2026/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) for more details.
### Known Issues and Current Workarounds
- GPU stateless execution is currently affected by a known issue.
- Workaround: set `GGML_OPENVINO_STATEFUL_EXECUTION=1` when using GPU device.
- NPU failures can happen when context size is too large. Recent llama.cpp behavior may resolve context size to the model training context (for example, 131072 for Llama 3.2 1B), which is too large for current NPU usage and can also stress laptop CPU/GPU on larger models. To inspect the selected context size, run `llama-cli` or `llama-server` with `-lv 3`.
- Workaround: explicitly set context size, for ex. `-c 1024` for NPU runs. Performance will be better with lower context size.
- Additional NPU limitations:
- Model caching is not yet supported.
- `llama-server -np > 1` (multiple parallel sequences) is not supported.
- `llama-perplexity` is only supported with `-b 512` or smaller.
- `--context-shift` with `llama-cli` is currently not supported with OpenVINO backend across CPU, GPU, and NPU devices.
- Encoder models (embedding, reranking) are not supported with the current OpenVINO backend implementation.
- `-fa 1` is required when running llama-bench with the OpenVINO backend.
- `GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1`
- `llama-server` with OpenVINO backend supports only one chat session/thread, when `GGML_OPENVINO_STATEFUL_EXECUTION=1` is enabled.
- For Intel GPU, NPU detection in containers, GPU, NPU user-space drivers/libraries must be present inside the image. We will include in a future PR. Until then, you can use this reference Dockerfile: [openvino.Dockerfile](https://github.com/ravi9/llama.cpp/blob/ov-docker-update/.devops/openvino.Dockerfile)
> [!NOTE]
> The OpenVINO backend is actively under development. Fixes are underway, and this document will continue to be updated as issues are resolved.
### Docker Build
@@ -229,31 +274,42 @@ docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_p
Run llama.cpp with OpenVINO backend Docker container.
Save sample models in `~/models` as [shown above](#3-download-sample-model). It will be mounted to the container in the examples below.
> [!NOTE]
> Intel GPU, NPU detection in containers will be included in a future PR. Until then, you can use this reference Dockerfile: [openvino.Dockerfile](https://github.com/ravi9/llama.cpp/blob/ov-docker-update/.devops/openvino.Dockerfile).
```bash
# Run Docker container
docker run --rm -it -v ~/models:/models llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
docker run --rm -it -v ~/models:/models llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
# With Intel GPU access (iGPU or dGPU)
docker run --rm -it -v ~/models:/models \
--device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
--env=GGML_OPENVINO_DEVICE=GPU --env=GGML_OPENVINO_STATEFUL_EXECUTION=1 \
llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
# With Intel NPU access
docker run --rm -it --env GGML_OPENVINO_DEVICE=NPU -v ~/models:/models \
docker run --rm -it -v ~/models:/models \
--device=/dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
--env=GGML_OPENVINO_DEVICE=NPU \
llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
```
Run Llama.cpp Server with OpenVINO Backend:
Run Llama.cpp Server with OpenVINO Backend.
> [!NOTE]
> `llama-server` with OpenVINO backend supports only one chat session/thread, when `GGML_OPENVINO_STATEFUL_EXECUTION=1` is enabled.
```bash
# Run the Server Docker container
docker run --rm -it -p 8080:8080 -v ~/models:/models llama-openvino:server --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
# In a NEW terminal, test the server with curl
docker run --rm -it -p 8080:8080 -v ~/models:/models llama-openvino:server --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 1024
# Or Using llama-server executable
./build/ReleaseOV/bin/llama-server -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf --port 8080 -c 1024
# If you are behind a proxy, make sure to set NO_PROXY to avoid proxy for localhost
export NO_PROXY=localhost,127.0.0.1
# Option 1: Open your browser to http://localhost:8080 to access the web UI for the llama.cpp server.
# Option 2: In a NEW terminal, test the server with curl
# Test health endpoint
curl -f http://localhost:8080/health
@@ -295,6 +351,7 @@ The OpenVINO backend can be configured using the following environment variables
export GGML_OPENVINO_CACHE_DIR=/tmp/ov_cache
export GGML_OPENVINO_PROFILING=1
export GGML_OPENVINO_DEVICE=GPU
export GGML_OPENVINO_STATEFUL_EXECUTION=1
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
@@ -302,38 +359,27 @@ export GGML_OPENVINO_DEVICE=GPU
set GGML_OPENVINO_CACHE_DIR=C:\tmp\ov_cache
set GGML_OPENVINO_PROFILING=1
set GGML_OPENVINO_DEVICE=GPU
set GGML_OPENVINO_STATEFUL_EXECUTION=1
# Windows PowerShell
$env:GGML_OPENVINO_CACHE_DIR = "C:\tmp\ov_cache"
$env:GGML_OPENVINO_PROFILING = "1"
$env:GGML_OPENVINO_DEVICE = "GPU"
$env:GGML_OPENVINO_STATEFUL_EXECUTION = "1"
build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
```
#### llama-bench
```bash
# -fa 1 is required when running llama-bench with the OpenVINO backend.
GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1
```
### NPU Notes
- Model caching is not yet supported
- Does not support llama-server -np > 1 (multiple parallel sequences)
- Only supports llama-perplexity -b 512 or smaller
## Llama.cpp Tools
The following tools work with the OpenVINO backend on CPU, GPU, NPU:
- llama-simple
- llama-run
- llama-cli
- llama-server
- llama-bench
- llama-cli
- llama-completion
- llama-perplexity
- llama-server
- llama-simple
## Work in Progress

View File

@@ -31,6 +31,13 @@ llama-server -m gemma-3-4b-it-Q4_K_M.gguf --mmproj mmproj-gemma-3-4b-it-Q4_K_M.g
llama-server -hf ggml-org/gemma-3-4b-it-GGUF --no-mmproj-offload
```
> [!IMPORTANT]
>
> OCR models are trained with specific prompt and input structure, please refer to these discussions for more info:
> - PaddleOCR-VL: https://github.com/ggml-org/llama.cpp/pull/18825
> - GLM-OCR: https://github.com/ggml-org/llama.cpp/pull/19677
> - Deepseek-OCR: https://github.com/ggml-org/llama.cpp/pull/17400
## Pre-quantized models
These are ready-to-use models, most of them come with `Q4_K_M` quantization by default. They can be found at the Hugging Face page of the ggml-org: https://huggingface.co/collections/ggml-org/multimodal-ggufs-68244e01ff1f39e5bebeeedc

View File

@@ -12,9 +12,9 @@ Legend:
- 🟡 Partially supported by this backend
- ❌ Not supported by this backend
| Operation | BLAS | CANN | CPU | CUDA | Metal | OpenCL | SYCL | Vulkan | WebGPU | ZenDNN | zDNN |
| Operation | BLAS | CANN | CPU | CUDA | MTL | OpenCL | SYCL | Vulkan | WebGPU | ZenDNN | zDNN |
|-----------|------|------|------|------|------|------|------|------|------|------|------|
| ABS | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| ABS | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| ACC | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
| ADD | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| ADD1 | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
@@ -23,63 +23,63 @@ Legend:
| ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
| ARGSORT | ❌ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ |
| CEIL | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| CLAMP | ❌ | ✅ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| CLAMP | ❌ | ✅ | ✅ | ✅ | | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| CONCAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | ✅ | ✅ | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| CONT | ❌ | 🟡 | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| CONV_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ |
| CONV_2D_DW | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| CONV_3D | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| CONV_TRANSPOSE_2D | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| COS | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| COS | ❌ | ✅ | ✅ | ✅ | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| COUNT_EQUAL | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| CROSS_ENTROPY_LOSS | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CROSS_ENTROPY_LOSS_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CUMSUM | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| DIAG | ❌ | ❌ | ✅ | ✅ | | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| DIAG | ❌ | ❌ | ✅ | ✅ | | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| DIAG_MASK_INF | ❌ | ✅ | ✅ | ✅ | ❌ | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| DIV | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| DUP | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| ELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| EXP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| EXPM1 | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| ELU | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| EXP | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| EXPM1 | ❌ | ❌ | ✅ | 🟡 | | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| FILL | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |
| FLASH_ATTN_EXT | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| FLOOR | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| GATED_DELTA_NET | ❌ | ❌ | ✅ | ❌ | | ❌ | ✅ | ❌ | | ❌ | ❌ |
| GATED_DELTA_NET | ❌ | ❌ | ✅ | ❌ | 🟡 | ❌ | ✅ | ❌ | | ❌ | ❌ |
| GATED_LINEAR_ATTN | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| GEGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GEGLU_ERF | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GEGLU_QUICK | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_QUICK | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| GELU | ❌ | ✅ | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_QUICK | ❌ | ✅ | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| HARDSWISH | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| IM2COL | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| IM2COL_3D | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| L2_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
| LEAKY_RELU | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | ✅ | 🟡 | ❌ | ❌ | ❌ |
| LOG | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | ✅ | ✅ | ❌ | ❌ |
| LOG | ❌ | ✅ | ✅ | ✅ | | ❌ | 🟡 | ✅ | ✅ | ❌ | ❌ |
| MEAN | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| MUL | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ |
| NEG | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| MUL_MAT | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 |
| MUL_MAT_ID | ❌ | 🟡 | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ❌ | ❌ | ❌ |
| NEG | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ❌ | ❌ | ❌ |
| OPT_STEP_ADAMW | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| OPT_STEP_SGD | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| OUT_PROD | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ | 🟡 |
| PAD | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ |
| PAD_REFLECT_1D | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
| POOL_1D | ❌ | ❌ | ✅ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| POOL_1D | ❌ | ❌ | ✅ | ❌ | | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| POOL_2D | ❌ | 🟡 | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| REGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| RELU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| RELU | ❌ | ✅ | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| REPEAT | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| REPEAT_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| RMS_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
@@ -91,31 +91,31 @@ Legend:
| RWKV_WKV6 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| RWKV_WKV7 | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| SCALE | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SET | ❌ | ❌ | ✅ | ✅ | | ❌ | 🟡 | ✅ | | ❌ | ❌ |
| SET | ❌ | ❌ | ✅ | ✅ | | ❌ | 🟡 | ✅ | | ❌ | ❌ |
| SET_ROWS | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| SGN | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SIGMOID | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SILU | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SGN | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SIGMOID | ❌ | ✅ | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SILU | ❌ | ✅ | ✅ | 🟡 | | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SILU_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| SIN | ❌ | ✅ | ✅ | ✅ | 🟡 | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SOFTPLUS | ❌ | ❌ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SIN | ❌ | ✅ | ✅ | ✅ | | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SOFTPLUS | ❌ | ❌ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SOFT_MAX | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
| SOLVE_TRI | ❌ | ❌ | ✅ | 🟡 | | ❌ | ❌ | ✅ | | ❌ | ❌ |
| SQR | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SQRT | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SSM_CONV | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ❌ | ❌ |
| SOLVE_TRI | ❌ | ❌ | ✅ | 🟡 | | ❌ | ❌ | ✅ | | ❌ | ❌ |
| SQR | ❌ | ✅ | ✅ | ✅ | | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SQRT | ❌ | ✅ | ✅ | ✅ | | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SSM_CONV | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | ❌ | ❌ |
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | 🟡 | ❌ | ❌ | ❌ |
| STEP | ❌ | ✅ | ✅ | 🟡 | 🟡 | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| STEP | ❌ | ✅ | ✅ | 🟡 | | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SUB | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SUM | ❌ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| SUM_ROWS | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ |
| SWIGLU | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SWIGLU_OAI | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| TANH | ❌ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| TANH | ❌ | ✅ | ✅ | 🟡 | | ✅ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| TIMESTEP_EMBEDDING | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| TOP_K | ❌ | ❌ | ✅ | ❌ | ✅ | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| TRI | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
| TRUNC | ❌ | ❌ | ✅ | 🟡 | ❌ | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| UPSCALE | ❌ | 🟡 | ✅ | ✅ | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| UPSCALE | ❌ | 🟡 | ✅ | ✅ | | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| XIELU | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ |

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -28,9 +28,6 @@ def _build_repetition(item_rule, min_items, max_items, separator_rule=None):
return f'({result})?' if min_items == 0 else result
def _generate_min_max_int(min_value: Optional[int], max_value: Optional[int], out: list, decimals_left: int = 16, top_level: bool = True):
has_min = min_value != None
has_max = max_value != None
def digit_range(from_char: str, to_char: str):
out.append("[")
if from_char == to_char:
@@ -106,7 +103,7 @@ def _generate_min_max_int(min_value: Optional[int], max_value: Optional[int], ou
out.append(to_str[i])
out.append("]")
if has_min and has_max:
if min_value is not None and max_value is not None:
if min_value < 0 and max_value < 0:
out.append("\"-\" (")
_generate_min_max_int(-max_value, -min_value, out, decimals_left, top_level=True)
@@ -133,7 +130,7 @@ def _generate_min_max_int(min_value: Optional[int], max_value: Optional[int], ou
less_decimals = max(decimals_left - 1, 1)
if has_min:
if min_value is not None:
if min_value < 0:
out.append("\"-\" (")
_generate_min_max_int(None, -min_value, out, decimals_left, top_level=False)
@@ -177,7 +174,7 @@ def _generate_min_max_int(min_value: Optional[int], max_value: Optional[int], ou
more_digits(length - 1, less_decimals)
return
if has_max:
if max_value is not None:
if max_value >= 0:
if top_level:
out.append("\"-\" [1-9] ")

View File

@@ -365,13 +365,13 @@ Java_com_arm_aichat_internal_InferenceEngineImpl_processSystemPrompt(
const auto *system_prompt = env->GetStringUTFChars(jsystem_prompt, nullptr);
LOGd("%s: System prompt received: \n%s", __func__, system_prompt);
std::string formatted_system_prompt(system_prompt);
env->ReleaseStringUTFChars(jsystem_prompt, system_prompt);
// Format system prompt if applicable
const bool has_chat_template = common_chat_templates_was_explicit(g_chat_templates.get());
if (has_chat_template) {
formatted_system_prompt = chat_add_and_format(ROLE_SYSTEM, system_prompt);
}
env->ReleaseStringUTFChars(jsystem_prompt, system_prompt);
// Tokenize system prompt
const auto system_tokens = common_tokenize(g_context, formatted_system_prompt,
@@ -414,13 +414,13 @@ Java_com_arm_aichat_internal_InferenceEngineImpl_processUserPrompt(
const auto *const user_prompt = env->GetStringUTFChars(juser_prompt, nullptr);
LOGd("%s: User prompt received: \n%s", __func__, user_prompt);
std::string formatted_user_prompt(user_prompt);
env->ReleaseStringUTFChars(juser_prompt, user_prompt);
// Format user prompt if applicable
const bool has_chat_template = common_chat_templates_was_explicit(g_chat_templates.get());
if (has_chat_template) {
formatted_user_prompt = chat_add_and_format(ROLE_USER, user_prompt);
}
env->ReleaseStringUTFChars(juser_prompt, user_prompt);
// Decode formatted user prompts
auto user_tokens = common_tokenize(g_context, formatted_user_prompt, has_chat_template, has_chat_template);

View File

@@ -64,7 +64,7 @@ def load_model_and_tokenizer(model_path, use_sentence_transformers=False, device
print("Using SentenceTransformer to apply all numbered layers")
model = SentenceTransformer(model_path)
tokenizer = model.tokenizer
config = model[0].auto_model.config # type: ignore
config = model[0].auto_model.config
else:
tokenizer = AutoTokenizer.from_pretrained(model_path)
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
@@ -108,8 +108,8 @@ def load_model_and_tokenizer(model_path, use_sentence_transformers=False, device
print(f"Model file: {type(model).__module__}")
# Verify the model is using the correct sliding window
if hasattr(model.config, 'sliding_window'): # type: ignore
print(f"Model's sliding_window: {model.config.sliding_window}") # type: ignore
if hasattr(model.config, 'sliding_window'):
print(f"Model's sliding_window: {model.config.sliding_window}")
else:
print("Model config does not have sliding_window attribute")
@@ -152,7 +152,7 @@ def main():
device = next(model.parameters()).device
else:
# For SentenceTransformer, get device from the underlying model
device = next(model[0].auto_model.parameters()).device # type: ignore
device = next(model[0].auto_model.parameters()).device
model_name = os.path.basename(model_path)
@@ -177,7 +177,7 @@ def main():
print(f"{token_id:6d} -> '{token_str}'")
print(f"Embeddings shape (after all SentenceTransformer layers): {all_embeddings.shape}")
print(f"Embedding dimension: {all_embeddings.shape[1] if len(all_embeddings.shape) > 1 else all_embeddings.shape[0]}") # type: ignore
print(f"Embedding dimension: {all_embeddings.shape[1] if len(all_embeddings.shape) > 1 else all_embeddings.shape[0]}")
else:
# Standard approach: use base model output only
encoded = tokenizer(
@@ -205,12 +205,12 @@ def main():
print(f"Embedding dimension: {all_embeddings.shape[1]}")
if len(all_embeddings.shape) == 1:
n_embd = all_embeddings.shape[0] # type: ignore
n_embd = all_embeddings.shape[0]
n_embd_count = 1
all_embeddings = all_embeddings.reshape(1, -1)
else:
n_embd = all_embeddings.shape[1] # type: ignore
n_embd_count = all_embeddings.shape[0] # type: ignore
n_embd = all_embeddings.shape[1]
n_embd_count = all_embeddings.shape[0]
print()

View File

@@ -2,7 +2,7 @@
import argparse
import sys
from common import compare_tokens # type: ignore
from common import compare_tokens # type: ignore[import-not-found]
def parse_arguments():

View File

@@ -6,7 +6,7 @@ import re
from copy import copy
from enum import Enum
from inspect import getdoc, isclass
from typing import TYPE_CHECKING, Any, Callable, List, Optional, Union, get_args, get_origin, get_type_hints
from typing import TYPE_CHECKING, Any, Callable, Optional, Union, get_args, get_origin, get_type_hints
from docstring_parser import parse
from pydantic import BaseModel, create_model
@@ -1158,7 +1158,7 @@ def create_dynamic_model_from_function(func: Callable[..., Any]):
# Assert that the parameter has a type annotation
if param.annotation == inspect.Parameter.empty:
raise TypeError(f"Parameter '{param.name}' in function '{func.__name__}' lacks a type annotation")
raise TypeError(f"""Parameter '{param.name}' in function '{getattr(func, "__name__", "")}' lacks a type annotation""")
# Find the parameter's description in the docstring
param_doc = next((d for d in docstring.params if d.arg_name == param.name), None)
@@ -1166,7 +1166,7 @@ def create_dynamic_model_from_function(func: Callable[..., Any]):
# Assert that the parameter has a description
if not param_doc or not param_doc.description:
raise ValueError(
f"Parameter '{param.name}' in function '{func.__name__}' lacks a description in the docstring")
f"""Parameter '{param.name}' in function '{getattr(func, "__name__", "")}' lacks a description in the docstring""")
# Add parameter details to the schema
param_docs.append((param.name, param_doc))
@@ -1177,7 +1177,7 @@ def create_dynamic_model_from_function(func: Callable[..., Any]):
dynamic_fields[param.name] = (
param.annotation if param.annotation != inspect.Parameter.empty else str, default_value)
# Creating the dynamic model
dynamic_model = create_model(f"{func.__name__}", **dynamic_fields)
dynamic_model = create_model(f"{getattr(func, '__name__')}", **dynamic_fields)
for name, param_doc in param_docs:
dynamic_model.model_fields[name].description = param_doc.description
@@ -1285,7 +1285,7 @@ def convert_dictionary_to_pydantic_model(dictionary: dict[str, Any], model_name:
if items != {}:
array = {"properties": items}
array_type = convert_dictionary_to_pydantic_model(array, f"{model_name}_{field_name}_items")
fields[field_name] = (List[array_type], ...)
fields[field_name] = (list[array_type], ...) # ty: ignore[invalid-type-form]
else:
fields[field_name] = (list, ...)
elif field_type == "object":

View File

@@ -77,6 +77,7 @@ extern "C" {
};
GGML_API struct gguf_context * gguf_init_empty(void);
GGML_API struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_params params);
GGML_API struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params);
//GGML_API struct gguf_context * gguf_init_from_buffer(..);
@@ -189,6 +190,7 @@ extern "C" {
//
// write the entire context to a binary file
GGML_API bool gguf_write_to_file_ptr(const struct gguf_context * ctx, FILE * file, bool only_meta);
GGML_API bool gguf_write_to_file(const struct gguf_context * ctx, const char * fname, bool only_meta);
// get the size in bytes of the meta data (header, kv pairs, tensor info) including padding

View File

@@ -1788,9 +1788,11 @@ void ggml_cann_get_rows(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
ggml_tensor * src0 = dst->src[0]; // src
ggml_tensor * src1 = dst->src[1]; // index
GGML_ASSERT(dst->type == GGML_TYPE_F32 || dst->type == GGML_TYPE_F16);
GGML_ASSERT(dst->type == GGML_TYPE_F32 || dst->type == GGML_TYPE_F16
|| dst->type == GGML_TYPE_BF16);
switch (src0->type) {
case GGML_TYPE_BF16:
case GGML_TYPE_F16:
case GGML_TYPE_F32:
if (src0->type == dst->type) {
@@ -1881,6 +1883,7 @@ void ggml_cann_set_rows(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
break;
}
case GGML_TYPE_F16:
case GGML_TYPE_BF16:
{
acl_tensor_ptr acl_src0 = ggml_cann_create_tensor(src0);
ggml_cann_pool_alloc src_buffer_allocator(ctx.pool(), ggml_nelements(src0) * sizeof(uint16_t));
@@ -1891,7 +1894,7 @@ void ggml_cann_set_rows(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
src_trans_nb[i] = src_trans_nb[i - 1] * src0->ne[i - 1];
}
acl_tensor_ptr src_trans_tensor = ggml_cann_create_tensor(
src_trans_buffer, ACL_FLOAT16, ggml_type_size(dst->type), src0->ne, src_trans_nb, GGML_MAX_DIMS);
src_trans_buffer, ggml_cann_type_mapping(dst->type), ggml_type_size(dst->type), src0->ne, src_trans_nb, GGML_MAX_DIMS);
aclnn_cast(ctx, acl_src0.get(), src_trans_tensor.get(), ggml_cann_type_mapping(dst->type));
aclnn_index_copy_4d(ctx, src_trans_buffer, src0->ne, src_trans_nb, dst->data, dst->ne, dst->nb, src1,
dst->type);
@@ -1965,7 +1968,7 @@ static void ggml_cann_mat_mul_fp(ggml_backend_cann_context & ctx, ggml_tensor *
// Only check env once.
static bool weight_to_nz = parse_bool(get_env_as_lowercase("GGML_CANN_WEIGHT_NZ").value_or("on"));
if (weight_to_nz && is_matmul_weight(weight)) {
if (weight_to_nz && weight->type != GGML_TYPE_BF16 && is_matmul_weight(weight)) {
acl_weight_tensor = ggml_cann_create_tensor(weight, transpose_ne, transpose_nb, n_dims, ACL_FORMAT_FRACTAL_NZ);
} else {
acl_weight_tensor = ggml_cann_create_tensor(weight, transpose_ne, transpose_nb, n_dims, ACL_FORMAT_ND);
@@ -2146,6 +2149,9 @@ void ggml_cann_mul_mat(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
switch (type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
#ifndef ASCEND_310P
case GGML_TYPE_BF16:
#endif
ggml_cann_mat_mul_fp(ctx, dst);
break;
case GGML_TYPE_Q4_0:
@@ -3005,6 +3011,58 @@ void ggml_cann_rope(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
}
}
void ggml_cann_rope_cache_preload(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
ggml_tensor * src0 = dst->src[0];
float freq_base, freq_scale, ext_factor, attn_factor, beta_fast, beta_slow;
int sections[4];
const int n_dims = ((int32_t *) dst->op_params)[1];
const int mode = ((int32_t *) dst->op_params)[2];
const int n_ctx_orig = ((int32_t *) dst->op_params)[4];
GGML_TENSOR_UNARY_OP_LOCALS
memcpy(&freq_base, (int32_t *) dst->op_params + 5, sizeof(float));
memcpy(&freq_scale, (int32_t *) dst->op_params + 6, sizeof(float));
memcpy(&ext_factor, (int32_t *) dst->op_params + 7, sizeof(float));
memcpy(&attn_factor, (int32_t *) dst->op_params + 8, sizeof(float));
memcpy(&beta_fast, (int32_t *) dst->op_params + 9, sizeof(float));
memcpy(&beta_slow, (int32_t *) dst->op_params + 10, sizeof(float));
memcpy(&sections, (int32_t *) dst->op_params + 11, sizeof(int) * 4);
const float theta_scale = powf(freq_base, -2.0f / n_dims);
float corr_dims[2];
ggml_rope_yarn_corr_dims(n_dims, n_ctx_orig, freq_base, beta_fast, beta_slow, corr_dims);
bool is_neox = mode & GGML_ROPE_TYPE_NEOX;
const bool is_imrope = mode == GGML_ROPE_TYPE_IMROPE;
const bool mrope_used = mode & GGML_ROPE_TYPE_MROPE;
const bool is_vision = mode == GGML_ROPE_TYPE_VISION;
if (is_imrope || mrope_used) {
is_neox = true;
}
int64_t rope_dims = n_dims;
if (is_vision) {
rope_dims = src0->ne[0];
}
// Run the full cache init on the non-captured stream. This performs all
// host-to-device memcpy, aclrtMalloc/Free, and on-device computations
// so that the memory pool is warmed up and cache metadata is populated.
aclnn_rope_cache_init(ctx, dst, corr_dims, ext_factor, theta_scale, freq_scale, attn_factor, is_neox, sections,
mrope_used, is_imrope, is_vision, rope_dims);
// Reset `cached` so that during graph capture the on-device computations
// (sin/cos, position multiply, repeat, etc.) still execute and get recorded
// into the captured graph. The cache metadata (theta_scale_length,
// theta_scale, sections, position_length, etc.) remains set, which causes
// all host-to-device copy and malloc/free branches to be skipped.
ctx.rope_cache.cached = false;
}
void ggml_cann_argmax(ggml_backend_cann_context & ctx, ggml_tensor * dst) {
ggml_tensor * src0 = dst->src[0];

View File

@@ -543,6 +543,21 @@ void ggml_cann_mul_mat(ggml_backend_cann_context & ctx, ggml_tensor * dst);
*/
void ggml_cann_rope(ggml_backend_cann_context & ctx, ggml_tensor * dst);
/**
* @brief Pre-load the RoPE cache before ACL graph capture.
*
* This function must be called outside of graph capture to perform
* host-to-device memory copies and device memory allocations that are
* not allowed on a captured stream. After pre-loading, the rope cache
* metadata is updated so that the subsequent call to
* aclnn_rope_cache_init (inside graph capture) skips these operations
* and only records the on-device computations into the captured graph.
*
* @param ctx CANN backend context.
* @param dst A ROPE destination tensor from the computation graph.
*/
void ggml_cann_rope_cache_preload(ggml_backend_cann_context & ctx, ggml_tensor * dst);
/**
* @brief Computes the index of the maximum value along the specified dimension
* of a ggml tensor using the CANN backend.

View File

@@ -277,7 +277,7 @@ struct ggml_graph_node_properties {
}
}
if (node->op == GGML_OP_SCALE || node->op == GGML_OP_UNARY || node->op == GGML_OP_GLU) {
if (node->op == GGML_OP_SCALE || node->op == GGML_OP_UNARY || node->op == GGML_OP_GLU || node->op == GGML_OP_ROPE){
return memcmp(this->op_params, node->op_params, GGML_MAX_OP_PARAMS) == 0;
}
return true;

View File

@@ -1234,7 +1234,8 @@ static void ggml_backend_cann_buffer_set_tensor(ggml_backend_buffer_t buffer,
static bool weight_to_nz = parse_bool(get_env_as_lowercase("GGML_CANN_WEIGHT_NZ").value_or("on"));
if (!need_transform(tensor->type)) {
ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, data, size, ACL_MEMCPY_HOST_TO_DEVICE));
if (weight_to_nz && is_matmul_weight((const ggml_tensor *) tensor)) {
if (weight_to_nz && tensor->type != GGML_TYPE_BF16
&& is_matmul_weight((const ggml_tensor *) tensor)) {
GGML_ASSERT(tensor->ne[2] == 1);
GGML_ASSERT(tensor->ne[3] == 1);
weight_format_to_nz(tensor, offset, ctx->device);
@@ -1443,7 +1444,8 @@ static size_t ggml_backend_cann_buffer_type_get_alloc_size(ggml_backend_buffer_t
if (ne0 % MATRIX_ROW_PADDING != 0) {
size += ggml_row_size(tensor->type, MATRIX_ROW_PADDING - ne0 % MATRIX_ROW_PADDING);
}
} else if (weight_to_nz && is_matmul_weight((const ggml_tensor *) tensor)) {
} else if (weight_to_nz && tensor->type != GGML_TYPE_BF16
&& is_matmul_weight((const ggml_tensor *) tensor)) {
// NZ format weight are not support quantized yet.
// If ND tensor transform to NZ, size may changed.
int64_t shape[] = { tensor->ne[1], tensor->ne[0] };
@@ -2223,6 +2225,19 @@ static enum ggml_status ggml_backend_cann_graph_compute(ggml_backend_t backend,
// If no matching graph is found, add a new ACL graph.
ggml_cann_graph * new_graph = ggml_cann_graph::create_from_cgraph(cgraph);
cann_ctx->graph_lru_cache.push(new_graph);
// Pre-load rope cache before graph capture. During capture the
// stream cannot perform host-to-device memcpy or device memory
// malloc/free. Running the full cache init now populates the
// cache metadata so these branches are skipped during capture,
// while also warming up the memory pool.
for (int i = 0; i < cgraph->n_nodes; i++) {
ggml_tensor * node = cgraph->nodes[i];
if (node->op == GGML_OP_ROPE) {
ggml_cann_rope_cache_preload(*cann_ctx, node);
break;
}
}
}
}
#else
@@ -2283,6 +2298,9 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, const ggml_ten
case GGML_OP_MUL_MAT:
{
switch (op->src[0]->type) {
#ifndef ASCEND_310P
case GGML_TYPE_BF16:
#endif
case GGML_TYPE_F16:
case GGML_TYPE_F32:
return true;
@@ -2320,6 +2338,9 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, const ggml_ten
switch (op->src[0]->type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
#ifndef ASCEND_310P
case GGML_TYPE_BF16:
#endif
case GGML_TYPE_Q8_0:
return true;
default:
@@ -2332,6 +2353,9 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, const ggml_ten
switch (op->type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
#ifndef ASCEND_310P
case GGML_TYPE_BF16:
#endif
return true;
default:
return false;
@@ -2341,20 +2365,30 @@ static bool ggml_backend_cann_supports_op(ggml_backend_dev_t dev, const ggml_ten
case GGML_OP_CPY:
{
ggml_tensor * src = op->src[0];
#ifdef ASCEND_310P
if ((op->type != GGML_TYPE_F32 && op->type != GGML_TYPE_F16) ||
(src->type != GGML_TYPE_F32 && src->type != GGML_TYPE_F16)) {
// only support F32 and F16.
// only support F32 and F16 on 310P.
return false;
}
#else
if ((op->type != GGML_TYPE_F32 && op->type != GGML_TYPE_F16 && op->type != GGML_TYPE_BF16) ||
(src->type != GGML_TYPE_F32 && src->type != GGML_TYPE_F16 && src->type != GGML_TYPE_BF16)) {
// only support F32, F16 and BF16.
return false;
}
#endif
return true;
}
break;
case GGML_OP_CONT:
{
// TODO: support GGML_TYPE_BF16
switch (op->src[0]->type) {
case GGML_TYPE_F32:
case GGML_TYPE_F16:
#ifndef ASCEND_310P
case GGML_TYPE_BF16:
#endif
return true;
default:
return false;

View File

@@ -460,6 +460,10 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
endif()
if(NOT GGML_CPU_ALL_VARIANTS)
set(MARCH_STR "rv64gc")
if (GGML_RVV)
string(APPEND MARCH_STR "v")
endif()
if (GGML_RV_ZFH)
string(APPEND MARCH_STR "_zfh")
endif()
@@ -467,7 +471,6 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_XTHEADVECTOR)
string(APPEND MARCH_STR "_xtheadvector")
elseif (GGML_RVV)
string(APPEND MARCH_STR "_v")
if (GGML_RV_ZVFH)
string(APPEND MARCH_STR "_zvfh")
endif()
@@ -475,12 +478,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
string(APPEND MARCH_STR "_zvfbfwma")
endif()
endif()
if (GGML_RV_ZICBOP)
string(APPEND MARCH_STR "_zicbop")
endif()
if (GGML_RV_ZIHINTPAUSE)
string(APPEND MARCH_STR "_zihintpause")
endif()
list(APPEND ARCH_FLAGS "-march=${MARCH_STR}" -mabi=lp64d)
else()
# Begin with the lowest baseline
@@ -572,9 +577,11 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
set(KLEIDIAI_FETCH_ARGS
URL ${KLEIDIAI_DOWNLOAD_URL}
DOWNLOAD_EXTRACT_TIMESTAMP NEW
URL_HASH MD5=${KLEIDIAI_ARCHIVE_MD5}
)
if (CMAKE_VERSION VERSION_GREATER_EQUAL "3.24")
list(APPEND KLEIDIAI_FETCH_ARGS DOWNLOAD_EXTRACT_TIMESTAMP NEW)
endif()
if (CMAKE_VERSION VERSION_GREATER_EQUAL "3.28")
FetchContent_Declare(KleidiAI_Download

View File

@@ -2871,8 +2871,12 @@ struct ggml_cplan ggml_graph_plan(
const int64_t ne11 = node->src[1]->ne[1]; // H
const int64_t ne12 = node->src[1]->ne[2]; // Channels In
cur += sizeof(ggml_fp16_t)*ne00*ne01*ne02*ne03;
cur += sizeof(ggml_fp16_t)*ne10*ne11*ne12;
GGML_ASSERT(node->src[0]->type == GGML_TYPE_F16 || node->src[0]->type == GGML_TYPE_F32);
GGML_ASSERT(node->src[1]->type == GGML_TYPE_F32);
cur += ggml_type_size(node->src[0]->type) * ne00 * ne01 * ne02 * ne03;
cur += ggml_type_size(node->src[0]->type) * ne10 * ne11 * ne12;
} break;
case GGML_OP_TOP_K:
{

View File

@@ -3194,6 +3194,7 @@ class tinyBLAS_PPC {
private:
__attribute__((always_inline))
inline void save_acc(acc_t * ACC, int64_t ii, int64_t jj) {
vec_t vec_C[4];
__builtin_mma_disassemble_acc(vec_C, ACC);
@@ -3204,6 +3205,7 @@ class tinyBLAS_PPC {
}
}
__attribute__((always_inline))
inline void add_save_acc(acc_t * ACC, int64_t ii, int64_t jj) {
vec_t vec_C[4];
__builtin_mma_disassemble_acc(vec_C, ACC);

View File

@@ -6923,16 +6923,15 @@ void ggml_compute_forward_conv_3d(
ggml_compute_forward_conv_3d_impl(params, src0, src1, dst, src0->type);
}
// ggml_compute_forward_conv_transpose_2d
void ggml_compute_forward_conv_transpose_2d(
const ggml_compute_params * params,
ggml_tensor * dst) {
template <typename kernel_t>
static void ggml_compute_forward_conv_transpose_2d_impl(
const ggml_compute_params * params,
ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
const ggml_tensor * src1 = dst->src[1];
GGML_ASSERT(src0->type == GGML_TYPE_F16);
GGML_ASSERT(src0->type == GGML_TYPE_F16 || src0->type == GGML_TYPE_F32);
GGML_ASSERT(src1->type == GGML_TYPE_F32);
GGML_ASSERT( dst->type == GGML_TYPE_F32);
@@ -6943,7 +6942,7 @@ void ggml_compute_forward_conv_transpose_2d(
const int nk = ne00*ne01*ne02*ne03;
GGML_ASSERT(nb00 == sizeof(ggml_fp16_t));
GGML_ASSERT(nb00 == ggml_type_size(src0->type));
GGML_ASSERT(nb10 == sizeof(float));
if (ith == 0) {
@@ -6951,12 +6950,12 @@ void ggml_compute_forward_conv_transpose_2d(
// permute kernel data (src0) from (Kw x Kh x Cout x Cin) to (Cin x Kw x Kh x Cout)
{
ggml_fp16_t * const wdata = (ggml_fp16_t *) params->wdata + 0;
kernel_t * const wdata = (kernel_t *) params->wdata + 0;
for (int64_t i03 = 0; i03 < ne03; i03++) {
for (int64_t i02 = 0; i02 < ne02; i02++) {
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i03*nb03 + i02*nb02);
ggml_fp16_t * dst_data = wdata + i02*ne01*ne00*ne03;
const kernel_t * const src = (kernel_t *)((char *) src0->data + i03*nb03 + i02*nb02);
kernel_t * dst_data = wdata + i02*ne01*ne00*ne03;
for (int64_t i01 = 0; i01 < ne01; i01++) {
for (int64_t i00 = 0; i00 < ne00; i00++) {
dst_data[i01*ne00*ne03 + i00*ne03 + i03] = src[i01 * ne00 + i00];
@@ -6968,13 +6967,17 @@ void ggml_compute_forward_conv_transpose_2d(
// permute source data (src1) from (Sw x Sh x Cin) to (Cin x Sw x Sh)
{
ggml_fp16_t * const wdata = (ggml_fp16_t *) params->wdata + nk;
kernel_t * const wdata = (kernel_t *) params->wdata + nk;
for (int i12 = 0; i12 < ne12; i12++) {
for (int i11 = 0; i11 < ne11; i11++) {
const float * const src = (float *)((char *) src1->data + i12*nb12 + i11*nb11);
ggml_fp16_t * dst_data = wdata + i11*ne10*ne12;
kernel_t * dst_data = wdata + i11*ne10*ne12;
for (int i10 = 0; i10 < ne10; i10++) {
dst_data[i10*ne12 + i12] = GGML_CPU_FP32_TO_FP16(src[i10]);
if constexpr (std::is_same_v<kernel_t, ggml_fp16_t>) {
dst_data[i10*ne12 + i12] = GGML_CPU_FP32_TO_FP16(src[i10]);
} else {
dst_data[i10*ne12 + i12] = src[i10];
}
}
}
}
@@ -6996,21 +6999,27 @@ void ggml_compute_forward_conv_transpose_2d(
const int ip0 = dp*ith;
const int ip1 = MIN(ip0 + dp, np);
ggml_fp16_t * const wdata = (ggml_fp16_t *) params->wdata + 0;
ggml_fp16_t * const wdata_src = wdata + nk;
kernel_t * const wdata = (kernel_t *) params->wdata + 0;
kernel_t * const wdata_src = wdata + nk;
for (int i2 = ip0; i2 < ip1; i2++) { // Cout
float * dst_data = (float *)((char *) dst->data + i2*nb2);
ggml_fp16_t * wdata_kernel = wdata + i2*ne01*ne00*ne03;
kernel_t * wdata_kernel = wdata + i2*ne01*ne00*ne03;
for (int i11 = 0; i11 < ne11; i11++) {
for (int i10 = 0; i10 < ne10; i10++) {
const int i1n = i11*ne10*ne12 + i10*ne12;
for (int i01 = 0; i01 < ne01; i01++) {
for (int i00 = 0; i00 < ne00; i00++) {
float v = 0;
ggml_vec_dot_f16(ne03, &v, 0,
wdata_src + i1n, 0,
wdata_kernel + i01*ne00*ne03 + i00*ne03, 0, 1);
if constexpr (std::is_same_v<kernel_t, ggml_fp16_t>) {
ggml_vec_dot_f16(ne03, &v, 0,
wdata_src + i1n, 0,
wdata_kernel + i01*ne00*ne03 + i00*ne03, 0, 1);
} else {
ggml_vec_dot_f32(ne03, &v, 0,
wdata_src + i1n, 0,
wdata_kernel + i01*ne00*ne03 + i00*ne03, 0, 1);
}
dst_data[(i11*stride + i01)*ne0 + i10*stride + i00] += v;
}
}
@@ -7019,6 +7028,28 @@ void ggml_compute_forward_conv_transpose_2d(
}
}
void ggml_compute_forward_conv_transpose_2d(
const ggml_compute_params * params,
ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
switch (src0->type) {
case GGML_TYPE_F16:
{
ggml_compute_forward_conv_transpose_2d_impl<ggml_fp16_t>(params, dst);
} break;
case GGML_TYPE_F32:
{
ggml_compute_forward_conv_transpose_2d_impl<float>(params, dst);
} break;
default:
{
GGML_ABORT("fatal error");
}
}
}
// ggml_compute_forward_conv_2d_dw
struct ggml_conv_2d_dw_params {

View File

@@ -116,12 +116,11 @@ if (CUDAToolkit_FOUND)
list(APPEND GGML_SOURCES_CUDA ${SRCS})
add_compile_definitions(GGML_CUDA_FA_ALL_QUANTS)
else()
file(GLOB SRCS "template-instances/fattn-vec*q4_0-q4_0.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
file(GLOB SRCS "template-instances/fattn-vec*q8_0-q8_0.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
file(GLOB SRCS "template-instances/fattn-vec*f16-f16.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
list(APPEND GGML_SOURCES_CUDA
template-instances/fattn-vec-instance-f16-f16.cu
template-instances/fattn-vec-instance-q4_0-q4_0.cu
template-instances/fattn-vec-instance-q8_0-q8_0.cu
template-instances/fattn-vec-instance-bf16-bf16.cu)
endif()
ggml_add_backend_library(ggml-cuda

View File

@@ -799,6 +799,16 @@ static __device__ __forceinline__ float ggml_cuda_e8m0_to_fp32(uint8_t x) {
#endif // CUDART_VERSION >= 12050
}
static __device__ __forceinline__ float ggml_cuda_ue4m3_to_fp32(uint8_t x) {
#ifdef FP8_AVAILABLE
const uint32_t bits = x * (x != 0x7F && x != 0xFF); // Convert NaN to 0.0f to match CPU implementation.
const __nv_fp8_e4m3 xf = *reinterpret_cast<const __nv_fp8_e4m3 *>(&bits);
return static_cast<float>(xf) / 2;
#else
NO_DEVICE_CODE;
#endif // FP8_AVAILABLE
}
__device__ __forceinline__ uint8_t ggml_cuda_float_to_fp4_e2m1(float x, float e) {
const uint8_t sign_bit = (x < 0.0f) << 3;
float ax = fabsf(x) * e;
@@ -931,6 +941,13 @@ struct ggml_cuda_type_traits<GGML_TYPE_MXFP4> {
static constexpr int qi = QI_MXFP4;
};
template<>
struct ggml_cuda_type_traits<GGML_TYPE_NVFP4> {
static constexpr int qk = QK_NVFP4;
static constexpr int qr = QR_NVFP4;
static constexpr int qi = QI_NVFP4;
};
template<>
struct ggml_cuda_type_traits<GGML_TYPE_Q2_K> {
static constexpr int qk = QK_K;

View File

@@ -1,12 +1,20 @@
#include <algorithm>
#include "conv2d-transpose.cuh"
#include "ggml.h"
#include "convert.cuh"
__global__ void conv2d_transpose_kernel(const float * __restrict__ input, const half * __restrict__ kernel,
float * __restrict__ output, const int in_w, const int in_h, const int out_w,
const int out_h, const int kernel_w, const int kernel_h, const int stride,
const int c_in, const int c_out, const int batches) {
template <typename kernel_t>
static __global__ void conv2d_transpose_kernel(const float * __restrict__ input,
const kernel_t * __restrict__ kernel,
float * __restrict__ output,
const int in_w,
const int in_h,
const int out_w,
const int out_h,
const int kernel_w,
const int kernel_h,
const int stride,
const int c_in,
const int c_out,
const int batches) {
const int global_idx = blockIdx.x * blockDim.x + threadIdx.x;
const int total_elements = out_w * out_h * c_out * batches;
@@ -26,24 +34,32 @@ __global__ void conv2d_transpose_kernel(const float * __restrict__ input, const
for (int c_in_idx = 0; c_in_idx < c_in; c_in_idx++) {
for (int kh = 0; kh < kernel_h; ++kh) {
int in_y = out_y_idx - kh;
if (in_y < 0 || in_y % stride) continue;
if (in_y < 0 || in_y % stride) {
continue;
}
in_y /= stride;
if (in_y >= in_h) continue;
if (in_y >= in_h) {
continue;
}
for (int kw = 0; kw < kernel_w; ++kw) {
int in_x = out_x_idx - kw;
if (in_x < 0 || in_x % stride) continue;
if (in_x < 0 || in_x % stride) {
continue;
}
in_x /= stride;
if (in_x >= in_w) continue;
if (in_x >= in_w) {
continue;
}
const int input_idx = (in_w * in_h * c_in) * n_idx + (in_w * in_h) * c_in_idx + (in_w) *in_y + in_x;
const int kernel_idx =
(kernel_h * kernel_w * c_out) * c_in_idx + (kernel_h * kernel_w) * c_idx + (kernel_w) *kh + kw;
float input_val = input[input_idx];
half kern_val = kernel[kernel_idx];
float input_val = input[input_idx];
kernel_t kern_val = kernel[kernel_idx];
accumulator += input_val * (float) kern_val;
accumulator += input_val * ggml_cuda_cast<float>(kern_val);
}
}
}
@@ -56,11 +72,12 @@ void ggml_cuda_conv_2d_transpose_p0(ggml_backend_cuda_context & ctx, ggml_tensor
const ggml_tensor * kernel = dst->src[0];
const ggml_tensor * input = dst->src[1];
GGML_ASSERT(kernel->type == GGML_TYPE_F16 && input->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
GGML_ASSERT(kernel->type == GGML_TYPE_F16 || kernel->type == GGML_TYPE_F32);
GGML_ASSERT(input->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
const float * input_data = (const float *) input->data;
float * output_data = (float *) dst->data;
const half * kernel_data = (const half *) kernel->data;
const void * kernel_data = kernel->data;
const int input_w = input->ne[0];
const int input_h = input->ne[1];
@@ -82,10 +99,17 @@ void ggml_cuda_conv_2d_transpose_p0(ggml_backend_cuda_context & ctx, ggml_tensor
GGML_ASSERT(ggml_is_contiguous(kernel));
GGML_ASSERT(ggml_is_contiguous(dst));
const int total = (output_w * output_h * channels_out * batches);
const int total = output_w * output_h * channels_out * batches;
const int blocks = (total + CUDA_CONV2D_TRANSPOSE_BLOCK_SIZE - 1) / CUDA_CONV2D_TRANSPOSE_BLOCK_SIZE;
conv2d_transpose_kernel<<<blocks, CUDA_CONV2D_TRANSPOSE_BLOCK_SIZE, 0, st>>>(
input_data, kernel_data, output_data, input_w, input_h, output_w, output_h, kernel_w, kernel_h, stride,
channels_in, channels_out, batches);
if (kernel->type == GGML_TYPE_F16) {
conv2d_transpose_kernel<half><<<blocks, CUDA_CONV2D_TRANSPOSE_BLOCK_SIZE, 0, st>>>(
input_data, (const half *) kernel_data, output_data, input_w, input_h, output_w, output_h, kernel_w,
kernel_h, stride, channels_in, channels_out, batches);
} else {
conv2d_transpose_kernel<float><<<blocks, CUDA_CONV2D_TRANSPOSE_BLOCK_SIZE, 0, st>>>(
input_data, (const float *) kernel_data, output_data, input_w, input_h, output_w, output_h, kernel_w,
kernel_h, stride, channels_in, channels_out, batches);
}
}

View File

@@ -1,4 +1,5 @@
#include "common.cuh"
#define CUDA_CONV2D_TRANSPOSE_BLOCK_SIZE 256
void ggml_cuda_conv_2d_transpose_p0(ggml_backend_cuda_context & ctx, ggml_tensor * dst);

View File

@@ -617,6 +617,45 @@ static void dequantize_row_mxfp4_cuda(const void * vx, dst_t * y, const int64_t
dequantize_block_mxfp4<<<nb, 32, 0, stream>>>(vx, y);
}
template <typename dst_t>
static __global__ void dequantize_block_nvfp4(
const void * __restrict__ vx,
dst_t * __restrict__ yy,
const int64_t ne) {
const int64_t i = blockIdx.x;
const int tid = threadIdx.x;
const int64_t base = i * QK_NVFP4;
if (base >= ne) {
return;
}
const block_nvfp4 * x = (const block_nvfp4 *) vx;
const block_nvfp4 & xb = x[i];
const int sub = tid / (QK_NVFP4_SUB / 2);
const int j = tid % (QK_NVFP4_SUB / 2);
const float d = ggml_cuda_ue4m3_to_fp32(xb.d[sub]);
const uint8_t q = xb.qs[sub * (QK_NVFP4_SUB / 2) + j];
const int64_t y0 = base + sub * QK_NVFP4_SUB + j;
const int64_t y1 = y0 + QK_NVFP4_SUB / 2;
yy[y0] = ggml_cuda_cast<dst_t>(d * kvalues_mxfp4[q & 0x0F]);
yy[y1] = ggml_cuda_cast<dst_t>(d * kvalues_mxfp4[q >> 4]);
}
template <typename dst_t>
static void dequantize_row_nvfp4_cuda(
const void * vx,
dst_t * y,
const int64_t k,
cudaStream_t stream) {
GGML_ASSERT(k % QK_NVFP4 == 0);
const int nb = k / QK_NVFP4;
dequantize_block_nvfp4<<<nb, 32, 0, stream>>>(vx, y, k);
}
template <typename src_t, typename dst_t>
static __global__ void convert_unary(
const void * __restrict__ vx, dst_t * __restrict__ y, const int64_t ne00, const int64_t ne01,
@@ -715,6 +754,8 @@ to_fp16_cuda_t ggml_get_to_fp16_cuda(ggml_type type) {
return dequantize_row_iq3_s_cuda;
case GGML_TYPE_MXFP4:
return dequantize_row_mxfp4_cuda;
case GGML_TYPE_NVFP4:
return dequantize_row_nvfp4_cuda;
case GGML_TYPE_F32:
return convert_unary_cont_cuda<float>;
case GGML_TYPE_BF16:
@@ -766,6 +807,8 @@ to_fp32_cuda_t ggml_get_to_fp32_cuda(ggml_type type) {
return dequantize_row_iq3_s_cuda;
case GGML_TYPE_MXFP4:
return dequantize_row_mxfp4_cuda;
case GGML_TYPE_NVFP4:
return dequantize_row_nvfp4_cuda;
case GGML_TYPE_F16:
return convert_unary_cont_cuda<half>;
case GGML_TYPE_BF16:

View File

@@ -41,6 +41,16 @@ template<typename dst_t, typename src_t>
return __bfloat162float(x);
} else if constexpr(std::is_same_v<src_t, float2> && std::is_same_v<dst_t, half2>) {
return __float22half2_rn(x);
} else if constexpr(std::is_same_v<src_t, nv_bfloat162> && std::is_same_v<dst_t, float2>) {
#ifdef GGML_USE_HIP
return make_float2(__bfloat162float(__low2bfloat16(x)), __bfloat162float(__high2bfloat16(x)));
#else
#if __CUDA_ARCH__ >= 800
return __bfloat1622float2(x);
#else
return make_float2(__bfloat162float(x.x), __bfloat162float(x.y));
#endif // __CUDA_ARCH__ >= 800
#endif // GGML_USE_HIP
} else if constexpr(std::is_same_v<src_t, float2> && std::is_same_v<dst_t, nv_bfloat162>) {
// bypass compile error on cuda 12.0.1
#ifdef GGML_USE_HIP

View File

@@ -74,6 +74,37 @@ static __device__ __forceinline__ float vec_dot_fattn_vec_KQ_f16(
return sum;
}
template <int D, int nthreads>
static __device__ __forceinline__ float vec_dot_fattn_vec_KQ_bf16(
const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds_v) {
const nv_bfloat162 * K_bf16 = (const nv_bfloat162 *) K_c;
GGML_UNUSED(Q_q8);
GGML_UNUSED(Q_ds_v);
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
float sum = 0.0f;
#pragma unroll
for (int k_KQ_0 = 0; k_KQ_0 < D/2; k_KQ_0 += nthreads*cpy_ne) {
__align__(16) nv_bfloat162 tmp[cpy_ne];
ggml_cuda_memcpy_1<sizeof(tmp)>(tmp, K_bf16 + k_KQ_0 + (threadIdx.x % nthreads)*cpy_ne);
#pragma unroll
for (int k_KQ_1 = 0; k_KQ_1 < cpy_ne; ++k_KQ_1) {
#ifdef V_DOT2_F32_F16_AVAILABLE
// FIXME replace macros in vector FA kernel with templating and use FP32 for BF16
ggml_cuda_mad(sum, ggml_cuda_cast<float2>(tmp[k_KQ_1]), __half22float2(((const half2 *) Q_v)[k_KQ_0/nthreads + k_KQ_1]));
#else
ggml_cuda_mad(sum, ggml_cuda_cast<float2>(tmp[k_KQ_1]), ((const float2 *) Q_v)[k_KQ_0/nthreads + k_KQ_1]);
#endif // V_DOT2_F32_F16_AVAILABLE
}
}
return sum;
}
template<int D, int nthreads>
static __device__ __forceinline__ float vec_dot_fattn_vec_KQ_q4_0(
const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8, const void * __restrict__ Q_ds_v) {
@@ -321,6 +352,19 @@ static __device__ __forceinline__ void dequantize_V_f16(const void * __restrict_
}
}
template <typename T, int ne>
static __device__ __forceinline__ void dequantize_V_bf16(const void * __restrict__ vx, void * __restrict__ dst, const int64_t i0) {
static_assert(std::is_same_v<T, float>, "BF16 V dequantization only supports float output");
static_assert(ne % 2 == 0, "bad ne");
__align__(16) nv_bfloat162 tmp[ne/2];
ggml_cuda_memcpy_1<ne*sizeof(nv_bfloat16)>(tmp, (const nv_bfloat16 *) vx + i0);
float2 * dst_f2 = (float2 *) dst;
#pragma unroll
for (int l = 0; l < ne/2; ++l) {
dst_f2[l] = ggml_cuda_cast<float2>(tmp[l]);
}
}
template <typename T, int ne>
static __device__ __forceinline__ void dequantize_V_q4_0(const void * __restrict__ vx, void * __restrict__ dst, const int64_t i0) {
const block_q4_0 * x = (const block_q4_0 *) vx;
@@ -547,6 +591,8 @@ constexpr __device__ vec_dot_KQ_t get_vec_dot_KQ() {
return vec_dot_fattn_vec_KQ_q5_1<D, nthreads>;
} else if constexpr (type_K == GGML_TYPE_Q8_0) {
return vec_dot_fattn_vec_KQ_q8_0<D, nthreads>;
} else if constexpr (type_K == GGML_TYPE_BF16) {
return vec_dot_fattn_vec_KQ_bf16<D, nthreads>;
} else {
static_assert(type_K == -1, "bad type");
return nullptr;
@@ -567,6 +613,8 @@ constexpr __device__ dequantize_V_t get_dequantize_V() {
return dequantize_V_q5_1<T, ne>;
} else if constexpr (type_V == GGML_TYPE_Q8_0) {
return dequantize_V_q8_0<T, ne>;
} else if constexpr (type_V == GGML_TYPE_BF16) {
return dequantize_V_bf16<float, ne>;
} else {
static_assert(type_V == -1, "bad type");
return nullptr;

View File

@@ -75,17 +75,17 @@ static __global__ void flash_attn_ext_vec(
#endif // GGML_USE_HIP
constexpr int nthreads = ggml_cuda_fattn_vec_get_nthreads_device();
constexpr int nthreads_KQ = type_K == GGML_TYPE_F16 ? 128 / cpy_nb : nthreads_KQ_q;
constexpr int nthreads_V = type_V == GGML_TYPE_F16 ? 128 / cpy_nb : nthreads_V_q;
constexpr int nthreads_KQ = (type_K == GGML_TYPE_F16 || type_K == GGML_TYPE_BF16) ? 128 / cpy_nb : nthreads_KQ_q;
constexpr int nthreads_V = (type_V == GGML_TYPE_F16 || type_V == GGML_TYPE_BF16) ? 128 / cpy_nb : nthreads_V_q;
static_assert(WARP_SIZE % nthreads_KQ == 0, "bad nthreads_K");
static_assert(WARP_SIZE % nthreads_V == 0, "bad nthreads_V");
constexpr int V_rows_per_thread = type_V == GGML_TYPE_F16 ? 2*cpy_ne : 4;
constexpr int V_rows_per_thread = (type_V == GGML_TYPE_F16 || type_V == GGML_TYPE_BF16) ? 2*cpy_ne : 4;
constexpr int V_cols_per_iter = WARP_SIZE / nthreads_V;
constexpr vec_dot_KQ_t vec_dot_KQ = get_vec_dot_KQ<type_K, D, nthreads_KQ>();
constexpr bool Q_q8_1 = type_K != GGML_TYPE_F16;
constexpr bool Q_q8_1 = type_K != GGML_TYPE_F16 && type_K != GGML_TYPE_BF16;
#ifdef V_DOT2_F32_F16_AVAILABLE
constexpr dequantize_V_t dequantize_V = get_dequantize_V<type_V, half, V_rows_per_thread>();
#else
@@ -323,8 +323,18 @@ static __global__ void flash_attn_ext_vec(
#pragma unroll
for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) {
half2 tmp[V_rows_per_thread/2];
dequantize_V(V + k*nb21, tmp,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
if constexpr (type_V == GGML_TYPE_BF16) {
float2 tmp_f[V_rows_per_thread/2];
dequantize_V(V + k*nb21, tmp_f,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
#pragma unroll
for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]);
}
} else {
dequantize_V(V + k*nb21, tmp,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
}
#pragma unroll
for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
#pragma unroll
@@ -563,6 +573,7 @@ void ggml_cuda_flash_attn_ext_vec_case(ggml_backend_cuda_context & ctx, ggml_ten
extern DECL_FATTN_VEC_CASE(D, type_K, GGML_TYPE_Q5_0); \
extern DECL_FATTN_VEC_CASE(D, type_K, GGML_TYPE_Q5_1); \
extern DECL_FATTN_VEC_CASE(D, type_K, GGML_TYPE_Q8_0); \
extern DECL_FATTN_VEC_CASE(D, type_K, GGML_TYPE_BF16); \
EXTERN_DECL_FATTN_VEC_CASES( 64, GGML_TYPE_F16)
EXTERN_DECL_FATTN_VEC_CASES( 64, GGML_TYPE_Q4_0)
@@ -570,6 +581,7 @@ EXTERN_DECL_FATTN_VEC_CASES( 64, GGML_TYPE_Q4_1)
EXTERN_DECL_FATTN_VEC_CASES( 64, GGML_TYPE_Q5_0)
EXTERN_DECL_FATTN_VEC_CASES( 64, GGML_TYPE_Q5_1)
EXTERN_DECL_FATTN_VEC_CASES( 64, GGML_TYPE_Q8_0)
EXTERN_DECL_FATTN_VEC_CASES( 64, GGML_TYPE_BF16)
EXTERN_DECL_FATTN_VEC_CASES(128, GGML_TYPE_F16)
EXTERN_DECL_FATTN_VEC_CASES(128, GGML_TYPE_Q4_0)
@@ -577,6 +589,7 @@ EXTERN_DECL_FATTN_VEC_CASES(128, GGML_TYPE_Q4_1)
EXTERN_DECL_FATTN_VEC_CASES(128, GGML_TYPE_Q5_0)
EXTERN_DECL_FATTN_VEC_CASES(128, GGML_TYPE_Q5_1)
EXTERN_DECL_FATTN_VEC_CASES(128, GGML_TYPE_Q8_0)
EXTERN_DECL_FATTN_VEC_CASES(128, GGML_TYPE_BF16)
EXTERN_DECL_FATTN_VEC_CASES(256, GGML_TYPE_F16)
EXTERN_DECL_FATTN_VEC_CASES(256, GGML_TYPE_Q4_0)
@@ -584,3 +597,4 @@ EXTERN_DECL_FATTN_VEC_CASES(256, GGML_TYPE_Q4_1)
EXTERN_DECL_FATTN_VEC_CASES(256, GGML_TYPE_Q5_0)
EXTERN_DECL_FATTN_VEC_CASES(256, GGML_TYPE_Q5_1)
EXTERN_DECL_FATTN_VEC_CASES(256, GGML_TYPE_Q8_0)
EXTERN_DECL_FATTN_VEC_CASES(256, GGML_TYPE_BF16)

View File

@@ -224,6 +224,7 @@ static void ggml_cuda_flash_attn_ext_vec(ggml_backend_cuda_context & ctx, ggml_t
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_0, GGML_TYPE_F16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_1, GGML_TYPE_F16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_F16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_F16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_Q4_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_0, GGML_TYPE_Q4_0)
@@ -231,6 +232,7 @@ static void ggml_cuda_flash_attn_ext_vec(ggml_backend_cuda_context & ctx, ggml_t
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_0, GGML_TYPE_Q4_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_1, GGML_TYPE_Q4_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q4_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_Q4_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_Q4_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_0, GGML_TYPE_Q4_1)
@@ -238,6 +240,7 @@ static void ggml_cuda_flash_attn_ext_vec(ggml_backend_cuda_context & ctx, ggml_t
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_0, GGML_TYPE_Q4_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_1, GGML_TYPE_Q4_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q4_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_Q4_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_Q5_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_0, GGML_TYPE_Q5_0)
@@ -245,6 +248,7 @@ static void ggml_cuda_flash_attn_ext_vec(ggml_backend_cuda_context & ctx, ggml_t
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_0, GGML_TYPE_Q5_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_1, GGML_TYPE_Q5_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q5_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_Q5_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_Q5_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_0, GGML_TYPE_Q5_1)
@@ -252,6 +256,7 @@ static void ggml_cuda_flash_attn_ext_vec(ggml_backend_cuda_context & ctx, ggml_t
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_0, GGML_TYPE_Q5_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_1, GGML_TYPE_Q5_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q5_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_Q5_1)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_0, GGML_TYPE_Q8_0)
@@ -259,10 +264,20 @@ static void ggml_cuda_flash_attn_ext_vec(ggml_backend_cuda_context & ctx, ggml_t
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_0, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_1, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_BF16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_0, GGML_TYPE_BF16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_1, GGML_TYPE_BF16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_0, GGML_TYPE_BF16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q5_1, GGML_TYPE_BF16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_BF16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_BF16)
#else
FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_F16)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q4_0, GGML_TYPE_Q4_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_Q8_0, GGML_TYPE_Q8_0)
FATTN_VEC_CASES_ALL_D(GGML_TYPE_BF16, GGML_TYPE_BF16)
#endif // GGML_CUDA_FA_ALL_QUANTS
GGML_ABORT("fatal error");
@@ -355,6 +370,7 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
#endif // GGML_CUDA_FA_ALL_QUANTS
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
case GGML_TYPE_BF16:
break;
default:
return BEST_FATTN_KERNEL_NONE;

View File

@@ -1297,7 +1297,12 @@ static void ggml_cuda_op_mul_mat_cublas(
const bool supports_bf16 = GGML_CUDA_CC_IS_NVIDIA(cc) || GGML_CUDA_CC_IS_AMD(cc) ||
(GGML_CUDA_CC_IS_MTHREADS(cc) && cc >= GGML_CUDA_CC_QY2);
const bool use_fp16 = (src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) && ggml_is_contiguous(src0) && row_diff == src0->ne[1] && dst->op_params[0] == GGML_PREC_DEFAULT;
const bool use_fp16 =
src0->type != GGML_TYPE_NVFP4 &&
(src0->type == GGML_TYPE_F16 || ggml_is_quantized(src0->type)) &&
ggml_is_contiguous(src0) &&
row_diff == src0->ne[1] &&
dst->op_params[0] == GGML_PREC_DEFAULT;
if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) {
ggml_cuda_pool_alloc<nv_bfloat16> src1_as_bf16(ctx.pool(id));
@@ -4781,6 +4786,9 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_TYPE_Q5_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_MXFP4:
#ifdef FP8_AVAILABLE
case GGML_TYPE_NVFP4:
#endif // FP8_AVAILABLE
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:

View File

@@ -15,6 +15,7 @@ static constexpr __device__ vec_dot_q_cuda_t get_vec_dot_q_cuda(ggml_type type)
case GGML_TYPE_Q5_1: return vec_dot_q5_1_q8_1;
case GGML_TYPE_Q8_0: return vec_dot_q8_0_q8_1;
case GGML_TYPE_MXFP4: return vec_dot_mxfp4_q8_1;
case GGML_TYPE_NVFP4: return vec_dot_nvfp4_q8_1;
case GGML_TYPE_Q2_K: return vec_dot_q2_K_q8_1;
case GGML_TYPE_Q3_K: return vec_dot_q3_K_q8_1;
case GGML_TYPE_Q4_K: return vec_dot_q4_K_q8_1;
@@ -33,7 +34,7 @@ static constexpr __device__ vec_dot_q_cuda_t get_vec_dot_q_cuda(ggml_type type)
}
}
static constexpr __device__ int get_vdr_mmvq(ggml_type type) {
static constexpr __host__ __device__ int get_vdr_mmvq(ggml_type type) {
switch (type) {
case GGML_TYPE_Q4_0: return VDR_Q4_0_Q8_1_MMVQ;
case GGML_TYPE_Q4_1: return VDR_Q4_1_Q8_1_MMVQ;
@@ -41,6 +42,7 @@ static constexpr __device__ int get_vdr_mmvq(ggml_type type) {
case GGML_TYPE_Q5_1: return VDR_Q5_1_Q8_1_MMVQ;
case GGML_TYPE_Q8_0: return VDR_Q8_0_Q8_1_MMVQ;
case GGML_TYPE_MXFP4: return VDR_MXFP4_Q8_1_MMVQ;
case GGML_TYPE_NVFP4: return VDR_NVFP4_Q8_1_MMVQ;
case GGML_TYPE_Q2_K: return VDR_Q2_K_Q8_1_MMVQ;
case GGML_TYPE_Q3_K: return VDR_Q3_K_Q8_1_MMVQ;
case GGML_TYPE_Q4_K: return VDR_Q4_K_Q8_1_MMVQ;
@@ -173,11 +175,11 @@ static constexpr __host__ __device__ int calc_nwarps(ggml_type type, int ncols_d
return 1;
}
static constexpr __host__ __device__ int calc_rows_per_block(int ncols_dst, int table_id) {
static constexpr __host__ __device__ int calc_rows_per_block(int ncols_dst, int table_id, bool small_k = false, int nwarps = 1) {
if (table_id == MMVQ_PARAMETERS_GENERIC || table_id == MMVQ_PARAMETERS_GCN) {
switch (ncols_dst) {
case 1:
return 1;
return small_k ? nwarps : 1;
case 2:
case 3:
case 4:
@@ -193,7 +195,7 @@ static constexpr __host__ __device__ int calc_rows_per_block(int ncols_dst, int
return 1;
}
template <ggml_type type, int ncols_dst, bool has_fusion, bool is_multi_token_id = false>
template <ggml_type type, int ncols_dst, bool has_fusion, bool is_multi_token_id = false, bool small_k = false>
__launch_bounds__(calc_nwarps(type, ncols_dst, get_device_table_id())*ggml_cuda_get_physical_warp_size(), 1)
static __global__ void mul_mat_vec_q(
const void * __restrict__ vx, const void * __restrict__ vy, const int32_t * __restrict__ ids, const ggml_cuda_mm_fusion_args_device fusion, float * __restrict__ dst,
@@ -208,7 +210,7 @@ static __global__ void mul_mat_vec_q(
constexpr int vdr = get_vdr_mmvq(type);
constexpr mmvq_parameter_table_id table_id = get_device_table_id();
constexpr int nwarps = calc_nwarps(type, ncols_dst, table_id);
constexpr int rows_per_cuda_block = calc_rows_per_block(ncols_dst, table_id);
constexpr int rows_per_cuda_block = calc_rows_per_block(ncols_dst, table_id, small_k, nwarps);
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr vec_dot_q_cuda_t vec_dot_q_cuda = get_vec_dot_q_cuda(type);
@@ -414,14 +416,16 @@ static __global__ void mul_mat_vec_q(
template<ggml_type type>
static std::pair<dim3, dim3> calc_launch_params(
const int ncols_dst, const int nrows_x, const int nchannels_dst, const int nsamples_or_ntokens,
const int warp_size, const mmvq_parameter_table_id table_id) {
const int64_t nblocks = (nrows_x + calc_rows_per_block(ncols_dst, table_id) - 1) / calc_rows_per_block(ncols_dst, table_id);
const int warp_size, const mmvq_parameter_table_id table_id, const bool small_k = false) {
const int nwarps = calc_nwarps(type, ncols_dst, table_id);
const int rpb = calc_rows_per_block(ncols_dst, table_id, small_k, nwarps);
const int64_t nblocks = (nrows_x + rpb - 1) / rpb;
const dim3 block_nums(nblocks, nchannels_dst, nsamples_or_ntokens);
const dim3 block_dims(warp_size, calc_nwarps(type, ncols_dst, table_id), 1);
const dim3 block_dims(warp_size, nwarps, 1);
return {block_nums, block_dims};
}
template<ggml_type type, int c_ncols_dst, bool is_multi_token_id = false>
template<ggml_type type, int c_ncols_dst, bool is_multi_token_id = false, bool small_k = false>
static void mul_mat_vec_q_switch_fusion(
const void * vx, const void * vy, const int32_t * ids, const ggml_cuda_mm_fusion_args_device fusion, float * dst,
const uint32_t ncols_x, const uint3 nchannels_y, const uint32_t stride_row_x, const uint32_t stride_col_y,
@@ -434,7 +438,7 @@ static void mul_mat_vec_q_switch_fusion(
const bool has_fusion = fusion.gate != nullptr || fusion.x_bias != nullptr || fusion.gate_bias != nullptr;
if constexpr (c_ncols_dst == 1) {
if (has_fusion) {
mul_mat_vec_q<type, c_ncols_dst, true, is_multi_token_id><<<block_nums, block_dims, nbytes_shared, stream>>>
mul_mat_vec_q<type, c_ncols_dst, true, is_multi_token_id, small_k><<<block_nums, block_dims, nbytes_shared, stream>>>
(vx, vy, ids, fusion, dst, ncols_x, nchannels_y, stride_row_x, stride_col_y, stride_col_dst,
channel_ratio, stride_channel_x, stride_channel_y, stride_channel_dst,
sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst, ids_stride);
@@ -444,7 +448,7 @@ static void mul_mat_vec_q_switch_fusion(
GGML_ASSERT(!has_fusion && "fusion only supported for ncols_dst=1");
mul_mat_vec_q<type, c_ncols_dst, false, is_multi_token_id><<<block_nums, block_dims, nbytes_shared, stream>>>
mul_mat_vec_q<type, c_ncols_dst, false, is_multi_token_id, small_k><<<block_nums, block_dims, nbytes_shared, stream>>>
(vx, vy, ids, fusion, dst, ncols_x, nchannels_y, stride_row_x, stride_col_y, stride_col_dst,
channel_ratio, stride_channel_x, stride_channel_y, stride_channel_dst,
sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst, ids_stride);
@@ -488,11 +492,33 @@ static void mul_mat_vec_q_switch_ncols_dst(
switch (ncols_dst) {
case 1: {
constexpr int c_ncols_dst = 1;
std::pair<dim3, dim3> dims = calc_launch_params<type>(c_ncols_dst, nrows_x, nchannels_dst, nsamples_dst, warp_size, table_id);
mul_mat_vec_q_switch_fusion<type, c_ncols_dst>(vx, vy, ids, fusion, dst, ncols_x, nchannels_y_fd, stride_row_x, stride_col_y, stride_col_dst,
channel_ratio_fd, stride_channel_x, stride_channel_y, stride_channel_dst,
sample_ratio_fd, stride_sample_x, stride_sample_y, stride_sample_dst,
dims.first, dims.second, 0, ids_stride, stream);
// When K is small, increase rows_per_block to match nwarps so each warp has more work to do
// Trigger when the full thread block covers all K blocks in a single loop iteration and few threads remain idle.
constexpr int qk = ggml_cuda_type_traits<type>::qk;
constexpr int qi = ggml_cuda_type_traits<type>::qi;
constexpr int vdr = get_vdr_mmvq(type);
const int blocks_per_row_x = ncols_x / qk;
const int blocks_per_iter_1warp = vdr * warp_size / qi;
const int nwarps = calc_nwarps(type, c_ncols_dst, table_id);
const bool use_small_k = nwarps > 1 && blocks_per_row_x < nwarps * blocks_per_iter_1warp;
if (use_small_k) {
std::pair<dim3, dim3> dims = calc_launch_params<type>(c_ncols_dst, nrows_x, nchannels_dst, nsamples_dst,
warp_size, table_id, true);
mul_mat_vec_q_switch_fusion<type, c_ncols_dst, false, true>(
vx, vy, ids, fusion, dst, ncols_x, nchannels_y_fd, stride_row_x, stride_col_y, stride_col_dst,
channel_ratio_fd, stride_channel_x, stride_channel_y, stride_channel_dst,
sample_ratio_fd, stride_sample_x, stride_sample_y, stride_sample_dst,
dims.first, dims.second, 0, ids_stride, stream);
} else {
std::pair<dim3, dim3> dims = calc_launch_params<type>(c_ncols_dst, nrows_x, nchannels_dst, nsamples_dst,
warp_size, table_id);
mul_mat_vec_q_switch_fusion<type, c_ncols_dst>(
vx, vy, ids, fusion, dst, ncols_x, nchannels_y_fd, stride_row_x, stride_col_y, stride_col_dst,
channel_ratio_fd, stride_channel_x, stride_channel_y, stride_channel_dst,
sample_ratio_fd, stride_sample_x, stride_sample_y, stride_sample_dst,
dims.first, dims.second, 0, ids_stride, stream);
}
} break;
case 2: {
constexpr int c_ncols_dst = 2;
@@ -602,6 +628,12 @@ static void mul_mat_vec_q_switch_type(
nchannels_x, nchannels_y, nchannels_dst, stride_channel_x, stride_channel_y, stride_channel_dst,
nsamples_x, nsamples_dst, stride_sample_x, stride_sample_y, stride_sample_dst, ids_stride, stream);
break;
case GGML_TYPE_NVFP4:
mul_mat_vec_q_switch_ncols_dst<GGML_TYPE_NVFP4>
(vx, vy, ids, fusion, dst, ncols_x, nrows_x, ncols_dst, stride_row_x, stride_col_y, stride_col_dst,
nchannels_x, nchannels_y, nchannels_dst, stride_channel_x, stride_channel_y, stride_channel_dst,
nsamples_x, nsamples_dst, stride_sample_x, stride_sample_y, stride_sample_dst, ids_stride, stream);
break;
case GGML_TYPE_Q2_K:
mul_mat_vec_q_switch_ncols_dst<GGML_TYPE_Q2_K>
(vx, vy, ids, fusion, dst, ncols_x, nrows_x, ncols_dst, stride_row_x, stride_col_y, stride_col_dst,

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_BF16, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_BF16, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_BF16, GGML_TYPE_BF16);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_BF16, GGML_TYPE_F16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_BF16, GGML_TYPE_F16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_BF16, GGML_TYPE_F16);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_BF16, GGML_TYPE_Q4_0);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_BF16, GGML_TYPE_Q4_0);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_BF16, GGML_TYPE_Q4_0);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_BF16, GGML_TYPE_Q4_1);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_BF16, GGML_TYPE_Q4_1);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_BF16, GGML_TYPE_Q4_1);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_BF16, GGML_TYPE_Q5_0);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_BF16, GGML_TYPE_Q5_0);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_BF16, GGML_TYPE_Q5_0);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_BF16, GGML_TYPE_Q5_1);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_BF16, GGML_TYPE_Q5_1);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_BF16, GGML_TYPE_Q5_1);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_BF16, GGML_TYPE_Q8_0);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_BF16, GGML_TYPE_Q8_0);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_BF16, GGML_TYPE_Q8_0);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_F16, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_F16, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_F16, GGML_TYPE_BF16);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_Q4_0, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_Q4_0, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_Q4_0, GGML_TYPE_BF16);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_Q4_1, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_Q4_1, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_Q4_1, GGML_TYPE_BF16);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_Q5_0, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_Q5_0, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_Q5_0, GGML_TYPE_BF16);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_Q5_1, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_Q5_1, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_Q5_1, GGML_TYPE_BF16);

View File

@@ -0,0 +1,7 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-vec.cuh"
DECL_FATTN_VEC_CASE( 64, GGML_TYPE_Q8_0, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(128, GGML_TYPE_Q8_0, GGML_TYPE_BF16);
DECL_FATTN_VEC_CASE(256, GGML_TYPE_Q8_0, GGML_TYPE_BF16);

View File

@@ -5,7 +5,7 @@ import os
HEAD_SIZES_KQ = [40, 64, 72, 80, 96, 112, 128, 256, 576]
TYPES_KV = ["GGML_TYPE_F16", "GGML_TYPE_Q4_0", "GGML_TYPE_Q4_1", "GGML_TYPE_Q5_0", "GGML_TYPE_Q5_1", "GGML_TYPE_Q8_0"]
TYPES_KV = ["GGML_TYPE_F16", "GGML_TYPE_Q4_0", "GGML_TYPE_Q4_1", "GGML_TYPE_Q5_0", "GGML_TYPE_Q5_1", "GGML_TYPE_Q8_0", "GGML_TYPE_BF16"]
SOURCE_FATTN_TILE = """// This file has been autogenerated by generate_cu_files.py, do not edit manually.

View File

@@ -322,6 +322,38 @@ static __device__ __forceinline__ float vec_dot_mxfp4_q8_1(
return d * sumi;
}
#define VDR_NVFP4_Q8_1_MMVQ 4
#define VDR_NVFP4_Q8_1_MMQ 8
static __device__ __forceinline__ float vec_dot_nvfp4_q8_1(
const void * __restrict__ vbq,
const block_q8_1 * __restrict__ bq8_1,
const int32_t & kbx,
const int32_t & iqs) {
const block_nvfp4 * bq4 = (const block_nvfp4 *) vbq + kbx;
float sum = 0.0f;
#pragma unroll
for (int i = 0; i < VDR_NVFP4_Q8_1_MMVQ/2; i++) {
const int32_t iqs0 = iqs + 2*i;
const int32_t iqs1 = iqs0 + 1;
const int32_t is = iqs0 >> 1;
const int2 v0 = get_int_from_table_16(get_int_b4(bq4->qs, iqs0), kvalues_mxfp4);
const int2 v1 = get_int_from_table_16(get_int_b4(bq4->qs, iqs1), kvalues_mxfp4);
const block_q8_1 * bq8 = bq8_1 + (is >> 1);
const int32_t i8 = ((is & 1) << 2);
int sumi = ggml_cuda_dp4a(v0.x, get_int_b4(bq8->qs, i8 + 0), 0);
sumi = ggml_cuda_dp4a(v0.y, get_int_b4(bq8->qs, i8 + 2), sumi);
sumi = ggml_cuda_dp4a(v1.x, get_int_b4(bq8->qs, i8 + 1), sumi);
sumi = ggml_cuda_dp4a(v1.y, get_int_b4(bq8->qs, i8 + 3), sumi);
const float d = ggml_cuda_ue4m3_to_fp32(bq4->d[is]) * __low2float(bq8->ds);
sum += d * float(sumi);
}
return sum;
}
#define VDR_Q2_K_Q8_1_MMVQ 1
#define VDR_Q2_K_Q8_1_MMQ 4

View File

@@ -6,9 +6,10 @@
#include <cuda_bf16.h>
#include <cuda_fp16.h>
#if CUDART_VERSION >= 12050
#if CUDART_VERSION >= 11080
#include <cuda_fp8.h>
#endif // CUDART_VERSION >= 12050
#define FP8_AVAILABLE
#endif // CUDART_VERSION >= 11080
#if CUDART_VERSION >= 12080
#include <cuda_fp4.h>

View File

@@ -235,6 +235,12 @@
typedef __hip_bfloat16 nv_bfloat16;
typedef __hip_bfloat162 nv_bfloat162;
#if HIP_VERSION >= 60200000
#include <hip/hip_fp8.h>
typedef __hip_fp8_e4m3 __nv_fp8_e4m3;
#define FP8_AVAILABLE
#endif // HIP_VERSION >= 60200000
typedef int8_t int8x4_t __attribute__((ext_vector_type(4)));
typedef uint8_t uint8x4_t __attribute__((ext_vector_type(4)));
static __device__ __forceinline__ int __vsubss4(const int a, const int b) {

View File

@@ -45,6 +45,7 @@ static int opt_verbose = 0;
static int opt_profile = 0;
static int opt_hostbuf = 1; // hostbuf ON by default
static int opt_experimental = 0;
static int opt_use_hmx = 1; // when set, enable HMX; when 0, use HVX only
// Enable all stages by default
static int opt_opmask = HTP_OPMASK_QUEUE | HTP_OPMASK_QUANTIZE | HTP_OPMASK_COMPUTE;
@@ -460,7 +461,7 @@ static void repack_row_q4x4x2(uint8_t * y, const block_q4_0 * x, int64_t k) {
d[7] = x[i * 8 + 7].d;
}
if (opt_verbose > 1) {
if (opt_verbose > 2) {
for (int i = 0; i < nb; i++) {
dump_packed_block_q4x4x2(y, i, k);
}
@@ -479,7 +480,7 @@ static void unpack_row_q4x4x2(block_q4_0 * x, const uint8_t * y, int64_t k) {
const uint8_t * y_q = y + 0; // quants first
const uint8_t * y_d = y + qrow_size; // then scales
if (opt_verbose > 1) {
if (opt_verbose > 2) {
for (int i = 0; i < nb; i++) {
dump_packed_block_q4x4x2(y, i, k);
}
@@ -795,7 +796,7 @@ static void repack_row_q8x4x2(uint8_t * y, const block_q8_0 * x, int64_t k) {
d[7] = x[i * 8 + 7].d;
}
if (opt_verbose > 1) {
if (opt_verbose > 2) {
for (int i = 0; i < nb; i++) {
dump_packed_block_q8x4x2(y, i, k);
}
@@ -813,7 +814,7 @@ static void unpack_row_q8x4x2(block_q8_0 * x, const uint8_t * y, int64_t k) {
const uint8_t * y_q = y + 0; // quants first
const uint8_t * y_d = y + qrow_size; // then scales
if (opt_verbose > 1) {
if (opt_verbose > 2) {
for (int i = 0; i < nb; i++) {
dump_packed_block_q8x4x2(y, i, k);
}
@@ -1148,7 +1149,7 @@ static void repack_row_mxfp4x4x2(uint8_t * y, const block_mxfp4 * x, int64_t k)
e[7] = x[i * 8 + 7].e;
}
if (opt_verbose > 1) {
if (opt_verbose > 2) {
for (int i = 0; i < nb; i++) {
dump_packed_block_mxfp4x4x2(y, i, k);
}
@@ -1167,7 +1168,7 @@ static void unpack_row_mxfp4x4x2(block_mxfp4 * x, const uint8_t * y, int64_t k)
const uint8_t * y_q = y + 0; // quants first
const uint8_t * y_e = y + qrow_size; // then scales
if (opt_verbose > 1) {
if (opt_verbose > 2) {
for (int i = 0; i < nb; i++) {
dump_packed_block_mxfp4x4x2(y, i, k);
}
@@ -1693,7 +1694,7 @@ void ggml_hexagon_session::allocate(int dev_id) noexcept(false) {
// Start the DSP-side service. We need to pass the queue ID to the
// DSP in a FastRPC call; the DSP side will import the queue and start
// listening for packets in a callback.
err = htp_iface_start(this->handle, dev_id, this->queue_id, opt_nhvx);
err = htp_iface_start(this->handle, dev_id, this->queue_id, opt_nhvx, opt_use_hmx);
if (err != 0) {
GGML_LOG_ERROR("ggml-hex: failed to start session: 0x%08x\n", (unsigned) err);
throw std::runtime_error("ggml-hex: iface start failed (see log for details)");
@@ -3372,6 +3373,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
const char * str_profile = getenv("GGML_HEXAGON_PROFILE");
const char * str_etm = getenv("GGML_HEXAGON_ETM");
const char * str_nhvx = getenv("GGML_HEXAGON_NHVX");
const char * str_use_hmx = getenv("GGML_HEXAGON_USE_HMX");
const char * str_ndev = getenv("GGML_HEXAGON_NDEV");
const char * str_arch = getenv("GGML_HEXAGON_ARCH");
@@ -3381,8 +3383,9 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
opt_opmask = str_opmask ? strtoul(str_opmask, NULL, 0) : opt_opmask;
opt_opsync = str_opsync ? atoi(str_opsync) : 0;
opt_profile = str_profile ? atoi(str_profile) : 0;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_nhvx = str_nhvx ? strtoul(str_nhvx, NULL, 0) : opt_nhvx;
opt_use_hmx = str_use_hmx ? atoi(str_use_hmx) : opt_use_hmx;
opt_ndev = str_ndev ? strtoul(str_ndev, NULL, 0) : opt_ndev;
if (opt_ndev > GGML_HEXAGON_MAX_SESSIONS) {

View File

@@ -40,6 +40,24 @@ target_compile_definitions(${HTP_LIB} PRIVATE
$<IF:$<BOOL:${HEXAGON_HTP_DEBUG}>,FARF_HIGH=1,>
FP32_QUANTIZE_GROUP_SIZE=${GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE})
# HMX acceleration: available on v73+ architectures
set(HTP_HMX_VERSIONS v73 v75 v79 v81)
list(FIND HTP_HMX_VERSIONS ${DSP_VERSION} _hmx_idx)
if (_hmx_idx GREATER_EQUAL 0)
target_sources(${HTP_LIB} PRIVATE
hmx-matmul-ops.c
)
# -mhmx enables HMX instruction set (needed by files that include hmx-utils.h)
set_source_files_properties(
hmx-matmul-ops.c
PROPERTIES COMPILE_OPTIONS "-mhmx"
)
target_compile_definitions(${HTP_LIB} PRIVATE HTP_HAS_HMX=1)
endif()
build_idl(htp_iface.idl ${HTP_LIB})
set_target_properties(${HTP_LIB} PROPERTIES EXPORT_COMPILE_COMMANDS ON)

View File

@@ -24,28 +24,26 @@
// Context for binary operations
struct htp_binary_context {
struct htp_ops_context * octx;
struct fastdiv_values dim1_div;
struct fastdiv_values dim2_div;
struct fastdiv_values dim12_div;
struct fastdiv_values src0_dim1_div; // ne01
struct fastdiv_values src0_dim2_div; // ne02
struct fastdiv_values src0_dim12_div;// ne03
struct fastdiv_values src1_dim1_div; // ne11
struct fastdiv_values src1_dim2_div; // ne12
struct fastdiv_values src1_dim3_div; // ne13
uint32_t nrows_per_thread;
bool split_at_ne01;
bool split_at_ne02;
// Precomputed values
uint32_t block_max;
uint32_t nrows_per_thread;
size_t src0_row_size_aligned;
size_t src1_row_size_aligned;
size_t dst_row_size_aligned;
uint32_t src1_fetch_rows; // 1 or block_max
uint32_t src1_dma_stride; // 0 or stride
bool split_at_ne01;
bool split_at_ne02;
};
#define htp_binary_preamble \
#define htp_binary_preamble \
const struct htp_tensor * src0 = &octx->src0; \
const struct htp_tensor * src1 = &octx->src1; \
struct htp_tensor * dst = &octx->dst; \
@@ -72,12 +70,11 @@ struct htp_binary_context {
const uint32_t nb2 = dst->nb[2]; \
const uint32_t nb3 = dst->nb[3];
static inline uint32_t calc_block_size(struct htp_binary_context * bctx, uint32_t ir, uint32_t end_row,
uint32_t ne01, uint32_t ne02) {
static inline uint32_t calc_block_size(struct htp_binary_context * bctx, uint32_t ir, uint32_t end_row, uint32_t ne01, uint32_t ne02) {
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir, &bctx->dim12_div);
i03 = fastdiv(ir, &bctx->src0_dim12_div);
rem = ir - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i02 = fastdiv(rem, &bctx->src0_dim1_div);
i01 = rem - i02 * ne01;
uint32_t rows_left = end_row - ir;
@@ -191,6 +188,8 @@ static void binary_job_scalar(unsigned int nth, unsigned int ith, void * data) {
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
if (start_row >= end_row) return;
FARF(HIGH, "binary-scalar: %d/%d (%u:%u) row-size %u (%u)", ith, nth, start_row, end_row, nb01, bctx->dst_row_size_aligned);
uint8_t * src0_spad_base = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * dst_spad_base = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
@@ -204,9 +203,9 @@ static void binary_job_scalar(unsigned int nth, unsigned int ith, void * data) {
for (int k = 0; k < 2 && ir_prefetch < end_row; k++) {
uint32_t current_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir_prefetch, &bctx->dim12_div);
i03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
rem = ir_prefetch - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i02 = fastdiv(rem, &bctx->src0_dim1_div);
i01 = rem - i02 * ne01;
uint8_t * src0_curr = (uint8_t *)src0->data + i03 * nb03 + i02 * nb02 + i01 * nb01;
@@ -215,7 +214,7 @@ static void binary_job_scalar(unsigned int nth, unsigned int ith, void * data) {
uint8_t * s0_spad = src0_spad_base + spad_idx * src0_spad_half;
uint8_t * d_spad = dst_spad_base + spad_idx * dst_spad_half;
dma_queue_push_vtcm_to_ddr(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, 0);
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, row_size_bytes, 0);
dma_queue_push(q, dma_make_ptr(s0_spad, src0_curr), bctx->src0_row_size_aligned, nb01, row_size_bytes, current_block_size);
ir_prefetch += current_block_size;
spad_idx ^= 1;
@@ -229,9 +228,9 @@ static void binary_job_scalar(unsigned int nth, unsigned int ith, void * data) {
uint8_t * s0_spad = (uint8_t *) dma_queue_pop(q).dst;
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir, &bctx->dim12_div);
i03 = fastdiv(ir, &bctx->src0_dim12_div);
rem = ir - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i02 = fastdiv(rem, &bctx->src0_dim1_div);
i01 = rem - i02 * ne01;
// src1 indices (broadcast/repeat)
@@ -255,9 +254,9 @@ static void binary_job_scalar(unsigned int nth, unsigned int ith, void * data) {
if (ir_prefetch < end_row) {
uint32_t next_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t p03, p02, p01, prem;
p03 = fastdiv(ir_prefetch, &bctx->dim12_div);
p03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
prem = ir_prefetch - p03 * (ne02 * ne01);
p02 = fastdiv(prem, &bctx->dim1_div);
p02 = fastdiv(prem, &bctx->src0_dim1_div);
p01 = prem - p02 * ne01;
uint8_t * s0_next = (uint8_t *)src0->data + p03 * nb03 + p02 * nb02 + p01 * nb01;
@@ -282,6 +281,8 @@ static void binary_job_vector_same_shape(unsigned int nth, unsigned int ith, voi
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
if (start_row >= end_row) return;
FARF(HIGH, "binary-same-shape: %d/%d (%u:%u) row-size %u (%u)", ith, nth, start_row, end_row, nb01, bctx->dst_row_size_aligned);
uint8_t * src0_spad_base = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * src1_spad_base = octx->src1_spad.data + (ith * octx->src1_spad.size_per_thread);
uint8_t * dst_spad_base = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
@@ -297,9 +298,9 @@ static void binary_job_vector_same_shape(unsigned int nth, unsigned int ith, voi
for (int k = 0; k < 2 && ir_prefetch < end_row; k++) {
uint32_t current_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir_prefetch, &bctx->dim12_div);
i03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
rem = ir_prefetch - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i02 = fastdiv(rem, &bctx->src0_dim1_div);
i01 = rem - i02 * ne01;
uint32_t i13 = (ne13 == 1) ? 0 : i03;
@@ -307,23 +308,23 @@ static void binary_job_vector_same_shape(unsigned int nth, unsigned int ith, voi
uint32_t i11 = (ne11 == 1) ? 0 : i01;
uint8_t * src0_curr = (uint8_t *)src0->data + i03 * nb03 + i02 * nb02 + i01 * nb01;
uint8_t * src1_base = (uint8_t *)src1->data + i13 * nb13 + i12 * nb12 + i11 * nb11;
uint8_t * src1_curr = (uint8_t *)src1->data + i13 * nb13 + i12 * nb12 + i11 * nb11;
uint8_t * dst_curr = (uint8_t *)dst->data + i03 * nb3 + i02 * nb2 + i01 * nb1;
uint8_t * s0_spad = src0_spad_base + spad_idx * src0_spad_half;
uint8_t * s1_spad = src1_spad_base + spad_idx * src1_spad_half;
uint8_t * d_spad = dst_spad_base + spad_idx * dst_spad_half;
dma_queue_push_vtcm_to_ddr(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, 0);
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, row_size_bytes, 0);
dma_queue_push(q, dma_make_ptr(s0_spad, src0_curr), bctx->src0_row_size_aligned, nb01, row_size_bytes, current_block_size);
dma_queue_push(q, dma_make_ptr(s1_spad, src1_base), bctx->src1_row_size_aligned, bctx->src1_dma_stride, row_size_bytes, current_block_size);
dma_queue_push(q, dma_make_ptr(s1_spad, src1_curr), bctx->src1_row_size_aligned, nb11, row_size_bytes, current_block_size);
ir_prefetch += current_block_size;
spad_idx ^= 1;
}
for (uint32_t ir = start_row; ir < end_row; ) {
uint32_t current_block_size = calc_block_size(bctx, ir, end_row, ne01, ne02);
uint8_t * d_spad = (uint8_t *) dma_queue_pop(q).src;
uint8_t * d_spad = (uint8_t *) dma_queue_pop(q).src;
uint8_t * s0_spad = (uint8_t *) dma_queue_pop(q).dst;
uint8_t * s1_spad = (uint8_t *) dma_queue_pop(q).dst;
@@ -335,9 +336,9 @@ static void binary_job_vector_same_shape(unsigned int nth, unsigned int ith, voi
}
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir, &bctx->dim12_div);
i03 = fastdiv(ir, &bctx->src0_dim12_div);
rem = ir - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i02 = fastdiv(rem, &bctx->src0_dim1_div);
i01 = rem - i02 * ne01;
uint8_t * dst_curr = (uint8_t *)dst->data + i03 * nb3 + i02 * nb2 + i01 * nb1;
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, row_size_bytes, current_block_size);
@@ -345,9 +346,9 @@ static void binary_job_vector_same_shape(unsigned int nth, unsigned int ith, voi
if (ir_prefetch < end_row) {
uint32_t next_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t p03, p02, p01, prem;
p03 = fastdiv(ir_prefetch, &bctx->dim12_div);
p03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
prem = ir_prefetch - p03 * (ne02 * ne01);
p02 = fastdiv(prem, &bctx->dim1_div);
p02 = fastdiv(prem, &bctx->src0_dim1_div);
p01 = prem - p02 * ne01;
uint32_t p13 = (ne13 == 1) ? 0 : p03;
@@ -358,7 +359,7 @@ static void binary_job_vector_same_shape(unsigned int nth, unsigned int ith, voi
uint8_t * s1_next = (uint8_t *)src1->data + p13 * nb13 + p12 * nb12 + p11 * nb11;
dma_queue_push(q, dma_make_ptr(s0_spad, s0_next), bctx->src0_row_size_aligned, nb01, row_size_bytes, next_block_size);
dma_queue_push(q, dma_make_ptr(s1_spad, s1_next), bctx->src1_row_size_aligned, bctx->src1_dma_stride, row_size_bytes, next_block_size);
dma_queue_push(q, dma_make_ptr(s1_spad, s1_next), bctx->src1_row_size_aligned, nb11, row_size_bytes, next_block_size);
ir_prefetch += next_block_size;
}
@@ -373,15 +374,17 @@ static void binary_job_vector_row_broadcast(unsigned int nth, unsigned int ith,
struct htp_ops_context * octx = bctx->octx;
htp_binary_preamble;
const uint32_t src0_type = octx->src0.type;
const uint32_t src0_type = octx->src0.type;
const uint32_t row_size_bytes = (src0_type == HTP_TYPE_F32) ? ne00 * sizeof(float) : ne00 * sizeof(_Float16);
const uint32_t total_rows = ne01 * ne02 * ne03;
const uint32_t start_row = bctx->nrows_per_thread * ith;
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
const uint32_t start_row = bctx->nrows_per_thread * ith;
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
if (start_row >= end_row) return;
FARF(HIGH, "binary-row-bcast: %d/%d (%u:%u) row-size %u (%u)", ith, nth, start_row, end_row, nb01, bctx->dst_row_size_aligned);
uint8_t * src0_spad_base = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * src1_spad = octx->src1_spad.data + (ith * octx->src1_spad.size_per_thread);
uint8_t * src1_spad_base = octx->src1_spad.data + (ith * octx->src1_spad.size_per_thread);
uint8_t * dst_spad_base = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
@@ -391,15 +394,14 @@ static void binary_job_vector_row_broadcast(unsigned int nth, unsigned int ith,
uint32_t ir_prefetch = start_row;
int spad_idx = 0;
void * s1_ptr = (void *) src1_spad;
void * s1_ptr = (void *) src1_spad_base;
for (int k = 0; k < 2 && ir_prefetch < end_row; k++) {
uint32_t current_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir_prefetch, &bctx->dim12_div);
rem = ir_prefetch - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t rem = ir_prefetch - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
uint8_t * src0_curr = (uint8_t *)src0->data + i03 * nb03 + i02 * nb02 + i01 * nb01;
uint8_t * dst_curr = (uint8_t *)dst->data + i03 * nb3 + i02 * nb2 + i01 * nb1;
@@ -407,7 +409,7 @@ static void binary_job_vector_row_broadcast(unsigned int nth, unsigned int ith,
uint8_t * s0_spad = src0_spad_base + spad_idx * src0_spad_half;
uint8_t * d_spad = dst_spad_base + spad_idx * dst_spad_half;
dma_queue_push_vtcm_to_ddr(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, 0);
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, row_size_bytes, 0);
dma_queue_push(q, dma_make_ptr(s0_spad, src0_curr), bctx->src0_row_size_aligned, nb01, row_size_bytes, current_block_size);
ir_prefetch += current_block_size;
spad_idx ^= 1;
@@ -415,7 +417,7 @@ static void binary_job_vector_row_broadcast(unsigned int nth, unsigned int ith,
for (uint32_t ir = start_row; ir < end_row; ) {
uint32_t current_block_size = calc_block_size(bctx, ir, end_row, ne01, ne02);
uint8_t * d_spad = (uint8_t *) dma_queue_pop(q).src;
uint8_t * d_spad = (uint8_t *) dma_queue_pop(q).src;
uint8_t * s0_spad = (uint8_t *) dma_queue_pop(q).dst;
for (uint32_t r = 0; r < current_block_size; r++) {
@@ -425,21 +427,19 @@ static void binary_job_vector_row_broadcast(unsigned int nth, unsigned int ith,
COMPUTE_VECTOR_OP_AAA(r_dst, r_src0, r_src1, src0_type, ne00);
}
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir, &bctx->dim12_div);
rem = ir - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir, &bctx->src0_dim12_div);
uint32_t rem = ir - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
uint8_t * dst_curr = (uint8_t *)dst->data + i03 * nb3 + i02 * nb2 + i01 * nb1;
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, row_size_bytes, current_block_size);
if (ir_prefetch < end_row) {
uint32_t next_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t p03, p02, p01, prem;
p03 = fastdiv(ir_prefetch, &bctx->dim12_div);
prem = ir_prefetch - p03 * (ne02 * ne01);
p02 = fastdiv(prem, &bctx->dim1_div);
p01 = prem - p02 * ne01;
uint32_t p03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t prem = ir_prefetch - p03 * (ne02 * ne01);
uint32_t p02 = fastdiv(prem, &bctx->src0_dim1_div);
uint32_t p01 = prem - p02 * ne01;
uint8_t * s0_next = (uint8_t *)src0->data + p03 * nb03 + p02 * nb02 + p01 * nb01;
dma_queue_push(q, dma_make_ptr(s0_spad, s0_next), bctx->src0_row_size_aligned, nb01, row_size_bytes, next_block_size);
ir_prefetch += next_block_size;
@@ -458,14 +458,16 @@ static void binary_job_vector_complex(unsigned int nth, unsigned int ith, void *
const uint32_t src0_type = octx->src0.type;
const uint32_t row_size_bytes = (src0_type == HTP_TYPE_F32) ? ne00 * sizeof(float) : ne00 * sizeof(_Float16);
const uint32_t total_rows = ne01 * ne02 * ne03;
const uint32_t start_row = bctx->nrows_per_thread * ith;
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
const uint32_t start_row = bctx->nrows_per_thread * ith;
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
if (start_row >= end_row) return;
FARF(HIGH, "binary-complex: %d/%d (%u:%u) row-size %u (%u)", ith, nth, start_row, end_row, nb01, bctx->dst_row_size_aligned);
uint8_t * src0_spad_base = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * dst_spad_base = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
size_t dst_spad_half = octx->dst_spad.size_per_thread / 2;
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
size_t dst_spad_half = octx->dst_spad.size_per_thread / 2;
dma_queue * q = octx->ctx->dma[ith];
uint32_t ir_prefetch = start_row;
@@ -473,11 +475,10 @@ static void binary_job_vector_complex(unsigned int nth, unsigned int ith, void *
for (int k = 0; k < 2 && ir_prefetch < end_row; k++) {
uint32_t current_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir_prefetch, &bctx->dim12_div);
rem = ir_prefetch - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t rem = ir_prefetch - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
uint8_t * src0_curr = (uint8_t *)src0->data + i03 * nb03 + i02 * nb02 + i01 * nb01;
uint8_t * dst_curr = (uint8_t *)dst->data + i03 * nb3 + i02 * nb2 + i01 * nb1;
@@ -485,7 +486,7 @@ static void binary_job_vector_complex(unsigned int nth, unsigned int ith, void *
uint8_t * s0_spad = src0_spad_base + spad_idx * src0_spad_half;
uint8_t * d_spad = dst_spad_base + spad_idx * dst_spad_half;
dma_queue_push_vtcm_to_ddr(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, 0);
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, row_size_bytes, 0);
dma_queue_push(q, dma_make_ptr(s0_spad, src0_curr), bctx->src0_row_size_aligned, nb01, row_size_bytes, current_block_size);
ir_prefetch += current_block_size;
spad_idx ^= 1;
@@ -496,11 +497,10 @@ static void binary_job_vector_complex(unsigned int nth, unsigned int ith, void *
uint8_t * d_spad = (uint8_t *) dma_queue_pop(q).src;
uint8_t * s0_spad = (uint8_t *) dma_queue_pop(q).dst;
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir, &bctx->dim12_div);
rem = ir - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir, &bctx->src0_dim12_div);
uint32_t rem = ir - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
for (uint32_t r = 0; r < current_block_size; r++) {
uint32_t r_i01 = i01 + r;
@@ -521,11 +521,10 @@ static void binary_job_vector_complex(unsigned int nth, unsigned int ith, void *
if (ir_prefetch < end_row) {
uint32_t next_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t p03, p02, p01, prem;
p03 = fastdiv(ir_prefetch, &bctx->dim12_div);
prem = ir_prefetch - p03 * (ne02 * ne01);
p02 = fastdiv(prem, &bctx->dim1_div);
p01 = prem - p02 * ne01;
uint32_t p03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t prem = ir_prefetch - p03 * (ne02 * ne01);
uint32_t p02 = fastdiv(prem, &bctx->src0_dim1_div);
uint32_t p01 = prem - p02 * ne01;
uint8_t * s0_next = (uint8_t *)src0->data + p03 * nb03 + p02 * nb02 + p01 * nb01;
dma_queue_push(q, dma_make_ptr(s0_spad, s0_next), bctx->src0_row_size_aligned, nb01, row_size_bytes, next_block_size);
ir_prefetch += next_block_size;
@@ -545,14 +544,16 @@ static void binary_job_element_repeat(unsigned int nth, unsigned int ith, void *
const uint32_t elem_size_bytes = (src0_type == HTP_TYPE_F32) ? sizeof(float) : sizeof(_Float16);
const uint32_t row_size_bytes = ne00 * elem_size_bytes;;
const uint32_t total_rows = ne01 * ne02 * ne03;
const uint32_t start_row = bctx->nrows_per_thread * ith;
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
const uint32_t start_row = bctx->nrows_per_thread * ith;
const uint32_t end_row = MIN(start_row + bctx->nrows_per_thread, total_rows);
if (start_row >= end_row) return;
uint8_t * src0_spad_base = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * dst_spad_base = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
size_t dst_spad_half = octx->dst_spad.size_per_thread / 2;
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
size_t dst_spad_half = octx->dst_spad.size_per_thread / 2;
FARF(HIGH, "binary-repeat: %d/%d (%u:%u) row-size %u (%u)", ith, nth, start_row, end_row, nb01, bctx->dst_row_size_aligned);
dma_queue * q = octx->ctx->dma[ith];
uint32_t ir_prefetch = start_row;
@@ -560,11 +561,10 @@ static void binary_job_element_repeat(unsigned int nth, unsigned int ith, void *
for (int k = 0; k < 2 && ir_prefetch < end_row; k++) {
uint32_t current_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir_prefetch, &bctx->dim12_div);
rem = ir_prefetch - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t rem = ir_prefetch - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
uint8_t * src0_curr = (uint8_t *)src0->data + i03 * nb03 + i02 * nb02 + i01 * nb01;
uint8_t * dst_curr = (uint8_t *)dst->data + i03 * nb3 + i02 * nb2 + i01 * nb1;
@@ -572,7 +572,7 @@ static void binary_job_element_repeat(unsigned int nth, unsigned int ith, void *
uint8_t * s0_spad = src0_spad_base + spad_idx * src0_spad_half;
uint8_t * d_spad = dst_spad_base + spad_idx * dst_spad_half;
dma_queue_push_vtcm_to_ddr(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, 0);
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, row_size_bytes, 0);
dma_queue_push(q, dma_make_ptr(s0_spad, src0_curr), bctx->src0_row_size_aligned, nb01, row_size_bytes, current_block_size);
ir_prefetch += current_block_size;
spad_idx ^= 1;
@@ -583,11 +583,10 @@ static void binary_job_element_repeat(unsigned int nth, unsigned int ith, void *
uint8_t * d_spad = (uint8_t *) dma_queue_pop(q).src;
uint8_t * s0_spad = (uint8_t *) dma_queue_pop(q).dst;
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir, &bctx->dim12_div);
rem = ir - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir, &bctx->src0_dim12_div);
uint32_t rem = ir - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
for (uint32_t r = 0; r < current_block_size; r++) {
uint32_t r_i01 = i01 + r;
@@ -612,11 +611,10 @@ static void binary_job_element_repeat(unsigned int nth, unsigned int ith, void *
if (ir_prefetch < end_row) {
uint32_t next_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t p03, p02, p01, prem;
p03 = fastdiv(ir_prefetch, &bctx->dim12_div);
prem = ir_prefetch - p03 * (ne02 * ne01);
p02 = fastdiv(prem, &bctx->dim1_div);
p01 = prem - p02 * ne01;
uint32_t p03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t prem = ir_prefetch - p03 * (ne02 * ne01);
uint32_t p02 = fastdiv(prem, &bctx->src0_dim1_div);
uint32_t p01 = prem - p02 * ne01;
uint8_t * s0_next = (uint8_t *)src0->data + p03 * nb03 + p02 * nb02 + p01 * nb01;
dma_queue_push(q, dma_make_ptr(s0_spad, s0_next), bctx->src0_row_size_aligned, nb01, row_size_bytes, next_block_size);
ir_prefetch += next_block_size;
@@ -646,6 +644,7 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
const uint32_t nb02 = src0->nb[2];
const uint32_t nb03 = src0->nb[3];
const uint32_t nb11 = src1->nb[1]; // src1 row stride
const uint32_t nb1 = dst->nb[1];
const uint32_t nb2 = dst->nb[2];
const uint32_t nb3 = dst->nb[3];
@@ -657,8 +656,8 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
uint8_t * src0_spad_base = octx->src0_spad.data + (ith * octx->src0_spad.size_per_thread);
uint8_t * dst_spad_base = octx->dst_spad.data + (ith * octx->dst_spad.size_per_thread);
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
size_t dst_spad_half = octx->dst_spad.size_per_thread / 2;
size_t src0_spad_half = octx->src0_spad.size_per_thread / 2;
size_t dst_spad_half = octx->dst_spad.size_per_thread / 2;
dma_queue * q = octx->ctx->dma[ith];
uint32_t ir_prefetch = start_row;
@@ -666,11 +665,10 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
for (int k = 0; k < 2 && ir_prefetch < end_row; k++) {
uint32_t current_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir_prefetch, &bctx->dim12_div);
rem = ir_prefetch - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t rem = ir_prefetch - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
uint8_t * src0_curr = (uint8_t *)src0->data + i03 * nb03 + i02 * nb02 + i01 * nb01;
uint8_t * dst_curr = (uint8_t *)dst->data + i03 * nb3 + i02 * nb2 + i01 * nb1;
@@ -678,7 +676,7 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
uint8_t * s0_spad = src0_spad_base + spad_idx * src0_spad_half;
uint8_t * d_spad = dst_spad_base + spad_idx * dst_spad_half;
dma_queue_push_vtcm_to_ddr(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, 0);
dma_queue_push(q, dma_make_ptr(dst_curr, d_spad), nb1, bctx->dst_row_size_aligned, ne00 * sizeof(float), 0);
dma_queue_push(q, dma_make_ptr(s0_spad, src0_curr), bctx->src0_row_size_aligned, nb01, ne00 * sizeof(float), current_block_size);
ir_prefetch += current_block_size;
spad_idx ^= 1;
@@ -689,11 +687,10 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
uint8_t * d_spad = (uint8_t *) dma_queue_pop(q).src;
uint8_t * s0_spad = (uint8_t *) dma_queue_pop(q).dst;
uint32_t i03, i02, i01, rem;
i03 = fastdiv(ir, &bctx->dim12_div);
rem = ir - i03 * (ne02 * ne01);
i02 = fastdiv(rem, &bctx->dim1_div);
i01 = rem - i02 * ne01;
uint32_t i03 = fastdiv(ir, &bctx->src0_dim12_div);
uint32_t rem = ir - i03 * (ne02 * ne01);
uint32_t i02 = fastdiv(rem, &bctx->src0_dim1_div);
uint32_t i01 = rem - i02 * ne01;
for (uint32_t r = 0; r < current_block_size; r++) {
uint32_t r_i01 = i01 + r; // linear within block since we split at ne01
@@ -712,11 +709,10 @@ static void binary_job_add_id(unsigned int nth, unsigned int ith, void * data) {
if (ir_prefetch < end_row) {
uint32_t next_block_size = calc_block_size(bctx, ir_prefetch, end_row, ne01, ne02);
uint32_t p03, p02, p01, prem;
p03 = fastdiv(ir_prefetch, &bctx->dim12_div);
prem = ir_prefetch - p03 * (ne02 * ne01);
p02 = fastdiv(prem, &bctx->dim1_div);
p01 = prem - p02 * ne01;
uint32_t p03 = fastdiv(ir_prefetch, &bctx->src0_dim12_div);
uint32_t prem = ir_prefetch - p03 * (ne02 * ne01);
uint32_t p02 = fastdiv(prem, &bctx->src0_dim1_div);
uint32_t p01 = prem - p02 * ne01;
uint8_t * s0_next = (uint8_t *)src0->data + p03 * nb03 + p02 * nb02 + p01 * nb01;
dma_queue_push(q, dma_make_ptr(s0_spad, s0_next), bctx->src0_row_size_aligned, nb01, ne00 * sizeof(float), next_block_size);
ir_prefetch += next_block_size;
@@ -739,40 +735,36 @@ static int execute_op_binary(struct htp_ops_context * octx) {
const size_t elem_size = (src0_type == HTP_TYPE_F32) ? sizeof(float) : sizeof(_Float16);
const size_t src0_row_size = src0->ne[0] * elem_size;
const size_t src1_row_size = src1->ne[0] * elem_size;
const size_t dst_row_size = dst->ne[0] * elem_size;
const size_t dst_row_size = dst->ne[0] * elem_size;
// Align to VLEN
const size_t src0_row_size_aligned = hex_round_up(src0_row_size, VLEN);
const size_t dst_row_size_aligned = hex_round_up(dst_row_size, VLEN);
size_t src0_row_size_aligned = hex_round_up(src0_row_size, VLEN);
size_t src1_row_size_aligned = hex_round_up(src1_row_size, VLEN);
size_t dst_row_size_aligned = hex_round_up(dst_row_size, VLEN);
bool is_add_id = (octx->op == HTP_OP_ADD_ID);
bool is_scalar = !is_add_id && (src1->ne[0] == 1);
// Determine which kernel we will use to alloc memory and dispatch
bool use_vector_same = !is_add_id && !is_scalar && ((src0->nb[1] % VLEN) == 0) && (src1->ne[0] == src0->ne[0]) &&
bool is_transposed = (src0->nb[1] < src0_row_size || src1->nb[1] < src1_row_size || dst->nb[1] < dst_row_size);
bool is_same_shape = !is_add_id && !is_scalar && !is_transposed &&
(src1->ne[0] == src0->ne[0] && src0->ne[0] % VLEN == 0) &&
(src1->ne[1] == src0->ne[1] || src1->ne[1] == 1) &&
(src1->ne[2] == src0->ne[2] || src1->ne[2] == 1) &&
(src1->ne[3] == src0->ne[3] || src1->ne[3] == 1);
bool is_row_bcast = use_vector_same && (src1->ne[1] == 1 && src1->ne[2] == 1 && src1->ne[3] == 1);
bool use_complex = !is_add_id && !is_scalar && !use_vector_same && (src1->ne[0] == src0->ne[0]);
bool use_repeat = !is_add_id && !is_scalar && !use_vector_same && (src1->ne[0] != src0->ne[0]);
bool is_row_bcast = is_same_shape && (src1->ne[1] == 1 && src1->ne[2] == 1 && src1->ne[3] == 1);
bool is_complex = !is_add_id && !is_scalar && !is_same_shape && (src1->ne[0] == src0->ne[0]);
bool is_repeat = !is_add_id && !is_scalar && !is_same_shape && (src1->ne[0] != src0->ne[0]);
size_t spad_row_total;
if (is_scalar) {
spad_row_total = 2 * (src0_row_size_aligned + dst_row_size_aligned);
} else if (is_row_bcast) {
spad_row_total = 2 * (src0_row_size_aligned + dst_row_size_aligned);
} else if (use_vector_same) {
if (is_same_shape) {
spad_row_total = 2 * (src0_row_size_aligned + src1_row_size_aligned + dst_row_size_aligned);
} else if (is_add_id) {
spad_row_total = 2 * (src0_row_size_aligned + dst_row_size_aligned); // src1 read directly
} else {
spad_row_total = 2 * (src0_row_size_aligned + dst_row_size_aligned);
}
size_t rows_per_buffer = octx->ctx->vtcm_size / (n_threads * spad_row_total);
// Adjust for static src1 in row_bcast case
if (is_row_bcast) {
size_t needed_static = src1_row_size_aligned;
@@ -782,28 +774,26 @@ static int execute_op_binary(struct htp_ops_context * octx) {
}
if (rows_per_buffer < 1) {
FARF(ERROR, "binary: VTCM too small\n");
return HTP_STATUS_VTCM_TOO_SMALL;
FARF(ERROR, "binary: VTCM too small\n");
return HTP_STATUS_VTCM_TOO_SMALL;
}
octx->src0_spad.size_per_thread = rows_per_buffer * 2 * src0_row_size_aligned;
octx->dst_spad.size_per_thread = rows_per_buffer * 2 * dst_row_size_aligned;
if (is_scalar || use_complex || use_repeat || is_add_id) {
octx->src1_spad.size_per_thread = 0;
} else if (is_row_bcast) {
if (is_add_id || is_scalar || is_complex || is_repeat || is_row_bcast) {
octx->src1_spad.size_per_thread = 0;
} else {
octx->src1_spad.size_per_thread = rows_per_buffer * 2 * src1_row_size_aligned;
}
octx->dst_spad.size = n_threads * octx->dst_spad.size_per_thread;
octx->src0_spad.size = n_threads * octx->src0_spad.size_per_thread;
if (is_row_bcast) {
octx->src1_spad.size = src1_row_size_aligned;
} else {
octx->src1_spad.size = n_threads * octx->src1_spad.size_per_thread;
}
octx->dst_spad.size = n_threads * octx->dst_spad.size_per_thread;
if (octx->ctx->vtcm_size < (octx->src0_spad.size + octx->src1_spad.size + octx->dst_spad.size)) {
return HTP_STATUS_VTCM_TOO_SMALL;
@@ -823,46 +813,37 @@ static int execute_op_binary(struct htp_ops_context * octx) {
}
struct htp_binary_context bctx;
bctx.octx = octx;
bctx.nrows_per_thread = (src0_nrows + n_threads - 1) / n_threads;
bctx.block_max = rows_per_buffer;
bctx.octx = octx;
bctx.nrows_per_thread = (src0_nrows + n_threads - 1) / n_threads;
bctx.block_max = rows_per_buffer;
bctx.src0_row_size_aligned = src0_row_size_aligned;
bctx.src1_row_size_aligned = src1_row_size_aligned;
bctx.dst_row_size_aligned = dst_row_size_aligned;
bctx.dim1_div = init_fastdiv_values(src0->ne[1]);
bctx.dim2_div = init_fastdiv_values(src0->ne[2]);
bctx.dim12_div = init_fastdiv_values(src0->ne[1] * src0->ne[2]);
bctx.src0_dim1_div = init_fastdiv_values(src0->ne[1]);
bctx.src0_dim2_div = init_fastdiv_values(src0->ne[2]);
bctx.src0_dim12_div = init_fastdiv_values(src0->ne[1] * src0->ne[2]);
bctx.src1_dim1_div = init_fastdiv_values(src1->ne[1]);
bctx.src1_dim2_div = init_fastdiv_values(src1->ne[2]);
bctx.src1_dim3_div = init_fastdiv_values(src1->ne[3]);
bctx.src1_dim1_div = init_fastdiv_values(src1->ne[1]);
bctx.src1_dim2_div = init_fastdiv_values(src1->ne[2]);
bctx.src1_dim3_div = init_fastdiv_values(src1->ne[3]);
bool src0_contig_dim1 = (src0->nb[2] == src0->ne[1] * src0->nb[1]);
bool dst_contig_dim1 = (dst->nb[2] == src0->ne[1] * dst->nb[1]);
bool dst_contig_dim1 = (dst->nb[2] == src0->ne[1] * dst->nb[1]);
bool src0_contig_dim2 = (src0->nb[3] == src0->ne[2] * src0->nb[2]);
bool dst_contig_dim2 = (dst->nb[3] == src0->ne[2] * dst->nb[2]);
bool dst_contig_dim2 = (dst->nb[3] == src0->ne[2] * dst->nb[2]);
bctx.split_at_ne01 = (src0->ne[2] > 1) &&
((src1->ne[1] > 1) || (src1->ne[2] > 1) || !src0_contig_dim1 || !dst_contig_dim1);
bctx.split_at_ne02 = (src0->ne[3] > 1) &&
((src1->ne[2] > 1) || (src1->ne[3] > 1) || !src0_contig_dim2 || !dst_contig_dim2);
// Precompute specific kernel parameters
if (use_vector_same) {
bctx.src1_dma_stride = (src1->ne[1] == 1) ? 0 : src1->nb[1];
bctx.src1_fetch_rows = (src1->ne[1] == 1) ? 1 : rows_per_buffer;
}
bctx.split_at_ne01 = (src0->ne[2] > 1) && ((src1->ne[1] > 1) || (src1->ne[2] > 1) || !src0_contig_dim1 || !dst_contig_dim1);
bctx.split_at_ne02 = (src0->ne[3] > 1) && ((src1->ne[2] > 1) || (src1->ne[3] > 1) || !src0_contig_dim2 || !dst_contig_dim2);
worker_callback_t worker_func;
if (is_add_id) worker_func = binary_job_add_id;
else if (is_scalar) worker_func = binary_job_scalar;
else if (is_row_bcast) worker_func = binary_job_vector_row_broadcast;
else if (use_vector_same) worker_func = binary_job_vector_same_shape;
else if (use_complex) worker_func = binary_job_vector_complex;
else worker_func = binary_job_element_repeat;
if (is_add_id) worker_func = binary_job_add_id;
else if (is_scalar) worker_func = binary_job_scalar;
else if (is_row_bcast) worker_func = binary_job_vector_row_broadcast;
else if (is_same_shape) worker_func = binary_job_vector_same_shape;
else if (is_complex) worker_func = binary_job_vector_complex;
else worker_func = binary_job_element_repeat;
if (is_row_bcast) {
dma_queue_pop(q);

View File

@@ -31,8 +31,8 @@ dma_queue * dma_queue_create(size_t capacity) {
q->capacity = capacity;
q->idx_mask = capacity - 1;
q->desc = (hexagon_udma_descriptor_type1_t *) memalign(64, capacity * sizeof(hexagon_udma_descriptor_type1_t));
memset(q->desc, 0, capacity * sizeof(hexagon_udma_descriptor_type1_t));
q->desc = (dma_descriptor_2d *) memalign(64, capacity * sizeof(dma_descriptor_2d));
memset(q->desc, 0, capacity * sizeof(dma_descriptor_2d));
q->dptr = (dma_ptr *) memalign(4, capacity * sizeof(dma_ptr));
memset(q->dptr, 0, capacity * sizeof(dma_ptr));

View File

@@ -10,19 +10,84 @@
extern "C" {
#endif
// Define the HW descriptor structs here since the ones in HexSDK are a bit out of date
typedef struct dma_descriptor_1d_s {
void * next;
uint32_t size:24;
uint32_t desc_size:2;
uint32_t dst_comp:1;
uint32_t src_comp:1;
uint32_t dst_bypass:1;
uint32_t src_bypass:1;
uint32_t order:1;
uint32_t done:1;
void * src;
void * dst;
} dma_descriptor_1d;
#if __HVX_ARCH__ < 75
typedef struct dma_descriptor_2d_s {
void * next;
uint32_t reserved0:24;
uint32_t desc_size:2;
uint32_t dst_comp:1;
uint32_t src_comp:1;
uint32_t dst_bypass:1;
uint32_t src_bypass:1;
uint32_t order:1;
uint32_t done:1;
void * src;
void * dst;
uint32_t desc_type:8;
uint32_t reserved1:24;
uint32_t row_size:16;
uint32_t nrows:16;
uint32_t src_stride:16;
uint32_t dst_stride:16;
uint32_t src_offset:16;
uint32_t dst_offset:16;
} dma_descriptor_2d;
#else
typedef struct dma_descriptor_2d_s {
void * next;
uint32_t dst_stride:24;
uint32_t desc_size:2;
uint32_t dst_comp:1;
uint32_t src_comp:1;
uint32_t dst_bypass:1;
uint32_t src_bypass:1;
uint32_t order:1;
uint32_t done:1;
void * src;
void * dst;
uint32_t desc_type:8;
uint32_t reserved0:24;
uint32_t row_size:24;
uint32_t nrows_lo:8;
uint32_t nrows_hi:8;
uint32_t src_stride:24;
uint32_t offset:24;
uint32_t reserved1:8;
} dma_descriptor_2d;
#endif
typedef struct {
void *dst;
void *dst;
const void *src;
} dma_ptr;
typedef struct {
hexagon_udma_descriptor_type1_t * desc; // descriptor pointers
hexagon_udma_descriptor_type1_t * tail; // tail pointer
dma_ptr * dptr; // dst/src pointers
uint32_t push_idx;
uint32_t pop_idx;
uint32_t capacity;
uint32_t idx_mask;
dma_descriptor_2d * desc; // descriptor pointers
dma_descriptor_2d * tail; // tail pointer
dma_ptr * dptr; // dst/src pointers
uint32_t push_idx;
uint32_t pop_idx;
uint32_t capacity;
uint32_t idx_mask;
} dma_queue;
dma_queue * dma_queue_create(size_t capacity);
@@ -59,71 +124,87 @@ static inline dma_ptr dma_make_ptr(void *dst, const void *src)
return p;
}
static inline bool dma_queue_push(dma_queue * q,
dma_ptr dptr,
size_t dst_row_size,
size_t src_row_size,
size_t width, // width in bytes. number of bytes to transfer per row
size_t nrows) {
#if __HVX_ARCH__ < 73
static const uint32_t dma_src_l2_bypass_on = 1;
static const uint32_t dma_dst_l2_bypass_on = 0;
#else
static const uint32_t dma_src_l2_bypass_on = 1;
static const uint32_t dma_dst_l2_bypass_on = 1;
#endif
static inline bool dma_queue_push_single_1d(dma_queue * q, dma_ptr dptr, size_t size) {
if (((q->push_idx + 1) & q->idx_mask) == q->pop_idx) {
FARF(ERROR, "dma-push: queue full\n");
FARF(HIGH, "dma-push: queue full\n");
return false;
}
hexagon_udma_descriptor_type1_t * desc = &q->desc[q->push_idx];
dma_descriptor_1d * desc = (dma_descriptor_1d *) &q->desc[q->push_idx];
desc->next = NULL;
desc->desc_size = 0; // 1D mode
desc->src_bypass = dma_src_l2_bypass_on;
desc->dst_bypass = dma_dst_l2_bypass_on;
desc->order = 1;
desc->done = 0;
desc->src = (void *) dptr.src;
desc->dst = (void *) dptr.dst;
desc->size = size;
q->dptr[q->push_idx] = dptr;
dmlink(q->tail, desc);
q->tail = (dma_descriptor_2d *) desc;
// FARF(ERROR, "dma-push: i %u row-size %u nrows %d dst %p src %p\n", q->push_idx, row_size, nrows, dptr.dst, dptr.src);
q->push_idx = (q->push_idx + 1) & q->idx_mask;
return true;
}
static inline bool dma_queue_push_single_2d(dma_queue * q, dma_ptr dptr, size_t dst_stride, size_t src_stride, size_t row_size, size_t nrows) {
if (((q->push_idx + 1) & q->idx_mask) == q->pop_idx) {
FARF(HIGH, "dma-push: queue full\n");
return false;
}
dma_descriptor_2d * desc = &q->desc[q->push_idx];
desc->next = NULL;
desc->length = 0;
desc->desctype = HEXAGON_UDMA_DESC_DESCTYPE_TYPE1;
desc->dstbypass = 1;
desc->srcbypass = 1;
#if __HVX_ARCH__ >= 73
desc->dstbypass = 1;
desc->srcbypass = 1;
#else
desc->dstbypass = 0;
desc->srcbypass = 1;
#endif
desc->order = 0;
desc->dstate = HEXAGON_UDMA_DESC_DSTATE_INCOMPLETE;
desc->reserved0 = 0;
desc->reserved1 = 0;
desc->desc_size = 1; // 2d mode
desc->src_bypass = dma_src_l2_bypass_on;
desc->dst_bypass = dma_dst_l2_bypass_on;
desc->src_comp = 0;
desc->dst_comp = 0;
desc->order = 1;
desc->done = 0;
desc->src_stride = src_stride;
desc->dst_stride = dst_stride;
desc->src = (void *) dptr.src;
desc->dst = (void *) dptr.dst;
desc->allocation = 0;
desc->padding = 0;
desc->roiwidth = width;
desc->roiheight = nrows;
desc->srcstride = src_row_size;
desc->dststride = dst_row_size;
desc->srcwidthoffset = 0;
desc->dstwidthoffset = 0;
desc->row_size = row_size;
#if __HVX_ARCH__ < 75
desc->desc_type = 0; // 2d (16-bit) mode
desc->nrows = nrows;
desc->src_offset = 0;
desc->dst_offset = 0;
#else
desc->desc_type = 9; // 2d (24-bit) mode
desc->nrows_lo = (nrows & 0xff);
desc->nrows_hi = (nrows >> 8);
desc->offset = 0;
#endif
q->dptr[q->push_idx] = dptr;
dmlink(q->tail, desc);
q->tail = desc;
// FARF(ERROR, "dma-push: i %u width %u nrows %d dst %p src %p\n", q->push_idx, width, nrows, dptr.dst, dptr.src);
// FARF(ERROR, "dma-push: i %u row-size %u nrows %d dst %p src %p\n", q->push_idx, row_size, nrows, dptr.dst, dptr.src);
q->push_idx = (q->push_idx + 1) & q->idx_mask;
return true;
}
static inline bool dma_queue_push_ddr_to_vtcm(dma_queue * q,
dma_ptr dptr,
size_t dst_row_size,
size_t src_row_size,
size_t nrows) {
return dma_queue_push(q, dptr, dst_row_size, src_row_size, src_row_size, nrows);
}
static inline bool dma_queue_push_vtcm_to_ddr(dma_queue * q,
dma_ptr dptr,
size_t dst_row_size,
size_t src_row_size,
size_t nrows) {
return dma_queue_push(q, dptr, dst_row_size, src_row_size, dst_row_size, nrows);
}
static inline dma_ptr dma_queue_pop(dma_queue * q) {
dma_ptr dptr = { NULL };
@@ -131,12 +212,12 @@ static inline dma_ptr dma_queue_pop(dma_queue * q) {
return dptr;
}
hexagon_udma_descriptor_type1_t * desc = &q->desc[q->pop_idx];
dma_descriptor_2d * desc = &q->desc[q->pop_idx];
// Wait for desc to complete
while (1) {
dmpoll();
if (desc->dstate == HEXAGON_UDMA_DESC_DSTATE_COMPLETE) {
if (desc->done) {
break;
}
// FARF(ERROR, "dma-pop: waiting for DMA : %u\n", q->pop_idx);
@@ -175,6 +256,62 @@ static inline uint32_t dma_queue_capacity(dma_queue * q) {
return q->capacity;
}
#if __HVX_ARCH__ < 75
// Overflow-safe DMA push: all 2d descriptor fields (row_size, nrows, src_stride, dst_stride) are 16-bit, max 65535.
// This version transparently handles values that exceed the 16-bit limit and submits chained DMA transtions.
#define DMA_MAX_FIELD_VAL 65535u
static inline bool dma_queue_push(dma_queue *q, dma_ptr dptr, size_t dst_stride, size_t src_stride, size_t row_size, size_t nrows) {
// Fast path: everything fits in 16 bits
if (nrows == 0 || __builtin_expect(
row_size <= DMA_MAX_FIELD_VAL &&
nrows <= DMA_MAX_FIELD_VAL &&
src_stride <= DMA_MAX_FIELD_VAL &&
dst_stride <= DMA_MAX_FIELD_VAL, 1)) {
return dma_queue_push_single_2d(q, dptr, dst_stride, src_stride, row_size, nrows);
}
// Contiguous block
// Use 1d DMA mode which supports sizes up to 24-bits (16MB)
if (nrows == 1 || (row_size == src_stride && row_size == dst_stride)) {
size_t total = row_size * nrows;
return dma_queue_push_single_1d(q, dptr, total);
}
// Stride overflow — fall back to row-by-row.
{
const uint8_t *src = (const uint8_t *) dptr.src;
uint8_t *dst = (uint8_t *) dptr.dst;
for (size_t r = 0; r < nrows; ++r) {
dma_ptr p = dma_make_ptr(dst + r * dst_stride, src + r * src_stride);
if (!dma_queue_push_single_1d(q, p, row_size))
return false;
if (r + 1 < nrows)
dma_queue_pop(q);
}
return true;
}
}
#else // HVX_ARCH >= 75
static inline bool dma_queue_push(dma_queue *q, dma_ptr dptr, size_t dst_stride, size_t src_stride, size_t row_size, size_t nrows) {
// On v75 and up we always use 2d 24-bit mode
return dma_queue_push_single_2d(q, dptr, dst_stride, src_stride, row_size, nrows);
}
#endif
static inline bool dma_queue_push_ddr_to_vtcm(dma_queue * q, dma_ptr dptr, size_t dst_row_size, size_t src_row_size, size_t nrows) {
return dma_queue_push(q, dptr, dst_row_size, src_row_size, src_row_size, nrows);
}
static inline bool dma_queue_push_vtcm_to_ddr(dma_queue * q, dma_ptr dptr, size_t dst_row_size, size_t src_row_size, size_t nrows) {
return dma_queue_push(q, dptr, dst_row_size, src_row_size, dst_row_size, nrows);
}
#ifdef __cplusplus
} // extern "C"
#endif

View File

@@ -21,6 +21,15 @@ static inline void hex_dump_uint8_line(char * pref, const uint8_t * x, uint32_t
FARF(HIGH, "%s\n", str);
}
static inline void hex_dump_uint32_line(char * pref, const uint32_t * x, uint32_t n) {
char str[1024], *p = str, *p_end = str + sizeof(str);
p += snprintf(p, p_end - p, "%s: ", pref);
for (int i = 0; i < n; i++) {
p += snprintf(p, p_end - p, "%u, ", (unsigned int) x[i]);
}
FARF(HIGH, "%s\n", str);
}
static inline void hex_dump_int32_line(char * pref, const int32_t * x, uint32_t n) {
char str[1024], *p = str, *p_end = str + sizeof(str);
p += snprintf(p, p_end - p, "%s: ", pref);

View File

@@ -29,10 +29,22 @@ static inline uint64_t hex_get_pktcnt() {
return pktcnt;
}
static inline int32_t hex_is_aligned(void * addr, uint32_t align) {
static inline size_t hmx_ceil_div(size_t num, size_t den) {
return (num + den - 1) / den;
}
static inline int32_t hex_is_aligned(const void * addr, uint32_t align) {
return ((size_t) addr & (align - 1)) == 0;
}
static inline size_t hex_align_up(size_t v, size_t align) {
return hmx_ceil_div(v, align) * align;
}
static inline size_t hex_align_down(size_t v, size_t align) {
return (v / align) * align;
}
static inline int32_t hex_is_one_chunk(void * addr, uint32_t n, uint32_t chunk_size) {
uint32_t left_off = (size_t) addr & (chunk_size - 1);
uint32_t right_off = left_off + n;
@@ -43,6 +55,14 @@ static inline uint32_t hex_round_up(uint32_t n, uint32_t m) {
return m * ((n + m - 1) / m);
}
static inline size_t hex_smin(size_t a, size_t b) {
return a < b ? a : b;
}
static inline size_t hex_smax(size_t a, size_t b) {
return a > b ? a : b;
}
static inline void hex_l2fetch(const void * p, uint32_t width, uint32_t stride, uint32_t height) {
const uint64_t control = Q6_P_combine_RR(stride, Q6_R_combine_RlRl(width, height));
Q6_l2fetch_AP((void *) p, control);

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,72 @@
// HMX operation entry-point declarations.
// Ported from htp-ops-lib/include/dsp/ops.h (renamed, benchmark kernels removed). (https://github.com/haozixu/htp-ops-lib)
#ifndef HMX_OPS_H
#define HMX_OPS_H
#include <stddef.h>
#include <stdint.h>
#ifndef restrict
# define restrict __restrict
#endif
#ifdef __cplusplus
extern "C" {
#endif
struct htp_context; // forward declaration
typedef struct {
float *dst;
const float *activation;
const __fp16 *permuted_weight;
int m;
int k;
int n;
int act_stride;
int weight_stride;
int dst_stride;
int ne02;
int ne03;
int ne12;
int ne13;
size_t src0_nb2;
size_t src0_nb3;
size_t src1_nb2;
size_t src1_nb3;
size_t dst_nb2;
size_t dst_nb3;
} hmx_matmul_w16a32_batched_params_t;
// HMX matrix multiplication — tile-permuted FP16 weights, FP32 activation/output
// act_stride: activation row stride in elements (= k for contiguous, or
// nb[1]/sizeof(float) for permuted tensors like attention Q).
// weight_stride: weight row stride in elements (= k for compact weights, or
// nb[1]/sizeof(__fp16) for permuted KV-cache views used by QK).
int hmx_mat_mul_permuted_w16a32(struct htp_context *ctx,
float *restrict dst,
const float *activation,
const __fp16 *permuted_weight,
int m, int k, int n,
int act_stride,
int weight_stride);
// Batched F16 wrapper over hmx_mat_mul_permuted_w16a32.
// Batch semantics match ggml_mul_mat(): src0 broadcasts to src1 in dims 2/3.
int hmx_mat_mul_permuted_w16a32_batched(struct htp_context *ctx,
const hmx_matmul_w16a32_batched_params_t *params);
// HMX matrix multiplication — tile-permuted quantised weights (Q4_0/Q8_0/IQ4_NL)
int hmx_mat_mul_permuted_qk_0_d16a32(struct htp_context *ctx,
float *restrict dst,
const float *activation,
const uint8_t *permuted_weight,
int m, int k, int n,
int weight_type);
#ifdef __cplusplus
}
#endif
#endif // HMX_OPS_H

Some files were not shown because too many files have changed in this diff Show More