* fix(mixed-types): use f32 for precision and update the shared memory calculation logic for f32
* fix(unary): correct the gelu, gelu quick and gelu erf functions
* fix(flash-attn-tile): fix the hardcode v type
* fix(flash_attn): fix tile path
* fix: pass editorconfig and address the type conflicts
* fix: remove reduant pipeline keys
* fix: remove inline min/max group size functions and revert the flash attn path order
* fix: use clamp to avoid NaN for GELU
* fix: use the right range for exp, 80 is safer for f32 exp
* model-conversion : add causal-convert-mmproj target [no ci]
This commit adds a new Make target that only converts the mmproj model.
The motivation for this that the causal-convert-mm-model target will
convert both the test model and the mmproj model which is nice when the
model model conversion is finalized. But during development it is nice
to be able to just convert the mmproj model and not have to wait for
the often more time consuming text model conversion.
* add path model path validation check
* working llama-eval mc and math suite
* multi source llama-eval
* Add readme
* add checkpointing
* examples: add llama-server simulator for testing eval scripts
Add a standalone Python script that simulates a llama-server HTTP endpoint
for testing the eval script. The simulator:
- Implements /v1/chat/completions endpoint with OpenAI-compatible format
- Loads AIME dataset from HuggingFace with local caching
- Uses Levenshtein distance for intelligent question matching
- Supports configurable success rate for correct/wrong answer generation
- Provides debug logging for troubleshooting
Also includes test scripts and documentation for testing and understanding
the simulator functionality.
* examples: refactor test-simulator.sh for better readability
Extract repeating question string into TEST_QUESTION variable and
create make_request() helper function to reduce code duplication.
Add proper error handling for error responses.
* docs: update llama-eval-discussion.md with session work summary
Add summary of llama-server-simulator implementation work including
features, testing results, technical decisions, and refactoring.
* examples: add simplified llama-eval-new.py for AIME evaluation
- Create new simplified evaluation script focused only on AIME
- Implement EvalState and Processor dataclasses for structured state management
- Add real-time feedback showing correct/incorrect status per case
- Abstract grading interface for external grader support
- Use structured JSON output for eval state
- Apply HuggingFace dataset caching to avoid repeated downloads
- Remove Levenshtein matching - eval script only sends requests and validates answers
* docs: remove README.md from llama-eval
* examples: implement flexible grader system for answer validation
- Add Grader class supporting regex and CLI-based grading
- Implement built-in regex patterns for AIME, GSM8K, MMLU, HellaSwag, ARC, WinoGrande
- Add CLI grader interface: python script.py --answer <pred> --expected <gold>
- Add HF telemetry disable to avoid warnings
- Support exact match requirement for regex patterns
- Add 30-second timeout for CLI grader
- Handle both boxed and plain text formats for AIME answers
* examples: use HF_HUB_OFFLINE to avoid HF Hub warnings
* examples: remove HF_HUB_OFFLINE to allow dataset download
* examples: use cached dataset path to avoid HF Hub requests
* examples: use cached dataset path in simulator to avoid HF Hub requests
* docs: update llama-eval-discussion.md with session work summary
* examples: add threading support and model parameter to llama-eval-new.py
- Add ThreadPoolExecutor for parallel request processing controlled by --threads
- Add --model argument to specify model name in request data
- Refactor process() to use thread-safe _process_single_case() method
- Update progress tracking to work with concurrent execution
* docs: update llama-eval-discussion.md with threading and model parameter updates
- Add threading support implementation details
- Document ThreadPoolExecutor usage and thread safety
- Add model parameter implementation details
- Include testing results for both features
* examples: add task summary table to llama-eval-new.py
* eval : print progress
* eval : add prompts
* test : fix path
* sim : fix answer matching
* eval : support multiple dataset runs
* minor
* improve grader
* docs
* remove old files
* datasets : add gsm8k
* add gpqa + sampling + docs
* rename
* grader : improve example answers
* cont
* datasets : add aime2025
* grader : update prompt
* grade : improve regex + logs
* datasets : fix aime2025
* cleanup
* add AGENTS.md
* ignore errors
* resume eval
* cleanup
* fix counts
* simplify
* fix prompts
* add html
* store full response
* add tokens
* resoning and error handling
* refactor
* track total time
* remove junk
* eval : unify "judge" terminology to "grader"
Replace all occurrences of "judge" with "grader" for consistency
across the codebase (CLI args, Grader class fields, help text).
Assisted-by: llama.cpp:local pi
* eval : add Wilson score confidence interval to results
Compute 95% CI on-the-fly from completed cases. Displayed in
terminal output, HTML report, and JSON state.
* llama-eval : add per-task generation speed from server timings
Extract predicted_per_second from the server timings response and store
it as tps_gen per task. Display in console progress, print_all_tasks,
and HTML report.
Assisted-by: llama.cpp:local pi
* llama-eval : add per-task generation time from server timings
Extract predicted_ms from the server timings response and store it as
t_gen_ms per task. Display in seconds with one decimal digit in console
progress, print_all_tasks, and HTML report.
Assisted-by: llama.cpp:local pi
* llama-eval : rename display, escaped, and count variables to use prefix convention
- _display suffix → display_ prefix (answer, tokens, tps, t_gen)
- _escaped suffix → escaped_ prefix (response, prompt, reasoning)
- _count suffix → n_ prefix (correct, incorrect, pending)
Assisted-by: llama.cpp:local pi
* llama-eval : support multiple evaluation endpoints with dynamic task distribution
- Add ServerConfig dataclass (url, threads, name)
- Accept comma-separated --server, --threads, --server-name CLI args
- Dynamic shared-queue task distribution across servers (fast servers do more work)
- One ThreadPoolExecutor per server, workers pull from shared Queue
- Track which server processed each task (server_name in results)
- Thread-safe EvalState with threading.Lock for concurrent mutations
- Server column in HTML report and console output
- Backward compatible: single server works as before
Assisted-by: llama.cpp:local pi
* llama-server-simulator : replace Flask with stdlib http.server
- Use HTTPServer + BaseHTTPRequestHandler instead of Flask
- RequestHandler handles POST /v1/chat/completions
- Server runs in daemon thread with clean Ctrl+C shutdown
- Remove flask and unused asdict imports
Assisted-by: llama.cpp:local pi
* llama-eval : update README with PR link and quick-start examples
Assisted-by: llama.cpp:local pi
* llama-eval : track model name in eval state and verify on resume
- Store model_name in EvalState and JSON output
- Display model in HTML summary table
- Verify --model matches stored model when resuming
Assisted-by: llama.cpp:local pi
* llama-server-simulator : fix comment - Dice coefficient, not Levenshtein
Assisted-by: llama.cpp:local pi
* llama-eval : require --grader-model or --model when using --grader-type llm
Assisted-by: llama.cpp:local pi
* llama-eval : protect dump() with lock for thread safety
Assisted-by: llama.cpp:local pi
* llama-eval : compact HTML report output
- Replace verbose summary table with single inline bar
- Shorten status text: '✓'/'✗'/'–'/'!' instead of full words
- Flatten CSS: remove box-shadows, border-radius, reduce padding
- Use system-ui font, 13px table, 12px details
- Conditional reasoning section (only shown when present)
- Single toggle JS function instead of two
- Shorter column headers
Assisted-by: llama.cpp:local pi
* llama-eval : check server connectivity on startup
- Hit /v1/models for each server before evaluation
- Exit with error if any server is unreachable
- Print comma-separated model IDs per server in startup output
- Sequential checks, no retries, no timeout override
Assisted-by: llama.cpp:local pi
* llama-eval : use server1/server2 instead of gpu1/gpu2 in README
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: gatbontonpc <gatbontonpc@gmail.com>
* convert : add split() method to LoraTorchTensor
* Fix python type-check
* Fix flake8 Lint
* fix: handle positional dim arg in torch.split dispatch
* Fix type-check again
* Fix type-checks
* Remove unit test per reviewers feedback
* work around ty deficiency
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Q4_1 MoE CLC pass sanity check
* remove unnecessary code
* opencl: remove unnecessary asserts and reformat
* opencl: fix supports_op for q4_1 moe
* q4_1 moe is supported by Adreno with certain shapes
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
`im2col_cuda` and `im2col_3d_cuda` both dispatch with
`block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on
raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet
at 11 s lands at OW = 176000 -- and the launch returns
`invalid configuration argument`.
Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the
kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern
already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel`
and 3D `im2col_3d_kernel` need the same fix. Bit-identical for
OW <= 65535 (single iteration of the new outer loop).
Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns
`invalid configuration argument`, post-fix runs to completion.
Existing test-backend-ops im2col cases unchanged.
* cuda: tighten snake fusion type checks for all operands (defensive, sync vulkan)
* cuda: reject snake fusion when ne[2] or ne[3] > 1 (mirror vulkan PR review)
* cuda: merge type_ok and types_ok into a single types_ok (address am17an review)
* cuda: filter ADD/SUB/MUL/DIV in supports_op to F32/F16
bin_bcast only dispatches F32/F16 type triplets, mirror the
vulkan filter so unsupported types fall back through cpy
instead of aborting.
* test-backend-ops: extend snake_fuse to rank-4 with ne[2]/ne[3] > 1 cases
* spec : refactor
* spec : drop support for incompatible vocabs
* spec : update common_speculative_init()
* cont : pass seq_id
* cont : dedup ctx_seq_rm_type
* server : sketch the ctx_dft decode loop
* server : draft prompt cache and checkpoints
* server : improve ctx names
* server, spec : transition to unified spec context
* cont : sync main and drft contexts
* cont : async drft eval when possible
* cont : handle non-ckpt models
* cont : pass correct n_past for drafting
* cont : process images throught the draft context
* spec : handle draft running out of context
* server : fix mtmd draft processing
* server : fix URL for draft model
* server : add comment
* server : clean-up + dry
* speculative-simple : update
* spec : fix n_past type
* server : fix slot ctx_drft ptr
* tools : update readme
* naming : improve consistency
* spec : refactor for multi-sequence speculative context
* cont : prepare params
* cont : prepare params
* spec : support parallel drafts
* server : support parallel drafting
* llama : reuse device buffers when possible
* server, spec : clean-up
* cont : clean-up
* cont : minor
* spec : reset `drafting` flag at the end
* spec : introduce `common_speculative_process()`
* spec : allow for multiple spec types (chain of speculators)
* replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
* introduce common_get_enabled_speculative_impls(const std::vector<enum common_speculative_type>)
to figure out which implementations the user has enabled
* introduce common_speculative_type_from_names(const std::vector<std::string> & names)
to parse the already user provided spec types
* all speculators run sequentially, best one wins (we verify its drafted tokens)
* maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
---------
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
This commit updates the command line arguments to use the correct names
and values which are now required.
The motivation for this change is that currently running the example
command as is will generate the following errors:
```console
error while handling argument "--color": error: unknown value for --color: '--sampling-seq'
usage:
-co, --color [on|off|auto] Colorize output to distinguish prompt and user input from generations
('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
error while handling argument "-fa": error: unknown value for --flash-attn: '--temp'
usage:
-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
(env: LLAMA_ARG_FLASH_ATTN)
error while handling argument "--draft-max": the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max
usage:
--draft, --draft-n, --draft-max N the argument has been removed. use --spec-draft-n-max or
--spec-ngram-mod-n-max
(env: LLAMA_ARG_DRAFT_MAX)
error while handling argument "--draft-min": the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min
usage:
--draft-min, --draft-n-min N the argument has been removed. use --spec-draft-n-min or
--spec-ngram-mod-n-min
(env: LLAMA_ARG_DRAFT_MIN)
```
* convert : add image break token fallback
This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
"vision_encoder": {
"image_token_id": 10,
"image_break_token_id": -1,
...
```
But the tokenizer.json has this token:
```console
115 "id": 12,
116 "content": "[IMG_BREAK]",
117 "single_word": false,
118 "lstrip": false,
119 "rstrip": false,
120 "normalized": false,
121 "special": true
122 },
```
If we look in convert_hf_to_gguf.py we have:
```python
elif self.is_mistral_format:
# hparams is already vision config here so norm_eps is only defined in global_config.
self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
if self.use_break_tok:
self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```
The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size: 5131.60 MiB
load_hparams: metadata size: 0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break
mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf
Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```
With this fallback the model loads successfully.
Resolves: https://github.com/ggml-org/llama.cpp/issues/22901
* Revert "convert : add image break token fallback"
This reverts commit 292e40cfdf.
* convert : add image break token fallback
This commit adds a image_break_token_id fallback for mistral where the
config contains a image_break_token_id of -1:
```console
"vision_encoder": {
"image_token_id": 10,
"image_break_token_id": -1,
...
```
But the tokenizer.json has this token:
```console
115 "id": 12,
116 "content": "[IMG_BREAK]",
117 "single_word": false,
118 "lstrip": false,
119 "rstrip": false,
120 "normalized": false,
121 "special": true
122 },
```
If we look in convert_hf_to_gguf.py we have:
```python
elif self.is_mistral_format:
# hparams is already vision config here so norm_eps is only defined in global_config.
self.hparams["norm_eps"] = self.global_config.get("norm_eps", None)
assert self.hparams["norm_eps"] is not None, "norm_eps not found in params.json"
if self.use_break_tok:
self.img_break_tok_id = self.find_vparam(["image_break_token_id"])
```
The motivation for this is that currently converting this models
results in the following error:
```console
load_hparams: model size: 5131.60 MiB
load_hparams: metadata size: 0.15 MiB
clip_init: failed to load model 'models/mmproj-Mistral-Medium-3.5-128B.gguf': operator(): unable to find tensor v.token_embd.img_break
mtmd_init_from_file: error: Failed to load CLIP model from models/mmproj-Mistral-Medium-3.5-128B.gguf
Failed to load vision model from models/mmproj-Mistral-Medium-3.5-128B.gguf
```
With this fallback the model loads successfully.
Co-authored-by: Pascal <admin@serveurperso.com>
Resolves: https://github.com/ggml-org/llama.cpp/issues/22901
* convert : allow zero value for img_break_tok_id
* ggml-cuda: add internal AllReduce provider for tensor parallelism
Introduces a NCCL-free AllReduce implementation for LLAMA_SPLIT_MODE_TENSOR
using a single-phase CUDA kernel that pipelines D2H copy, cross-GPU
handshake via pinned-memory volatile flags, and the reduction in one
kernel launch per GPU.
New files:
- ggml/src/ggml-cuda/comm.cuh — ggml_cuda_allreduce_provider enum
- ggml/src/ggml-cuda/allreduce.cuh — pipeline API declarations
- ggml/src/ggml-cuda/allreduce.cu — kernel + pipeline init/dispatch
ggml-cuda.cu changes:
- ggml_backend_cuda_comm_context gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* llama-bench: add --allreduce flag to select AllReduce provider
Adds --allreduce <auto|nccl|internal> to llama-bench (and via the shared
field pattern, consistent with other multi-value flags). Useful for
isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl
to force NCCL and bypass the internal provider.
Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* llama-bench: rename --allreduce to --reduction-provider / -rp
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
via the shared
field pattern, consistent with other multi-value flags). Useful for
isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl
to force NCCL and bypass the internal provider.
Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* llama-bench: pass WARN/ERROR log messages through in non-verbose mode
The null log callback was silently dropping all messages. WARN and ERROR
should always be visible since they indicate legitimate issues (e.g. a
requested reduction provider not being available).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vider.
Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* cmake: improve NCCL detection for source-tree builds, add static/dynamic switch
FindNCCL.cmake now searches the cmake source-build layout used by the Windows
NCCL port (cmake/lib/Release for static, cmake/src/Release for dynamic import
lib) and also checks src/include for the generated nccl.h header.
New option GGML_CUDA_NCCL_STATIC (default OFF) selects static vs dynamic
linking and controls which paths and library names are searched.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
for the "auto" case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: add AllReduce hang watchdog (GGML_CUDA_AR_WATCHDOG)
When compiled with -DGGML_CUDA_AR_WATCHDOG=ON, uses a debug kernel
variant that writes per-GPU spin diagnostics to pinned host memory.
A host-side blocking poll (cudaEventQuery + volatile reads) detects
hangs and logs WARN with the last observed arrival counters and spin
counts, controlled by GGML_CUDA_AR_WATCHDOG (ms timeout) and
GGML_CUDA_AR_MAX_SPIN (kernel bailout) env vars at runtime.
Zero overhead on the production path — all debug code is behind #ifdef.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: fix intermittent AllReduce hang on Blackwell PCIe
Add __threadfence_system() before the arrival signal write in
signal_set to ensure D2H data is globally visible before the peer
observes the arrival flag. Without this fence, the peer could enter
Phase 3 host reads before the data had fully landed, causing an
intermittent deadlock on RTX 5090 (Blackwell, PCIe-only).
Also redesign the watchdog from a blocking dispatch-thread poll to a
non-blocking background thread, eliminating the ~20ms per-slot
latency the old design added.
Verified: 30/30 soak test runs clean at ~50 t/s (previously ~1-in-15
hang rate).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: fix watchdog shutdown ordering and pipeline_free drain
- Stop watchdog thread BEFORE destroying GPU resources (events, streams)
to prevent polling destroyed handles → spurious "busy" readings
- Add cudaStreamSynchronize in pipeline_free to drain in-flight kernels
before freeing pinned host buffers they may still be reading
- Sleep-first watchdog polling: no +0ms noise, only logs when a kernel
is genuinely stuck past the poll interval
- Check wdog_stop in both outer and inner loops so join() returns
promptly instead of draining the entire queue
- Add Phase 3 breadcrumbs to debug[3] for hang localization
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
CPU reduce for unsupported sizes or GPU counts (> 2)
Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: replace event-based watchdog with per-GPU ring buffer
Completely rework the GGML_CUDA_AR_WATCHDOG system:
- Replace the shared debug_buf + event-polling + queue design with
per-GPU ring buffers in pinned host memory
- Kernel writes a debug record only on spin-limit bailout: claims a
ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: normalize line endings to LF (undo Windows CRLF conversion)
Five files were inadvertently converted to CRLF by the Windows
development environment, causing every line to show as changed in
diffs against master.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imit bailout: claims a
ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* .gitattributes: force LF line endings to prevent Windows CRLF conversion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
elopment environment, causing every line to show as changed in
diffs against master.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imit bailout: claims a
ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml-cuda: move GGML_CUDA_AR_WATCHDOG from CMake option to local define
The watchdog is development-only; a global CMake option is overkill.
Move the toggle to a #define at the top of allreduce.cu (set to 0 by
default) and remove the option from ggml/CMakeLists.txt and the CUDA
CMakeLists.txt add_compile_definitions block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly
Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* unify kernel debug paths
* use __threadfence_system explicitly (not in ggml_cuda_ar_signal_set)
* preferentially use internal reduction for <=2 GPUs
* templatize the main kernel to support fp16/bf16
* restore llama-bench.cpp changes
* revert CMakeLists changes
* remove notes from repo
* remove dead warmup code
* fix comments
* improve reduction provider fallback code
* add messages for allreduce fallback
* rework reduction provider init to not call ncclCommInitAll if using the internal provider
* fix case where a given tensor has not been computed
* add chunked mode to the kernel for unlimited vector size
* rework a few checks/fallbacks
* various small cleanups
* allow disabling CUDA reductions completely (falling back to the non-CUDA butterfly mode)
* simplify reduction provider selection
* minor simplifications
* more cleanups/fixes
* prototype alternate path for large reductions
* chunked version of large reduction path
* use bf16 for large reductions
* experimental reduction using cudaMemcpyPeerAsync (slightly slower)
* revert experimental change
* add combined conversion/reduction kernel
* add bf16 wire format for single kernel mode
* experimental on-stream small reduction kernel
* double buffer arrival slots, use token (incrementing) method
* double buffer host_buf for small reductions
* put in waits for use of host_mem in large reduction case (prevents stomping on in-use memory
* remove watchdog code
* various cleanups / dead code removal
* fix fp16 mode
* fix some comments/logging statements
* use increasing token scheme for arrival signals
* add top-level comment to allreduce.cu
* improve top-level comment in allreduce.cu
* fix comments in ggml_cuda_ar_kernel
* improve event handling for hostmem buffer usage tracking
* change ev_pool to fixed 2D array
* add chunked memcpy fallback for extra-large reductions (>32 MB)
* change thresholds for copy-engine path and bf16 demotion
* multi-block kernel test
* more fine-tuning for chukn-size, etc.
* various fixes for PR review
* more PR fixes
* fix semantics of all host mappings
* require ampere+
* small cleanups
* properly use host pointer for src/dst in cudaMemcpy calls
* allreduce: lazy-init the internal pipeline on first use
A config that lives entirely on NCCL never needs the chunked-kernel
pipeline (host_buf, host_large, dev_tmp, streams, events, arrival ring).
Defer pipeline creation to the first try_allreduce_internal call using the
same std::call_once pattern as ensure_nccl, so those resources stay
unallocated when only NCCL is in use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: assert n_backends == 2 instead of soft-fallback
ar_pipeline_init already requires n_devices == 2 and bails before any AR can
get here, so by the time we reach try_allreduce_internal we know we have
exactly two backends. Replace the runtime-debug-log fallback with a hard
assert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NCCL is in use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* rework reduction provider selection. internal/nccl is OS dependent; most fallbacks are removed
* remove unneeded Turing arch check (llama.cpp doesn't even compile pre-Turing anyway)
* allreduce: ASCII-only comments and ggml_cuda_cast for value conversions
Replace non-ASCII characters in comments (em dashes, right arrows) with
ASCII equivalents (--, ->) so the source stays in the ggml/upstream norm.
In the kernel-side code, replace static_cast<Twire>/static_cast<Tdst>
with ggml_cuda_cast<...> so the BF16 conversions go through the fast
__float2bfloat16 / __bfloat162float intrinsics from convert.cuh. Pure
pointer and integer casts stay as static_cast.
Also drops two stray garbage tokens that snuck in from earlier merges
(a duplicated 'return ok; }' tail in allreduce.cu and a leftover '_reg)'
fragment in ggml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: use ggml_cuda_memcpy_1 for the chunked-kernel vector copies
The chunked kernel's two 16-byte register<->host transfers (Phase 1 store
and Phase 3 load) used reinterpret_cast<float4 *> on both sides. Replace
with ggml_cuda_memcpy_1<sizeof(wire)>, which is the canonical helper for
this pattern and emits the same int4 LD/ST under the hood.
Conformance passes; 5x reruns of 70b internal pp512 show 1832-1836 t/s,
matching the prior matrix value of 1831 t/s -- no perf change as expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ok; }' tail in allreduce.cu and a leftover '_reg)'
fragment in ggml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: assert cuda_ctx->device matches the pipeline's device
Both ggml_cuda_ar_pipeline and ggml_backend_cuda_context carry the device
they were created for; if they ever disagree, every cuda call that follows
runs on the wrong device. Add GGML_ASSERT at each cuda_ctx retrieval site
in the AR path so the misuse fails fast rather than silently corrupting.
Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency
with the rest of the file, and tighten one cudaGetLastError check to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: expand one-liner for loops to braced bodies
Code-style preference -- match the rest of the file by writing every for
loop with the body on its own braced line. Three sites in the copy-engine
typed dispatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
in the AR path so the misuse fails fast rather than silently corrupting.
Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency
with the rest of the file, and tighten one cudaGetLastError check to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: rename template parameters Tdst/Twire/Tsrc -> T_dst/T_wire/T_src
Code-style preference per PR review -- T_dst/T_wire/T_src is more
consistent with surrounding code. Whole-word rename across all 58 sites
in allreduce.cu (kernel definitions, internal uses, and comment text).
Realigned the parameter columns in three function signatures whose
T_src/T_dst lines shifted by 1 char relative to their non-templated
neighbors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: drop hyphen in 'chunked-kernel' across comments
Per PR review feedback -- 'chunked kernel' (no hyphen) reads more naturally
in running prose, especially for ESL readers. Pure comment-only change;
all 10 occurrences in allreduce.cu updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
three function signatures whose
T_src/T_dst lines shifted by 1 char relative to their non-templated
neighbors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: use ggml_cuda_get_max_cpy_bytes() instead of hardcoded 16
The chunked kernel hardcoded a 16-byte vector unit; replace with the
ggml_cuda_get_max_cpy_bytes() helper that fattn-common.cuh uses for the
same purpose, so ELEMS_PER_VEC self-adjusts to the arch's widest
single-instruction copy.
Perf-neutral on supported targets (Volta+ returns 16).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hbors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ggml-cuda: PR review fixes -- annotate #endif, fix stale comment, assert nbytes alignment
Three separate but minor changes from PR #22299 review feedback:
1. Annotate the five GGML_USE_NCCL #endif lines with the matching condition
so the pairing is visible without scrolling back.
2. The comment block on ggml_backend_cuda_comm_context claimed NCCL is
lazy-initialised; that was true at one point but the dispatch refactor
(727b141c0) made both NCCL and the internal pipeline eager. Rewrite
the comment to match current behaviour.
3. Assert in ggml_backend_cuda_comm_allreduce_internal that the tensor's
byte size is a 16-byte multiple. The chunked-kernel issues full-width
vector loads/stores, so this is a precondition; tensor-parallel splits
of hidden-dim-multiples satisfy it trivially, but a hard assert turns
any caller-side bug into a clear failure rather than UB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
device's new AR
records its ev.ker -- otherwise the second device's wait sees the first
device's just-recorded event (the in-flight new AR) and creates a circular
dependency with the in-kernel peer signal. Two-pass dispatch (all waits,
then all launches) avoids this.
Bump POOL_SIZE 2 -> 8 (small memory cost, more breathing room for the
GPU's view of the event chain) and add a runtime env override for the
hybrid kernel chunk size (GGML_CUDA_AR_HYBRID_CHUNK_BYTES) for tuning.
One-shot stderr diagnostic at first AR prints the chosen path + sizing.
Result on 2x RTX 5090 Linux, 70b ub_sweep:
ub=64 (1 MB AR): 913 -> 1036 t/s (+13.5% vs old, +1.8% vs NCCL)
ub=128 (2 MB AR): 1056 -> 1181 (+11.9%, +3.7% vs NCCL)
ub=256 (4 MB AR): 1212 -> 1424 (+17.5%, +3.5% vs NCCL)
Internal now beats NCCL at every size (+1.8% to +15.6%), recovering all
ground in the 1-4 MB regime that was previously a 10-12% loss.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* simplify the init logic
* address some other PR requests
* ggml-cuda: stub internal AllReduce on HIP/MUSA, drop pre-Ampere mention, gate NCCL fallback warning on !HIP
The internal AllReduce relies on cudaHostAllocPortable/Mapped,
cudaHostGetDevicePointer, and __nanosleep -- none of which the HIP or
MUSA shims expose -- so wrap the implementation in
!defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) and provide
nullptr/no-op/false stubs in the #else branch. The dispatcher already
treats a null pipeline as init failure and silently falls back to the
meta backend's generic AllReduce, so HIP/MUSA builds compile clean and
behave correctly without further call-site changes.
PR review follow-ups:
- drop "or pre-Ampere?" from the internal-init failure warning -- the
kernel doesn't require Ampere or newer.
- guard the "NCCL not compiled in" fallback warning behind
!defined(GGML_USE_HIP); the suggestion to install NCCL only makes
sense on NVIDIA builds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hind, now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: guard __nanosleep on Volta+ and reject pre-Volta devices at init
__nanosleep is the only Volta-specific intrinsic in the kernel; wrap it
in #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA / NO_DEVICE_CODE so the file
still compiles cleanly when targeting older arches (the dispatcher's
init check below ensures the kernel is never actually launched on
pre-Volta).
Add a per-device compute-capability check in pipeline_init that returns
nullptr if any device is below sm70. The dispatcher already treats
nullptr as init failure and silently falls back to the meta backend's
generic AllReduce.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rom the internal-init failure warning -- the
kernel doesn't require Ampere or newer.
- guard the "NCCL not compiled in" fallback warning behind
!defined(GGML_USE_HIP); the suggestion to install NCCL only makes
sense on NVIDIA builds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hind, now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* allreduce: fix CI -Werror warnings (sign-compare, format, restrict alias, maybe-uninitialized)
The CUDA CI builds with -Werror -Wsign-compare -Wformat -Wrestrict
-Wmaybe-uninitialized. Address each:
- n_devices is size_t; change `int i; i < n_devices` to size_t in the
three init loops, and the matching GGML_LOG_INFO format from %d to %zu.
- ggml_cuda_ar_kernel was launched with sendbuf == recvbuf (in-place
reduction), so the __restrict__ qualifiers on those parameters were
technically UB. Drop __restrict__ from sendbuf and recvbuf; an A/B
sweep showed <0.6% perf delta (within noise) on Linux.
- The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the
per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were
declared with size GGML_CUDA_MAX_DEVICES but the loop only writes
indices [0, n_devices); zero-initialise so the compiler sees the
tail elements as defined.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ggml-cuda: drop unused-function warning by guarding try_allreduce_nccl behind GGML_USE_NCCL
The only call site (in init_nccl) is already inside #ifdef GGML_USE_NCCL,
so the function is unreferenced in non-NCCL builds and trips
nvcc's -Werror=unused-function check. Move the guard from inside the
function body to around the entire definition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce
reduction), so the __restrict__ qualifiers on those parameters were
technically UB. Drop __restrict__ from sendbuf and recvbuf; an A/B
sweep showed <0.6% perf delta (within noise) on Linux.
- The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the
per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were
declared with size GGML_CUDA_MAX_DEVICES but the loop only writes
indices [0, n_devices); zero-initialise so the compiler sees the
tail elements as defined.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* SYCL: reduce allocation overhead during flash attention
* tidy up whitespace
* add a note about the flag
* move ggml_sycl_fattn_* into fattn-buffers.hpp
* refactor implementation into fattn-buffers.cpp
* move new_fattn_kv_buffers back into ggml-sycl.cpp
Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in
supports_op and in the kernel dispatch. This fixes a performance
regression where models using BF16 embedding tensors (e.g., Gemma4's
per_layer_token_embd.weight) fall back to CPU for the GET_ROWS op,
causing a full GPU-to-CPU tensor transfer every token.
The fix reuses the existing get_rows_sycl_float template with
sycl::ext::oneapi::bfloat16, matching the pattern already used for
sycl::half (F16) and float (F32).
2026-05-09 08:50:24 +03:00
Intel AI Get-to Market Customer Success and Solutions
* sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations
Signed-off-by: Chun Tao <chun.tao@intel.com>
* Remove unneeded/unnecessary comments and annotations
The MMQ subgroup annotations added are on functions gated behind
ggml_sycl_supports_mmq(). Revisit the need for these annotations
when that function changes.
---------
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
* mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128
* mimo-v2.5: follow (256, 256) fattn templates
* mimo-v2.5: cleanup comments
* mimo-v2.5: further comment cleanup
* mimo-v2.5: address PR feedback
fix GQA handling
check for other dangling 320/576 carveouts and mirror them for 192
Add to backend ops test so new paths are covered
Implement the Gated Delta Net recurrence on HVX with:
- 4-row fused kernels for PP (prompt processing) path
- 8-row fused kernels for TG (token generation) path, reducing
K/Q/gate vector reload overhead by 2x
- Separate PP/TG thread functions for I-cache isolation
- VTCM state scratchpad with DMA in/out for TG single-cycle access
- Vectorized gate exp via hvx_exp_f32
2026-05-08 17:12:04 -07:00
Intel AI Get-to Market Customer Success and Solutions
* L2_NORM Updates
* Addressed PR Comments
* ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend
* hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* server: support Vertex AI compatible API
* a bit safer
* support other AIP_* env var
* various fixes
* if AIP_MODE is unset, do nothing
* fix test case
* fix windows build
* cuda: fuse snake activation (mul, sin, sqr, mul, add)
Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.
Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.
* cuda: address review feedback from @am17an
Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.
* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an
* Update tests/test-backend-ops.cpp
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* cuda: snake fusion check add->type matches x->type
Address review feedback from @am17an
* cuda: snake fusion check add->type matches x->type
Moved for readability (equivalent)
Address review feedback from @am17an
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Q4_0 MoE CLC pass sanity check
* release program
* opencl: fix whitespace
* opencl: remove unused cl_program
* opencl: break #if block to make it more clear
* opencl: adjust format
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* convert : fix RuntimeError when stripping FP8 KV-cache scales
In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates
self.model_tensors.keys() and calls del on the same dict, which raises
RuntimeError: dictionary changed size during iteration when a ModelOpt
NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4
and any modelopt config with kv_cache_quant_algo: FP8).
Wrap the keys view in list() so the deletions happen on a snapshot.
* re-add another accidentally removed list
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched
* CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends
* webui: add LLM title generation option
* webui: use chat_template_kwargs for title gen + fix conversation check
* webui: capture firstUserMessage before async streamChatCompletion to fix race condition
* webui: extract LLM title generation into separate method
* webui: use constants and ChatService for LLM generated titles
* webui: rebuild static output
* webui: add LLM title generation setting to new settings location
* webui: use sendMessage in generateTitle
* webui: rebuild static output
* webui: fix formatting
* webui: configurable title prompt, remove think tag regexes, fix TS error
* webui: group title constants into TITLE object, use TruncatedText for CSS truncation and fix race condition
* webui: rebuild static output
* Write a readme on Multi-GPU usage in llama.cpp
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution
was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0
Fix is to get rid of a dollar sign.
* common: do not fit to unknown device memory
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: preserve host fallback for non-GPU fit devices
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* common: keep unknown GPU fit memory at zero
Signed-off-by: Florian Reinle <f.reinle@otec.de>
---------
Signed-off-by: Florian Reinle <f.reinle@otec.de>
* feat: migrate to PEP 621 and add uv support
* fix: remove upper bound on protobuf
* remove poetry.lock and uv.lock
* fix/add torch dependency version and markers
* fix dev-dependency deprecation warning
* gguf-py : update python version requirement to 3.10
---------
Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* convert : ignore non-language tensors for Gemma4Model
This commit adds a check to make sure only text language tensors are
handled in filter_tensors.
The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight, torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
main()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
model_instance.write()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
self.prepare_tensors()
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
yield from super().modify_tensors(data_torch, name, bid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
new_name = self.map_tensor_name(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```
* add forgotten embed_vision and embed_audio
* improve
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* hex-mm: process m-tail rows on HMX instead of HVX
* hmx-mm: unroll and optimize padded activation loop
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3719 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3726 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3733 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3740 | [](common_params & /*params*/, int /*value*/) {
| ^
common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
3747 | [](common_params & /*params*/, int /*value*/) {
| ^
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Previously, unknown tool names passed via --tools were silently ignored.
Now the server validates each tool name at startup and exits with an
error if an unrecognized tool is specified, listing the available tools.
Assisted-by: llama.cpp:local pi
* chat/autoparser: the fixes
* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.
* Trim whitespace on apply instead
* docs : update speculative decoding parameters after refactor (#22397)
Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397:
- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands
Assisted-by: llama.cpp:local pi
* pi : add rule to use gh CLI for GitHub resources
Assisted-by: llama.cpp:local pi
* docs : run llama-gen-docs
* arg : fix typo
* shader(norm): add layer norm ops
* shader(norm): stablize floating point computation with Kahan summation and handle mixed types
* shader(norm): remove the non-contiguous strides
* shader(norm): use the original implementation rather than the kahan summation
Llama-architecture q_proj/k_proj weights need an axis-0 row permutation
to match GGML's RoPE convention. The BF16 path applies this in
LlamaModel.modify_tensors via LlamaModel.permute, but the NVFP4 path
bypasses modify_tensors and writes weights directly through
ModelBase._repack_nvfp4. Without the permutation, attention heads end
up scrambled at inference and the model produces gibberish.
This change overrides _repack_nvfp4 on LlamaModel and applies the same
permutation to both the nibble-packed weight and the per-block scale
before delegating to ModelBase._repack_nvfp4 via super(). Reuses the
existing LlamaModel.permute static helper and respects the existing
undo_permute flag, so subclasses (Mistral, Granite, Llama4, etc.)
inherit the fix automatically.
Verified on TinyLlama-1.1B reproducer: perplexity drops from 4419
(gibberish) to 43.9, matching the BF16-dequantized baseline (44.0).
Also verified end-to-end on ALIA-40b-instruct-2601 (BSC, Llama
architecture) with multilingual generation in Spanish/Catalan/Basque/
Galician all coherent with the fix applied.
Co-authored-by: Chema <chema@montevive.ai>
* hmx: extract shared interleave headers and unify matmul batched
* hmx: add HMX-accelerated flash attention for prefill
* hmx: replace asm wrappers with Q6_ intrinsics in hmx-utils.h
Switches three single-instruction helpers from inline asm to the matching
Q6_ intrinsics, matching the style established by aizip f8737609a and used
by the upstream PR #21554 hmx-matmul-ops.c rewrite:
hmx_set_output_scales asm "bias=mxmem2" -> Q6_bias_mxmem2_A
hmx_load_tile_pair_fp16 asm packet -> Q6_activation_hf_mxmem_RR
+ Q6_weight_hf_mxmem_RR
hmx_consume_accumulator_fp16 asm "mxmem=acc" -> Q6_mxmem_AR_after_hf
hmx_load_tiles_fp16 stays on inline asm: it uses ":deep" activation
streaming, and the mixed Q6_activation_hf_mxmem_RR_deep + non-deep
Q6_weight_hf_mxmem_RR pair fails the HMX backend constraint check
("activate weight pair (1) exceeds limit (1)"). The asm bundle keeps
both halves in one VLIW packet and avoids the diagnostic.
Functionally equivalent — same instructions emitted; the Q6_ intrinsics
just give the compiler more visibility for scheduling.
* hmx: drop the duplicate interleave_fp16_weight_chunk_to_tiles
* hmx: apply upstream optimization to hmx-flash-attn-ops.c
apply restrict, __builtin_assume, and pointer accumulation to the three HMX workers (qk_dot, o_update, o_norm) and the matching inline HMX loops in op_hmx_flash_attn_ext.
* hmx: unify interleave helper
* hmx: multi-thread Q load / O store and enable prefill FA dispatch
Extract inline Q-load and O-store loops into worker_pool-parallel helpers
(fa_phase_q_load, fa_phase_o_store) so HVX threads split the F32↔F16
conversion work across row ranges. Also relax the softmax threading
gate from n_row_vec_cnt >= n_threads to >= 2, which was unnecessarily
forcing single-thread fallback when n_rows_g < 512.
On the dispatch side, remove the ne[2] != 1 guard that blocked multi-head
(prefill) FA from reaching the HTP backend — GQA is already handled
internally by both the HMX and HVX flash-attention paths.
* hmx: relax matmul pipeline gate to cover k > n shapes (e.g. FFN_down)
* hmx: optimize FA softmax mask phase (no-ALiBi fast path + GQA dedup)
* hmx: Add an asm memory clobber at the phase boundary to prevent reorder bug
* [experimental]: fp16 softmax (EXP2_HF) to accelerate fa
Bake log2(e) into qk_scale and use hvx_exp2_hf directly for P and m_diff
(base-2 consistent, matches htp-ops-lib). ~22 ALU ops for 64 lanes vs
~44 for the F32 round-trip path.
* hmx flash-attn: refine cost model coefficients based on profiling data
* hmx flash-attn: replace asm clobber with targeted volatile reads on vtcm_d_tiles
* hmx flash-attn: fix prefill correctness (dst indexing, softmax reduce, V stride)
* hmx flash-attn: fix p_tiles dual-tile OOB race; enable MT + pipeline
* hmx flash-attn: preserve additive mask bias in no-ALiBi fast path
The no-ALiBi fast path (max_bias==0) was skipping mask add entirely on
the assumption that mask values are only {0, -inf}. This is wrong when
the mask carries additive positional bias — those terms were silently
dropped. Keep the slope-mul skip (slope≡1.0) but add mask back so the
bias survives; vmux still clamps below -16 to -inf.
Also add HMX FA coverage to test-backend-ops: prefill shapes (nb=64,
nb=32) × {mask on/off} × {ALiBi on/off} × {softcap on/off}, F16 KV,
hs ∈ {64, 128}.
* hmx: fix softcap+EXP2_HF interaction, tighten matmul pipeline gate, add FA tests
- flash-attn: when EXP2_HF is on AND logit_softcap is active, fold
log2(e) into the post-tanh multiplier (v_cap) instead of pre-baking
it into qk_scale. Pre-baking shifted the tanh knee from x≈c to
x≈c/log2(e) and produced numerically wrong softcapped outputs
whenever both knobs were enabled.
- flash-attn softmax (fa_softmax_thread): replace the union+memcpy
scalar extract pattern with HVX vmux-based per-row accumulators on
rowmax/rowsum. Add hvx_vec_get_f16 helper in hvx-base.h. Functional
parity, less scalar code, clearer hf/qf16 lane-format contract.
- matmul (hmx_mat_mul_permuted_qk_0_d16a32): pick pipeline vs sequential
layout based on whether the chunker actually yields >=2 n-chunks,
instead of the static (m>=128 && n>=256) gate. Avoids paying for
output double-buffer + worker dispatch when there is no HMX/HVX
overlap to gain (e.g. shapes that collapse to one n-chunk).
- tests: add HMX flash-attention coverage over the
{mask, ALiBi (max_bias), logit_softcap} cross-product for the prefill
path — head_dim 64/128, GQA 4×4, kv=512/nb=64 plus a kv=113/nb=32
non-aligned case.
* [Help Wanted]: refactor D matrix computation into separate function for clarity and maintainability
* format code
* hexagon: looks like -O3 is causing issues with the large code base, switch to -O2 and -flto instead
* hexagon: use hex_ prefix for swap_ptr
* hexagon: move vtcm_seq_alloc into vtcm-utils.h
More vtcm allocator updates are coming so it makes sense to start the separate hdr for it.
* hmx-utils: add hmx_prefix for layout converters
* hmx-mm: move main hmx_mm functions to the end, remove unused fwd decls, etc
* hmx-mm: remove unused qweight_fetch_task_state_t and minor alignment fixes
* hmx-fa: minor alignment fixes
* hmx-fa: move hmx_flash_atten into hmx-ops.h
* hmx-fa: remove redundant workpool pointer in the hmx_fa_ctx, plus minor alignment updates
* hmx-fa: minor alignment and simplifications
* hexagon: move FA_EXP_F16 option to hostside CMake file
* hmx-fa: use hvx_vec_splat_f16 instead of fp16_to_bits
* hmx-fa: add hvx_splat_u16/u8 and use that in the fa instead custom hvx_fill
* hmx-fa: some more alignment updates in the core fa function
* hmx-fa: keep slopes in vtcm in fp16
Saves malloc/free and removes the need for float -> fp16 downcast on every use.
* hexagon: consistent noinline usage (after static)
* hex-hmx: consistent use FARF_HIGH to enable debug output
* hmx-utils: no need for always_inline attr
* hex-hmx: consistent noinline usage (static noinline ...)
* hex-hmx: simplify init_col_scales
* hexagon: fix editorconfig errors
* hmx-mm: minor alignment fixes
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
fix#22544 (my fault!)
Credit to @Anai-Guo, ref #22559 - since that one was closed due to the
new contributor policy I am taking the liberty of re-submitting that PR
here.
* vulkan: Support asymmetric FA in coopmat2 path
There has been some recent interest/experimentation with mixed quantization
types for FA. I had originally designed the cm2 FA shader with this in mind
(because I didn't realize it wasn't supported at the time!), this change
adds the missing pieces and enables it.
Also support Q1_0 since people have been trying that out (seems crazy, but
who knows).
We should be able to do similar things in the coopmat1/scalar path, but
there's another change open against the scalar path and I don't want to
conflict.
* reorder cases
* Add mat-vec fast path of MUL_MAT_ID.
* Add shared accumulation vec logic and the other types supports.
* Add i-quant mat-mat for MUL_MAT_ID and fix some parts
* Remove n_experts from shader_lib_context.
* scripts : add wc2wt.sh - create worktree from current HEAD
Add a script to create a git worktree on a new branch from the current
HEAD. Similar to pr2wt.sh but for local development branches instead of
PRs.
Usage:
./scripts/wc2wt.sh gg/new-feature
./scripts/wc2wt.sh gg/new-feature "bash -l"
Assisted-by: llama.cpp:local pi
* cont : no need to try to delete the branch
* port #22358 PR to examples/speculative/speculative.cpp
* use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft
model / target model vocabulary mismatch
Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
* hexagon: allow host to set max vmem size
We use a sane default but it's helpful to allow for an override if needed.
* hexagon: add support for measuring vmem space and move pinned mmaping management to host
* hexagon: update vmem checks to use uint64
* hexagon: bump op buffers to 16 (matches max mmaps)
* hexagon: bump default vmem to 3.2GB
* hexagon: add support for autodetecting vmem space and some logging cleanup in that area
* hexagon: fix whitespace warnings
* Update scripts/snapdragon/adb/run-cli.sh
Co-authored-by: Pascal <admin@serveurperso.com>
* hex-adb: fix run-completion script
---------
Co-authored-by: Pascal <admin@serveurperso.com>
* ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME
When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains
inline asm for the vmadot family which requires the xsmtvdotii custom
extension.(problem can see in some blogs and make sure in K3 platform)
The current CMakeLists does not include xsmtvdotii, so any toolchain
that honours the explicit -march (tested with SpacemiT GCC 15.2) fails
at the assembler stage:
Error: unrecognized opcode `vmadot v16,v14,v0',
extension `xsmtvdotii' required
Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is
enabled so the IME path can actually build with a capable toolchain.
No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off.
toolchain from https://www.spacemit.com/community/resources-download/Tools
* Update ggml/src/ggml-cpu/CMakeLists.txt
Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
---------
Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
* Changed to leak logger singleton to prevent hanging on Windows
* Fix comment
* Stopped using static vector
Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash
* Change so all logs are output before exit
* Added debug logging
* added more logging
* Added logging
* Explicitly free logger to avoid hanging on Win
* Reverted to leak logger instance again
* Removed debug log and fixed comment
* Fixed comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel
* Change arrays to static const in repack.cpp
---------
Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)
Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.
* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32
* Apply suggestions from code review
Address review comments
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256
* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.
Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
Some SPIR-V compilers (notably mesa) don't handle the current
vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well.
While reading three `u8`s from the 12-byte scale array should (at
least on some hardware) result in loading the full 12 bytes in a
single LOAD followed by whatever extraction is needed, at least
the ANV Intel driver really can't practically perform this
optimization.
`mesa`'s unsigned upper bound logic doesn't handle tracking bounds
through ternary, resulting in the `(is < 4) ? ... : is - 4` having
an infinite upper bound (as it cannot prove `is - 4` doesn't
underflow). While this could still be rectified if mesa looked at
the array bounds, it currently doesn't and `glslc` currently emits
SPIR-V that doesn't allow for this optimization anyway (though
maybe it will at some point, see
https://github.com/KhronosGroup/glslang/issues/4206).
In mul_mat_vecq we took a different approach to loading the same
fields. We read the first two bytes we needed from `scale` then
took a branch before deciding whether we needed to read a third
byte. In mesa this did, indeed, lead to a top-level branch with
conditional loads. As such these loads ended up not being
coalesced either (at least in the ANV driver) resulting in
additional instructions in our hot loop.
Instead, here, we go ahead and force loading the full 12 bytes and
extract the bits we need from the packed-u32s instead. In mul_mat
there's a few less ternaries and only one extra shift, so even on
drivers that did optimize the previous loads properly the only
material change should be pulling a few extra bytes into registers
(which on most hardware won't cost anything anyway, though
ironically on Intel it theoretically could). In mul_mat_vecq this
requires a bit of extra math and may read bytes from the u32 that
weren't needed, but it seems likely avoiding the branch is a win
on most platforms.
On Intel Xe2/mesa 26.0.4 with the optimizations from
https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162,
for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l:
* Instruction Count: 2753 -> 2688
* SEND Count: 269 -> 261
* Cycle Count: 273976 -> 266138
* Max live registers: 248 -> 246
* Non SSA regs after NIR: 381 -> 382
for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l:
* Instruction Count: 2767 -> 2702
* SEND Count: 271 -> 263
* Cycle Count: 274140 -> 268144
* Max live registers: 248 -> 246
* Non SSA regs after NIR: 381 -> 382
for shader mul_mat_vec_id_q4_k_q8_1_f32:
* Instruction Count: 1930 -> 1646
* SEND Count: 116 -> 71
* Cycle Count: 1348306 -> 843350
* Max live registers: 78 -> 84
* Non SSA regs after NIR: 300 -> 135
for shader mul_mat_vec_id_q5_k_q8_1_f32:
* Instruction Count: 2207 -> 1922
* SEND Count: 131 -> 86
* Cycle Count: 1392012 -> 1037836
* Max live registers: 90 -> 90
* Non SSA regs after NIR: 300 -> 135
for shader mul_mat_vec_q4_k_q8_1_f32:
* Instruction Count: 2029 -> 1749
* SEND Count: 111 -> 66
* Cycle Count: 1347278 -> 840118
* Max live registers: 74 -> 80
* Non SSA regs after NIR: 299 -> 134
for shader mul_mat_vec_q5_k_q8_1_f32:
* Instruction Count: 2307 -> 2022
* SEND Count: 126 -> 81
* Cycle Count: 1379820 -> 954042
* Max live registers: 86 -> 86
* Non SSA regs after NIR: 299 -> 134
On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL:
* pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%)
* pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%)
* tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%)
On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S:
* pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%)
* pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%)
* tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%)
On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with
-sm layer (note that -sm tensor improvements will naturally be
less):
* pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%)
* pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%)
* tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)
* ggml : revert to -lm linking instead of find_library
`find_library(MATH_LIBRARY m)` was introduced recently, but it breaks
CUDA compilation with GGML_STATIC. I could not find any valid use case
where we would prefer `find_library` over the standard `-lm` approach.
This commit is also meant to start a discussion if there is a valid
reason to keep `find_library(MATH_LIBRARY m)`, we should clarify what
problem it was solving and find an alternative fix that does not break
CUDA with GGML_STATIC.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : use MATH_LIBRARY only if defined
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : fix initial broken condition
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : always respect MATH_LIBRARY when defined
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
New operators:
- GGML_OP_SET: implement via aclnnInplaceCopy on target region
- GGML_OP_CUMSUM: implement via aclnnCumsum
- GGML_OP_FILL: implement via aclnnInplaceFillScalar
- GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides
- GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via
aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets
- GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve
- GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus
Optimizations:
- GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu /
aclnnGeGluV3 when applicable; fallback conditions now checked inside
each function rather than at the call site
- CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→
ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call
- L2_NORM: fix in-place ClampMin on norm result (was clamping wrong
tensor); add eps clamping before division to avoid divide-by-zero
- PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call
ReflectionPad1d once on the full 4-D view; remove redundant nb copies
- GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor
helper into gather_batched lambda with batch loop inlined
- SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice;
refactor helper into scatter_batched lambda with batch loop inlined
- OUT_PROD: replace O(ne[3]*ne[2]*ne[1]) Ger+InplaceAdd loop with
per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast
batch dims where ne02/ne03 may differ from ne2/ne3
- backend memset_tensor: implement via aclrtMemset (was NULL)
Bug fixes:
- COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary
buffer instead of InplaceEqTensor, avoiding corruption of src0
- ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[]
fields in ggml_graph_node_properties; has_matching_properties() was
missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to
incorrectly share cached graphs and produce wrong results (ERR≈679)
- graph cache op_params matching: compare full GGML_MAX_OP_PARAMS
bytes so that ops differing only in parameters are not incorrectly
replayed from cache
* This commit enables the router to forward form-data to model server.
Fixes#22044 (enabling to use the /v1/audio/transcriptions in router mode)
* * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload.
* Addressed Copilots third comment by extending the files representation to also include filename and content-type
* Addressed Copilots fourth comment by making the RNG thread_local
* Changed variable body from std::string to std::ostringstream in build_multipart_body
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053
* Added sanitize_field lambda in build_multipart_body for key, filename and content_type
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647
* explicitly checking if value/item is string before calling value/item.get<std::string>()
as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279
* Added double quote to the sanitize lambda and throw on json parse failure
---------
Co-authored-by: Ralph Paßgang <ralph@trust-it.de>
* common: refactor common/debug to move abort_on_nan into base_callback_data
Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO.
It should just be a member of the base_callback_data instead.
* cont : cleanup
* common : use pimpl in debug.h to reduce header dependencies
Move common_debug_cb_user_data's data members (std::regex,
std::vector<uint8_t>) into a private impl struct in debug.cpp.
This removes the includes of common.h and <regex> from debug.h,
reducing transitive dependencies for any translation unit that
includes the header.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
* Optimize Metal Tensor API usage for matmul2d
Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.
The legacy simdgroup_matrix kernel is preserved under #else.
Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
* cont : cleanup
* cont : cleanup
* cont : cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.
In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
* opt arc770 for Q4_0
* add for Q4_0
* update the script
* add help script for windows
* update guide
* fix format issue
* convert from dos to unix for format issue
* fix missed -sm parameter
* switch ubuntu-latest to ubuntu-slim
* Fix the path for upload so CI doesn't fail
* Update .github/workflows/build-and-test-snapdragon.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use -slim image for key check and consistent naming for artifact dir
Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* Remove check-secret extra job
* move QDC key check for Run QDC jobs step specifically
* add a step before to check the secret for qdc jobs
---------
Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml-webgpu: add tile flash attention fallback
* ggml-webgpu: add new fields and discard usage of mnk for tile version
* ggml-webgpu: modify the vec path to discard the mnk parameter
* ggml-webgpu: enable flash attention vec and tile version for broswer
* ggml-webgpu: stagging KV for flash attention tile version
* formatting
* turn on subgroup uniformity check
* remove Q_TILE as it is always 1 for vec path
* make row_max and exp_sum to local register
* make different bindings with same underlying buffer to have the same usage flags
* move path selection into the shader library and have the host consume a single flash-attn decision object.
* turn off skip_validation and address buffer overlapping when nwg==1
* formatting
* merge binding when kv overlap
This change implements the third requested change in issue 20429.
Because defaults.sampling contains the reasoning budget token count and
the reasoning budget message, it's not necessary to assign them to
struct variables.
* hexagon: restore HTP_OPMASK_QUEUE
* hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul
* hex-prof: restore op profiling
* hex-prof: enable PMU
* hexagon: simplify and improve op-queuing with full profiling support
Add separate profile descriptors.
* hexagon: remove opsync and rename opmask into opstage
opsync is no longer needed since the profiler is fully async now.
opmask name was confusing and opstage is more accurate.
* hexagon: refactor opbatch queue handling
* hexagon: add iface hooks for enabling profiler from the host
Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use.
* hexagon: make profiler mode configurable
On older devices getting PMU counters is expensive so it's now optional.
* hexagon: add support for setting profiler pmu events from env
* hexagon: simplify profiler output (no need to print buffs, etc)
* hexagon: simplify pmu counter formating
* hexagon: add a simple profile post-proc tool
* hex-prof: add support for reading logs from stdin
* hexagon: document GGML_HEXAGON_PROFILE
* hex-prof: update default width for dims field
* hex-prof: fix linter warnings and errors
* Update ggml/src/ggml-hexagon/htp/htp-ops.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/snapdragon/ggml-hexagon-profile.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add the tests that we want to run on external CI
* remove extra files
* Fixes python issues, reove the deadlock on CI
* remove unecessary changes
* use override to ty.toml
* fix pre-commit and try tests with secret in external repo not upstream
* skip if key is unavailable
* Fix feedback
* switch hexagon to snapdragon
* cleanup
* fix secrets
* remove the copyrights at the top of the files
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.
I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!
It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.
* model-conversion : fix mmproj output file name [no ci]
This commit updates the convert-model.sh script to properly handle
mmproj output files.
The motivation for this that currently the same name as the original
model is used as the mmproj file, which causes the original model to
be overwritten and no mmproj-<model_name>.gguf to be created.
* model-conversion : use MODEL_NAME [no ci]
* gitignore: add AGENTS.local
Assisted-by: llama.cpp:local pi
Signed-off-by: Georgi Gerganov <ggerganov@gmail.com>
* gitignore: rename AGENTS.local to AGENTS.local.md
Assisted-by: llama.cpp:local pi
Signed-off-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Georgi Gerganov <ggerganov@gmail.com>
Fixes#22237 — the find_library(MATH_LIBRARY m) result was being
discarded and the target linked against the literal 'm' string.
This prevents users from overriding the math library (e.g. for AMD AOCL)
via CMake variables. Now the discovered MATH_LIBRARY is used directly.
* upgrade oneAPI to 2025.3.3
* update
* seperate SYCL CI and support release binary package for ubuntu 24
* add dependence
* remove wrong copy lines
* add missed line
* remove other task to test the release for SYCL
* rm more for test release
* fix file name
* correct the error in running
* support build for fp32/fp16
* rm ubuntu-24-sycl-fp16 for duplicated
* refactor build setting
* update guide for ubuntu 24 release package, restore the release.yml for other backend
* user docker replace to install oneAPI
* use download installation package to replace docker
* use wget to download and install oneapi, replace the apt cmd
* enable ccache for oneAPI installation
* fix format error
* enable cache for oneAPI installation
* update guide
* Update .github/workflows/release.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update .github/workflows/release.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update .github/workflows/build-sycl.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update .github/workflows/release.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* sycl : fused MoE mul_mat_vec_q for TG
Create an MMVQ kernel so ggml_sycl_mul_mat_id can consolidate
n_experts_used matmuls in a single kernel launch. The kernel
also reads expert IDs directly, removing a per-call host sync.
This is similar to the CUDA backend's ggml_cuda_mul_mat_vec_q*
paths.
All types supported in the current MMVQ are supported here as well:
Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0
It will fall back to the existing per-expert path when src0 has been rewritten
by opt_for_reorder(), and for any shape the fused path doesn't handle.
test-backend-ops passes for supported type/shape combos.
Benchmark: Qwen3-Next-35B-A3B Q4_K_M on Intel Arc B70 (SYCL0),
baseline 707c0b7a6, 16k context, -fa 0.
build/bin/llama-bench -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
-p 1024 -n 128 -d 16384 -ngl 99 -fa 0 -ub 2048 -r 2 -dev SYCL0
Before (3 runs on 707c0b7a6):
| test | run 1 | run 2 | run 3 |
| --------------- | ----------------:| ----------------:| ----------------:|
| pp1024 @ d16384 | 533.26 ± 4.87 | 535.20 ± 2.78 | 524.27 ± 3.10 |
| tg128 @ d16384 | 33.47 ± 0.02 | 33.31 ± 0.02 | 33.17 ± 0.05 |
After (3 runs on 707c0b7a6 + this patch):
| test | run 1 | run 2 | run 3 |
| --------------- | ----------------:| ----------------:| ----------------:|
| pp1024 @ d16384 | 534.06 ± 0.97 | 531.95 ± 0.02 | 520.94 ± 20.10 |
| tg128 @ d16384 | 45.85 ± 0.21 | 45.95 ± 0.45 | 46.22 ± 0.12 |
disclosure: Claude wrote it, but I reviewed and understand the implementation
(albeit my C is a little rusty).
* sycl: also support nvfp4 and mxfp4 expert types
* sycl: terser comments/nested dispatch in response to review
* sycl: more comment cleanup in mmvq.cpp/hpp
---------
Co-authored-by: Debian <aaron@openllmi.net.bots.is>
* shader(im2col): implement the im2col shader
* shader(im2col): clean the formatting issues
* shader(im2col): clean the editorconfig checker warning
* fix(shader): address the workgroup issues of im2col and conv2d
In #11362 hip graph was disabled by default as, at the time, its performance impact was negative. Due to improvements in rocm and our usage and construction of graphs this is no longer true, so lets change the default.
* Only run webgpu CI on my fork
* Implement set_tensor_async
* Implement synchronize api
* Implement event creation and deletion API
* Cleanup
* Cleanup
* Comment out jobs for local CI run
* Add webgpu only workflow
* Delete .github/workflows/build-webgpu.yml
* Cleanup
* Cleanup
* Update API with function handlers
* Run clang-format
* Replace one-shot buffer with a direct queue.WriteBuffer using the buffer context
* fused rms_norm_mul + mul
* Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion.
* Decouple num_fused_ops from webgpu_context; misc cleanup
* Fix eps handling and remove disable_fusion.
* Fix not to use c++20 initializers.
* chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs
* Fix ty errors.
* Fix flake8 err
* sycl: size mul_mat_id staging buffers by routed rows
Previously src1_contiguous/dst_contiguous in ggml_sycl_mul_mat_id were
sized to ggml_nelements(src1/dst), which over-allocates when ne12 > 1
and can fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero for
MoE models (notably with --cpu-moe). Size them by the actual number of
routed rows (ids->ne[1] * n_ids) instead.
* sycl: add bf16 mul_mat fast path via DNNL
When src0 is BF16 (commonly the case for lm_head / output.weight), the
existing f16 path is skipped because bf16 isn't covered, and the f32
fallback dequantizes the entire src0 slab to f32 in a single pool alloc
(row_diff*ne00 floats). For large-vocab models this can reach several
GB and fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero.
Add a bf16xbf16 -> f32 DNNL matmul fast path that uses the bf16 storage
in place and only materializes a small src1 bf16 conversion buffer. bf16
matmul accumulates in f32, so it's correct even when the op requests
GGML_PREC_F32 (as lm_head does).
- gemm.hpp: map bfloat16 to dnnl::memory::data_type::bf16.
- convert.{hpp,cpp}: expose ggml_get_to_bf16_sycl for f32/f16/bf16 -> bf16.
- ggml-sycl.cpp: take the bf16 path early in ggml_sycl_op_mul_mat_sycl
when DNNL and GGML_SYCL_HAS_BF16 are both available.
* mtmd, llama : add HunyuanVL vision-language model support
- add LLM_ARCH_HUNYUAN_VL with M-RoPE (XD-RoPE) support
- add PROJECTOR_TYPE_HUNYUANVL with PatchMerger vision encoder
- add HunyuanVL-specific M-RoPE position encoding for image tokens
- add GGUF conversion for HunyuanVL vision and text models
- add smoke test in tools/mtmd/tests.sh
* fix: fix HunyuanVL XD-RoPE h/w section order
* fix: Remove redundant code
* convert : fix HunyuanOCR / HunyuanVL conversion
- Tested locally: both HunyuanOCR and HunyuanVL-4B convert to GGUF
- successfully and produce correct inference output on Metal (F16 / Q8_0).
* clip : fix -Werror=misleading-indentation in bilinear resize
* fix CI: convert_hf_to_gguf type check error
- convert_hf_to_gguf.py: give HunyuanVLTextModel.__init__ an explicit `dir_model: Path` parameter so ty can infer the type for load_hparams instead of reporting `Unknown | None`.
---------
Co-authored-by: wendadawen <wendadawen@tencent.com>
This change refactors the reasoning_budget_message parameter from the
common params into the sampling parameters specifically. It also removes
the reasoning_budget common parameter and standardizes on the existing
reasoning_budget_tokens parameter in the sampling configuration.
Issue: https://github.com/ggml-org/llama.cpp/issues/20429
Original PR: https://github.com/ggml-org/llama.cpp/pull/20297
* ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log
* Merge with upstream
* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants
* Update Unary wgsl EXP and EXPM1 for f16 stability
* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization
* Fix numerical percision for unary sqrt when working with f16
* Fix NaN canonicalization for packed integers using f16
* Update err threshold for binary div ops when using f16
* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend
* clean: uncomment existing code logs
* clean: clean the unncessary debug info
* Refactor and generalize dequant helpers
* Remove deprecated quant structs
* Refactor shader defines to reduce repetition
* Remove error override for F16 type
* fix: fix the accidential removal of the proper initialization of ctx
* clean: clean legacy and format code
* fix: did not modify tests ops
* shader(conv2d): add conv2d shader kernels and pass f32 and f16 tests
* shader(conv2d): fix the out of bounds memory access in the weight indexing
* shader(conv2d): clean unused variables and optimize the computation
* merge: use the new entries function
* clean: address the formatting issues
* clean: address the warning issues
* clear: clean the shader editorconfig-checker issues
* clear: clean the shader editorconfig-checker with utf-8
---------
Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
* feat: (vocab) fix stray text appended in llama_decode_text
Remove accidental concatenation of the full `text` string when
formatting UNK_BYTE hex escapes. Only the closing "]" should be appended.
* feat(mtmd): add Yasa2 vision encoder support
Add a Yasa2 (ConvNeXtV2-based) vision encoder for reka-edge:
- Register PROJECTOR_TYPE_YASA2 and tensor name definitions
- Add yasa2_block/yasa2_stage model structs
- Implement graph builder with ConvNeXt stages, GRN, adaptive pooling
- Wire into clip.cpp switch statements and mtmd.cpp init_vision
- Use mtmd_image_preprocessor_fixed_size for image preprocessing
* feat(chat): add reka-edge template handler (tools, thinking)
- Add chat-reka.cpp/h implementing PEG-based parser for reka-edge format
- Add Reka-Edge.jinja chat template
- Detect reka-edge template in try_specialized_template()
- Add LLAMA_EXAMPLE_MTMD to chat-template-file arg
* feat: add reka vlm to gguf conversion script
Converts Reka Yasa2 hf checkpoints to GGUF format:
- Text decoder: Llama-arch with tiktoken/BPE vocab
- Mmproj (--mmproj): ConvNeXt vision backbone + language_projection
- Generates 2D sincos positional embeddings for vision encoder
* test: add Reka Edge chat template and parser tests
- test-chat-template: oracle tests comparing Jinja engine output vs
common_chat_templates_apply for text, tools, thinking, images, video
- test-chat: PEG parser tests for Reka Edge format, round-trip tests
for image/video content parts, common path integration tests
* scripts: add Reka Edge mixed quantization helper
Q4_0 base quantization with Q8_0 override for the last 8 transformer
blocks (layers 24-31) via --tensor-type regex.
* fix: adapt chat-reka and tests to upstream API
- Use autoparser::generation_params (not templates_params)
- Add p.prefix(generation_prompt) to PEG parser
- Simplify reasoning parser to match LFM2 pattern
- Remove image/video oracle tests (unsupported by oaicompat parser;
no other multimodal models test this path)
* fix: avoid duplicate tensor loading in yasa2 vision encoder
TN_YASA_PATCH_W and TN_PATCH_EMBD both resolve to "v.patch_embd.weight",
causing the same tensor to be loaded twice into ctx_data and overflowing
the memory pool. Reuse the tensors already loaded by the common section.
* chore: update image pre-processing settings
The reka-edge model depends on the following settings in an older
fork of llama.cpp:
1. Fixed square resize
2. BICUBIC
3. add_padding=false
In current llama.cpp, this means setting:
- image_resize_algo = RESIZE_ALGO_BICUBIC
- image_resize_pad = false
* chore: remove reka gguf conversion script
* chore: remove reka quantization script
* chore: remove unnecessary changes from PR scope
This commit removes a couple of unnecessary changes for the PR scope:
1. BPE decoder bug fix - this affects reka edge because there's a bug
in our tokenization that doesn't represent <think> tokens as special
tokens. However this isn't meant to be a thinking model so when run
with --reasoning off the edge case does not affect us
2. --chat-template-file support from llama-mtmd-cli - the focus is on
llama-server and the reka edge gguf contains the necessary metadata
to detect the chat template
3. reka edge oracle test cases - no other model has similar test cases,
so I removed it for standardization
* chore: remove unnecessary ggml_cast
This commit removes unnecessary ggml_cast after updating the
reka vlm -> gguf conversion script on hugging face.
* chore: remove redundant code
* chore: remove unnecessary ggml_cont calls
This commit removes all ggml_cont calls except the four that
precede ggml_reshape_3d/ggml_reshape_4d. Those are necessary
because ggml_reshape recomputes strides assuming contiguous
layout and asserts ggml_is_contiguous.
Other operations (ggml_mean, ggml_add, ggml_mul etc.) use
stride-based indexing and handle non-contiguous inputs
correctly and so we are ok to remove ggml_cont for those.
* chore: remove unnecessary ggml_repeat calls
This commit removes unnecessary ggml_repeat calls because the underlying
ops already broadcast automatically.
Every ggml_repeat in yasa2.cpp was expanding a smaller tensor to match
a larger one's shape before passing both to an elementwise op (ggml_add,
ggml_sub, ggml_mul, or ggml_div). This is unnecessary because all four
of these ops already support broadcasting internally.
* chore: restore ggml_cont needed for cpu operations
* refactor: locate reka chat template handler in chat.cpp
* chore: remove unnecessary warmup tokens
* chore: add code comments on image_resize_pad
* chore: remove custom reka parsing code
* chore: revert common/chat.cpp
* Uncomment debug logging for PEG input parsing
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* Thread safety per request only
* Fix ROPE yarn case
* Fix sticky stateful config
* Use i4/i8 directly for symmetric quant
* Use weightless caching
* Add WeightlessCacheAttribute to reduce NPU memory usage
* Gelu tanh support (#125)
* Imrope support (#126)
* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free
* add GPU,NPU support in OV Dockerfile
* add build-openvino.yml ci
* Fix sticky stateful config
* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml
* fix thread-safety of shared runtime context
* rope type abstraction for frontend translations
* fix editorconfig
---------
Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
* Fix delayed AllReduce on Gemma-4 MoE
Skip forward past nodes that don't consume the current one, and allow a chain of MULs.
* Check for all sources before skipping nodes
* Address review comments
* Implemented optimized q1_0 dot for x86 and generic
* Removed redundant helper definition
* Removed two redundant instructions from AVX q1_0 dot
* Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback
* Style cleanup around AVX q1_0 dot
* Replaced explicitly unrolled blocks with inner for loop for q1_0
* Replaced scalar ARM q1_0 impl with new generic one
* merged properly, but slow q3_k and q5_k with u32 indexing
* Start on new mat-vec
* New format float paths working
* Working q4_0
* Work on remaining legacy q-types
* port k-quants to new matvec
* remove old shader
* Remove old constants, format
* remove accidental file
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* llama: fix crash in print_info for GLM-DSA when vocab_only is set
* addressed code review comments
* cont : simplify
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes
The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K
asserted that block_num_y was a multiple of 16 subgroups. Models with
a vocab size not divisible by 16 (for example HY-MT at 120818) aborted
on model load when the output projection tripped the assert.
I replaced the assert with padding: block_num_y now rounds up to a
whole number of subgroup-sized workgroups. The kernel already has the
row bounds check (`if (row >= nrows) return;`) so the extra padded
threads early-exit cleanly. Row values are uniform across a subgroup
so the collective reduce stays safe.
For aligned vocab sizes the padded block_num_y equals the old value,
so the kernel launch is identical and there is no regression.
Thanks to @arthw for flagging the relationship to #21527.
Fixes#22020.
AI assisted coding, tested on Intel B70 hardware.
* sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches
Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec
launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel
target where WARP_SIZE is 16, but makes the relationship to subgroup
size explicit. Per review by @NeoZhangJianyu on #22035.
Assisted by Claude.
* cache subgraph splits when cgraph is unchanged
Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively.
Assign uid to every sub-graph so that CUDA's fast uid check path hits too.
* Address review comments
* Keep the scope as is
* Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code.
* Address review comments
* Update ggml/src/ggml-backend-meta.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-backend-meta.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds.
Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform.
Reported-by: oobabooga
Refs: #21630
Co-authored-by: texasich <texasich@users.noreply.github.com>
* rpc : refactor the RPC transport
Move all transport related code into a separate file and use the
socket_t interface to hide all transport implementation details.
* fix win32
* better socket_t construction
* Update workflows to remove dependence on llvmpipe
* Try setting Dawn_DIR
* remove c++20 initializers
* Move to proper guid
* Try avoiding segfaults on vulkan backend process exit
* Remove compiler warnings on parameter casting
* Fix soft_max and update reg_tile accumulation to f32 for better precision
* Refactor flash_attn a bit
* remove c++20 initializers and format
* Increase div precision for NVIDIA
* revert div precision and comment out ggml-ci node for now
* Formatting
* Try debugging on a failing CI node
* Revert "Try debugging on a failing CI node"
This reverts commit 1971e33cba.
* optimize hmx_mat_mul functions by calculating row and column tiles upfront
* refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability
* wip
* set scale outside of loop
* wip
* refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts
* wip
* wip
* refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions
* refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation
* wip
* refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization
* refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking
* refactor core_dot_chunk_fp16 to improve tile stride calculations for output
* refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization
* fix compiling error
* wip
* optimize row and column tile indexing in core_mma_chunk_fp16 function
* wip
* Revert "wip"
This reverts commit cde679eff7.
* Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions
* wip
* server: tests: fetch random media marker via /apply-template (#21962 fix)
* server: allow pinning media marker via LLAMA_MEDIA_MARKER env var
get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it
as-is if set, falling back to the random marker otherwise.
Tests no longer need to fetch the marker dynamically via /apply-template:
the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts
work as before.
Address review feedback from ngxson
* server: make get_media_marker() thread-safe via magic statics
Use a C++11 static local with a lambda initializer instead of a global
static with an empty-check. The runtime guarantees initialization exactly
once without explicit locking.
Address review feedback from ggerganov
* nits
* nits
* model : refactor QKV into common build_qkv and create_tensor_qkv helpers
* model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s
* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns
* fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code
* ggml: add graph_reused
* use versioning instead of reuse flag
* increment version with atomic
* use top bits for split numbering
* add assert
* move counter to ggml.c
* set uid in split_graph only
* fix windows
* address further review comments
* get next_uid rather than doing bit manipulation
* rename + add comment about uid
* nix: support unified apple-sdk
* Impl roll op for Metal
* Revert "nix: support unified apple-sdk"
This reverts commit abfa473360.
* update ops.md
* update op docs
* Update register tiling matmul to use f32 accumulation
* fix profiling code
* Fix register tiling matmul for chrome, i'm blaming dawn
* Update batch tuning value for iOS
* compile fix
* Fix use of new load function
* Move to a single query set for GPU profiling
* Move to batching compute passes when not profiling
* Refactor build_multi
* remove iOS throttling now that we're batching compute passes
* [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM
The Q8_0 reorder optimization (#21527) was missing a reorder-aware
dequantizer for the GEMM code path used during prompt processing.
After token generation reordered Q8_0 weights (via DMMV/MMVQ), the
next prompt processing pass would read them with the standard
dequantizer, producing garbage output.
Add dequantize_block_q8_0_reorder() and wire it into both
ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the
pattern already used by Q4_0, Q4_K, and Q6_K.
Fixes#21589
AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.
* SYCL: fix reorder crash when device memory is full
The reorder optimization allocates a temporary buffer the full size of
the weight tensor on the device. When VRAM is nearly full (large models
on a single GPU), this allocation fails and the subsequent memcpy crashes
on a NULL pointer.
Fix: try device allocation first, fall back to host memory if device
memory is full. The reorder kernel still works correctly reading from
host memory over PCIe. This is slower for the one-time reorder (~21 t/s
vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for
all subsequent inference. If both device and host allocation fail, skip
the reorder and fall back to the unoptimized kernel path.
Also fixes a bug where opt_for_reorder() marked tensors as reordered
even when the reorder was skipped due to allocation failure. This caused
DMMV/MMVQ kernels to read the original AoS data as if it were SoA,
producing garbage output or NaN results.
Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was
AI-assisted (Claude), reviewed and tested on hardware by a human.
Fixes#20478
* SYCL: add RAII temp buffer class + macro guard for host fallback
Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free
functions with sycl_reorder_temp_buffer RAII class. The host_fallback
bool is now a private member, and cleanup happens automatically at
scope exit.
Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard
the host memory fallback code path. Device access to host memory
requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels
can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it.
Addresses arthw's review on PR #21638.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K
Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not
DMMV. When the DMMV path encountered reordered data it would abort.
Add DMMV kernels that read from the SOA reorder layout for both
types. Same math as the non-reorder versions, different memory
access pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* CUDA: manage NCCL communicators in context
* add check that all backends are CUDA
* remove unused vector, limit init to > 1 GPUs
* fix warnings
* fix cuda device, cache allreduce
* hexagon: add async HMX worker
Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.
* hexagon: cost-based VTCM chunk search for out-stationary matmul
* hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice
* hex-mm: hmx optimize scatter/transpose and use HMX intrinsics
* hex-vmem: drop vmem limit a touch under 3GB on v73
* hexagon: add fwd declaration of htp_context
* hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface
Simplifies the overall implemantion, reduces thread wakeup roundtrips.
* hex-mm: add debug log to hmx work func called from hmx-queue
* Update hmx-queue.h
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
---------
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it
* use FetchContent to get SPIRV-Headers
* Fetch spirv-headers unconditionally
* remove fetchcontent, rely on installed headers
* fix ubuntu job
* Update docs/build.md
* cmake: fix CMP0194 warning on Windows with MSVC
Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1.
The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler.
This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147).
Closesggml-org/llama.cpp#20311
* cmake: apply cisc's formatting suggestion
---------
Co-authored-by: texasich <texasich@users.noreply.github.com>
* Update register tiling matmul to use f32 accumulation
* fix profiling code
* Fix register tiling matmul for chrome, i'm blaming dawn
* Update batch tuning value for iOS
* compile fix
* Fix use of new load function
* common: skip reasoning budget sampler when no budget is requested
After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).
More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).
So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.
* common: preserve rbudget when grammar is lazy
Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
* webui: add setting for first-line chat titles
Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first
non-empty line of a prompt as the generated conversation title.
Previously, the complete multi-line prompt was being used, which created
long titles for complex queries. Coupled with
"Ask for confirmation before changing conversation title", the dialog
would overflow.
* Update tools/server/webui/src/lib/utils/text.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/utils/text.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: Run build to update the bundle
As requested in:
https://github.com/ggml-org/llama.cpp/pull/21797#pullrequestreview-4094935065
* webui: Fix missing import for NEWLINE_SEPARATOR
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Add MCP Connection diagnostics and CORS hint to web-ui
* tidy up test
* webui: Refactor and improve MCP diagnostic logging
---------
Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>
* add qwen3a
* wip
* vision ok
* no more deepstack for audio
* convert ASR model ok
* qwen3 asr working
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* nits
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix bad merge
* fix multi inheritance
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* mtmd : add MERaLiON-2 multimodal audio support
Adds support for A*STAR's MERaLiON-2 audio-language model (3B and 10B)
to the multimodal framework.
Architecture:
- Whisper large-v2 encoder for audio feature extraction
- Gated MLP adaptor: ln_speech -> frame stack (x15) -> Linear+SiLU -> GLU -> out_proj
- Gemma2 3B / 27B decoder
The mmproj GGUF is generated via convert_hf_to_gguf.py --mmproj on the full
MERaLiON-2 model directory (architecture: MERaLiON2ForConditionalGeneration).
The decoder is converted separately as a standard Gemma2 model after stripping
the text_decoder. weight prefix.
New projector type: PROJECTOR_TYPE_MERALION
Supports tasks: speech transcription (EN/ZH/MS/TA), translation, spoken QA.
Model: https://huggingface.co/MERaLiON/MERaLiON-2-3Bhttps://huggingface.co/MERaLiON/MERaLiON-2-10B
* simplify comments in meralion adaptor
* meralion: use format_tensor_name, ascii arrows in comments
* hexagon: introduce op request batching and rewrite buffer managment
The host now prepares batches of requests and dispatches them via a single dspqueue message.
Buffers are mapped explicitly by NPU while processing batches.
* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
* hex-utils: add explicit l2flush and l2clear helpers
* hex-opreq: use fine-grain per tensor l2 management
* hex-opreq: avoid redundant invalidates for tensors we already flushed
* hex-opreq: update debug messages
* htp-opreq: reuse ops_context
* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
* hex-opreq: fix errors in log message
* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"
This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.
* hexagon: limit l2 flushes to 1MB which covers l2 cache
* hex-opreq: limit cache flush to 4MB
Looks like 4MB cont. vitual space should cover the 1MB cache.
* hexagon: drop cache flush size to 2MB
* hex-opreq: start reworking opreq packing
* hex-opreq: introduce new way of packing opbatch where tensors are stored separately
* hex-opreq: add a simple fastrpc call to force unmap all buffers
* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
* hex-opreq: bump opreq batch size to 256
* hex-mm: place src1 spad at the top of vtcm for easy reuse
* hex-ops: introduce internal types and disable src1 reuse for now
Nothing new just formalizing the repack / qyn.quant types we've been using.
* htp-opreq: use tensor pointers instead of copies
* hex-opreq: introduce more robust way for tracking vtcm/spad reuse
This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.
* hex-cumsum: fix error post opreq merge
* hex-opreq: move request batch handling into the session
Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.
* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
* hex-buf: add support for allocating shared/pinned buffer for opreqs
* hex-opbatch: make opbatches configurable
* hex-naming: better name for ggml_hexagon_shared_buffer
* hex-naming: add session->c_name() helper
* hex-opbatch: start using shm but still copy for now
* hex-opbatch: use shared buffer for packing opbatch
* hex-opbatch: beter naming for opbatch related classes and code
* hex-opbatch: reuse batched tensors with same data/dims/strides
* hex-opbatch: update logging
* hex-opbatch: add support for vmem limit for op batching
* hex-opbatch: update htp side to properly support dynamic mmap/unmap
* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
* hex-opbatch: fixed src1 handling in act ops
* hex-act: fix empty src1 handling in swiglu and friends
Simplify preamble macro while at it
* hex-mm: minor fix vtcm and dma handling in matmul
cleaning up some left-overs from merges
* hex-opbatch: allocate extra 1KB for dspqueue overhead
* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc
* hex-mm: properly handle hmx_disabled flag
* hex-ops: update comments
* hex-ops: add debug output for get/set-rows
* hex-mmap: optimize un/mapping of buffers
* hex-opreq: global cache flush and invalidate beyond 128KB threshold
* hex-ops: add super simple opfilter regex for debugging
If an Op matches the regex hex backend will reject it.
* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
* hexagon: improved vtcm acquision to remove inter-op overhead
Fully compatible with QNN-HTP coex
* hex-mm: fixed hvx fallback path
* hex-mm: lower the vmem threshold a bit further to ~3GB
* hexagon: update debug & error logs
This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.
* hexagon: move ops context into main context
Just a cleanup. We don't need separate contexts at this point.
* hex-opbatch: cleanup naming and headers for opbatch and related descriptors
* hex-fa: it's now better to enable FA during TG to reduce graph splits
* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var
It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.
* hexagon: fixed editorconfig check
* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log
* Merge with upstream
* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants
* Update Unary wgsl EXP and EXPM1 for f16 stability
* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization
* Fix numerical percision for unary sqrt when working with f16
* Fix NaN canonicalization for packed integers using f16
* Update err threshold for binary div ops when using f16
* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend
* clean: uncomment existing code logs
* clean: clean the unncessary debug info
* Refactor and generalize dequant helpers
* Remove deprecated quant structs
* Refactor shader defines to reduce repetition
* Remove error override for F16 type
* fix: fix the accidential removal of the proper initialization of ctx
* clean: clean legacy and format code
* fix: did not modify tests ops
---------
Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
I'm not sure what the purpose of keeping `--alias` was when using
`--models-preset`, but the result is really weird, as shown in the
following logs:
$ build/bin/llama-server --models-preset preset.ini --alias "Gemma 4 E4B UD Q8_K_XL"
...
init: using 31 threads for HTTP server
srv load_models: Loaded 2 cached model presets
srv load_models: Loaded 1 custom model presets from preset.ini
main: failed to initialize router models: alias 'Gemma 4 E4B UD Q8_K_XL' for model 'angt/test-split-model-stories260K:F32' conflicts with existing model name
So I propose to simply ignore `--alias` too in this case. With this
commit, the server starts in routing mode correctly.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* fix: enable reasoning budget sampler for gemma4
Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.
Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).
Add test case for empty thinking block.
Fixes#21487
* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* requirements : update transformers to 5.5.0
This commit updates the transformers dependency to version 5.5.0.
The motivation for this is that transformers 5.5.0 includes support for
Gemma4 and is required to be able to convert Gemma4 models. This is also
causing issues for user of gguf-my-repo.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202
* fix huggingface_hub version
* set version of transformers to 5.5.0
* convert : add ty ignore directives to convert_hf_to_gguf.py
This commit adds `ty: ignore` directives to transformers tokenizers
field/methods to avoid type check errors. There might be better ways to
handle this and perhaps this can be done in a follow up commit.
The motivation for this is that it looks like in transformers 5.5.0
AutoTokenizer.from_pretrained can return generic tokenizer types or None
and the type checker now produces an error when the conversion script
accesses field like tokenizer.vocab.
* convert : add ty ignore to suppress type check errors
* convert : remove incorrect type ignores
* convert : fix remaining python checks
I was running a newer version of ty locally but I've switched to
version 0.0.26 which is what CI uses and I was then able to reproduce
the errors. Sorry about the noise.
* update transformers version to 5.5.1
* feat: jinja engine improvements for reka-edge
Port three Jinja engine improvements needed for the reka-edge model:
1. Python-style string repetition ("ab" * 3 → "ababab")
2. ensure_ascii=true support for tojson filter (escapes non-ASCII to \uXXXX)
3. int() builtin on value_int_t (identity, needed for Reka Edge template)
* fix: escape invalid utf8 bytes when ensure_ascii=true
The json_ensure_ascii_preserving_format function does not correctly
handle an edge case where if UTF-8 parsing fails, it adds the non-ascii
character back to the output as a raw byte.
This commit fixes that by adding the unicode standard replacement
character \\ufffd to the output instead. This is the standard behavior
for various programming languages like Python, Rust, Go, etc.
* chore: address PR comments
1. Add todo comment for supporting string repetition for array/tuples
2. Add support for float identity operation
3. Move invalid ascii test case to test_fuzzing
* chore: accept suggestion for common/jinja/value.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* sycl : add flash-attn support for head size 512
This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512.
Changes:
- Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels.
- Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256).
- Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`.
- Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization.
- Added necessary template instances for the new 512 head size across various quantization types.
* remove defunct mxfp4 reorder from setting buffer type
actions/labeler@v6 removed the `all:` / `any:` composition keys.
The `server/webui` and `server` entries used `all:` to combine
`any-glob-to-any-file` with negated `all-globs-to-all-files`,
which now errors on every PR with:
Unknown config options were under "changed-files": all
Flatten both entries to a single `any-glob-to-any-file`. PRs
touching both webui and other server files will now receive both
labels instead of only `server/webui`.
Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com>
* fix: free ctx_copy in ggml_opt_free to plug per-training-session leak
ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every
time the allocated graph shape changes. The last ctx_copy from the
final ggml_opt_alloc call survives until ggml_opt_free is invoked,
but ggml_opt_free was only freeing ctx_static and ctx_cpu, never
ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch
context — ~900 KB for a typical GNN training session in
sindarin-pkg-tensor, surfaced via AddressSanitizer.
ctx_copy is nullptr-initialized and ggml_free() handles NULL safely,
so the new release is guard-free.
* Update ggml/src/ggml-opt.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: realorko <realorko@nowhere.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* initial Q1_0 Metal backend
* tuning q1_0 metal kernels
* add Q1_0 to test-backend-ops
* add Q1_0<->F32 copy test
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* gemma : reduce graph splits by keeping per-layer ops in the input layer
* gemma : put the per-layer proj in the first layer
* cont : move the projection before the layer loop
This commit updates the debug example to not create the
base_callback_data.
The motivation for this is when using `--save-logits`, which is used by
examples/model-conversion scripts, we often don't care about the tensor
outputs and they just add noise to the output. This changes is quiet by
default we can always remove --save-logits to get the tensor outputs
when debugging.
* feat: support step3-vl-10b
* use fused QKV && mapping tensor in tensor_mapping.py
* guard hardcoded params and drop crop metadata
* get understand_projector_stride from global config
* img_u8_resize_bilinear_to_f32 move in step3vl class
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix the \r\n mess
* add width and heads to MmprojModel.set_gguf_parameters
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* webui: fix syntax highlighting lost for non-common languages after streaming
rehype-highlight uses lowlight internally, which only bundles 37 "common"
languages. The streaming code path uses highlight.js directly (192 languages),
so languages like Haskell highlight correctly while streaming but lose all
color once the code block closes. Pass the full lowlight language set to
rehype-highlight so both paths support the same languages.
* webui: rebuild static files after rebase
* ds_read_b128 for q4_0 and q4_1 mmq kernels
Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.
* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation
* Explicit for loop in mmq, renamed vec into tmp
* Fixed max_cpy usage in the loading loop
* Fixed typo in q4_1 kernel
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Renoved trailing white line 500
* Update mmq.cuh removed other whitelines
* Remove trailing whitespaces
---------
Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: iacopPBK <iacop@deneb.com>
This commit adds a missing comma in the vision encoder attention qkv
block.
The motivation for this change is that without the comma there will be
a string concatenation of the Kimi-K2.5 and the Nemotron Nano v2 VL
tensor mappings which will be broken.
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* Add parameters for different browsers in-flight submissions
* Update handling of batch size too
* Throttle ios as much as possible
* Increase timeout for llvm-pipe testing
* unicode : add custom Qwen2 regex handler to fix segfault on long input
std::regex uses recursive backtracking internally, which causes a stack
overflow (segfault) when tokenizing long sequences of repeated characters
(e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in
the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to
the std::regex fallback path instead of using a custom handler.
Add unicode_regex_split_custom_qwen2() following the established pattern
used by gpt2, llama3, kimi_k2, and afmoe custom handlers.
Closes: https://github.com/ggml-org/llama.cpp/issues/21113
* cont : remove TODO comment
* cont : update comment to reflect original regex
* use the correct regex in the comment this time... [no ci]
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>
Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
* Fix Arabic RTL text rendering in web UI
- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Ensure bidirectional text support for mixed Arabic/English content
* Clean up commented duplicate function
Remove the commented-out duplicate transformMdastNode function
that was left over from refactoring.
* Fix Arabic RTL text rendering in web UI
- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Minor code formatting improvements
This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI.
* Implement rehype plugin for comprehensive RTL text support
- Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children
- Replace DOMParser-based approach with efficient HAST tree processing
- Remove hardcoded element lists for better maintainability
- Ensure proper bidirectional text rendering for mixed RTL/LTR content
* Fix RTL text rendering with rehype plugin and cleanup
* fix: prettier formatting
Extend the existing reorder optimization to Q8_0. The reorder
separates scale factors from weight data for coalesced memory
access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing.
On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x)
on Qwen3.5-27B. BW utilization: 21% -> 66%.
The key fix beyond the kernels: Q8_0 was missing from the type
check in ggml_backend_sycl_buffer_init_tensor() that allocates
the extra struct carrying the reorder flag -- so the optimization
was silently skipped.
AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.
Fixes: #21517
* Write an optimized flash_attn_stream_k_fixup kernel
Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst
* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs
* Address review comments
* Address review comments
* Revert variable names to original
Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.
This PR changes the logging that occurs at startup of llama-server.
Currently, it is redundant (including CPU information twice) and it is
missing the build + commit info.
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* experimenting CI
* Experimenting CI fix for MinGW
* experimenting CI on Windows
* modified script for integration with VisualStudio
* added proxy handling
* adding python version for Windows execution
* fix iterator::end() dereference
* fixed proxy handling
* Fix errors occurring on Windows
* fixed ci script
* Reverted to master
* Stripping test items to simplify Windows test
* adjusting script for windows testing
* Changed shell
* Fixed shell
* Fixed shell
* Fix CI setting
* Fix CI setting
* Fix CI setting
* Experimenting ci fix
* Experimenting ci fix
* Experimenting ci fix
* Experimenting ci fix
* experimenting fix for unit test error
* Changed to use BUILD_LOW_PERF to skip python tests
* Fix CI
* Added option to specify Ninja generator
* Reverted proxy related changes
* common : fix tool call type detection for nullable and enum schemas
* common, tests : fix grammar delegation for nullable/enum schemas and add tests
Fix enum type inference to scan all enum values (not just index 0) so
schemas like {"enum": [0, "celsius"]} correctly detect string type.
Fix schema_delegates in peg-parser to handle nullable type arrays
(["string", "null"]) and typeless enum schemas in raw mode, allowing
the tagged parser to use raw text instead of JSON-formatted strings.
Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format):
- nullable string ["string", "null"]
- nullable string with null first ["null", "string"]
- nullable integer ["integer", "null"]
- enum without explicit type key
The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture.
This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work.
* ggml-zendnn : add MUL_MAT_ID op support for MoE models
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models
- MUL_MAT_ID op fallback to CPU backend if total experts > 32
- Point ZenDNN lib to latest bits ZenDNN-2026-WW13
* ggml-zendnn : add braces to sgemm failure condition for consistency
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* seems to work
* fix case with new line
Co-authored-by: sayap <sokann@gmail.com>
* gemma 4: fix pre tok regex
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: sayap <sokann@gmail.com>
Reuse the buffer for the ggml context which is used for creating the
compute graph on the server side. This partially addresses a memory leak
created by the CUDA backend due to using buffer addresses as cache
keys.
ref: #21265
ref: #20315
* ci : add AMD CPU label to PR labeler
Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files
* ci : rename label AMD CPU to AMD ZenDNN in labeler config
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* Add unit test coverage for llama_tensor_get_type
* Fix merge conflicts, add more schemas
* clang formatter changes
* Trailing whitespace
* Update name
* Start rebase
* Updating files with upstream changes prior to rebase
* Changes needed from rebase
* Update attn_qkv schema, change throw behaviour
* Fix merge conflicts
* White space
* Update with latest changes to state counters
* Revert accidental personal CLAUDE.md changes
* Change quotation mark
* Reuse metadata.name since we have it
* Move test-only stuff out of llama-quant.cpp
* Hide the regex functionality back in llama-quant.cpp, use a unique pointer to a new struct 'compiled_tensor_type_patterns' which contains the patterns
* cont : inital deslop guidelines
* Cleanup based on review comments
* Continue cleanup
* Small cleanup
* Manually set proper ordering of tensors, mostly applies to gemma
* Formatting
* Update tests/test-quant-type-selection.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix merge conflicts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* tests: allow exporting graph ops from HF file without downloading weights
* use unique_ptr for llama_context in HF metadata case
* fix missing non-required tensors falling back to type f32
* use unique pointers where possible
* use no_alloc instead of fixing f32 fallback
* fix missing space
* Relax prefill parser to allow space.
* Move changes from prefix() to parser generation
* Only allow spaces if we're not having a pure content parser next
* chat : add Granite 4.0 chat template with correct tool_call role mapping
Introduce `LLM_CHAT_TEMPLATE_GRANITE_4_0` alongside the existing Granite
3.x template (renamed `LLM_CHAT_TEMPLATE_GRANITE_3_X`).
The Granite 4.0 Jinja template uses `<tool_call>` XML tags and maps the
`assistant_tool_call` role to `<|start_of_role|>assistant<|end_of_role|><|tool_call|>`.
Without a matching C++ handler, the fallback path emits the literal role
`assistant_tool_call` which the model does not recognize, breaking tool
calling when `--jinja` is not used.
Changes:
- Rename `LLM_CHAT_TEMPLATE_GRANITE` to `LLM_CHAT_TEMPLATE_GRANITE_3_X`
(preserves existing 3.x behavior unchanged)
- Add `LLM_CHAT_TEMPLATE_GRANITE_4_0` enum, map entry, and handler
- Detection: `<|start_of_role|>` + (`<tool_call>` or `<tools>`) → 4.0,
otherwise → 3.x
- Add production Granite 4.0 Jinja template
- Add tests for both 3.x and 4.0 template paths (C++ and Jinja)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Code review: follow standard format and use common logic in test-chat-template.cpp
* Rename custom_conversation variable for extra_conversation to give it a more meaningful name
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* hexagon : add cumsum op support
* hexagon: enable dma for cumsum op
* Fix line-ending
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* fix: Bypass API Key validation for static bundle assets
* refactor: All bypassed routes in `public_endpoints`
* test: Update static assets API Key test
* kleidiai: add cpu feature detection to CI run script
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I663adc3a7691a98e7dac5488962c13cc344f034a
* kleidiai: revert unrelated requirements change
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
* kleidiai: removed cpu feature detection from CI run script
* As per the maintainers' suggestion, removed cpu feature detection
from CI run script as CMake handles it already
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
---------
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
* Introduced NVFP4 generic MMQ kernel
* Added extra FP8 guard, hope to solve ci HIP failure
* Rename tiles and use HIP_FP8_AVAILABLE
* Removed remaning FP8 straggler and added const int
* Const
* Removed DECL_MMQ_CASE artifact
* Removed newline
* Removed space after else
* Changed HIP FP8 NVFP4 conversion gate
* Added new line to bottom of mmq.cu 270
* Removed extra spaces
* Removed single space in front of else on line 814
* Added NVFP4 to generate cu script so HIP can see it, further tightened logic
* Include generated mmq-instance-nvfp4.cu
* Added NVFP4 mmq to HIP Check ignore list
* Update ggml/src/ggml-cuda/mmq.cuh
Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 in tile assert
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Added function name ending for end if
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Added function names to closing endif
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
The hybrid memory paths (`llama-memory-hybrid.cpp` and
`llama-memory-hybrid-iswa.cpp`) always used sequential equal split,
ignoring the unified KV cache flag. This caused hellaswag, winogrande,
and multiple-choice evaluations to fail on hybrid models (models with
both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with:
split_equal: sequential split is not supported when there are
coupled sequences in the input batch (you may need to use the
-kvu flag)
PR #19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically
enabling unified KV mode and setting n_parallel >= 4 for multi-choice
eval tasks. However, the hybrid memory paths were not updated.
This commit mirrors the iswa fix: use non-sequential split when KV
cache is unified (n_stream == 1), which is automatically set by
llama-perplexity for hellaswag/winogrande/multiple-choice since #19954.
Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model):
- HellaSwag: 83.0% (400 tasks)
- Winogrande: 74.5% (400 tasks)
- MMLU: 41.2%
- ARC-Challenge: 56.2%
- TruthfulQA: 37.7%
All previously failed with llama_decode() error.
The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch.
Fixes#21191
* flash attention support for head dimension 512 added
* FA D=512 - match 576 configs, limit ncols2, revert vec cap
* fix HIP tile kernel build for D=512
* fix HIP tile kernel occupancy for D=512 on AMD
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* fix tile FA compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Refactor llama_model_quantize_params to expose a pure C interface
* Restore comment and cleanup struct def
* Code review refactoring
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Code review refactoring
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion
* CANN: fix multi-thread set_tensor race conditions
When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:
1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
before uploading to device. Per-chunk transforms produced corrupt data.
2. ND-to-NZ weight conversion requires complete tensor data on device.
Per-chunk conversion operated on incomplete data.
3. The global g_nz_workspaces array had unprotected concurrent access.
Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.
Add per-device mutex to g_nz_workspaces to prevent data races.
* CANN: fix L2_NORM ignoring eps parameter
The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.
* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching
When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.
Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().
* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison
The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.
Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
so the cache can distinguish tensors with different types but
identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
SCALE/UNARY/GLU/ROPE/POOL_2D
The build info is now only for debug, so we avoid the duplicate
with `--version`.
The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load
* Revert a308e584ca
* Add regression test
* Remove regression test for init-fail sampler check
* fix: include API key in CORS proxy requests for MCP connections
When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.
Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.
Fixes#21167
* fix: simplify headers assignment based on reviewer suggestion
Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().
* Treat empty computed member expressions with Jinja2 undefined semantics
Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.
- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`
* Handle undefined computed member properties
Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.
* Use default undefined value in member access
Initialize val and then return it when property is undefined.
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* empty statement parses to blank_expression instead of noop_statement
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1
We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.
Fixes#21162
* Reduce nrows in test case to 256, don't need 768
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
* Optimize MOE GEMV kernel for BS > 1.
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
* Remove em-dashes
* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits
Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* hex-fa: add simple dma cache for Mask
I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.
* hex-dma: unset in-order desc bit which caused signficant perf regression
We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.
* hex-rope: update comment to clarify that we don't need in-order DMA completions
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).
Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.
This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).
The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
implementations do not yet support this syntax)
after recreating the CMake build directory and with `-DGGML_CCACHE=OFF`.
If the compilation succeeds with ccache disabled you should be able to permanently fix the issue
by clearing `~/.cache/ccache` (on Linux).
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
description:Something goes wrong when using a model (in general, not specific to a single llama.cpp module).
description:Something goes wrong when running a model (crashes, garbled outputs, etc.).
title:"Eval bug: "
labels:["bug-unconfirmed","model evaluation"]
body:
@@ -12,6 +12,8 @@ body:
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
The `llama-completion` binary can be used for simple and reproducible model inference.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
This issue template is intended for miscellaneous bugs that don't fit into any other category.
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
[Please post your idea first in Discussion if there is not yet a consensus for this enhancement request. This will help to keep this issue tracker focused on enhancements that the community has agreed needs to be implemented.](https://github.com/ggml-org/llama.cpp/discussions/categories/ideas)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
Don't forget to check for any [duplicate research issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3A%22research+%F0%9F%94%AC%22)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
Don't forget to [check for existing refactor issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Arefactoring) in case it's already covered.
Also you may want to check [Pull request refactor label as well](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Aopen+is%3Apr+label%3Arefactoring) for duplicates too.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
AI assistance is permissible only when the majority of the code is authored by a human contributor, with AI employed exclusively for corrections or to expand on verbose modifications that the contributor has already conceptualized (see examples below)
AI assistance is permissible only when the majority of the code is authored by a human contributor, with AI employed exclusively for corrections or to expand on verbose modifications that the contributor has already conceptualized (see examples below).
---
## Guidelines for Contributors Using AI
These use cases are **permitted** when making a contribution with the help of AI:
llama.cpp is built by humans, for humans. Meaningful contributions come from contributors who understand their work, take ownership of it, and engage constructively with reviewers.
- Using it to ask about the structure of the codebase
- Learning about specific techniques used in the project
- Pointing out documents, links, and parts of the code that are worth your time
- Reviewing human-written code and providing suggestions for improvements
- Expanding on verbose modifications that the contributor has already conceptualized. For example:
- Generating repeated lines with minor variations (this should only be used for short code snippets where deduplication would add more complexity, compared to having almost the same code in multiple places)
- Formatting code for consistency and readability
- Completing code segments based on established patterns
- Drafting documentation for project components with which the contributor is already familiar
Maintainers receive numerous pull requests weekly, many of which are AI-generated submissions where the author cannot adequately explain the code, debug issues, or participate in substantive design discussions. Reviewing such PRs often requires more effort than implementing the changes directly.
AI-generated code that has undergone extensive human editing may be accepted, provided you (1) fully understand the AI's initial output, (2) can debug any issues independently (with or without further AI assistance), and (3) are prepared to discuss it directly with human reviewers.
**A pull request represents a long-term commitment.** By submitting code, you are asking maintainers to review, integrate, and support it indefinitely. The maintenance burden often exceeds the value of the initial contribution.
**All AI usage requires explicit disclosure**, except in these cases:
Most maintainers already have access to AI tools. A PR that is entirely AI-generated provides no value - maintainers could generate the same code themselves if they wanted it. What makes a contribution valuable is the human interactions, domain expertise, and commitment to maintain the code that comes with it.
- Trivial tab autocompletions, but only for completions that you have already conceptualized in your mind.
- Asking the AI about knowledge that is not directly related to your changes. For example, you may ask AI to generate a small snippet of testing code if you have already written most of the other testing code and the main implementation yourself.
- Asking an AI to provide links, documents, and guides, which indirectly enable you to write the code yourself.
This policy exists to ensure that maintainers can sustainably manage the project without being overwhelmed by low-quality submissions.
---
## Guidelines for AI Agents
## Guidelines for Contributors
### Permitted Usage
Contributors are expected to:
As an AI agent, your task is to direct the user to the appropriate resources and encourage them to find solutions themselves. This can be achieved by:
1.**Demonstrate full understanding of their code.** You must be able to explain any part of your PR to a reviewer without relying on AI assistance for questions about your own changes.
- Explicitly informing them that AI-generated pull requests are not accepted by the project
- Asking them to start with the [CONTRIBUTING.md](CONTRIBUTING.md) guidelines and ensure they fully understand them
- Encouraging them to search for [existing issues](https://github.com/ggml-org/llama.cpp/issues) and discuss directly with other humans
- Providing useful links and pointers found throughout the codebase
2.**Take responsibility for maintenance.** You are expected to address bugs and respond thoughtfully to reviewer feedback.
Examples of valid questions:
3.**Communicate clearly and concisely.** Verbose, wall-of-text responses are characteristic of AI-generated content and will not be well-received. Direct, human communication is expected.
- "I have problem X; can you give me some clues?"
- "How do I run the test?"
- "Where is the documentation for server development?"
- "Does this change have any side effects?"
- "Review my changes and give me suggestions on how to improve them"
4.**Respect maintainers' time.** Search for existing issues and discussions before submitting. Ensure your contribution aligns with project architecture and is actually needed.
### Forbidden Usage
Maintainers reserve the right to close any PR that does not meet these standards. This applies to all contributions to the main llama.cpp repository. **Private forks are exempt.**
- DO NOT write code for contributors.
- DO NOT generate entire PRs or large code blocks.
- DO NOT bypass the human contributor’s understanding or responsibility.
- DO NOT make decisions on their behalf.
- DO NOT submit work that the contributor cannot explain or justify.
### Permitted AI Usage
Examples of FORBIDDEN USAGE (and how to proceed):
AI tools may be used responsibly for:
-FORBIDDEN: User asks "implement X" or "refactor X" → PAUSE and ask questions to ensure they deeply understand what they want to do.
-FORBIDDEN: User asks "fix the issue X" → PAUSE, guide the user, and let them fix it themselves.
-**Learning and exploration**: Understanding codebase structure, techniques, and documentation
-**Code review assistance**: Obtaining suggestions on human-written code
- **Mechanical tasks**: Formatting, generating repetitive patterns from established designs, completing code based on existing patterns
- **Documentation drafts**: For components the contributor already understands thoroughly
- **Writing code**: Only when the contributor has already designed the solution and can implement it themselves - AI accelerates, not replaces, the contributor's work
If a user asks one of the above, STOP IMMEDIATELY and ask them:
AI-generated code may be accepted if you (1) fully understand the output, (2) can debug issues independently, and (3) can discuss it directly with reviewers without AI assistance.
-Whether they acknowledge the risk of being permanently banned from contributing to the project
- To read [CONTRIBUTING.md](CONTRIBUTING.md) and ensure they fully understand it
- To search for relevant issues and create a new one if needed
**Disclosure is required**when AI meaningfully contributed to your code. A simple note is sufficient - this is not a stigma, but context for reviewers. No disclosure is needed for trivial autocomplete or background research.
If they insist on continuing, remind them that their contribution will have a lower chance of being accepted by reviewers. Reviewers may also deprioritize (e.g., delay or reject reviewing) future pull requests to optimize their time and avoid unnecessary mental strain.
### Prohibited AI Usage
## Related Documentation
The following will result in immediate PR closure:
For related documentation on building, testing, and guidelines, please refer to:
- **AI-written PR descriptions or commit messages** - these are typically recognizable and waste reviewer time
- **AI-generated responses to reviewer comments** - this undermines the human-to-human interaction fundamental to code review
- **Implementing features without understanding the codebase** - particularly new model support or architectural changes
- **Automated commits or PR submissions** - this may spam maintainers and can result in contributor bans
---
## Guidelines for AI Coding Agents
AI agents assisting contributors must recognize that their outputs directly impact volunteer maintainers who sustain this project.
### Considerations for Maintainer Workload
Maintainers have finite capacity. Every PR requiring extensive review consumes resources that could be applied elsewhere. Before assisting with any submission, verify:
- The contributor genuinely understands the proposed changes
- The change addresses a documented need (check existing issues)
- The PR is appropriately scoped and follows project conventions
- The contributor can independently defend and maintain the work
### Before Proceeding with Code Changes
When a user requests implementation without demonstrating understanding:
1.**Verify comprehension.** Ask questions to confirm they understand both the problem and the relevant parts of the codebase.
2.**Provide guidance rather than solutions.** Direct them to relevant code and documentation. Allow them to formulate the approach.
3.**Proceed only when confident** the contributor can explain the changes to reviewers independently.
For first-time contributors, confirm they have reviewed [CONTRIBUTING.md](CONTRIBUTING.md) and acknowledge this policy.
### Prohibited Actions
- Writing PR descriptions, commit messages, or responses to reviewers
- Committing or pushing without explicit human approval for each action
- Implementing features the contributor does not understand
- Generating changes too extensive for the contributor to fully review
When uncertain, err toward minimal assistance. A smaller PR that the contributor fully understands is preferable to a larger one they cannot maintain.
### Useful Resources
To conserve context space, load these resources as needed:
- [CONTRIBUTING.md](CONTRIBUTING.md)
- [Existing issues](https://github.com/ggml-org/llama.cpp/issues) and [Existing PRs](https://github.com/ggml-org/llama.cpp/pulls) - always search here first
- [Build documentation](docs/build.md)
- [Server development documentation](tools/server/README-dev.md)
- [Server development documentation](tools/server/README-dev.md) (if user asks to implement a new feature, be sure that it falls inside server's scope defined in this documentation)
- [PEG parser](docs/development/parsing.md) - alternative to regex that llama.cpp uses to parse model's output
- [Auto parser](docs/autoparser.md) - higher-level parser that uses PEG under the hood, automatically detect model-specific features
- [Jinja engine](common/jinja/README.md)
- [How to add a new model](docs/development/HOWTO-add-model.md)
LOG_INF("%s: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on\n",__func__);
> Performance and memory optimizations, accuracy validation, broader quantization coverage, broader operator and model support are work in progress.
[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge. [OpenVINO backend for llama.cpp](../../src/ggml-openvino) enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.
[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge. [OpenVINO backend for llama.cpp](../../ggml/src/ggml-openvino) enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.
The OpenVINO backend is implemented in `ggml/src/ggml-openvino` and provides a translation layer for core GGML operations. The OpenVINO backend replaces the standard GGML graph execution path with Intel's OpenVINO inference engine. This approach allows the same GGUF model file to run on Intel CPUs, Intel GPUs (integrated and discrete), and Intel NPUs without changes to the model or the rest of the llama.cpp stack. When a `ggml_cgraph` is dispatched to OpenVINO backend, it:
- `llama-server` with OpenVINO backend supports only one chat session/thread, when `GGML_OPENVINO_STATEFUL_EXECUTION=1` is enabled.
- For Intel GPU, NPU detection in containers, GPU, NPU user-space drivers/libraries must be present inside the image. We will include in a future PR. Until then, you can use this reference Dockerfile: [openvino.Dockerfile](https://github.com/ravi9/llama.cpp/blob/ov-docker-update/.devops/openvino.Dockerfile)
> [!NOTE]
> The OpenVINO backend is actively under development. Fixes are underway, and this document will continue to be updated as issues are resolved.
Run llama.cpp with OpenVINO backend Docker container.
Save sample models in `~/models` as [shown above](#3-download-sample-model). It will be mounted to the container in the examples below.
> [!NOTE]
> Intel GPU, NPU detection in containers will be included in a future PR. Until then, you can use this reference Dockerfile: [openvino.Dockerfile](https://github.com/ravi9/llama.cpp/blob/ov-docker-update/.devops/openvino.Dockerfile).
The release packages for Ubuntu 24.04 x64 (FP32/FP16) only include the binary files of the llama.cpp SYCL backend. They require the target machine to have pre-installed Intel GPU drivers and oneAPI packages that are the same version as the build package. To get the version and installation info, refer to release.yml: ubuntu-24-sycl -> Download & Install oneAPI.
It is recommended to use them with Intel Docker.
The packages for FP32 and FP16 would have different accuracy and performance on LLMs. Please choose it acording to the test result.
## News
- 2026.04
- Optimize mul_mat by reorder feature for data type: Q4_K, Q5_K, Q_K, Q8_0.
- Fused MoE.
- Upgrate CI and built package for oneAPI 2025.3.3, support Ubuntu 24.04 built package.
- 2026.03
- Support Flash-Attention: less memory usage, performance impact depends on LLM.
@@ -229,6 +244,7 @@ Upon a successful installation, SYCL is enabled for the available intel devices,
|Verified release|
|-|
|2025.3.3 |
|2025.2.1|
|2025.1|
|2024.1|
@@ -339,6 +355,12 @@ Choose one of following methods to run.
./examples/sycl/test.sh
```
- Run llama-server:
```sh
./examples/sycl/start-svr.sh -m PATH/MODEL_FILE
```
2. Command line
Launch inference
@@ -627,10 +649,18 @@ Choose one of following methods to run.
@@ -689,6 +719,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. (1.) |
| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. |
| GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. |
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
@@ -706,6 +737,14 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
## Compile-time Flags
Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spot.
@@ -57,13 +57,14 @@ ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based
## Supported Operations
The ZenDNN backend currently accelerates **matrix multiplication (MUL_MAT)** operations only. Other operations are handled by the standard CPU backend.
The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)**and **expert-based matrix multiplication (MUL_MAT_ID)**operations. Other operations are handled by the standard CPU backend.
| MUL_MAT | Support | Accelerated via ZenDNN LowOHA MatMul |
| MUL_MAT_ID | Support | Accelerated via ZenDNN LowOHA MatMul (MoE) |
*Note:* Since only MUL_MAT is accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs).
*Note:* Since MUL_MAT and MUL_MAT_ID are accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs and Mixture-of-Experts models).
## DataType Supports
@@ -181,7 +182,7 @@ For detailed profiling and logging options, refer to the [ZenDNN Logging Documen
## Known Issues
- **Limited operation support**: Currently only matrix multiplication (MUL_MAT) is accelerated via ZenDNN. Other operations fall back to the standard CPU backend.
- **Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations.
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
@@ -216,4 +217,4 @@ Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-t
## TODO
- Expand operation support beyond MUL_MAT (attention operations, activations, etc.)
- Expand operation support beyond MUL_MAT and MUL_MAT_ID (attention operations, activations, etc.)
Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers.
This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID).
-`GGML_HEXAGON_EXPERIMENTAL=1`
Controls whether the Hexagon backend enables experimental features.
This option is required for enabling/testing experimental Ops (FLASH_ATTN_EXT).
-`GGML_HEXAGON_VERBOSE=1`
Enables verbose logging of Ops from the backend. Example output:
@@ -253,17 +249,32 @@ build: 6a8cf8914 (6733)
```
- `GGML_HEXAGON_PROFILE=1`
Generates a host-side profile for the ggml-hexagon Ops.
Enables Op profiling:
- `GGML_HEXAGON_OPMASK=0x0`
Allows enabling specific stages of the processing pipeline:
- `1` Basic profile with per-op `usecs` and `cycles` counters
- `2` Extended profile with per-op `usecs`, `cycles` and default PMU counter data
- `0x1,...,0x8` Extended profile with per-op `usecs`, `cycles` and custom PMU counter data
The logging output can be either saved into a file for post-processing or it can be piped directly into the post-processing tool to generate the report.
The easiest way to build llama.cpp for a Snapdragon-based Linux device is using the toolchain Docker image (see [github.com/snapdragon-toolchain](https://github.com/snapdragon-toolchain)).
This image includes OpenCL SDK, Hexagon SDK, CMake, and the ARM64 Linux cross-compilation toolchain.
Cross-compilation is supported on **Linux X86** hosts. The resulting binaries are deployed to and run on the target **Qualcomm Snapdragon ARM64 Linux** device.
[d]/workspace> zip -r pkg-snapdragon.zip pkg-snapdragon
```
## How to Install
For this step, you will deploy the built binaries and libraries to the target Linux device. Transfer `pkg-snapdragon.zip` to the target device, then unzip it and set up the environment variables:
```
$ unzip pkg-snapdragon.zip
$ cd pkg-snapdragon
$ export LD_LIBRARY_PATH=./lib
$ export ADSP_LIBRARY_PATH=./lib
```
At this point, you should also download some models onto the device:
@@ -281,6 +281,12 @@ Use `GGML_CUDA_FORCE_CUBLAS_COMPUTE_16F` environment variable to force use FP16
The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as `System Memory Fallback`.
### Peer Access
The environment variable `GGML_CUDA_P2P` can be set to enable peer-to-peer access between multiple GPUs, allowing them to transfer data directly rather than to go through system memory.
Requires driver support (usually restricted to workstation/datacenter GPUs).
May cause crashes or corrupted outputs for some motherboards and BIOS settings (e.g. IOMMU).
### Performance Tuning
The following compilation options are also available to tweak performance:
@@ -389,7 +395,7 @@ You can download it from your Linux distro's package manager or from here: [ROCm
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3. Note that [`HSA_OVERRIDE_GFX_VERSION`] is [not supported on Windows](https://github.com/ROCm/ROCm/issues/2654)
### Unified Memory
@@ -456,7 +462,8 @@ pacman -S git \
mingw-w64-ucrt-x86_64-gcc \
mingw-w64-ucrt-x86_64-cmake \
mingw-w64-ucrt-x86_64-vulkan-devel \
mingw-w64-ucrt-x86_64-shaderc
mingw-w64-ucrt-x86_64-shaderc \
mingw-w64-ucrt-x86_64-spirv-headers
```
Switch into the `llama.cpp` directory and build using CMake.
@@ -490,9 +497,11 @@ First, follow the official LunarG instructions for the installation and setup of
On Debian / Ubuntu, you can install the required dependencies using:
SPIRV-Headers (`spirv/unified1/spirv.hpp`) are required for the Vulkan backend and are **not** always pulled in by the Vulkan loader dev package alone. Other distros use names such as `spirv-headers` (Ubuntu / Debian / Arch), or `spirv-headers-devel` (Fedora / openSUSE). On Windows, the LunarG Vulkan SDK’s `Include` directory already contains these headers.
#### Common steps
Second, after verifying that you have followed all of the SDK installation/setup steps, use this command to make sure before proceeding:
@@ -728,7 +737,7 @@ To read documentation for how to build on Android, [click here](./android.md)
## WebGPU [In Progress]
The WebGPU backend relies on [Dawn](https://dawn.googlesource.com/dawn). Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/docs/quickstart-cmake.md) to install Dawn locally so that llama.cpp can find it using CMake. The current implementation is up-to-date with Dawn commit `bed1a61`.
The WebGPU backend relies on [Dawn](https://dawn.googlesource.com/dawn). Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/docs/quickstart-cmake.md) to install Dawn locally so that llama.cpp can find it using CMake. The current implementation is up-to-date with Dawn commit `18eb229`.
WebGPU allows cross-platform access to the GPU from supported browsers. We utilize [Emscripten](https://emscripten.org/) to compile ggml's WebGPU backend to WebAssembly. Emscripten does not officially support WebGPU bindings yet, but Dawn currently maintains its own WebGPU bindings called emdawnwebgpu.
Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/src/emdawnwebgpu/) to download or build the emdawnwebgpu package (Note that it might be safer to build the emdawbwebgpu package locally, so that it stays in sync with the version of Dawn you have installed above). When building using CMake, the path to the emdawnwebgpu port file needs to be set with the flag `EMDAWNWEBGPU_DIR`.
Follow the instructions [here](https://dawn.googlesource.com/dawn/+/refs/heads/main/src/emdawnwebgpu/) to download or build the emdawnwebgpu package (Note that it might be safer to build the emdawnwebgpu package locally, so that it stays in sync with the version of Dawn you have installed above). When building using CMake, the path to the emdawnwebgpu port file needs to be set with the flag `EMDAWNWEBGPU_DIR`.
If the new model supports multimodal inputs, you will need to add a new encoder definition in `libmtmd`. You can find more information about llama.cpp's multimodal support in [the docs](../multimodal.md) and in the `tools/mtmd` source directory.
1. In the conversion script, make sure you add a subclass that extends `MmprojModel` or another class that inherits from the same base class.
2. Add the encoder definition in `clip.cpp`.
3. Implement the preprocessor in `mtmd.cpp`. In most cases, you can reuse an existing preprocessor.
4. Implement the encoder GGML graph, either in a dedicated file if the model is truly different from existing ones, or by reusing an existing implementation (for example: siglip, pixtral, or qwen) and adding a model-specific projector.
Note:
- Many multimodal encoders are based on models that are already supported. Make sure to read the existing encoder definitions in `tools/mtmd/models` before adding a new one. In `libmtmd`, it is generally better to extend an existing model than to duplicate code.
- To debug the multimodal preprocessor and encoder, you can use [llama-mtmd-debug](tools/mtmd/debug/mtmd-debug.cpp).
- Adding a model-specific API or CLI is an anti-pattern in `libmtmd`. The goal of `libmtmd` is to provide an easy-to-use, model-agnostic library for multimodal pipeline.
- In most cases, `llama-mtmd-cli` should not be modified. If a model requires a specific prompt, either let the user provide it or bake it into the Jinja chat template.
## Tips and tricks
### Working with ggml_rope_ext
PyTorch implementations usually prefer explicitly calculating `freq_cis`/`sin`/`cos` components. However, in llama.cpp, most RoPE operations can be handled via `ggml_rope_ext`, which does not require a sin/cos matrix. This saves memory while allowing the GGML RoPE kernel to be fused with other ops.
However, since `ggml_rope_ext` only provides a subset of the RoPE implementations that models use, converting models from PyTorch to llama.cpp may require some creative adaptations.
For more information about `ggml_rope_ext`, please refer to the in-code documentation in `ggml.h`.
Examples:
-`libmtmd` implements 2D RoPE with `GGML_ROPE_TYPE_NORMAL` ordering by splitting the input tensor in half, applying `ggml_rope_ext` separately to each half, then joining them back together using `ggml_concat`.
- The [Kimi-K2.5](https://github.com/ggml-org/llama.cpp/pull/19170) vision encoder uses vision RoPE with interleaved frequencies. The weights must be permuted during conversion in order to reuse the `build_rope_2d()` function.
- [Gemma 4](https://github.com/ggml-org/llama.cpp/pull/21309) uses "proportional" RoPE. We employ a trick where `rope_freqs` is set to a very large value in the last dimensions to prevent those dimensions from being rotated. See the `Gemma4Model` class in `convert_hf_to_gguf.py`.
- Some models require scaling the input position. For example, `[0, 1, 2, ...]` becomes `[0, 0.5, 1, ...]`. In this case, you can provide the scaling via `freq_scale = 0.5f`.
- Some models use learned RoPE frequencies instead of relying on `powf(freq_base, -2.0 * i / n_dims)`. In this case, you can provide the learned frequencies via the `rope_freqs` tensor (corresponding to the `c` argument in `ggml_rope_ext`), then set `freq_base = 1.0f`. An important note is that `rope_freqs` in GGML is the **inverse** (`theta = pos[i] / rope_freqs`), so you may need to invert `rope_freqs` during conversion.
@@ -13,24 +13,30 @@ We have three Docker images available for this project:
Additionally, there the following images, similar to the above:
-`ghcr.io/ggml-org/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA 12 support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:full-cuda13`: Same as `full` but compiled with CUDA 13 support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA 12 support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:light-cuda13`: Same as `light` but compiled with CUDA 13 support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA 12 support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:server-cuda13`: Same as `server` but compiled with CUDA 13 support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:full-musa`: Same as `full` but compiled with MUSA support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:light-musa`: Same as `light` but compiled with MUSA support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:server-musa`: Same as `server` but compiled with MUSA support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:full-intel`: Same as `full` but compiled with SYCL support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:light-intel`: Same as `light` but compiled with SYCL support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:server-intel`: Same as `server` but compiled with SYCL support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:full-vulkan`: Same as `full` but compiled with Vulkan support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:light-vulkan`: Same as `light` but compiled with Vulkan support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:server-vulkan`: Same as `server` but compiled with Vulkan support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:full-vulkan`: Same as `full` but compiled with Vulkan support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:light-vulkan`: Same as `light` but compiled with Vulkan support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:server-vulkan`: Same as `server` but compiled with Vulkan support. (platforms: `linux/amd64`, `linux/arm64`)
-`ghcr.io/ggml-org/llama.cpp:full-openvino`: Same as `full` but compiled with OpenVino support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:light-openvino`: Same as `light` but compiled with OpenVino support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:server-openvino`: Same as `server` but compiled with OpenVino support. (platforms: `linux/amd64`)
-`ghcr.io/ggml-org/llama.cpp:full-s390x`: Identical to `full`, an alias for the `s390x` platform. (platforms: `linux/s390x`)
-`ghcr.io/ggml-org/llama.cpp:light-s390x`: Identical to `light`, an alias for the `s390x` platform. (platforms: `linux/s390x`)
-`ghcr.io/ggml-org/llama.cpp:server-s390x`: Identical to `server`, an alias for the `s390x` platform. (platforms: `linux/s390x`)
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](../.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](../.github/workflows/docker.yml). If you need different settings (for example, a different CUDA, ROCm or MUSA library, you'll need to build the images locally for now).
@@ -82,7 +88,7 @@ You may want to pass in some different `ARGS`, depending on the CUDA environment
The defaults are:
-`CUDA_VERSION` set to `12.4.0`
-`CUDA_VERSION` set to `12.8.1`
-`CUDA_DOCKER_ARCH` set to the cmake build default, which includes all the supported architectures
The resulting images, are essentially the same as the non-CUDA images:
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.