Compare commits

...

139 Commits
b8962 ... b9101

Author SHA1 Message Date
Georgi Gerganov
389ff61d77 server : print warning when HTTP timeout exceeded (#22907) 2026-05-10 22:00:18 +03:00
Tim Neumann
2e97c5f96f backend sampling: support returning post-sampling probs (#22622)
* server: Never return 0.0 post-sampling probabilities

* backend sampling: support returning post-sampling probs
2026-05-10 19:12:02 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
5d5d2e15d2 vendor : update cpp-httplib to 0.43.4 (#22888) 2026-05-10 18:46:54 +02:00
Oliver Walsh
2b2babd124 ggml-virtgpu : include missing mutex header (#22810)
Add missing `#include <mutex>` in ggml-backend-device.cpp.

Fixes: #22809

Signed-off-by: Oliver Walsh <owalsh@redhat.com>
2026-05-10 17:32:41 +02:00
Georgi Gerganov
0b047287fe sync : ggml 2026-05-10 17:00:11 +03:00
Georgi Gerganov
efbada936f ggml : bump version to 0.11.1 (ggml/1484) 2026-05-10 17:00:11 +03:00
scutler-nv
f3c3e0e9a0 internal AllReduce kernel for CUDA provider (#22299)
* ggml-cuda: add internal AllReduce provider for tensor parallelism

Introduces a NCCL-free AllReduce implementation for LLAMA_SPLIT_MODE_TENSOR
using a single-phase CUDA kernel that pipelines D2H copy, cross-GPU
handshake via pinned-memory volatile flags, and the reduction in one
kernel launch per GPU.

New files:
- ggml/src/ggml-cuda/comm.cuh        — ggml_cuda_allreduce_provider enum
- ggml/src/ggml-cuda/allreduce.cuh   — pipeline API declarations
- ggml/src/ggml-cuda/allreduce.cu    — kernel + pipeline init/dispatch

ggml-cuda.cu changes:
- ggml_backend_cuda_comm_context gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* llama-bench: add --allreduce flag to select AllReduce provider

Adds --allreduce <auto|nccl|internal> to llama-bench (and via the shared
field pattern, consistent with other multi-value flags).  Useful for
isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl
to force NCCL and bypass the internal provider.

Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* llama-bench: rename --allreduce to --reduction-provider / -rp

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
 via the shared
field pattern, consistent with other multi-value flags).  Useful for
isolating hangs or regressions in tensor-parallel mode: pass --allreduce nccl
to force NCCL and bypass the internal provider.

Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* llama-bench: pass WARN/ERROR log messages through in non-verbose mode

The null log callback was silently dropping all messages. WARN and ERROR
should always be visible since they indicate legitimate issues (e.g. a
requested reduction provider not being available).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vider.

Also fixes ggml_cuda_select_allreduce_provider() to treat an empty
GGML_CUDA_ALLREDUCE env var the same as unset (avoids spurious warning when
llama-bench sets it to "" for the "auto" case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* cmake: improve NCCL detection for source-tree builds, add static/dynamic switch

FindNCCL.cmake now searches the cmake source-build layout used by the Windows
NCCL port (cmake/lib/Release for static, cmake/src/Release for dynamic import
lib) and also checks src/include for the generated nccl.h header.

New option GGML_CUDA_NCCL_STATIC (default OFF) selects static vs dynamic
linking and controls which paths and library names are searched.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
 for the "auto" case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
xt gains ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ggml-cuda: add AllReduce hang watchdog (GGML_CUDA_AR_WATCHDOG)

When compiled with -DGGML_CUDA_AR_WATCHDOG=ON, uses a debug kernel
variant that writes per-GPU spin diagnostics to pinned host memory.
A host-side blocking poll (cudaEventQuery + volatile reads) detects
hangs and logs WARN with the last observed arrival counters and spin
counts, controlled by GGML_CUDA_AR_WATCHDOG (ms timeout) and
GGML_CUDA_AR_MAX_SPIN (kernel bailout) env vars at runtime.

Zero overhead on the production path — all debug code is behind #ifdef.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
 ar_pipeline field
- Provider selection via GGML_CUDA_ALLREDUCE env var ("nccl" / "internal")
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ggml-cuda: fix intermittent AllReduce hang on Blackwell PCIe

Add __threadfence_system() before the arrival signal write in
signal_set to ensure D2H data is globally visible before the peer
observes the arrival flag.  Without this fence, the peer could enter
Phase 3 host reads before the data had fully landed, causing an
intermittent deadlock on RTX 5090 (Blackwell, PCIe-only).

Also redesign the watchdog from a blocking dispatch-thread poll to a
non-blocking background thread, eliminating the ~20ms per-slot
latency the old design added.

Verified: 30/30 soak test runs clean at ~50 t/s (previously ~1-in-15
hang rate).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- INTERNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ggml-cuda: fix watchdog shutdown ordering and pipeline_free drain

- Stop watchdog thread BEFORE destroying GPU resources (events, streams)
  to prevent polling destroyed handles → spurious "busy" readings
- Add cudaStreamSynchronize in pipeline_free to drain in-flight kernels
  before freeing pinned host buffers they may still be reading
- Sleep-first watchdog polling: no +0ms noise, only logs when a kernel
  is genuinely stuck past the poll interval
- Check wdog_stop in both outer and inner loops so join() returns
  promptly instead of draining the entire queue
- Add Phase 3 breadcrumbs to debug[3] for hang localization

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RNAL provider initialises the pipeline at comm_init time
- Dispatch routes to ggml_cuda_ar_allreduce(); falls back to meta-backend
  CPU reduce for unsupported sizes or GPU counts (> 2)

Current scope: 2 GPUs, FP32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ggml-cuda: replace event-based watchdog with per-GPU ring buffer

Completely rework the GGML_CUDA_AR_WATCHDOG system:

- Replace the shared debug_buf + event-polling + queue design with
  per-GPU ring buffers in pinned host memory
- Kernel writes a debug record only on spin-limit bailout: claims a
  ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
  writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
  any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly

Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: normalize line endings to LF (undo Windows CRLF conversion)

Five files were inadvertently converted to CRLF by the Windows
development environment, causing every line to show as changed in
diffs against master.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imit bailout: claims a
  ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
  writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
  any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly

Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* .gitattributes: force LF line endings to prevent Windows CRLF conversion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
elopment environment, causing every line to show as changed in
diffs against master.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
imit bailout: claims a
  ring slot via atomicAdd (single-GPU host atomics work on RTX 5090),
  writes fields, fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
  any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly

Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ggml-cuda: move GGML_CUDA_AR_WATCHDOG from CMake option to local define

The watchdog is development-only; a global CMake option is overkill.
Move the toggle to a #define at the top of allreduce.cu (set to 0 by
default) and remove the option from ggml/CMakeLists.txt and the CUDA
CMakeLists.txt add_compile_definitions block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
 fences, sets completion flag, then all threads exit
- Watchdog thread simply polls ring head counters every 1ms and prints
  any new complete records — no CUDA event queries, no mutex, no queue
- Zero overhead on the dispatch path (no queue posting, no memset)
- Watchdog shutdown returns within ~1ms (atomic bool, no drain)
- On bailout the kernel skips Phase 3 entirely and exits cleanly

Verified: 20/20 prefill soak test clean at ~1112 t/s, no hangs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P32, tensors <= 256 KB. Notes in NOTES-allreduce.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* unify kernel debug paths

* use __threadfence_system explicitly (not in ggml_cuda_ar_signal_set)

* preferentially use internal reduction for <=2 GPUs

* templatize the main kernel to support fp16/bf16

* restore llama-bench.cpp changes

* revert CMakeLists changes

* remove notes from repo

* remove dead warmup code

* fix comments

* improve reduction provider fallback code

* add messages for allreduce fallback

* rework reduction provider init to not call ncclCommInitAll if using the internal provider

* fix case where a given tensor has not been computed

* add chunked mode to the kernel for unlimited vector size

* rework a few checks/fallbacks

* various small cleanups

* allow disabling CUDA reductions completely (falling back to the non-CUDA butterfly mode)

* simplify reduction provider selection

* minor simplifications

* more cleanups/fixes

* prototype alternate path for large reductions

* chunked version of large reduction path

* use bf16 for large reductions

* experimental reduction using cudaMemcpyPeerAsync (slightly slower)

* revert experimental change

* add combined conversion/reduction kernel

* add bf16 wire format for single kernel mode

* experimental on-stream small reduction kernel

* double buffer arrival slots, use token (incrementing) method

* double buffer host_buf for small reductions

* put in waits for use of host_mem in large reduction case (prevents stomping on in-use memory

* remove watchdog code

* various cleanups / dead code removal

* fix fp16 mode

* fix some comments/logging statements

* use increasing token scheme for arrival signals

* add top-level comment to allreduce.cu

* improve top-level comment in allreduce.cu

* fix comments in ggml_cuda_ar_kernel

* improve event handling for hostmem buffer usage tracking

* change ev_pool to fixed 2D array

* add chunked memcpy fallback for extra-large reductions (>32 MB)

* change thresholds for copy-engine path and bf16 demotion

* multi-block kernel test

* more fine-tuning for chukn-size, etc.

* various fixes for PR review

* more PR fixes

* fix semantics of all host mappings

* require ampere+

* small cleanups

* properly use host pointer for src/dst in cudaMemcpy calls

* allreduce: lazy-init the internal pipeline on first use

A config that lives entirely on NCCL never needs the chunked-kernel
pipeline (host_buf, host_large, dev_tmp, streams, events, arrival ring).
Defer pipeline creation to the first try_allreduce_internal call using the
same std::call_once pattern as ensure_nccl, so those resources stay
unallocated when only NCCL is in use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: assert n_backends == 2 instead of soft-fallback

ar_pipeline_init already requires n_devices == 2 and bails before any AR can
get here, so by the time we reach try_allreduce_internal we know we have
exactly two backends.  Replace the runtime-debug-log fallback with a hard
assert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
 NCCL is in use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* rework reduction provider selection. internal/nccl is OS dependent; most fallbacks are removed

* remove unneeded Turing arch check (llama.cpp doesn't even compile pre-Turing anyway)

* allreduce: ASCII-only comments and ggml_cuda_cast for value conversions

Replace non-ASCII characters in comments (em dashes, right arrows) with
ASCII equivalents (--, ->) so the source stays in the ggml/upstream norm.

In the kernel-side code, replace static_cast<Twire>/static_cast<Tdst>
with ggml_cuda_cast<...> so the BF16 conversions go through the fast
__float2bfloat16 / __bfloat162float intrinsics from convert.cuh.  Pure
pointer and integer casts stay as static_cast.

Also drops two stray garbage tokens that snuck in from earlier merges
(a duplicated 'return ok; }' tail in allreduce.cu and a leftover '_reg)'
fragment in ggml-cuda.cu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: use ggml_cuda_memcpy_1 for the chunked-kernel vector copies

The chunked kernel's two 16-byte register<->host transfers (Phase 1 store
and Phase 3 load) used reinterpret_cast<float4 *> on both sides.  Replace
with ggml_cuda_memcpy_1<sizeof(wire)>, which is the canonical helper for
this pattern and emits the same int4 LD/ST under the hood.

Conformance passes; 5x reruns of 70b internal pp512 show 1832-1836 t/s,
matching the prior matrix value of 1831 t/s -- no perf change as expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ok; }' tail in allreduce.cu and a leftover '_reg)'
fragment in ggml-cuda.cu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: assert cuda_ctx->device matches the pipeline's device

Both ggml_cuda_ar_pipeline and ggml_backend_cuda_context carry the device
they were created for; if they ever disagree, every cuda call that follows
runs on the wrong device.  Add GGML_ASSERT at each cuda_ctx retrieval site
in the AR path so the misuse fails fast rather than silently corrupting.

Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency
with the rest of the file, and tighten one cudaGetLastError check to fire
only after the to_bf16 call that can actually fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: expand one-liner for loops to braced bodies

Code-style preference -- match the rest of the file by writing every for
loop with the body on its own braced line.  Three sites in the copy-engine
typed dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
in the AR path so the misuse fails fast rather than silently corrupting.

Also: rename __nv_bfloat16 -> nv_bfloat16 (typedef alias) for consistency
with the rest of the file, and tighten one cudaGetLastError check to fire
only after the to_bf16 call that can actually fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: rename template parameters Tdst/Twire/Tsrc -> T_dst/T_wire/T_src

Code-style preference per PR review -- T_dst/T_wire/T_src is more
consistent with surrounding code.  Whole-word rename across all 58 sites
in allreduce.cu (kernel definitions, internal uses, and comment text).

Realigned the parameter columns in three function signatures whose
T_src/T_dst lines shifted by 1 char relative to their non-templated
neighbors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: drop hyphen in 'chunked-kernel' across comments

Per PR review feedback -- 'chunked kernel' (no hyphen) reads more naturally
in running prose, especially for ESL readers.  Pure comment-only change;
all 10 occurrences in allreduce.cu updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
three function signatures whose
T_src/T_dst lines shifted by 1 char relative to their non-templated
neighbors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: use ggml_cuda_get_max_cpy_bytes() instead of hardcoded 16

The chunked kernel hardcoded a 16-byte vector unit; replace with the
ggml_cuda_get_max_cpy_bytes() helper that fattn-common.cuh uses for the
same purpose, so ELEMS_PER_VEC self-adjusts to the arch's widest
single-instruction copy.

Perf-neutral on supported targets (Volta+ returns 16).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hbors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
to fire
only after the to_bf16 call that can actually fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gml-cuda.cu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ggml-cuda: PR review fixes -- annotate #endif, fix stale comment, assert nbytes alignment

Three separate but minor changes from PR #22299 review feedback:

1. Annotate the five GGML_USE_NCCL #endif lines with the matching condition
   so the pairing is visible without scrolling back.

2. The comment block on ggml_backend_cuda_comm_context claimed NCCL is
   lazy-initialised; that was true at one point but the dispatch refactor
   (727b141c0) made both NCCL and the internal pipeline eager.  Rewrite
   the comment to match current behaviour.

3. Assert in ggml_backend_cuda_comm_allreduce_internal that the tensor's
   byte size is a 16-byte multiple.  The chunked-kernel issues full-width
   vector loads/stores, so this is a precondition; tensor-parallel splits
   of hidden-dim-multiples satisfy it trivially, but a hard assert turns
   any caller-side bug into a clear failure rather than UB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
 device's new AR
records its ev.ker -- otherwise the second device's wait sees the first
device's just-recorded event (the in-flight new AR) and creates a circular
dependency with the in-kernel peer signal.  Two-pass dispatch (all waits,
then all launches) avoids this.

Bump POOL_SIZE 2 -> 8 (small memory cost, more breathing room for the
GPU's view of the event chain) and add a runtime env override for the
hybrid kernel chunk size (GGML_CUDA_AR_HYBRID_CHUNK_BYTES) for tuning.
One-shot stderr diagnostic at first AR prints the chosen path + sizing.

Result on 2x RTX 5090 Linux, 70b ub_sweep:

    ub=64   (1 MB AR): 913 -> 1036 t/s  (+13.5% vs old, +1.8% vs NCCL)
    ub=128  (2 MB AR): 1056 -> 1181     (+11.9%, +3.7% vs NCCL)
    ub=256  (4 MB AR): 1212 -> 1424     (+17.5%, +3.5% vs NCCL)

Internal now beats NCCL at every size (+1.8% to +15.6%), recovering all
ground in the 1-4 MB regime that was previously a 10-12% loss.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* simplify the init logic

* address some other PR requests

* ggml-cuda: stub internal AllReduce on HIP/MUSA, drop pre-Ampere mention, gate NCCL fallback warning on !HIP

The internal AllReduce relies on cudaHostAllocPortable/Mapped,
cudaHostGetDevicePointer, and __nanosleep -- none of which the HIP or
MUSA shims expose -- so wrap the implementation in
!defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) and provide
nullptr/no-op/false stubs in the #else branch.  The dispatcher already
treats a null pipeline as init failure and silently falls back to the
meta backend's generic AllReduce, so HIP/MUSA builds compile clean and
behave correctly without further call-site changes.

PR review follow-ups:
 - drop "or pre-Ampere?" from the internal-init failure warning -- the
   kernel doesn't require Ampere or newer.
 - guard the "NCCL not compiled in" fallback warning behind
   !defined(GGML_USE_HIP); the suggestion to install NCCL only makes
   sense on NVIDIA builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hind, now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: guard __nanosleep on Volta+ and reject pre-Volta devices at init

__nanosleep is the only Volta-specific intrinsic in the kernel; wrap it
in #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA / NO_DEVICE_CODE so the file
still compiles cleanly when targeting older arches (the dispatcher's
init check below ensures the kernel is never actually launched on
pre-Volta).

Add a per-device compute-capability check in pipeline_init that returns
nullptr if any device is below sm70.  The dispatcher already treats
nullptr as init failure and silently falls back to the meta backend's
generic AllReduce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rom the internal-init failure warning -- the
   kernel doesn't require Ampere or newer.
 - guard the "NCCL not compiled in" fallback warning behind
   !defined(GGML_USE_HIP); the suggestion to install NCCL only makes
   sense on NVIDIA builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hind, now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* allreduce: fix CI -Werror warnings (sign-compare, format, restrict alias, maybe-uninitialized)

The CUDA CI builds with -Werror -Wsign-compare -Wformat -Wrestrict
-Wmaybe-uninitialized.  Address each:

 - n_devices is size_t; change `int i; i < n_devices` to size_t in the
   three init loops, and the matching GGML_LOG_INFO format from %d to %zu.
 - ggml_cuda_ar_kernel was launched with sendbuf == recvbuf (in-place
   reduction), so the __restrict__ qualifiers on those parameters were
   technically UB.  Drop __restrict__ from sendbuf and recvbuf; an A/B
   sweep showed <0.6% perf delta (within noise) on Linux.
 - The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the
   per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were
   declared with size GGML_CUDA_MAX_DEVICES but the loop only writes
   indices [0, n_devices); zero-initialise so the compiler sees the
   tail elements as defined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ggml-cuda: drop unused-function warning by guarding try_allreduce_nccl behind GGML_USE_NCCL

The only call site (in init_nccl) is already inside #ifdef GGML_USE_NCCL,
so the function is unreferenced in non-NCCL builds and trips
nvcc's -Werror=unused-function check.  Move the guard from inside the
function body to around the entire definition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce
   reduction), so the __restrict__ qualifiers on those parameters were
   technically UB.  Drop __restrict__ from sendbuf and recvbuf; an A/B
   sweep showed <0.6% perf delta (within noise) on Linux.
 - The buf/src/dst pointer arrays in ggml_cuda_ar_allreduce and the
   per-iteration arrays in ggml_cuda_ar_allreduce_copy_outer were
   declared with size GGML_CUDA_MAX_DEVICES but the loop only writes
   indices [0, n_devices); zero-initialise so the compiler sees the
   tail elements as defined.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
now +6-8% ahead at ub=1024-4096.
Perplexity (32 chunks) matches NCCL bit-for-bit (3.4044 vs 3.4043).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 11:05:22 +02:00
Sigbjørn Skjæret
5755a100cd model : fix model type check for granite/llama3 and deepseek2/glm4.7 lite (#22870) 2026-05-10 08:44:29 +02:00
Sumit Chatterjee
1e5ad35d56 model : add sarvam_moe architecture support (#20275) 2026-05-09 16:31:50 +02:00
Yuannan
65d7a8bbf0 devops : updated Nix systems (#22869) 2026-05-09 17:15:03 +03:00
Davi Henrique Linhares
00d56b11c3 docker : upgraded the default intel compute-runtime version (#22567) 2026-05-09 10:22:23 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
5757c4dcb1 cmake : update BoringSSL to 0.20260508.0 (#22839) 2026-05-09 10:26:33 +03:00
Alexey Kopytko
e20b83930c SYCL: reduce allocation overhead during flash attention (#22732)
* SYCL: reduce allocation overhead during flash attention

* tidy up whitespace

* add a note about the flag

* move ggml_sycl_fattn_* into fattn-buffers.hpp

* refactor implementation into fattn-buffers.cpp

* move new_fattn_kv_buffers back into ggml-sycl.cpp
2026-05-09 09:30:39 +03:00
Devedse
fd89556567 [SYCL] Add BF16 support to GET_ROWS operation (#21391)
Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in
supports_op and in the kernel dispatch. This fixes a performance
regression where models using BF16 embedding tensors (e.g., Gemma4's
per_layer_token_embd.weight) fall back to CPU for the GET_ROWS op,
causing a full GPU-to-CPU tensor transfer every token.

The fix reuses the existing get_rows_sycl_float template with
sycl::ext::oneapi::bfloat16, matching the pattern already used for
sycl::half (F16) and float (F32).
2026-05-09 08:50:24 +03:00
Intel AI Get-to Market Customer Success and Solutions
60489932ec sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path (#22152)
* sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Remove duplicate definitions

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
2026-05-09 08:48:07 +03:00
Intel AI Get-to Market Customer Success and Solutions
4a4f819cb6 sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations (#22147)
* sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Remove unneeded/unnecessary comments and annotations

The MMQ subgroup annotations added are on functions gated behind
ggml_sycl_supports_mmq(). Revisit the need for these annotations
when that function changes.

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
2026-05-09 08:42:40 +03:00
AesSedai
046e284437 Add flash attention MMA / Tiles to support MiMo-V2.5 (#22812)
* mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128

* mimo-v2.5: follow (256, 256) fattn templates

* mimo-v2.5: cleanup comments

* mimo-v2.5: further comment cleanup

* mimo-v2.5: address PR feedback
fix GQA handling
check for other dangling 320/576 carveouts and mirror them for 192
Add to backend ops test so new paths are covered
2026-05-09 11:28:29 +08:00
Yanzhao Wang
66001722aa hexagon: add HTP kernel for GGML_OP_GATED_DELTA_NET (#22837)
Implement the Gated Delta Net recurrence on HVX with:
- 4-row fused kernels for PP (prompt processing) path
- 8-row fused kernels for TG (token generation) path, reducing
  K/Q/gate vector reload overhead by 2x
- Separate PP/TG thread functions for I-cache isolation
- VTCM state scratchpad with DMA in/out for TG single-cycle access
- Vectorized gate exp via hvx_exp_f32
2026-05-08 17:12:04 -07:00
Intel AI Get-to Market Customer Success and Solutions
c5703e03a5 sycl: support non-contiguous input in PAD op (#22148)
Signed-off-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
2026-05-09 08:05:22 +08:00
Pranav Dhinakar
b46812de78 Feature hexagon l2 norm (#22816)
* L2_NORM Updates

* Addressed PR Comments

* ggml-hexagon: add L2_NORM HVX kernel for Hexagon backend

* hex-unary: remove supported_unary_nc since the outer loop is the same for all unary ops

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-08 13:41:40 -07:00
Aldehir Rojas
49956041ee common : do not wrap raw strings in schema parser for tagged parsers (#22827) 2026-05-08 15:33:17 -05:00
ynankani
9f5f0e689c model : support Gemma4_26B_A4B_NVFP4 (#22804)
* Gemma4_26B_A4B_NvFp4 hf checkpoint convert to gguf format fixes

Signed-off-by: ynankani <ynankani@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address review comments

Signed-off-by: ynankani <ynankani@nvidia.com>

* fix CRLF

Signed-off-by: ynankani <ynankani@nvidia.com>

* Lint error fix

Signed-off-by: ynankani <ynankani@nvidia.com>

---------

Signed-off-by: ynankani <ynankani@nvidia.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-08 20:42:09 +02:00
Aldehir Rojas
f9cd456ea5 common : revert reasoning budget +inf logit bias (#22740) 2026-05-08 17:46:43 +02:00
smugman-dot
5d6f18a638 webui: fix LLM title generation for agentic conversations (#22840) 2026-05-08 16:36:04 +02:00
Xuan-Son Nguyen
29debb3a6a server: support Vertex AI compatible API (#22545)
* server: support Vertex AI compatible API

* a bit safer

* support other AIP_* env var

* various fixes

* if AIP_MODE is unset, do nothing

* fix test case

* fix windows build
2026-05-08 15:23:04 +02:00
Xuan-Son Nguyen
9dcf835528 server: (router) expose child model info from router's /v1/models (#22683)
* server: (router) expose child model info from router's /v1/models

* update docs
2026-05-08 14:42:15 +02:00
Pascal
58e68df0f9 cuda: fuse snake activation (mul, sin, sqr, mul, add) (#22667)
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-05-08 17:44:09 +08:00
Aleksander Grygier
9b2925e1e0 webui: Add Import/Export of Settings configuration + improve architecture (#22803)
* refactor: Settings keys as constant object keys

* chore: Run `npm audit fix`

* refactor: Settings Sections UI

* feat: Refactor Settings structure and implement import/export logic

* feat: Introduce ROUTES constant and RouterService

* refactor: Consolidate settings definitions into registry

* refactor: Update settings page routing structure

* chore: Migrate hardcoded URLs to use ROUTES and RouterService

* feat: Enhance model selection logic for settings and chat

* chore: Update webui static build

* refactor: Address PR review comments

* fix: Remove unneeded setting

* fix: Re-add missing settings

* fix: Add missing `/slots` proxy for webui dev mode

* chore: Dev-mode logs

* fix: Data binding

* fix: Steering for non-agentic flow
2026-05-08 11:26:04 +02:00
Johannes Gäßler
a8fd165fec CUDA: lower-case PCI bus id, standardize for ggml (#22820) 2026-05-08 10:09:38 +02:00
miyan
6d57a49a70 vulkan: fix spv shadowing (#22760) 2026-05-08 09:35:22 +02:00
Max Krasnyansky
3e941b813b ggml: update SCHED_DEBUG output to use ggml_op_desc() (#22825) 2026-05-07 22:43:04 -07:00
Shawn Gu
f3e8d149ce opencl: add q4_0 MoE GEMM for Adreno (#22731)
* Q4_0 MoE CLC pass sanity check

* release program

* opencl: fix whitespace

* opencl: remove unused cl_program

* opencl: break #if block to make it more clear

* opencl: adjust format

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-05-07 21:17:07 -07:00
Michał Piszczek
1d72d87349 convert : fix RuntimeError when stripping FP8 KV-cache scales (#22818)
* convert : fix RuntimeError when stripping FP8 KV-cache scales

In ModelBase._generate_nvfp4_tensors the final cleanup loop iterates
self.model_tensors.keys() and calls del on the same dict, which raises
RuntimeError: dictionary changed size during iteration when a ModelOpt
NVFP4 model also has FP8 KV-cache scales (e.g. mmangkad/Qwen3.6-35B-A3B-NVFP4
and any modelopt config with kv_cache_quant_algo: FP8).

Wrap the keys view in list() so the deletions happen on a snapshot.

* re-add another accidentally removed list

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-08 06:55:48 +03:00
Neo Zhang
6a2a2513dc fix script error (#22795sycl : ) 2026-05-08 06:54:57 +03:00
samuraieng
44dbe8c521 model: Support sarashina2.2-vision-3b model (#22103) 2026-05-07 23:10:29 +02:00
leonardHONG
05ff59cb57 CUDA: batch out_prod inner loop with cublasSgemmStridedBatched (#22651)
* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched

* CUDA: batch out_prod inner loop with cublasSgemmStridedBatched

* CUDA: add cublasSgemmStridedBatched mapping for HIP and MUSA backends
2026-05-07 21:59:29 +02:00
smugman-dot
aaf4a4d5e0 webui: add option for LLM title generation (#22265)
* webui: add LLM title generation option

* webui: use chat_template_kwargs for title gen + fix conversation check

* webui: capture firstUserMessage before async streamChatCompletion to fix race condition

* webui: extract LLM title generation into separate method

* webui: use constants and ChatService for LLM generated titles

* webui: rebuild static output

* webui: add LLM title generation setting to new settings location

* webui: use sendMessage in generateTitle

* webui: rebuild static output

* webui: fix formatting

* webui: configurable title prompt, remove think tag regexes, fix TS error

* webui: group title constants into TITLE object, use TruncatedText for CSS truncation and fix race condition

* webui: rebuild static output
2026-05-07 21:14:03 +02:00
Georgi Gerganov
e43431b381 llama : fix device state save/load (#22805) 2026-05-07 21:43:40 +03:00
shaofeiqi
ceb7e14b96 opencl: add opfilter regex for debugging (#22782) 2026-05-07 11:00:20 -07:00
Aldehir Rojas
093be624cc common/chat : preserve media markers for typed-content templates (#22634) 2026-05-07 12:50:56 -05:00
HaoJun ZHANG
deab41ec68 tests: add long-sequence cases and fix inputs for gated_delta_net (#22794)
* tests : add long-seq + tail cases for gated_delta_net

* tests : realistic input ranges for gated_delta_net
2026-05-08 00:23:36 +08:00
Intel AI Get-to Market Customer Success and Solutions
ad09224658 sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET (#22149)
* sycl: add FILL, CUMSUM, DIAG, SOLVE_TRI, SSM_SCAN, GATED_DELTA_NET

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Fix abort during test-backend-ops

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Regenerate ops.md

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

* Add scope_dbg_print to newly added SYCL ops.

Also add scope_dbg_print to existing ssm_conv op.

Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Todd Malsbary <todd.malsbary@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>
2026-05-07 18:51:33 +03:00
Gaurav Garg
b9afc19cb4 Write a readme on Multi-GPU usage in llama.cpp (#22729)
* Write a readme on Multi-GPU usage in llama.cpp

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-07 17:48:40 +02:00
Georgi Gerganov
803627f121 llama : remove unnecessary seq_id check during state restore (#22797) 2026-05-07 16:37:26 +03:00
pl752
68380ae11b ggml-cpu: Optimized risc-v cpu q1_0 dot 2026-05-07 21:09:25 +08:00
Pascal
cc97e45a14 mtmd: fix whisper audio tail truncation by exposing padded buffer to FFT (#22770) 2026-05-07 14:01:01 +02:00
AesSedai
8e52631d55 model: Add Mimo v2.5 model support (#22493)
* add mimo-v2.5 support

* mimo-v2.5: fix modify_tensors row split

* mimi-v2.5: forgot `add_attn_value_scale` plumbing

* mimi-v2.5: fix tp dequant to detect tp rows

* mimo-v2.5: fix TP iteration to be descending

* mimo-v2.5: fix comment

* mimo-v2.5: retain fused qkv

* mimo-v2.5: missed the attn_value scale during merge

* mimo-v2.5: fused QKV needs contiguous for scaling attention value

* mimo-v2.5: move `speech_embeddings.` to TextModel filter_tensors

* Update src/llama-hparams.h

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update src/models/mimo2.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* mimo-v2.5: include MTP weights in gguf

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-07 13:21:58 +02:00
Pascal
f4b5a2ee91 webui: fix ?model= URL param race in router mode (#22771)
* webui: fix ?model= URL param race in router mode

* chore: update webui build output
2026-05-07 13:09:32 +02:00
Vishal Singh
97f06e9eed codeowners : add ZenDNN backend codeowner (#22772)
* codeowners : add ZenDNN backend codeowner

* codeowners : fix zendnn owners to use individual github handles
2026-05-07 14:46:51 +08:00
viggy
e358d75adb webui: fix flicker issue on dismiss animation on overlay primitives (#22773)
* add fill-mode-forwards

* generated diffs
2026-05-07 08:11:31 +02:00
Shane Tran Whitmire
cfff1fc300 sycl : fix test script (#22737)
The error:
./examples/sycl/test.sh: line 122: level_zero:${$GGML_SYCL_DEVICE}: bad
substitution

was thrown whenever the user used this command:
./examples/sycl/test.sh -mg 0

Fix is to get rid of a dollar sign.
2026-05-07 08:25:57 +03:00
Adrien Gallouët
3980e04d5a llama : add missing call to ggml_backend_load_all() (#22752)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-07 08:24:47 +03:00
tc-mb
2496f9c149 mtmd : support MiniCPM-V 4.6 (#22529)
* Support MiniCPM-V 4.6 in new branch

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix code bug

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix pre-commit

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix convert

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* rename clip_graph_minicpmv4_6

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use new TYPE_MINICPMV4_6

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use build_attn to allow flash attention support

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* no use legacy code, restored here.

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use the existing tensors name

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* unused ctx->model.hparams.minicpmv_version

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* use n_merge for slice alignment

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* borrow wa_layer_indexes for vit_merger insertion point

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix code style

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* use filter_tensors and add model.vision_tower

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix chkhsh

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

* fix type check

Signed-off-by: tc-mb <tianchi_cai@icloud.com>

---------

Signed-off-by: tc-mb <tianchi_cai@icloud.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 21:54:09 +02:00
Gilad S.
5207d120ea model : don't crash on unsupported architecture (#22742)
* model: don't crash on unsupported architecture

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 18:51:21 +02:00
fl0rianr
a0101225bc common: do not fit to unknown device memory (#22614)
* common: do not fit to unknown device memory

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: preserve host fallback for non-GPU fit devices

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* common: keep unknown GPU fit memory at zero

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
2026-05-06 17:03:45 +02:00
Georgi Gerganov
a290ce6266 gguf-py : bump version to 0.19.0 (#22664)
* gguf-py : bump version to 0.19.0

* bump poetry

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 14:46:14 +02:00
Yakine Tahtah
a00e47e422 mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) (#22101)
* mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)

Conformer encoder with Shaw relative position encoding,
QFormer projector, log-mel spectrogram with frame stacking.

Encoder uses GLU gating, folded batch norm, and SSM depthwise
conv. QFormer compresses encoder output via windowed
cross-attention (window=15, queries=3) into the LLM embedding
space.

Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank,
dynamic range compression, 2x frame stacking (80->160 mel).

GGUF converter handles batch norm folding at export time,
fused K/V split, and Conv1d weight reshaping.

Tested against HF transformers reference: token-for-token match
on 30s/60s audio clips with greedy decoding.

* mtmd: rename gs_ prefixed tensors to generic/architecture names

* mtmd: use tensor_mapping.py for all granite_speech tensors

* convert: fold GraniteSpeechTextModel into GraniteModel

* mtmd: replace n_layer hack with explicit has_standard_layers flag

* mtmd: replace hardcoded magic numbers with GGUF hparams for granite speech

* mtmd: align KEY_A_ define spacing

* convert: register GraniteModel for GraniteSpeechForConditionalGeneration

* convert: fix ty type-check for GraniteSpeechMmprojModel registration

* mtmd: align TN_ define spacing

* mtmd: use generic layer loop for granite speech tensor loading

* mtmd: merge qformer_proj_layer into clip_layer

* mtmd: granite_speech remove redundant ggml_build_forward_expand on inputs

* mtmd: granite_speech add comment explaining why build_attn is not used

* mtmd: granite_speech hard-code eps in cpp, remove from GGUF metadata

* gguf: add spacing between granite_speech tensor mapping blocks

* mtmd: make generic audio layer_norm_eps read optional

* mtmd: granite_speech keep encoder eps in GGUF, only hard-code projector eps

* mtmd: align defines and struct fields in clip-impl.h and clip-model.h

* mtmd: fix alignment and ordering issues across granite speech files

* convert: granite_speech use filter_tensors instead of modify_tensors for skipping
2026-05-06 14:40:59 +02:00
David Huggins-Daines
750141969c feat: migrate to PEP 621 and add uv support (#21907)
* feat: migrate to PEP 621 and add uv support

* fix: remove upper bound on protobuf

* remove poetry.lock and uv.lock

* fix/add torch dependency version and markers

* fix dev-dependency deprecation warning

* gguf-py : update python version requirement to 3.10

---------

Co-authored-by: David Huggins-Daines <dhd@dhd.ecolingui.ca>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2026-05-06 14:04:10 +02:00
Daniel Bevenius
a736e6c0ac convert : ignore non-language tensors for Gemma4Model (#22753)
* convert : ignore non-language tensors for Gemma4Model

This commit adds a check to make sure only text language tensors are
handled in filter_tensors.

The motivation is that currently when trying to convert a Gemma4 model
the following error occurs:
```console
(venv) $ ./convert-gemma.sh
INFO:hf-to-gguf:Loading model: gemma-4-E2B-it
INFO:hf-to-gguf:Model architecture: Gemma4ForConditionalGeneration
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,                 torch.float32 --> F32, shape = {256}
Traceback (most recent call last):
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13752, in <module>
    main()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 13746, in main
    model_instance.write()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 945, in write
    self.prepare_tensors()
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 805, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7925, in modify_tensors
    yield from super().modify_tensors(data_torch, name, bid)
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 7290, in modify_tensors
    yield from super().modify_tensors(data_torch, name, bid)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 579, in modify_tensors
    new_name = self.map_tensor_name(name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/danbev/work/llama.cpp/./convert_hf_to_gguf.py", line 572, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.embed_vision.embedding_projection.weight'
```

* add forgotten embed_vision and embed_audio

* improve

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-06 13:50:44 +02:00
Aleksander Grygier
e3e3f8e46a webui: Remove Google Favicons & Improve MCP Information logic & UI (#22719)
* refactor: Remove Google favicon utility

* fix: MCP Server favicon

* refactor: Cleanup

* refactor: MCP Server Information

* fix: Fix MCP Settings UI

* refactor: Cleanup
2026-05-06 11:12:27 +02:00
zzzzwc
f08f20a0e3 ggml-cpu: fuse RMS_NORM + MUL on CPU backend (#22423) 2026-05-06 15:41:14 +08:00
viggy
07eaf919ed add tabindex and aria-hidden (#22699) 2026-05-06 09:21:58 +02:00
Sigbjørn Skjæret
74d6248f71 convert : add filter_tensors method to pre-filter tensors (#22597)
* add filter_tensors classmethod

* remove language_model

* fix parts validation
2026-05-06 08:06:05 +02:00
fl0rianr
2ca1161bd7 ggml : use CL_DEVICE_GLOBAL_MEM_SIZE as memory estimate for OpenCL --fit (#22688)
* ggml : report estimated OpenCL memory for --fit

Signed-off-by: Florian Reinle <f.reinle@otec.de>

* ggml : estimated OpenCL memory backend integrated

Signed-off-by: Florian Reinle <f.reinle@otec.de>

---------

Signed-off-by: Florian Reinle <f.reinle@otec.de>
2026-05-05 22:12:48 -07:00
Trivikram Reddy
bbeb89d76c Hexagon: Process M-tail rows on HMX instead of HVX (#22724)
* hex-mm: process m-tail rows on HMX instead of HVX

* hmx-mm: unroll and optimize padded activation loop

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-05 09:43:03 -07:00
lhez
ff806a110d opencl: refactor Adreno q4_0 (#22335)
* opencl: refactor adreno q4_0 gemm/gemv dispatch

* opencl: refactor q4_0 gemm/gemv loading, use consistent names

* opencl: use consistent name for adreno q8_0 gemm/gemv

* opencl: use consistent names for adreno q4_0 gemm/gemv

* opencl: simplify adreno q4_0 set_tensor

* opencl: refactor q4_0 get_tensor
2026-05-05 09:38:57 -07:00
Radoslav Gerganov
d5003b6e4d rpc : use graph uid instead of graph cache (#22701)
Store the last graph uid and compare against it to determine if the same
graph is being computed.
2026-05-05 13:47:13 +03:00
Adrien Gallouët
2635ac76e8 common : fix missing-noreturn warnings when compiling with clang 21 (#22702)
common/arg.cpp:3719:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3719 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3726:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3726 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3733:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3733 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3740:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3740 |         [](common_params & /*params*/, int /*value*/) {
          |         ^
    common/arg.cpp:3747:9: error: function 'operator()' could be declared with attribute 'noreturn' [-Werror,-Wmissing-noreturn]
     3747 |         [](common_params & /*params*/, int /*value*/) {
          |         ^

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 13:16:25 +03:00
Georgi Gerganov
70a8309114 sync : ggml 2026-05-05 13:15:59 +03:00
Georgi Gerganov
c91faf997f ggml : bump version to 0.11.0 (ggml/1478) 2026-05-05 13:15:59 +03:00
Adrien Gallouët
bf76ac77be common : only load backends when required (#22290)
* common : only load backends when required

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* llama : call ggml_backend_load_all() directly from llama_backend_init()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add ggml_backend_load_all() where llama_backend_init() is not used

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-05 09:23:50 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
a09a00e502 vendor : update cpp-httplib to 0.43.3 (#22686) 2026-05-05 09:04:57 +02:00
Georgi Gerganov
2bacb1eb77 server : validate --tools CLI argument against known tool names (#22538)
Previously, unknown tool names passed via --tools were silently ignored.
Now the server validates each tool name at startup and exits with an
error if an unrecognized tool is specified, listing the available tools.

Assisted-by: llama.cpp:local pi
2026-05-05 06:35:27 +03:00
Georgi Gerganov
d6e7b033a4 llama : add option to save memory in device buffers (#22679)
* llama : add option to save memory in device buffers

* tests : extend llama-save-load-state
2026-05-05 06:35:07 +03:00
Sigbjørn Skjæret
fa595462ca graph : handle non-contiguous Q/K/V in mul_mat_aux (#22630)
* qkv may not always be contiguous

* cont : make the cont conditional

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-05 06:34:44 +03:00
Ismail
a817a22bc6 ggml : implement fast walsh-hadamard transform for kv rotation (#21352) (#22631) 2026-05-05 10:05:05 +08:00
Charles Xu
eff06702b2 kleidiai : update to v1.24.0 and use release archive (#22549) 2026-05-04 22:13:31 +03:00
leonardHONG
e77056f9b2 CUDA: use fastdiv for batch index split in get_rows (#22650) 2026-05-04 16:24:05 +02:00
Xuan-Son Nguyen
935a340292 server: implement /models?reload=1 (#21848) 2026-05-04 16:23:26 +02:00
Shakhnazar Sailaukan
d8794eecd5 examples: refactor diffusion generation (#22590)
* examples: refactor diffusion generation

* renamed enum values
2026-05-04 20:19:30 +08:00
JusteLeo
36a694c965 webui : fix circular dependency between chat.service.ts and models.svelte.ts (#22625) 2026-05-04 13:38:10 +02:00
Piotr Wilkin (ilintar)
a4701c98f7 common/autoparser: fixes for newline handling / forced tool calls (#22654)
* chat/autoparser: the fixes

* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.

* Trim whitespace on apply instead
2026-05-04 13:18:11 +02:00
Xuan-Son Nguyen
994118a183 model: move load_hparams and load_tensors to per-model definition (#22004)
* git-friendly migration

* add build_graph

* nits

* exclude old code from build

* wip

* add llm_arch_model_i

* prepare downstream functions

* nits

* nits

* wip

* wip

* add back create_tensor_qkv

* fix files missing include

* enforce one llm_build per arch

* cmake: use glob

* missing model params

* nits

* wip

* wip (2)

* wip (3)

* test-llama-archs is happy

* improve switch case

* move more stuff into llm_arch_model_i

* fix downstream code

* nits

* nits (2)

* fix order

* llama_model_base

* LLAMA_LOAD_LOCALS

* small fix

* fix build errors

* auto

* rm migration script and ifdef
2026-05-04 12:36:59 +02:00
Evan Huus
c84e6d6db5 server: Add a simple get_datetime server tool (#22649) 2026-05-04 12:19:41 +02:00
Nick Towle
fa8feaed34 webui: restore missing settings (#22666) 2026-05-04 09:04:07 +02:00
Georgi Gerganov
846262d787 docs : update speculative decoding parameters after refactor (#22397) (#22539)
* docs : update speculative decoding parameters after refactor (#22397)

Update docs/speculative.md to reflect the new parameter naming scheme
introduced in PR #22397:

- Replace --draft-max/--draft-min with --spec-draft-n-max/--spec-draft-n-min
- Replace --spec-ngram-size-n/m with per-implementation variants
- Add documentation for all new --spec-ngram-*- parameters
- Update all example commands

Assisted-by: llama.cpp:local pi

* pi : add rule to use gh CLI for GitHub resources

Assisted-by: llama.cpp:local pi

* docs : run llama-gen-docs

* arg : fix typo
2026-05-04 08:52:07 +03:00
Atomic-Germ
6dcd824fce vulkan: delete dead GGML_VK_MAX_NODES def (#22621) 2026-05-04 07:49:29 +02:00
Chen Yuan
d4b0c22f9e ggml-webgpu: add layer norm ops (#22406)
* shader(norm): add layer norm ops

* shader(norm): stablize floating point computation with Kahan summation and handle mixed types

* shader(norm): remove the non-contiguous strides

* shader(norm): use the original implementation rather than the kahan summation
2026-05-03 20:52:53 -07:00
Aldehir Rojas
e48034dfc9 common : determine generation prompt using longest common prefix (#22657) 2026-05-04 00:18:23 +02:00
Julien Denize
048a490f76 convert : Mistral format yarn apply_scale support (#22612)
* [BUGFIX] Mistral format apply_scale support.

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix misunderstood boolean parameters

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-03 21:51:21 +02:00
JM Robles
db44417b02 convert : apply Q/K RoPE permutation in NVFP4 repack path (#22611)
Llama-architecture q_proj/k_proj weights need an axis-0 row permutation
to match GGML's RoPE convention. The BF16 path applies this in
LlamaModel.modify_tensors via LlamaModel.permute, but the NVFP4 path
bypasses modify_tensors and writes weights directly through
ModelBase._repack_nvfp4. Without the permutation, attention heads end
up scrambled at inference and the model produces gibberish.

This change overrides _repack_nvfp4 on LlamaModel and applies the same
permutation to both the nibble-packed weight and the per-block scale
before delegating to ModelBase._repack_nvfp4 via super(). Reuses the
existing LlamaModel.permute static helper and respects the existing
undo_permute flag, so subclasses (Mistral, Granite, Llama4, etc.)
inherit the fix automatically.

Verified on TinyLlama-1.1B reproducer: perplexity drops from 4419
(gibberish) to 43.9, matching the BF16-dequantized baseline (44.0).
Also verified end-to-end on ALIA-40b-instruct-2601 (BSC, Llama
architecture) with multilingual generation in Spanish/Catalan/Basque/
Galician all coherent with the fix applied.

Co-authored-by: Chema <chema@montevive.ai>
2026-05-03 18:22:00 +03:00
lucy
d05fe1d7da fix: CUDA device PCI bus ID de-dupe OOMing (ignoring other 3 gpus entirely) (#22533)
* fix: CUDA device PCI bus ID detection for multi-GPU de-dupe

* HIP, MUSA macros

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-02 22:19:25 +02:00
Georgi Gerganov
0754b7b6fe server : avoid checkpoint data host copies (#22558)
* server : avoid checkpoint data host copies

* llama : refactor llama_io_read_i
2026-05-02 18:03:25 +03:00
JusteLeo
09294365a9 ggml-virtgpu: fix circular dependency in headers (#22557) 2026-05-02 21:28:50 +08:00
Csaba Kecskemeti
63d93d1733 convert : disable uint types (#18908) 2026-05-02 09:05:59 +03:00
Shawn Gu
c5a3bc39b1 opencl: Adreno optimization for MoE - MxFP4 (#22301)
* MoE Mxfp4 CLC kernel added, router reorder on GPU

* Pass test-backend-ops for MoE mxfp4 Adreno CLC

* remove putenv in llama-model.cpp

* fix indent style and whitespace

* opencl: remove unnecessary headers

* opencl: do not save cl_program objects

* opencl: remove unnecessary assert

* fix precision issue

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-05-01 23:02:24 -07:00
Johannes Gäßler
9dbb372610 Github: update issue templates (#22594) 2026-05-02 07:56:13 +02:00
Georgi Gerganov
228e836344 sync : ggml 2026-05-02 08:55:29 +03:00
Georgi Gerganov
ed23489f42 ggml : bump version to 0.10.2 (ggml/1474) 2026-05-02 08:55:29 +03:00
Georgi Gerganov
457e2288c9 sync : ggml 2026-05-02 07:22:35 +03:00
Georgi Gerganov
e8ec7ab058 ggml : try fix win32 build (whisper/0) 2026-05-02 07:22:35 +03:00
Yiwei Shao
1a03cf47f6 hexagon: hmx flash attention (#22347)
* hmx: extract shared interleave headers and unify matmul batched

* hmx: add HMX-accelerated flash attention for prefill

* hmx: replace asm wrappers with Q6_ intrinsics in hmx-utils.h

Switches three single-instruction helpers from inline asm to the matching
Q6_ intrinsics, matching the style established by aizip f8737609a and used
by the upstream PR #21554 hmx-matmul-ops.c rewrite:

  hmx_set_output_scales       asm "bias=mxmem2"  -> Q6_bias_mxmem2_A
  hmx_load_tile_pair_fp16     asm packet         -> Q6_activation_hf_mxmem_RR
                                                    + Q6_weight_hf_mxmem_RR
  hmx_consume_accumulator_fp16 asm "mxmem=acc"   -> Q6_mxmem_AR_after_hf

hmx_load_tiles_fp16 stays on inline asm: it uses ":deep" activation
streaming, and the mixed Q6_activation_hf_mxmem_RR_deep + non-deep
Q6_weight_hf_mxmem_RR pair fails the HMX backend constraint check
("activate weight pair (1) exceeds limit (1)"). The asm bundle keeps
both halves in one VLIW packet and avoids the diagnostic.

Functionally equivalent — same instructions emitted; the Q6_ intrinsics
just give the compiler more visibility for scheduling.

* hmx: drop the duplicate interleave_fp16_weight_chunk_to_tiles

* hmx:  apply upstream optimization to hmx-flash-attn-ops.c
apply restrict, __builtin_assume, and pointer accumulation to the three HMX workers (qk_dot, o_update, o_norm) and the matching inline HMX loops in op_hmx_flash_attn_ext.

* hmx: unify interleave helper

* hmx: multi-thread Q load / O store and enable prefill FA dispatch

Extract inline Q-load and O-store loops into worker_pool-parallel helpers
(fa_phase_q_load, fa_phase_o_store) so HVX threads split the F32↔F16
conversion work across row ranges.  Also relax the softmax threading
gate from n_row_vec_cnt >= n_threads to >= 2, which was unnecessarily
forcing single-thread fallback when n_rows_g < 512.

On the dispatch side, remove the ne[2] != 1 guard that blocked multi-head
(prefill) FA from reaching the HTP backend — GQA is already handled
internally by both the HMX and HVX flash-attention paths.

* hmx: relax matmul pipeline gate to cover k > n shapes (e.g. FFN_down)

* hmx: optimize FA softmax mask phase (no-ALiBi fast path + GQA dedup)

* hmx: Add an asm memory clobber at the phase boundary to prevent reorder bug

* [experimental]: fp16 softmax (EXP2_HF) to accelerate fa

Bake log2(e) into qk_scale and use hvx_exp2_hf directly for P and m_diff
(base-2 consistent, matches htp-ops-lib). ~22 ALU ops for 64 lanes vs
~44 for the F32 round-trip path.

* hmx flash-attn: refine cost model coefficients based on profiling data

* hmx flash-attn: replace asm clobber with targeted volatile reads on vtcm_d_tiles

* hmx flash-attn: fix prefill correctness (dst indexing, softmax reduce, V stride)

* hmx flash-attn: fix p_tiles dual-tile OOB race; enable MT + pipeline

* hmx flash-attn: preserve additive mask bias in no-ALiBi fast path

The no-ALiBi fast path (max_bias==0) was skipping mask add entirely on
the assumption that mask values are only {0, -inf}.  This is wrong when
the mask carries additive positional bias — those terms were silently
dropped.  Keep the slope-mul skip (slope≡1.0) but add mask back so the
bias survives; vmux still clamps below -16 to -inf.

Also add HMX FA coverage to test-backend-ops: prefill shapes (nb=64,
nb=32) × {mask on/off} × {ALiBi on/off} × {softcap on/off}, F16 KV,
hs ∈ {64, 128}.

* hmx: fix softcap+EXP2_HF interaction, tighten matmul pipeline gate, add FA tests

- flash-attn: when EXP2_HF is on AND logit_softcap is active, fold
  log2(e) into the post-tanh multiplier (v_cap) instead of pre-baking
  it into qk_scale.  Pre-baking shifted the tanh knee from x≈c to
  x≈c/log2(e) and produced numerically wrong softcapped outputs
  whenever both knobs were enabled.
- flash-attn softmax (fa_softmax_thread): replace the union+memcpy
  scalar extract pattern with HVX vmux-based per-row accumulators on
  rowmax/rowsum.  Add hvx_vec_get_f16 helper in hvx-base.h.  Functional
  parity, less scalar code, clearer hf/qf16 lane-format contract.
- matmul (hmx_mat_mul_permuted_qk_0_d16a32): pick pipeline vs sequential
  layout based on whether the chunker actually yields >=2 n-chunks,
  instead of the static (m>=128 && n>=256) gate.  Avoids paying for
  output double-buffer + worker dispatch when there is no HMX/HVX
  overlap to gain (e.g. shapes that collapse to one n-chunk).
- tests: add HMX flash-attention coverage over the
  {mask, ALiBi (max_bias), logit_softcap} cross-product for the prefill
  path — head_dim 64/128, GQA 4×4, kv=512/nb=64 plus a kv=113/nb=32
  non-aligned case.

* [Help Wanted]: refactor D matrix computation into separate function for clarity and maintainability

* format code

* hexagon: looks like -O3 is causing issues with the large code base, switch to -O2 and -flto instead

* hexagon: use hex_ prefix for swap_ptr

* hexagon: move vtcm_seq_alloc into vtcm-utils.h

More vtcm allocator updates are coming so it makes sense to start the separate hdr for it.

* hmx-utils: add hmx_prefix for layout converters

* hmx-mm: move main hmx_mm functions to the end, remove unused fwd decls, etc

* hmx-mm: remove unused qweight_fetch_task_state_t and minor alignment fixes

* hmx-fa: minor alignment fixes

* hmx-fa: move hmx_flash_atten into hmx-ops.h

* hmx-fa: remove redundant workpool pointer in the hmx_fa_ctx, plus minor alignment updates

* hmx-fa: minor alignment and simplifications

* hexagon: move FA_EXP_F16 option to hostside CMake file

* hmx-fa: use hvx_vec_splat_f16 instead of fp16_to_bits

* hmx-fa: add hvx_splat_u16/u8 and use that in the fa instead custom hvx_fill

* hmx-fa: some more alignment updates in the core fa function

* hmx-fa: keep slopes in vtcm in fp16

Saves malloc/free and removes the need for float -> fp16 downcast on every use.

* hexagon: consistent noinline usage (after static)

* hex-hmx: consistent use FARF_HIGH to enable debug output

* hmx-utils: no need for always_inline attr

* hex-hmx: consistent noinline usage (static noinline ...)

* hex-hmx: simplify init_col_scales

* hexagon: fix editorconfig errors

* hmx-mm: minor alignment fixes

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-05-01 20:29:13 -07:00
ddh0
b97ebdc98f llama-quant : fix --tensor-type when default qtype is overriden (#22572)
fix #22544 (my fault!)

Credit to @Anai-Guo, ref #22559 - since that one was closed due to the
new contributor policy I am taking the liberty of re-submitting that PR
here.
2026-05-01 19:55:55 +02:00
Aparna M P
2098fd6169 hexagon: enable non-contiguous row tensor support for unary ops (#22574) 2026-05-01 10:09:23 -07:00
Aleksander Grygier
ab6120cde5 webui: Spring Cleaning Refactor v1 (#22505)
* wip: server_tools

* feat: Integrate with `/tools` endpoint

* feat: Builtin + MCP + JSON Schema Tools WIP

* refactor

* displayName -> display_name

* snake_case everywhere

* rm redundant field

* feat: Improvements

* chore: update webui build output

* refactor: Updates after server updates

* chore: update webui build output

* change arg to --tools all

* feat: UI improvements

* chore: update webui build output

* add readme mention

* llama-gen-docs

* chore: update webui build output

* chore: update webui build output

* chore: update webui build output

* feat: Reorganize settings sections

* feat: Separate dialogs for MCP Servers Settings and Import/Export

* feat: WIP

* feat: WIP

* feat: WIP

* feat: WIP

* feat: WIP

* feat: WIP

* WIP on allozaur/20677-webui-server-tools

* feat: UI improvements

* chore: Update package lock

* chore: Run `npm audit fix`

* feat: UI WIP

* feat: UI

* refactor: Desktop Icon Strip DRY

* feat: Cleaner rendering and transition for ChatScreen

* feat: UI improvements

* feat: UI improvement

* feat: Remove MCP Server "enable" switch from Tools submenu

* chore: Run `npm audit fix`

* feat: WIP

* feat: Logic improvements

* refactor: Cleanup

* refactor: DRY

* test: Fix Chat Sidebar UI Tests

* chore: Update package lock

* refactor: Cleanup

* feat: Chat Message Action Card with Continue and Permission flow implementations

* feat: Add agentic steering messages, draft messages and improve chat UX

* fix: Search results UI

* test: Fix unit test

* feat: UI/UX improvements

* refactor: Simplify `useToolsPanel` access in components

* feat: Implement Processing Info Context API

* feat: Implement 'Go back to chat' functionality for settings

* feat: Enhance MCP Server management in Chat Form Attachments

* style: Minor UI and branding adjustments

* chore: Update webui static build output

* chore: Formatting, linting & type checks

* feat: Draft messages logic

* feat: UI improvements

* feat: Steering Messages improvements

* refactor: Cleanup

* refactor: Cleanup

* feat: Improve UI

* refactor: Settings navigation hook

* refactor: DRY code

* refactor: DRY ChatMessageUser UI components

* refactor: Desktop Icon Strip DRY

* refactor: Tools & permissions

* fix: Navigation condition

* refactor: Cleanup

* refactor: Cleanup

* refactor: Cleanup

* fix: preserve reasoning_content in agentic flow

* refactor: Storybook cleanup

* refactor: isInViewport util function

* refactor: Rename globally `onClick` to `onclick`

* chore: `npm audit fix`

* refactor: Action Icon usage

* refactor: Naming

* refactor: JS in `class` directive

* refactor: Chat components cleanup WIP

* refactor: Components structure

* refactor: Cleanup WIP

* feat: New ChatAttachmentsPreview component

* feat: UI improvements

* feat: UI improvements

* refactor: Cleanup

* refactor: ChatAttachmentsPreview UI/UX

* refactor: Remove dead code

* refactor: Cleanup

* fix: Model Name aliases displaying

* feat: Shortcut improvements

* refactor: Chat Message

* feat: Move Import/Export to settings

* refactor: Cleanup

* refactor: Cleanup

* refactor: Cleanup

* refactor: Cleanup

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-05-01 18:36:29 +02:00
Masashi Yoshimura
c3c1505392 ggml-webgpu: Fix vectorized handling in mul-mat and mul-mat-id (#22578)
* Fix vectorized condition of mul-mat-fast pipeline and add vectorized variant to mul-mat-id

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-01 07:55:01 -07:00
Jeff Bolz
05e141a6b3 vulkan: Support asymmetric FA in coopmat2 path (#21753)
* vulkan: Support asymmetric FA in coopmat2 path

There has been some recent interest/experimentation with mixed quantization
types for FA. I had originally designed the cm2 FA shader with this in mind
(because I didn't realize it wasn't supported at the time!), this change
adds the missing pieces and enables it.

Also support Q1_0 since people have been trying that out (seems crazy, but
who knows).

We should be able to do similar things in the coopmat1/scalar path, but
there's another change open against the scalar path and I don't want to
conflict.

* reorder cases
2026-05-01 15:28:32 +02:00
Chen Yuan
aab68217b7 ggml-webgpu: add the upscale shader (#22419)
* shader(upscale): add the upscale shader with nearest, bilinear and bicubic implementations

* shader(upscale): use macro
2026-04-30 22:22:18 -07:00
Masashi Yoshimura
a95a11e5b8 ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (#22464)
* Add mat-vec fast path of MUL_MAT_ID.

* Add shared accumulation vec logic and the other types supports.

* Add i-quant mat-mat for MUL_MAT_ID and fix some parts

* Remove n_experts from shader_lib_context.
2026-04-30 14:19:10 -07:00
Reese Levine
5cbfb18075 Update llama-mmap to use ftello/fseeko (#22497)
* Update llama-mmap to work with 32-bit wasm and >2GB models

* Update to gguf.cpp style
2026-04-30 14:17:52 -07:00
Adrien Gallouët
beb42fffa4 common : check for null getpwuid in hf-cache (#22550)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-30 21:32:41 +02:00
Ruben Ortlam
660b1b4bdc vulkan: add get/set tensor 2d functions (#22514)
* vulkan: add get/set_tensor_2d functions

* fix backend interface comments

* Update ggml/src/ggml-metal/ggml-metal.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-30 17:37:13 +02:00
Ben Guidarelli
c20c44514a spec: fix argument typo (#22552) 2026-04-30 17:32:32 +03:00
Sigbjørn Skjæret
6118c043b1 ci : bump ty to 0.0.33 (#22535)
* bump ty to 0.0.33

* update typings
2026-04-30 16:15:54 +03:00
Adrien Gallouët
5f0ab726f7 vendor : update cpp-httplib to 0.43.2 (#22548)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-04-30 15:04:39 +02:00
Johannes Gäßler
e82aaf2587 CUDA: fix tile FA kernel on Pascal (#22541) 2026-04-30 13:04:50 +02:00
Georgi Gerganov
27aef3dd91 scripts : add wc2wt.sh - create worktree from current HEAD (#22513)
* scripts : add wc2wt.sh - create worktree from current HEAD

Add a script to create a git worktree on a new branch from the current
HEAD. Similar to pr2wt.sh but for local development branches instead of
PRs.

Usage:
  ./scripts/wc2wt.sh gg/new-feature
  ./scripts/wc2wt.sh gg/new-feature "bash -l"

Assisted-by: llama.cpp:local pi

* cont : no need to try to delete the branch
2026-04-30 09:20:26 +03:00
Rithik Sharma
45155597aa add fast matmul iquants (#22504) 2026-04-29 22:58:32 -07:00
Georgi Gerganov
80afa33aad spec : fix draft model checkpoints (#22521)
* spec : fix draft model checkpoints

* cont : clean-up

* cont : gate the ngram-mod reset warning behind verbose flag
2026-04-30 08:32:18 +03:00
Peter Sideris
b42c7fa5b8 spec : fix vocab compat checks in spec example (#22426)
* port #22358 PR to examples/speculative/speculative.cpp
* use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft
  model / target model vocabulary mismatch

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
2026-04-30 08:18:25 +03:00
Aldehir Rojas
d77599234e common : do not pass prompt tokens to reasoning budget sampler (#22488) 2026-04-29 14:10:58 -05:00
Max Krasnyansky
41a63be28e hexagon: make vmem and buffer-size configurable (#22487)
* hexagon: allow host to set max vmem size

We use a sane default but it's helpful to allow for an override if needed.

* hexagon: add support for measuring vmem space and move pinned mmaping management to host

* hexagon: update vmem checks to use uint64

* hexagon: bump op buffers to 16 (matches max mmaps)

* hexagon: bump default vmem to 3.2GB

* hexagon: add support for autodetecting vmem space and some logging cleanup in that area

* hexagon: fix whitespace warnings

* Update scripts/snapdragon/adb/run-cli.sh

Co-authored-by: Pascal <admin@serveurperso.com>

* hex-adb: fix run-completion script

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2026-04-29 11:51:21 -07:00
Anav Prasad
098705a29e CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478) 2026-04-30 02:39:56 +08:00
Georgi Gerganov
683c5acb90 spec : disacard last drafted token with low prob (#22506) 2026-04-29 17:00:00 +03:00
Georgi Gerganov
b1d5f5b449 sync : ggml 2026-04-29 16:43:47 +03:00
Georgi Gerganov
4b221b7f1e ggml : bump version to 0.10.1 (ggml/1469) 2026-04-29 16:43:47 +03:00
Pascal
59237bfbbc webui: fix slow mic stop and WAV encode (#22480)
* webui: instant mic stop, race-free recorder restart

* webui: faster WAV PCM encode via hoisted channels and Int16Array

* chore: update webui build output

* webui: drop setTimeout(0) hack and harden cancelRecording

* chore: update webui build output
2026-04-29 12:58:35 +02:00
shalinib-ibm
1cbc846eba ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (#22293)
* ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault

vec_xst operations in the tiled path crash on AIX when writing
near 4KB page boundaries due to strict memory protection. Fall
back to mnpack implementation on AIX for stable execution.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

* Update ggml/src/ggml-cpu/llamafile/sgemm.cpp

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Update sgemm.cpp

* Update sgemm.cpp

---------

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-04-29 13:32:40 +03:00
Aman Gupta
3142f1dbb9 ggml-cuda: refactor fusion code (#22468)
* ggml-cuda: refactor fusion code

* apply formatting + make env variable truthy
2026-04-29 16:19:33 +08:00
qiurui144
b5c4227dc6 ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317)
* ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME

When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains
inline asm for the vmadot family which requires the xsmtvdotii custom
extension.(problem can see in some blogs and make sure in K3 platform)
The current CMakeLists does not include xsmtvdotii, so any toolchain
that honours the explicit -march (tested with SpacemiT GCC 15.2) fails
at the assembler stage:

  Error: unrecognized opcode `vmadot v16,v14,v0',
         extension `xsmtvdotii' required

Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is
enabled so the IME path can actually build with a capable toolchain.
No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off.

toolchain from https://www.spacemit.com/community/resources-download/Tools

* Update ggml/src/ggml-cpu/CMakeLists.txt

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>

---------

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
2026-04-29 10:59:21 +03:00
Reese Levine
d6a5094004 ggml-webgpu: Fix bug in FlashAttention support check (#22492)
* Fix flashattention support check for devices that don't support subgroups

* set path to none if kv_tile doesn't fit
2026-04-29 10:59:00 +03:00
Masato Nakasaka
7b95ea5d11 common: Intentionally leak logger instance to fix hanging on Windows (#22273)
* Changed to leak logger singleton to prevent hanging on Windows

* Fix comment

* Stopped using static vector

Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash

* Change so all logs are output before exit

* Added debug logging

* added more logging

* Added logging

* Explicitly free logger to avoid hanging on Win

* Reverted to leak logger instance again

* Removed debug log and fixed comment

* Fixed comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-29 10:58:43 +03:00
hrushitfujitsu
bdc9c743a5 ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (#21916)
* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
2026-04-29 10:57:37 +03:00
Johannes Gäßler
739393beeb TP: fix delayed AllReduce + zero-sized slices (#22489) 2026-04-29 08:55:07 +02:00
Michael Wand
fc2b0053ff ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196) 2026-04-29 06:47:42 +08:00
lnigam
7b8443ac78 ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286)
* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)

Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.

* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32

* Apply suggestions from code review

Address review comments

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256

* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-28 21:37:35 +02:00
Daniel Bevenius
5d56effdee convert : add support for Nemotron Nano 3 Omni (#22481)
This commit adds support for NVIDIA Nemotron Nano 3 Omni model enabling
this model to be converted to GGUF.
2026-04-28 19:17:57 +02:00
Jillis ter Hove
52e5f0a5c1 common : re-arm reasoning budget after DONE on new <think> (#22323)
DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.

Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
2026-04-28 19:15:36 +02:00
Matt Corallo
f9f33654a6 vulkan: Coalesce Q4_K/Q5_K scale loads (#21751)
Some SPIR-V compilers (notably mesa) don't handle the current
vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well.
While reading three `u8`s from the 12-byte scale array should (at
least on some hardware) result in loading the full 12 bytes in a
single LOAD followed by whatever extraction is needed, at least
the ANV Intel driver really can't practically perform this
optimization.

`mesa`'s unsigned upper bound logic doesn't handle tracking bounds
through ternary, resulting in the `(is < 4) ? ... : is - 4` having
an infinite upper bound (as it cannot prove `is - 4` doesn't
underflow). While this could still be rectified if mesa looked at
the array bounds, it currently doesn't and `glslc` currently emits
SPIR-V that doesn't allow for this optimization anyway (though
maybe it will at some point, see
https://github.com/KhronosGroup/glslang/issues/4206).

In mul_mat_vecq we took a different approach to loading the same
fields. We read the first two bytes we needed from `scale` then
took a branch before deciding whether we needed to read a third
byte. In mesa this did, indeed, lead to a top-level branch with
conditional loads. As such these loads ended up not being
coalesced either (at least in the ANV driver) resulting in
additional instructions in our hot loop.

Instead, here, we go ahead and force loading the full 12 bytes and
extract the bits we need from the packed-u32s instead. In mul_mat
there's a few less ternaries and only one extra shift, so even on
drivers that did optimize the previous loads properly the only
material change should be pulling a few extra bytes into registers
(which on most hardware won't cost anything anyway, though
ironically on Intel it theoretically could). In mul_mat_vecq this
requires a bit of extra math and may read bytes from the u32 that
weren't needed, but it seems likely avoiding the branch is a win
on most platforms.

On Intel Xe2/mesa 26.0.4 with the optimizations from
https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162,

for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l:
 * Instruction Count: 2753 -> 2688
 * SEND Count: 269 -> 261
 * Cycle Count: 273976 -> 266138
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l:
 * Instruction Count: 2767 -> 2702
 * SEND Count: 271 -> 263
 * Cycle Count: 274140 -> 268144
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader mul_mat_vec_id_q4_k_q8_1_f32:
 * Instruction Count: 1930 -> 1646
 * SEND Count: 116 -> 71
 * Cycle Count: 1348306 -> 843350
 * Max live registers: 78 -> 84
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_id_q5_k_q8_1_f32:
 * Instruction Count: 2207 -> 1922
 * SEND Count: 131 -> 86
 * Cycle Count: 1392012 -> 1037836
 * Max live registers: 90 -> 90
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_q4_k_q8_1_f32:
 * Instruction Count: 2029 -> 1749
 * SEND Count: 111 -> 66
 * Cycle Count: 1347278 -> 840118
 * Max live registers: 74 -> 80
 * Non SSA regs after NIR: 299 -> 134

for shader mul_mat_vec_q5_k_q8_1_f32:
 * Instruction Count: 2307 -> 2022
 * SEND Count: 126 -> 81
 * Cycle Count: 1379820 -> 954042
 * Max live registers: 86 -> 86
 * Non SSA regs after NIR: 299 -> 134

On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL:
 * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%)
 * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%)
 * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%)

On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S:
 * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%)
 * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%)
 * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%)

On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with
-sm layer (note that -sm tensor improvements will naturally be
less):
 * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%)
 * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%)
 * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)
2026-04-28 17:31:04 +02:00
605 changed files with 46248 additions and 28304 deletions

View File

@@ -33,10 +33,10 @@ RUN mkdir -p /app/full \
FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS base
ARG IGC_VERSION=v2.30.1
ARG IGC_VERSION_FULL=2_2.30.1+20950
ARG COMPUTE_RUNTIME_VERSION=26.09.37435.1
ARG COMPUTE_RUNTIME_VERSION_FULL=26.09.37435.1-0
ARG IGC_VERSION=v2.32.7
ARG IGC_VERSION_FULL=2_2.32.7+21184
ARG COMPUTE_RUNTIME_VERSION=26.14.37833.4
ARG COMPUTE_RUNTIME_VERSION_FULL=26.14.37833.4-0
ARG IGDGMM_VERSION=22.9.0
RUN mkdir /tmp/neo/ && cd /tmp/neo/ \
&& wget https://github.com/intel/intel-graphics-compiler/releases/download/$IGC_VERSION/intel-igc-core-${IGC_VERSION_FULL}_amd64.deb \

View File

@@ -103,6 +103,7 @@ let
vulkan-headers
vulkan-loader
shaderc
spirv-headers
];
in
@@ -146,7 +147,6 @@ effectiveStdenv.mkDerivation (finalAttrs: {
ninja
pkg-config
git
spirv-headers
]
++ optionals useCuda [
cudaPackages.cuda_nvcc

View File

@@ -12,6 +12,8 @@ body:
after recreating the CMake build directory and with `-DGGML_CCACHE=OFF`.
If the compilation succeeds with ccache disabled you should be able to permanently fix the issue
by clearing `~/.cache/ccache` (on Linux).
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: commit
attributes:

View File

@@ -1,5 +1,5 @@
name: Bug (model use)
description: Something goes wrong when using a model (in general, not specific to a single llama.cpp module).
description: Something goes wrong when running a model (crashes, garbled outputs, etc.).
title: "Eval bug: "
labels: ["bug-unconfirmed", "model evaluation"]
body:
@@ -12,6 +12,8 @@ body:
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
The `llama-completion` binary can be used for simple and reproducible model inference.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: version
attributes:

View File

@@ -10,6 +10,8 @@ body:
This issue template is intended for miscellaneous bugs that don't fit into any other category.
If you encountered the issue while using an external UI (e.g. ollama),
please reproduce your issue using one of the examples/binaries in this repository.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: version
attributes:

View File

@@ -8,6 +8,8 @@ body:
value: |
[Please post your idea first in Discussion if there is not yet a consensus for this enhancement request. This will help to keep this issue tracker focused on enhancements that the community has agreed needs to be implemented.](https://github.com/ggml-org/llama.cpp/discussions/categories/ideas)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: checkboxes
id: prerequisites
attributes:

View File

@@ -8,6 +8,8 @@ body:
value: |
Don't forget to check for any [duplicate research issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3A%22research+%F0%9F%94%AC%22)
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: checkboxes
id: research-stage
attributes:

View File

@@ -9,6 +9,8 @@ body:
Don't forget to [check for existing refactor issue tickets](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Arefactoring) in case it's already covered.
Also you may want to check [Pull request refactor label as well](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Aopen+is%3Apr+label%3Arefactoring) for duplicates too.
Please fill out this template yourself, copypasting language model outputs is [strictly prohibited](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#ai-usage-policy).
- type: textarea
id: background-description
attributes:

View File

@@ -29,10 +29,10 @@ jobs:
uses: actions/setup-python@v6
with:
python-version: '3.11'
pip-install: poetry==2.4.0
- name: Install dependencies
run: |
cd gguf-py
python -m pip install poetry==2.3.2
poetry install
- name: Build package

View File

@@ -31,7 +31,7 @@ jobs:
uses: actions/setup-python@v6
with:
python-version: "3.11"
pip-install: -r requirements/requirements-all.txt ty==0.0.26
pip-install: -r requirements/requirements-all.txt ty==0.0.33
# - name: Type-check with Pyright
# uses: jakebailey/pyright-action@v2
# with:

3
.gitignore vendored
View File

@@ -105,9 +105,12 @@
__pycache__/
*/poetry.lock
poetry.toml
poetry.lock
uv.lock
# Nix
flake.lock
/result
# Test binaries

View File

@@ -4,6 +4,7 @@ General:
- By very precise and concise when writing code, comments, explanations, etc.
- PR and commit titles format: `<module> : <title>`. Lookup recents for examples
- Don't try to build or run the code unless you are explicitly asked to do so
- Use the `gh` CLI tool when querying PRs, issues, or other GitHub resources
Coding:
- When in doubt, always refer to the CONTRIBUTING.md file of the project

View File

@@ -76,6 +76,7 @@
/ggml/src/ggml-vulkan/ @ggml-org/ggml-vulkan
/ggml/src/ggml-webgpu/ @ggml-org/ggml-webgpu
/ggml/src/ggml-zdnn/ @ggml-org/ggml-zdnn @Andreas-Krebbel @AlekseiNikiforovIBM
/ggml/src/ggml-zendnn/ @avinashcpandey @Jiten1parmar @z-vishal
/ggml/src/ggml.c @ggerganov
/ggml/src/ggml.cpp @ggerganov
/ggml/src/gguf.cpp @JohannesGaessler @Green-Sky

View File

@@ -529,6 +529,7 @@ To learn more about model quantization, [read this documentation](tools/quantize
- [How to build](docs/build.md)
- [Running on Docker](docs/docker.md)
- [Build on Android](docs/android.md)
- [Multi-GPU usage](docs/multi-gpu.md)
- [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
- [GGML tips & tricks](https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-&-Tricks)

View File

@@ -248,6 +248,8 @@ std::vector<std::string> common_arg::get_env() const {
// Helper function to parse tensor buffer override strings
static void parse_tensor_buffer_overrides(const std::string & value, std::vector<llama_model_tensor_buft_override> & overrides) {
ggml_backend_load_all();
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
@@ -425,6 +427,10 @@ static bool parse_bool_value(const std::string & value) {
}
}
[[noreturn]] static void arg_removed(const std::string & msg) {
throw std::invalid_argument("the argument has been removed. " + msg);
}
//
// CLI argument parsing functions
//
@@ -803,6 +809,7 @@ static std::vector<ggml_backend_dev_t> parse_device_list(const std::string & val
if (dev_names.size() == 1 && dev_names[0] == "none") {
devices.push_back(nullptr);
} else {
ggml_backend_load_all();
for (const auto & device : dev_names) {
auto * dev = ggml_backend_dev_by_name(device.c_str());
if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
@@ -820,6 +827,7 @@ static void add_rpc_devices(const std::string & servers) {
if (rpc_servers.empty()) {
throw std::invalid_argument("no RPC servers specified");
}
ggml_backend_load_all();
ggml_backend_reg_t rpc_reg = ggml_backend_reg_by_name("RPC");
if (!rpc_reg) {
throw std::invalid_argument("failed to find RPC backend");
@@ -1016,9 +1024,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.use_color = tty_can_use_colors();
// load dynamic backends
ggml_backend_load_all();
common_params_context ctx_arg(params);
ctx_arg.print_usage = print_usage;
ctx_arg.ex = ex;
@@ -2275,6 +2280,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--list-devices"},
"print list of available devices and exit",
[](common_params &) {
ggml_backend_load_all();
std::vector<ggml_backend_dev_t> devices;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
@@ -2864,7 +2870,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--tools"}, "TOOL1,TOOL2,...",
"experimental: whether to enable built-in tools for AI agents - do not enable in untrusted environments (default: no tools)\n"
"specify \"all\" to enable all tools\n"
"available tools: read_file, file_glob_search, grep_search, exec_shell_command, write_file, edit_file, apply_diff",
"available tools: read_file, file_glob_search, grep_search, exec_shell_command, write_file, edit_file, apply_diff, get_datetime",
[](common_params & params, const std::string & value) {
params.server_tools = parse_csv_row(value);
}
@@ -3380,7 +3386,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}));
add_opt(common_arg(
{"--spec-draft-poll", "--poll-draft"}, "<0|1>",
"Use polling to wait for draft model work (default: same as --poll])",
"Use polling to wait for draft model work (default: same as --poll)",
[](common_params & params, int value) {
params.speculative.draft.cpuparams.poll = value;
}
@@ -3499,7 +3505,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_LOOKUP, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_SPEC_DRAFT_N_MIN"));
add_opt(common_arg(
{"--spec--draft-p-split", "--draft-p-split"}, "P",
{"--spec-draft-p-split", "--draft-p-split"}, "P",
string_format("speculative decoding split probability (default: %.2f)", (double)params.speculative.draft.p_split),
[](common_params & params, const std::string & value) {
params.speculative.draft.p_split = std::stof(value);
@@ -3715,35 +3721,35 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--draft", "--draft-n", "--draft-max"}, "N",
"the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max",
[](common_params & /*params*/, int /*value*/) {
throw std::invalid_argument("the argument has been removed. use --spec-draft-n-max or --spec-ngram-mod-n-max");
arg_removed("use --spec-draft-n-max or --spec-ngram-mod-n-max");
}
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_LOOKUP, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_DRAFT_MAX"));
add_opt(common_arg(
{"--draft-min", "--draft-n-min"}, "N",
"the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min",
[](common_params & /*params*/, int /*value*/) {
throw std::invalid_argument("the argument has been removed. use --spec-draft-n-min or --spec-ngram-mod-n-min");
arg_removed("use --spec-draft-n-min or --spec-ngram-mod-n-min");
}
).set_spec().set_examples({LLAMA_EXAMPLE_SPECULATIVE, LLAMA_EXAMPLE_LOOKUP, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_DRAFT_MIN"));
add_opt(common_arg(
{"--spec-ngram-size-n"}, "N",
"the argument has been removed. use the respective --spec-ngram-*-size-n or --spec-ngram-mod-n-match",
[](common_params & /*params*/, int /*value*/) {
throw std::invalid_argument("the argument has been removed. use the respective --spec-ngram-*-size-n");
arg_removed("use the respective --spec-ngram-*-size-n");
}
).set_spec().set_examples({LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"--spec-ngram-size-m"}, "N",
"the argument has been removed. use the respective --spec-ngram-*-size-m",
[](common_params & /*params*/, int /*value*/) {
throw std::invalid_argument("the argument has been removed. use the respective --spec-ngram-*-size-m");
arg_removed("use the respective --spec-ngram-*-size-m");
}
).set_spec().set_examples({LLAMA_EXAMPLE_SERVER}));
add_opt(common_arg(
{"--spec-ngram-min-hits"}, "N",
"the argument has been removed. use the respective --spec-ngram-*-min-hits",
[](common_params & /*params*/, int /*value*/) {
throw std::invalid_argument("the argument has been removed. use the respective --spec-ngram-*-min-hits");
arg_removed("use the respective --spec-ngram-*-min-hits");
}
).set_spec().set_examples({LLAMA_EXAMPLE_SERVER}));
@@ -3794,7 +3800,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_examples({ LLAMA_EXAMPLE_DIFFUSION }));
add_opt(common_arg(
{"--diffusion-algorithm"}, "N",
string_format("diffusion algorithm: 0=ORIGIN, 1=ENTROPY_BASED, 2=MARGIN_BASED, 3=RANDOM, 4=LOW_CONFIDENCE (default: %d)", params.diffusion.algorithm),
string_format(
"diffusion algorithm: 0=DIFFUSION_ALGORITHM_ORIGIN, 1=DIFFUSION_ALGORITHM_ENTROPY_BASED, "
"2=DIFFUSION_ALGORITHM_MARGIN_BASED, 3=DIFFUSION_ALGORITHM_RANDOM, "
"4=DIFFUSION_ALGORITHM_CONFIDENCE_BASED (default: %d)", params.diffusion.algorithm),
[](common_params & params, int value) { params.diffusion.algorithm = value; }
).set_examples({ LLAMA_EXAMPLE_DIFFUSION }));
add_opt(common_arg(

View File

@@ -136,10 +136,10 @@ common_peg_parser analyze_reasoning::build_parser(parser_build_context & ctx) co
if (!end.empty()) {
if (!start.empty()) {
// Standard tag-based: optional(<think>reasoning</think>)
return p.optional(start + p.reasoning(p.until(end)) + end + p.space());
return p.optional(p.optspace(start) + p.reasoning(p.until(trim_whitespace(end))) + p.optspace(end));
}
// Delimiter-style (empty start)
return p.optional(p.reasoning(p.until(end)) + end + p.space());
return p.optional(p.reasoning(p.until(trim_whitespace(end))) + p.optspace(end));
}
}
@@ -186,7 +186,6 @@ common_peg_parser analyze_tools::build_parser(parser_build_context & ctx) const
common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
bool force_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;
// Build effective field names with dot notation if function_field is set
std::string name_field = format.name_field;
@@ -225,8 +224,7 @@ common_peg_parser analyze_tools::build_tool_parser_json_native(parser_build_cont
tool_start = format.per_call_start;
}
return ctx.reasoning_parser + (force_tools ? p.eps() : p.optional(p.content(p.until(tool_start)))) + tools_parser +
p.end();
return ctx.reasoning_parser + p.optional(p.content(p.until(tool_start))) + tools_parser + p.end();
}
common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p, const std::string & name,
@@ -270,7 +268,6 @@ common_peg_parser analyze_tools::build_func_parser(common_chat_peg_builder & p,
common_peg_parser analyze_tools::build_tool_parser_tag_json(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
bool force_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;
common_peg_parser tool_choice = p.choice();
@@ -336,14 +333,12 @@ common_peg_parser analyze_tools::build_tool_parser_tag_json(parser_build_context
std::string trigger_marker = !format.section_start.empty() ? format.section_start : format.per_call_start;
auto content_before_tools = trigger_marker.empty() ? p.eps() : p.until(trigger_marker);
return ctx.reasoning_parser + (force_tools ? p.eps() : p.optional(p.content(content_before_tools))) + tool_calls +
p.end();
return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
}
common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_context & ctx) const {
auto & p = ctx.p;
const auto & inputs = ctx.inputs;
bool force_tools = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED;
auto until_suffix = p.rule("until-suffix", p.until(arguments.value_suffix));
@@ -374,9 +369,7 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte
arguments.name_suffix) +
arguments.value_prefix +
(schema_info.resolves_to_string(param_schema) ?
p.tool_arg_string_value(p.schema(until_suffix,
"tool-" + name + "-arg-" + param_name + "-schema",
param_schema, true)) :
p.tool_arg_string_value(until_suffix) :
p.tool_arg_json_value(p.schema(
p.json(), "tool-" + name + "-arg-" + param_name + "-schema", param_schema, false)) +
p.space()) +
@@ -471,8 +464,7 @@ common_peg_parser analyze_tools::build_tool_parser_tag_tagged(parser_build_conte
std::string trigger_marker = !format.section_start.empty() ? format.section_start : format.per_call_start;
auto content_before_tools = trigger_marker.empty() ? p.eps() : p.until(trigger_marker);
return ctx.reasoning_parser + (force_tools ? p.eps() : p.optional(p.content(content_before_tools))) + tool_calls +
p.end();
return ctx.reasoning_parser + p.optional(p.content(content_before_tools)) + tool_calls + p.end();
}
} // namespace autoparser

View File

@@ -342,7 +342,7 @@ void analyze_reasoning::compare_thinking_enabled() {
if (left_trimmed.empty() && !diff.right.empty()) {
if (!right_trimmed.empty() && string_ends_with(comparison->output_B, right_trimmed)) {
if (start.empty()) {
start = trim_leading_whitespace(diff.right);
start = diff.right;
mode = reasoning_mode::TAG_BASED;
}
}
@@ -353,7 +353,7 @@ void analyze_reasoning::compare_thinking_enabled() {
if (seg.size() >= 2 && seg[seg.size() - 1].value == left_trimmed && seg[seg.size() - 2].type == segment_type::MARKER) {
start = seg[seg.size() - 2].value;
}
end = trim_trailing_whitespace(diff.left);
end = diff.left;
mode = reasoning_mode::TAG_BASED;
}
}
@@ -445,14 +445,14 @@ void analyze_reasoning::compare_reasoning_scope() {
auto result = parser_wrapped.parse_anywhere_and_extract(comparison->output_B);
if (result.result.success()) {
start = result.tags["pre"];
end = trim_trailing_whitespace(result.tags["post"]);
end = result.tags["post"];
} else {
auto parser_delimiter = build_tagged_peg_parser([&](common_peg_parser_builder &p) {
return p.literal(reasoning_content) + p.space() + p.optional(p.tag("post", (p.marker() + p.space())));
});
result = parser_delimiter.parse_anywhere_and_extract(comparison->output_B);
if (result.result.success()) {
end = trim_trailing_whitespace(result.tags["post"]);
end = result.tags["post"];
} else {
LOG_DBG(ANSI_ORANGE "%s: Unable to extract reasoning markers, falling back to reasoning = NONE\n" ANSI_RESET, __func__);
mode = reasoning_mode::NONE;

View File

@@ -816,6 +816,32 @@ common_peg_parser common_chat_peg_builder::prefix(const std::string & s, const s
return literal(s.substr(0, s.rfind(delimiter)));
}
common_peg_parser common_chat_peg_builder::optspace(const std::string & tag) {
auto parser = eps();
size_t end_of_prefix_space = tag.size();
size_t start_of_suffix_space = tag.size();
for (size_t i = 0; i < tag.size(); i++) {
if (!std::isspace(tag[i])) {
end_of_prefix_space = i;
break;
}
}
for (size_t i = tag.size(); i > 0; i--) {
if (!std::isspace(tag[i - 1])) {
start_of_suffix_space = i;
break;
}
}
for (size_t i = 0; i < end_of_prefix_space; i++) {
parser += optional(literal(std::string(1, tag[i])));
}
parser += literal(tag.substr(end_of_prefix_space, start_of_suffix_space - end_of_prefix_space));
for (size_t i = start_of_suffix_space; i < tag.size(); i++) {
parser += optional(literal(std::string(1, tag[i])));
}
return parser;
}
common_peg_parser common_chat_peg_builder::standard_json_tools(
const std::string & section_start,
const std::string & section_end,

View File

@@ -96,6 +96,9 @@ class common_chat_peg_builder : public common_peg_parser_builder {
// Return a parser that parses the prefix of a string, up to a given delimiter.
common_peg_parser prefix(const std::string & s, const std::string & delimiter = {});
// Return a parser that parses all elements of tag, but leading and trailing spaces are optional
common_peg_parser optspace(const std::string & tag);
// Legacy-compatible helper for building standard JSON tool calls
// Used by tests and manual parsers
// name_key/args_key: JSON key names for function name and arguments

View File

@@ -80,7 +80,7 @@ json common_chat_msg::to_json_oaicompat(bool concat_typed_text) const {
if (!content.empty()) {
jmsg["content"] = content;
} else if (!content_parts.empty()) {
if (concat_typed_text) {
if (concat_typed_text || contains_media()) {
std::string text;
bool last_was_media_marker = false;
// join parts with newline, do not add newline before or after media markers
@@ -2116,22 +2116,38 @@ std::optional<common_chat_params> common_chat_try_specialized_template(
return std::nullopt;
}
static std::string common_chat_templates_generation_prompt(const common_chat_template & tmpl, const autoparser::generation_params & inputs) {
autoparser::generation_params params = inputs;
params.add_generation_prompt = false;
std::string no_gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
params.add_generation_prompt = true;
std::string gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
size_t prefix_len = 0;
size_t min_size = std::min(no_gen_prompt.size(), gen_prompt.size());
while (prefix_len < min_size && no_gen_prompt[prefix_len] == gen_prompt[prefix_len]) {
prefix_len++;
}
return gen_prompt.substr(prefix_len);
}
static common_chat_params common_chat_templates_apply_jinja(const struct common_chat_templates * tmpls,
const struct common_chat_templates_inputs & inputs) {
autoparser::generation_params params;
params.tools = common_chat_tools_to_json_oaicompat(inputs.tools);
const auto & tmpl =
params.tools.is_array() && tmpls->template_tool_use ? *tmpls->template_tool_use : *tmpls->template_default;
const auto & src = tmpl.source();
const auto & caps = tmpl.original_caps();
params.messages = render_message_to_json(inputs.messages, tmpl.original_caps());
params.tool_choice = inputs.tool_choice;
params.reasoning_format = inputs.reasoning_format;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
params.add_bos = tmpls->add_bos;
params.add_eos = tmpls->add_eos;
const auto & src = tmpl.source();
const auto & caps = tmpl.original_caps();
params.messages = render_message_to_json(inputs.messages, tmpl.original_caps());
params.tool_choice = inputs.tool_choice;
params.reasoning_format = inputs.reasoning_format;
params.enable_thinking = inputs.enable_thinking;
params.grammar = inputs.grammar;
params.now = inputs.now;
params.add_generation_prompt = inputs.add_generation_prompt;
params.add_bos = tmpls->add_bos;
params.add_eos = tmpls->add_eos;
if (src.find("<|channel|>") == std::string::npos) {
// map developer to system for all models except for GPT-OSS
@@ -2153,14 +2169,7 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
workaround::func_args_not_string(params.messages);
}
params.add_generation_prompt = false;
std::string no_gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
params.add_generation_prompt = true;
std::string gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
auto diff = calculate_diff_split(no_gen_prompt, gen_prompt);
params.generation_prompt = diff.right + diff.suffix;
params.add_generation_prompt = inputs.add_generation_prompt;
params.generation_prompt = common_chat_templates_generation_prompt(tmpl, params);
params.extra_context = common_chat_extra_context();
for (auto el : inputs.chat_template_kwargs) {
@@ -2212,8 +2221,8 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
auto auto_params = autoparser::peg_generator::generate_parser(tmpl, params, autoparser);
auto_params.supports_thinking = autoparser.reasoning.mode != autoparser::reasoning_mode::NONE;
if (auto_params.supports_thinking) {
auto_params.thinking_start_tag = autoparser.reasoning.start;
auto_params.thinking_end_tag = autoparser.reasoning.end;
auto_params.thinking_start_tag = trim_whitespace(autoparser.reasoning.start);
auto_params.thinking_end_tag = trim_whitespace(autoparser.reasoning.end);
}
auto_params.generation_prompt = params.generation_prompt;
common_peg_arena arena;

View File

@@ -94,6 +94,15 @@ struct common_chat_msg {
tool_name.empty() && tool_call_id.empty();
}
bool contains_media() const {
for (const auto & part : content_parts) {
if (part.type == "media_marker") {
return true;
}
}
return false;
}
void set_tool_call_ids(std::vector<std::string> & ids_cache,
const std::function<std::string()> & gen_tool_call_id) {
for (auto i = 0u; i < tool_calls.size(); i++) {

View File

@@ -109,16 +109,24 @@ static std::vector<llama_device_memory_data> common_get_device_memory_data(
ret.back().total = total;
}
for (size_t i = 0; i < nd; i++) {
ggml_backend_dev_t dev = llama_model_get_device(model, i);
size_t free;
size_t total;
ggml_backend_dev_memory(llama_model_get_device(model, i), &free, &total);
ggml_backend_dev_memory(dev, &free, &total);
// devices can return 0 bytes for free and total memory if they do not
// have any to report. in this case, we will use the host memory as a fallback
// fixes: https://github.com/ggml-org/llama.cpp/issues/18577
// Some non-GPU accelerator backends, such as BLAS, report 0/0 and rely on
// the host-memory fallback. For GPU-like backends, keep 0/0 so --fit does
// not assign anything to a device with an unknown memory budget.
if (free == 0 && total == 0) {
free = ret.back().free;
total = ret.back().total;
const enum ggml_backend_dev_type type = ggml_backend_dev_type(dev);
if (type == GGML_BACKEND_DEVICE_TYPE_GPU || type == GGML_BACKEND_DEVICE_TYPE_IGPU) {
LOG_WRN("%s: device %s did not report memory; --fit will not use it\n",
__func__, ggml_backend_dev_name(dev));
} else {
free = ret.back().free;
total = ret.back().total;
}
}
ret[i].free = free;
ret[i].total = total;

View File

@@ -57,7 +57,7 @@ static fs::path get_cache_directory() {
#ifndef _WIN32
const struct passwd * pw = getpwuid(getuid());
if (pw->pw_dir && *pw->pw_dir) {
if (pw && pw->pw_dir && *pw->pw_dir) {
return fs::path(pw->pw_dir) / ".cache" / "huggingface" / "hub";
}
#endif

View File

@@ -49,7 +49,7 @@ enum common_log_col : int {
};
// disable colors by default
static std::vector<const char *> g_col = {
static const char* g_col[] = {
"",
"",
"",
@@ -247,7 +247,6 @@ public:
entries = std::move(new_entries);
}
cv.notify_one();
}
@@ -265,7 +264,6 @@ public:
{
std::unique_lock<std::mutex> lock(mtx);
cv.wait(lock, [this]() { return head != tail; });
cur = entries[head];
head = (head + 1) % entries.size();
@@ -301,7 +299,6 @@ public:
tail = (tail + 1) % entries.size();
}
cv.notify_one();
}
@@ -338,7 +335,7 @@ public:
g_col[COMMON_LOG_COL_CYAN] = LOG_COL_CYAN;
g_col[COMMON_LOG_COL_WHITE] = LOG_COL_WHITE;
} else {
for (size_t i = 0; i < g_col.size(); i++) {
for (size_t i = 0; i < std::size(g_col); i++) {
g_col[i] = "";
}
}
@@ -368,14 +365,20 @@ struct common_log * common_log_init() {
}
struct common_log * common_log_main() {
static struct common_log log;
// We intentionally leak (i.e. do not delete) the logger singleton because
// common_log destructor called at DLL teardown phase will cause hanging on Windows.
// OS will release resources anyway so it should not be a significant issue,
// though this design may cause logs to be lost if not flushed before the program exits.
// Refer to https://github.com/ggml-org/llama.cpp/issues/22142 for details.
static struct common_log * log;
static std::once_flag init_flag;
std::call_once(init_flag, [&]() {
log = new common_log;
// Set default to auto-detect colors
log.set_colors(tty_can_use_colors());
log->set_colors(tty_can_use_colors());
});
return &log;
return log;
}
void common_log_pause(struct common_log * log) {

View File

@@ -49,7 +49,11 @@ void common_log_default_callback(enum ggml_log_level level, const char * text, v
struct common_log;
struct common_log * common_log_init();
struct common_log * common_log_main(); // singleton, automatically destroys itself on exit
// Singleton, intentionally leaked to avoid Windows teardown hangs.
// Call common_log_flush() before exit if you want to ensure all logs are flushed.
struct common_log * common_log_main();
void common_log_pause (struct common_log * log); // pause the worker thread, not thread-safe
void common_log_resume(struct common_log * log); // resume the worker thread, not thread-safe
void common_log_free (struct common_log * log);

View File

@@ -122,6 +122,20 @@ static void common_reasoning_budget_accept(struct llama_sampler * smpl, llama_to
}
break;
case REASONING_BUDGET_DONE:
// Re-arm on a new start tag: some models emit multiple <think> blocks
// per response, and each should get a fresh budget window.
if (ctx->start_matcher.advance(token)) {
ctx->state = REASONING_BUDGET_COUNTING;
ctx->remaining = ctx->budget;
ctx->end_matcher.reset();
LOG_INF("reasoning-budget: re-activated on new start tag, budget=%d tokens\n", ctx->budget);
if (ctx->remaining <= 0) {
ctx->state = REASONING_BUDGET_FORCING;
ctx->force_pos = 0;
LOG_INF("reasoning-budget: budget=0, forcing immediately\n");
}
}
break;
}
}
@@ -218,34 +232,6 @@ static struct llama_sampler * common_reasoning_budget_init_state(
);
}
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
const std::vector<llama_token> & prefill_tokens) {
// Determine initial state from prefill: COUNTING if the prefill begins with
// the start sequence but does not also contain the end sequence after it.
common_reasoning_budget_state initial_state = REASONING_BUDGET_IDLE;
if (!prefill_tokens.empty() && !start_tokens.empty() &&
prefill_tokens.size() >= start_tokens.size() &&
std::equal(start_tokens.begin(), start_tokens.end(), prefill_tokens.begin())) {
initial_state = REASONING_BUDGET_COUNTING;
// If the end sequence also follows the start in the prefill, reasoning
// was opened and immediately closed — stay IDLE.
if (!end_tokens.empty() &&
prefill_tokens.size() >= start_tokens.size() + end_tokens.size()) {
auto end_start = prefill_tokens.end() - (ptrdiff_t) end_tokens.size();
if (end_start >= prefill_tokens.begin() + (ptrdiff_t) start_tokens.size() &&
std::equal(end_tokens.begin(), end_tokens.end(), end_start)) {
initial_state = REASONING_BUDGET_IDLE;
}
}
}
return common_reasoning_budget_init_state(vocab, start_tokens, end_tokens, forced_tokens, budget, initial_state);
}
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,

View File

@@ -29,10 +29,7 @@ enum common_reasoning_budget_state {
// end_tokens - token sequence for natural deactivation
// forced_tokens - token sequence forced when budget expires
// budget - max tokens allowed in the reasoning block
// prefill_tokens - tokens already present in the prompt (generation prompt);
// used to determine the initial state: COUNTING if they begin
// with start_tokens (but don't also end with end_tokens),
// IDLE otherwise. COUNTING with budget <= 0 is promoted to FORCING.
// initial_state - initial state
//
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
@@ -40,16 +37,6 @@ struct llama_sampler * common_reasoning_budget_init(
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
const std::vector<llama_token> & prefill_tokens = {});
// Variant that takes an explicit initial state (used by tests and clone).
// COUNTING with budget <= 0 is promoted to FORCING.
struct llama_sampler * common_reasoning_budget_init(
const struct llama_vocab * vocab,
const std::vector<llama_token> & start_tokens,
const std::vector<llama_token> & end_tokens,
const std::vector<llama_token> & forced_tokens,
int32_t budget,
common_reasoning_budget_state initial_state);
common_reasoning_budget_state initial_state = REASONING_BUDGET_IDLE);
common_reasoning_budget_state common_reasoning_budget_get_state(const struct llama_sampler * smpl);

View File

@@ -260,32 +260,35 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
}
}
// Compute prefill tokens from the generation prompt
std::vector<llama_token> prefill_tokens;
if (!params.generation_prompt.empty()) {
GGML_ASSERT(vocab != nullptr);
auto tokens = common_tokenize(vocab, params.generation_prompt, false, true);
for (size_t i = 0; i < tokens.size(); i++) {
std::string piece = common_token_to_piece(vocab, tokens[i], true);
if (i == 0 && std::isspace(piece[0]) && !std::isspace(params.generation_prompt[0])) {
// Some tokenizers will add a space before the first special token, need to exclude
continue;
}
LOG_DBG("%s: prefill token: %d = %s\n", __func__, tokens[i], piece.c_str());
prefill_tokens.push_back(tokens[i]);
}
}
// Feed generation prompt tokens to the grammar sampler so it advances past
// tokens the template already placed in the prompt.
// Only applies to output-format and tool-call grammars; user-supplied grammars must not be prefilled.
std::vector<llama_token> prefill_tokens;
if (!params.generation_prompt.empty() && common_grammar_needs_prefill(params.grammar)) {
GGML_ASSERT(vocab != nullptr);
prefill_tokens = common_tokenize(vocab, params.generation_prompt, false, true);
if (!prefill_tokens.empty()) {
std::string first_token = common_token_to_piece(vocab, prefill_tokens[0], true);
if (std::isspace(first_token[0]) && !std::isspace(params.generation_prompt[0])) {
// Some tokenizers will add a space before the first special token, need to remove
prefill_tokens = std::vector<llama_token>(prefill_tokens.begin() + 1, prefill_tokens.end());
}
}
if (grmr && !params.grammar_lazy) {
try {
for (const auto & token : prefill_tokens) {
llama_sampler_accept(grmr, token);
LOG_DBG("%s: accepted prefill token (%d)\n", __func__, token);
}
} catch (std::exception &e) {
LOG_ERR("%s: error initializing grammar sampler for grammar:\n%s\n\nGeneration prompt:\n'%s'\n", __func__,
common_grammar_value(params.grammar).c_str(), params.generation_prompt.c_str());
throw e;
if (grmr && !params.grammar_lazy && common_grammar_needs_prefill(params.grammar)) {
try {
for (const auto & token : prefill_tokens) {
llama_sampler_accept(grmr, token);
LOG_DBG("%s: grammar accepted prefill token (%d)\n", __func__, token);
}
} catch (std::exception &e) {
LOG_ERR("%s: error initializing grammar sampler for grammar:\n%s\n\nGeneration prompt:\n'%s'\n", __func__,
common_grammar_value(params.grammar).c_str(), params.generation_prompt.c_str());
throw e;
}
}
@@ -296,8 +299,12 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
params.reasoning_budget_start,
params.reasoning_budget_end,
params.reasoning_budget_forced,
params.reasoning_budget_tokens < 0 ? INT_MAX : params.reasoning_budget_tokens,
prefill_tokens);
params.reasoning_budget_tokens < 0 ? INT_MAX : params.reasoning_budget_tokens);
for (const auto & token : prefill_tokens) {
llama_sampler_accept(rbudget, token);
LOG_DBG("%s: reasoning-budget accepted prefill token (%d)\n", __func__, token);
}
}
if (params.has_logit_bias()) {
@@ -431,7 +438,7 @@ static bool grammar_should_apply(struct common_sampler * gsmpl) {
return true;
}
void common_sampler_accept(struct common_sampler * gsmpl, llama_token token, bool accept_grammar) {
void common_sampler_accept(struct common_sampler * gsmpl, llama_token token, bool is_generated) {
if (!gsmpl) {
return;
}
@@ -439,9 +446,11 @@ void common_sampler_accept(struct common_sampler * gsmpl, llama_token token, boo
const auto tm = gsmpl->tm();
// grammar_should_apply() checks the reasoning budget state, so calculate this before we accept
accept_grammar = accept_grammar && grammar_should_apply(gsmpl);
const auto accept_grammar = is_generated && grammar_should_apply(gsmpl);
llama_sampler_accept(gsmpl->rbudget, token);
if (gsmpl->rbudget && is_generated) {
llama_sampler_accept(gsmpl->rbudget, token);
}
if (gsmpl->grmr && accept_grammar) {
llama_sampler_accept(gsmpl->grmr, token);
@@ -538,6 +547,8 @@ llama_token common_sampler_sample(struct common_sampler * gsmpl, struct llama_co
auto & chain = gsmpl->chain;
auto & cur_p = gsmpl->cur_p; // initialized by set_logits
gsmpl->set_logits(ctx, idx);
// Check if a backend sampler has already sampled a token in which case we
// return that token id directly.
{
@@ -549,17 +560,17 @@ llama_token common_sampler_sample(struct common_sampler * gsmpl, struct llama_co
GGML_ASSERT(!gsmpl->grmr && "using grammar in combination with backend sampling is not supported");
GGML_ASSERT(!gsmpl->rbudget && "using reasoning budget in combination with backend sampling is not supported");
// TODO: simplify
gsmpl->cur.resize(1);
gsmpl->cur[0] = { id, 0.0f, 1.0f };
cur_p = { gsmpl->cur.data(), gsmpl->cur.size(), 0, true };
for (size_t i = 0; i < cur_p.size; ++i) {
if (cur_p.data[i].id == id) {
cur_p.selected = i;
break;
}
}
return id;
}
}
gsmpl->set_logits(ctx, idx);
// apply reasoning budget first
llama_sampler_apply(rbudget, &cur_p);

View File

@@ -41,8 +41,8 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
void common_sampler_free(struct common_sampler * gsmpl);
// if accept_grammar is true, the token is accepted both by the sampling chain and the grammar
void common_sampler_accept(struct common_sampler * gsmpl, llama_token token, bool accept_grammar);
// if is_generated is true, the token is accepted by the sampling chain, the reasoning budget sampler, and the grammar sampler
void common_sampler_accept(struct common_sampler * gsmpl, llama_token token, bool is_generated);
void common_sampler_reset (struct common_sampler * gsmpl);
struct common_sampler * common_sampler_clone (struct common_sampler * gsmpl);

View File

@@ -167,8 +167,6 @@ struct common_speculative_checkpoint {
size_t size() const {
return data.size();
}
size_t ckpt_size = 0;
};
struct common_speculative_state_draft : public common_speculative_state {
@@ -176,7 +174,7 @@ struct common_speculative_state_draft : public common_speculative_state {
llama_context * ctx_dft;
bool use_ckpt = false;
struct common_speculative_checkpoint ckpt;
common_speculative_checkpoint ckpt;
common_sampler * smpl;
@@ -249,29 +247,19 @@ struct common_speculative_state_draft : public common_speculative_state {
llama_batch_free(batch);
}
void begin(const llama_tokens & prompt) override {
if (use_ckpt && ckpt.size() > 0) {
// delete checkpoint
LOG_DBG("%s: delete checkpoint, prompt.size=%zu, pos_min=%d, pos_max=%d, n_tokens=%" PRId64 ", size=%.3f MiB\n",
__func__, prompt.size(), ckpt.pos_min, ckpt.pos_max, ckpt.n_tokens, (float) ckpt.data.size() / 1024 / 1024);
ckpt.pos_min = 0;
ckpt.pos_max = 0;
ckpt.n_tokens = 0;
ckpt.ckpt_size = 0;
ckpt.data.clear();
}
void begin(const llama_tokens & /*prompt*/) override {
}
size_t draft_create_checkpoint(int n_tokens_prompt, int n_tokens_batch) {
size_t create_checkpoint(int n_tokens_prompt) {
int slot_id = 0;
const size_t checkpoint_size = llama_state_seq_get_size_ext(ctx_dft, slot_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
const size_t checkpoint_size = llama_state_seq_get_size_ext(ctx_dft, slot_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
ckpt.pos_min = llama_memory_seq_pos_min(llama_get_memory(ctx_dft), slot_id);
ckpt.pos_max = llama_memory_seq_pos_max(llama_get_memory(ctx_dft), slot_id);
ckpt.n_tokens = n_tokens_prompt - n_tokens_batch;
ckpt.n_tokens = n_tokens_prompt;
ckpt.data.resize(checkpoint_size);
const size_t n = llama_state_seq_get_data_ext(ctx_dft, ckpt.data.data(), checkpoint_size, slot_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
const size_t n = llama_state_seq_get_data_ext(ctx_dft, ckpt.data.data(), checkpoint_size, slot_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
if (n != checkpoint_size) {
GGML_ABORT("checkpoint size mismatch: expected %zu, got %zu\n", checkpoint_size, n);
}
@@ -281,13 +269,13 @@ struct common_speculative_state_draft : public common_speculative_state {
return n;
}
size_t draft_restore_checkpoint(size_t ckpt_size_part_expected) {
size_t restore_checkpoint() {
int slot_id = 0;
LOG_DBG("%s: pos_min = %d, pos_max = %d\n", __func__, ckpt.pos_min, ckpt.pos_max);
const size_t n = llama_state_seq_set_data_ext(ctx_dft, ckpt.data.data(), ckpt.size(), slot_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY);
if (n != ckpt_size_part_expected) {
GGML_ABORT("%s: failed to restore context checkpoint (pos_min=%d, pos_max=%d, size=%zu, get_data_ext->%zu, set_data_ext->%zu",
__func__, ckpt.pos_min, ckpt.pos_max, ckpt.size(), ckpt_size_part_expected, n);
const size_t n = llama_state_seq_set_data_ext(ctx_dft, ckpt.data.data(), ckpt.size(), slot_id, LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
if (n != ckpt.size()) {
GGML_ABORT("%s: failed to restore context checkpoint (pos_min=%d, pos_max=%d, size=%zu",
__func__, ckpt.pos_min, ckpt.pos_max, ckpt.size());
}
llama_memory_seq_rm(llama_get_memory(ctx_dft), slot_id, ckpt.pos_max + 1, -1);
@@ -346,13 +334,18 @@ struct common_speculative_state_draft : public common_speculative_state {
const int i_start = std::max<int>(0, (int) prompt_cur.size() - n_ctx);
if (use_ckpt && i_start > 0) {
LOG_WRN("%s: context shift is not supported with checkpoint-based contexts - skipping\n", __func__);
return;
}
// reuse as much as possible from the old draft context
// ideally, the draft context should be as big as the target context and we will always reuse the entire prompt
for (int i = 0; i < (int) prompt_dft.size(); ++i) {
int cur = 0;
while (i_start + cur < (int) prompt_cur.size() &&
i + cur < (int) prompt_dft.size() &&
prompt_cur[i_start + cur] == prompt_dft[i + cur]) {
i + cur < (int) prompt_dft.size() &&
prompt_cur[i_start + cur] == prompt_dft[i + cur]) {
cur++;
}
@@ -360,21 +353,26 @@ struct common_speculative_state_draft : public common_speculative_state {
reuse_i = i;
reuse_n = cur;
}
if (use_ckpt) {
break;
}
}
LOG_DBG("%s: reuse_i = %d, reuse_n = %d, #prompt_dft = %zu, #prompt_cur = %zu\n",
__func__, reuse_i, reuse_n, prompt_dft.size(), prompt_cur.size());
if (use_ckpt && ckpt.ckpt_size == 0 && reuse_n > 0) {
LOG_DBG("%s: no checkpoint available, no reuse, (reuse_i=%d, reuse_n=%d) -> (0, 0)\n",
__func__, reuse_i, reuse_n);
if (use_ckpt && ckpt.n_tokens > reuse_n) {
LOG_DBG("%s: checkpoint (n_tokens = %d) is outdated -> delete it\n", __func__, (int) ckpt.n_tokens);
reuse_i = 0;
reuse_n = 0;
ckpt = {};
}
result.clear();
result.reserve(sparams.n_max);
bool needs_ckpt = use_ckpt && prompt_dft.size() > 0;
if (reuse_n == 0 || (use_ckpt && reuse_i > 0)) {
llama_memory_clear(mem_dft, false);
prompt_dft.clear();
@@ -393,50 +391,38 @@ struct common_speculative_state_draft : public common_speculative_state {
return;
}
bool do_restore = false;
if (prompt_dft.size() > prompt_cur.size() && reuse_i + reuse_n < (int64_t) prompt_dft.size()) {
// This can happen after a partial acceptance (speculative decoding with checkpoints)
LOG_DBG("%s: #prompt_dft=%zu, #prompt_cur=%zu, shorten draft\n",
__func__, prompt_dft.size(), prompt_cur.size());
prompt_dft.resize(prompt_cur.size());
do_restore = true;
}
if (reuse_i > 0) {
GGML_ASSERT(!use_ckpt);
bool is_removed = llama_memory_seq_rm (mem_dft, 0, 0, reuse_i);
if (!is_removed) {
LOG_ERR("%s: llama_memory_seq_rm failed, reuse_i=%d\n", __func__, reuse_i);
return;
}
llama_memory_seq_add(mem_dft, 0, reuse_i, -1, -reuse_i);
prompt_dft.erase(prompt_dft.begin(), prompt_dft.begin() + reuse_i);
}
if (reuse_n < (int) prompt_dft.size() || do_restore) {
if (reuse_n < (int) prompt_dft.size()) {
if (use_ckpt) {
if (ckpt.n_tokens > (int64_t) prompt_dft.size()) {
LOG_INF("%s: checkpoint is too large, prompt_tgt.size=%zu, ckpt.n_tokens=%" PRId64 ", reuse_n=%d, prompt_dft.size=%zu\n",
__func__, prompt_tgt.size(), ckpt.n_tokens, reuse_n, prompt_dft.size());
if (ckpt.n_tokens > 0) {
LOG_DBG("%s: restoring checkpoint, reuse_n=%d, prompt_dft.size=%zu\n", __func__, reuse_n, prompt_dft.size());
restore_checkpoint();
reuse_n = ckpt.n_tokens;
prompt_dft.resize(reuse_n);
}
draft_restore_checkpoint(ckpt.ckpt_size);
reuse_n = ckpt.n_tokens;
prompt_dft.resize(reuse_n);
needs_ckpt = false;
} else {
bool is_removed = llama_memory_seq_rm (mem_dft, 0, reuse_n, -1);
const bool is_removed = llama_memory_seq_rm(mem_dft, 0, reuse_n, -1);
if (!is_removed) {
LOG_ERR("%s: llama_memory_seq_rm failed, reuse_n=%d, prompt_dft.size=%zu\n",
__func__, reuse_n, prompt_dft.size());
LOG_ERR("%s: llama_memory_seq_rm failed, reuse_n=%d, prompt_dft.size=%zu\n", __func__, reuse_n, prompt_dft.size());
return;
}
prompt_dft.erase(prompt_dft.begin() + reuse_n, prompt_dft.end());
}
}
}
if (needs_ckpt) {
ckpt.ckpt_size = draft_create_checkpoint(prompt_dft.size(), batch.n_tokens);
}
// prepare a batch to evaluate any new tokens in the prompt
common_batch_clear(batch);
@@ -450,12 +436,17 @@ struct common_speculative_state_draft : public common_speculative_state {
// we should rarely end-up here during normal decoding
if (batch.n_tokens > 0) {
//LOG_DBG("%s: draft prompt batch: %s\n", __func__, string_from(ctx, batch).c_str());
LOG_DBG("%s: draft prompt batch: %d tokens\n", __func__, batch.n_tokens);
int ret = llama_decode(ctx_dft, batch);
if (ret != 0 && ret != 1) {
LOG_WRN("%s: llama_decode returned %d, prompt_cur.size=%zu\n",
__func__, ret, prompt_cur.size());
}
if (use_ckpt) {
create_checkpoint(prompt_dft.size());
}
}
const llama_pos n_past = prompt_dft.size();
@@ -467,7 +458,7 @@ struct common_speculative_state_draft : public common_speculative_state {
prompt_dft.push_back(id_last);
LOG_DBG("%s: draft prompt: %s\n", __func__, string_from(ctx_dft, prompt_dft).c_str());
//LOG_DBG("%s: draft prompt: %s\n", __func__, string_from(ctx_dft, prompt_dft).c_str());
int ret = llama_decode(ctx_dft, batch);
if (ret != 0 && ret != 1) {
@@ -495,14 +486,14 @@ struct common_speculative_state_draft : public common_speculative_state {
common_sampler_accept(smpl, id, true);
result.push_back(id);
if (sparams.n_max <= (int) result.size()) {
// only collect very high-confidence draft tokens
if (cur_p->data[0].p < sparams.p_min) {
break;
}
// only collect very high-confidence draft tokens
if (cur_p->data[0].p < sparams.p_min) {
result.push_back(id);
if (sparams.n_max <= (int) result.size()) {
break;
}
@@ -784,17 +775,15 @@ struct common_speculative_state_ngram_mod : public common_speculative_state {
}
void accept(uint16_t n_accepted) override {
if (verbose) {
LOG_INF("%s: accepted %d tokens from %zu drafted tokens\n", __func__, n_accepted, n_draft_last);
}
// compute acceptance fraction if we have a recorded draft length
if (n_draft_last > 0) {
const double f_acc = (double)n_accepted / (double)n_draft_last;
if (f_acc < 0.5) {
n_low++;
if (n_low >= 3) {
LOG_WRN("%s: low acceptance streak (%d) resetting ngram_mod\n", __func__, n_low);
if (verbose) {
LOG_WRN("%s: low acceptance streak (%d) resetting ngram_mod\n", __func__, n_low);
}
mod.reset();
n_low = 0;

File diff suppressed because it is too large Load Diff

View File

@@ -155,6 +155,7 @@ models = [
{"name": "joyai-llm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jdopensource/JoyAI-LLM-Flash", },
{"name": "kanana2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/kakaocorp/kanana-2-30b-a3b-instruct-2601", },
{"name": "f2llmv2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/codefuse-ai/F2LLM-v2-4B", },
{"name": "sarvam-moe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/sarvamai/sarvam-30b", },
]
# some models are known to be broken upstream, so we will skip them as exceptions
@@ -175,6 +176,7 @@ pre_computed_hashes = [
{"name": "falcon-h1", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/Falcon-H1-34B-Base", "chkhsh": "48f8e02c0359c0bbdd82f26909171fac1c18a457bb47573ed1fe3bbb2c1cfd4b"},
{"name": "kimi-k2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/moonshotai/Kimi-K2-Base", "chkhsh": "81212dc7cdb7e0c1074ca62c5aeab0d43c9f52b8a737be7b12a777c953027890"},
{"name": "qwen2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Qwen/Qwen3-Embedding-0.6B", "chkhsh": "d4540891389ea895b53b399da6ac824becc30f2fba0e9ddbb98f92e55ca0e97c"},
{"name": "qwen35", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openbmb/MiniCPM-V-4_6", "chkhsh": "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f"},
{"name": "grok-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/alvarobartt/grok-2-tokenizer", "chkhsh": "66b8d4e19ab16c3bfd89bce5d785fb7e0155e8648708a1f42077cb9fe002c273"},
# jina-v2-de variants
{"name": "jina-v2-de", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/aari1995/German_Semantic_V3", "chkhsh": "b3d1dd861f1d4c5c0d2569ce36baf3f90fe8a102db3de50dd71ff860d91be3df"},

View File

@@ -737,6 +737,14 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
## Compile-time Flags
Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spot.
| Name | Function |
|-----------------|----------------------------------------------------------------------------------|
| DEBUG_SYCL_POOL | Enable device memory pool logging on teardown. Useful for profiling allocations. |
## Design Rule
- Open to all contributors.

127
docs/multi-gpu.md Normal file
View File

@@ -0,0 +1,127 @@
# Using multiple GPUs with llama.cpp
This guide explains how to run [llama.cpp](https://github.com/ggml-org/llama.cpp) across more than one GPU. It covers the split modes, the command-line flags that control them, the limitations you need to know about, and ready-to-use recipes for `llama-cli` and `llama-server`.
The CLI arguments listed here are the same for both tools - or most llama.cpp binaries for that matter.
---
## When you need multi-GPU
Reach for multi-GPU when one of these is true:
- **The model doesn't fit in a single GPU's VRAM.** By spreading the weights across two or more GPUs the whole model can stay on accelerators. Otherwise part of the model will need to be run off of the comparatively slower system RAM.
- **You want more throughput.** By distributing the computation across multiple GPUs, each individual GPU has to do less work. This can result in better prefill and/or token generation performance, depending on the split mode and interconnect speed vs. the speed of an individual GPU.
---
## The split modes
Set with `--split-mode` / `-sm`.
| Mode | What it does | When to use |
|---|---|---|
| `none` | Use a single GPU only. Pick which one with `--main-gpu`. | You explicitly want to confine the model to one GPU even though more are visible. |
| `layer` (**default**) | Pipeline parallelism. Each GPU holds a contiguous slice of layers. The KV cache for layer *l* lives on the GPU that owns layer *l*. | Default and most compatible multi-GPU choice. You want more memory than a single GPU provides and your priority is a fast prefill. Can tolerate slow interconnect speeds between GPUs. |
| `row` | **Deprecated.** Older row-split tensor-parallel path with comparatively poor performance. Splits only dense weights across GPUs. Superseded by `tensor` which should be universally superior if it can be used. | Avoid in new deployments. |
| `tensor` | **EXPERIMENTAL.** Tensor parallelism that splits both weights *and* KV across the participating GPUs via a "meta device" abstraction. | You want more memory than a single GPU provides and your priority is fast token generation. Prefill speeds approach pipeline parallel speeds for large, dense models and fast GPU interconnect speeds. Treat as experimental as the code is less mature than pipeline parallelism. Performance should be good for multiple NVIDIA GPUs using the CUDA backend, no guarantees otherwise. |
> Pipeline parallel (`layer`) vs. tensor parallel (`tensor`): pipeline-parallel runs different layers on different GPUs and processes tokens sequentially through the pipeline. This minimizes data transfers between GPUs but requires many tokens to scale well. Tensor-parallel splits each layer across GPUs and does multiple cross-GPU reductions per layer. This enables parallelizing any workload but is much more bottlenecked by the GPU interconnect speed. Pipeline-parallel maximizes batch throughput; tensor-parallel minimizes latency.
---
## Command-line arguments reference
| Short | Long | Value | Default | Notes |
|---|---|---|---|---|
| `-sm` | `--split-mode` | `none` \| `layer` \| `tensor` | `layer` | See modes above. |
| `-ts` | `--tensor-split` | comma-separated proportions, e.g. `3,1` | mode-dependent | How much of the model goes to each GPU. If omitted, `layer`/`row` use automatic splitting proportional to memory, while `tensor` splits tensor segments evenly. With `3,1` on two GPUs, GPU 0 gets 75 %, GPU 1 gets 25 %. The values follow the order in `--device`. |
| `-mg` | `--main-gpu` | integer device index | `0` | The single GPU used in `--split-mode none`. |
| `-ngl` | `--n-gpu-layers` / `--gpu-layers` | integer \| `auto` \| `all` | `auto` | Maximum number of layers to keep in VRAM. Use `999` or `all` to push everything possible to the GPUs. |
| `-dev` | `--device` | comma-separated device names, or `none` | auto | Restrict which devices llama.cpp may use. See `--list-devices` for names. |
| | `--list-devices` | - | - | Print the available devices and their memory. Run this first to learn the names you'd pass to `--device`. |
| `-fa` | `--flash-attn` | `on` \| `off` \| `auto` | `auto` | Required when using `--split-mode tensor` and/or quantized V cache. Supported (and therefore enabled by default) for most combinations of models and backends. |
| `-ctk` | `--cache-type-k` | `f32` \| `f16` \| `bf16` \| `q8_0` \| `q4_0` \| ... | `f16` | KV cache type for K. |
| `-ctv` | `--cache-type-v` | same as `-ctk` | `f16` | KV cache type for V. |
| `-fit` | `--fit` | `on` \| `off` | `on` | Auto-fit unset args to device memory. **Not supported with `tensor`. You may need to manually set the `--ctx-size` to make the model fit.** |
As for any CUDA program, the environment variable `CUDA_VISIBLE_DEVICES` can be used to control which GPUs to use for the CUDA backend: if you set it, llama.cpp only sees the specified GPUs. Use `--device` for selecting GPUs from among those visible to llama.cpp, this works for any backend.
---
## Recipes
### 1. Default - pipeline parallel across all visible GPUs
```bash
llama-cli -m model.gguf
llama-server -m model.gguf
```
Easiest configuration. KV cache spreads across the GPUs along with the layers. `--fit` (on by default) sizes things automatically.
### 2. Pipeline parallel with a custom split ratio
```bash
llama-cli -m model.gguf -ts 3,1
```
Useful when GPUs have different memory: GPU 0 (3 parts) and GPU 1 (1 part). Proportions are normalized so `-ts 3,1` is the same as e.g. `-ts 75,25`.
### 3. Single-GPU mode, picking a specific GPU
```bash
llama-cli --list-devices
llama-cli -m model.gguf -dev CUDA1
```
Use only the device listed as `CUDA1` when calling with `--list-devices`.
### 4. Tensor parallelism (experimental)
```bash
llama-cli -m model.gguf -sm tensor -ctk f16 -ctv f16
```
- `--flash-attn off` or (`--flash-attn auto` resolving to `off` when it isn't supported) is a hard error.
- KV cache types must be non-quantized: `f32`, `f16`, or `bf16`. Support for quantized KV cache is not implemented and trying to use it will result in an error.
- Mark this configuration as experimental in your tooling: validate output quality before deploying.
- `--split-mode tensor`is not implemented for all architectures. The following will fail with *"LLAMA_SPLIT_MODE_TENSOR not implemented for architecture '...'"*:
- **MoE / hybrid:** Grok, MPT, OLMoE, DeepSeek2, GLM-DSA, Nemotron-H, Nemotron-H-MoE, Granite-Hybrid, LFM2-MoE, Minimax-M2, Mistral4, Kimi-Linear, Jamba, Falcon-H1
- **State-space / RWKV-style:** Mamba, Mamba2 (and the hybrid Mamba-attention models above)
- **Other:** PLAMO2, MiniCPM3, Gemma-3n, OLMo2, BitNet, T5
### 5. With NCCL
There's no runtime flag for NCCL - it's selected at build time (`-DGGML_CUDA_NCCL=ON`, this is the default). Note that NCCL is **not** automatically distributed with CUDA and you may need to install it manually - when in doubt check the CMake log to see whether or not it can find the package. When llama.cpp is compiled with NCCL support it uses it automatically for cross-GPU reductions in `tensor` mode. When NCCL is missing on a multi-GPU build, you'll see this one-time warning and performance will be lower:
```
NVIDIA Collective Communications Library (NCCL) is unavailable, multi GPU performance will be suboptimal
```
When using the "ROCm" backend (which is the ggml CUDA code translated for AMD via HIP), the AMD equivalent RCCL can be used by compiling with `-DGGML_HIP_RCCL=ON`. Note that RCCL is by default *disabled* because (unlike NCCL) it was not universally beneficial during testing.
### 6. With CUDA peer-to-peer access (`GGML_CUDA_P2P`)
CUDA peer-to-peer (P2P) lets GPUs transfer data directly between each other instead of going through system memory, which generally improves multi-GPU performance. It is **opt-in** at runtime - set the environment variable `GGML_CUDA_P2P` to any value to enable it:
```bash
GGML_CUDA_P2P=1 llama-cli -m model.gguf -sm tensor
```
P2P requires driver support (usually restricted to workstation/datacenter GPUs) and **may cause crashes or corrupted outputs on some motherboards or BIOS configurations** (e.g. when IOMMU is enabled). If you see instability after enabling it, unset the variable.
---
## Troubleshooting
| Symptom | How to fix |
|---|---|
| Startup error *"SPLIT_MODE_TENSOR requires flash_attn to be enabled"* | Add `-fa on` or remove `-fa off`. |
| Startup error *"simultaneous use of SPLIT_MODE_TENSOR and KV cache quantization not implemented"* | Use `-ctk f16 -ctv f16` (or `bf16`/`f32`) with `--split-mode tensor`. |
| Startup error *"LLAMA_SPLIT_MODE_TENSOR not implemented for architecture 'X'"* | Architecture not on the TENSOR allow-list. Use `--split-mode layer`. |
| Warning *"NCCL is unavailable, multi GPU performance will be suboptimal"* | llama.cpp wasn't built with NCCL. Either accept the lower performance or install NCCL and rebuild. |
| CUDA OOM at startup or during prefill in `--split-mode tensor` | Auto-fit is disabled in this mode, so reduce memory pressure yourself. In order from least to most disruptive: lower `--ctx-size` (`-c`) (KV cache is roughly proportional to `n_ctx`); for `llama-server`, lower `--parallel` (`-np`) (a slot KV cache is allocated per concurrent sequence); as a last resort, reduce `--n-gpu-layers` (`-ngl`) (the remaining layers run on CPU and inference will be much slower). |
| Performance is worse with multi-GPU than single-GPU | The performance is bottlenecked by GPU interconnect speed. For `--split-mode tensor`, verify that NCCL is being used. Try `--split-mode layer` (less communication than `tensor`). Increase GPU interconnect speed via more PCIe lanes or e.g. NVLink (if available). |
| GPU not used at all | `--n-gpu-layers` is `0` or too low - try explicitly setting `-ngl all`. Or you are accidentally hiding the GPUs via an environment variable like `CUDA_VISIBLE_DEVICES=-1`. Or your build doesn't include support for the relevant backend. |
| Crashes or corrupted outputs after setting `GGML_CUDA_P2P=1` | Some motherboards and BIOS settings (e.g. with IOMMU enabled) don't support CUDA peer-to-peer reliably. Unset `GGML_CUDA_P2P`. |

View File

@@ -0,0 +1,49 @@
## MiniCPM-V 4.6
### Prepare models and code
Download [MiniCPM-V-4_6](https://huggingface.co/openbmb/MiniCPM-V-4_6) PyTorch model from huggingface to "MiniCPM-V-4_6" folder.
The model must be the standard `transformers` v5.7.0+ checkpoint (no `trust_remote_code`); the architecture in `config.json` is `MiniCPMV4_6ForConditionalGeneration` with a `qwen3_5_text` text model and a SigLIP-based vision tower plus a window-attention `vit_merger`.
### Build llama.cpp
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
Build llama.cpp using `CMake`:
```bash
cmake -B build
cmake --build build --config Release
```
### Usage of MiniCPM-V 4.6
Unlike older MiniCPM-V variants, MiniCPM-V 4.6 is converted directly through `convert_hf_to_gguf.py`. The same script is invoked twice on the original Hugging Face directory: once to produce the language-model GGUF and once with `--mmproj` to produce the multimodal projector GGUF.
```bash
# language model
python ./convert_hf_to_gguf.py ../MiniCPM-V-4_6 --outfile ../MiniCPM-V-4_6/ggml-model-f16.gguf
# multimodal projector (vision tower + window-attention vit_merger + DownsampleMLP merger)
python ./convert_hf_to_gguf.py ../MiniCPM-V-4_6 --mmproj --outfile ../MiniCPM-V-4_6/mmproj-model-f16.gguf
# optional: quantize to Q4_K_M
./build/bin/llama-quantize ../MiniCPM-V-4_6/ggml-model-f16.gguf ../MiniCPM-V-4_6/ggml-model-Q4_K_M.gguf Q4_K_M
```
Inference on Linux or Mac
```bash
# run in single-turn mode
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-4_6/ggml-model-f16.gguf --mmproj ../MiniCPM-V-4_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
# run in conversation mode
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-4_6/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-4_6/mmproj-model-f16.gguf
```

View File

@@ -17,7 +17,7 @@ Legend:
| ABS | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| ACC | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
| ADD | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| ADD1 | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | | ✅ | ❌ | ❌ | ❌ |
| ADD1 | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | | ✅ | ❌ | ❌ | ❌ |
| ADD_ID | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ARANGE | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ARGMAX | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ |
@@ -36,15 +36,15 @@ Legend:
| CPY | ❌ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| CROSS_ENTROPY_LOSS | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CROSS_ENTROPY_LOSS_BACK | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| CUMSUM | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ❌ | ❌ |
| DIAG | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ❌ | ❌ |
| CUMSUM | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ❌ | ❌ |
| DIAG | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ❌ | ❌ |
| DIAG_MASK_INF | ❌ | ✅ | ✅ | ✅ | ❌ | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| DIV | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| DUP | ❌ | ✅ | ✅ | 🟡 | 🟡 | 🟡 | ✅ | ✅ | ❌ | ❌ | ❌ |
| ELU | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| EXP | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| EXPM1 | ❌ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| FILL | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ❌ | ❌ |
| FILL | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ❌ | ❌ |
| FLASH_ATTN_EXT | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | 🟡 | ❌ | ❌ |
| FLOOR | ❌ | ❌ | ✅ | 🟡 | ✅ | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| GATED_DELTA_NET | ❌ | ❌ | ✅ | ❌ | 🟡 | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ |
@@ -101,11 +101,11 @@ Legend:
| SOFTPLUS | ❌ | ❌ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SOFT_MAX | ❌ | 🟡 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SOFT_MAX_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | 🟡 | ✅ | ❌ | ❌ | ❌ |
| SOLVE_TRI | ❌ | ❌ | ✅ | 🟡 | ✅ | ❌ | | ✅ | ✅ | ❌ | ❌ |
| SOLVE_TRI | ❌ | ❌ | ✅ | 🟡 | ✅ | ❌ | 🟡 | ✅ | ✅ | ❌ | ❌ |
| SQR | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SQRT | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| SSM_CONV | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | | 🟡 | ✅ | ❌ | ❌ |
| SSM_SCAN | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | 🟡 | 🟡 | ✅ | ❌ | ❌ |
| STEP | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |
| SUB | ❌ | ✅ | ✅ | ✅ | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| SUM | ❌ | 🟡 | ✅ | 🟡 | 🟡 | ❌ | 🟡 | 🟡 | 🟡 | ❌ | ❌ |

File diff suppressed because it is too large Load Diff

View File

@@ -33,18 +33,18 @@ An example to use this approach can be the rewriting of source code by a LLM.
This implementation looks for the last n-gram in history that matches the current n-gram and creates a draft using the m tokens following the matched n-gram. It is the simplest self-speculative approach with minimal overhead.
```
llama-server [...] --spec-type ngram-simple --draft-max 64
llama-server [...] --spec-type ngram-simple --spec-draft-n-max 64
```
#### n-gram Map Key (`ngram-map-k`)
This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument `--spec-ngram-min-hits`, default is 1) before generating drafts.
This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument `--spec-ngram-map-k-min-hits`, default is 1) before generating drafts.
The number of accepted tokens is stored for each used n-gram.
**Example:**
```
llama-server [...] --spec-type ngram-map-k --draft-max 64
llama-server [...] --spec-type ngram-map-k --spec-draft-n-max 64
```
#### n-gram Map Key-4-Values (`ngram-map-k4v`)
@@ -55,7 +55,7 @@ The number of accepted tokens is stored for each used n-gram.
**Example:** Server options to be used if there are a lot of longer repetitions.
```
llama-server [...] --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-max 64
llama-server [...] --spec-type ngram-map-k4v --spec-ngram-map-k4v-size-n 8 --spec-ngram-map-k4v-size-m 8 --spec-ngram-map-k4v-min-hits 2 --spec-draft-n-max 64
```
### n-gram Mod (`ngram-mod`)
@@ -80,9 +80,9 @@ Currently, a single hash pool is shared across all server slots, so different re
# notes:
# - small `n` are not recommended
# - MoEs require long drafts
# - dense models: can reduce `--draft-min` and `--draft-max`
# - dense models: can reduce `--spec-ngram-mod-n-min` and `--spec-ngram-mod-n-max`
llama-server ... --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
llama-server ... --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64
```
Applications:
@@ -105,21 +105,90 @@ Example Video:
If a draft model is combined with a draftless decoding the draftless decoding has higher precedence.
### General Speculative Parameters
```
--draft, --draft-n, --draft-max N number of tokens to draft for speculative decoding (default: 16)
(env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding
(default: 0)
(env: LLAMA_ARG_DRAFT_MIN)
[...]
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
type of speculative decoding to use when no draft model is provided
(default: none)
--spec-ngram-size-n N ngram size N for ngram-simple/ngram-map speculative decoding, length
of lookup n-gram (default: 12)
--spec-ngram-size-m N ngram size M for ngram-simple/ngram-map speculative decoding, length
of draft m-gram (default: 48)
--spec-ngram-min-hits N minimum hits for ngram-map speculative decoding (default: 1)
(env: LLAMA_ARG_SPEC_TYPE)
--spec-default use default speculative decoding
```
### Draft Model Parameters
```
--spec-draft-model, -md, --model-draft FNAME
draft model for speculative decoding (default: unused)
(env: LLAMA_ARG_SPEC_DRAFT_MODEL)
--spec-draft-hf, -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]
HuggingFace repository for the draft model
--spec-draft-n-max N
number of tokens to draft for speculative decoding (default: 16)
(env: LLAMA_ARG_SPEC_DRAFT_N_MAX)
--spec-draft-n-min N
minimum number of draft tokens to use for speculative decoding (default: 0)
(env: LLAMA_ARG_SPEC_DRAFT_N_MIN)
--spec-draft-p-split, --draft-p-split P
speculative decoding split probability (default: 0.10)
(env: LLAMA_ARG_SPEC_DRAFT_P_SPLIT)
--spec-draft-p-min, --draft-p-min P
minimum speculative decoding probability (greedy) (default: 0.75)
(env: LLAMA_ARG_SPEC_DRAFT_P_MIN)
--spec-draft-ctx-size, -cd, --ctx-size-draft N
size of the prompt context for the draft model (default: 0, 0 = loaded from model)
(env: LLAMA_ARG_SPEC_DRAFT_CTX_SIZE)
--spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft N
max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)
(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
--spec-draft-device, -devd, --device-draft <dev1,dev2,..>
comma-separated list of devices to use for offloading the draft model
--spec-draft-replace, --spec-replace TARGET DRAFT
translate the string in TARGET into DRAFT if the draft model and main model are not compatible
```
### n-gram Mod Parameters
```
--spec-ngram-mod-n-match N
ngram-mod lookup length (default: 24)
--spec-ngram-mod-n-min N
minimum number of ngram tokens to use for ngram-based speculative decoding (default: 48)
--spec-ngram-mod-n-max N
maximum number of ngram tokens to use for ngram-based speculative decoding (default: 64)
```
### n-gram Simple Parameters
```
--spec-ngram-simple-size-n N
ngram size N for ngram-simple speculative decoding, length of lookup n-gram (default: 12)
--spec-ngram-simple-size-m N
ngram size M for ngram-simple speculative decoding, length of draft m-gram (default: 48)
--spec-ngram-simple-min-hits N
minimum hits for ngram-simple speculative decoding (default: 1)
```
### n-gram Map Key Parameters
```
--spec-ngram-map-k-size-n N
ngram size N for ngram-map-k speculative decoding, length of lookup n-gram (default: 12)
--spec-ngram-map-k-size-m N
ngram size M for ngram-map-k speculative decoding, length of draft m-gram (default: 48)
--spec-ngram-map-k-min-hits N
minimum hits for ngram-map-k speculative decoding (default: 1)
```
### n-gram Map Key-4-Values Parameters
```
--spec-ngram-map-k4v-size-n N
ngram size N for ngram-map-k4v speculative decoding, length of lookup n-gram (default: 12)
--spec-ngram-map-k4v-size-m N
ngram size M for ngram-map-k4v speculative decoding, length of draft m-gram (default: 48)
--spec-ngram-map-k4v-min-hits N
minimum hits for ngram-map-k4v speculative decoding (default: 1)
```
### `--spec-type TYPE`
@@ -140,21 +209,40 @@ Specifies a type of speculative decoding without draft model.
./llama-server [...] --spec-type ngram-simple
```
### `--spec-ngram-size-n N`
### `--spec-ngram-*-size-n N`
Sets the size N of the lookup n-gram for n-gram map based speculative decoding.
The n-gram size N determines how many tokens in a row to look back when searching for matching patterns.
### `--spec-ngram-size-m M`
Each n-gram implementation has its own parameter:
- `--spec-ngram-simple-size-n` for `ngram-simple`
- `--spec-ngram-map-k-size-n` for `ngram-map-k`
- `--spec-ngram-map-k4v-size-n` for `ngram-map-k4v`
- `--spec-ngram-mod-n-match` for `ngram-mod`
### `--spec-ngram-*-size-m M`
Sets the size M of the draft m-gram for n-gram map based speculative decoding.
The m-gram size determines how many tokens to draft when a match is found.
Larger values can provide more speedup but may reduce acceptance rate.
### `--spec-ngram-min-hits H`
Each n-gram implementation has its own parameter:
- `--spec-ngram-simple-size-m` for `ngram-simple`
- `--spec-ngram-map-k-size-m` for `ngram-map-k`
- `--spec-ngram-map-k4v-size-m` for `ngram-map-k4v`
### `--spec-ngram-*-min-hits H`
This option defines how often a key has to appear in the token history to be used as a draft (default is 1).
Each n-gram implementation has its own parameter:
- `--spec-ngram-simple-min-hits` for `ngram-simple`
- `--spec-ngram-map-k-min-hits` for `ngram-map-k`
- `--spec-ngram-map-k4v-min-hits` for `ngram-map-k4v`
## Statistics
Each speculative decoding implementation prints statistics.
@@ -180,4 +268,3 @@ statistics ngram_map_k: #calls(b,g,a) = 6 1690 26, #gen drafts = 26, #acc drafts
- `#gen tokens`: number of tokens generated by this implementation (including rejected tokens)
- `#acc tokens`: number of tokens accepted by the main model
- `dur(b,g,a): durations of begin (new prompt), generation and accumulation (process acceptance).

View File

@@ -1,5 +1,10 @@
set(TARGET llama-diffusion)
add_library(${TARGET} STATIC diffusion.cpp diffusion.h)
target_link_libraries(${TARGET} PUBLIC llama llama-common ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PUBLIC cxx_std_17)
set(TARGET llama-diffusion-cli)
add_executable(${TARGET} diffusion-cli.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama llama-common ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(${TARGET} PRIVATE llama-diffusion llama llama-common ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -12,11 +12,11 @@ The diffusion CLI supports various parameters to control the generation process:
### Core Diffusion Parameters
- `--diffusion-steps`: Number of diffusion steps (default: 256)
- `--diffusion-algorithm`: Algorithm for token selection
- `0`: ORIGIN - Token will be generated in a purely random order from https://arxiv.org/abs/2107.03006.
- `1`: ENTROPY_BASED - Entropy-based selection
- `2`: MARGIN_BASED - Margin-based selection
- `3`: RANDOM - Random selection
- `4`: CONFIDENCE_BASED - Confidence-based selection (default)
- `0`: DIFFUSION_ALGORITHM_ORIGIN - Token will be generated in a purely random order from https://arxiv.org/abs/2107.03006.
- `1`: DIFFUSION_ALGORITHM_ENTROPY_BASED - Entropy-based selection
- `2`: DIFFUSION_ALGORITHM_MARGIN_BASED - Margin-based selection
- `3`: DIFFUSION_ALGORITHM_RANDOM - Random selection
- `4`: DIFFUSION_ALGORITHM_CONFIDENCE_BASED - Confidence-based selection (default)
- More documentation here https://github.com/DreamLM/Dream
- `--diffusion-visual`: Enable live visualization during generation

View File

@@ -1,127 +1,23 @@
#include "arg.h"
#include "chat.h"
#include "common.h"
#include "diffusion.h"
#include "llama.h"
#include "log.h"
#include <limits.h>
#include <algorithm>
#include <clocale>
#include <cmath>
#include <cstring>
#include <limits>
#include <random>
#include <string>
#include <vector>
enum diffusion_algorithm { ORIGIN = 0, ENTROPY_BASED = 1, MARGIN_BASED = 2, RANDOM = 3, CONFIDENCE_BASED = 4 };
// Unified transfer scheduling methods
enum transfer_schedule {
TIMESTEP_BASED = 0, // Dream-style: (1.0 - s/t) * remaining
BLOCK_BASED = 1, // LLaDA-style: process in blocks with get_num_transfer_tokens
};
typedef bool (*diffusion_step_callback_t)(int32_t step,
int32_t total_steps,
const llama_token * tokens,
int32_t n_tokens,
void * user_data);
struct diffusion_params {
int32_t steps = 0;
float temperature = 0;
llama_token mask_token_id = LLAMA_TOKEN_NULL;
diffusion_step_callback_t step_callback = nullptr;
void * step_callback_user_data = nullptr;
int32_t seed = 0;
bool visual_mode = false;
bool shift_logits = false; // Shift logits by -1 after decode
float top_p = 0.;
int32_t top_k = 0.;
diffusion_algorithm algorithm = CONFIDENCE_BASED;
transfer_schedule schedule = TIMESTEP_BASED;
float cfg_scale = 0.; // Config scale for classifier-free guidance
float eps = 0.; // Timestep scheduling
int32_t block_length = 0; // Block size (for block scheduling)
float alg_temp = 0; // algorithm temperature (0.0 = deterministic)
bool add_gumbel_noise = false; // Add gumbel noise to the logits if temp > 0.0
int32_t max_length = 0; // Maximum sequence length
};
struct callback_data {
diffusion_params * diff_params;
const llama_vocab * vocab;
int32_t n_input;
};
static float calculate_confidence(const llama_token_data_array & cur_p,
diffusion_algorithm algorithm,
std::mt19937 & rng) {
switch (algorithm) {
case CONFIDENCE_BASED:
return cur_p.data[cur_p.selected].p; // Selected token probability
case ENTROPY_BASED:
{
float entropy = 0.0f;
const float epsilon = 1e-10f;
for (size_t i = 0; i < cur_p.size; i++) {
float prob = cur_p.data[i].p;
entropy += prob * logf(prob + epsilon);
}
return -entropy; // Higher entropy = lower confidence
}
case MARGIN_BASED:
return (cur_p.size > 1) ? cur_p.data[0].p - cur_p.data[1].p : cur_p.data[0].p;
case RANDOM:
{
std::uniform_real_distribution<float> uniform(0.0f, 1.0f);
return uniform(rng); // Random confidence
}
case ORIGIN:
return cur_p.data[cur_p.selected].p;
default:
return 0.0f;
}
}
// Unified transfer count calculation function
static int32_t calculate_transfer_count(int32_t step,
int32_t total_steps,
int32_t remaining_masked,
transfer_schedule schedule,
float eps,
const std::vector<int32_t> & num_transfer_tokens = {}) {
switch (schedule) {
case TIMESTEP_BASED:
{
float t = 1.0f - (float) step / total_steps * (1.0f - eps);
float s = 1.0f - (float) (step + 1) / total_steps * (1.0f - eps);
float p_transfer = (step < total_steps - 1) ? (1.0f - s / t) : 1.0f;
return (int32_t) (remaining_masked * p_transfer);
}
case BLOCK_BASED:
if (!num_transfer_tokens.empty() && step < (int32_t) num_transfer_tokens.size()) {
return num_transfer_tokens[step];
}
return remaining_masked / (total_steps - step); // Fallback
default:
return remaining_masked / (total_steps - step);
}
}
static bool diffusion_step_callback(int32_t step,
int32_t total_steps,
const llama_token * tokens,
@@ -176,341 +72,6 @@ static bool diffusion_step_callback(int32_t step,
return true;
}
static void add_gumbel_noise(float * logits, int32_t n_vocab, float temperature, std::mt19937 & rng) {
if (temperature == 0.0f) {
return;
}
std::uniform_real_distribution<double> uniform(0.0, 1.0);
for (int32_t i = 0; i < n_vocab; i++) {
double noise = uniform(rng);
// Prevent log(0)
noise = std::max(noise, 1e-20);
double gumbel_noise = std::pow(-std::log(noise), temperature);
logits[i] = std::exp(logits[i]) / gumbel_noise;
}
}
static std::vector<int32_t> get_num_transfer_tokens(int32_t mask_count, int32_t steps) {
std::vector<int32_t> num_transfer_tokens(steps);
int32_t base = mask_count / steps;
int32_t remainder = mask_count % steps;
for (int32_t i = 0; i < steps; i++) {
num_transfer_tokens[i] = base + (i < remainder ? 1 : 0);
}
return num_transfer_tokens;
}
static void diffusion_generate(llama_context * ctx,
const llama_token * input_tokens,
llama_token * output_tokens,
int32_t n_input,
const diffusion_params & params,
int32_t & n_generated) {
n_generated = 0;
if (!ctx || !input_tokens || !output_tokens || n_input <= 0 || params.max_length <= n_input) {
return;
}
const llama_model * model = llama_get_model(ctx);
// Initialize with input and pad with mask tokens
std::copy(input_tokens, input_tokens + n_input, output_tokens);
std::fill(output_tokens + n_input, output_tokens + params.max_length, params.mask_token_id);
std::mt19937 rng(params.seed);
llama_set_causal_attn(ctx, false);
int32_t n_vocab = llama_vocab_n_tokens(llama_model_get_vocab(model));
std::vector<llama_token_data> candidates(n_vocab);
std::vector<llama_token_data> conf_candidates;
conf_candidates.reserve(params.max_length);
std::vector<int32_t> mask_positions;
mask_positions.reserve(params.max_length);
// Setup sampler chain
struct llama_sampler * sampler = llama_sampler_chain_init(llama_sampler_chain_default_params());
if (params.top_k > 0) {
llama_sampler_chain_add(sampler, llama_sampler_init_top_k(params.top_k));
}
if (params.top_p < 1.0f) {
llama_sampler_chain_add(sampler, llama_sampler_init_top_p(params.top_p, 1));
}
if (params.temperature > 0.0f) {
llama_sampler_chain_add(sampler, llama_sampler_init_temp(params.temperature));
}
llama_sampler_chain_add(sampler, llama_sampler_init_dist(params.seed));
struct llama_sampler * dist_sampler = llama_sampler_init_dist(params.seed);
llama_batch batch = llama_batch_init(params.max_length, 0, 1);
batch.n_tokens = params.max_length;
// Pre-allocate buffers for CFG if needed
int32_t logits_size = n_vocab * params.max_length;
std::vector<float> cond_logits_buffer;
std::vector<llama_token> un_x_buffer;
if (params.cfg_scale > 0.0f) {
cond_logits_buffer.resize(logits_size);
un_x_buffer.resize(params.max_length);
}
// For block-based processing
std::vector<int32_t> num_transfer_tokens;
int32_t num_blocks = 1;
int32_t steps_per_block = params.steps;
if (params.schedule == BLOCK_BASED) {
GGML_ASSERT(params.max_length % params.block_length == 0);
num_blocks = params.max_length / params.block_length;
GGML_ASSERT(params.steps % num_blocks == 0);
steps_per_block = params.steps / num_blocks;
}
std::vector<float> confidence(params.max_length);
int64_t total_sampling_time = 0;
int64_t total_time = 0;
int64_t time_start = ggml_time_us();
for (int block_num = 0; block_num < num_blocks; block_num++) {
int32_t block_start = (params.schedule == BLOCK_BASED) ? n_input + block_num * params.block_length : 0;
int32_t block_end = (params.schedule == BLOCK_BASED) ?
std::min(n_input + (block_num + 1) * params.block_length, params.max_length) :
params.max_length;
// Count masked tokens in current block for block-based processing
if (params.schedule == BLOCK_BASED) {
int32_t block_mask_count = 0;
for (int i = block_start; i < block_end; i++) {
if (output_tokens[i] == params.mask_token_id) {
block_mask_count++;
}
}
num_transfer_tokens = get_num_transfer_tokens(block_mask_count, steps_per_block);
}
for (int32_t step = 0; step < steps_per_block; step++) {
int32_t global_step = block_num * steps_per_block + step;
if (params.step_callback) {
if (!params.step_callback(
global_step, params.steps, output_tokens, params.max_length, params.step_callback_user_data)) {
break;
}
}
// Setup batch
for (int32_t i = 0; i < params.max_length; i++) {
batch.token[i] = output_tokens[i];
batch.pos[i] = i;
batch.n_seq_id[i] = 1;
batch.seq_id[i][0] = 0;
batch.logits[i] = 1;
}
float * logits = nullptr;
if (params.cfg_scale > 0.0f) {
int ret = llama_decode(ctx, batch);
if (ret != 0) {
LOG_ERR("Failed to generate conditional");
break;
}
float * cond_logits_ptr = llama_get_logits(ctx);
std::memcpy(cond_logits_buffer.data(), cond_logits_ptr, logits_size * sizeof(float));
// Unconditional generation (mask input)
std::copy(output_tokens, output_tokens + params.max_length, un_x_buffer.begin());
for (int32_t i = 0; i < n_input; i++) {
un_x_buffer[i] = params.mask_token_id;
}
for (int32_t i = 0; i < params.max_length; i++) {
batch.token[i] = un_x_buffer[i];
}
ret = llama_decode(ctx, batch);
if (ret != 0) {
LOG_ERR("Failed to generate unconditional");
break;
}
float * uncond_logits = llama_get_logits(ctx);
// Apply CFG
for (int32_t i = 0; i < logits_size; i++) {
cond_logits_buffer[i] =
uncond_logits[i] + (params.cfg_scale + 1.0f) * (cond_logits_buffer[i] - uncond_logits[i]);
}
logits = cond_logits_buffer.data();
} else {
int ret = llama_decode(ctx, batch);
if (ret != 0) {
LOG_ERR("%s: failed to decode at step %d, ret = %d\n", __func__, global_step, ret);
break;
}
logits = llama_get_logits(ctx);
}
if (!logits) {
LOG_ERR("%s: failed to get logits at step %d\n", __func__, global_step);
break;
}
auto get_logits_for_pos = [&](int32_t pos) -> const float * {
if (params.shift_logits) {
return pos == 0 ? logits : logits + (pos - 1) * n_vocab;
}
return logits + (pos) *n_vocab;
};
int64_t time_start_sampling = ggml_time_us();
mask_positions.clear();
for (int32_t i = 0; i < params.max_length; i++) {
if (output_tokens[i] == params.mask_token_id) {
// For block-based, only consider current block
if (params.schedule != BLOCK_BASED || (i >= block_start && i < block_end)) {
mask_positions.push_back(i);
}
}
}
if (mask_positions.empty()) {
break;
}
if (params.add_gumbel_noise && params.temperature > 0.0f) {
add_gumbel_noise(logits, n_vocab, params.temperature, rng);
}
if (params.algorithm == ORIGIN) {
int32_t transfer_count = calculate_transfer_count(
step, steps_per_block, mask_positions.size(), params.schedule, params.eps, num_transfer_tokens);
float p_transfer = (float) transfer_count / mask_positions.size();
for (int32_t pos : mask_positions) {
if (std::uniform_real_distribution<float>(0.0f, 1.0f)(rng) < p_transfer) {
const float * pos_logits = get_logits_for_pos(pos);
for (int32_t token_id = 0; token_id < n_vocab; token_id++) {
candidates[token_id].id = token_id;
candidates[token_id].logit = pos_logits[token_id];
candidates[token_id].p = 0.0f;
}
llama_token_data_array cur_p = {
candidates.data(),
(size_t) n_vocab,
-1,
false,
};
llama_sampler_apply(sampler, &cur_p);
output_tokens[pos] = cur_p.data[cur_p.selected].id;
}
}
} else {
std::vector<std::pair<float, int32_t>> confidences;
std::vector<llama_token> sampled_tokens(mask_positions.size());
for (size_t i = 0; i < mask_positions.size(); i++) {
int32_t pos = mask_positions[i];
const float * pos_logits = get_logits_for_pos(pos);
for (int32_t token_id = 0; token_id < n_vocab; token_id++) {
candidates[token_id].logit = pos_logits[token_id];
candidates[token_id].p = 0.0f;
candidates[token_id].id = token_id;
}
llama_token_data_array cur_p = {
candidates.data(),
candidates.size(),
-1,
false,
};
llama_sampler_apply(sampler, &cur_p);
llama_token sampled_token = cur_p.data[cur_p.selected].id;
float conf = calculate_confidence(cur_p, params.algorithm, rng);
sampled_tokens[i] = sampled_token;
confidences.emplace_back(conf, i);
}
int32_t transfer_count = calculate_transfer_count(
step, steps_per_block, mask_positions.size(), params.schedule, params.eps, num_transfer_tokens);
if (transfer_count > 0) {
if (params.alg_temp == 0.0f) {
std::partial_sort(confidences.begin(),
confidences.begin() + std::min(transfer_count, (int32_t) confidences.size()),
confidences.end(),
[](const std::pair<float, int32_t> & a, const std::pair<float, int32_t> & b) {
if (a.first != b.first) {
return a.first > b.first;
}
return a.second < b.second;
});
for (int32_t i = 0; i < std::min(transfer_count, (int32_t) confidences.size()); i++) {
int32_t mask_idx = confidences[i].second;
int32_t pos = mask_positions[mask_idx];
output_tokens[pos] = sampled_tokens[mask_idx];
}
} else {
conf_candidates.clear();
for (size_t i = 0; i < confidences.size(); i++) {
float conf_logit = confidences[i].first / params.alg_temp;
conf_candidates.emplace_back(llama_token_data{ (int32_t) i, conf_logit, 0.0f });
}
llama_token_data_array conf_array = {
conf_candidates.data(),
conf_candidates.size(),
-1,
false,
};
for (int32_t i = 0; i < std::min(transfer_count, (int32_t) confidences.size()); i++) {
llama_sampler_apply(dist_sampler, &conf_array);
int32_t selected_idx = conf_array.selected;
int32_t mask_idx = selected_idx;
int32_t pos = mask_positions[mask_idx];
output_tokens[pos] = sampled_tokens[mask_idx];
conf_candidates[selected_idx].p = 0.0f;
conf_array.selected = -1;
}
}
}
}
int64_t time_end_sampling = ggml_time_us();
total_sampling_time += time_end_sampling - time_start_sampling;
}
}
int64_t time_end = ggml_time_us();
total_time += time_end - time_start;
LOG_INF("\ntotal time: %0.2fms, time per step: %0.2fms, sampling time per step: %0.2fms\n",
total_time / 1000.0,
total_time / 1000.0 / params.steps,
total_sampling_time / 1000.0 / params.steps);
llama_batch_free(batch);
llama_sampler_free(sampler);
llama_sampler_free(dist_sampler);
n_generated = params.max_length;
}
static std::string format_input_text(const std::string & prompt, const std::string & system_prompt, bool use_chat_template, llama_model * model) {
if (!use_chat_template) {
return prompt;
@@ -631,10 +192,10 @@ int main(int argc, char ** argv) {
GGML_ASSERT((params.diffusion.eps == 0) ^ (params.diffusion.block_length == 0));
if (params.diffusion.eps) {
diff_params.schedule = TIMESTEP_BASED;
diff_params.schedule = DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED;
diff_params.eps = params.diffusion.eps;
} else if (params.diffusion.block_length) {
diff_params.schedule = BLOCK_BASED;
diff_params.schedule = DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED;
diff_params.block_length = params.diffusion.block_length;
}
@@ -653,8 +214,17 @@ int main(int argc, char ** argv) {
callback_data cb_data = { &diff_params, vocab, n_input };
diff_params.step_callback_user_data = &cb_data;
const char * alg_names[] = { "ORIGIN", "ENTROPY_BASED", "MARGIN_BASED", "RANDOM", "CONFIDENCE_BASED" };
const char * sched_names[] = { "TIMESTEP_BASED", "BLOCK_BASED" };
const char * alg_names[] = {
"DIFFUSION_ALGORITHM_ORIGIN",
"DIFFUSION_ALGORITHM_ENTROPY_BASED",
"DIFFUSION_ALGORITHM_MARGIN_BASED",
"DIFFUSION_ALGORITHM_RANDOM",
"DIFFUSION_ALGORITHM_CONFIDENCE_BASED",
};
const char * sched_names[] = {
"DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED",
"DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED",
};
const char * alg_name =
(diff_params.algorithm >= 0 && diff_params.algorithm <= 4) ? alg_names[diff_params.algorithm] : "UNKNOWN";
const char * sched_name =
@@ -666,11 +236,11 @@ int main(int argc, char ** argv) {
LOG_INF("diffusion_params: - %-25s enum = %d (%s)\n", "algorithm", diff_params.algorithm, alg_name);
LOG_INF("diffusion_params: - %-25s enum = %d (%s)\n", "schedule", diff_params.schedule, sched_name);
LOG_INF("diffusion_params: - %-25s f32 = %.3f\n", "temperature", diff_params.temperature);
if (diff_params.schedule == TIMESTEP_BASED) {
if (diff_params.schedule == DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED) {
LOG_INF("diffusion_params: - %-25s f32 = %.6f\n", "eps", diff_params.eps);
LOG_INF("diffusion_params: - %-25s f32 = %.3f\n", "alg_temp", diff_params.alg_temp);
}
if (diff_params.schedule == BLOCK_BASED) {
if (diff_params.schedule == DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED) {
LOG_INF("diffusion_params: - %-25s u32 = %d\n", "block_length", diff_params.block_length);
LOG_INF("diffusion_params: - %-25s f32 = %.3f\n", "cfg_scale", diff_params.cfg_scale);
}

View File

@@ -0,0 +1,408 @@
#include "diffusion.h"
#include "log.h"
#include <algorithm>
#include <cstddef>
#include <cmath>
#include <cstring>
#include <random>
#include <utility>
#include <vector>
static float calculate_confidence(const llama_token_data_array & cur_p,
diffusion_algorithm algorithm,
std::mt19937 & rng) {
switch (algorithm) {
case DIFFUSION_ALGORITHM_CONFIDENCE_BASED:
return cur_p.data[cur_p.selected].p; // Selected token probability
case DIFFUSION_ALGORITHM_ENTROPY_BASED:
{
float entropy = 0.0f;
const float epsilon = 1e-10f;
for (size_t i = 0; i < cur_p.size; i++) {
float prob = cur_p.data[i].p;
entropy += prob * logf(prob + epsilon);
}
return -entropy; // Higher entropy = lower confidence
}
case DIFFUSION_ALGORITHM_MARGIN_BASED:
return (cur_p.size > 1) ? cur_p.data[0].p - cur_p.data[1].p : cur_p.data[0].p;
case DIFFUSION_ALGORITHM_RANDOM:
{
std::uniform_real_distribution<float> uniform(0.0f, 1.0f);
return uniform(rng); // Random confidence
}
case DIFFUSION_ALGORITHM_ORIGIN:
return cur_p.data[cur_p.selected].p;
default:
return 0.0f;
}
}
// Unified transfer count calculation function
static int32_t calculate_transfer_count(int32_t step,
int32_t total_steps,
int32_t remaining_masked,
diffusion_transfer_schedule schedule,
float eps,
const std::vector<int32_t> & num_transfer_tokens = {}) {
switch (schedule) {
case DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED:
{
float t = 1.0f - (float) step / total_steps * (1.0f - eps);
float s = 1.0f - (float) (step + 1) / total_steps * (1.0f - eps);
float p_transfer = (step < total_steps - 1) ? (1.0f - s / t) : 1.0f;
return (int32_t) (remaining_masked * p_transfer);
}
case DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED:
if (!num_transfer_tokens.empty() && step < (int32_t) num_transfer_tokens.size()) {
return num_transfer_tokens[step];
}
return remaining_masked / (total_steps - step); // Fallback
default:
return remaining_masked / (total_steps - step);
}
}
static void add_gumbel_noise(float * logits, int32_t n_vocab, float temperature, std::mt19937 & rng) {
if (temperature == 0.0f) {
return;
}
std::uniform_real_distribution<double> uniform(0.0, 1.0);
for (int32_t i = 0; i < n_vocab; i++) {
double noise = uniform(rng);
// Prevent log(0)
noise = std::max(noise, 1e-20);
double gumbel_noise = std::pow(-std::log(noise), temperature);
logits[i] = std::exp(logits[i]) / gumbel_noise;
}
}
static std::vector<int32_t> get_num_transfer_tokens(int32_t mask_count, int32_t steps) {
std::vector<int32_t> num_transfer_tokens(steps);
int32_t base = mask_count / steps;
int32_t remainder = mask_count % steps;
for (int32_t i = 0; i < steps; i++) {
num_transfer_tokens[i] = base + (i < remainder ? 1 : 0);
}
return num_transfer_tokens;
}
void diffusion_generate(llama_context * ctx,
const llama_token * input_tokens,
llama_token * output_tokens,
int32_t n_input,
const diffusion_params & params,
int32_t & n_generated) {
n_generated = 0;
if (!ctx || !input_tokens || !output_tokens || n_input <= 0 || params.max_length <= n_input) {
return;
}
const llama_model * model = llama_get_model(ctx);
// Initialize with input and pad with mask tokens
std::copy(input_tokens, input_tokens + n_input, output_tokens);
std::fill(output_tokens + n_input, output_tokens + params.max_length, params.mask_token_id);
std::mt19937 rng(params.seed);
llama_set_causal_attn(ctx, false);
int32_t n_vocab = llama_vocab_n_tokens(llama_model_get_vocab(model));
std::vector<llama_token_data> candidates(n_vocab);
std::vector<llama_token_data> conf_candidates;
conf_candidates.reserve(params.max_length);
std::vector<int32_t> mask_positions;
mask_positions.reserve(params.max_length);
// Setup sampler chain
struct llama_sampler * sampler = llama_sampler_chain_init(llama_sampler_chain_default_params());
if (params.top_k > 0) {
llama_sampler_chain_add(sampler, llama_sampler_init_top_k(params.top_k));
}
if (params.top_p < 1.0f) {
llama_sampler_chain_add(sampler, llama_sampler_init_top_p(params.top_p, 1));
}
if (params.temperature > 0.0f) {
llama_sampler_chain_add(sampler, llama_sampler_init_temp(params.temperature));
}
llama_sampler_chain_add(sampler, llama_sampler_init_dist(params.seed));
struct llama_sampler * dist_sampler = llama_sampler_init_dist(params.seed);
llama_batch batch = llama_batch_init(params.max_length, 0, 1);
batch.n_tokens = params.max_length;
// Pre-allocate buffers for CFG if needed
int32_t logits_size = n_vocab * params.max_length;
std::vector<float> cond_logits_buffer;
std::vector<llama_token> un_x_buffer;
if (params.cfg_scale > 0.0f) {
cond_logits_buffer.resize(logits_size);
un_x_buffer.resize(params.max_length);
}
// For block-based processing
std::vector<int32_t> num_transfer_tokens;
int32_t num_blocks = 1;
int32_t steps_per_block = params.steps;
if (params.schedule == DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED) {
GGML_ASSERT(params.max_length % params.block_length == 0);
num_blocks = params.max_length / params.block_length;
GGML_ASSERT(params.steps % num_blocks == 0);
steps_per_block = params.steps / num_blocks;
}
std::vector<float> confidence(params.max_length);
int64_t total_sampling_time = 0;
int64_t total_time = 0;
int64_t time_start = ggml_time_us();
for (int block_num = 0; block_num < num_blocks; block_num++) {
int32_t block_start = (params.schedule == DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED) ? n_input + block_num * params.block_length : 0;
int32_t block_end = (params.schedule == DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED) ?
std::min(n_input + (block_num + 1) * params.block_length, params.max_length) :
params.max_length;
// Count masked tokens in current block for block-based processing
if (params.schedule == DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED) {
int32_t block_mask_count = 0;
for (int i = block_start; i < block_end; i++) {
if (output_tokens[i] == params.mask_token_id) {
block_mask_count++;
}
}
num_transfer_tokens = get_num_transfer_tokens(block_mask_count, steps_per_block);
}
for (int32_t step = 0; step < steps_per_block; step++) {
int32_t global_step = block_num * steps_per_block + step;
if (params.step_callback) {
if (!params.step_callback(
global_step, params.steps, output_tokens, params.max_length, params.step_callback_user_data)) {
break;
}
}
// Setup batch
for (int32_t i = 0; i < params.max_length; i++) {
batch.token[i] = output_tokens[i];
batch.pos[i] = i;
batch.n_seq_id[i] = 1;
batch.seq_id[i][0] = 0;
batch.logits[i] = 1;
}
float * logits = nullptr;
if (params.cfg_scale > 0.0f) {
int ret = llama_decode(ctx, batch);
if (ret != 0) {
LOG_ERR("Failed to generate conditional");
break;
}
float * cond_logits_ptr = llama_get_logits(ctx);
std::memcpy(cond_logits_buffer.data(), cond_logits_ptr, logits_size * sizeof(float));
// Unconditional generation (mask input)
std::copy(output_tokens, output_tokens + params.max_length, un_x_buffer.begin());
for (int32_t i = 0; i < n_input; i++) {
un_x_buffer[i] = params.mask_token_id;
}
for (int32_t i = 0; i < params.max_length; i++) {
batch.token[i] = un_x_buffer[i];
}
ret = llama_decode(ctx, batch);
if (ret != 0) {
LOG_ERR("Failed to generate unconditional");
break;
}
float * uncond_logits = llama_get_logits(ctx);
// Apply CFG
for (int32_t i = 0; i < logits_size; i++) {
cond_logits_buffer[i] =
uncond_logits[i] + (params.cfg_scale + 1.0f) * (cond_logits_buffer[i] - uncond_logits[i]);
}
logits = cond_logits_buffer.data();
} else {
int ret = llama_decode(ctx, batch);
if (ret != 0) {
LOG_ERR("%s: failed to decode at step %d, ret = %d\n", __func__, global_step, ret);
break;
}
logits = llama_get_logits(ctx);
}
if (!logits) {
LOG_ERR("%s: failed to get logits at step %d\n", __func__, global_step);
break;
}
auto get_logits_for_pos = [&](int32_t pos) -> const float * {
if (params.shift_logits) {
return pos == 0 ? logits : logits + (pos - 1) * n_vocab;
}
return logits + pos * n_vocab;
};
int64_t time_start_sampling = ggml_time_us();
mask_positions.clear();
for (int32_t i = 0; i < params.max_length; i++) {
if (output_tokens[i] == params.mask_token_id) {
// For block-based, only consider current block
if (params.schedule != DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED || (i >= block_start && i < block_end)) {
mask_positions.push_back(i);
}
}
}
if (mask_positions.empty()) {
break;
}
if (params.add_gumbel_noise && params.temperature > 0.0f) {
add_gumbel_noise(logits, n_vocab, params.temperature, rng);
}
if (params.algorithm == DIFFUSION_ALGORITHM_ORIGIN) {
int32_t transfer_count = calculate_transfer_count(
step, steps_per_block, mask_positions.size(), params.schedule, params.eps, num_transfer_tokens);
float p_transfer = (float) transfer_count / mask_positions.size();
for (int32_t pos : mask_positions) {
if (std::uniform_real_distribution<float>(0.0f, 1.0f)(rng) < p_transfer) {
const float * pos_logits = get_logits_for_pos(pos);
for (int32_t token_id = 0; token_id < n_vocab; token_id++) {
candidates[token_id].id = token_id;
candidates[token_id].logit = pos_logits[token_id];
candidates[token_id].p = 0.0f;
}
llama_token_data_array cur_p = {
candidates.data(),
(size_t) n_vocab,
-1,
false,
};
llama_sampler_apply(sampler, &cur_p);
output_tokens[pos] = cur_p.data[cur_p.selected].id;
}
}
} else {
std::vector<std::pair<float, int32_t>> confidences;
std::vector<llama_token> sampled_tokens(mask_positions.size());
for (size_t i = 0; i < mask_positions.size(); i++) {
int32_t pos = mask_positions[i];
const float * pos_logits = get_logits_for_pos(pos);
for (int32_t token_id = 0; token_id < n_vocab; token_id++) {
candidates[token_id].logit = pos_logits[token_id];
candidates[token_id].p = 0.0f;
candidates[token_id].id = token_id;
}
llama_token_data_array cur_p = {
candidates.data(),
candidates.size(),
-1,
false,
};
llama_sampler_apply(sampler, &cur_p);
llama_token sampled_token = cur_p.data[cur_p.selected].id;
float conf = calculate_confidence(cur_p, params.algorithm, rng);
sampled_tokens[i] = sampled_token;
confidences.emplace_back(conf, i);
}
int32_t transfer_count = calculate_transfer_count(
step, steps_per_block, mask_positions.size(), params.schedule, params.eps, num_transfer_tokens);
if (transfer_count > 0) {
if (params.alg_temp == 0.0f) {
std::partial_sort(confidences.begin(),
confidences.begin() + std::min(transfer_count, (int32_t) confidences.size()),
confidences.end(),
[](const std::pair<float, int32_t> & a, const std::pair<float, int32_t> & b) {
if (a.first != b.first) {
return a.first > b.first;
}
return a.second < b.second;
});
for (int32_t i = 0; i < std::min(transfer_count, (int32_t) confidences.size()); i++) {
int32_t mask_idx = confidences[i].second;
int32_t pos = mask_positions[mask_idx];
output_tokens[pos] = sampled_tokens[mask_idx];
}
} else {
conf_candidates.clear();
for (size_t i = 0; i < confidences.size(); i++) {
float conf_logit = confidences[i].first / params.alg_temp;
conf_candidates.emplace_back(llama_token_data{ (int32_t) i, conf_logit, 0.0f });
}
llama_token_data_array conf_array = {
conf_candidates.data(),
conf_candidates.size(),
-1,
false,
};
for (int32_t i = 0; i < std::min(transfer_count, (int32_t) confidences.size()); i++) {
llama_sampler_apply(dist_sampler, &conf_array);
int32_t selected_idx = conf_array.selected;
int32_t mask_idx = selected_idx;
int32_t pos = mask_positions[mask_idx];
output_tokens[pos] = sampled_tokens[mask_idx];
conf_candidates[selected_idx].p = 0.0f;
conf_array.selected = -1;
}
}
}
}
int64_t time_end_sampling = ggml_time_us();
total_sampling_time += time_end_sampling - time_start_sampling;
}
}
int64_t time_end = ggml_time_us();
total_time += time_end - time_start;
LOG_INF("\ntotal time: %0.2fms, time per step: %0.2fms, sampling time per step: %0.2fms\n",
total_time / 1000.0,
total_time / 1000.0 / params.steps,
total_sampling_time / 1000.0 / params.steps);
llama_batch_free(batch);
llama_sampler_free(sampler);
llama_sampler_free(dist_sampler);
n_generated = params.max_length;
}

View File

@@ -0,0 +1,57 @@
#pragma once
#include "llama.h"
#include <cstdint>
enum diffusion_algorithm {
DIFFUSION_ALGORITHM_ORIGIN = 0,
DIFFUSION_ALGORITHM_ENTROPY_BASED = 1,
DIFFUSION_ALGORITHM_MARGIN_BASED = 2,
DIFFUSION_ALGORITHM_RANDOM = 3,
DIFFUSION_ALGORITHM_CONFIDENCE_BASED = 4,
};
// Unified transfer scheduling methods
enum diffusion_transfer_schedule {
DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED = 0, // Dream-style: (1.0 - s/t) * remaining
DIFFUSION_TRANSFER_SCHEDULE_BLOCK_BASED = 1, // LLaDA-style: process in blocks with get_num_transfer_tokens
};
typedef bool (*diffusion_step_callback_t)(int32_t step,
int32_t total_steps,
const llama_token * tokens,
int32_t n_tokens,
void * user_data);
struct diffusion_params {
int32_t steps = 0;
float temperature = 0;
llama_token mask_token_id = LLAMA_TOKEN_NULL;
diffusion_step_callback_t step_callback = nullptr;
void * step_callback_user_data = nullptr;
int32_t seed = 0;
bool visual_mode = false;
bool shift_logits = false; // Shift logits by -1 after decode
float top_p = 0.;
int32_t top_k = 0.;
diffusion_algorithm algorithm = DIFFUSION_ALGORITHM_CONFIDENCE_BASED;
diffusion_transfer_schedule schedule = DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED;
float cfg_scale = 0.; // Config scale for classifier-free guidance
float eps = 0.; // Timestep scheduling
int32_t block_length = 0; // Block size (for block scheduling)
float alg_temp = 0; // algorithm temperature (0.0 = deterministic)
bool add_gumbel_noise = false; // Add gumbel noise to the logits if temp > 0.0
int32_t max_length = 0; // Maximum sequence length
};
void diffusion_generate(llama_context * ctx,
const llama_token * input_tokens,
llama_token * output_tokens,
int32_t n_input,
const diffusion_params & params,
int32_t & n_generated);

View File

@@ -38,8 +38,12 @@ int main(int argc, char ** argv) {
std::string result0;
std::string result1;
std::string result2;
std::string result3;
// init
ggml_backend_load_all();
auto llama_init = common_init_from_params(params);
auto * model = llama_init->model();
@@ -213,11 +217,83 @@ int main(int argc, char ** argv) {
n_past += 1;
}
// test on-device state save/load
auto params_ctx4 = common_context_params_to_llama(params);
params_ctx4.n_seq_max = 2;
llama_context * ctx4 = llama_init_from_model(model, params_ctx4);
llama_sampler * smpl4 = llama_sampler_chain_init(sparams);
llama_sampler_chain_add(smpl4, llama_sampler_init_dist(params.sampling.seed));
printf("\nsingle seq run: %s", params.prompt.c_str());
// load state (rng, logits, embedding and kv_cache) from file
n_token_count_out = 0;
if (!llama_state_load_file(ctx4, state_file.data(), unused_sts.data(), unused_sts.size(), &n_token_count_out)) {
fprintf(stderr, "\n%s : failed to load state\n", __func__);
return 1;
}
fprintf(stderr, "%s : loaded state with %zu tokens\n", __func__, n_token_count_out);
// restore state (last tokens)
n_past = n_token_count_out;
if (!common_replay_last_token(ctx4, tokens.back(), n_past)) {
return 1;
}
++n_past;
// save seq 0 and load into seq 1
{
// save kv of seq 0
std::vector<uint8_t> seq_store(llama_state_seq_get_size_ext(ctx4, 0, LLAMA_STATE_SEQ_FLAGS_ON_DEVICE));
const size_t ncopy = llama_state_seq_get_data_ext(ctx4, seq_store.data(), seq_store.size(), 0, LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
if (ncopy != seq_store.size()) {
fprintf(stderr, "\n%s : seq copy data length %zd does not match expected length %zd\n", __func__, ncopy, seq_store.size());
return 1;
}
fprintf(stderr, "%s : seq 0 copied, %zd bytes\n", __func__, ncopy);
// erase whole kv
llama_memory_clear(llama_get_memory(ctx4), true);
fprintf(stderr, "%s : kv cache cleared\n", __func__);
// restore kv into seq 0
const size_t nset = llama_state_seq_set_data_ext(ctx4, seq_store.data(), seq_store.size(), 1, LLAMA_STATE_SEQ_FLAGS_ON_DEVICE);
if (nset != seq_store.size()) {
fprintf(stderr, "\n%s : seq set data length %zd does not match expected length %zd\n", __func__, nset, seq_store.size());
return 1;
}
fprintf(stderr, "%s : seq 1 restored, %zd bytes\n", __func__, nset);
}
// forth run
for (auto i = 0; i < params.n_predict; i++) {
auto next_token = llama_sampler_sample(smpl4, ctx4, -1);
auto next_token_str = common_token_to_piece(ctx4, next_token);
printf("%s", next_token_str.c_str());
result3 += next_token_str;
common_batch_clear(batch);
common_batch_add(batch, next_token, n_past, {1}, true);
if (llama_decode(ctx4, batch)) {
fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
llama_batch_free(batch);
return 1;
}
n_past += 1;
}
printf("\n");
llama_sampler_free(smpl);
llama_sampler_free(smpl2);
llama_sampler_free(smpl3);
llama_sampler_free(smpl4);
llama_batch_free(batch);
@@ -226,12 +302,18 @@ int main(int argc, char ** argv) {
llama_free(ctx2);
llama_free(ctx3);
llama_free(ctx4);
if (result0 != result2) {
fprintf(stderr, "\n%s : error : the seq restore generation is different\n", __func__);
return 1;
}
if (result0 != result3) {
fprintf(stderr, "\n%s : error : the seq restore generation is different\n", __func__);
return 1;
}
fprintf(stderr, "\n%s : success\n", __func__);
return 0;

View File

@@ -110,13 +110,21 @@ int main(int argc, char ** argv) {
return 1;
}
if (
llama_vocab_get_add_bos(vocab_tgt) != llama_vocab_get_add_bos(vocab_dft) ||
llama_vocab_get_add_eos(vocab_tgt) != llama_vocab_get_add_eos(vocab_dft) ||
llama_vocab_bos(vocab_tgt) != llama_vocab_bos(vocab_dft) ||
llama_vocab_eos(vocab_tgt) != llama_vocab_eos(vocab_dft)
) {
LOG_ERR("%s: draft model special tokens must match target model to use speculation\n", __func__);
if (llama_vocab_get_add_bos(vocab_tgt) != llama_vocab_get_add_bos(vocab_dft) ||
(llama_vocab_get_add_bos(vocab_tgt) && llama_vocab_bos(vocab_tgt) != llama_vocab_bos(vocab_dft))) {
LOG_ERR("%s: draft model bos tokens must match target model to use speculation. add: %d - %d, id: %d - %d)\n",
__func__,
llama_vocab_get_add_bos(vocab_tgt), llama_vocab_get_add_bos(vocab_dft),
llama_vocab_bos(vocab_tgt), llama_vocab_bos(vocab_dft));
return 1;
}
if (llama_vocab_get_add_eos(vocab_tgt) != llama_vocab_get_add_eos(vocab_dft) ||
(llama_vocab_get_add_eos(vocab_tgt) && llama_vocab_eos(vocab_tgt) != llama_vocab_eos(vocab_dft))) {
LOG_ERR("%s: draft model eos tokens must match target model to use speculation. add: %d - %d, id: %d - %d)\n",
__func__,
llama_vocab_get_add_eos(vocab_tgt), llama_vocab_get_add_eos(vocab_dft),
llama_vocab_eos(vocab_tgt), llama_vocab_eos(vocab_dft));
return 1;
}
@@ -137,11 +145,12 @@ int main(int argc, char ** argv) {
for (int i = SPEC_VOCAB_CHECK_START_TOKEN_ID; i < std::min(n_vocab_tgt, n_vocab_dft); ++i) {
const char * token_text_tgt = llama_vocab_get_text(vocab_tgt, i);
const char * token_text_dft = llama_vocab_get_text(vocab_dft, i);
if (std::strcmp(token_text_tgt, token_text_dft) != 0) {
LOG_ERR("%s: draft model vocab must match target model to use speculation but ", __func__);
LOG_ERR("token %d content differs - target '%s', draft '%s'\n", i,
common_token_to_piece(ctx_tgt, i).c_str(),
common_token_to_piece(ctx_dft, i).c_str());
common_token_to_piece(vocab_tgt, i).c_str(),
common_token_to_piece(vocab_dft, i).c_str());
return 1;
}
}

View File

@@ -111,14 +111,14 @@ if [ $GGML_SYCL_DEVICE -ne -1 ]; then
echo "Use $GGML_SYCL_DEVICE as main GPU"
#use signle GPU only
GPUS_SETTING="-mg $GGML_SYCL_DEVICE -sm ${SPLIT_MODE}"
export ONEAPI_DEVICE_SELECTOR="level_zero:${$GGML_SYCL_DEVICE}"
export ONEAPI_DEVICE_SELECTOR="level_zero:${GGML_SYCL_DEVICE}"
echo "ONEAPI_DEVICE_SELECTOR=${ONEAPI_DEVICE_SELECTOR}"
else
echo "Use all Intel GPUs, including iGPU & dGPU"
GPUS_SETTING="-sm ${SPLIT_MODE}"
fi
echo "run cmd: ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -no-cnv -p "${INPUT_PROMPT}" -n 200 -e -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap "
echo "run cmd: ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap --host 0.0.0.0 --port 8000"
ZES_ENABLE_SYSMAN=1 ${BIN_FILE} -m ${MODEL_FILE} -ngl ${NGL} -s ${SEED} -c ${CONTEXT} ${GPUS_SETTING} -lv ${LOG_VERBOSE} --mmap --host 0.0.0.0 --port 8000

View File

@@ -119,7 +119,7 @@ if [ $GGML_SYCL_DEVICE -ne -1 ]; then
echo "Use $GGML_SYCL_DEVICE as main GPU"
#use signle GPU only
GPUS_SETTING="-mg $GGML_SYCL_DEVICE -sm ${SPLIT_MODE}"
export ONEAPI_DEVICE_SELECTOR="level_zero:${$GGML_SYCL_DEVICE}"
export ONEAPI_DEVICE_SELECTOR="level_zero:${GGML_SYCL_DEVICE}"
echo "ONEAPI_DEVICE_SELECTOR=${ONEAPI_DEVICE_SELECTOR}"
else
echo "Use all Intel GPUs, including iGPU & dGPU"

58
flake.lock generated
View File

@@ -1,58 +0,0 @@
{
"nodes": {
"flake-parts": {
"inputs": {
"nixpkgs-lib": "nixpkgs-lib"
},
"locked": {
"lastModified": 1730504689,
"narHash": "sha256-hgmguH29K2fvs9szpq2r3pz2/8cJd2LPS+b4tfNFCwE=",
"owner": "hercules-ci",
"repo": "flake-parts",
"rev": "506278e768c2a08bec68eb62932193e341f55c90",
"type": "github"
},
"original": {
"owner": "hercules-ci",
"repo": "flake-parts",
"type": "github"
}
},
"nixpkgs": {
"locked": {
"lastModified": 1732014248,
"narHash": "sha256-y/MEyuJ5oBWrWAic/14LaIr/u5E0wRVzyYsouYY3W6w=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "23e89b7da85c3640bbc2173fe04f4bd114342367",
"type": "github"
},
"original": {
"owner": "NixOS",
"ref": "nixos-unstable",
"repo": "nixpkgs",
"type": "github"
}
},
"nixpkgs-lib": {
"locked": {
"lastModified": 1730504152,
"narHash": "sha256-lXvH/vOfb4aGYyvFmZK/HlsNsr/0CVWlwYvo2rxJk3s=",
"type": "tarball",
"url": "https://github.com/NixOS/nixpkgs/archive/cc2f28000298e1269cea6612cd06ec9979dd5d7f.tar.gz"
},
"original": {
"type": "tarball",
"url": "https://github.com/NixOS/nixpkgs/archive/cc2f28000298e1269cea6612cd06ec9979dd5d7f.tar.gz"
}
},
"root": {
"inputs": {
"flake-parts": "flake-parts",
"nixpkgs": "nixpkgs"
}
}
},
"root": "root",
"version": 7
}

View File

@@ -4,8 +4,8 @@ project("ggml" C CXX ASM)
### GGML Version
set(GGML_VERSION_MAJOR 0)
set(GGML_VERSION_MINOR 10)
set(GGML_VERSION_PATCH 0)
set(GGML_VERSION_MINOR 11)
set(GGML_VERSION_PATCH 1)
set(GGML_VERSION_BASE "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/")

View File

@@ -169,7 +169,7 @@ extern "C" {
// device type
enum ggml_backend_dev_type type;
// device id
// for PCI devices, this should be the PCI bus id formatted as "domain:bus:device.function" (e.g. "0000:01:00.0")
// for PCI devices, this should be the lower-case PCI bus id formatted as "domain:bus:device.function" (e.g. "0000:c1:00.0")
// if the id is unknown, this should be NULL
const char * device_id;
// device capabilities

View File

@@ -438,6 +438,12 @@ extern "C" {
GGML_PREC_F32 = 10,
};
// op hint
enum ggml_op_hint {
GGML_HINT_NONE = 0,
GGML_HINT_SRC0_IS_HADAMARD = 1,
};
// model file types
enum ggml_ftype {
GGML_FTYPE_UNKNOWN = -1,
@@ -1419,6 +1425,11 @@ extern "C" {
struct ggml_tensor * a,
enum ggml_prec prec);
// change the hint of a matrix multiplication
GGML_API void ggml_mul_mat_set_hint(
struct ggml_tensor * a,
enum ggml_op_hint hint);
// indirect matrix multiplication
GGML_API struct ggml_tensor * ggml_mul_mat_id(
struct ggml_context * ctx,

View File

@@ -1826,7 +1826,24 @@ static enum ggml_status ggml_backend_meta_graph_compute(ggml_backend_t backend,
continue;
}
i = get_i_delayed(i);
const int i_delayed = get_i_delayed(i);
// If we can delay the AllReduce we need to consider the interaction with zero-sized tensor slices.
// A backend with such a slice would normally have valid data after participating in the AllReduce with a node that has
// its compute flag disabled and thus gets its data zeroed out.
// If the AllReduce is delayed then the nodes until that point also need to have their compute flag disabled.
if (i_delayed > i) {
for (size_t j = 0; j < n_backends; j++) {
auto & bcj = backend_ctx->backend_configs[j];
if ((bcj.nodes[i]->flags & GGML_TENSOR_FLAG_COMPUTE) == 0) {
for (int ii = i + 1; ii <= i_delayed; ii++) {
bcj.nodes[ii]->flags &= ~GGML_TENSOR_FLAG_COMPUTE;
}
}
}
}
i = i_delayed;
for (size_t j = 0; j < n_backends; j++) {
auto & bcj = backend_ctx->backend_configs[j];
@@ -2083,8 +2100,8 @@ static const ggml_backend_i ggml_backend_meta_i = {
/* .free = */ ggml_backend_meta_free,
/* .set_tensor_async = */ ggml_backend_meta_set_tensor_async,
/* .get_tensor_async = */ ggml_backend_meta_get_tensor_async,
/* .get_tensor_2d_async = */ nullptr,
/* .set_tensor_2d_async = */ nullptr,
/* .get_tensor_2d_async = */ nullptr,
/* .cpy_tensor_async = */ nullptr,
/* .synchronize = */ ggml_backend_meta_synchronize,
/* .graph_plan_create = */ nullptr,

View File

@@ -965,7 +965,7 @@ static void ggml_backend_sched_print_assignments(ggml_backend_sched_t sched, str
}
if (sched->debug > 1) {
ggml_backend_t tensor_backend = ggml_backend_sched_get_tensor_backend(sched, node);
GGML_LOG_DEBUG("node #%3d (%10.10s): %20.20s (%5.5s) [%5.5s %8.8s] use=%d,c=%d:", i, ggml_op_name(node->op), node->name,
GGML_LOG_DEBUG("node #%3d (%10.10s): %20.20s (%5.5s) [%5.5s %8.8s] use=%d,c=%d:", i, ggml_op_desc(node), node->name,
fmt_size(ggml_nbytes(node)), tensor_backend ? ggml_backend_name(tensor_backend) : "NULL", GET_CAUSE(node),
graph->use_counts[ggml_hash_find(&graph->visited_hash_set, node)], node->flags & GGML_TENSOR_FLAG_COMPUTE ? 1 : 0);
for (int j = 0; j < GGML_MAX_SRC; j++) {

View File

@@ -262,9 +262,9 @@ static struct ggml_backend_i blas_backend_i = {
/* .get_name = */ ggml_backend_blas_get_name,
/* .free = */ ggml_backend_blas_free,
/* .set_tensor_async = */ NULL,
/* .get_tensor_2d_async = */ NULL,
/* .set_tensor_2d_async = */ NULL,
/* .get_tensor_async = */ NULL,
/* .set_tensor_2d_async = */ NULL,
/* .get_tensor_2d_async = */ NULL,
/* .cpy_tensor_async = */ NULL,
/* .synchronize = */ NULL,
/* .graph_plan_create = */ NULL,

View File

@@ -2746,8 +2746,8 @@ static const ggml_backend_i ggml_backend_cann_interface = {
/* .free = */ ggml_backend_cann_free,
/* .set_tensor_async = */ ggml_backend_cann_set_tensor_async,
/* .get_tensor_async = */ ggml_backend_cann_get_tensor_async,
/* .get_tensor_2d_async = */ NULL,
/* .set_tensor_2d_async = */ NULL,
/* .get_tensor_2d_async = */ NULL,
/* .cpy_tensor_async = */ ggml_backend_cann_cpy_tensor_async,
/* .synchronize = */ ggml_backend_cann_synchronize,
/* .graph_plan_create = */ NULL,

View File

@@ -485,6 +485,13 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
if (GGML_RV_ZIHINTPAUSE)
string(APPEND MARCH_STR "_zihintpause")
endif()
if (GGML_CPU_RISCV64_SPACEMIT)
# `xsmtvdotii' is only required for GCC >= 15.
if (CMAKE_C_COMPILER_ID STREQUAL "GNU" AND
CMAKE_C_COMPILER_VERSION VERSION_GREATER_EQUAL 15)
string(APPEND MARCH_STR "_xsmtvdotii")
endif()
endif()
list(APPEND ARCH_FLAGS "-march=${MARCH_STR}" -mabi=lp64d)
else()
@@ -571,13 +578,13 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
# Fetch KleidiAI sources:
include(FetchContent)
set(KLEIDIAI_COMMIT_TAG "v1.22.0")
set(KLEIDIAI_DOWNLOAD_URL "https://github.com/ARM-software/kleidiai/archive/refs/tags/${KLEIDIAI_COMMIT_TAG}.tar.gz")
set(KLEIDIAI_ARCHIVE_MD5 "54049037570ab0ee0a0d126b2ba5ece1")
set(KLEIDIAI_COMMIT_TAG "v1.24.0")
set(KLEIDIAI_DOWNLOAD_URL "https://github.com/ARM-software/kleidiai/releases/download/${KLEIDIAI_COMMIT_TAG}/kleidiai-${KLEIDIAI_COMMIT_TAG}-src.tar.gz")
set(KLEIDIAI_RELEASE_ARCHIVE_MD5 "2f02ebe29573d45813e671eb304f2a00")
set(KLEIDIAI_FETCH_ARGS
URL ${KLEIDIAI_DOWNLOAD_URL}
URL_HASH MD5=${KLEIDIAI_ARCHIVE_MD5}
URL_HASH MD5=${KLEIDIAI_RELEASE_ARCHIVE_MD5}
)
if (CMAKE_VERSION VERSION_GREATER_EQUAL "3.24")
list(APPEND KLEIDIAI_FETCH_ARGS DOWNLOAD_EXTRACT_TIMESTAMP NEW)

View File

@@ -203,7 +203,6 @@
#elif defined(__riscv)
// quants.c
#define ggml_vec_dot_nvfp4_q8_0_generic ggml_vec_dot_nvfp4_q8_0
#define ggml_vec_dot_q1_0_q8_0_generic ggml_vec_dot_q1_0_q8_0
// repack.cpp
#define ggml_quantize_mat_q8_0_4x1_generic ggml_quantize_mat_q8_0_4x1
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4

View File

@@ -5023,6 +5023,71 @@ void ggml_gemm_q8_0_4x8_q8_0(int n,
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined(__aarch64__) && defined(__ARM_FEATURE_SVE) && defined(__ARM_FEATURE_MATMUL_INT8)
if (svcntb() * 8 == 256) {
const block_q8_0x4 * b_ptr_base = (const block_q8_0x4 *) vx;
static const uint32_t idx_arr[8] = {0, 1, 4, 5, 2, 3, 6, 7};
svuint32_t idx = svld1(svptrue_b32(), idx_arr);
static const uint32_t idx_arr1[8] = {0, 1, 2, 3, 1, 2, 3, 0};
svuint32_t idx_sc1 = svld1(svptrue_b32(), idx_arr1);
static const uint32_t idx_arr2[8] = {0, 1, 2, 3, 0, 1, 2, 3};
svuint32_t idx_sc2 = svld1(svptrue_b32(), idx_arr2);
for (int y = 0; y < nr; y += 4) {
const block_q8_0x4 * a_ptr_base = (const block_q8_0x4 *) vy + (y / 4) * nb;
for (int x = 0; x < nc; x += ncols_interleaved) {
const block_q8_0x4 * b_ptr = b_ptr_base + (x / 4) * nb;
const block_q8_0x4 * a_ptr = a_ptr_base;
svfloat32_t acc_f32_01 = svdup_f32(0);
svfloat32_t acc_f32_23 = svdup_f32(0);
for (int b = 0; b < nb; b++) {
svint32_t acc_01 = svdup_s32(0);
svint32_t acc_23 = svdup_s32(0);
// Process 4 chunks of 8 positions each
for (int chunk = 0; chunk < 4; chunk++) {
svint8_t s_a01 = svld1rq_s8(svptrue_b8(), a_ptr->qs + chunk * 32);
svint8_t s_a23 = svld1rq_s8(svptrue_b8(), a_ptr->qs + chunk * 32 + 16);
svint8_t s_b0123 = svld1_s8(svptrue_b8(), b_ptr->qs + chunk * 32);
acc_01 = svmmla_s32(acc_01, s_a01, s_b0123);
acc_23 = svmmla_s32(acc_23, s_a23, s_b0123);
}
// Reorder outputs from 2×2 tiles to row-major
// acc[01] = [r0c0, r0c1, r1c0, r1c1, r0c2, r0c3, r1c2, r1c3]
// acc[23] = [r2c0, r2c1, r3c0, r3c1, r2c2, r2c3, r3c2, r3c3]
svint32_t row01 = svtbl_s32(acc_01, idx);
svint32_t row23 = svtbl_s32(acc_23, idx);
svfloat16_t temp1 = svld1_f16(svptrue_pat_b16(SV_VL4), (const __fp16 *) a_ptr->d);
svfloat16_t temp2 = svld1_f16(svptrue_pat_b16(SV_VL4), (const __fp16 *) b_ptr->d);
svfloat32_t sv_a_d = svtbl_f32(svcvt_f32_f16_x(svptrue_b32(), svzip1_f16(temp1, temp1)), idx_sc1);
svfloat32_t sv_b_d = svtbl_f32(svcvt_f32_f16_x(svptrue_b32(), svzip1_f16(temp2, temp2)), idx_sc2);
acc_f32_01 = svmla_f32_x(svptrue_b32(), acc_f32_01, svcvt_f32_s32_x(svptrue_b32(), row01), svmul_lane_f32(sv_b_d, sv_a_d, 0));
acc_f32_23 = svmla_f32_x(svptrue_b32(), acc_f32_23, svcvt_f32_s32_x(svptrue_b32(), row23), svmul_lane_f32(sv_b_d, sv_a_d, 2));
a_ptr++;
b_ptr++;
}
svbool_t pg4 = svptrue_pat_b32(SV_VL4);
svst1_f32(pg4, s + (y+0) * bs + x, acc_f32_01);
svst1_f32(pg4, s + (y+1) * bs + x, svext_f32(acc_f32_01, acc_f32_01, 4));
svst1_f32(pg4, s + (y+2) * bs + x, acc_f32_23);
svst1_f32(pg4, s + (y+3) * bs + x, svext_f32(acc_f32_23, acc_f32_23, 4));
}
}
return;
}
#endif // SVE compile-time end
#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
const block_q8_0x4 * b_ptr_base = (const block_q8_0x4 *) vx;

View File

@@ -480,6 +480,104 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
#endif
}
#if defined(__riscv_v)
static NOINLINE void ggml_vec_dot_q1_0_q8_0_vl256(const int n, float * GGML_RESTRICT s, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy) {
const int qk = QK1_0;
const int nb = n / qk;
assert(n % qk == 0);
const block_q1_0 * GGML_RESTRICT x = vx;
const block_q8_0 * GGML_RESTRICT y = vy;
//LMUL = 1, VLMAX = 32
const size_t vl32 = __riscv_vsetvl_e8m1(32);
assert(vl32 == 32);
const vint16m1_t zero = __riscv_vmv_v_x_i16m1(0, 1);
float sumf = 0;
for (int ib = 0; ib < nb; ++ib) {
const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
float acc = 0;
for (int k = 0; k < 4; ++k) {
const block_q8_0 * GGML_RESTRICT yb = &y[ib * 4 + k];
const vbool8_t is_not_zero = __riscv_vlm_v_b8(x[ib].qs + 4 * k, vl32);
const vint8m1_t qy = __riscv_vle8_v_i8m1(yb->qs, vl32);
const vint8m1_t neg_qy = __riscv_vneg_v_i8m1(qy, vl32);
const vint8m1_t sy = __riscv_vmerge_vvm_i8m1(neg_qy, qy, is_not_zero, vl32);
const vint16m1_t red = __riscv_vwredsum_vs_i8m1_i16m1(sy, zero, vl32);
acc += GGML_CPU_FP16_TO_FP32(yb->d) * (float)__riscv_vmv_x_s_i16m1_i16(red);
}
sumf += d0 * acc;
}
*s = sumf;
}
static NOINLINE void ggml_vec_dot_q1_0_q8_0_vl128(const int n, float * GGML_RESTRICT s, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy) {
const int qk = QK1_0;
const int nb = n / qk;
assert(n % qk == 0);
const block_q1_0 * GGML_RESTRICT x = vx;
const block_q8_0 * GGML_RESTRICT y = vy;
//LMUL = 2, VLMAX = 32
const size_t vl32 = __riscv_vsetvl_e8m2(32);
assert(vl32 == 32);
const vint16m1_t zero = __riscv_vmv_v_x_i16m1(0, 1);
float sumf = 0;
for (int ib = 0; ib < nb; ++ib) {
const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
float acc = 0;
for (int k = 0; k < 4; ++k) {
const block_q8_0 * GGML_RESTRICT yb = &y[ib * 4 + k];
const vbool4_t is_not_zero = __riscv_vlm_v_b4(x[ib].qs + 4 * k, vl32);
const vint8m2_t qy = __riscv_vle8_v_i8m2(yb->qs, vl32);
const vint8m2_t neg_qy =__riscv_vneg_v_i8m2(qy, vl32);
const vint8m2_t sy = __riscv_vmerge_vvm_i8m2(neg_qy, qy, is_not_zero, vl32);
const vint16m1_t red = __riscv_vwredsum_vs_i8m2_i16m1(sy, zero, vl32);
acc += GGML_CPU_FP16_TO_FP32(yb->d) * (float)__riscv_vmv_x_s_i16m1_i16(red);
}
sumf += d0 * acc;
}
*s = sumf;
}
#endif
void ggml_vec_dot_q1_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined(__riscv_v)
assert(nrc == 1);
const size_t vlen_bits = __riscv_vlenb() * 8;
if (vlen_bits >= 256) {
ggml_vec_dot_q1_0_q8_0_vl256(n, s, vx, vy);
} else if (vlen_bits >= 128) {
ggml_vec_dot_q1_0_q8_0_vl128(n, s, vx, vy);
} else {
ggml_vec_dot_q1_0_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
}
#else
ggml_vec_dot_q1_0_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(nrc == 1);
UNUSED(nrc);

View File

@@ -1245,6 +1245,12 @@ void ggml_compute_forward_mul_mat(
const struct ggml_tensor * src0 = dst->src[0];
const struct ggml_tensor * src1 = dst->src[1];
const int32_t hint = ggml_get_op_params_i32(dst, 1);
if (hint == GGML_HINT_SRC0_IS_HADAMARD && !params->use_ref) {
ggml_compute_forward_fwht(params, dst);
return;
}
GGML_TENSOR_BINARY_OP_LOCALS
const int ith = params->ith;
@@ -2959,6 +2965,45 @@ struct ggml_cplan ggml_graph_plan(
return cplan;
}
// Try to fuse the current node with subsequent nodes for better performance.
// Returns the number of nodes skipped by fusion (>=1), or 0 if no fusion was applied.
static bool ggml_cpu_disable_fusion = false; // initialized once in ggml_cpu_init(), read-only afterwards
static int ggml_cpu_try_fuse_ops(
const struct ggml_cgraph * cgraph,
const int node_n,
const struct ggml_compute_params * params,
const struct ggml_cplan * cplan) {
if (ggml_cpu_disable_fusion || cplan->use_ref) {
return 0;
}
struct ggml_tensor * node = cgraph->nodes[node_n];
if (node->op == GGML_OP_RMS_NORM) {
// RMS_NORM + MUL fusion
const enum ggml_op fuse_ops[] = { GGML_OP_RMS_NORM, GGML_OP_MUL };
if (ggml_can_fuse(cgraph, node_n, fuse_ops, 2)) {
struct ggml_tensor * mul_node = cgraph->nodes[node_n + 1];
const struct ggml_tensor * mul_w = (mul_node->src[0] == node)
? mul_node->src[1] : mul_node->src[0];
if (node->src[0]->type == GGML_TYPE_F32 &&
mul_node->type == GGML_TYPE_F32 &&
mul_w->type == GGML_TYPE_F32 &&
mul_w->ne[0] == node->ne[0] &&
mul_w->nb[0] == sizeof(float)) {
ggml_compute_forward_rms_norm_mul_fused(params, node, mul_node);
return 1;
}
}
}
return 0;
}
static thread_ret_t ggml_graph_compute_thread(void * data) {
struct ggml_compute_state * state = (struct ggml_compute_state *) data;
struct ggml_threadpool * tp = state->threadpool;
@@ -2995,7 +3040,14 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
continue;
}
ggml_compute_forward(&params, node);
// TODO: move fused-op detection into ggml_graph_plan so fusion decisions are made once at planning time
// Try fused ops, fall back to normal compute
const int n_fused = ggml_cpu_try_fuse_ops(cgraph, node_n, &params, cplan);
if (n_fused > 0) {
node_n += n_fused;
} else {
ggml_compute_forward(&params, node);
}
if (state->ith == 0 && cplan->abort_callback &&
cplan->abort_callback(cplan->abort_callback_data)) {
@@ -3757,6 +3809,11 @@ void ggml_cpu_init(void) {
ggml_init_riscv_arch_features();
#endif
{
const char * env = getenv("GGML_CPU_DISABLE_FUSION");
ggml_cpu_disable_fusion = (env != NULL && atoi(env) == 1);
}
is_first_call = false;
}

View File

@@ -195,8 +195,8 @@ static const struct ggml_backend_i ggml_backend_cpu_i = {
/* .free = */ ggml_backend_cpu_free,
/* .set_tensor_async = */ NULL,
/* .get_tensor_async = */ NULL,
/* .get_tensor_2d_async = */ NULL,
/* .set_tensor_2d_async = */ NULL,
/* .get_tensor_2d_async = */ NULL,
/* .cpy_tensor_async = */ NULL,
/* .synchronize = */ NULL,
/* .graph_plan_create = */ ggml_backend_cpu_graph_plan_create,

View File

@@ -2321,6 +2321,9 @@ class tinyBLAS_Q0_PPC {
}
void matmul(int64_t m, int64_t n) {
#if defined(_AIX) || defined(__BIG_ENDIAN__)
mnpack(0, m, 0, n);
#else
const int64_t mc = 64;
const int64_t kc = 64;
int64_t nc = 64;
@@ -2334,7 +2337,6 @@ class tinyBLAS_Q0_PPC {
} else {
n_aligned = (n / 64) * 64;
}
if (n_aligned > 0) {
if (n_aligned % 64 == 0) nc = 64;
else if (n_aligned == n) nc = n;
@@ -2352,6 +2354,7 @@ class tinyBLAS_Q0_PPC {
} else {
mnpack(0, m, 0, n);
}
#endif
}
private:
@@ -3191,12 +3194,16 @@ class tinyBLAS_PPC {
}
void matmul(int64_t m, int64_t n) {
#if defined(_AIX) || defined(__BIG_ENDIAN__)
mnpack(0, m, 0, n);
#else
int64_t mc = 256; int64_t nc = 256; int64_t kc = 256;
if (m % mc == 0 && n % nc == 0 && k % kc == 0) {
matmul_tiled(m, n, mc, nc, kc);
} else {
mnpack(0, m, 0, n);
}
#endif
}
private:

View File

@@ -3713,11 +3713,27 @@ void ggml_compute_forward_norm(
// ggml_compute_forward_group_rms_norm
// fusion kinds that can be combined with the rms_norm computation in a single pass.
// extend this enum when adding new fused variants (e.g. FUSE_ADD, FUSE_MUL_ADD, ...).
enum ggml_rms_norm_fuse_op {
GGML_RMS_NORM_FUSE_OP_NONE,
GGML_RMS_NORM_FUSE_OP_MUL,
};
template <ggml_rms_norm_fuse_op FUSE_OP>
static void ggml_compute_forward_rms_norm_f32(
const ggml_compute_params * params,
ggml_tensor * dst) {
ggml_tensor * dst_rms_norm,
ggml_tensor * dst_fused = nullptr) {
const ggml_tensor * src0 = dst->src[0];
const ggml_tensor * src0 = dst_rms_norm->src[0];
const ggml_tensor * src1 = nullptr;
ggml_tensor * dst = dst_rms_norm;
if constexpr (FUSE_OP == GGML_RMS_NORM_FUSE_OP_MUL) {
src1 = (dst_fused->src[0] == dst_rms_norm) ? dst_fused->src[1] : dst_fused->src[0];
dst = dst_fused;
}
GGML_ASSERT(ggml_are_same_shape(src0, dst));
@@ -3726,11 +3742,10 @@ static void ggml_compute_forward_rms_norm_f32(
const int ith = params->ith;
const int nth = params->nth;
GGML_TENSOR_UNARY_OP_LOCALS
GGML_TENSOR_BINARY_OP_LOCALS
float eps;
memcpy(&eps, dst->op_params, sizeof(float));
memcpy(&eps, dst_rms_norm->op_params, sizeof(float));
GGML_ASSERT(eps >= 0.0f);
// TODO: optimize
@@ -3740,25 +3755,32 @@ static void ggml_compute_forward_rms_norm_f32(
const float * x = (float *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
ggml_float sum = 0.0;
// worth switching to explicit SIMD?
for (int64_t i00 = 0; i00 < ne00; i00++) {
sum += (ggml_float)(x[i00] * x[i00]);
}
const float mean = sum/ne00;
float * y = (float *) ((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3);
memcpy(y, x, ne00 * sizeof(float));
// for (int i00 = 0; i00 < ne00; i00++) {
// y[i00] = x[i00];
// }
const float mean = sum/ne00;
const float scale = 1.0f/sqrtf(mean + eps);
// if you hit this, likely you got an inf somewhere earlier
assert(scale > 0.0f);
ggml_vec_scale_f32(ne00, y, scale);
float * y = (float *) ((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3);
if constexpr (FUSE_OP == GGML_RMS_NORM_FUSE_OP_MUL) {
const int64_t i11 = i01 % ne11;
const int64_t i12 = i02 % ne12;
const int64_t i13 = i03 % ne13;
const float * w = (float *) ((char *) src1->data + i11*nb11 + i12*nb12 + i13*nb13);
for (int64_t i00 = 0; i00 < ne00; i00++) {
y[i00] = x[i00] * scale * w[i00];
}
} else {
memcpy(y, x, ne00 * sizeof(float));
ggml_vec_scale_f32(ne00, y, scale);
}
}
}
}
@@ -3773,7 +3795,31 @@ void ggml_compute_forward_rms_norm(
switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_rms_norm_f32(params, dst);
ggml_compute_forward_rms_norm_f32<GGML_RMS_NORM_FUSE_OP_NONE>(params, dst);
} break;
default:
{
GGML_ABORT("fatal error");
}
}
}
// Fused RMS_NORM + MUL: computes dst = rms_norm(src0) * src1 in a single pass.
// This avoids materializing the intermediate rms_norm result in memory.
void ggml_compute_forward_rms_norm_mul_fused(
const ggml_compute_params * params,
ggml_tensor * dst_rms_norm,
ggml_tensor * dst_mul) {
GGML_ASSERT(dst_mul != nullptr);
GGML_ASSERT(dst_mul->src[0] == dst_rms_norm || dst_mul->src[1] == dst_rms_norm);
const ggml_tensor * src0 = dst_rms_norm->src[0];
switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_rms_norm_f32<GGML_RMS_NORM_FUSE_OP_MUL>(params, dst_rms_norm, dst_mul);
} break;
default:
{
@@ -11212,3 +11258,91 @@ void ggml_compute_forward_opt_step_sgd(const ggml_compute_params * params, ggml_
}
}
}
static void ggml_compute_forward_fwht_f32(const ggml_compute_params * params, ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
const ggml_tensor * src1 = dst->src[1];
GGML_ASSERT(src1->type == GGML_TYPE_F32);
GGML_ASSERT(dst->type == GGML_TYPE_F32);
GGML_TENSOR_BINARY_OP_LOCALS
const int ith = params->ith;
const int nth = params->nth;
const int64_t n = ne10;
GGML_ASSERT((n & (n - 1)) == 0); // must be power of 2
const int64_t nr = ne11 * ne12 * ne13;
const int64_t rows_per_thread = (nr + nth - 1) / nth;
const int64_t start_row = ith * rows_per_thread;
const int64_t end_row = MIN(start_row + rows_per_thread, nr);
const float scale = 1.0f / sqrtf((float)n);
#if defined(GGML_SIMD)
const GGML_F32_VEC v_minus_one = GGML_F32_VEC_SET1(-1.0f);
#endif
for (int64_t r = start_row; r < end_row; r++) {
const int64_t i13 = r / (ne11 * ne12);
const int64_t i12 = (r - i13 * ne11 * ne12) / ne11;
const int64_t i11 = r - i13 * ne11 * ne12 - i12 * ne11;
const float * src_row = (const float *) ((const char *) src1->data + i11 * nb11 + i12 * nb12 + i13 * nb13);
float * dst_row = (float *) ((char *) dst->data + i11 * nb1 + i12 * nb2 + i13 * nb3);
for (int64_t j = 0; j < n; j++) {
dst_row[j] = src_row[j] * scale;
}
// Scalar passes
#if defined(GGML_SIMD)
const int step = GGML_F32_EPR;
#else
const int step = n;
#endif
for (int64_t len = 1; len < step && len < n; len <<= 1) {
for (int64_t i = 0; i < n; i += 2 * len) {
for (int64_t j = 0; j < len; j++) {
float u = dst_row[i + j];
float v = dst_row[i + len + j];
dst_row[i + j] = u + v;
dst_row[i + len + j] = u - v;
}
}
}
// SIMD passes using GGML_F32_VEC_* macros for multi-architecture support
#if defined(GGML_SIMD)
for (int64_t len = step; len < n; len <<= 1) {
for (int64_t i = 0; i < n; i += 2 * len) {
for (int64_t j = 0; j < len; j += step) {
GGML_F32_VEC u = GGML_F32_VEC_LOAD(dst_row + i + j);
GGML_F32_VEC v = GGML_F32_VEC_LOAD(dst_row + i + len + j);
GGML_F32_VEC_STORE(dst_row + i + j, GGML_F32_VEC_ADD(u, v));
GGML_F32_VEC_STORE(dst_row + i + len + j, GGML_F32_VEC_FMA(u, v, v_minus_one));
}
}
}
#endif
}
}
void ggml_compute_forward_fwht(const ggml_compute_params * params, ggml_tensor * dst) {
const ggml_tensor * src1 = dst->src[1];
switch (src1->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_fwht_f32(params, dst);
}
break;
default:
{
GGML_ABORT("fatal error - fwht is F32 only");
}
}
}

View File

@@ -44,6 +44,7 @@ void ggml_compute_forward_concat(const struct ggml_compute_params * params, stru
void ggml_compute_forward_silu_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_rms_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_rms_norm_mul_fused(const struct ggml_compute_params * params, struct ggml_tensor * dst_rms_norm, struct ggml_tensor * dst_mul);
void ggml_compute_forward_rms_norm_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_group_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_l2_norm(const struct ggml_compute_params * params, struct ggml_tensor * dst);
@@ -111,6 +112,7 @@ void ggml_compute_forward_cross_entropy_loss(const struct ggml_compute_params *
void ggml_compute_forward_cross_entropy_loss_back(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_opt_step_adamw(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_mul_mat(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_fwht(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_opt_step_sgd(const struct ggml_compute_params * params, struct ggml_tensor * dst);
#ifdef __cplusplus
}

View File

@@ -0,0 +1,968 @@
#include "allreduce.cuh"
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
#include "convert.cuh"
#include "ggml-impl.h"
#include <algorithm>
#include <cstdlib>
#include <cstring>
#include <limits>
// ---------------------------------------------------------------------------
// CUDA AllReduce for tensor-parallel inference across two GPUs.
//
// Provides an in-place sum reduction over matching tensors on two CUDA
// devices in the same process. Used by the tensor-split path alongside
// NCCL; targets setups without NVLink, where data is exchanged between the
// GPUs by staging it through pinned host memory over PCIe.
//
// Two reduction strategies are selected per call by tensor size:
//
// * Chunked kernel path (small reductions): a single CUDA kernel both
// stages data through pinned host memory and performs the local sum.
// Cross-GPU synchronization happens *inside the kernel* (busy-wait on
// a host-memory flag), which keeps launch overhead low for the
// latency-sensitive token-generation case.
//
// * Copy-engine path (large reductions): the transfer is split into
// D2H + H2D cudaMemcpyAsync chunks driven by the GPU's copy engine,
// followed by a small device-side add kernel. Cross-GPU
// synchronization happens *outside the kernel*, via CUDA events
// between streams. This keeps the compute engine free while large
// transfers are in flight, which matters for prefill-sized tensors.
// Reductions larger than the per-call inner cap are processed by an
// outer chunker that issues sequential inner calls.
// ---------------------------------------------------------------------------
// ---------------------------------------------------------------------------
// Cross-GPU signal mechanism
//
// One int per (slot, rank) pair in pinned host memory. Each AR call writes a
// strictly increasing token (= the AR call number) into its own arrival int.
// The peer spins until its read of the other's arrival int equals the token
// it expects for this call -- a mismatch means the peer hasn't arrived yet.
// Tokens never repeat over realistic call rates (32-bit int wraps in tens of
// days at thousands of ARs/sec), so arrival ints don't need to be reset
// between calls; we initialize once at pipeline init and let the values
// accumulate.
//
// There is exactly one writer (the owning GPU) and one reader (the peer), so
// we don't need atomics. A volatile store paired with __threadfence_system()
// provides the release ordering that makes the D2H writes visible system-wide
// before the arrival token is observed.
//
// atomicAdd_system() requires hostNativeAtomicSupported, which is unavailable
// on PCIe-attached consumer GPUs without NVLink, so the volatile path is the
// portable choice.
// ---------------------------------------------------------------------------
static __device__ __forceinline__ void ggml_cuda_ar_signal_set(int * p, int token) {
*(volatile int *)p = token;
}
static __device__ __forceinline__ int ggml_cuda_ar_signal_get(const int * p) {
return *(const volatile int *)p;
}
// Byte spacing between adjacent arrival ints. 64 bytes (one cache line)
// ensures each GPU/block's arrival slot lives on its own line, preventing
// false-sharing stalls on the polling GPU.
static constexpr size_t GGML_CUDA_AR_ARRIVAL_STRIDE = 64;
// Number of blocks the chunked kernel launches with. Each block stripes a
// disjoint slice of the data and synchronizes through its own arrival-token
// slot so multiple SMs can pump PCIe stores in parallel.
static constexpr int GGML_CUDA_AR_KERNEL_BLOCKS = 8;
// ---------------------------------------------------------------------------
// Chunked kernel AllReduce -- 2 GPUs, supports float, half, and bfloat16.
//
// Both GPUs run this kernel simultaneously on independent streams. sendbuf
// and recvbuf live in T_dst (the caller's tensor type); host_mine / host_other
// carry data in T_wire (the on-wire type, possibly narrower than T_dst -- e.g.
// T_dst=F32 with T_wire=BF16 halves the bytes pushed across PCIe). When
// T_dst == T_wire the casts below are no-ops.
//
// Each GPU runs three phases:
//
// Phase 1 (all threads): cast sendbuf (T_dst) -> T_wire and store as
// single-instruction-width vectors into host_mine.
// __threadfence_system() commits these writes to host
// memory.
// Phase 2 (thread 0): write token to arrival_mine; spin until
// arrival_other == token.
// Phase 3 (all threads): read T_wire vectors from host_other, cast
// each element to T_dst, and sum with the local
// sendbuf value (also rounded through T_wire so that
// both GPUs truncate identically -- this guarantees
// bit-equivalent results across the two devices).
//
// Multi-block: blocks stripe vectors across (gridDim.x * blockDim.x) global
// threads to keep multiple SMs issuing PCIe stores in parallel. Each block
// has its own arrival-token slot (offset by blockIdx.x * ARRIVAL_STRIDE);
// thread 0 of each block signals/spins on that slot independently of other
// blocks. Tail elements (the leftover < ELEMS_PER_VEC at the end) are
// handled only by block 0 to avoid cross-block writes to the same slots.
// ---------------------------------------------------------------------------
template <typename T_dst, typename T_wire>
static __global__ void ggml_cuda_ar_kernel(
const T_dst * sendbuf,
T_dst * recvbuf,
T_wire * __restrict__ host_mine,
const T_wire * __restrict__ host_other,
int count,
int * arrival_mine,
int * arrival_other,
int token) {
// Vector unit for the wire type, sized to the arch's widest single-instruction
// copy (16 B on Volta+). Each phase-1 iter writes one vector to host memory;
// each phase-3 iter reads one and produces ELEMS_PER_VEC sums.
constexpr int ELEMS_PER_VEC = ggml_cuda_get_max_cpy_bytes() / sizeof(T_wire);
constexpr int ARRIVAL_INTS = (int)(GGML_CUDA_AR_ARRIVAL_STRIDE / sizeof(int));
const int tid = threadIdx.x;
const int nt = blockDim.x;
const int bid = blockIdx.x;
const int gtid = bid * nt + tid;
const int gnt = gridDim.x * nt;
const int count_vec = count / ELEMS_PER_VEC;
const int tail = count_vec * ELEMS_PER_VEC;
// Phase 1: cast sendbuf (T_dst) -> host_mine (T_wire) and store as vectors.
{
for (int i = gtid; i < count_vec; i += gnt) {
const int off = i * ELEMS_PER_VEC;
T_wire wire[ELEMS_PER_VEC];
#pragma unroll
for (int k = 0; k < ELEMS_PER_VEC; ++k) {
wire[k] = ggml_cuda_cast<T_wire>(sendbuf[off + k]);
}
ggml_cuda_memcpy_1<sizeof(wire)>(&host_mine[off], wire);
}
if (bid == 0 && tid < count - tail) {
host_mine[tail + tid] = ggml_cuda_cast<T_wire>(sendbuf[tail + tid]);
}
}
// Commit this block's host writes before signalling.
__threadfence_system();
__syncthreads();
// Phase 2: thread 0 of each block signals on its own arrival slot, then
// spins for the matching slot from peer. Per-block tokens mean blocks
// proceed independently -- no inter-block barrier needed.
if (tid == 0) {
int * my_slot = arrival_mine + bid * ARRIVAL_INTS;
const int * other_slot = arrival_other + bid * ARRIVAL_INTS;
ggml_cuda_ar_signal_set(my_slot, token);
__threadfence_system(); // make our signal visible system-wide
while (ggml_cuda_ar_signal_get(other_slot) != token) {
#if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
__nanosleep(100);
#else
NO_DEVICE_CODE;
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
}
}
__syncthreads();
// Acquire peer's host_other writes (this block's stripe of them).
__threadfence_system();
// Phase 3: read peer's T_wire vector, cast both sides through T_wire for
// bit-equivalence, sum in T_dst precision, and write back to recvbuf.
{
for (int i = gtid; i < count_vec; i += gnt) {
const int off = i * ELEMS_PER_VEC;
T_wire wire[ELEMS_PER_VEC];
ggml_cuda_memcpy_1<sizeof(wire)>(wire, &host_other[off]);
#pragma unroll
for (int k = 0; k < ELEMS_PER_VEC; ++k) {
const T_wire d_low = ggml_cuda_cast<T_wire>(sendbuf[off + k]);
recvbuf[off + k] = ggml_cuda_cast<T_dst>(d_low) + ggml_cuda_cast<T_dst>(wire[k]);
}
}
if (bid == 0 && tid < count - tail) {
const T_wire d_low = ggml_cuda_cast<T_wire>(sendbuf[tail + tid]);
recvbuf[tail + tid] =
ggml_cuda_cast<T_dst>(d_low) + ggml_cuda_cast<T_dst>(host_other[tail + tid]);
}
}
}
// Combined load-convert-add kernel. The peer's contribution arrives as T_src
// (which may be a lower-precision type than T_dst when the BF16 round-trip is
// active). For bit-equivalence between the two GPUs, dst is first rounded
// through T_src's precision via ggml_cuda_cast -- peer already truncated its
// own value the same way before sending -- so both sides perform identical
// arithmetic. When T_dst == T_src the round-trip cast is a no-op.
template <typename T_dst, typename T_src>
static __global__ void ggml_cuda_ar_add_kernel(
T_dst * __restrict__ dst,
const T_src * __restrict__ src,
int count) {
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
const int nt = gridDim.x * blockDim.x;
for (int i = tid; i < count; i += nt) {
const T_src d_low = ggml_cuda_cast<T_src>(dst[i]);
dst[i] = ggml_cuda_cast<T_dst>(d_low) + ggml_cuda_cast<T_dst>(src[i]);
}
}
// ---------------------------------------------------------------------------
// Pipeline structure
// ---------------------------------------------------------------------------
// Number of slots in the event / arrival ring. Two slots is sufficient:
// lockstep guarantees the two GPUs are at most one AR (or chunk) apart, so
// slot[N%2] is always safe to reuse -- peer has already consumed slot[N%2]
// from AR N-2 by the time we get to AR N. acquire_slot's
// cudaEventSynchronize on ev.ker for both devices makes that consumption
// explicit before we overwrite host_buf[slot] for the new AR.
static constexpr int GGML_CUDA_AR_POOL_SIZE = 2;
// Maximum chunk size (bytes per GPU) handled by one chunked kernel launch.
// Larger tensors are reduced by issuing multiple chunked launches.
static constexpr size_t GGML_CUDA_AR_MAX_BYTES = 1024 * 1024; // 1 MB
// Copy-engine path: largest tensor accepted on this path; sets host_large /
// dev_tmp allocation size.
static constexpr size_t GGML_CUDA_AR_COPY_MAX_BYTES = 32 * 1024 * 1024; // 32 MB
// AR wire size at which the copy-engine path takes over from the chunked-
// kernel path. Override via GGML_CUDA_AR_COPY_THRESHOLD.
static constexpr size_t GGML_CUDA_AR_COPY_THRESHOLD_DEFAULT = 1024 * 1024; // 1 MB
// Per-call CE chunk-size heuristic: chunk_bytes = clamp(nbytes / 4, MIN, MAX).
// The /4 keeps ~4 chunks in flight at any moment (good D2H/H2D overlap with
// the peer); the clamps cover the cases where nbytes/4 is too small (per-
// memcpy fixed cost dominates) or too large (chunk-level pipelining stalls).
// Env var GGML_CUDA_AR_COPY_CHUNK_BYTES can override with a fixed value.
static constexpr size_t GGML_CUDA_AR_COPY_CHUNK_BYTES_HEURISTIC_MIN = 512 * 1024; // 512 KB
static constexpr size_t GGML_CUDA_AR_COPY_CHUNK_BYTES_HEURISTIC_MAX = 2 * 1024 * 1024; // 2 MB
// Absolute floor that an env-var override is allowed to set; this caps the
// per-slot copy-event array. 256 KB -> up to 128 chunks per 32 MB tensor.
static constexpr size_t GGML_CUDA_AR_COPY_CHUNK_BYTES_MIN = 256 * 1024;
static constexpr int GGML_CUDA_AR_COPY_MAX_CHUNKS =
static_cast<int>((GGML_CUDA_AR_COPY_MAX_BYTES + GGML_CUDA_AR_COPY_CHUNK_BYTES_MIN - 1) /
GGML_CUDA_AR_COPY_CHUNK_BYTES_MIN);
struct ggml_cuda_ar_event_slot {
cudaEvent_t app = nullptr; // upstream computation complete
cudaEvent_t cpy[GGML_CUDA_AR_COPY_MAX_CHUNKS] = {}; // copy-engine D2H chunks complete
cudaEvent_t h2d = nullptr; // copy-engine H2Ds complete (handoff AR stream -> compute stream)
cudaEvent_t ker = nullptr; // AllReduce kernel complete
};
// Mapped pinned host allocation: cudaHostAlloc + cudaHostGetDevicePointer
// in one place, with the host handle preserved for cudaFreeHost. Used where
// the CPU never touches the buffer -- only the device reads/writes via the
// mapped device pointer. Required on systems where cudaDevAttrCanUseHost-
// PointerForRegisteredMem is 0 and the host pointer can't be used as a
// device pointer.
struct ggml_cuda_ar_host_mapping {
uint8_t * host = nullptr; // cudaFreeHost handle; also the H-side ptr for cudaMemcpyAsync
uint8_t * dev = nullptr; // device-side pointer for kernels / cudaMemset
cudaError_t alloc(size_t bytes) {
cudaError_t rc = cudaHostAlloc(reinterpret_cast<void **>(&host), bytes,
cudaHostAllocPortable | cudaHostAllocMapped);
if (rc != cudaSuccess) {
host = nullptr;
return rc;
}
rc = cudaHostGetDevicePointer(reinterpret_cast<void **>(&dev), host, 0);
if (rc != cudaSuccess) {
cudaFreeHost(host);
host = nullptr;
dev = nullptr;
}
return rc;
}
void free() {
if (host) {
cudaFreeHost(host);
host = nullptr;
dev = nullptr;
}
}
};
struct ggml_cuda_ar_pipeline {
int n_devices;
int devices[GGML_CUDA_MAX_DEVICES];
size_t buf_bytes; // bytes per device in host_buf[]
size_t copy_bytes; // bytes per device in host_large[] / dev_tmp[]
size_t copy_threshold;
size_t copy_chunk_bytes;
size_t bf16_threshold; // tensors >= this size (bytes) are reduced via FP32->BF16 round-trip; 0 disables
uint64_t call_count;
// Per-device resources.
ggml_cuda_ar_host_mapping host_buf[GGML_CUDA_MAX_DEVICES]; // pinned staging (chunked kernel)
ggml_cuda_ar_host_mapping host_large[GGML_CUDA_MAX_DEVICES]; // pinned staging (copy-engine)
char * dev_tmp[GGML_CUDA_MAX_DEVICES]; // device scratch for copy-engine path
cudaStream_t streams[GGML_CUDA_MAX_DEVICES]; // non-blocking
ggml_cuda_ar_event_slot ev_pool[GGML_CUDA_MAX_DEVICES][GGML_CUDA_AR_POOL_SIZE];
// Copy-engine: per-device "I finished reading my peer's host_large"
// event. Indexed by RECORDER device. Recorded same-device on streams[i]
// after stage 2's last H2D from host_large[peer]. Waited cross-device
// by peer's stage-1 stream before the next AR overwrites host_large[peer].
cudaEvent_t host_large_read_done[GGML_CUDA_MAX_DEVICES];
bool host_large_read_done_valid;
// Copy-engine: per-device "my add_kernel is done with dev_tmp" event.
// Recorded on the compute stream after each add_kernel; the AR stream
// waits on it before the next copy_impl's H2D overwrites dev_tmp. Lets us
// single-buffer dev_tmp despite add_kernel running on a separate stream.
cudaEvent_t dev_tmp_kernel_done[GGML_CUDA_MAX_DEVICES];
bool dev_tmp_kernel_done_valid;
// Arrival ring: ARRIVAL_STRIDE bytes between adjacent ints. Mapped pinned
// memory; CPU never reads/writes -- only the kernel and cudaMemset.
// Use ggml_cuda_ar_arrival_ptr() to index.
ggml_cuda_ar_host_mapping arrival;
};
// Base pointer for the (slot, rank) per-block token block. The kernel adds
// blockIdx.x * (ARRIVAL_STRIDE/sizeof(int)) internally to land on its own slot.
static int * ggml_cuda_ar_arrival_ptr(const ggml_cuda_ar_pipeline * p, int slot, int rank) {
const size_t offset = ((size_t)slot * p->n_devices + rank) *
GGML_CUDA_AR_KERNEL_BLOCKS * GGML_CUDA_AR_ARRIVAL_STRIDE;
return reinterpret_cast<int *>(p->arrival.dev + offset);
}
static uint64_t ggml_cuda_ar_env_u64(const char * name, uint64_t default_value) {
const char * value = getenv(name);
if (value == nullptr || value[0] == '\0') {
return default_value;
}
char * end = nullptr;
const unsigned long long parsed = strtoull(value, &end, 10);
return end != value ? (uint64_t) parsed : default_value;
}
struct ggml_cuda_ar_slot_info {
int slot;
int token;
};
static ggml_cuda_ar_slot_info ggml_cuda_ar_acquire_slot(ggml_cuda_ar_pipeline * p) {
const int slot = static_cast<int>(p->call_count % GGML_CUDA_AR_POOL_SIZE);
const bool pool_lapped = p->call_count >= GGML_CUDA_AR_POOL_SIZE;
p->call_count++;
if (pool_lapped) {
for (int i = 0; i < p->n_devices; ++i) {
ggml_cuda_set_device(p->devices[i]);
CUDA_CHECK(cudaEventSynchronize(p->ev_pool[i][slot].ker));
}
}
return { slot, (int) p->call_count };
}
// Per-AR copy-engine chunk size: env-var override if set, else heuristic
// (clamp(nbytes/4, HEURISTIC_MIN, HEURISTIC_MAX)).
static size_t ggml_cuda_ar_chunk_bytes(const ggml_cuda_ar_pipeline * p, size_t nbytes) {
if (p->copy_chunk_bytes > 0) {
return p->copy_chunk_bytes;
}
return std::min(GGML_CUDA_AR_COPY_CHUNK_BYTES_HEURISTIC_MAX,
std::max(GGML_CUDA_AR_COPY_CHUNK_BYTES_HEURISTIC_MIN, nbytes / 4));
}
static void ggml_cuda_ar_wait_for_compute(
ggml_cuda_ar_pipeline * p, ggml_backend_cuda_context * cuda_ctx, int rank, int slot) {
ggml_cuda_ar_event_slot & ev = p->ev_pool[rank][slot];
CUDA_CHECK(cudaEventRecord(ev.app, cuda_ctx->stream()));
CUDA_CHECK(cudaStreamWaitEvent(p->streams[rank], ev.app));
}
// ---------------------------------------------------------------------------
// Init / free
// ---------------------------------------------------------------------------
ggml_cuda_ar_pipeline * ggml_cuda_ar_pipeline_init(const int * devices, size_t n_devices) {
if (n_devices != 2) {
GGML_LOG_DEBUG("%s: internal AllReduce only supports n_devices=2 (got %zu); "
"falling back\n", __func__, n_devices);
return nullptr;
}
// The chunked kernel uses __nanosleep, which is sm70+ (Volta+).
for (size_t i = 0; i < n_devices; ++i) {
const int cc = ggml_cuda_info().devices[devices[i]].cc;
if (cc < GGML_CUDA_CC_VOLTA) {
GGML_LOG_DEBUG("%s: internal AllReduce requires compute capability >= %d "
"(device %d has cc=%d); falling back\n",
__func__, GGML_CUDA_CC_VOLTA, devices[i], cc);
return nullptr;
}
}
auto * p = new ggml_cuda_ar_pipeline{};
p->n_devices = n_devices;
p->copy_bytes = GGML_CUDA_AR_COPY_MAX_BYTES;
p->copy_threshold = ggml_cuda_ar_env_u64("GGML_CUDA_AR_COPY_THRESHOLD", GGML_CUDA_AR_COPY_THRESHOLD_DEFAULT);
// 0 = use the per-call heuristic (default). Non-zero env value forces a
// fixed chunk size for diagnostics, with a floor at COPY_CHUNK_BYTES_MIN.
p->copy_chunk_bytes = ggml_cuda_ar_env_u64("GGML_CUDA_AR_COPY_CHUNK_BYTES", 0);
if (p->copy_chunk_bytes > 0 && p->copy_chunk_bytes < GGML_CUDA_AR_COPY_CHUNK_BYTES_MIN) {
GGML_LOG_WARN("%s: GGML_CUDA_AR_COPY_CHUNK_BYTES=%zu below minimum %zu; clamping\n",
__func__, p->copy_chunk_bytes, GGML_CUDA_AR_COPY_CHUNK_BYTES_MIN);
p->copy_chunk_bytes = GGML_CUDA_AR_COPY_CHUNK_BYTES_MIN;
}
// Default 1: BF16 round-trip is always on for F32 inputs (any non-zero
// ne). Set GGML_CUDA_AR_BF16_THRESHOLD=0 to disable, or to a larger
// byte threshold to opt out for small tensors.
p->bf16_threshold = ggml_cuda_ar_env_u64("GGML_CUDA_AR_BF16_THRESHOLD", 1);
for (size_t i = 0; i < n_devices; ++i) {
p->devices[i] = devices[i];
}
// Per-device streams and event pools.
for (size_t i = 0; i < n_devices; ++i) {
ggml_cuda_set_device(p->devices[i]);
cudaStream_t stream = nullptr;
if (cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking) != cudaSuccess) {
GGML_LOG_ERROR("%s: cudaStreamCreateWithFlags failed for device %d\n",
__func__, p->devices[i]);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
p->streams[i] = stream;
for (int s = 0; s < GGML_CUDA_AR_POOL_SIZE; ++s) {
bool ok =
cudaEventCreateWithFlags(&p->ev_pool[i][s].app, cudaEventDisableTiming) == cudaSuccess &&
cudaEventCreateWithFlags(&p->ev_pool[i][s].h2d, cudaEventDisableTiming) == cudaSuccess &&
cudaEventCreateWithFlags(&p->ev_pool[i][s].ker, cudaEventDisableTiming) == cudaSuccess;
for (int c = 0; ok && c < GGML_CUDA_AR_COPY_MAX_CHUNKS; ++c) {
ok = cudaEventCreateWithFlags(&p->ev_pool[i][s].cpy[c], cudaEventDisableTiming) == cudaSuccess;
}
if (!ok) {
GGML_LOG_ERROR("%s: cudaEventCreate failed for device %d slot %d\n",
__func__, p->devices[i], s);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
}
if (cudaEventCreateWithFlags(&p->host_large_read_done[i], cudaEventDisableTiming) != cudaSuccess) {
GGML_LOG_ERROR("%s: cudaEventCreate for host_large_read_done failed for device %d\n",
__func__, p->devices[i]);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
if (cudaEventCreateWithFlags(&p->dev_tmp_kernel_done[i], cudaEventDisableTiming) != cudaSuccess) {
GGML_LOG_ERROR("%s: cudaEventCreate for dev_tmp_kernel_done failed for device %d\n",
__func__, p->devices[i]);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
}
// Arrival ring: cache-line padded so each GPU's int is on its own line.
const size_t arrival_bytes =
(size_t)GGML_CUDA_AR_POOL_SIZE * n_devices *
GGML_CUDA_AR_KERNEL_BLOCKS * GGML_CUDA_AR_ARRIVAL_STRIDE;
if (p->arrival.alloc(arrival_bytes) != cudaSuccess) {
GGML_LOG_ERROR("%s: alloc for arrival ring failed (%zu bytes)\n",
__func__, arrival_bytes);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
ggml_cuda_set_device(p->devices[0]);
if (cudaMemset(p->arrival.dev, 0, arrival_bytes) != cudaSuccess) {
GGML_LOG_ERROR("%s: cudaMemset for arrival ring failed (%zu bytes)\n",
__func__, arrival_bytes);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
// Per-device pinned staging buffers -- POOL_SIZE-deep ring so the chunked-
// kernel can write the next slot's data while the peer is still reading
// the previous slot's. Indexed by (slot * buf_bytes) at the call site.
p->buf_bytes = GGML_CUDA_AR_MAX_BYTES;
const size_t host_buf_total = (size_t) GGML_CUDA_AR_POOL_SIZE * p->buf_bytes;
for (size_t i = 0; i < n_devices; ++i) {
if (p->host_buf[i].alloc(host_buf_total) != cudaSuccess) {
GGML_LOG_ERROR("%s: alloc for staging failed (%zu bytes)\n",
__func__, host_buf_total);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
}
// Copy-engine path: pinned host staging + device scratch, sized for the
// largest tensor we accept on this path (GGML_CUDA_AR_COPY_MAX_BYTES).
// dev_tmp is single-buffered; cross-AR safety is enforced by an explicit
// cross-stream wait in copy_impl on the prior AR's add_kernel-done event.
for (size_t i = 0; i < n_devices; ++i) {
ggml_cuda_set_device(p->devices[i]);
if (p->host_large[i].alloc(p->copy_bytes) != cudaSuccess) {
GGML_LOG_ERROR("%s: alloc for large staging failed (%zu bytes)\n",
__func__, p->copy_bytes);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
if (cudaMalloc(reinterpret_cast<void **>(&p->dev_tmp[i]), p->copy_bytes) != cudaSuccess) {
GGML_LOG_ERROR("%s: cudaMalloc for copy scratch failed (%zu bytes) on device %d\n",
__func__, p->copy_bytes, p->devices[i]);
ggml_cuda_ar_pipeline_free(p);
return nullptr;
}
}
GGML_LOG_INFO("%s: initialized AllReduce pipeline: %zu GPUs, "
"%zu KB chunked kernel staging + %zu MB copy-engine staging per GPU\n",
__func__, n_devices, p->buf_bytes >> 10, p->copy_bytes >> 20);
return p;
}
void ggml_cuda_ar_pipeline_free(ggml_cuda_ar_pipeline * p) {
if (!p) {
return;
}
// Drain all in-flight kernels before tearing down resources.
for (int i = 0; i < p->n_devices; ++i) {
if (p->streams[i]) {
ggml_cuda_set_device(p->devices[i]);
cudaStreamSynchronize(p->streams[i]);
}
}
for (int i = 0; i < p->n_devices; ++i) {
p->host_buf[i].free();
p->host_large[i].free();
if (p->dev_tmp[i]) {
ggml_cuda_set_device(p->devices[i]);
cudaFree(p->dev_tmp[i]);
}
ggml_cuda_set_device(p->devices[i]);
for (int s = 0; s < GGML_CUDA_AR_POOL_SIZE; ++s) {
if (p->ev_pool[i][s].app) { cudaEventDestroy(p->ev_pool[i][s].app); }
for (int c = 0; c < GGML_CUDA_AR_COPY_MAX_CHUNKS; ++c) {
if (p->ev_pool[i][s].cpy[c]) { cudaEventDestroy(p->ev_pool[i][s].cpy[c]); }
}
if (p->ev_pool[i][s].h2d) { cudaEventDestroy(p->ev_pool[i][s].h2d); }
if (p->ev_pool[i][s].ker) { cudaEventDestroy(p->ev_pool[i][s].ker); }
}
if (p->host_large_read_done[i]) {
ggml_cuda_set_device(p->devices[i]);
cudaEventDestroy(p->host_large_read_done[i]);
}
if (p->dev_tmp_kernel_done[i]) {
ggml_cuda_set_device(p->devices[i]);
cudaEventDestroy(p->dev_tmp_kernel_done[i]);
}
if (p->streams[i]) {
ggml_cuda_set_device(p->devices[i]);
cudaStreamDestroy(p->streams[i]);
}
}
p->arrival.free();
delete p;
}
// ---------------------------------------------------------------------------
// Dispatch
// ---------------------------------------------------------------------------
// Asymmetric copy_impl: data sent over PCIe in T_src precision (one element of
// nbytes per ne element); accumulated locally into a T_dst buffer. When
// T_src == T_dst this is the original homogeneous reduction. When they differ
// (e.g. BF16 wire / F32 accumulator) the add kernel rounds dst through T_src
// for bit-equivalence between GPUs and we skip the otherwise-needed
// post-conversion entirely.
template <typename T_src, typename T_dst>
static bool ggml_cuda_ar_allreduce_copy_impl(
ggml_cuda_ar_pipeline * p,
ggml_backend_t * backends,
T_src * const src_buf[GGML_CUDA_MAX_DEVICES],
T_dst * const dst_buf[GGML_CUDA_MAX_DEVICES],
const bool compute[GGML_CUDA_MAX_DEVICES],
int64_t ne,
size_t nbytes) {
GGML_ASSERT(p->n_devices == 2);
GGML_ASSERT(nbytes <= p->copy_bytes);
GGML_ASSERT(ne <= std::numeric_limits<int>::max());
const size_t chunk_bytes = ggml_cuda_ar_chunk_bytes(p, nbytes);
GGML_ASSERT(chunk_bytes > 0);
const int slot = ggml_cuda_ar_acquire_slot(p).slot;
const size_t copy_chunks = (nbytes + chunk_bytes - 1) / chunk_bytes;
GGML_ASSERT(copy_chunks <= GGML_CUDA_AR_COPY_MAX_CHUNKS);
ggml_backend_cuda_context * cuda_ctx[2] = {};
// Stage 1: both GPUs copy their local contribution to pinned host memory.
for (int i = 0; i < 2; ++i) {
ggml_cuda_set_device(p->devices[i]);
cuda_ctx[i] = static_cast<ggml_backend_cuda_context *>(backends[i]->context);
GGML_ASSERT(cuda_ctx[i]->device == p->devices[i]);
ggml_cuda_ar_wait_for_compute(p, cuda_ctx[i], i, slot);
// Wait for peer's H2D from our host_large[i] (recorded in the
// previous AR's stage 2) to complete before we overwrite host_large[i].
// host_large_read_done[peer] = peer finished reading host_large[i].
// No-op on the first AR -- no prior record exists.
if (p->host_large_read_done_valid) {
const int peer = 1 - i;
CUDA_CHECK(cudaStreamWaitEvent(p->streams[i], p->host_large_read_done[peer]));
}
if (!compute[i]) {
CUDA_CHECK(cudaMemsetAsync(src_buf[i], 0, nbytes, p->streams[i]));
}
for (size_t c = 0; c < copy_chunks; ++c) {
const size_t offset = c * chunk_bytes;
const size_t this_bytes = (nbytes - offset) < chunk_bytes ?
(nbytes - offset) : chunk_bytes;
CUDA_CHECK(cudaMemcpyAsync(
p->host_large[i].host + offset, reinterpret_cast<char *>(src_buf[i]) + offset, this_bytes,
cudaMemcpyDeviceToHost, p->streams[i]));
CUDA_CHECK(cudaEventRecord(p->ev_pool[i][slot].cpy[c], p->streams[i]));
}
}
// Stage 2: each GPU waits for each peer D2H chunk, pulls that chunk back to
// local device scratch (dev_tmp), then performs one device-local add over
// the assembled peer tensor. The H2Ds run on the AR stream (copy engine)
// and the add_kernel runs on the caller's compute stream, so the AR stream
// stays pure-copy and avoids an in-stream copy->compute engine switch every
// AR. dev_tmp is single-buffered: the AR stream waits cross-stream on the
// prior AR's add_kernel-done event before overwriting it.
for (int i = 0; i < 2; ++i) {
const int peer = 1 - i;
ggml_cuda_set_device(p->devices[i]);
// Wait for the previous AR's add_kernel (on the compute stream) to
// finish reading dev_tmp before our H2D overwrites it. No-op on the
// first copy_impl call.
if (p->dev_tmp_kernel_done_valid) {
CUDA_CHECK(cudaStreamWaitEvent(p->streams[i], p->dev_tmp_kernel_done[i]));
}
for (size_t c = 0; c < copy_chunks; ++c) {
const size_t offset = c * chunk_bytes;
const size_t this_bytes = (nbytes - offset) < chunk_bytes ?
(nbytes - offset) : chunk_bytes;
CUDA_CHECK(cudaStreamWaitEvent(p->streams[i], p->ev_pool[peer][slot].cpy[c]));
CUDA_CHECK(cudaMemcpyAsync(
p->dev_tmp[i] + offset, p->host_large[peer].host + offset, this_bytes,
cudaMemcpyHostToDevice, p->streams[i]));
}
// Mark our reads of host_large[peer] complete so peer's next AR can
// safely overwrite it.
CUDA_CHECK(cudaEventRecord(p->host_large_read_done[i], p->streams[i]));
// Hand off from AR stream (copy engine) to compute stream: compute
// stream waits for all H2Ds to finish, then runs the add_kernel.
CUDA_CHECK(cudaEventRecord(p->ev_pool[i][slot].h2d, p->streams[i]));
CUDA_CHECK(cudaStreamWaitEvent(cuda_ctx[i]->stream(), p->ev_pool[i][slot].h2d));
const int block_size = 256;
int n_blocks = (int) ((ne + block_size - 1) / block_size);
if (n_blocks > 1024) {
n_blocks = 1024;
}
ggml_cuda_ar_add_kernel<T_dst, T_src><<<n_blocks, block_size, 0, cuda_ctx[i]->stream()>>>(
dst_buf[i],
reinterpret_cast<const T_src *>(p->dev_tmp[i]),
(int) ne);
CUDA_CHECK(cudaGetLastError());
// Record dev_tmp-released on the compute stream so the next copy_impl
// can wait for the kernel to finish before overwriting dev_tmp. Also
// record AR-done as ev.ker for acquire_slot's pool-wraparound sync.
CUDA_CHECK(cudaEventRecord(p->dev_tmp_kernel_done[i], cuda_ctx[i]->stream()));
CUDA_CHECK(cudaEventRecord(p->ev_pool[i][slot].ker, cuda_ctx[i]->stream()));
}
p->host_large_read_done_valid = true;
p->dev_tmp_kernel_done_valid = true;
return true;
}
// Outer-level chunker: copy_impl handles up to copy_bytes per call (limited by
// the host_large / dev_tmp allocation size). When the full AR exceeds that,
// slice the tensor into copy_bytes-sized pieces and call copy_impl repeatedly.
// Each slice goes through its own stage 1 -> stage 2 cycle and acquires its own
// slot, so cross-AR fences and pool wraparound work the same way as for any
// other sequence of small ARs.
template <typename T_src, typename T_dst>
static bool ggml_cuda_ar_allreduce_copy_outer(
ggml_cuda_ar_pipeline * p,
ggml_backend_t * backends,
T_src * const src_buf[GGML_CUDA_MAX_DEVICES],
T_dst * const dst_buf[GGML_CUDA_MAX_DEVICES],
const bool compute[GGML_CUDA_MAX_DEVICES],
int64_t ne) {
const int64_t outer_max_elems = (int64_t) (p->copy_bytes / sizeof(T_src));
GGML_ASSERT(outer_max_elems > 0);
bool ok = true;
for (int64_t outer_start = 0; outer_start < ne && ok; outer_start += outer_max_elems) {
const int64_t outer_ne = std::min(outer_max_elems, ne - outer_start);
const size_t outer_nbytes = (size_t) outer_ne * sizeof(T_src);
T_src * src[GGML_CUDA_MAX_DEVICES] = {};
T_dst * dst[GGML_CUDA_MAX_DEVICES] = {};
for (int i = 0; i < p->n_devices; ++i) {
src[i] = src_buf[i] + outer_start;
dst[i] = dst_buf[i] + outer_start;
}
ok = ggml_cuda_ar_allreduce_copy_impl<T_src, T_dst>(
p, backends, src, dst, compute, outer_ne, outer_nbytes);
}
return ok;
}
bool ggml_cuda_ar_allreduce(
ggml_cuda_ar_pipeline * p,
ggml_backend_t * backends,
ggml_tensor ** tensors) {
GGML_ASSERT(p != nullptr);
const int n = p->n_devices;
GGML_ASSERT(n == 2);
const ggml_type input_type = tensors[0]->type;
GGML_ASSERT(input_type == GGML_TYPE_F32 || input_type == GGML_TYPE_F16 || input_type == GGML_TYPE_BF16);
const int64_t ne = ggml_nelements(tensors[0]);
GGML_ASSERT(ne > 0);
const size_t input_nbytes = ggml_nbytes(tensors[0]);
// BF16 round-trip: F32 inputs >= bf16_threshold are converted to BF16 for
// the reduction (chunked or copy-engine), halving on-wire bytes. Matches
// NCCL's behaviour. The pre-conversion zeroes inactive shards so the
// inner paths see them as already-prepared compute tensors.
const bool use_bf16 =
input_type == GGML_TYPE_F32 &&
p->bf16_threshold > 0 &&
input_nbytes >= p->bf16_threshold;
const ggml_type kernel_type = use_bf16 ? GGML_TYPE_BF16 : input_type;
const size_t type_size = ggml_type_size(kernel_type);
GGML_ASSERT(p->buf_bytes >= type_size);
const size_t nbytes = (size_t) ne * type_size;
bool compute_flag[GGML_CUDA_MAX_DEVICES] = {};
for (int i = 0; i < n; ++i) {
compute_flag[i] = (tensors[i]->flags & GGML_TENSOR_FLAG_COMPUTE) != 0;
}
// Decide between copy-engine and chunked kernel paths based on the working
// type's actual byte count. No upper bound: copy_outer slices reductions
// larger than copy_bytes into copy_bytes-sized pieces.
const bool use_copy_engine =
p->copy_threshold > 0 &&
nbytes >= p->copy_threshold;
// BF16 inactive-shard zeroing: when use_bf16 is on, the combined kernel
// (chunked kernel path) and the combined add kernel (copy_engine path)
// both accumulate into the F32 tensor data directly, so an inactive
// shard's accumulator must start at zero.
if (use_bf16) {
for (int i = 0; i < n; ++i) {
if (!compute_flag[i]) {
auto * cuda_ctx = static_cast<ggml_backend_cuda_context *>(backends[i]->context);
GGML_ASSERT(cuda_ctx->device == p->devices[i]);
ggml_cuda_set_device(p->devices[i]);
CUDA_CHECK(cudaMemsetAsync(tensors[i]->data, 0, (size_t) ne * sizeof(float), cuda_ctx->stream()));
}
}
}
// Pre-convert F32 -> BF16 into bf16_tmp ONLY for the copy_engine + use_bf16
// path; the chunked kernel path's combined kernel does the conversion
// inline as it writes to host_buf.
ggml_cuda_pool_alloc<nv_bfloat16> bf16_tmp[GGML_CUDA_MAX_DEVICES];
void * copy_src_ptr[GGML_CUDA_MAX_DEVICES] = {};
if (use_copy_engine && use_bf16) {
to_bf16_cuda_t to_bf16 = ggml_get_to_bf16_cuda(GGML_TYPE_F32);
for (int i = 0; i < n; ++i) {
auto * cuda_ctx = static_cast<ggml_backend_cuda_context *>(backends[i]->context);
GGML_ASSERT(cuda_ctx->device == p->devices[i]);
bf16_tmp[i].pool = &cuda_ctx->pool();
bf16_tmp[i].alloc(ne);
ggml_cuda_set_device(p->devices[i]);
if (compute_flag[i]) {
to_bf16(tensors[i]->data, bf16_tmp[i].get(), ne, cuda_ctx->stream());
CUDA_CHECK(cudaGetLastError());
} else {
CUDA_CHECK(cudaMemsetAsync(bf16_tmp[i].get(), 0, nbytes, cuda_ctx->stream()));
}
copy_src_ptr[i] = bf16_tmp[i].get();
}
}
bool ok = true;
if (use_copy_engine) {
// After up-front BF16 conversion, the tmp buffers already hold the
// (possibly zeroed-for-inactive) data, so the inner path can treat
// every shard as compute.
bool inner_compute[GGML_CUDA_MAX_DEVICES];
for (int i = 0; i < n; ++i) {
inner_compute[i] = use_bf16 ? true : compute_flag[i];
}
// Dispatch into copy_impl with explicit src/dst types. When use_bf16
// is on, the wire type is BF16 (src = bf16_tmp) and the accumulator
// is F32 (dst = tensors[i]->data); the combined add kernel rounds dst
// through BF16 for bit-equivalence and writes F32 directly, so no
// post-conversion is needed. Otherwise src == dst (same native type).
if (use_bf16) {
GGML_ASSERT(kernel_type == GGML_TYPE_BF16);
nv_bfloat16 * src[GGML_CUDA_MAX_DEVICES] = {};
float * dst[GGML_CUDA_MAX_DEVICES] = {};
for (int i = 0; i < n; ++i) {
src[i] = static_cast<nv_bfloat16 *>(copy_src_ptr[i]);
dst[i] = static_cast<float *>(tensors[i]->data);
}
ok = ggml_cuda_ar_allreduce_copy_outer<nv_bfloat16, float>(
p, backends, src, dst, inner_compute, ne);
} else {
switch (kernel_type) {
case GGML_TYPE_F32: {
float * buf[GGML_CUDA_MAX_DEVICES] = {};
for (int i = 0; i < n; ++i) {
buf[i] = static_cast<float *>(tensors[i]->data);
}
ok = ggml_cuda_ar_allreduce_copy_outer<float, float>(
p, backends, buf, buf, inner_compute, ne);
break;
}
case GGML_TYPE_BF16: {
nv_bfloat16 * buf[GGML_CUDA_MAX_DEVICES] = {};
for (int i = 0; i < n; ++i) {
buf[i] = static_cast<nv_bfloat16 *>(tensors[i]->data);
}
ok = ggml_cuda_ar_allreduce_copy_outer<nv_bfloat16, nv_bfloat16>(
p, backends, buf, buf, inner_compute, ne);
break;
}
case GGML_TYPE_F16: {
half * buf[GGML_CUDA_MAX_DEVICES] = {};
for (int i = 0; i < n; ++i) {
buf[i] = static_cast<half *>(tensors[i]->data);
}
ok = ggml_cuda_ar_allreduce_copy_outer<half, half>(
p, backends, buf, buf, inner_compute, ne);
break;
}
default:
GGML_ASSERT(false);
}
}
} else {
// host_buf carries T_wire-typed data; max_chunk_elems is the count that
// fits in one host_buf at the wire size.
const size_t max_chunk_elems = p->buf_bytes / type_size;
const size_t input_type_size = ggml_type_size(input_type);
// Chunked kernel path runs entirely on the caller's compute stream:
// since AR is a barrier here, same-stream ordering subsumes any
// cross-stream event handshake that the copy-engine path needs, and
// skips the cross-stream scheduling overhead that was hurting the
// small-tensor (tg) latency on the AR-stream variant. Only ev.ker is
// still recorded at end-of-AR for acquire_slot's pool-wraparound check.
for (int64_t chunk_start = 0; chunk_start < ne; chunk_start += (int64_t) max_chunk_elems) {
const size_t remaining_elems = (size_t) (ne - chunk_start);
const size_t chunk_elems = remaining_elems < max_chunk_elems ? remaining_elems : max_chunk_elems;
const size_t chunk_dst_bytes = chunk_elems * input_type_size;
const auto [slot, token] = ggml_cuda_ar_acquire_slot(p);
const bool last_chunk = chunk_start + (int64_t) chunk_elems == ne;
for (int i = 0; i < n; ++i) {
const int peer = 1 - i; // valid for n == 2 only
ggml_cuda_set_device(p->devices[i]);
auto * cuda_ctx = static_cast<ggml_backend_cuda_context *>(backends[i]->context);
GGML_ASSERT(cuda_ctx->device == p->devices[i]);
cudaStream_t stream = cuda_ctx->stream();
char * data = static_cast<char *>(tensors[i]->data) + chunk_start * (int64_t) input_type_size;
// Match NCCL/meta-backend semantics: inactive shards contribute
// zeros. On the BF16 path the F32 tensor data was already
// zeroed up-front (above), so per-chunk zeroing isn't needed.
if (!compute_flag[i] && !use_bf16) {
CUDA_CHECK(cudaMemsetAsync(data, 0, chunk_dst_bytes, stream));
}
#define LAUNCH_AR_KERNEL(T_dst, T_wire) \
ggml_cuda_ar_kernel<T_dst, T_wire><<<dim3(GGML_CUDA_AR_KERNEL_BLOCKS), dim3(256), 0, stream>>>( \
reinterpret_cast<const T_dst *>(data), \
reinterpret_cast<T_dst *>(data), \
reinterpret_cast<T_wire *>(p->host_buf[i].dev + (size_t) slot * p->buf_bytes), \
reinterpret_cast<const T_wire *>(p->host_buf[peer].dev + (size_t) slot * p->buf_bytes), \
static_cast<int>(chunk_elems), \
ggml_cuda_ar_arrival_ptr(p, slot, i), \
ggml_cuda_ar_arrival_ptr(p, slot, peer), \
token)
if (use_bf16) {
GGML_ASSERT(input_type == GGML_TYPE_F32);
LAUNCH_AR_KERNEL(float, nv_bfloat16);
} else {
switch (input_type) {
case GGML_TYPE_F32: LAUNCH_AR_KERNEL(float, float); break;
case GGML_TYPE_F16: LAUNCH_AR_KERNEL(half, half); break;
case GGML_TYPE_BF16: LAUNCH_AR_KERNEL(nv_bfloat16, nv_bfloat16); break;
default: GGML_ASSERT(false);
}
}
#undef LAUNCH_AR_KERNEL
CUDA_CHECK(cudaGetLastError());
if (last_chunk) {
CUDA_CHECK(cudaEventRecord(p->ev_pool[i][slot].ker, stream));
}
}
}
}
return ok;
}
#else // defined(GGML_USE_HIP) || defined(GGML_USE_MUSA)
// HIP and MUSA lack the host-mapped pinned-memory APIs (cudaHostAllocPortable
// / cudaHostAllocMapped / cudaHostGetDevicePointer) and __nanosleep that this
// implementation relies on, so the internal AllReduce is a CUDA-only feature.
// The dispatcher in ggml-cuda.cu treats a nullptr pipeline as "init failed"
// and silently falls back to the meta backend's generic AllReduce.
ggml_cuda_ar_pipeline * ggml_cuda_ar_pipeline_init(const int *, size_t) {
return nullptr;
}
void ggml_cuda_ar_pipeline_free(ggml_cuda_ar_pipeline *) {
}
bool ggml_cuda_ar_allreduce(ggml_cuda_ar_pipeline *, ggml_backend_t *, ggml_tensor **) {
return false;
}
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)

View File

@@ -0,0 +1,29 @@
#pragma once
#include "common.cuh"
#include "ggml-backend-impl.h"
#include <cstddef>
// Opaque pipeline context -- owns all pinned buffers, streams, and events.
struct ggml_cuda_ar_pipeline;
// Allocate a pipeline for n_devices GPUs.
// devices[] holds the CUDA device IDs in rank order.
// Returns nullptr on allocation failure.
ggml_cuda_ar_pipeline * ggml_cuda_ar_pipeline_init(
const int * devices, size_t n_devices);
// Release all resources owned by the pipeline.
void ggml_cuda_ar_pipeline_free(ggml_cuda_ar_pipeline * pipeline);
// Execute an in-place AllReduce (sum) across tensors[0..n_devices-1].
// tensors[i] must live on the device managed by backends[i] and be
// contiguous F32, F16, or BF16.
// Preconditions are checked by the CUDA comm dispatcher before calling this.
// Returns true once the reduction work has been enqueued successfully.
bool ggml_cuda_ar_allreduce(
ggml_cuda_ar_pipeline * pipeline,
ggml_backend_t * backends,
ggml_tensor ** tensors);

View File

@@ -830,6 +830,18 @@ static __device__ __forceinline__ float ggml_cuda_ue4m3_to_fp32(uint8_t x) {
#endif // defined(GGML_USE_HIP) && defined(CDNA3) && defined(FP8_AVAILABLE) && HIP_VERSION >= 60200000
}
static __device__ __forceinline__ uint8_t ggml_cuda_fp32_to_ue4m3(float x) {
#if defined(BLACKWELL_MMA_AVAILABLE) // This is used for NVFP4 subblock scale quantizations only
if (!(x > 0.0f)) {
return 0;
}
const __nv_fp8_e4m3 xf(x);
return xf.__x;
#else
NO_DEVICE_CODE; // Used only for NVFP4 Scales for Activations, only for Blackwell
#endif // defined(BLACKWELL_MMA_AVAILABLE)
}
__device__ __forceinline__ uint8_t ggml_cuda_float_to_fp4_e2m1(float x, float e) {
const uint8_t sign_bit = (x < 0.0f) << 3;
float ax = fabsf(x) * e;

View File

@@ -61,11 +61,19 @@ static constexpr __host__ __device__ fattn_mma_config ggml_cuda_fattn_mma_get_co
GGML_CUDA_FATTN_MMA_CONFIG_CASE(128, 128, 32, 128, 2, 64, 64, 64, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(128, 128, 64, 128, 2, 64, 64, 64, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(192, 128, 8, 64, 4, 64, 96, 64, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(192, 128, 16, 64, 4, 32, 96, 64, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(192, 128, 32, 128, 2, 32, 96, 64, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(192, 128, 64, 128, 2, 32, 96, 64, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 8, 64, 4, 64, 128, 128, 128, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 16, 64, 4, 32, 128, 128, 128, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 32, 128, 2, 32, 128, 128, 128, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 64, 128, 2, 32, 128, 128, 128, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(320, 256, 32, 128, 2, 32, 128, 128, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(320, 256, 64, 256, 1, 32, 128, 128, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 8, 64, 4, 32, 256, 256, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 16, 64, 4, 32, 256, 256, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 32, 128, 2, 32, 128, 128, 128, 1, false);
@@ -85,6 +93,9 @@ static constexpr __host__ __device__ fattn_mma_config ggml_cuda_fattn_mma_get_co
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 32, 128, 2, 64, 128, 128, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 64, 128, 2, 64, 128, 128, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(320, 256, 32, 128, 2, 32, 128, 128, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(320, 256, 64, 256, 1, 32, 128, 128, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 8, 64, 4, 32, 96, 64, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 16, 64, 4, 32, 96, 64, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 32, 128, 2, 32, 128, 128, 128, 1, false);
@@ -118,6 +129,9 @@ static constexpr __host__ __device__ fattn_mma_config ggml_cuda_fattn_mma_get_co
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 32, 128, 2, 64, 128, 128, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 64, 128, 2, 64, 128, 128, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(320, 256, 32, 128, 2, 64, 160, 128, 64, 2, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(320, 256, 64, 128, 2, 64, 160, 128, 64, 2, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 16, 64, 4, 32, 128, 128, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 32, 128, 2, 32, 128, 128, 128, 1, false);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(512, 512, 64, 256, 1, 32, 128, 128, 128, 1, false);
@@ -1217,7 +1231,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
float KQ_max_scale[cols_per_thread];
#pragma unroll
for (int col = 0; col < cols_per_thread; ++col) {
const int jc = cols_per_warp == 8 ? T_C_KQ::get_j(col) : T_C_KQ::get_i(2*col);
const int jc = (threadIdx.y/np)*cols_per_warp + (cols_per_warp == 8 ? T_C_KQ::get_j(col) : T_C_KQ::get_i(2*col));
const float sink = sinks_f[jc % ncols2];
const float KQ_max_new = fmaxf(KQ_max[col], sink);
@@ -1552,6 +1566,10 @@ static __global__ void flash_attn_ext_f16(
NO_DEVICE_CODE;
return;
}
if (DKQ == 192 && ncols2 != 8 && ncols2 != 16) {
NO_DEVICE_CODE;
return;
}
#ifdef VOLTA_MMA_AVAILABLE
if (ncols1*ncols2 < 32) {
NO_DEVICE_CODE;
@@ -1825,6 +1843,10 @@ extern DECL_FATTN_MMA_F16_CASE(576, 512, 1, 16);
extern DECL_FATTN_MMA_F16_CASE(576, 512, 2, 16);
extern DECL_FATTN_MMA_F16_CASE(576, 512, 4, 16);
// Mistral Small 4 (DKQ=320, DV=256), GQA=32-only build:
extern DECL_FATTN_MMA_F16_CASE(320, 256, 1, 32);
extern DECL_FATTN_MMA_F16_CASE(320, 256, 2, 32);
// For GLM 4.7 Flash
extern DECL_FATTN_MMA_F16_CASE(576, 512, 4, 4);
extern DECL_FATTN_MMA_F16_CASE(576, 512, 8, 4);

View File

@@ -34,10 +34,18 @@ void ggml_cuda_flash_attn_ext_tile(ggml_backend_cuda_context & ctx, ggml_tensor
GGML_ASSERT(V->ne[0] == K->ne[0]);
ggml_cuda_flash_attn_ext_tile_case<128, 128>(ctx, dst);
} break;
case 192: {
GGML_ASSERT(V->ne[0] == 128);
ggml_cuda_flash_attn_ext_tile_case<192, 128>(ctx, dst);
} break;
case 256: {
GGML_ASSERT(V->ne[0] == K->ne[0]);
ggml_cuda_flash_attn_ext_tile_case<256, 256>(ctx, dst);
} break;
case 320: {
GGML_ASSERT(V->ne[0] == 256);
ggml_cuda_flash_attn_ext_tile_case<320, 256>(ctx, dst);
} break;
case 512: {
GGML_ASSERT(V->ne[0] == K->ne[0]);
ggml_cuda_flash_attn_ext_tile_case<512, 512>(ctx, dst);

View File

@@ -62,12 +62,20 @@ static constexpr __host__ __device__ uint32_t ggml_cuda_fattn_tile_get_config_nv
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 16, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 32, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 2, 64, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 4, 128, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 8, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 16, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 32, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 2, 64, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 4, 128, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 8, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 16, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 32, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(320, 256, 16, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 4, 128, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 8, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 16, 256, 2, 64, 64)
@@ -122,12 +130,20 @@ static constexpr __host__ __device__ uint32_t ggml_cuda_fattn_tile_get_config_nv
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 16, 128, 3, 32, 128)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 32, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 2, 128, 3, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 4, 128, 3, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 8, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 16, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 32, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 2, 128, 3, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 4, 128, 3, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 8, 256, 2, 32, 256)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 16, 256, 2, 32, 128)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 32, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(320, 256, 16, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 4, 128, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 8, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 16, 256, 2, 32, 64)
@@ -189,12 +205,20 @@ static constexpr __host__ __device__ uint32_t ggml_cuda_fattn_tile_get_config_am
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 32, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 64, 256, 2, 64, 32)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 2, 256, 2, 128, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 4, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 8, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 16, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 32, 256, 2, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 2, 256, 2, 128, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 4, 256, 2, 64, 128)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 8, 256, 2, 64, 128)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 16, 256, 2, 32, 128)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 32, 256, 2, 32, 128)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(320, 256, 32, 512, 1, 128, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 4, 128, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 8, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 16, 256, 2, 64, 64)
@@ -258,12 +282,20 @@ static constexpr __host__ __device__ uint32_t ggml_cuda_fattn_tile_get_config_am
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 32, 256, 3, 128, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(128, 128, 64, 256, 3, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 2, 64, 8, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 4, 128, 6, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 8, 128, 6, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 16, 256, 5, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(192, 128, 32, 256, 3, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 2, 64, 8, 32, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 4, 128, 6, 32, 256)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 8, 128, 6, 32, 256)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 16, 256, 5, 32, 256)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(256, 256, 32, 256, 3, 64, 128)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(320, 256, 32, 256, 2, 128, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 4, 128, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 8, 256, 2, 64, 64)
GGML_CUDA_FATTN_TILE_CONFIG_CASE(512, 512, 16, 256, 4, 64, 64)
@@ -1116,7 +1148,7 @@ static void launch_fattn_tile_switch_ncols1(ggml_backend_cuda_context & ctx, ggm
constexpr size_t nbytes_shared = 0;
#ifdef GGML_USE_HIP
if constexpr (DV <= 128) {
if constexpr (DKQ <= 128) {
if (Q->ne[1] > 32/ncols2) {
constexpr int cols_per_block = 64;
const int nwarps = ggml_cuda_fattn_tile_get_nthreads (DKQ, DV, cols_per_block, cc) / warp_size;
@@ -1130,7 +1162,7 @@ static void launch_fattn_tile_switch_ncols1(ggml_backend_cuda_context & ctx, ggm
#endif // GGML_USE_HIP
#ifndef GGML_USE_HIP
if constexpr (DV <= 256)
if constexpr (DKQ <= 256)
#endif // GGML_USE_HIP
{
if (Q->ne[1] > 16/ncols2) {
@@ -1144,14 +1176,16 @@ static void launch_fattn_tile_switch_ncols1(ggml_backend_cuda_context & ctx, ggm
}
}
if (Q->ne[1] > 8/ncols2) {
constexpr int cols_per_block = 16;
const int nwarps = ggml_cuda_fattn_tile_get_nthreads (DKQ, DV, cols_per_block, cc) / warp_size;
const int nbatch_fa = ggml_cuda_fattn_tile_get_nbatch_fa(DKQ, DV, cols_per_block, cc);
fattn_kernel_t fattn_kernel = flash_attn_tile<DKQ, DV, cols_per_block/ncols2, ncols2, use_logit_softcap>;
launch_fattn<DV, cols_per_block/ncols2, ncols2>
(ctx, dst, fattn_kernel, nwarps, nbytes_shared, nbatch_fa, true, true, false, warp_size);
return;
if constexpr (ncols2 <= 16) {
if (Q->ne[1] > 8/ncols2) {
constexpr int cols_per_block = 16;
const int nwarps = ggml_cuda_fattn_tile_get_nthreads (DKQ, DV, cols_per_block, cc) / warp_size;
const int nbatch_fa = ggml_cuda_fattn_tile_get_nbatch_fa(DKQ, DV, cols_per_block, cc);
fattn_kernel_t fattn_kernel = flash_attn_tile<DKQ, DV, cols_per_block/ncols2, ncols2, use_logit_softcap>;
launch_fattn<DV, cols_per_block/ncols2, ncols2>
(ctx, dst, fattn_kernel, nwarps, nbytes_shared, nbatch_fa, true, true, false, warp_size);
return;
}
}
if constexpr (ncols2 <= 8) {
@@ -1210,6 +1244,25 @@ static void launch_fattn_tile_switch_ncols2(ggml_backend_cuda_context & ctx, ggm
const int gqa_limit = nvidia && gqa_ratio <= 4 && DV <= 256 ? 16 : INT_MAX;
const bool use_gqa_opt = mask && max_bias == 0.0f && Q->ne[1] <= gqa_limit && K->ne[1] % FATTN_KQ_STRIDE == 0;
if constexpr (DKQ == 320) {
// This branch is only used for Mistral Small 4 which has a GQA ratio of 32.
// On AMD, simply use that GQA ratio with 32 columns / block since we always have enough SRAM.
// On NVIDIA however, the tile kernel is only used for GPUs that can't use the mma kernel (Pascal and older).
// Therefore, use a GQA ratio of 16 with 16 columns / block to stay below 48 kiB of SRAM / block.
#ifdef GGML_USE_HIP
if (use_gqa_opt && gqa_ratio % 32 == 0) {
launch_fattn_tile_switch_ncols1<DKQ, DV, 32, use_logit_softcap>(ctx, dst);
return;
}
#else
if (use_gqa_opt && gqa_ratio % 16 == 0) {
launch_fattn_tile_switch_ncols1<DKQ, DV, 16, use_logit_softcap>(ctx, dst);
return;
}
#endif // GGML_USE_HIP
GGML_ABORT("flash-attn tile (320/256): expected GQA ratio multiple of 32");
}
if constexpr (DKQ == 576) {
if (use_gqa_opt && gqa_ratio % 16 == 0) {
launch_fattn_tile_switch_ncols1<DKQ, DV, 16, use_logit_softcap>(ctx, dst);
@@ -1221,7 +1274,20 @@ static void launch_fattn_tile_switch_ncols2(ggml_backend_cuda_context & ctx, ggm
}
}
if constexpr (DKQ <= 512) {
if constexpr (DKQ == 192) {
// MiMo-V2.5 / V2.5-Pro / V2-Flash: gqa_ratio is 8 (SWA) or 16 (full attn)
if (use_gqa_opt && gqa_ratio % 16 == 0) {
launch_fattn_tile_switch_ncols1<DKQ, DV, 16, use_logit_softcap>(ctx, dst);
return;
}
if (use_gqa_opt && gqa_ratio % 8 == 0) {
launch_fattn_tile_switch_ncols1<DKQ, DV, 8, use_logit_softcap>(ctx, dst);
return;
}
GGML_ABORT("flash-attn tile (192/128): expected GQA ratio multiple of 8");
}
if constexpr (DKQ <= 512 && DKQ != 320 && DKQ != 192) {
if (use_gqa_opt && gqa_ratio % 8 == 0) {
launch_fattn_tile_switch_ncols1<DKQ, DV, 8, use_logit_softcap>(ctx, dst);
return;
@@ -1274,6 +1340,8 @@ extern DECL_FATTN_TILE_CASE( 80, 80);
extern DECL_FATTN_TILE_CASE( 96, 96);
extern DECL_FATTN_TILE_CASE(112, 112);
extern DECL_FATTN_TILE_CASE(128, 128);
extern DECL_FATTN_TILE_CASE(192, 128);
extern DECL_FATTN_TILE_CASE(256, 256);
extern DECL_FATTN_TILE_CASE(320, 256);
extern DECL_FATTN_TILE_CASE(512, 512);
extern DECL_FATTN_TILE_CASE(576, 512);

View File

@@ -139,10 +139,42 @@ static void ggml_cuda_flash_attn_ext_mma_f16(ggml_backend_cuda_context & ctx, gg
GGML_ASSERT(V->ne[0] == 128);
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2<128, 128>(ctx, dst);
break;
case 192: {
// MiMo-V2.5 / V2.5-Pro / V2-Flash: gqa_ratio is 8 (SWA) or 16 (full attn)
GGML_ASSERT(V->ne[0] == 128);
float max_bias = 0.0f;
memcpy(&max_bias, (const float *) KQV->op_params + 1, sizeof(float));
const bool use_gqa_opt = mask && max_bias == 0.0f;
GGML_ASSERT(use_gqa_opt);
GGML_ASSERT(Q->ne[2] % K->ne[2] == 0);
const int gqa_ratio = Q->ne[2] / K->ne[2];
if (gqa_ratio % 16 == 0) {
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<192, 128, 16>(ctx, dst);
} else {
GGML_ASSERT(gqa_ratio % 8 == 0);
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<192, 128, 8>(ctx, dst);
}
} break;
case 256:
GGML_ASSERT(V->ne[0] == 256);
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2<256, 256>(ctx, dst);
break;
case 320:
// For Mistral Small 4, go straight to the ncols1 switch (ncols2=32-only build).
GGML_ASSERT(V->ne[0] == 256);
{
float max_bias = 0.0f;
memcpy(&max_bias, (const float *) KQV->op_params + 1, sizeof(float));
const bool use_gqa_opt = mask && max_bias == 0.0f;
GGML_ASSERT(use_gqa_opt);
GGML_ASSERT(Q->ne[2] % K->ne[2] == 0);
const int gqa_ratio = Q->ne[2] / K->ne[2];
GGML_ASSERT(gqa_ratio % 32 == 0);
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols1<320, 256, 32>(ctx, dst);
}
break;
case 512:
GGML_ASSERT(V->ne[0] == 512);
ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2<512, 512>(ctx, dst);
@@ -352,6 +384,22 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
return BEST_FATTN_KERNEL_NONE;
}
break;
case 192:
if (V->ne[0] != 128 || !gqa_opt_applies) {
return BEST_FATTN_KERNEL_NONE;
}
if (gqa_ratio % 8 != 0) {
return BEST_FATTN_KERNEL_NONE;
}
break;
case 320:
if (V->ne[0] != 256 || !gqa_opt_applies) {
return BEST_FATTN_KERNEL_NONE;
}
if (gqa_ratio % 32 != 0) {
return BEST_FATTN_KERNEL_NONE;
}
break;
case 512:
if (V->ne[0] != K->ne[0]) {
return BEST_FATTN_KERNEL_NONE;
@@ -401,7 +449,8 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
}
// For small batch sizes the vector kernel may be preferable over the kernels optimized for large batch sizes:
const bool can_use_vector_kernel = Q->ne[0] <= 256 && Q->ne[0] % 64 == 0 && K->ne[1] % FATTN_KQ_STRIDE == 0;
// 192 satisfies % 64 == 0 but has no vec instance (DKQ != DV); force it onto the MMA path.
const bool can_use_vector_kernel = Q->ne[0] <= 256 && Q->ne[0] % 64 == 0 && Q->ne[0] != 192 && K->ne[1] % FATTN_KQ_STRIDE == 0;
// If Turing tensor cores are available, use them:
if (turing_mma_available(cc) && Q->ne[0] != 40 && Q->ne[0] != 72) {
@@ -430,7 +479,7 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
if (volta_mma_available(cc) && Q->ne[0] != 40 && Q->ne[0] != 72) {
int gqa_ratio_eff = 1;
const int ncols2_max = Q->ne[0] == 576 ? 16 : 8;
const int ncols2_max = (Q->ne[0] == 576 || Q->ne[0] == 192) ? 16 : 8;
while (gqa_ratio % (2*gqa_ratio_eff) == 0 && gqa_ratio_eff < ncols2_max) {
gqa_ratio_eff *= 2;
}
@@ -444,7 +493,7 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
}
// Use the WMMA kernel if possible:
if (ggml_cuda_should_use_wmma_fattn(cc) && K->ne[1] % FATTN_KQ_STRIDE == 0 && Q->ne[0] != 40 && Q->ne[0] != 72 && Q->ne[0] != 512 && Q->ne[0] != 576) {
if (ggml_cuda_should_use_wmma_fattn(cc) && K->ne[1] % FATTN_KQ_STRIDE == 0 && Q->ne[0] != 40 && Q->ne[0] != 72 && Q->ne[0] != 192 && Q->ne[0] != 512 && Q->ne[0] != 576) {
if (can_use_vector_kernel && Q->ne[1] <= 2) {
return BEST_FATTN_KERNEL_VEC;
}
@@ -477,7 +526,7 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
}
// Use MFMA flash attention for CDNA (MI100+):
if (amd_mfma_available(cc) && Q->ne[0] != 40 && Q->ne[0] != 72 && Q->ne[0] != 256 && Q->ne[0] != 512 && Q->ne[0] != 576) {
if (amd_mfma_available(cc) && Q->ne[0] != 40 && Q->ne[0] != 72 && Q->ne[0] != 192 && Q->ne[0] != 256 && Q->ne[0] != 512 && Q->ne[0] != 576) {
const int64_t eff_nq = Q->ne[1] * (gqa_opt_applies ? gqa_ratio : 1);
// MMA vs tile crossover benchmarked on MI300X @ d32768:
// hsk=64 (gqa=4): MMA wins at eff >= 128 (+11%)

View File

@@ -6,17 +6,18 @@ template<int qk, int qr, dequantize_kernel_t dequantize_kernel, typename dst_t>
static __global__ void k_get_rows(
const void * __restrict__ src0, const int32_t * __restrict__ src1, dst_t * __restrict__ dst,
const int64_t ne00, /*const int64_t ne01, const int64_t ne02, const int64_t ne03,*/
/*const int64_t ne10,*/ const int64_t ne11, const int64_t ne12, /*const int64_t ne13,*/
/*const int64_t ne10,*/ const int64_t ne11, const uint3 ne12_fdv, /*const int64_t ne13,*/
/*const size_t s0,*/ const size_t s1, const size_t s2, const size_t s3,
/*const size_t nb00,*/ const size_t nb01, const size_t nb02, const size_t nb03,
const size_t s10, const size_t s11, const size_t s12/*, const size_t s13*/) {
for (int64_t z = blockIdx.z; z < ne11*ne12; z += gridDim.z) {
for (int64_t z = blockIdx.z; z < ne11*(int64_t)ne12_fdv.z; z += gridDim.z) {
for (int64_t i00 = 2*(blockIdx.y*blockDim.x + threadIdx.x); i00 < ne00; i00 += gridDim.y*blockDim.x) {
// The x and y dimensions of the grid are swapped because the maximum allowed grid size for x is higher.
const int i10 = blockIdx.x;
const int i11 = z / ne12; // TODO fastdiv
const int i12 = z % ne12;
const uint2 dm = fast_div_modulo((uint32_t)z, ne12_fdv);
const int i11 = dm.x;
const int i12 = dm.y;
const int i01 = src1[i10*s10 + i11*s11 + i12*s12];
@@ -42,17 +43,18 @@ template<typename src0_t, typename dst_t>
static __global__ void k_get_rows_float(
const src0_t * __restrict__ src0, const int32_t * __restrict__ src1, dst_t * __restrict__ dst,
const int64_t ne00, /*const int64_t ne01, const int64_t ne02, const int64_t ne03,*/
/*const int64_t ne10,*/ const int64_t ne11, const int64_t ne12, /*const int64_t ne13,*/
/*const int64_t ne10,*/ const int64_t ne11, const uint3 ne12_fdv, /*const int64_t ne13,*/
/*const size_t s0,*/ const size_t s1, const size_t s2, const size_t s3,
/*const size_t nb00,*/ const size_t nb01, const size_t nb02, const size_t nb03,
const size_t s10, const size_t s11, const size_t s12/*, const size_t s13*/) {
for (int64_t z = blockIdx.z; z < ne11*ne12; z += gridDim.z) {
for (int64_t z = blockIdx.z; z < ne11*(int64_t)ne12_fdv.z; z += gridDim.z) {
for (int64_t i00 = blockIdx.y*blockDim.x + threadIdx.x; i00 < ne00; i00 += gridDim.y*blockDim.x) {
// The x and y dimensions of the grid are swapped because the maximum allowed grid size for x is higher.
const int i10 = blockIdx.x;
const int i11 = z / ne12; // TODO fastdiv
const int i12 = z % ne12;
const uint2 dm = fast_div_modulo((uint32_t)z, ne12_fdv);
const int i11 = dm.x;
const int i12 = dm.y;
if (i00 >= ne00) {
return;
@@ -115,10 +117,14 @@ static void get_rows_cuda_q(
GGML_ASSERT(ne00 % 2 == 0);
GGML_ASSERT(ne12 > 0);
GGML_ASSERT(ne11 <= std::numeric_limits<uint32_t>::max() / ne12);
const uint3 ne12_fdv = init_fastdiv_values(ne12);
k_get_rows<qk, qr, dq><<<block_nums, block_dims, 0, stream>>>(
src0_d, src1_d, dst_d,
ne00, /*ne01, ne02, ne03,*/
/*ne10,*/ ne11, ne12, /*ne13,*/
/*ne10,*/ ne11, ne12_fdv, /*ne13,*/
/* s0,*/ s1, s2, s3,
/* nb00,*/ nb01, nb02, nb03,
s10, s11, s12/*, s13*/);
@@ -146,10 +152,14 @@ static void get_rows_cuda_float(
const size_t s12 = nb12 / sizeof(int32_t);
// const size_t s13 = nb13 / sizeof(int32_t);
GGML_ASSERT(ne12 > 0);
GGML_ASSERT(ne11 <= std::numeric_limits<uint32_t>::max() / ne12);
const uint3 ne12_fdv = init_fastdiv_values(ne12);
k_get_rows_float<<<block_nums, block_dims, 0, stream>>>(
src0_d, src1_d, dst_d,
ne00, /*ne01, ne02, ne03,*/
/*ne10,*/ ne11, ne12, /*ne13,*/
/*ne10,*/ ne11, ne12_fdv, /*ne13,*/
/* s0,*/ s1, s2, s3,
/* nb00,*/ nb01, nb02, nb03,
s10, s11, s12/*, s13*/);

File diff suppressed because it is too large Load Diff

View File

@@ -1015,25 +1015,35 @@ namespace ggml_cuda_mma {
#endif // AMD_MFMA_AVAILABLE
}
static __device__ __forceinline__ void mma_block_scaled(tile<16, 8, float> & D,
const tile<16, 8, int> & A,
const tile<8, 8, int> & B,
uint32_t a_scale,
uint32_t b_scale) {
template <ggml_type type>
static __device__ __forceinline__ void mma_block_scaled_fp4(tile<16, 8, float> & D,
const tile<16, 8, int> & A,
const tile<8, 8, int> & B,
uint32_t a_scale,
uint32_t b_scale) {
#ifdef BLACKWELL_MMA_AVAILABLE
const int * Axi = (const int *) A.x;
const int * Bxi = (const int *) B.x;
float * Dxi = (float *) D.x;
asm volatile(
"mma.sync.aligned.kind::mxf4.block_scale.scale_vec::2X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue8m0 "
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3}, "
"%10, {0, 0}, %11, {0, 0};"
: "+f"(Dxi[0]), "+f"(Dxi[1]), "+f"(Dxi[2]), "+f"(Dxi[3])
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]), "r"(a_scale), "r"(b_scale));
if constexpr (type == GGML_TYPE_MXFP4) {
asm volatile(
"mma.sync.aligned.kind::mxf4.block_scale.scale_vec::2X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue8m0 "
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3}, "
"%10, {0, 0}, %11, {0, 0};"
: "+f"(Dxi[0]), "+f"(Dxi[1]), "+f"(Dxi[2]), "+f"(Dxi[3])
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]), "r"(a_scale), "r"(b_scale));
} else {
asm volatile(
"mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3 "
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9}, {%0, %1, %2, %3}, "
"%10, {0, 0}, %11, {0, 0};"
: "+f"(Dxi[0]), "+f"(Dxi[1]), "+f"(Dxi[2]), "+f"(Dxi[3])
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]), "r"(a_scale), "r"(b_scale));
}
#else
GGML_UNUSED_VARS(D, A, B, a_scale, b_scale);
#endif // BLACKWELL_MMA_AVAILABLE
#endif // BLACKWELL_MMA_AVAILABLE
}
static __device__ __forceinline__ void mma(

View File

@@ -122,7 +122,7 @@ void ggml_cuda_mul_mat_q(
|| GGML_CUDA_CC_IS_CDNA(cc);
// TODO: tighter pool buffer size vs q8 path
const bool use_native_mxfp4 = blackwell_mma_available(cc) && src0->type == GGML_TYPE_MXFP4;
const bool use_native_fp4 = blackwell_mma_available(cc) && (src0->type == GGML_TYPE_MXFP4 || src0->type == GGML_TYPE_NVFP4);
if (!ids) {
const size_t nbytes_src1_q8_1 = ne13*ne12 * ne11*ne10_padded * sizeof(block_q8_1)/QK8_1 +
@@ -133,9 +133,9 @@ void ggml_cuda_mul_mat_q(
const int64_t s11 = src1->nb[1] / ts_src1;
const int64_t s12 = src1->nb[2] / ts_src1;
const int64_t s13 = src1->nb[3] / ts_src1;
if (use_native_mxfp4) {
if (use_native_fp4) {
static_assert(sizeof(block_fp4_mmq) == 4 * sizeof(block_q8_1));
quantize_mmq_mxfp4_cuda(src1_d, nullptr, src1_q8_1.get(), src0->type, ne10, s11, s12, s13, ne10_padded,
quantize_mmq_fp4_cuda(src1_d, nullptr, src1_q8_1.get(), src0->type, ne10, s11, s12, s13, ne10_padded,
ne11, ne12, ne13, stream);
} else {
@@ -146,10 +146,8 @@ void ggml_cuda_mul_mat_q(
}
// Stride depends on quantization format
const int64_t s12 = use_native_mxfp4 ?
ne11 * ne10_padded * sizeof(block_fp4_mmq) /
(8 * QK_MXFP4 * sizeof(int)) // block_fp4_mmq holds 256 values (8 blocks of 32)
:
const int64_t s12 = use_native_fp4 ?
ne11 * ne10_padded * sizeof(block_fp4_mmq) / (QK_K * sizeof(int)) : // block_fp4_mmq holds 256 values
ne11 * ne10_padded * sizeof(block_q8_1) / (QK8_1 * sizeof(int));
const int64_t s13 = ne12*s12;
@@ -198,8 +196,8 @@ void ggml_cuda_mul_mat_q(
const int64_t s12 = src1->nb[2] / ts_src1;
const int64_t s13 = src1->nb[3] / ts_src1;
if (use_native_mxfp4) {
quantize_mmq_mxfp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
if (use_native_fp4) {
quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
} else {
quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
@@ -208,8 +206,9 @@ void ggml_cuda_mul_mat_q(
CUDA_CHECK(cudaGetLastError());
}
const int64_t s12 = use_native_mxfp4 ? ne11 * ne10_padded * sizeof(block_fp4_mmq) / (8 * QK_MXFP4 * sizeof(int)) :
ne11 * ne10_padded * sizeof(block_q8_1) / (QK8_1 * sizeof(int));
static_assert(QK_K == 8 * QK_MXFP4, "QK_K needs to be 8 * QK_MXFP4");
const int64_t s12 = use_native_fp4 ? ne11 * ne10_padded * sizeof(block_fp4_mmq) / (QK_K * sizeof(int)) :
ne11 * ne10_padded * sizeof(block_q8_1) / (QK8_1 * sizeof(int));
const int64_t s13 = ne12*s12;
// Note that ne02 is used instead of ne12 because the number of y channels determines the z dimension of the CUDA grid.

View File

@@ -10,9 +10,9 @@
using namespace ggml_cuda_mma;
#define MMQ_DP4A_MAX_BATCH_SIZE 64 // Max. batch size to use for dp4a MMQ kernels when FP16 tensor cores are available.
#define MMQ_ITER_K 256
#define MMQ_ITER_K_MXFP4_FP4 512
#define MMQ_NWARPS 8
#define MMQ_ITER_K 256
#define MMQ_ITER_K_FP4 512
#define MMQ_NWARPS 8
typedef void (*load_tiles_mmq_t)(const char * __restrict__ x, int * x_tile, const int kbx0, const int i_max, const int stride);
typedef void (*vec_dot_mmq_t)(const int * __restrict__ x, const int * __restrict__ y, float * __restrict__ sum, const int k00);
@@ -46,9 +46,12 @@ struct block_q8_1_mmq {
int8_t qs[4*QK8_1]; // 128 values quantized to 8 bit each
};
// this struct is used for fp4 data types (currently only used for Blackwell)
// mxfp4 has block size 32, each int32 of d4 contains 2 e8m0 scales in the lower 16 bits
// nvfp4 has block size 16, each int32 of d4 contains 4 ue4m3 scales
struct block_fp4_mmq {
uint32_t d4[4]; // 8 E8M0 scales (1 per 32 values), 2 packed per uint32: d4[0]={s0,s1}, d4[1]={s2,s3}, etc.
int8_t qs[4 * 32]; // 256 FP4 values packed as 4-bit pairs (2 per byte), 8 blocks of 32 values
uint32_t d4[4];
int8_t qs[4 * 32]; // 256 FP4 values packed as 4-bit pairs (2 per byte)
};
static_assert(sizeof(block_q8_1_mmq) == 4*QK8_1 + 4*sizeof(half2), "Unexpected block_q8_1_mmq size");
@@ -143,10 +146,11 @@ static int get_mmq_y_host(const int cc) {
static constexpr __device__ int get_iter_k([[maybe_unused]] const ggml_type type) {
#if defined(BLACKWELL_MMA_AVAILABLE)
return type == GGML_TYPE_MXFP4 ? MMQ_ITER_K_MXFP4_FP4 : MMQ_ITER_K;
#else
return MMQ_ITER_K;
if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
return MMQ_ITER_K_FP4;
}
#endif // defined(BLACKWELL_MMA_AVAILABLE)
return MMQ_ITER_K;
}
static constexpr __device__ int get_mmq_y_device() {
@@ -213,8 +217,8 @@ static constexpr __host__ __device__ tile_x_sizes mmq_get_dp4a_tile_x_sizes(ggml
}
#define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0 + 4)
#define MMQ_MMA_TILE_X_K_FP4 (2*MMQ_TILE_NE_K + 8 + 4) // MXFP4
#define MMQ_MMA_TILE_X_K_NVFP4 (2*MMQ_TILE_NE_K + MMQ_TILE_NE_K/2 + 4) // NVFP4
#define MMQ_MMA_TILE_X_K_FP4 (2*MMQ_TILE_NE_K + 8 + 4) // MXFP4 and NVFP4 Blackwell
#define MMQ_MMA_TILE_X_K_NVFP4 (2*MMQ_TILE_NE_K + MMQ_TILE_NE_K/2 + 4) // NVFP4 Generic
#define MMQ_MMA_TILE_X_K_Q8_1 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0 + 4)
#define MMQ_MMA_TILE_X_K_Q2_K (2*MMQ_TILE_NE_K + MMQ_TILE_NE_K + 4)
#define MMQ_MMA_TILE_X_K_Q3_K (2*MMQ_TILE_NE_K + MMQ_TILE_NE_K/2 + 4)
@@ -240,7 +244,11 @@ static constexpr __host__ __device__ int mmq_get_mma_tile_x_k(ggml_type type) {
case GGML_TYPE_Q8_0: return MMQ_MMA_TILE_X_K_Q8_0;
// tile sizes are the same for Q8_1 and FP4 for blackwell
case GGML_TYPE_MXFP4: return MMQ_MMA_TILE_X_K_Q8_1;
#if defined(BLACKWELL_MMA_AVAILABLE)
case GGML_TYPE_NVFP4: return MMQ_MMA_TILE_X_K_FP4;
#else
case GGML_TYPE_NVFP4: return MMQ_MMA_TILE_X_K_NVFP4;
#endif // defined(BLACKWELL_MMA_AVAILABLE)
case GGML_TYPE_Q2_K: return MMQ_MMA_TILE_X_K_Q2_K;
case GGML_TYPE_Q3_K: return MMQ_MMA_TILE_X_K_Q3_K;
case GGML_TYPE_Q4_K: return MMQ_MMA_TILE_X_K_Q8_1;
@@ -934,6 +942,128 @@ static __device__ __forceinline__ void load_tiles_mxfp4_fp4(const char * __restr
}
}
#ifdef BLACKWELL_MMA_AVAILABLE
template <int mmq_y, bool need_check>
static __device__ __forceinline__ void load_tiles_nvfp4_nvfp4(const char * __restrict__ x,
int * __restrict__ x_tile,
const int kbx0,
const int i_max,
const int stride) {
constexpr int nwarps = mmq_get_nwarps_device();
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int iter_k = get_iter_k(GGML_TYPE_NVFP4);
constexpr int threads_per_row = iter_k / QK_NVFP4; // each thread processes 1 block
constexpr int rows_per_warp = warp_size / threads_per_row;
uint32_t * x_u32 = (uint32_t *) x_tile;
const int txi = threadIdx.x;
const int kbx = txi % threads_per_row;
const int row_in_warp = txi / threads_per_row;
const block_nvfp4 * bxi_base = (const block_nvfp4 *) x + kbx0 + kbx;
uint32_t * x_u32_scale = x_u32 + 64 + kbx;
#pragma unroll
for (int i0 = 0; i0 < mmq_y; i0 += rows_per_warp * nwarps) {
int i = i0 + threadIdx.y * rows_per_warp + row_in_warp;
if constexpr (need_check) {
i = min(i, i_max);
}
const block_nvfp4 * bxi = bxi_base + i * stride;
const int row_base = i * MMQ_MMA_TILE_X_K_FP4;
const int q_base = row_base + 8 * kbx;
const uint32_t * src_qs = reinterpret_cast<const uint32_t *>(bxi->qs);
#pragma unroll
for (int sub = 0; sub < QK_NVFP4 / QK_NVFP4_SUB; ++sub) {
x_u32[q_base + 2 * sub + 0] = src_qs[2 * sub + 0];
x_u32[q_base + 2 * sub + 1] = src_qs[2 * sub + 1];
}
x_u32_scale[row_base] = get_int_b4(bxi->d, 0);
}
}
// Shared MMA kernel for MXFP4 and NVFP4 on Blackwell.
// Both quantizations encode values as e2m1 (FP4) and produce one uint32 scale per
// m16n8k64 MMA call; only the PTX kind (scale_vec::2X ue8m0 vs scale_vec::4X ue4m3)
// and the per-type stride constant differ.
template <int mmq_x, int mmq_y, ggml_type type>
static __device__ __forceinline__ void vec_dot_fp4_fp4_mma(const int * __restrict__ x,
const int * __restrict__ y,
float * __restrict__ sum,
const int k00) {
static_assert(type == GGML_TYPE_MXFP4 || type == GGML_TYPE_NVFP4,
"vec_dot_fp4_fp4_mma: type must be MXFP4 or NVFP4");
typedef tile<16, 8, int> tile_A;
typedef tile<8, 8, int> tile_B;
typedef tile<16, 8, float> tile_C;
constexpr int stride = MMQ_MMA_TILE_X_K_FP4;
constexpr int granularity = mmq_get_granularity_device(mmq_x);
constexpr int rows_per_warp = 2 * granularity;
constexpr int ntx = rows_per_warp / tile_C::I;
constexpr int nfrags = MMQ_TILE_NE_K / tile_A::J;
y += (threadIdx.y % ntx) * (tile_C::J * MMQ_TILE_Y_K);
const int * x_qs = (const int *) x;
const uint32_t * x_sc = (const uint32_t *) (x_qs + 2 * MMQ_TILE_NE_K);
const int * y_qs = (const int *) y + 4;
const uint32_t * y_sc = (const uint32_t *) y;
// 2 threads per quad supply the packed scale register to the block_scale MMA,
// see https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-block-scaling
const int tidx_A = threadIdx.x / 4 + (threadIdx.x % 2) * 8;
const int tidx_B = threadIdx.x / 4;
const int i0 = (threadIdx.y / ntx) * rows_per_warp;
tile_A A[ntx][nfrags];
uint32_t scaleA[ntx][nfrags];
#pragma unroll
for (int n = 0; n < ntx; ++n) {
#pragma unroll
for (int frag = 0; frag < nfrags; ++frag) {
const int k0 = k00 + frag * tile_A::J;
load_ldmatrix(A[n][frag], x_qs + (i0 + n * tile_A::I) * stride + k0, stride);
scaleA[n][frag] = x_sc[(i0 + n * tile_A::I + tidx_A) * stride + k0 / tile_A::J];
}
}
#pragma unroll
for (int j0 = 0; j0 < mmq_x; j0 += ntx * tile_C::J) {
tile_B B[nfrags];
uint32_t scaleB[nfrags];
#pragma unroll
for (int frag = 0; frag < nfrags; ++frag) {
const int k0 = frag * tile_B::J;
load_generic(B[frag], y_qs + j0 * MMQ_TILE_Y_K + k0, MMQ_TILE_Y_K);
scaleB[frag] = y_sc[(j0 + tidx_B) * MMQ_TILE_Y_K + frag];
}
#pragma unroll
for (int n = 0; n < ntx; ++n) {
#pragma unroll
for (int frag = 0; frag < nfrags; ++frag) {
tile_C C = {};
mma_block_scaled_fp4<type>(C, A[n][frag], B[frag], scaleA[n][frag], scaleB[frag]);
#pragma unroll
for (int l = 0; l < tile_C::ne; ++l) {
sum[(j0 / tile_C::J + n) * tile_C::ne + l] += C.x[l];
}
}
}
}
}
#endif // BLACKWELL_MMA_AVAILABLE
template <int mmq_y, bool need_check>
static __device__ __forceinline__ void load_tiles_nvfp4(const char * __restrict__ x,
@@ -1163,77 +1293,6 @@ static __device__ __forceinline__ void vec_dot_q8_0_q8_1_mma(
#endif // defined(AMD_MFMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
}
template <int mmq_x, int mmq_y>
static __device__ __forceinline__ void vec_dot_mxfp4_mxfp4_mma(const int * __restrict__ x,
const int * __restrict__ y,
float * __restrict__ sum,
const int k00) {
typedef tile<16, 8, int> tile_A;
typedef tile<8, 8, int> tile_B;
typedef tile<16, 8, float> tile_C; // Output is float for native scaled MMA
constexpr int granularity = mmq_get_granularity_device(mmq_x);
constexpr int rows_per_warp = 2 * granularity;
constexpr int ntx = rows_per_warp / tile_C::I; // Number of x minitiles per warp.
y += (threadIdx.y % ntx) * (tile_C::J * MMQ_TILE_Y_FP4_K);
// Match layout from load_tiles_mxfp4_fp4
const int * x_qs = (const int *) x;
const uint32_t * x_sc = (const uint32_t *) (x_qs + 2 * MMQ_TILE_NE_K);
const int * y_qs = (const int *) y + 4;
const uint32_t * y_sc = (const uint32_t *) y;
// tile_A has a length of 64 logical values vs. 32 values in block_mxfp4
tile_A A[ntx][MMQ_TILE_NE_K / (2 * QI_MXFP4)];
uint32_t scaleA[ntx][MMQ_TILE_NE_K / (2 * QI_MXFP4)];
// Block scale
// Each thread has to point to a 4 byte scale value
// https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-block-scaling
const int i0 = (threadIdx.y / ntx) * rows_per_warp;
#pragma unroll
for (int n = 0; n < ntx; ++n) {
#pragma unroll
for (int k01 = 0; k01 < MMQ_TILE_NE_K; k01 += 2 * QI_MXFP4) {
const int k0 = k00 + k01;
load_ldmatrix(A[n][k01 / (2 * QI_MXFP4)], x_qs + (i0 + n * tile_A::I) * MMQ_MMA_TILE_X_K_FP4 + k0,
MMQ_MMA_TILE_X_K_FP4);
// based on block-scaling document, 2 threads in each quad need to supply to the scale value
const int tidx = threadIdx.x / 4 + (threadIdx.x % 2) * 8;
scaleA[n][k01 / (2 * QI_MXFP4)] =
*(x_sc + (i0 + n * tile_A::I + tidx) * MMQ_MMA_TILE_X_K_FP4 + k0 / (2 * QI_MXFP4));
}
}
#pragma unroll
for (int j0 = 0; j0 < mmq_x; j0 += ntx * tile_C::J) {
#pragma unroll
for (int k01 = 0; k01 < MMQ_TILE_NE_K; k01 += 2 * QI_MXFP4) {
tile_B B;
uint32_t scaleB; // 2xN scales
load_generic(B, y_qs + j0 * MMQ_TILE_Y_FP4_K + k01, MMQ_TILE_Y_FP4_K);
scaleB = y_sc[(j0 + threadIdx.x / 4) * MMQ_TILE_Y_FP4_K + k01 / (2 * QI_MXFP4)];
#pragma unroll
for (int n = 0; n < ntx; ++n) {
tile_C C;
mma_block_scaled(C, A[n][k01 / (2 * QI_MXFP4)], B, scaleA[n][k01 / (2 * QI_MXFP4)], scaleB);
#pragma unroll
for (int l = 0; l < tile_C::ne; ++l) {
sum[(j0 / tile_C::J + n) * tile_C::ne + l] += C.x[l];
}
}
}
}
}
template <int mmq_x, int mmq_y>
static __device__ __forceinline__ void vec_dot_q8_1_q8_1_dp4a(
@@ -3259,7 +3318,7 @@ struct mmq_type_traits<mmq_x, mmq_y, need_check, GGML_TYPE_MXFP4> {
static constexpr int vdr = VDR_MXFP4_Q8_1_MMQ;
#ifdef BLACKWELL_MMA_AVAILABLE
static constexpr load_tiles_mmq_t load_tiles = load_tiles_mxfp4_fp4<mmq_y, need_check>;
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_mxfp4_mxfp4_mma<mmq_x, mmq_y>;
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_fp4_fp4_mma<mmq_x, mmq_y, GGML_TYPE_MXFP4>;
#else
static constexpr load_tiles_mmq_t load_tiles = load_tiles_mxfp4<mmq_y, need_check>;
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_q8_0_q8_1_mma<mmq_x, mmq_y, MMQ_Q8_1_DS_LAYOUT_D4>;
@@ -3270,8 +3329,13 @@ struct mmq_type_traits<mmq_x, mmq_y, need_check, GGML_TYPE_MXFP4> {
template <int mmq_x, int mmq_y, bool need_check>
struct mmq_type_traits<mmq_x, mmq_y, need_check, GGML_TYPE_NVFP4> {
static constexpr int vdr = VDR_NVFP4_Q8_1_MMQ;
#ifdef BLACKWELL_MMA_AVAILABLE
static constexpr load_tiles_mmq_t load_tiles = load_tiles_nvfp4_nvfp4<mmq_y, need_check>;
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_fp4_fp4_mma<mmq_x, mmq_y, GGML_TYPE_NVFP4>;
#else
static constexpr load_tiles_mmq_t load_tiles = load_tiles_nvfp4<mmq_y, need_check>;
static constexpr vec_dot_mmq_t vec_dot_mma = vec_dot_q8_0_16_q8_1_mma<mmq_x, mmq_y>;
#endif // BLACKWELL_MMA_AVAILABLE
static constexpr vec_dot_mmq_t vec_dot_dp4a = vec_dot_q8_0_16_q8_1_dp4a<mmq_x, mmq_y>;
};
@@ -3406,7 +3470,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
#if defined(BLACKWELL_MMA_AVAILABLE)
// FP4 tile stores 8 blocks
constexpr int ne_block = (type == GGML_TYPE_MXFP4) ? 8 * QK_MXFP4 : 4 * QK8_1;
constexpr int ne_block = (type == GGML_TYPE_MXFP4 || type == GGML_TYPE_NVFP4) ? QK_K : 4 * QK8_1;
#else
constexpr int ne_block = 4 * QK8_1;
#endif // defined(BLACKWELL_MMA_AVAILABLE)

View File

@@ -115,6 +115,7 @@ static constexpr __host__ __device__ int get_mmvq_mmid_max_batch_pascal_older(gg
case GGML_TYPE_IQ4_NL: return 6;
case GGML_TYPE_IQ4_XS: return 5;
case GGML_TYPE_MXFP4: return 4;
case GGML_TYPE_NVFP4: return 4;
case GGML_TYPE_Q2_K: return 4;
case GGML_TYPE_Q3_K: return 4;
case GGML_TYPE_Q4_0: return 6;
@@ -135,6 +136,7 @@ static constexpr __host__ __device__ int get_mmvq_mmid_max_batch_turing_plus(ggm
case GGML_TYPE_IQ3_S: return 6;
case GGML_TYPE_IQ3_XXS: return 7;
case GGML_TYPE_MXFP4: return 7;
case GGML_TYPE_NVFP4: return 8;
case GGML_TYPE_Q2_K: return 7;
case GGML_TYPE_Q3_K: return 5;
default: return MMVQ_MAX_BATCH_SIZE;
@@ -221,6 +223,7 @@ static constexpr __host__ __device__ int get_mmvq_mmid_max_batch_rdna4(ggml_type
case GGML_TYPE_IQ4_NL: return 7;
case GGML_TYPE_IQ4_XS: return 5;
case GGML_TYPE_MXFP4: return 5;
case GGML_TYPE_NVFP4: return 5;
case GGML_TYPE_Q3_K: return 4;
case GGML_TYPE_Q4_0: return 7;
case GGML_TYPE_Q4_1: return 7;

View File

@@ -54,15 +54,31 @@ void ggml_cuda_out_prod(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const int64_t dps2 = ne2 / ne02;
const int64_t dps3 = ne3 / ne03;
// TODO batched matrix multiplication
for (int64_t i3 = 0; i3 < ne3; ++i3) {
for (int64_t i2 = 0; i2 < ne2; ++i2) {
if (dps2 == 1 && ne2 > 1) {
// src0 has uniform stride s02 along dim 2; batch the inner loop with a strided GEMM
GGML_ASSERT(ne2 <= std::numeric_limits<int>::max());
const int batch_count = (int) ne2;
for (int64_t i3 = 0; i3 < ne3; ++i3) {
CUBLAS_CHECK(
cublasSgemm(handle, CUBLAS_OP_N, src1_cublas_op,
cublasSgemmStridedBatched(handle, CUBLAS_OP_N, src1_cublas_op,
ne0, ne1, ne01,
&alpha, src0_d + (i3/dps3)*s03 + (i2/dps2)*s02, lda,
src1_d + i3 *s13 + i2 *s12, ldb,
&beta, dst_d + i3 *s3 + i2 *s2, ldc));
&alpha, src0_d + (i3/dps3)*s03, lda, s02,
src1_d + i3 *s13, ldb, s12,
&beta, dst_d + i3 *s3, ldc, s2,
batch_count));
}
} else {
// Fallback: ne2 == 1 (no batching benefit) or dps2 > 1 (src0 broadcast along dim 2
// with non-uniform stride; would need cublasSgemmBatched with pointer arrays).
for (int64_t i3 = 0; i3 < ne3; ++i3) {
for (int64_t i2 = 0; i2 < ne2; ++i2) {
CUBLAS_CHECK(
cublasSgemm(handle, CUBLAS_OP_N, src1_cublas_op,
ne0, ne1, ne01,
&alpha, src0_d + (i3/dps3)*s03 + (i2/dps2)*s02, lda,
src1_d + i3 *s13 + i2 *s12, ldb,
&beta, dst_d + i3 *s3 + i2 *s2, ldc));
}
}
}
}

View File

@@ -70,6 +70,102 @@ __device__ __forceinline__ uint8_t compute_e8m0_scale(float amax) {
return static_cast<uint8_t>(biased);
}
static __global__ void quantize_mmq_nvfp4(
const float * __restrict__ x, const int32_t * __restrict__ ids, void * __restrict__ vy,
const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
const int64_t ne0, const int64_t ne1, const int64_t ne2) {
#if defined(BLACKWELL_MMA_AVAILABLE)
const int64_t i0_base = ((int64_t) blockDim.x * blockIdx.y + threadIdx.x) * QK_NVFP4_SUB;
if (i0_base >= ne0) {
return;
}
const int64_t i1 = blockIdx.x;
const int64_t i2 = blockIdx.z % ne2;
const int64_t i3 = blockIdx.z / ne2;
const int64_t i01 = ids ? ids[i1] : i1;
const int64_t k_block = i0_base / QK_K;
const int64_t blocks_per_col = (ne0 + QK_K - 1) / QK_K;
if (k_block >= blocks_per_col) {
return;
}
const int64_t ib = blockIdx.z * ((int64_t) blocks_per_col * ne1) + k_block * ne1 + blockIdx.x;
block_fp4_mmq * y = (block_fp4_mmq *) vy;
block_fp4_mmq * yb = y + ib;
const int sub = (i0_base % QK_K) / QK_NVFP4_SUB;
float vals_raw[QK_NVFP4_SUB];
float amax_raw = 0.0f;
const int64_t base_idx = i3 * s03 + i2 * s02 + i01 * s01;
#pragma unroll
for (int k = 0; k < QK_NVFP4_SUB; k++) {
const int64_t i00 = i0_base + k;
if (i00 < ne00) {
const float v = x[base_idx + i00];
vals_raw[k] = v;
amax_raw = fmaxf(amax_raw, fabsf(v));
} else {
vals_raw[k] = 0.0f;
}
}
static constexpr int test_offsets[5] = { 0, -1, 1, -2, 2};
const int first_fp8_code = (int) ggml_cuda_fp32_to_ue4m3(amax_raw / 6.0f);
float best_err = FLT_MAX;
uint8_t fp8_code = 0;
float subblock_scale = 0.0f;
#pragma unroll // Check +/- 2 to find best code to reduce NVFP4 activation loss. Negligible overhead on Blackwell.
for (int i = 0; i < 5; i++) {
const int test_code = first_fp8_code + test_offsets[i];
if (test_code < 0 || test_code > 0x7e) {
continue;
}
const uint8_t code = (uint8_t) test_code;
const float test_scale = ggml_cuda_ue4m3_to_fp32(code);
const float test_inv_scale = test_scale > 0.0f ? 0.5f / test_scale : 0.0f;
float cur_err = 0.0f;
#pragma unroll
for (int k = 0; k < QK_NVFP4_SUB; ++k) {
const float v = vals_raw[k];
const uint8_t q = ggml_cuda_float_to_fp4_e2m1(v, test_inv_scale);
const float err_diff = fabsf(v) - fabsf(kvalues_mxfp4[q & 0x7]) * test_scale;
cur_err = fmaf(err_diff, err_diff, cur_err);
}
if (cur_err < best_err) {
best_err = cur_err;
fp8_code = test_code;
subblock_scale = test_scale;
}
}
const float inv_scale = subblock_scale > 0.0f ? 0.5f / subblock_scale : 0.0f;
uint32_t q0 = 0;
uint32_t q1 = 0;
#pragma unroll // this is faster than the previous __nv_fp4x4_e2m1
for (int k = 0; k < QK_NVFP4_SUB / 4; ++k) {
q0 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 0], inv_scale) << (8 * k);
q0 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 8], inv_scale) << (8 * k + 4);
q1 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 4], inv_scale) << (8 * k);
q1 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 12], inv_scale) << (8 * k + 4);
}
uint32_t * yqs = reinterpret_cast<uint32_t *>(yb->qs);
yqs[2 * sub + 0] = q0;
yqs[2 * sub + 1] = q1;
reinterpret_cast<uint8_t *>(yb->d4)[sub] = fp8_code;
#else
NO_DEVICE_CODE; // This is for Blackwell NVFP4 activations only.
#endif // defined(BLACKWELL_MMA_AVAILABLE)
}
// quantize values in the format mxfp4 is stored which is interleaved nibbles
// i.e. a block a0-a31 is represented as a0a16,a1a17 ...a15a31
static __global__ void quantize_mmq_mxfp4(const float * __restrict__ x,
@@ -316,28 +412,32 @@ void quantize_mmq_q8_1_cuda(
}
}
void quantize_mmq_mxfp4_cuda(const float * x,
const int32_t * ids,
void * vy,
[[maybe_unused]] const ggml_type type_src0,
const int64_t ne00,
const int64_t s01,
const int64_t s02,
const int64_t s03,
const int64_t ne0,
const int64_t ne1,
const int64_t ne2,
const int64_t ne3,
cudaStream_t stream) {
GGML_ASSERT(ne0 % (2 * QK_MXFP4) == 0);
void quantize_mmq_fp4_cuda(
const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
const int64_t ne0, const int64_t ne1, const int64_t ne2, const int64_t ne3, cudaStream_t stream) {
GGML_ASSERT(type_src0 == GGML_TYPE_MXFP4 || type_src0 == GGML_TYPE_NVFP4);
GGML_ASSERT(ne0 > 0);
constexpr int nwarps = 8;
constexpr int vals_per_warp = 2 * QK_MXFP4;
constexpr int vals_per_block = nwarps * vals_per_warp;
if (type_src0 == GGML_TYPE_NVFP4) {
GGML_ASSERT(ne00 % QK_NVFP4 == 0);
constexpr int nvfp4_block_size = 128;
const int64_t block_num_y = (ne0 + QK_NVFP4_SUB * nvfp4_block_size - 1) / (QK_NVFP4_SUB * nvfp4_block_size);
const dim3 block_size(nvfp4_block_size, 1, 1);
const dim3 num_blocks(ne1, block_num_y, ne2 * ne3);
quantize_mmq_nvfp4<<<num_blocks, block_size, 0, stream>>>(
x, ids, vy, ne00, s01, s02, s03, ne0, ne1, ne2);
} else {
GGML_ASSERT(ne0 % (2 * QK_MXFP4) == 0);
const int64_t block_num_y = (ne0 + vals_per_block - 1) / vals_per_block;
const dim3 num_blocks(ne1, block_num_y, ne2 * ne3);
const dim3 block_size(WARP_SIZE, nwarps, 1);
constexpr int nwarps = 8;
constexpr int vals_per_warp = 2 * QK_MXFP4;
constexpr int vals_per_block = nwarps * vals_per_warp;
quantize_mmq_mxfp4<<<num_blocks, block_size, 0, stream>>>(x, ids, vy, ne00, s01, s02, s03, ne0, ne1, ne2);
const int64_t block_num_y = (ne0 + vals_per_block - 1) / vals_per_block;
const dim3 num_blocks(ne1, block_num_y, ne2 * ne3);
const dim3 block_size(WARP_SIZE, nwarps, 1);
quantize_mmq_mxfp4<<<num_blocks, block_size, 0, stream>>>(x, ids, vy, ne00, s01, s02, s03, ne0, ne1, ne2);
}
}

View File

@@ -26,7 +26,7 @@ void quantize_mmq_q8_1_cuda(
ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
void quantize_mmq_mxfp4_cuda(const float * x,
void quantize_mmq_fp4_cuda(const float * x,
const int32_t * ids,
void * vy,
ggml_type type_src0,

View File

@@ -0,0 +1,72 @@
#include "snake.cuh"
#include "convert.cuh"
// Fused Snake activation: y = x + sin^2(a * x) * inv_b
// x: [T, C] (T contiguous), a: [1, C], inv_b: [1, C]
// Supports F32, F16, BF16 data with F32 compute.
template <typename T>
static __global__ void snake_kernel(
const T * __restrict__ x,
const float * __restrict__ a,
const float * __restrict__ inv_b,
T * __restrict__ dst,
const int total,
const uint3 T_len_fastdiv) {
const int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= total) return;
const int c = (int) fastdiv((uint32_t) idx, T_len_fastdiv);
const float xi = ggml_cuda_cast<float>(x[idx]);
const float s = sinf(a[c] * xi);
dst[idx] = ggml_cuda_cast<T>(xi + s * s * inv_b[c]);
}
// Internal launcher with explicit x/a/inv_b/dst tensors.
// Shared by the public op (reads dst->src) and the fusion path (explicit args).
static void launch_snake(ggml_backend_cuda_context & ctx,
const ggml_tensor * x,
const ggml_tensor * a,
const ggml_tensor * inv_b,
ggml_tensor * dst) {
const float * a_d = (const float *)a->data;
const float * inv_b_d = (const float *)inv_b->data;
const int T = (int)x->ne[0];
const int C = (int)x->ne[1];
const int total = T * C;
const uint3 T_len_fastdiv = init_fastdiv_values((uint64_t) T);
const int block_size = 256;
const int grid_size = (total + block_size - 1) / block_size;
cudaStream_t stream = ctx.stream();
switch (x->type) {
case GGML_TYPE_F32: {
snake_kernel<<<grid_size, block_size, 0, stream>>>(
(const float *)x->data, a_d, inv_b_d, (float *)dst->data, total, T_len_fastdiv);
} break;
case GGML_TYPE_F16: {
snake_kernel<<<grid_size, block_size, 0, stream>>>(
(const half *)x->data, a_d, inv_b_d, (half *)dst->data, total, T_len_fastdiv);
} break;
case GGML_TYPE_BF16: {
snake_kernel<<<grid_size, block_size, 0, stream>>>(
(const nv_bfloat16 *)x->data, a_d, inv_b_d, (nv_bfloat16 *)dst->data, total, T_len_fastdiv);
} break;
default:
GGML_ABORT("snake: unsupported type");
}
}
// Fusion entry: caller supplies x/a/inv_b explicitly from the matched
// mul -> sin -> sqr -> mul -> add pattern. The dst is the trailing add output.
void ggml_cuda_op_snake_fused(ggml_backend_cuda_context & ctx,
const ggml_tensor * x,
const ggml_tensor * a,
const ggml_tensor * inv_b,
ggml_tensor * dst) {
launch_snake(ctx, x, a, inv_b, dst);
}

View File

@@ -0,0 +1,8 @@
#include "common.cuh"
// Fusion entry point. Caller supplies x/a/inv_b explicitly.
void ggml_cuda_op_snake_fused(ggml_backend_cuda_context & ctx,
const ggml_tensor * x,
const ggml_tensor * a,
const ggml_tensor * inv_b,
ggml_tensor * dst);

View File

@@ -3,6 +3,7 @@
template <bool apply_silu, size_t split_d_inner, size_t d_conv>
static __global__ void ssm_conv_f32(const float * __restrict__ src0, const float * __restrict__ src1,
const float * __restrict__ bias,
const int src0_nb0, const int src0_nb1, const int src0_nb2, const int src1_nb1,
float * __restrict__ dst, const int dst_nb0, const int dst_nb1, const int dst_nb2,
const int64_t n_t) {
@@ -27,6 +28,8 @@ static __global__ void ssm_conv_f32(const float * __restrict__ src0, const float
w[j] = w_block[tid * stride_w + j];
}
float b = bias != nullptr ? bias[bidy * split_d_inner + tid] : 0.0f;
for (int64_t i = 0; i < n_t; i++) {
float sumf = 0.0f;
@@ -42,12 +45,14 @@ static __global__ void ssm_conv_f32(const float * __restrict__ src0, const float
for (size_t j = 0; j < d_conv; j++) {
sumf += x[(i + j) % d_conv] * w[j];
}
sumf += b;
y_block[i * stride_y + tid] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
}
}
template <bool apply_silu, size_t split_d_inner, size_t d_conv, int64_t split_n_t>
static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0, const float * __restrict__ src1,
const float * __restrict__ bias,
const int src0_nb0, const int src0_nb1, const int src0_nb2,
const int src1_nb1, float * __restrict__ dst, const int dst_nb0,
const int dst_nb1, const int dst_nb2, const int64_t n_t) {
@@ -97,6 +102,8 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0,
w[j] = w_block[tid * stride_w + j];
}
float b = bias != nullptr ? bias[bidy * split_d_inner + tid] : 0.0f;
// Compute from shared memory
for (int64_t i = 0; i < local_n_t; i++) {
float sumf = 0.0f;
@@ -104,12 +111,13 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0,
for (size_t j = 0; j < d_conv; j++) {
sumf += smem[tid * n_cols + i + j] * w[j];
}
sumf += b;
y_block[i * stride_y + tid] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
}
}
template <bool apply_silu>
static void ssm_conv_f32_cuda(const float * src0, const float * src1, const int src0_nb0, const int src0_nb1,
static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
const int dst_nb2, const int64_t nc, const int64_t nr, const int64_t n_t,
const int64_t n_s, cudaStream_t stream) {
@@ -120,14 +128,14 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const int
constexpr int kNC = decltype(NC)::value;
if (n_t <= 32) {
const dim3 blocks(n_s, (nr + threads - 1) / threads, 1);
ssm_conv_f32<apply_silu, threads, kNC><<<blocks, threads, 0, stream>>>(src0, src1, src0_nb0, src0_nb1, src0_nb2, src1_nb1,
ssm_conv_f32<apply_silu, threads, kNC><<<blocks, threads, 0, stream>>>(src0, src1, bias, src0_nb0, src0_nb1, src0_nb2, src1_nb1,
dst, dst_nb0, dst_nb1, dst_nb2, n_t);
} else {
const int64_t split_n_t = 32;
dim3 blocks(n_s, (nr + threads - 1) / threads, (n_t + split_n_t - 1) / split_n_t);
const size_t smem_size = threads * (kNC - 1 + split_n_t) * sizeof(float);
ssm_conv_long_token_f32<apply_silu, threads, kNC, split_n_t><<<blocks, threads, smem_size, stream>>>(
src0, src1, src0_nb0, src0_nb1, src0_nb2, src1_nb1, dst, dst_nb0, dst_nb1, dst_nb2, n_t);
src0, src1, bias, src0_nb0, src0_nb1, src0_nb2, src1_nb1, dst, dst_nb0, dst_nb1, dst_nb2, n_t);
}
};
@@ -140,11 +148,18 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const int
}
}
void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * silu_dst) {
void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node, ggml_tensor * silu_dst) {
const struct ggml_tensor * src0 = dst->src[0]; // conv_x
const struct ggml_tensor * src1 = dst->src[1]; // conv1d.weight
const bool fuse_bias = bias_add_node != nullptr;
const bool fuse_silu = silu_dst != nullptr;
// bias always comes with silu.
GGML_ASSERT(!fuse_bias || fuse_silu);
// The bias (when fused) is the non-conv operand of the ADD node.
const struct ggml_tensor * bias = fuse_bias ? (bias_add_node->src[0] == dst ? bias_add_node->src[1] : bias_add_node->src[0]) : nullptr;
// When fusing, write to silu_dst (the node downstream references).
const struct ggml_tensor * out = fuse_silu ? silu_dst : dst;
@@ -160,16 +175,23 @@ void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, g
const float * src0_d = (const float *) src0->data;
const float * src1_d = (const float *) src1->data;
const float * bias_d = fuse_bias ? (const float *) bias->data : nullptr;
float * dst_d = (float *) out->data;
cudaStream_t stream = ctx.stream();
GGML_ASSERT(src0->type == GGML_TYPE_F32);
GGML_ASSERT(out->type == GGML_TYPE_F32);
if (fuse_bias) {
GGML_ASSERT(bias->type == GGML_TYPE_F32);
GGML_ASSERT(ggml_is_contiguous(bias));
GGML_ASSERT(ggml_nelements(bias) == nr);
}
if (fuse_silu) {
ssm_conv_f32_cuda<true>(src0_d, src1_d, src0->nb[0], src0->nb[1], src0->nb[2], src1->nb[1], dst_d, out->nb[0], out->nb[1],
ssm_conv_f32_cuda<true>(src0_d, src1_d, bias_d, src0->nb[0], src0->nb[1], src0->nb[2], src1->nb[1], dst_d, out->nb[0], out->nb[1],
out->nb[2], nc, nr, n_t, n_s, stream);
} else {
ssm_conv_f32_cuda<false>(src0_d, src1_d, src0->nb[0], src0->nb[1], src0->nb[2], src1->nb[1], dst_d, out->nb[0], out->nb[1],
ssm_conv_f32_cuda<false>(src0_d, src1_d, bias_d, src0->nb[0], src0->nb[1], src0->nb[2], src1->nb[1], dst_d, out->nb[0], out->nb[1],
out->nb[2], nc, nr, n_t, n_s, stream);
}
}

View File

@@ -1,3 +1,3 @@
#include "common.cuh"
void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * silu_dst = nullptr);
void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node = nullptr, ggml_tensor * silu_dst = nullptr);

View File

@@ -2,4 +2,5 @@
#include "../fattn-mma-f16.cuh"
DECL_FATTN_MMA_F16_CASE(192, 128, 1, 16);
DECL_FATTN_MMA_F16_CASE(576, 512, 1, 16);

View File

@@ -2,4 +2,5 @@
#include "../fattn-mma-f16.cuh"
DECL_FATTN_MMA_F16_CASE(320, 256, 1, 32);
DECL_FATTN_MMA_F16_CASE(576, 512, 1, 32);

View File

@@ -7,5 +7,6 @@ DECL_FATTN_MMA_F16_CASE(80, 80, 1, 8);
DECL_FATTN_MMA_F16_CASE(96, 96, 1, 8);
DECL_FATTN_MMA_F16_CASE(112, 112, 1, 8);
DECL_FATTN_MMA_F16_CASE(128, 128, 1, 8);
DECL_FATTN_MMA_F16_CASE(192, 128, 1, 8);
DECL_FATTN_MMA_F16_CASE(256, 256, 1, 8);
DECL_FATTN_MMA_F16_CASE(512, 512, 1, 8);

View File

@@ -2,4 +2,5 @@
#include "../fattn-mma-f16.cuh"
DECL_FATTN_MMA_F16_CASE(192, 128, 2, 16);
DECL_FATTN_MMA_F16_CASE(576, 512, 2, 16);

View File

@@ -2,4 +2,5 @@
#include "../fattn-mma-f16.cuh"
DECL_FATTN_MMA_F16_CASE(320, 256, 2, 32);
DECL_FATTN_MMA_F16_CASE(576, 512, 2, 32);

View File

@@ -7,5 +7,6 @@ DECL_FATTN_MMA_F16_CASE(80, 80, 2, 8);
DECL_FATTN_MMA_F16_CASE(96, 96, 2, 8);
DECL_FATTN_MMA_F16_CASE(112, 112, 2, 8);
DECL_FATTN_MMA_F16_CASE(128, 128, 2, 8);
DECL_FATTN_MMA_F16_CASE(192, 128, 2, 8);
DECL_FATTN_MMA_F16_CASE(256, 256, 2, 8);
DECL_FATTN_MMA_F16_CASE(512, 512, 2, 8);

View File

@@ -2,4 +2,5 @@
#include "../fattn-mma-f16.cuh"
DECL_FATTN_MMA_F16_CASE(192, 128, 4, 16);
DECL_FATTN_MMA_F16_CASE(576, 512, 4, 16);

View File

@@ -7,5 +7,6 @@ DECL_FATTN_MMA_F16_CASE(80, 80, 4, 8);
DECL_FATTN_MMA_F16_CASE(96, 96, 4, 8);
DECL_FATTN_MMA_F16_CASE(112, 112, 4, 8);
DECL_FATTN_MMA_F16_CASE(128, 128, 4, 8);
DECL_FATTN_MMA_F16_CASE(192, 128, 4, 8);
DECL_FATTN_MMA_F16_CASE(256, 256, 4, 8);
DECL_FATTN_MMA_F16_CASE(512, 512, 4, 8);

View File

@@ -7,5 +7,6 @@ DECL_FATTN_MMA_F16_CASE(80, 80, 8, 8);
DECL_FATTN_MMA_F16_CASE(96, 96, 8, 8);
DECL_FATTN_MMA_F16_CASE(112, 112, 8, 8);
DECL_FATTN_MMA_F16_CASE(128, 128, 8, 8);
DECL_FATTN_MMA_F16_CASE(192, 128, 8, 8);
DECL_FATTN_MMA_F16_CASE(256, 256, 8, 8);
DECL_FATTN_MMA_F16_CASE(512, 512, 8, 8);

View File

@@ -0,0 +1,5 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-tile.cuh"
DECL_FATTN_TILE_CASE(192, 128);

View File

@@ -0,0 +1,5 @@
// This file has been autogenerated by generate_cu_files.py, do not edit manually.
#include "../fattn-tile.cuh"
DECL_FATTN_TILE_CASE(320, 256);

View File

@@ -3,7 +3,10 @@
from glob import glob
import os
HEAD_SIZES_KQ = [40, 64, 72, 80, 96, 112, 128, 256, 512, 576]
HEAD_SIZES_KQ = [40, 64, 72, 80, 96, 112, 128, 192, 256, 320, 512, 576]
# DKQ -> DV override for asymmetric head dims.
HEAD_SIZES_V_OVERRIDE = {576: 512, 320: 256, 192: 128}
TYPES_KV = ["GGML_TYPE_F16", "GGML_TYPE_Q4_0", "GGML_TYPE_Q4_1", "GGML_TYPE_Q5_0", "GGML_TYPE_Q5_1", "GGML_TYPE_Q8_0", "GGML_TYPE_BF16"]
@@ -62,7 +65,7 @@ for filename in glob("*.cu"):
os.remove(filename)
for head_size_kq in HEAD_SIZES_KQ:
head_size_v = head_size_kq if head_size_kq != 576 else 512
head_size_v = HEAD_SIZES_V_OVERRIDE.get(head_size_kq, head_size_kq)
with open(f"fattn-tile-instance-dkq{head_size_kq}-dv{head_size_v}.cu", "w") as f:
f.write(SOURCE_FATTN_TILE.format(head_size_kq=head_size_kq, head_size_v=head_size_v))
@@ -84,13 +87,18 @@ for ncols in [8, 16, 32, 64]:
continue
if head_size_kq == 72:
continue
if head_size_kq == 512 and ncols2 not in (4, 8):
# Skip compilation of unused ncols2 values for niche head sizes:
if head_size_kq == 192 and ncols2 not in (8, 16): # MiMo-V2.5
continue
if head_size_kq != 576 and ncols2 in (16, 32):
if head_size_kq == 320 and ncols2 != 32: # Mistral Small 4
continue
if head_size_kq == 576 and ncols2 not in (4, 16, 32):
if head_size_kq == 512 and ncols2 not in (4, 8): # Gemma 4
continue
head_size_v = head_size_kq if head_size_kq != 576 else 512
if head_size_kq == 576 and ncols2 not in (4, 16, 32): # Deepseek, GLM 4.7 Flash
continue
if head_size_kq not in (192, 320, 576) and ncols2 in (16, 32):
continue
head_size_v = HEAD_SIZES_V_OVERRIDE.get(head_size_kq, head_size_kq)
f.write(SOURCE_FATTN_MMA_CASE.format(ncols1=ncols1, ncols2=ncols2, head_size_kq=head_size_kq, head_size_v=head_size_v))
for type in TYPES_MMQ:

View File

@@ -48,6 +48,7 @@
#define cublasSetMathMode(handle, mode) CUBLAS_STATUS_SUCCESS
#define cublasSetStream hipblasSetStream
#define cublasSgemm hipblasSgemm
#define cublasSgemmStridedBatched hipblasSgemmStridedBatched
#define cublasStatus_t hipblasStatus_t
#define cublasOperation_t hipblasOperation_t
#define cudaDevAttrCooperativeLaunch hipDeviceAttributeCooperativeLaunch
@@ -55,6 +56,7 @@
#define cudaDeviceDisablePeerAccess hipDeviceDisablePeerAccess
#define cudaDeviceEnablePeerAccess hipDeviceEnablePeerAccess
#define cudaDeviceGetAttribute hipDeviceGetAttribute
#define cudaDeviceGetPCIBusId hipDeviceGetPCIBusId
#define cudaDeviceProp hipDeviceProp_t
#define cudaDeviceSynchronize hipDeviceSynchronize
#define cudaError_t hipError_t

View File

@@ -32,6 +32,7 @@
#define cublasSetMathMode mublasSetMathMode
#define cublasSetStream mublasSetStream
#define cublasSgemm mublasSgemm
#define cublasSgemmStridedBatched mublasSgemmStridedBatched
#define cublasStatus_t mublasStatus_t
#define cublasOperation_t mublasOperation_t
#define cublasGetStatusString mublasGetStatusString
@@ -39,6 +40,7 @@
#define cudaDeviceCanAccessPeer musaDeviceCanAccessPeer
#define cudaDeviceDisablePeerAccess musaDeviceDisablePeerAccess
#define cudaDeviceEnablePeerAccess musaDeviceEnablePeerAccess
#define cudaDeviceGetPCIBusId musaDeviceGetPCIBusId
#define cudaDeviceProp musaDeviceProp
#define cudaDeviceSynchronize musaDeviceSynchronize
#define cudaError_t musaError_t

View File

@@ -22,7 +22,8 @@ message(STATUS "hexagon: using ${HEXAGON_SDK_ROOT} and ${HEXAGON_TOOLS_ROOT} for
include(${HEXAGON_SDK_ROOT}/build/cmake/hexagon_fun.cmake)
include(ExternalProject)
option(GGML_HEXAGON_HTP_DEBUG "ggml-hexagon: enable HTP debug output" OFF)
option(GGML_HEXAGON_HTP_DEBUG "ggml-hexagon: enable HTP debug output" OFF)
option(GGML_HEXAGON_FA_EXP2_HF "ggml-hexagon: use FP16 exp2 polynomial in FA softmax instead of F32 exp round-trip" OFF)
set(GGML_HEXAGON_HTP_CERT "$ENV{HEXAGON_HTP_CERT}" CACHE PATH "ggml-hexagon: enable HTP library signing using certificate")
set(GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE 128 CACHE STRING "ggml-hexagon: quantize group size (32, 64, or 128)")

View File

@@ -48,14 +48,16 @@ using intvec = std::vector<int>;
using uintvec = std::vector<unsigned int>;
using u32vec = std::vector<uint32_t>;
static size_t opt_ndev = 1;
static size_t opt_nhvx = 0; // use all
static int opt_arch = 0; // autodetect
static int opt_etm = 0;
static int opt_verbose = 0;
static int opt_profile = 0; // profiling mode (0-disabled, 1-basic, 2-pmu)
static int opt_hostbuf = 1; // hostbuf ON by default
static int opt_use_hmx = 1; // when set, enable HMX; when 0, use HVX only
static int opt_arch = 0; // autodetect
static size_t opt_ndev = 1;
static size_t opt_nhvx = 0; // use all
static int opt_use_hmx = 1; // when set, enable HMX; when 0, use HVX only
static size_t opt_vmem = HTP_OP_MAX_VMEM_DEFAULT; // max available va space for buffer mappings
static size_t opt_mbuf = 1ul * 1024 * 1024 * 1024; // max buffer size
static int opt_etm = 0;
static int opt_verbose = 0;
static int opt_profile = 0; // profiling mode (0-disabled, 1-basic, 2-pmu)
static int opt_hostbuf = 1; // hostbuf ON by default
// Default PMU events, if profiling with PMU (mode=2) is enabled
// See https://docs.qualcomm.com/doc/80-N2040-60/topic/pmu-events.html
@@ -66,6 +68,7 @@ static u32vec opt_pmu_evt { 0x3, 0x111, 0x100, 0x105, 0x240, 0x256, 0x7D, 0x8C }
static int opt_opstage = HTP_OPSTAGE_QUEUE | HTP_OPSTAGE_COMPUTE;
static int opt_opbatch = 1024; // max number of ops in a batch
static int opt_opqueue = 16; // max number of pending batches
static std::regex* opt_opfilter = NULL; // regex of ops to not claim
#define HEX_VERBOSE(...) \
@@ -110,7 +113,7 @@ static void ggml_hexagon_dump_op_supp(const std::string &sess_name, const struct
if (!opt_verbose) return;
op_desc desc(op);
GGML_LOG_DEBUG("ggml-hex: %s supports-op %s : %s : %s : %s : %s : %s : %s\n", sess_name.c_str(),
GGML_LOG_DEBUG("ggml-hex: %s supports-op %s: %s : %s : %s : %s : %s : %s\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, desc.buffs, supp ? "yes" : "no");
}
@@ -118,8 +121,6 @@ static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const ggml_t
uint32_t op_usec, uint32_t op_cycles, const uint32_t pmu[]) {
if (!opt_profile) return;
op_desc desc(op);
char pmu_str[256] = "";
if (opt_profile > 1) {
static_assert(HTP_PROF_PMU_NCNT == 8, "current implementation assumes 8 PMU counters");
@@ -127,6 +128,7 @@ static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const ggml_t
pmu[0], pmu[1], pmu[2], pmu[3], pmu[4], pmu[5], pmu[6], pmu[7]);
}
op_desc desc(op);
GGML_LOG_DEBUG("ggml-hex: %s profile-op %s: %s : %s : %s : %s : usec %u cycles %u%s\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, op_usec, op_cycles, pmu_str);
}
@@ -191,33 +193,30 @@ struct ggml_hexagon_shared_buffer {
bool mapped;
bool pinned;
void mmap(bool pinned = false) {
int err = fastrpc_mmap(sess->domain_id, this->fd, (void *) this->base, 0, this->size, FASTRPC_MAP_FD_DELAYED);
void mmap() {
fastrpc_map_flags flags = this->pinned ? FASTRPC_MAP_FD : FASTRPC_MAP_FD_DELAYED;
int err = fastrpc_mmap(sess->domain_id, this->fd, (void *) this->base, 0, this->size, flags);
if (err != 0) {
GGML_LOG_ERROR("ggml-hex: %s buffer mapping failed : domain_id %d size %zu fd %d error 0x%08x\n", sess->c_name(),
sess->domain_id, this->size, this->fd, (unsigned) err);
throw std::runtime_error("ggml-hex: fastrpc_mmap failed (see log for details)");
}
if (pinned) {
err = htp_iface_mmap(sess->handle, this->fd, this->size, pinned);
if (err != 0) {
GGML_LOG_ERROR("ggml-hex: %s buffer pinning failed : domain_id %d size %zu fd %d error 0x%08x\n", sess->c_name(),
sess->domain_id, this->size, this->fd, (unsigned) err);
throw std::runtime_error("ggml-hex: htp_iface_mmap failed (see log for details)");
}
}
this->mapped = true;
this->pinned = pinned;
HEX_VERBOSE("ggml-hex: %s mapped buffer: base %p size %zu fd %d pinned %u\n",
sess->c_name(), (void *) this->base, this->size, this->fd, pinned);
this->mapped = true;
}
void unmap() {
if (!this->mapped) return;
htp_iface_munmap(sess->handle, this->fd);
if (!this->pinned) {
// HTP might still hold a reference, tell it drop it
htp_iface_munmap(sess->handle, this->fd);
}
fastrpc_munmap(sess->domain_id, this->fd, (void *) this->base, this->size);
HEX_VERBOSE("ggml-hex: %s unmapped buffer: base %p size %zu fd %d\n", sess->c_name(),
@@ -227,7 +226,7 @@ struct ggml_hexagon_shared_buffer {
this->fd = -1;
}
void alloc(size_t size, bool pinned = false) {
void alloc(size_t size) {
if (this->base) return;
this->base = (uint8_t *) rpcmem_alloc2(RPCMEM_HEAP_ID_SYSTEM, RPCMEM_DEFAULT_FLAGS, size);
@@ -245,8 +244,7 @@ struct ggml_hexagon_shared_buffer {
HEX_VERBOSE("ggml-hex: %s allocated buffer: base %p size %zu fd %d pinned %d\n", sess->c_name(),
(void *) this->base, this->size, this->fd, (int) pinned);
mmap(pinned);
mmap();
}
void free() {
@@ -262,15 +260,14 @@ struct ggml_hexagon_shared_buffer {
}
ggml_hexagon_shared_buffer(ggml_hexagon_session * sess, size_t size, bool pinned = false) {
size += 4 * 1024; // extra page for padding
this->sess = sess;
this->size = 0;
this->base = nullptr;
this->fd = -1;
this->mapped = false;
this->pinned = pinned;
alloc(size, pinned);
alloc(size);
}
~ggml_hexagon_shared_buffer() {
@@ -1475,6 +1472,7 @@ static ggml_backend_buffer_t ggml_backend_hexagon_buffer_type_alloc_buffer(
ggml_backend_buffer_type_t buffer_type, size_t size) {
auto sess = static_cast<ggml_backend_hexagon_buffer_type_context *>(buffer_type->context)->sess;
try {
size += 4 * 1024; // guard page
ggml_hexagon_shared_buffer * sbuf = new ggml_hexagon_shared_buffer(sess, size);
return ggml_backend_buffer_init(buffer_type, ggml_backend_hexagon_buffer_interface, sbuf, size);
} catch (const std::exception & exc) {
@@ -1487,6 +1485,7 @@ static ggml_backend_buffer_t ggml_backend_hexagon_repack_buffer_type_alloc_buffe
ggml_backend_buffer_type_t buffer_type, size_t size) {
auto sess = static_cast<ggml_backend_hexagon_buffer_type_context *>(buffer_type->context)->sess;
try {
size += 4 * 1024; // guard page
ggml_hexagon_shared_buffer * sbuf = new ggml_hexagon_shared_buffer(sess, size);
return ggml_backend_buffer_init(buffer_type, ggml_backend_hexagon_buffer_interface, sbuf, size);
} catch (const std::exception & exc) {
@@ -1505,7 +1504,7 @@ static size_t ggml_backend_hexagon_buffer_type_get_alloc_size(ggml_backend_buffe
}
static size_t ggml_backend_hexagon_buffer_type_get_max_size(ggml_backend_buffer_type_t buffer_type) {
return 1UL * 1024 * 1024 * 1024; // 1GB per buffer
return opt_mbuf; // typically 1GB per buffer
GGML_UNUSED(buffer_type);
}
@@ -1573,14 +1572,14 @@ struct ggml_hexagon_opbatch {
d_map.clear();
}
ggml_hexagon_opbatch(ggml_hexagon_session *sess, size_t batch_size) {
ggml_hexagon_opbatch(ggml_hexagon_session *sess, size_t batch_size, size_t max_vmem) {
this->sess = sess;
n_bufs_max = HTP_OP_MAX_BUFS;
n_ops_max = batch_size;
n_tens_max = n_ops_max + n_ops_max * HTP_OP_MAX_INPUTS;
b_vmem_max = HTP_OP_MAX_VMEM;
b_vmem_max = max_vmem;
ops.resize(n_ops_max);
@@ -1592,6 +1591,9 @@ struct ggml_hexagon_opbatch {
t_map.reserve(n_tens_max);
d_map.reserve(n_tens_max);
GGML_LOG_INFO("ggml-hex: %s op batching: n-bufs %u n-tensors %u n-ops %u vmem %zu\n",
sess->c_name(), n_bufs_max, n_tens_max, n_ops_max, b_vmem_max);
reset();
}
@@ -1925,6 +1927,8 @@ void ggml_hexagon_session::flush_batch() {
// Bump pending flag (cleared in the session::flush once we get the response)
this->op_pending++; // atomic inc
HEX_VERBOSE("ggml-hex: %s queue-opbatch: %p size %u\n", this->c_name(), dbuf.ptr, dbuf.size);
int err = dspqueue_write(this->queue, 0, 1, &dbuf, sizeof(req), (const uint8_t*) &req, DSPQUEUE_TIMEOUT);
if (err != 0) {
GGML_ABORT("ggml-hex: %s dspqueue_write failed: 0x%08x\n", this->c_name(), (unsigned) err);
@@ -1944,6 +1948,35 @@ void ggml_hexagon_session::flush(bool all) {
flush_pending(all);
}
static size_t ggml_hexagon_measure_max_vmem(ggml_hexagon_session *sess) {
// Allocate a bunch pinned buffers till failure.
// This is kind of expensive but handy for figuring out exactly how much we can mmap on a specific device.
// Typically we're going to allocate all/most of these buffers anyway for the model weights.
std::vector<ggml_hexagon_shared_buffer *> sbufs;
const size_t MiB = 1024 * 1024;
const size_t GiB = MiB * 1024;
size_t vmem = 0;
size_t step = 256u * MiB;
try {
sbufs.push_back(new ggml_hexagon_shared_buffer(sess, GiB, true)); vmem += GiB;
sbufs.push_back(new ggml_hexagon_shared_buffer(sess, GiB, true)); vmem += GiB;
sbufs.push_back(new ggml_hexagon_shared_buffer(sess, GiB, true)); vmem += GiB;
while (1) {
sbufs.push_back(new ggml_hexagon_shared_buffer(sess, step, true));
vmem += step;
}
} catch (...) { }
for (auto b : sbufs) { delete b; }
return vmem - step; // backoff to account for overhead from internal mappings
}
void ggml_hexagon_session::allocate(int dev_id) noexcept(false) {
this->valid_session = false;
this->valid_handle = false;
@@ -1957,7 +1990,7 @@ void ggml_hexagon_session::allocate(int dev_id) noexcept(false) {
this->op_pending = 0;
GGML_LOG_INFO("ggml-hex: allocating new session: %s\n", this->name.c_str());
GGML_LOG_DEBUG("ggml-hex: %s allocating new session\n", this->name.c_str());
domain * my_domain = get_domain(this->domain_id);
if (my_domain == NULL) {
@@ -2033,9 +2066,6 @@ void ggml_hexagon_session::allocate(int dev_id) noexcept(false) {
this->valid_handle = true;
GGML_LOG_INFO("ggml-hex: new session: %s : session-id %d domain-id %d uri %s handle 0x%lx\n", this->name.c_str(),
this->session_id, this->domain_id, session_uri, (unsigned long) this->handle);
// Enable FastRPC QoS mode
{
struct remote_rpc_control_latency l;
@@ -2047,6 +2077,9 @@ void ggml_hexagon_session::allocate(int dev_id) noexcept(false) {
}
}
GGML_LOG_INFO("ggml-hex: %s new session : session-id %d domain-id %d uri %s handle 0x%lx\n", this->c_name(),
this->session_id, this->domain_id, session_uri, (unsigned long) this->handle);
const size_t req_q_size = (sizeof(htp_opbatch_req) * opt_opqueue * 2) + 1024;
const size_t rsp_q_size = (sizeof(htp_opbatch_rsp) * opt_opqueue * 2) + 1024;
@@ -2091,13 +2124,19 @@ void ggml_hexagon_session::allocate(int dev_id) noexcept(false) {
}
// Allocate buffers and state for op batching
this->op_batch = new ggml_hexagon_opbatch(this, opt_opbatch);
this->op_queue = new ggml_hexagon_opqueue(this, opt_opbatch, opt_opqueue);
// Start processing op batch requests
err = htp_iface_start(this->handle, dev_id, this->queue_id, opt_nhvx, opt_use_hmx);
if (!opt_vmem) {
opt_vmem = ggml_hexagon_measure_max_vmem(this);
GGML_LOG_INFO("ggml-hex: %s measured max vmem %zu\n", this->c_name(), opt_vmem);
}
this->op_batch = new ggml_hexagon_opbatch(this, opt_opbatch, opt_vmem);
// Start dspqueue/opbatch processing
err = htp_iface_start(this->handle, dev_id, this->queue_id, opt_nhvx, opt_use_hmx, opt_vmem);
if (err != 0) {
GGML_LOG_ERROR("ggml-hex: failed to start session: 0x%08x\n", (unsigned) err);
GGML_LOG_ERROR("ggml-hex: %s failed to start session: 0x%08x\n", this->c_name(), (unsigned) err);
throw std::runtime_error("ggml-hex: iface start failed (see log for details)");
}
this->valid_iface = true;
@@ -2108,17 +2147,17 @@ void ggml_hexagon_session::release() noexcept(true) {
int err;
delete this->op_batch;
delete this->op_queue;
// Stop the DSP-side service and close the queue
if (this->valid_iface) {
// Stop dspqueue/opbatch processing
err = htp_iface_stop(this->handle);
if (err != 0) {
GGML_ABORT("ggml-hex: htp_iface_stop failed: 0x%08x\n", (unsigned) err);
}
}
delete this->op_batch;
delete this->op_queue;
if (opt_etm) {
err = htp_iface_etm(this->handle, 0);
if (err != 0) {
@@ -2215,14 +2254,65 @@ static bool ggml_hexagon_supported_flash_attn_ext(const struct ggml_hexagon_sess
return false;
}
if (dst->ne[2] != 1 || dst->ne[3] != 1) {
// FA during prompt still needs work
if (dst->ne[3] != 1) {
return false;
}
return true;
}
static bool ggml_hexagon_supported_gated_delta_net(const struct ggml_hexagon_session * sess, const struct ggml_tensor * op) {
const struct ggml_tensor * q = op->src[0];
const struct ggml_tensor * k = op->src[1];
const struct ggml_tensor * v = op->src[2];
const struct ggml_tensor * g = op->src[3];
const struct ggml_tensor * beta = op->src[4];
const struct ggml_tensor * state = op->src[5];
const struct ggml_tensor * dst = op;
if (!q || !k || !v || !g || !beta || !state) {
return false;
}
if (q->type != GGML_TYPE_F32 || k->type != GGML_TYPE_F32 || v->type != GGML_TYPE_F32 ||
g->type != GGML_TYPE_F32 || beta->type != GGML_TYPE_F32 || state->type != GGML_TYPE_F32 ||
dst->type != GGML_TYPE_F32) {
return false;
}
if (!ggml_is_contiguous_rows(q) || !ggml_is_contiguous_rows(k) || !ggml_is_contiguous_rows(v) ||
!ggml_is_contiguous(g) || !ggml_is_contiguous(beta) || !ggml_is_contiguous(state) ||
!ggml_is_contiguous(dst)) {
return false;
}
const int64_t S_v = v->ne[0];
const int64_t H = v->ne[1];
const int64_t n_tokens = v->ne[2];
const int64_t n_seqs = v->ne[3];
if (S_v <= 0 || S_v > 128 || H <= 0 || n_tokens <= 0 || n_seqs <= 0) {
return false;
}
if (q->ne[0] != S_v || k->ne[0] != S_v || q->ne[1] <= 0 || k->ne[1] <= 0 ||
q->ne[2] != n_tokens || k->ne[2] != n_tokens || q->ne[3] <= 0 || k->ne[3] <= 0 ||
(n_seqs % q->ne[3]) != 0 || (n_seqs % k->ne[3]) != 0) {
return false;
}
if ((g->ne[0] != 1 && g->ne[0] != S_v) || beta->ne[0] != 1) {
return false;
}
if (ggml_nelements(state) != S_v * S_v * H * n_seqs) {
return false;
}
if (dst->ne[0] != S_v * H || dst->ne[1] != n_tokens * n_seqs + S_v * n_seqs) {
return false;
}
GGML_UNUSED(sess);
return true;
}
static bool ggml_hexagon_supported_mul_mat(const struct ggml_hexagon_session * sess, const struct ggml_tensor * dst) {
const struct ggml_tensor * src0 = dst->src[0];
const struct ggml_tensor * src1 = dst->src[1];
@@ -2382,8 +2472,8 @@ static bool ggml_hexagon_supported_unary(const struct ggml_hexagon_session * ses
return false;
}
// TODO: add support for non-contigiuos tensors
if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(dst)) {
// dst must be contiguous; src0 may be non-contiguous
if (!ggml_is_contiguous(dst)) {
return false;
}
@@ -2739,32 +2829,34 @@ static void ggml_backend_hexagon_free(ggml_backend_t backend) {
static htp_op_code op_remap_to_htp(const ggml_tensor * t) {
switch (t->op) {
case GGML_OP_FLASH_ATTN_EXT: return HTP_OP_FLASH_ATTN_EXT;
case GGML_OP_MUL_MAT: return HTP_OP_MUL_MAT;
case GGML_OP_MUL_MAT_ID: return HTP_OP_MUL_MAT_ID;
case GGML_OP_MUL: return HTP_OP_MUL;
case GGML_OP_ADD: return HTP_OP_ADD;
case GGML_OP_ADD_ID: return HTP_OP_ADD_ID;
case GGML_OP_SUB: return HTP_OP_SUB;
case GGML_OP_DIV: return HTP_OP_DIV;
case GGML_OP_CPY: return HTP_OP_CPY;
case GGML_OP_CONT: return HTP_OP_CPY;
case GGML_OP_GET_ROWS: return HTP_OP_GET_ROWS;
case GGML_OP_SET_ROWS: return HTP_OP_SET_ROWS;
case GGML_OP_SUM_ROWS: return HTP_OP_SUM_ROWS;
case GGML_OP_ARGSORT: return HTP_OP_ARGSORT;
case GGML_OP_RMS_NORM: return HTP_OP_RMS_NORM;
case GGML_OP_SCALE: return HTP_OP_SCALE;
case GGML_OP_SQR: return HTP_OP_SQR;
case GGML_OP_SQRT: return HTP_OP_SQRT;
case GGML_OP_SOFT_MAX: return HTP_OP_SOFTMAX;
case GGML_OP_SSM_CONV: return HTP_OP_SSM_CONV;
case GGML_OP_ROPE: return HTP_OP_ROPE;
case GGML_OP_REPEAT: return HTP_OP_REPEAT;
case GGML_OP_CUMSUM: return HTP_OP_CUMSUM;
case GGML_OP_FILL: return HTP_OP_FILL;
case GGML_OP_DIAG: return HTP_OP_DIAG;
case GGML_OP_SOLVE_TRI: return HTP_OP_SOLVE_TRI;
case GGML_OP_FLASH_ATTN_EXT: return HTP_OP_FLASH_ATTN_EXT;
case GGML_OP_MUL_MAT: return HTP_OP_MUL_MAT;
case GGML_OP_MUL_MAT_ID: return HTP_OP_MUL_MAT_ID;
case GGML_OP_MUL: return HTP_OP_MUL;
case GGML_OP_ADD: return HTP_OP_ADD;
case GGML_OP_ADD_ID: return HTP_OP_ADD_ID;
case GGML_OP_SUB: return HTP_OP_SUB;
case GGML_OP_DIV: return HTP_OP_DIV;
case GGML_OP_CPY: return HTP_OP_CPY;
case GGML_OP_CONT: return HTP_OP_CPY;
case GGML_OP_GET_ROWS: return HTP_OP_GET_ROWS;
case GGML_OP_SET_ROWS: return HTP_OP_SET_ROWS;
case GGML_OP_SUM_ROWS: return HTP_OP_SUM_ROWS;
case GGML_OP_ARGSORT: return HTP_OP_ARGSORT;
case GGML_OP_L2_NORM: return HTP_OP_L2_NORM;
case GGML_OP_RMS_NORM: return HTP_OP_RMS_NORM;
case GGML_OP_SCALE: return HTP_OP_SCALE;
case GGML_OP_SQR: return HTP_OP_SQR;
case GGML_OP_SQRT: return HTP_OP_SQRT;
case GGML_OP_SOFT_MAX: return HTP_OP_SOFTMAX;
case GGML_OP_SSM_CONV: return HTP_OP_SSM_CONV;
case GGML_OP_GATED_DELTA_NET: return HTP_OP_GATED_DELTA_NET;
case GGML_OP_ROPE: return HTP_OP_ROPE;
case GGML_OP_REPEAT: return HTP_OP_REPEAT;
case GGML_OP_CUMSUM: return HTP_OP_CUMSUM;
case GGML_OP_FILL: return HTP_OP_FILL;
case GGML_OP_DIAG: return HTP_OP_DIAG;
case GGML_OP_SOLVE_TRI: return HTP_OP_SOLVE_TRI;
case GGML_OP_UNARY:
switch (ggml_get_unary_op(t)) {
case GGML_UNARY_OP_SILU: return HTP_OP_UNARY_SILU;
@@ -2997,8 +3089,8 @@ static struct ggml_backend_i hexagon_backend_i = {
/* .free = */ ggml_backend_hexagon_free,
/* .set_tensor_async = */ NULL,
/* .get_tensor_async = */ NULL,
/* .get_tensor_2d_async = */ NULL,
/* .set_tensor_2d_async = */ NULL,
/* .get_tensor_2d_async = */ NULL,
/* .cpy_tensor_async = */ NULL,
/* .synchronize = */ ggml_backend_hexagon_synchronize,
/* .graph_plan_create = */ NULL,
@@ -3215,6 +3307,10 @@ static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, cons
supp = ggml_hexagon_supported_add_id(sess, op);
break;
case GGML_OP_L2_NORM:
supp = ggml_hexagon_supported_unary(sess, op);
break;
case GGML_OP_RMS_NORM:
case GGML_OP_SCALE:
supp = ggml_hexagon_supported_unary(sess, op);
@@ -3298,6 +3394,10 @@ static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, cons
supp = ggml_hexagon_supported_ssm_conv(sess, op);
break;
case GGML_OP_GATED_DELTA_NET:
supp = ggml_hexagon_supported_gated_delta_net(sess, op);
break;
case GGML_OP_CUMSUM:
supp = ggml_hexagon_supported_cumsum(sess, op);
break;
@@ -3380,21 +3480,6 @@ struct ggml_hexagon_registry {
ggml_hexagon_registry::ggml_hexagon_registry(ggml_backend_reg_t reg) {
GGML_LOG_INFO("ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev %zu\n", opt_ndev);
if (!opt_arch) {
int err = get_hex_arch_ver(CDSP_DOMAIN_ID, &opt_arch);
if (err != 0) {
GGML_LOG_ERROR("ggml-hex: failed to query HTP version (err %d) defaulting to v73\n", err);
opt_arch = 73;
}
}
#if defined(__ANDROID__)
if (opt_arch < 75) {
opt_ndev = 1;
GGML_LOG_WARN("ggml-hex: forcing ndev to 1 for SoCs archs lower than v75.\n");
}
#endif
GGML_LOG_INFO("ggml-hex: Hexagon Arch version v%d\n", opt_arch);
// Create devices / sessions
@@ -3480,32 +3565,67 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
static_assert((unsigned int) HTP_TYPE_IQ4_NL == (unsigned int) GGML_TYPE_IQ4_NL,
"please update hexagon_type to match ggml_type");
const char * str_verbose = getenv("GGML_HEXAGON_VERBOSE");
const char * str_hostbuf = getenv("GGML_HEXAGON_HOSTBUF");
const char * str_opstage = getenv("GGML_HEXAGON_OPSTAGE");
const char * str_opbatch = getenv("GGML_HEXAGON_OPBATCH");
const char * str_opqueue = getenv("GGML_HEXAGON_OPQUEUE");
const char * str_opfilter= getenv("GGML_HEXAGON_OPFILTER");
const char * str_profile = getenv("GGML_HEXAGON_PROFILE");
const char * str_etm = getenv("GGML_HEXAGON_ETM");
const char * str_nhvx = getenv("GGML_HEXAGON_NHVX");
const char * str_use_hmx = getenv("GGML_HEXAGON_USE_HMX");
const char * str_ndev = getenv("GGML_HEXAGON_NDEV");
const char * str_arch = getenv("GGML_HEXAGON_ARCH");
const char * str_verbose = getenv("GGML_HEXAGON_VERBOSE");
const char * str_hostbuf = getenv("GGML_HEXAGON_HOSTBUF");
const char * str_opstage = getenv("GGML_HEXAGON_OPSTAGE");
const char * str_opbatch = getenv("GGML_HEXAGON_OPBATCH");
const char * str_opqueue = getenv("GGML_HEXAGON_OPQUEUE");
const char * str_opfilter = getenv("GGML_HEXAGON_OPFILTER");
const char * str_profile = getenv("GGML_HEXAGON_PROFILE");
const char * str_etm = getenv("GGML_HEXAGON_ETM");
const char * str_nhvx = getenv("GGML_HEXAGON_NHVX");
const char * str_use_hmx = getenv("GGML_HEXAGON_USE_HMX");
const char * str_ndev = getenv("GGML_HEXAGON_NDEV");
const char * str_arch = getenv("GGML_HEXAGON_ARCH");
const char * str_vmem = getenv("GGML_HEXAGON_VMEM");
const char * str_mbuf = getenv("GGML_HEXAGON_MBUF");
// Init Arch first since it affects other defaults
if (!str_arch) {
int err = get_hex_arch_ver(CDSP_DOMAIN_ID, &opt_arch);
if (err != 0) {
GGML_LOG_ERROR("ggml-hex: failed to query HTP version (err %d) defaulting to v73\n", err);
opt_arch = 73;
}
} else {
if (str_arch[0] == 'v' || str_arch[0] == 'V') {
str_arch++;
}
opt_arch = strtoul(str_arch, NULL, 0);
}
size_t MiB = 1024 * 1024;
// Update vmem default
opt_vmem = opt_arch >= 75 ? HTP_OP_MAX_VMEM_DEFAULT : 3000 * MiB;
auto RE_ICASE = std::regex_constants::icase;
opt_opfilter = str_opfilter ? new std::regex(str_opfilter, RE_ICASE) : NULL;
opt_verbose = str_verbose ? atoi(str_verbose) : 0;
opt_hostbuf = str_hostbuf ? atoi(str_hostbuf) : opt_hostbuf;
opt_opstage = str_opstage ? strtoul(str_opstage, NULL, 0) : opt_opstage;
opt_opbatch = str_opbatch ? strtoul(str_opbatch, NULL, 0) : opt_opbatch;
opt_opqueue = str_opqueue ? strtoul(str_opqueue, NULL, 0) : opt_opqueue;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_nhvx = str_nhvx ? strtoul(str_nhvx, NULL, 0) : opt_nhvx;
opt_use_hmx = str_use_hmx ? atoi(str_use_hmx) : opt_use_hmx;
opt_ndev = str_ndev ? strtoul(str_ndev, NULL, 0) : opt_ndev;
opt_hostbuf = str_hostbuf ? atoi(str_hostbuf) : opt_hostbuf;
opt_opfilter = str_opfilter ? new std::regex(str_opfilter, RE_ICASE) : NULL;
opt_verbose = str_verbose ? atoi(str_verbose) : 0;
opt_hostbuf = str_hostbuf ? atoi(str_hostbuf) : opt_hostbuf;
opt_opstage = str_opstage ? strtoul(str_opstage, NULL, 0) : opt_opstage;
opt_opbatch = str_opbatch ? strtoul(str_opbatch, NULL, 0) : opt_opbatch;
opt_opqueue = str_opqueue ? strtoul(str_opqueue, NULL, 0) : opt_opqueue;
opt_profile = str_profile ? atoi(str_profile) : 0;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_nhvx = str_nhvx ? strtoul(str_nhvx, NULL, 0) : opt_nhvx;
opt_use_hmx = str_use_hmx ? atoi(str_use_hmx) : opt_use_hmx;
opt_ndev = str_ndev ? strtoul(str_ndev, NULL, 0) : opt_ndev;
opt_hostbuf = str_hostbuf ? atoi(str_hostbuf) : opt_hostbuf;
opt_mbuf = str_mbuf ? strtoul(str_mbuf, NULL, 0) * MiB : opt_mbuf;
opt_vmem = str_vmem ? strtoul(str_vmem, NULL, 0) * MiB : opt_vmem;
if (opt_ndev > GGML_HEXAGON_MAX_SESSIONS) {
opt_ndev = GGML_HEXAGON_MAX_SESSIONS;
}
#if defined(__ANDROID__)
if (opt_arch < 75) {
opt_ndev = 1;
GGML_LOG_WARN("ggml-hex: forcing ndev to 1 for SoCs archs lower than v75.\n");
}
#endif
if (str_profile) {
opt_pmu_evt = [&]() -> std::vector<uint32_t> {
@@ -3520,17 +3640,6 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
vec_to_str<uint32_t, 16>(opt_pmu_evt).c_str());
}
if (opt_ndev > GGML_HEXAGON_MAX_SESSIONS) {
opt_ndev = GGML_HEXAGON_MAX_SESSIONS;
}
if (str_arch) {
if (str_arch[0] == 'v') {
str_arch++;
}
opt_arch = strtoul(str_arch, NULL, 0);
}
reg->context = new ggml_hexagon_registry(reg);
}

Some files were not shown because too many files have changed in this diff Show More