Compare commits

...

139 Commits
b9331 ... b9470

Author SHA1 Message Date
Max Krasnyansky
8f7f3bf141 hexagon: MUL_MAT, MUL_MAT_ID, FLASH_ATTN and GDN cleanup and optimizations for latest models (#23989)
* hex-mm: initial support for F32 * F32 -> F32 matmuls

* hex-rms-norm: fix src1 stride use in fused rms_norm_mul

* hex-ops: clear spad pointers in the ops that clober it

This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes.

* hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX

Decided to use Q4_0 * F32 -> F32 matmul for this.
Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16.
Super simple and pretty efficient.

* hmx-mm: route f16 2D matmuls through the same kernel used for all other types

* hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way

This update futher improves matmul performance and at the same time removes most of the redudant logic
we had in different paths.

* hmx-fa: slighlty improved pipeline simimar to matmul updates

* hmx-mm: initial version of MAT_MUL_ID support for HMX

* hmx-mm: fixed mxfp4 handling for MUL_MAT_ID

* hex-gdn: optimize GATED_DELTA_NET

DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :)

* hmx-mm: missed one more case where we can use fastmod

* hexagon: update DCVS settings for a slight perf bump

* hmx-fa: use fastdiv in hmx-flash-attn

* hmx-fa: precompute slope values to avoid disrupting the inner loop

* hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi

* hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty

* hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right
2026-06-01 23:40:08 -07:00
Todor Boinovski
d178a11818 hexagon: add gelu_quick (#24007) 2026-06-01 23:19:07 -07:00
Pascal
354ebac8cb server: real-time reasoning interruption via control endpoint (#23971)
* server: real-time reasoning interruption via control endpoint

Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.

* ui: track reasoning phase via explicit streaming state

Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.

* ui: extract control endpoint and action into constants

Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.

* server: target reasoning control by completion id

Address @ngxson review on the control endpoint.

Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.

* ui: target reasoning control by completion id

Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.

* server: address review from @ngxson

Move the control fields into task_params and drop the redundant
comments on the control path.

* server: document the reasoning control endpoint

* Update tools/ui/src/lib/types/database.d.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: rename cmplId to completionId

Per @allozaur review, clearer name for the streamed completion id.

* ui: wire completion id capture through the agentic flow

The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.

* ui: target reasoning control model from the message

The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-06-02 07:26:20 +02:00
Anav Prasad
1fd5f48037 clean up unused variables warnings (#23975) 2026-06-02 10:38:37 +08:00
lhez
210a6570ce opencl: fix compiler warnings for non-adreno path (#23922)
* opencl: fix compiler warnings for non-adreno path

* opencl: fix const cast warning
2026-06-01 19:15:09 -07:00
Masashi Yoshimura
b8275a8acc revert to using global_invocation_id for cpy shader (#23955) 2026-06-01 16:59:06 -07:00
Georgi Gerganov
5dcb711666 speculative : fix n_outputs_max and remove draft-simple auto-enable (#23988)
* speculative : add common_speculative_n_max helper function

Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.

Assisted-by: llama.cpp:local pi

* cont : draft context always has n_parallel outputs

* llama : log n_outputs_max

* speculative : remove draft-simple auto-enable

* ci : enable server tests on PRs
2026-06-01 22:26:58 +03:00
Christian Hoener zu Siederdissen
5aa3a64596 nix : add nix-nodejs facilities to build Web UI (#23846)
* nix: add nix-nodejs facilities to build Web UI

Build the Web UI locally using standard Nix systems for building NodeJS
packages.

- Create derivation for the web UI
- npm dependencies are imported via buildNodeModules. Does not require
  setting any shasum.
- Copy build artifacts to the correct folders.
- Prevents having to download from huggingface.co

Fixes #23067

* nix: simplify webui derivation using LLAMA_UI_OUT_DIR

- Move npm build to installPhase with LLAMA_UI_OUT_DIR=$out to write
  output directly to the Nix store
- Copy built assets to tools/ui/dist (source tree) instead of
  build/tools/ui/dist so CMake's copy_src_dist() finds them
2026-06-01 14:01:26 -04:00
shaofeiqi
27d9ed8397 opencl: add basic support for q5_0 and q5_1 (#23548)
* opencl: add general q5_0 support

* opencl: add general q5_1 support

* opencl: support non-uniform workgrp size

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2026-06-01 10:06:50 -07:00
Adrien Gallouët
335abed17d vendor : update cpp-httplib to 0.46.1 (#23980)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-06-01 19:40:10 +03:00
Aman Gupta
de6f727aae llama: limit max outputs of llama_context (#23861)
* llama: save more VRAM by reserving n_outputs == n_seqs when possible

* add n_outputs_per_seq

* move n_outputs_max to server-context

* change ubatch to batch everywhere
2026-06-01 18:01:38 +03:00
Shrivas Shankar
95b8b8ec1a metal: template GLU kernels to support f16/f32 (#23882)
Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.
2026-06-01 15:40:28 +03:00
Jeff Bolz
55ac0909e5 vulkan: don't hold the device mutex while compiling pipelines (#23641)
* vulkan: don't hold the device mutex while compiling pipelines

We need to hold a lock while we traverse all pipelines and lazily initialize
them, but we don't need to hold it while the pipeline is being compiled. And
it doesn't need to be the same lock as the device mutex. We call load_shaders
each time a pipeline is needed, so we only need to compile that one pipeline
(and, for example, don't want to end up compiling a pipeline that another
thread should be compiling).

* remove 'needed'
2026-06-01 14:04:01 +02:00
Winston Ma
bef69f1306 vulkan: reduce host memory lock contention (#23376)
* vulkan: reduces lock contention

* replace unique_lock with lock_guard
2026-06-01 14:03:32 +02:00
o7si
5aba5364d9 vocab: add normalizer.lowercase support to WPM (#23899)
* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

* vocab : add normalizer.lowercase support to WPM

* vocab : default normalizer.lowercase to false for whitespace pre-tokenizer
2026-06-01 14:26:47 +03:00
Johannes Gäßler
8e6fff84de TP: quantized KV cache support (#23792)
* TP: quantized KV cache support

* fix partial view

* remove overly strict assert
2026-06-01 12:30:10 +02:00
Georgi Gerganov
02a57017f6 security : disable private disclosures (#23963) 2026-06-01 13:14:12 +03:00
Junwon Hwang
48b88c3b00 model: Add EXAONE 4.5 implementations (#21733)
* Add EXAONE 4.5 and Add GQA for MMproj

* mtmd: EXAONE 4.5 vision markers and projector path

EXAONE 4.5 uses <vision> and </vision> for image boundaries; Qwen keeps
<|vision_start|> and <|vision_end|>.

Route EXAONE 4.5 through the Qwen2.5-VL-style encode path (window attention
pattern, optional mmproj input norm). Update exaone4_5 projector weights and
convert_hf_to_gguf for mmproj export.

* mtmd: load EXAONE4 nextn tensors correctly

Align EXAONE4 tensor registration with EXAONE_MOE for NextN/MTP slots and avoid skip-flag propagation on duplicated rope_freqs so model loading succeeds for EXAONE 4.5 GGUF.

* Minor fixes

* Address PR feedback

* Address PR feedback

* Fix EXAONE after merge

* Fix EXAONE 4.5 conversion

* Address PR feedback

* Refactor EXAONE 4.5 conversion

* Address PR feedback

* Fix unintended deletion

* Minor fix

---------

Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>
2026-06-01 11:48:53 +02:00
Matt Corallo
19620004f5 vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23056)
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
2026-06-01 11:46:48 +02:00
Winston Ma
f8c0a19d46 vulkan: Removed unused functions (#23175) 2026-06-01 11:46:23 +02:00
Aldehir Rojas
5254a7994d common : support manually triggering the reasoning budget end sequence (#23949) 2026-06-01 11:37:11 +02:00
Georgi Gerganov
e22b0de60d ci : add missing Linux label to cpu-x64-high-perf runner (#23958)
Fixes: https://github.com/ggml-org/llama.cpp/pull/23927#discussion_r3332213086

The cpu-x64-high-perf job was missing the Linux label in its runs-on
specification, causing the runner to not be discovered. All other
self-hosted Linux jobs include this label.

Assisted-by: llama.cpp:local pi
2026-06-01 10:39:59 +03:00
Neo Zhang
a51142497a [SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812)
* support Q4_1, Q5_0, Q5_1

* update ut case
2026-06-01 09:53:53 +03:00
Neo Zhang
4162522688 [SYCL] Add more types in GET_ROWS OP (#23710)
* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP

* correct the link
2026-06-01 09:53:04 +03:00
Neo Zhang
44e211cecf sycl : Optimize Q3_K mul_mat by reorder (#23725) 2026-06-01 09:50:55 +03:00
Eve
af6528e6df ci: remove redundant or duplicate jobs (#23927)
* remove redundant apple job

openvino gpu and cpu test can share the same build and machine

Update build-rpc.yml

Update build-openvino.yml

cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm

remove duplicate x86 vulkan

combine backend sampling

Update server.yml

run server on arm as windows is x86

* emdawn on one machine only

* fix openvino, remove cpu tag as we dont have many x64 machines with that tag
2026-06-01 06:32:17 +03:00
Eric Zhang
6f165c1c64 server : handle If-None-Match weak ETags (#23916) 2026-05-31 16:21:08 -05:00
Georgi Gerganov
399739d5c5 ci : limit trigger paths for the CPU workflow (#23938) 2026-05-31 19:02:47 +03:00
o7si
d4c8e2c29c vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756)
* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

* lowercase defaults to true

* type fix

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-31 12:37:35 +02:00
Eric Zhang
3292da09f6 ui: fix ETag truncation with MSVC compiler (#23917) 2026-05-31 11:21:23 +02:00
Vladislav
e6123e2080 docs : update ZenDNN docs for Q8 support (#23791)
* docs zendnn added information about Q8 support

* docs zendnn rm unnecessary data

* docs update, links to ZenDNN docs provided

* docs zenDNN update: clarified explanation

* docs zenDNN update: one more explanation clarified

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
2026-05-31 10:26:42 +02:00
Ruben Ortlam
22cadc1944 llama: only use one iGPU device by default (#23897) 2026-05-31 08:17:47 +02:00
Pascal
d749821db3 webui: add custom CSS injection via config (#23904)
* webui: add custom CSS injection via config

register a customCSS setting in the Developer section under Custom JSON,
syncable so it rides the existing ui-config pass through. inject the value
into a single style element in the head, reactive on the setting. lets an
operator theme a prebuilt binary through --ui-config without rebuilding,
and lets a user set it from the settings panel.

* ui: address review from @niutech and @allozaur, rename custom JSON key and CSS field

* ui: address review from @allozaur, move custom CSS injection to a style tag in svelte:head

* ui: inject custom CSS through a svelte action instead of a bound element

move the textContent write into a use: action on the head style node.
the action is the idiomatic way to touch a node, so the no-dom-manipulating
lint rule is satisfied without a disable. value stays text through
textContent, never parsed as HTML.

* Update tools/ui/src/lib/constants/settings-keys.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: address review from @allozaur, rename custom config key to customJson with migration

rename the custom config key to customJson across the type, the chat
request builder, the settings save check and the custom tools reader,
keeping the custom API param name unchanged. add a non destructive
migration that copies the legacy custom key to customJson at startup.
only render the head style tag when custom CSS is set.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-05-30 23:49:31 +02:00
Gaurav Garg
aa46bda89b Support -fa auto in llama-bench (#23714)
* Support `-fa auto` in llama-bench

Make the default value of `-ngl` -1, similar to other tools.

Update README with latest usage and examples

* Address review comments
2026-05-31 02:03:57 +05:30
lhez
d6588daa80 opencl: support bf16 by converting to f16 (#23839) 2026-05-30 10:17:47 -07:00
Pascal
d38d50e7ff ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (#23910) 2026-05-30 16:50:54 +02:00
Johannes Gäßler
8b0e0db606 TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843)
* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs

* fix afmoe TP
2026-05-30 16:48:00 +03:00
Georgi Gerganov
2d9b7c8e98 metal : restore im2col implementation for large kernels (#23901) 2026-05-30 15:26:13 +03:00
Xuan-Son Nguyen
e674b1279b test: (test-llama-archs) log the config name first (#23885) 2026-05-30 12:22:38 +02:00
Georgi Gerganov
4c4e91b799 ci : update ios-xcode release job to macos-26 (#23906)
* ci : disable libcommon build from xcframework

* ocd : fix name

* ci : ios-xcode change to macos-26

* cont : pin xcode

* cont : pin xcode to minor version
2026-05-30 13:21:46 +03:00
Jinyang He
d48a56effb ggml : add some lsx support (#23798)
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
2026-05-30 11:53:26 +03:00
Ruben Ortlam
6e093b80ea vulkan: add Flash Attention support for BFloat16 KV cache (#23420)
* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
2026-05-30 10:39:31 +02:00
Georgi Gerganov
337528571d ci : fix s390x release job (#23898)
* ci : fix s390x release job

* ci : multi-thread build for `ios-xcode`

* ocd : names
2026-05-30 09:21:38 +03:00
Georgi Gerganov
d4204b03a5 ci : clear cache instead of "no timestamp" keys + fix macos (#23895)
* ci : ios use macos-15 again

* ci : add and test ccache-clear

* cont : fix

* cont : set permission

* cont : another permission

* cont : token

* cont : print key

* cont : bring back perms

* cont : test windows

* cont : add token

* cont : cleanup

* ci : make release jobs clean-up their ccache
2026-05-30 08:52:30 +03:00
Radoslav Gerganov
1738129bee llama : do not skip iGPU when only RPC devices are present (#23868)
After #23007 reclassified integrated CUDA/HIP devices as IGPU, the device
selection logic dropped the local iGPU whenever any RPC server was added,
because RPC devices made `model->devices` non-empty. On systems where the
"iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified
memory), this caused all tensors to be allocated on the RPC peer alone and
model loading to fail.

Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer
suppress the local iGPU.

closes: #23858
2026-05-30 07:48:22 +03:00
Xuan-Son Nguyen
0821c5fcfd server: in SSE mode, send HTTP headers when slot starts (#23884)
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
2026-05-30 00:06:29 +02:00
Reese Levine
151f3a98e9 ggml-webgpu: Check earlier for WebGPU required features (#23879) 2026-05-29 14:16:05 -07:00
Reese Levine
b22da25889 ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760)
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
2026-05-29 14:14:11 -07:00
Ruixiang Wang
689a9a470e server-bench : add speed-bench for speculative decoding benchmarking (#23869)
* spec: add speed-bench support for benchmarking

* speed-bench : add trailing newline to requirements.txt

* speed-bench : bump datasets to 4.8.0 to fix ty check

* server-bench : remove now-unused type: ignore after datasets bump
2026-05-29 23:09:47 +02:00
Pascal
5a46b46acd app: add llama update self updater (#23865)
* wip: llama update POC

* cleaning: llama update

* llama-gen-docs

* app: delegate llama update to the install script

* app: spawn the installer detached so llama update can replace a running binary

* cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h}

* app: make llama_update static

Address review from @angt
2026-05-29 23:02:40 +02:00
ValdikSS
22d66b567e ui: handle audio/vnd.wave as audio WAV file (#23754)
Firefox on Linux uses this MIME type
2026-05-29 21:41:35 +02:00
Tarek Dakhran
2084434e66 vocab : support tokenizer for LFM2.5-8B-A1B (#23826)
* vocab: Support tokenizer for LFM2.5-8B-A1B

* Keep liquid6 tokenizer in models
2026-05-29 20:25:43 +02:00
Sigbjørn Skjæret
764f1e64a1 graph : ensure DS32 kq_mask_lid is F32 (#23864) 2026-05-29 19:55:14 +02:00
Xuan-Son Nguyen
b5f52280fb server: remove obsolete scripts (#23870) 2026-05-29 19:47:30 +02:00
Georgi Gerganov
dc71236b6c ci : update macos release to use macos-26 runner (#23878) 2026-05-29 20:41:57 +03:00
Xuan-Son Nguyen
06d26dfdff download: add option to skip_download (#23059)
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
2026-05-29 16:30:55 +02:00
Saba Fallah
da3f990a47 mtmd: Add DeepSeekOCR 2 Support (#20975)
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

* introduced clip_image_f32::add_viewsep

* address PR review

- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d

* Update tools/mtmd/models/deepseekocr2.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-05-29 16:13:51 +02:00
Oliver Simons
6ed481eea4 CUDA: Check PTX version on host side to guard PDL dispatch (#23530)
* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
2698b43803/include/boost/container_hash/detail/hash_mix.hpp (L19-L65)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-29 12:28:18 +02:00
Xuan-Son Nguyen
cb47092b00 server: bump timeout to 3600s (#23842)
* server: bump timeout to 3600s

* nits: change wording
2026-05-29 10:23:17 +02:00
fairydreaming
1f0aa2a696 model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (#23346)
* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
2026-05-29 10:15:17 +02:00
Aman Gupta
031ddb2e08 llama: use f16 mask for FA to save VRAM (#23764)
* llama: use f16 mask for FA

* review: add llama_cast + formatting

* simplify
2026-05-29 15:44:43 +08:00
Georgi Gerganov
fe12e422ad sync : ggml 2026-05-29 09:56:08 +03:00
Georgi Gerganov
ea02bc37f5 ggml : bump version to 0.13.1 (ggml/1523) 2026-05-29 09:56:08 +03:00
Omid Azizi
b000431a0b ngram-mod : Add missing include (#23857)
[no release]

Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>
2026-05-29 09:21:37 +03:00
Aman Gupta
eef59a7642 llama: add llm_graph_input_mtp (#23643)
* llama: add llm_graph_input_mtp

* rename input_mtp -> input_token_embd

* add TODO about mtmd embedding

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-29 09:17:32 +03:00
Adrien Gallouët
98e480a32e app : move licences to llama-app (#23824)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-29 07:46:11 +02:00
Andreas Kieslinger
241cbd41d2 cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825) 2026-05-29 07:46:10 +03:00
Matt Corallo
33c718db1f meta : Add missing buffer set in allreduce fallback !COMPUTE clear (#23480)
Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
2026-05-29 06:30:24 +03:00
Max Krasnyansky
19e92c33ef hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23835)
Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
2026-05-28 14:05:54 -07:00
Xuan-Son Nguyen
751ebd17a5 mtmd-debug: add color and rainbow mode (#23829)
* mtmd-debug: add color and rainbow mode

* fix M_PI

* max_dist
2026-05-28 20:59:14 +02:00
Xuan-Son Nguyen
c8914ad4f4 mtmd: fix gemma 4 projector pre_norm (#23822) 2026-05-28 20:58:55 +02:00
lhez
408ae2b9e5 opencl: move backend info printing into its own function (#23702)
* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
2026-05-28 11:05:42 -07:00
Sigbjørn Skjæret
3ef2369551 ci : run ui publish on ubuntu-slim (#23818)
* run ui publish on self-hosted fast

* run on ubuntu-slim
2026-05-28 20:58:32 +03:00
ValdikSS
2f6c815dc4 ui: fix audio and video modality detection (#23756)
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
2026-05-28 17:36:10 +02:00
Georgi Gerganov
445b7cef62 ci : releases use Github-hosted builds for the UI (#23823)
* ci : releases use Github-hosted builds for the UI

* cont : fix name
2026-05-28 17:50:32 +03:00
Adrien Gallouët
479a9a1b03 app : improve help output (#23805)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-28 16:45:06 +02:00
Saba Fallah
0b56d283bf mtmd: n_head_kv defaults to n_head (#23782)
removed AI-generated comment
2026-05-28 16:44:36 +02:00
Xuan-Son Nguyen
d6be3158e1 mtmd: fix gemma 4 audio rms norm eps (#23815)
* mtmd: fix gemma 4 audio rms norm eps

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-28 16:31:37 +02:00
Georgi Gerganov
dd1557907a ci : change Vulkan builds to Release to reduce ccache (#23820)
* ci : disable all CPU variant builds for Vulkan workflow

* cont : change cache key

* cont : change build type
2026-05-28 17:29:11 +03:00
Mikolaj Kucharski
7fb1e70b59 arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (#23167) 2026-05-28 16:25:40 +02:00
Johannes Gäßler
d374e71e55 test-llama-archs: fix table format [no release] (#23810) 2026-05-28 15:53:54 +02:00
fl0rianr
30af6e2b98 ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007) 2026-05-28 15:01:14 +02:00
redfox
d7be46189f mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (#23729)
* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:51:14 +02:00
Jaden_Mach
bc81d47aba CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#23227)
* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-05-28 14:50:25 +02:00
Funtowicz Morgan
0b246862b9 server: minor tweaks to use more cpp features (#23785)
* misc(server): add default port to impl RAII

* misc(server): register_gcp_compat() can be const

* misc(server): use proper cpp const/auto methods

* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
2026-05-28 14:00:25 +02:00
Max Krasnyansky
a919001134 hexagon: minor refresh for HMX FA and MM (#23796)
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
2026-05-28 04:49:11 -07:00
Jeff Bolz
48e7078ee0 vulkan: fast path for walsh-hadamard transform (#23687)
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
2026-05-28 13:18:43 +02:00
Jesus Talavera
bb771cbd2b chat : add Granite 4.1 chat template (#23518) 2026-05-28 13:13:33 +02:00
Winston Ma
7c48fb81ce vulkan: fix wrong index variable in inner loop (#23665) 2026-05-28 12:48:34 +02:00
Winston Ma
91eb8f4fa0 vulkan: Fix memory logger unsafe iterator access (#23667) 2026-05-28 12:46:07 +02:00
Markus Tavenrath
d205df6812 server, ui : Add support for HTTP ETags in llama-server (#23701)
* allow caching of ui elements in llama-server

* use fnv_hash

* Update tools/server/server-http.cpp

etag has to be set always

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-05-28 12:21:24 +02:00
Sachin Sharma
e8d2567429 docker : add ZenDNN Dockerfile (#23716) 2026-05-28 11:40:49 +02:00
fairydreaming
09e7b76c93 cuda : fix KQ mask offset integer overflow in fattn MMA kernel (#23610)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2026-05-28 10:55:42 +02:00
Adrien Gallouët
48e7eae41c perplexity : fix format specifier in LOG_ERR (#23788)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-28 10:34:58 +03:00
ynankani
c5229087a5 convert : add FP8 to Q8 conversion (#23250)
Signed-off-by: ynankani <ynankani@nvidia.com>
2026-05-28 10:16:17 +03:00
Martin Klacer
e31cdaa0eb ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (#22841)
* Updated vec.h/vec.cpp code to accumulate to F32 rather than F16



Change-Id: I0cb789347f2bf60ffaf9047319f727e788c825f8

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
2026-05-28 10:04:21 +03:00
Georgi Gerganov
491c4d7d2e ci : refactor (#23789)
* ci : separate CUDA windows workflow + fix names

* ci : rename workflow

* ci : prefix cache names with workflow name

* ci : rename build.yml -> build-cpu.yml

* ci : cache keys

* ci : fix windows cuda/hip concurrency of release workflow

* ci : fix apple cache names

* ci : add TODOs

* cont : keep just the last cache

* ci : update release concurrency to queue

* ci : move the release trigger to ubuntu-slim

* ci : hip add TODO

* cont : improve words

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-05-28 09:44:25 +03:00
ymcki
939a7dd648 Hexagon: OP_GATED_DELTA_NET K>1 support (#23531)
* K>1 state snapshot support

* removed picky indent multiple of 4 fixes
2026-05-27 23:05:25 -07:00
ymcki
8ad8aef447 opencl: OP_GATED_DELTA_NET (#23312)
* OP_GATED_DELTA_NET impl

* add back lanes_per_column declaration

* removed has_subgroup_arithmetic and has_subgroup_clustered_reduce

* removed trailing spaces and fixes indentation. Hard coded subgroup size for Adreno and Intel. Return not supported when K>1 state snapshot

* support for K>1 state snapshot

* removed picky indent multiple of 4 fixes

* removed return that won\'t be executed
2026-05-27 21:23:21 -07:00
Reese Levine
f12cc6d0fa ggml-webgpu: remove legacy constants (#23672) 2026-05-27 14:22:33 -07:00
Max Krasnyansky
aa50b2c2ae hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647)
* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now

* hmx-mm: add support for Q4_1

* hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot

* hexagon: fix repack scratch buffer overflow

* hex-mm: fix Q4_1 repack buffer sizing

* hexagon: flip the build order for mm and fa (seems to help LTO)

* hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1

* hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output

* hexagon: resurrect early-wake and add support for polling for op-batch completions

With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax.
This is a good thing! But it does add extra latency for the pure benchmark runs.
Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking.

---------

Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2026-05-27 10:46:11 -07:00
Masashi Yoshimura
c40006a62e ggml-webgpu: Fix how to dispatch WG to some ops (#23750) 2026-05-27 09:48:12 -07:00
Matt Corallo
c6e4088376 vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887)
* vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32

Against mesa git, this shows a 4.8% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

Note that this breaks some tests until the last commit which fixes
OOB A reads.

* vulkan: Use aligned loads in mul_mat_vec when available

Against mesa git, this shows a 3.3% performance improvement for
tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec

Mesa's UUB logic can't see through conditionals, limiting its
ability to understand the bounds on the `num_rows` field in the
cleanup run. Making it explicit that `num_rows` is, indeed, always
<= `NUM_ROWS` helps mesa make slightly better codegen.

Against mesa git, this currently shows a 1% performance improvement
in tg128 on Qwen3.5-9B:BF16 on Intel BMG.

* vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes

There was a TODO to fix the OOB reads from the A matrix which we do
here.

It is within performance noise (+<0.1%) in tg128 for
Qwen3.5-9B:BF16 on Intel BMG.
2026-05-27 17:19:23 +02:00
Jeff Bolz
b36eefc1b3 vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (#23541) 2026-05-27 17:18:28 +02:00
l8bloom
837bb6b447 vulkan: add REPEAT op support for f16 to f16. (#23298)
* feat: extend repeat op for vulkan

* feat: add repeat_f16 vulkan pipeline

* fix: ensure same dst and src types

* fix: use type_size instead of data types

* fix: use int16 and int32 for repeat shader op

* chore: rename repeat_f* to repeat_i*

* chore: rename repeat vulkan pipelines
2026-05-27 16:59:08 +02:00
Georgi Gerganov
ba4dd0bc67 ci : move ARM jobs to self-hosted + disable kleidiai mac release (#23780)
* ci : move ARM jobs to 3rd-party runners + disable kleidiai release

* cont : fix deps + fix names

* ocd : fix names

* cont : fix PR links
2026-05-27 17:22:20 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)
617255d437 vendor : update cpp-httplib to 0.46.0 (#23650) 2026-05-27 21:36:24 +08:00
Sigbjørn Skjæret
87b0a60cdd pyproject : add conversion folder and update dependencies (#23746)
* add conversion folder and update dependencies

* limit python version for triton

* update dev-dependencies section
2026-05-27 15:06:18 +02:00
Oliver Simons
fda8528aa8 CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (#23742) 2026-05-27 15:21:04 +03:00
Sigbjørn Skjæret
2d0656fbdd ci : bump cuda release to 13.3 (#23749) 2026-05-27 15:06:08 +03:00
Georgi Gerganov
6b4e4bd582 common : fix env names to all have LLAMA_ARG_ prefix (#23778) 2026-05-27 14:52:47 +03:00
Georgi Gerganov
9f0e4b14d2 ci : fix windows ccaches (#23777)
* ci : server windows set build type explicitly

* cont : try windows-2025

* ci : use llvm

* cont : use ninja

* cont : fix shell

* ci : set number of jobs correctly

* ci : fix windows with vulkan ccache by using llvm

* ci : server ccache only on master

* ocd : fix job names

[no release]
2026-05-27 13:54:21 +03:00
Sigbjørn Skjæret
b3a739c9b6 ci : remove wasm test (#23733)
* run tests in correct build folder

* remove wasm test
2026-05-27 13:11:37 +03:00
Winston Ma
4d8cc0c56f vulkan: avoid preferring transfer queue on AMD UMA devices (#22455) 2026-05-27 11:48:40 +02:00
Georgi Gerganov
0d227ec358 ci : add ccache to server builds + fix undefined sanitizer build (#23763)
* ci : fix undefined sanitizer build to use Debug build type only

* ci : ccache the server builds

* cont : remove ui dependency + reuse ccache for both ubuntu jobs

* tmp : force ccache save

* Revert "tmp : force ccache save"

This reverts commit a857b03a10.

* cont : no need for node.js
2026-05-27 11:45:12 +03:00
quyentonndbs
1d971bba36 docs : fix duplicated "the" in granitevision and model-conversion docs (#23767)
Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>
2026-05-27 09:34:06 +02:00
zhangtao2-1
9777256c31 convert: add MiniCPM5 tokenizer support (#23384)
Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and
implement hardcoded regex handling in llama-vocab.cpp, consistent with
other BPE pre-tokenizers.

Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
2026-05-27 08:08:33 +03:00
Radoslav Gerganov
7085492c6f server : fix the log message when using SSL (#23393)
When llama-server is started with SSL key and cert, the log says that it
listens on http instead of https. This patch fixes this.
2026-05-27 08:06:30 +03:00
Vladislav
b4c0549a49 ggml-zendnn : fixed naming of matmul function (#20964)
* ggml-zendnn: fixed naming of matmul function

* ggml-zendnn: fixed naming of mul_mat_id function

* ggml-zendnn: fixed print in  mul_mat_id

---------

Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>
2026-05-27 00:59:35 +02:00
Georgi Gerganov
0d18aaa9d1 ci : do not allocate ccache for 3rd-party hosted runners (#23730)
* ci : do not allocate ccache for 3rd-party hosted runners

[no release]

* cont : add prints

[no ci]
[no release]
2026-05-26 20:15:01 +03:00
Georgi Gerganov
08bc21b459 ci : move [no release] check to dedicated check_release job (#23734)
* ci : move [no release] check to dedicated check_release job

Move the workflow-level \`if\` condition that skips builds when the commit
message contains \`[no release]\` into a lightweight \`check_release\` job.
All build jobs now depend on it via \`needs\` and check its output.

This ensures the skip logic is evaluated at the job level rather than at
the workflow level, which is the recommended approach for conditional jobs.

Assisted-by: llama.cpp:local pi

* cont : use `fast` runner
2026-05-26 19:49:41 +03:00
Georgi Gerganov
35a74c8fb9 ci : add [no release] keyword + fix sanitizer builds (#23728)
* ci : skip release workflow on master when commit message contains [no release]

Assisted-by: llama.cpp:local pi

* ci : restrict sanitizer builds to x86_64 + fix build type

the spark is apparently too slow for some reason

* tests : fix undefined warning

[no ci]
2026-05-26 19:05:48 +03:00
Georgi Gerganov
5190c2ea8d ci : move macos jobs to the apple workflow + fix names (#23721) 2026-05-26 16:57:55 +03:00
Jeff Bolz
7799d31e68 vulkan: optimize conv2d and implement coopmat1 support (#22620)
* vulkan: add CONV_SHAPE_64x128 for medium-K conv2d

* vulkan: skip conv2d bounds checks when shapes align with tile sizes

* vulkan: use WG_SIZE=128 for CONV_SHAPE_64x32 conv2d

* vulkan: stage cm2 conv2d accumulator through shmem before global store

* vulkan: add coopmat1 conv2d path

* fallback when using too much shared memory. clean up comments

* Require 16x16x16 and subgroup size 32 or 64

* check whether shared memory is sufficient before overwriting conv2d params with coopmat1 values
2026-05-26 15:48:05 +02:00
Georgi Gerganov
3a3ed153d9 ci : remove vulkan SDK dep from webgpu job (#23718)
* ci : remove vulkan dep from webgpu build

* cont : add ccache to `ubuntu-24-webgpu-wasm`

* ci : fix name + add wasm test
2026-05-26 16:40:30 +03:00
Max Krasnyansky
ef66bfab68 hexagon: add support for CONCAT op (#23648)
* hexagon: add support for CONCAT with optimized concat_2d_transposed

qwen3.5 models are quite heavy on the CONCAT with large and transposed src1.

* hex-concat: use fastdiv in generic version

* hex-concat: make checks for transposed a bit more readable

* hex-concat: reoder dma ops for better pipelining

* hex-cont/cpy: optimize CPY and CONT ops

The primary change is to avoid scalar divs in the inner loops.
We were calling hvx_copy_uu(... type_size) where type_size is non a constexpr.
This causes runtime divs by that value which is normally just 4 or 2 (f32/f16).

* hex-get-rows: optimize GET_ROWS for large rows

We now use DMA for larger rows and also split them into chunks to improve perf for Qwen3.5 and other models
that do lots of GET_ROWS with huge (2MB+ rows).

Also bump the DMA queue depth now that we can take advantage of it.

* hex-concat: unroll the inner loops of concat_2d

* hex-concat: more updates to concat_2d to improve perf a bit further

* hex-cpy: fixed n_rows per thread checks in the copy ops

* hmx-fa: fix alignment issues while computing dma sizes

* hex-set-rows: add early returns for idle threads

* hvx-rope: minor optimization to replace loops with fastdiv logic

* hex-rope: replace scalar tail processing with HVX

* hex-rope: optimize rope cache init with HVX

Add hvx-utils sin/cos helpers that use an aprox method (similar to rsqrt, inverse, etc)
Use the helpers to optimize ROPE.
2026-05-26 06:20:05 -07:00
Georgi Gerganov
678d43d720 ci : move more CPU jobs to self-hosted runners (#23715) 2026-05-26 15:37:40 +03:00
Georgi Gerganov
ef41a69179 ci : move sanitizer jobs to self-hosted runners (#23713) 2026-05-26 15:22:09 +03:00
Georgi Gerganov
3dc7684f39 ci : reduce (disable SYCL and CANN builds/releases) (#23705)
* ci : reduce

[no ci]

* cont : disable sycl, cann + rename caches

[no ci]

* cont : cann

[no ci]
2026-05-26 15:21:21 +03:00
ghleg
dbe9c0c8ce convert : support Gemma4ForCausalLM architecture (#23682)
* convert : support Gemma4ForCausalLM architecture (#23674)

* fix indent

---------

Co-authored-by: Oleg Afonin <your.email@example.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-26 08:00:31 +03:00
Michael Wand
6fe90deffa models : Attach Mistral3 NVFP4 weight scales (#23629) 2026-05-26 07:59:59 +03:00
Alexey Kopytko
581d020b12 SYCL: implement ggml_sycl_pool_vmm (#22862)
* SYCL: implement ggml_sycl_pool_vmm

* Add an option to bypass VMM with GGML_SYCL_DISABLE_VMM

* Clean up debugging logging

* document GGML_SYCL_DISABLE_VMM

* Multi-stream MoE optimization

* Revert "Multi-stream MoE optimization"

This reverts commit 938929c3f1.

* Update common.hpp

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>

* Flip GGML_SYCL_DISABLE_VMM to GGML_SYCL_ENABLE_VMM

* add logging for GGML_SYCL_ENABLE_VMM when extension is not available (SYCL_EXT_ONEAPI_VIRTUAL_MEM macro)

* Apply suggestions from code review

Co-authored-by: Alexey Kopytko <alexey@kopytko.com>

* Apply suggestion from @sanmai

* Apply suggestion from @sanmai

---------

Co-authored-by: Neo Zhang <zhang.jianyu@outlook.com>
2026-05-26 07:59:00 +03:00
Jeff Bolz
7623de11d9 tests: test-backend-ops -j <N> to run tests in parallel (#23637)
Create a pool of N threads that grab a chunk of up to 100 tests at a time to
iterate through. The number of tests at a time decreases as fewer remain.

Each thread uses its own dev and cpu backend, and set_n_threads_fn is not
called on the cpu backend.

Fix some TSAN issues that arose:
- In init_tensor_uniform, don't use static vector of generators.
- Replace gmtime with versions that don't use a global variable.
- Mutex calls to print_test_result.
2026-05-26 07:57:56 +03:00
Niklas Sheth
c9d98295a3 model : add support for talkie-1930-13b (#22596)
* initial talkie support, coherent

* reorder to follow convention

* absorb inverse rope

* stop folding scalars to improve quantization

* use broadcasting instead of duplication

* style cleanup

* add scaling support to LoraTorchTensor; use that path in conversion

* use layer_out_scale instead of embd_skip_scale
2026-05-26 07:57:38 +03:00
Masashi Yoshimura
1506d39e76 ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MUL_MAT pipeline (#23594)
* ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K

* Fix to editorconfig checking pass

* Remove mul-mat-legacy pipeline

* Fix to use vendor name as is and add dot_product/vendor to shader_lib_ctx
2026-05-25 20:42:49 -07:00
Nikhil Jain
54121f7325 [WebGPU] Check batch_compute_passes before sending passes when not doing GPU profiling (#23457)
* Only run webgpu CI on my fork

* Add webgpu only workflow

* refactor batch_compute_passes to a per-thread variable, and submit individual passes when it is set to false and no GPU profiling is enabled

* restore build.yml
2026-05-25 20:32:49 -07:00
Johannes Gäßler
192d8ae8b8 CUDA: missing PDL sync for FWHT, better fallback (#23690) 2026-05-26 11:05:51 +08:00
forforever73
35c9b1f39e metal : add apple device id (#23566)
Co-authored-by: lvyichen <lvyichen@stepfun.com>
2026-05-25 21:05:16 +03:00
Max Krasnyansky
4bead4e30d snapdragon: bump toolchain docker to v0.7 to fix ui build issues (#23680) 2026-05-25 10:57:43 -07:00
296 changed files with 21928 additions and 7441 deletions

View File

@@ -3,6 +3,7 @@
glibc,
config,
stdenv,
stdenvNoCC,
runCommand,
cmake,
ninja,
@@ -19,6 +20,8 @@
openssl,
shaderc,
spirv-headers,
nodejs,
importNpmLock,
useBlas ?
builtins.all (x: !x) [
useCuda
@@ -130,7 +133,31 @@ effectiveStdenv.mkDerivation (finalAttrs: {
src = lib.cleanSource ../../.;
};
postPatch = ''
# Builds the webui locally, taking care not to require updating any sha256 hash.
webui = stdenvNoCC.mkDerivation {
pname = "webui";
version = llamaVersion;
src = lib.cleanSource ../../tools/ui;
nativeBuildInputs = [
nodejs
importNpmLock.linkNodeModulesHook
];
# no sha256 required when using buildNodeModules
npmDeps = importNpmLock.buildNodeModules {
npmRoot = ../../tools/ui;
inherit nodejs;
};
installPhase = ''
LLAMA_UI_OUT_DIR=$out npm run build --offline
'';
};
postPatch = lib.optionalString useWebUi ''
cp -r ${finalAttrs.webui} tools/ui/dist
chmod -R u+w tools/ui/dist
'';
# With PR#6015 https://github.com/ggml-org/llama.cpp/pull/6015,

101
.devops/zendnn.Dockerfile Normal file
View File

@@ -0,0 +1,101 @@
ARG UBUNTU_VERSION=24.04
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
FROM ubuntu:$UBUNTU_VERSION AS build
RUN apt-get update && \
apt-get install -y gcc-13 g++-13 build-essential git cmake libssl-dev libomp-dev libnuma-dev python3 ca-certificates
ENV CC=gcc-13 CXX=g++-13
WORKDIR /app
COPY . .
RUN cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_TESTS=OFF -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_ZENDNN=ON && \
cmake --build build -j $(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so*" -exec cp -P {} /app/lib \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
## Base image
FROM ubuntu:$UBUNTU_VERSION AS base
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE
RUN apt-get update \
&& apt-get install -y libgomp1 libnuma1 curl \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
### Full
FROM base AS full
COPY --from=build /app/full /app
WORKDIR /app
RUN apt-get update \
&& apt-get install -y \
git \
python3 \
python3-pip \
python3-wheel \
&& pip install --break-system-packages --upgrade setuptools \
&& pip install --break-system-packages -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
### Light, CLI only
FROM base AS light
COPY --from=build /app/full/llama-cli /app/full/llama-completion /app
WORKDIR /app
ENTRYPOINT [ "/app/llama-cli" ]
### Server, Server only
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]

22
.github/actions/ccache-clear/action.yml vendored Normal file
View File

@@ -0,0 +1,22 @@
name: "ccache-clear"
description: "Delete all GitHub Actions caches matching a key prefix"
inputs:
key:
description: "Cache key prefix to match and delete"
required: true
runs:
using: "composite"
steps:
- name: Clear caches
shell: bash
run: |
CACHES=$(gh cache list --key "ccache-${{ inputs.key }}" --json id,key --jq '.[] | "\(.id) \(.key)"' 2>/dev/null)
if [ -z "$CACHES" ]; then
echo "No caches found with key prefix: ${{ inputs.key }}"
exit 0
fi
while read -r id key; do
echo "Deleting cache: $id ($key)"
gh cache delete "$id"
done <<< "$CACHES"

View File

@@ -96,3 +96,34 @@ runs:
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V13_1=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
- name: Install Cuda Toolkit 13.3
if: ${{ inputs.cuda_version == '13.3' }}
shell: pwsh
run: |
mkdir -p "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3"
choco install unzip -y
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_crt/windows-x86_64/cuda_crt-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_cudart/windows-x86_64/cuda_cudart-windows-x86_64-13.3.29-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvcc/windows-x86_64/cuda_nvcc-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/windows-x86_64/cuda_nvrtc-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libcublas/windows-x86_64/libcublas-windows-x86_64-13.5.1.27-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/libnvvm/windows-x86_64/libnvvm-windows-x86_64-13.3.33-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvtx/windows-x86_64/cuda_nvtx-windows-x86_64-13.3.29-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cuda_profiler_api/windows-x86_64/cuda_profiler_api-windows-x86_64-13.3.27-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/visual_studio_integration/windows-x86_64/visual_studio_integration-windows-x86_64-13.3.27-archive.zip"
curl -O "https://developer.download.nvidia.com/compute/cuda/redist/cccl/windows-x86_64/cccl-windows-x86_64-13.3.3.3.1-archive.zip"
unzip '*.zip' -d "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3"
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_crt-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_cudart-windows-x86_64-13.3.29-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_nvcc-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_nvrtc-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\libcublas-windows-x86_64-13.5.1.27-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\libnvvm-windows-x86_64-13.3.33-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_nvtx-windows-x86_64-13.3.29-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cuda_profiler_api-windows-x86_64-13.3.27-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\visual_studio_integration-windows-x86_64-13.3.27-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
xcopy "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\cccl-windows-x86_64-13.3.3.3.1-archive\*" "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" /E /I /H /Y
echo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\bin" | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
echo "CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8
echo "CUDA_PATH_V13_3=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3" | Out-File -FilePath $env:GITHUB_ENV -Append -Encoding utf8

View File

@@ -22,9 +22,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-llguidance:

View File

@@ -31,7 +31,7 @@ jobs:
android-ndk-snapdragon:
runs-on: ubuntu-latest
container:
image: 'ghcr.io/snapdragon-toolchain/arm64-android:v0.6'
image: 'ghcr.io/snapdragon-toolchain/arm64-android:v0.7'
defaults:
run:
shell: bash
@@ -61,7 +61,7 @@ jobs:
linux-iot-snapdragon:
runs-on: ubuntu-latest
container:
image: 'ghcr.io/snapdragon-toolchain/arm64-linux:v0.6'
image: 'ghcr.io/snapdragon-toolchain/arm64-linux:v0.7'
defaults:
run:
shell: bash

View File

@@ -27,12 +27,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
android:
default:
runs-on: ubuntu-latest
steps:
@@ -58,7 +58,7 @@ jobs:
cd examples/llama.android
./gradlew build --no-daemon
android-ndk:
ndk:
runs-on: ubuntu-latest
container:
image: 'ghcr.io/snapdragon-toolchain/arm64-android:v0.3'
@@ -92,7 +92,7 @@ jobs:
name: llama-cpp-android-arm64-cpu
path: pkg-adb/llama.cpp
android-arm64:
arm64:
runs-on: ubuntu-latest
env:
@@ -103,12 +103,18 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: android-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# note : disabled to spare some cache space (https://github.com/ggml-org/llama.cpp/pull/23789)
# for some reason, the ccache does not improve the build time in this case
# example:
# cache off: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78160400831
# cache on: https://github.com/ggerganov/tmp2/actions/runs/26534713799/job/78224189394
#
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: android-ubuntu-arm64
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Set up JDK
uses: actions/setup-java@v5

View File

@@ -32,12 +32,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
macOS-latest-ios:
macos-latest-arm64:
runs-on: macos-latest
steps:
@@ -48,7 +48,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-ios
key: apple-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -56,19 +56,58 @@ jobs:
id: cmake_build
run: |
sysctl -a
cmake -B build -G Xcode \
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \
-DLLAMA_BUILD_APP=OFF \
-DLLAMA_BUILD_COMMON=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_TOOLS=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-DCMAKE_SYSTEM_NAME=iOS \
-DCMAKE_OSX_DEPLOYMENT_TARGET=14.0 \
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
-DGGML_METAL_EMBED_LIBRARY=OFF \
-DGGML_METAL_SHADER_DEBUG=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
leaks -atExit -- ./build/bin/test-thread-safety -hf ggml-org/gemma-3-270m-qat-GGUF -ngl 99 -p "$(printf 'hello %.0s' {1..128})" -n 16 -c 512 -ub 32 -np 2 -t 2 -lv 1
- name: Test
id: cmake_test
run: |
cd build
ctest -L main -E "test-llama-archs" --verbose --timeout 900
macos-latest-x64:
runs-on: macos-15-intel
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: apple-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggml-org/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL=OFF \
-DGGML_RPC=ON \
-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
macos-latest-ios-xcode:
runs-on: macos-latest
@@ -117,7 +156,7 @@ jobs:
xcodebuild -downloadPlatform iOS
xcodebuild -project examples/llama.swiftui/llama.swiftui.xcodeproj -scheme llama.swiftui -sdk iphoneos CODE_SIGNING_REQUIRED=NO CODE_SIGN_IDENTITY= -destination 'generic/platform=iOS' FRAMEWORK_FOLDER_PATH=./build-ios build
macOS-latest-tvos:
macos-latest-tvos:
runs-on: macos-latest
steps:
@@ -125,10 +164,11 @@ jobs:
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-tvos
key: apple-tvos
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -150,7 +190,7 @@ jobs:
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
macOS-latest-visionos:
macos-latest-visionos:
runs-on: macos-latest
steps:
@@ -158,6 +198,14 @@ jobs:
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: apple-visionos
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
@@ -176,7 +224,7 @@ jobs:
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu) -- CODE_SIGNING_ALLOWED=NO
macOS-latest-swift:
macos-latest-swift:
runs-on: macos-latest
needs: macos-latest-ios-xcode
@@ -189,10 +237,11 @@ jobs:
id: checkout
uses: actions/checkout@v6
# TODO: this likely does not do anything - if yes, remove it
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-swift
key: apple-swift
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -28,7 +28,7 @@ jobs:
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
key: cache-gha-vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
@@ -54,7 +54,7 @@ jobs:
# id: cache-toolchain
# with:
# path: ./spacemit_toolchain
# key: spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
# key: cache-gha-spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
# - name: Setup SpacemiT Toolchain
# if: steps.cache-toolchain.outputs.cache-hit != 'true'
@@ -81,7 +81,7 @@ jobs:
id: cache-openvino
with:
path: ./openvino_toolkit
key: openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
key: cache-gha-openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
- name: Setup OpenVINO Toolkit
if: steps.cache-openvino.outputs.cache-hit != 'true'
@@ -108,7 +108,7 @@ jobs:
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
key: cache-gha-rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'

View File

@@ -29,74 +29,76 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
openEuler-latest-cann:
defaults:
run:
shell: bash -el {0}
strategy:
matrix:
arch: [x86, aarch64]
chip_type: ['910b', '310p']
build: ['Release']
use_acl_graph: ['on', 'off']
exclude:
# 310P does not support USE_ACL_GRAPH=on
- chip_type: '310p'
use_acl_graph: 'on'
runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
steps:
- name: Checkout
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Free up disk space
uses: ggml-org/free-disk-space@v1.3.1
with:
tool-cache: true
- name: Set container image
id: cann-image
run: |
image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
echo "image=${image}" >> "${GITHUB_OUTPUT}"
- name: Pull container image
run: docker pull "${{ steps.cann-image.outputs.image }}"
- name: Build
env:
BUILD_TYPE: ${{ matrix.build }}
SOC_TYPE: ascend${{ matrix.chip_type }}
USE_ACL_GRAPH: ${{ matrix.use_acl_graph }}
run: |
HOST_UID=$(id -u)
HOST_GID=$(id -g)
docker run --rm \
-v "${PWD}:/workspace" \
-w /workspace \
-e SOC_TYPE=${SOC_TYPE} \
-e BUILD_TYPE=${BUILD_TYPE} \
-e USE_ACL_GRAPH=${USE_ACL_GRAPH} \
"${{ steps.cann-image.outputs.image }}" \
bash -lc '
set -e
yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
yum clean all && rm -rf /var/cache/yum
git config --global --add safe.directory "/workspace"
export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
-DGGML_CANN=on \
-DSOC_TYPE=${SOC_TYPE} \
-DUSE_ACL_GRAPH=${USE_ACL_GRAPH}
cmake --build build -j $(nproc)
chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
'
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# openEuler-latest-cann:
# defaults:
# run:
# shell: bash -el {0}
# strategy:
# matrix:
# arch: [x86, aarch64]
# chip_type: ['910b', '310p']
# build: ['Release']
# use_acl_graph: ['on', 'off']
# exclude:
# # 310P does not support USE_ACL_GRAPH=on
# - chip_type: '310p'
# use_acl_graph: 'on'
# runs-on: ${{ matrix.arch == 'aarch64' && 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
# steps:
# - name: Checkout
# uses: actions/checkout@v6
# with:
# fetch-depth: 0
#
# - name: Free up disk space
# uses: ggml-org/free-disk-space@v1.3.1
# with:
# tool-cache: true
#
# - name: Set container image
# id: cann-image
# run: |
# image="ascendai/cann:${{ matrix.chip_type == '910b' && '8.5.0-910b-openeuler24.03-py3.11' || '8.5.0-310p-openeuler24.03-py3.11' }}"
# echo "image=${image}" >> "${GITHUB_OUTPUT}"
#
# - name: Pull container image
# run: docker pull "${{ steps.cann-image.outputs.image }}"
#
# - name: Build
# env:
# BUILD_TYPE: ${{ matrix.build }}
# SOC_TYPE: ascend${{ matrix.chip_type }}
# USE_ACL_GRAPH: ${{ matrix.use_acl_graph }}
# run: |
# HOST_UID=$(id -u)
# HOST_GID=$(id -g)
#
# docker run --rm \
# -v "${PWD}:/workspace" \
# -w /workspace \
# -e SOC_TYPE=${SOC_TYPE} \
# -e BUILD_TYPE=${BUILD_TYPE} \
# -e USE_ACL_GRAPH=${USE_ACL_GRAPH} \
# "${{ steps.cann-image.outputs.image }}" \
# bash -lc '
# set -e
# yum install -y --setopt=install_weak_deps=False --setopt=tsflags=nodocs git gcc gcc-c++ make cmake openssl-devel
# yum clean all && rm -rf /var/cache/yum
# git config --global --add safe.directory "/workspace"
# export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/$(uname -m)-linux/devlib/:${LD_LIBRARY_PATH}
# cmake -S . -B build \
# -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
# -DGGML_CANN=on \
# -DSOC_TYPE=${SOC_TYPE} \
# -DUSE_ACL_GRAPH=${USE_ACL_GRAPH}
# cmake --build build -j $(nproc)
#
# chown -R '"${HOST_UID}"':'"${HOST_GID}"' /workspace/build
# '

215
.github/workflows/build-cpu.yml vendored Normal file
View File

@@ -0,0 +1,215 @@
name: CI (cpu)
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/build-cpu.yml',
'.github/workflows/build-cmake-pkg.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp',
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-cpu.yml',
'.github/workflows/build-cmake-pkg.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
build-cmake-pkg:
uses: ./.github/workflows/build-cmake-pkg.yml
ubuntu:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-22.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: cpu-${{ matrix.os }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build Dependencies
id: build_depends
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
python3 python3-pip python3-dev python3-wheel \
libjpeg-dev build-essential libssl-dev \
git-lfs
- name: Toolchain workaround (GCC 14)
if: ${{ contains(matrix.os, 'ubuntu-24.04') }}
run: |
sudo apt-get install -y gcc-14 g++-14
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: Python Dependencies
id: python_depends
run: |
export PIP_BREAK_SYSTEM_PACKAGES="1"
python3 -m pip install --upgrade pip setuptools
pip3 install ./gguf-py
- name: Build
id: cmake_build
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
- name: Test llama2c conversion
id: llama2c_test
run: |
cd build
echo "Fetch tokenizer"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
windows:
runs-on: windows-2025
env:
OPENBLAS_VERSION: 0.3.23
SDE_VERSION: 9.33.0-2024-01-07
VULKAN_VERSION: 1.4.313.2
strategy:
matrix:
include:
- build: 'x64-cpu-static'
arch: 'x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF'
- build: 'x64-openblas'
arch: 'x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
- build: 'x64-vulkan'
arch: 'x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON'
- build: 'arm64'
arch: 'arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: cpu-windows-2025-${{ matrix.build }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Download OpenBLAS
id: get_openblas
if: ${{ matrix.build == 'x64-openblas' }}
run: |
curl.exe -o $env:RUNNER_TEMP/openblas.zip -L "https://github.com/xianyi/OpenBLAS/releases/download/v${env:OPENBLAS_VERSION}/OpenBLAS-${env:OPENBLAS_VERSION}-x64.zip"
curl.exe -o $env:RUNNER_TEMP/OpenBLAS.LICENSE.txt -L "https://github.com/xianyi/OpenBLAS/raw/v${env:OPENBLAS_VERSION}/LICENSE"
mkdir $env:RUNNER_TEMP/openblas
tar.exe -xvf $env:RUNNER_TEMP/openblas.zip -C $env:RUNNER_TEMP/openblas
$vcdir = $(vswhere -latest -products * -requires Microsoft.VisualStudio.Component.VC.Tools.x86.x64 -property installationPath)
$msvc = $(join-path $vcdir $('VC\Tools\MSVC\'+$(gc -raw $(join-path $vcdir 'VC\Auxiliary\Build\Microsoft.VCToolsVersion.default.txt')).Trim()))
$lib = $(join-path $msvc 'bin\Hostx64\x64\lib.exe')
& $lib /machine:x64 "/def:${env:RUNNER_TEMP}/openblas/lib/libopenblas.def" "/out:${env:RUNNER_TEMP}/openblas/lib/openblas.lib" /name:openblas.dll
- name: Install Vulkan SDK
id: get_vulkan
if: ${{ matrix.build == 'x64-vulkan' }}
run: |
curl.exe -o $env:RUNNER_TEMP/VulkanSDK-Installer.exe -L "https://sdk.lunarg.com/sdk/download/${env:VULKAN_VERSION}/windows/vulkansdk-windows-X64-${env:VULKAN_VERSION}.exe"
& "$env:RUNNER_TEMP\VulkanSDK-Installer.exe" --accept-licenses --default-answer --confirm-command install
Add-Content $env:GITHUB_ENV "VULKAN_SDK=C:\VulkanSDK\${env:VULKAN_VERSION}"
Add-Content $env:GITHUB_PATH "C:\VulkanSDK\${env:VULKAN_VERSION}\bin"
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Build
id: cmake_build
run: |
cmake -S . -B build ${{ matrix.defines }} `
-DLLAMA_BUILD_BORINGSSL=ON
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS}
- name: Add libopenblas.dll
id: add_libopenblas_dll
if: ${{ matrix.build == 'x64-openblas' }}
run: |
cp $env:RUNNER_TEMP/openblas/bin/libopenblas.dll ./build/bin/Release/openblas.dll
cp $env:RUNNER_TEMP/OpenBLAS.LICENSE.txt ./build/bin/Release/OpenBLAS-${env:OPENBLAS_VERSION}.txt
- name: Test
id: cmake_test
if: ${{ matrix.arch == 'x64' }}
run: |
cd build
ctest -L main -C Release --verbose --timeout 900
# TODO: disabled for now, consider adding tests for all CPU variants instead
# - name: Test (Intel SDE)
# id: cmake_test_sde
# if: ${{ matrix.build == 'avx512-x64' && env.HAS_AVX512F == '0' }} # use Intel SDE for AVX-512 emulation
# run: |
# curl.exe -o $env:RUNNER_TEMP/sde.tar.xz -L "https://downloadmirror.intel.com/813591/sde-external-${env:SDE_VERSION}-win.tar.xz"
# # for some weird reason windows tar doesn't like sde tar.xz
# 7z x "-o${env:RUNNER_TEMP}" $env:RUNNER_TEMP/sde.tar.xz
# 7z x "-o${env:RUNNER_TEMP}" $env:RUNNER_TEMP/sde.tar
# $sde = $(join-path $env:RUNNER_TEMP sde-external-${env:SDE_VERSION}-win/sde.exe)
# cd build
# $env:LLAMA_SKIP_TESTS_SLOW_ON_EMULATOR = 1
# & $sde -future -- ctest -L main -C Release --verbose --timeout 900

View File

@@ -287,7 +287,7 @@ jobs:
# id: cache-toolchain
# with:
# path: ./spacemit_toolchain
# key: spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
# key: cache-gha-spacemit-ime-toolchain-v${{ env.SPACEMIT_IME_TOOLCHAIN_VERSION }}-${{ runner.os }}
- name: Setup SpacemiT Toolchain
#if: steps.cache-toolchain.outputs.cache-hit != 'true'

View File

@@ -1,4 +1,4 @@
name: CI (hip)
name: CI (CUDA, ubuntu)
on:
workflow_dispatch: # allows manual triggering
@@ -6,7 +6,7 @@ on:
branches:
- master
paths: [
'.github/workflows/build-hip.yml',
'.github/workflows/build-cuda-ubuntu.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
@@ -20,7 +20,7 @@ on:
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build-hip.yml',
'.github/workflows/build-cuda-ubuntu.yml',
'ggml/src/ggml-cuda/**'
]
@@ -31,13 +31,48 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
cuda:
runs-on: ubuntu-24.04
container: nvidia/cuda:12.6.2-devel-ubuntu24.04
ubuntu-22-hip:
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install dependencies
env:
DEBIAN_FRONTEND: noninteractive
run: |
apt update
apt install -y cmake build-essential ninja-build libgomp1 git libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: cuda-ubuntu-24.04-cuda
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with CMake
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
cmake -S . -B build -G Ninja \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=89-real \
-DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined \
-DGGML_NATIVE=OFF \
-DGGML_CUDA=ON \
-DGGML_CUDA_CUB_3DOT2=ON
cmake --build build
hip:
runs-on: ubuntu-22.04
container: rocm/dev-ubuntu-22.04:6.1.2
@@ -55,7 +90,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip
key: cuda-ubuntu-22.04-hip
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -69,75 +104,7 @@ jobs:
-DGGML_HIP=ON
cmake --build build --config Release -j $(nproc)
windows-latest-hip:
runs-on: windows-2022
env:
# Make sure this is in sync with build-cache.yml
HIPSDK_INSTALLER_VERSION: "26.Q1"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Grab rocWMMA package
id: grab_rocwmma
run: |
curl -o rocwmma.deb "https://repo.radeon.com/rocm/apt/7.2.1/pool/main/r/rocwmma-dev/rocwmma-dev_2.2.0.70201-81~24.04_amd64.deb"
7z x rocwmma.deb
7z x data.tar
- name: Use ROCm Installation Cache
uses: actions/cache@v5
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
uses: ./.github/actions/windows-setup-rocm
with:
version: ${{ env.HIPSDK_INSTALLER_VERSION }}
- name: Verify ROCm
id: verify
run: |
# Find and test ROCm installation
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
if (-not $clangPath) {
Write-Error "ROCm installation not found"
exit 1
}
& $clangPath.FullName --version
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ${{ github.job }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
$env:HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake -G "Unix Makefiles" -B build -S . `
-DCMAKE_C_COMPILER="${env:HIP_PATH}\bin\clang.exe" `
-DCMAKE_CXX_COMPILER="${env:HIP_PATH}\bin\clang++.exe" `
-DCMAKE_CXX_FLAGS="-I$($PWD.Path.Replace('\', '/'))/opt/rocm-7.2.1/include/" `
-DCMAKE_BUILD_TYPE=Release `
-DLLAMA_BUILD_BORINGSSL=ON `
-DROCM_DIR="${env:HIP_PATH}" `
-DGGML_HIP=ON `
-DGGML_HIP_ROCWMMA_FATTN=ON `
-DGPU_TARGETS="gfx1100" `
-DGGML_RPC=ON
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}
ubuntu-22-musa:
musa:
runs-on: ubuntu-22.04
container: mthreads/musa:rc4.3.0-devel-ubuntu22.04-amd64
@@ -155,7 +122,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-musa
key: cuda-ubuntu-22.04-musa
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

162
.github/workflows/build-cuda-windows.yml vendored Normal file
View File

@@ -0,0 +1,162 @@
name: CI (CUDA, windows)
# TODO: this workflow is only triggered manually because it is very heavy on the CI
# when we provision dedicated windows runners, we can enable it for pushes too
# note: running this workflow manually will populate the ccache for the release builds
# this can be used before merging a PR to speed up the release workflow
on:
workflow_dispatch: # allows manual triggering
# note: this will run in queue with the release workflow
concurrency:
group: release
queue: max
env:
GH_TOKEN: ${{ github.token }}
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
cuda:
runs-on: windows-2022
permissions:
actions: write
strategy:
matrix:
cuda: ['12.4', '13.3']
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
with:
cuda_version: ${{ matrix.cuda }}
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Build
id: cmake_build
shell: cmd
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
cmake -S . -B build -G "Ninja Multi-Config" ^
-DLLAMA_BUILD_SERVER=ON ^
-DLLAMA_BUILD_BORINGSSL=ON ^
-DGGML_NATIVE=OFF ^
-DGGML_BACKEND_DL=ON ^
-DGGML_CPU_ALL_VARIANTS=ON ^
-DGGML_CUDA=ON ^
-DGGML_RPC=ON ^
-DGGML_CUDA_CUB_3DOT2=ON
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% -t ggml
cmake --build build --config Release
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
key: release-windows-2022-x64-cuda-${{ matrix.cuda }}
hip:
runs-on: windows-2022
permissions:
actions: write
env:
# Make sure this is in sync with build-cache.yml
HIPSDK_INSTALLER_VERSION: "26.Q1"
strategy:
matrix:
include:
# sync with release.yml
- name: "radeon"
gpu_targets: "gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Grab rocWMMA package
id: grab_rocwmma
run: |
curl -o rocwmma.deb "https://repo.radeon.com/rocm/apt/7.2.1/pool/main/r/rocwmma-dev/rocwmma-dev_2.2.0.70201-81~24.04_amd64.deb"
7z x rocwmma.deb
7z x data.tar
- name: Use ROCm Installation Cache
uses: actions/cache@v5
id: cache-rocm
with:
path: C:\Program Files\AMD\ROCm
key: cache-gha-rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: Setup ROCm
if: steps.cache-rocm.outputs.cache-hit != 'true'
uses: ./.github/actions/windows-setup-rocm
with:
version: ${{ env.HIPSDK_INSTALLER_VERSION }}
- name: Verify ROCm
id: verify
run: |
# Find and test ROCm installation
$clangPath = Get-ChildItem 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | Select-Object -First 1
if (-not $clangPath) {
Write-Error "ROCm installation not found"
exit 1
}
& $clangPath.FullName --version
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
# TODO: this build does not match the build in release.yml, so we use a different cache key
# ideally, the builds should match, similar to the CUDA build above so that we would be able
# to populate the ccache for the release with manual runs of this workflow
#key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
key: cuda-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
- name: Build
id: cmake_build
run: |
$env:HIP_PATH=$(Resolve-Path 'C:\Program Files\AMD\ROCm\*\bin\clang.exe' | split-path | split-path)
$env:CMAKE_PREFIX_PATH="${env:HIP_PATH}"
cmake -G "Unix Makefiles" -B build -S . `
-DCMAKE_C_COMPILER="${env:HIP_PATH}\bin\clang.exe" `
-DCMAKE_CXX_COMPILER="${env:HIP_PATH}\bin\clang++.exe" `
-DCMAKE_CXX_FLAGS="-I$($PWD.Path.Replace('\', '/'))/opt/rocm-7.2.1/include/" `
-DCMAKE_BUILD_TYPE=Release `
-DLLAMA_BUILD_BORINGSSL=ON `
-DROCM_DIR="${env:HIP_PATH}" `
-DGGML_HIP=ON `
-DGGML_HIP_ROCWMMA_FATTN=ON `
-DGPU_TARGETS="gfx1100" `
-DGGML_RPC=ON
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}
- name: ccache-clear
uses: ./.github/actions/ccache-clear
with:
#key: release-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}
key: cuda-windows-2022-x64-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}

View File

@@ -29,9 +29,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:

View File

@@ -15,9 +15,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
windows-msys2:
@@ -37,7 +37,7 @@ jobs:
#- name: ccache
# uses: ggml-org/ccache-action@v1.2.16
# with:
# key: windows-msys2
# key: msys-windows-2025-x64
# variant: ccache
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -30,13 +30,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
windows-latest-opencl-adreno:
windows-2025-opencl-adreno:
runs-on: windows-2025
steps:
@@ -47,7 +46,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-llvm-arm64-opencl-adreno
key: opencl-windows-2025-x64
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

View File

@@ -29,30 +29,18 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-openvino:
name: ubuntu-24-openvino-${{ matrix.openvino_device }}
runs-on: [self-hosted, Linux, Intel, OpenVINO]
concurrency:
group: openvino-${{ matrix.variant }}-${{ github.head_ref || github.ref }}
group: openvino-gpu-${{ github.head_ref || github.ref }}
cancel-in-progress: false
strategy:
matrix:
include:
- variant: cpu
runner: '"ubuntu-24.04"'
openvino_device: "CPU"
- variant: gpu
runner: '["self-hosted","Linux","Intel","OpenVINO"]'
openvino_device: "GPU"
runs-on: ${{ fromJSON(matrix.runner) }}
env:
# Sync versions in build-openvino.yml, build-self-hosted.yml, release.yml, build-cache.yml, .devops/openvino.Dockerfile
OPENVINO_VERSION_MAJOR: "2026.0"
@@ -63,14 +51,6 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
if: runner.environment == 'github-hosted'
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-openvino-${{ matrix.variant }}-no-preset-v1
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
@@ -78,16 +58,7 @@ jobs:
sudo apt-get install -y build-essential libssl-dev libtbb12 cmake ninja-build python3-pip
sudo apt-get install -y ocl-icd-opencl-dev opencl-headers opencl-clhpp-headers intel-opencl-icd
- name: Use OpenVINO Toolkit Cache
if: runner.environment == 'github-hosted'
uses: actions/cache@v5
id: cache-openvino
with:
path: ./openvino_toolkit
key: openvino-toolkit-v${{ env.OPENVINO_VERSION_FULL }}-${{ runner.os }}
- name: Setup OpenVINO Toolkit
if: steps.cache-openvino.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-openvino
with:
path: ./openvino_toolkit
@@ -109,12 +80,17 @@ jobs:
-DGGML_OPENVINO=ON
time cmake --build build/ReleaseOV --config Release -j $(nproc)
- name: Test
id: cmake_test
- name: Test (CPU)
id: cmake_test_cpu
# TODO: fix and re-enable the `test-llama-archs` test below
run: |
cd ${{ github.workspace }}
if [ "${{ matrix.openvino_device }}" = "GPU" ]; then
export GGML_OPENVINO_DEVICE=GPU
fi
ctest --test-dir build/ReleaseOV -L main -E "test-llama-archs" --verbose --timeout 2000
- name: Test (GPU)
id: cmake_test_gpu
# TODO: fix and re-enable the `test-llama-archs` test below
run: |
cd ${{ github.workspace }}
export GGML_OPENVINO_DEVICE=GPU
ctest --test-dir build/ReleaseOV -L main -E "test-llama-archs" --verbose --timeout 2000

View File

@@ -29,9 +29,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-cpu-riscv64-native:
@@ -58,17 +58,20 @@ jobs:
ldd --version
cmake --version
rustc --version
env
echo "nproc=$(nproc)"
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-cpu-riscv64-native
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# note: sparing some ccache since these jobs run on dedicated runners that are not part of the organitzation
#- name: ccache
# uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
# with:
# key: riscv-ubuntu-native
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
@@ -132,12 +135,13 @@ jobs:
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
with:
key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# note: sparing some ccache since these jobs run on dedicated runners that are not part of the organitzation
#- name: ccache
# uses: ggml-org/ccache-action@afde29e5b5422e5da23cb1f639e8baecadeadfc3 # https://github.com/ggml-org/ccache-action/pull/1
# with:
# key: riscv-ubuntu-native-sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build

View File

@@ -29,14 +29,13 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-latest-rpc:
runs-on: ubuntu-latest
ubuntu-24-rpc:
runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
continue-on-error: true

View File

@@ -22,66 +22,65 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-latest-sanitizer:
runs-on: ubuntu-latest
ctest:
runs-on: [self-hosted, X64, CPU, Linux]
continue-on-error: true
strategy:
matrix:
sanitizer: [ADDRESS, THREAD, UNDEFINED]
build_type: [Debug]
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-latest-sanitizer-${{ matrix.sanitizer }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
# with UNDEFINED sanitizer, we have to build in Debug to avoid GCC 13 false-positive warnings
- name: Build (undefined)
id: cmake_build_undefined
if: ${{ matrix.sanitizer == 'UNDEFINED' }}
run: |
sudo apt-get update
sudo apt-get install build-essential libssl-dev
cmake -B build \
-DCMAKE_BUILD_TYPE=Debug \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON \
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON
cmake --build build --config Debug -j $(nproc)
- name: Build
id: cmake_build
if: ${{ matrix.sanitizer != 'THREAD' }}
if: ${{ matrix.sanitizer == 'ADDRESS' }}
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON \
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON
cmake --build build --config ${{ matrix.build_type }} -j $(nproc)
cmake --build build --config RelWithDebInfo -j $(nproc)
- name: Build (no OpenMP)
id: cmake_build_no_openmp
if: ${{ matrix.sanitizer == 'THREAD' }}
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON \
-DGGML_SANITIZE_${{ matrix.sanitizer }}=ON \
-DCMAKE_BUILD_TYPE=${{ matrix.build_type }} \
-DGGML_OPENMP=OFF
cmake --build build --config ${{ matrix.build_type }} -j $(nproc)
cmake --build build --config RelWithDebInfo -j $(nproc)
- name: Test
id: cmake_test
# skip run in Debug - very slow
if: ${{ matrix.sanitizer != 'UNDEFINED' }}
run: |
cd build
ctest -L main --verbose --timeout 900
ctest -L main -E tokenizer --verbose --timeout 900

View File

@@ -50,12 +50,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ggml-ci-nvidia-cuda:
gpu-cuda:
runs-on: [self-hosted, Linux, NVIDIA]
steps:
@@ -69,7 +69,7 @@ jobs:
nvidia-smi
GG_BUILD_CUDA=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-vulkan-cm:
gpu-vulkan-nvidia-cm:
runs-on: [self-hosted, Linux, NVIDIA]
steps:
@@ -83,7 +83,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 GGML_VK_DISABLE_COOPMAT2=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-vulkan-cm2:
gpu-vulkan-nvidia-cm2:
runs-on: [self-hosted, Linux, NVIDIA, COOPMAT2]
steps:
@@ -97,7 +97,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-nvidia-webgpu:
gpu-webgpu-nvidia:
runs-on: [self-hosted, Linux, NVIDIA, X64]
steps:
@@ -127,7 +127,7 @@ jobs:
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMX-compatible machine
#ggml-ci-cpu-amx:
#cpu-amx:
# runs-on: [self-hosted, Linux, CPU, AMX]
# steps:
@@ -141,7 +141,7 @@ jobs:
# bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMD GPU machine
# ggml-ci-amd-vulkan:
# amd-vulkan:
# runs-on: [self-hosted, Linux, AMD]
# steps:
@@ -156,7 +156,7 @@ jobs:
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: provision AMD GPU machine
# ggml-ci-amd-rocm:
# amd-rocm:
# runs-on: [self-hosted, Linux, AMD]
# steps:
@@ -170,7 +170,7 @@ jobs:
# amd-smi static
# GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-metal:
gpu-metal:
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -183,7 +183,7 @@ jobs:
run: |
GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-webgpu:
gpu-webgpu-apple:
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -210,7 +210,7 @@ jobs:
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-vulkan:
gpu-vulkan-apple:
runs-on: [self-hosted, macOS, ARM64]
steps:
@@ -224,7 +224,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-linux-intel-vulkan:
gpu-vulkan-intel-linux:
runs-on: [self-hosted, Linux, Intel]
steps:
@@ -240,7 +240,7 @@ jobs:
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-win-intel-vulkan:
gpu-vulkan-intel-windows:
runs-on: [self-hosted, Windows, X64, Intel]
steps:
@@ -261,7 +261,7 @@ jobs:
# a valid python environment for testing
LLAMA_FATAL_WARNINGS=OFF GG_BUILD_NINJA=1 GG_BUILD_VULKAN=1 GG_BUILD_LOW_PERF=1 ./ci/run.sh ./results/llama.cpp ./mnt/llama.cpp
ggml-ci-intel-openvino-gpu-low-perf:
gpu-openvino-low-perf:
runs-on: [self-hosted, Linux, Intel, OpenVINO]
concurrency:
@@ -297,8 +297,8 @@ jobs:
source ./openvino_toolkit/setupvars.sh
GG_BUILD_OPENVINO=1 GGML_OPENVINO_DEVICE=GPU GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-arm64-cpu-low-perf:
runs-on: [self-hosted, Linux, ARM64, CPU]
cpu-x64-high-perf:
runs-on: [self-hosted, Linux, X64]
steps:
- name: Clone
@@ -308,49 +308,84 @@ jobs:
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-arm64-cpu-high-perf:
runs-on: [self-hosted, Linux, ARM64, CPU]
cpu-arm64-high-perf-graviton4:
runs-on: ah-ubuntu_22_04-c8g_8x
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
set -euxo pipefail
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive NEEDRESTART_MODE=a \
apt-get install -y \
build-essential \
python3-venv \
gpg \
wget \
time \
git-lfs
git lfs install
# install the latest cmake
sudo install -d /usr/share/keyrings
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc \
| gpg --dearmor \
| sudo tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ jammy main' \
| sudo tee /etc/apt/sources.list.d/kitware.list
sudo apt-get update
sudo apt-get install -y cmake
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
# TODO: not sure how to detect ARM flags on DGX Spark. currently get this error during cmake:
# CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:147 (message):
# ARM -march/-mcpu not found, -mcpu=native will be used
#
# if we resolve this, we should be able to offload these jobs to the self-hosted runners
#
# ggml-ci-arm64-cpu-high-perf-sve:
# runs-on: [self-hosted, Linux, ARM64, CPU]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-arm64-cpu-kleidiai:
# runs-on: [self-hosted, Linux, ARM64, CPU]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_KLEIDIAI=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
cpu-arm64-graviton4-kleidiai:
runs-on: ah-ubuntu_22_04-c8g_8x
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
set -euxo pipefail
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive NEEDRESTART_MODE=a \
apt-get install -y \
build-essential \
python3-venv \
gpg \
wget \
time \
git-lfs
git lfs install
# install the latest cmake
sudo install -d /usr/share/keyrings
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc \
| gpg --dearmor \
| sudo tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ jammy main' \
| sudo tee /etc/apt/sources.list.d/kitware.list
sudo apt-get update
sudo apt-get install -y cmake
- name: Test
id: ggml-ci
run: |
GG_BUILD_KLEIDIAI=1 \
GG_BUILD_EXTRA_TESTS_0=1 \
bash ./ci/run.sh ./tmp/results ./tmp/mnt

View File

@@ -29,130 +29,134 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-sycl:
strategy:
matrix:
build: [fp32]
include:
- build: fp32
fp16: OFF
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# ubuntu-24-sycl:
# strategy:
# matrix:
# build: [fp32]
# include:
# - build: fp32
# fp16: OFF
#
# runs-on: ubuntu-24.04
#
# env:
# ONEAPI_ROOT: /opt/intel/oneapi/
# ONEAPI_INSTALLER_VERSION: "2025.3.3"
# LEVEL_ZERO_VERSION: "1.28.2"
# LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
#
# continue-on-error: true
#
# steps:
# - uses: actions/checkout@v6
#
# - name: Use oneAPI Installation Cache
# uses: actions/cache@v5
# id: cache-sycl
# with:
# path: ${{ env.ONEAPI_ROOT }}
# key: cache-gha-oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
#
# - name: Download & Install oneAPI
# shell: bash
# if: steps.cache-sycl.outputs.cache-hit != 'true'
# run: |
# cd /tmp
# wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
# sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
#
# - name: Install Level Zero SDK
# shell: bash
# run: |
# cd /tmp
# wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
# wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
# sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
#
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: sycl-ubuntu-24-${{ matrix.build }}
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Build
# id: cmake_build
# run: |
# source /opt/intel/oneapi/setvars.sh
# cmake -B build \
# -G "Ninja" \
# -DCMAKE_BUILD_TYPE=Release \
# -DGGML_SYCL=ON \
# -DCMAKE_C_COMPILER=icx \
# -DCMAKE_CXX_COMPILER=icpx \
# -DLLAMA_OPENSSL=OFF \
# -DGGML_NATIVE=OFF \
# -DGGML_SYCL_F16=${{ matrix.fp16 }}
# time cmake --build build --config Release -j $(nproc)
runs-on: ubuntu-24.04
env:
ONEAPI_ROOT: /opt/intel/oneapi/
ONEAPI_INSTALLER_VERSION: "2025.3.3"
LEVEL_ZERO_VERSION: "1.28.2"
LEVEL_ZERO_UBUNTU_VERSION: "u24.04"
continue-on-error: true
steps:
- uses: actions/checkout@v6
- name: Use oneAPI Installation Cache
uses: actions/cache@v5
id: cache-sycl
with:
path: ${{ env.ONEAPI_ROOT }}
key: oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
- name: Download & Install oneAPI
shell: bash
if: steps.cache-sycl.outputs.cache-hit != 'true'
run: |
cd /tmp
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/56f7923a-adb8-43f3-8b02-2b60fcac8cab/intel-deep-learning-essentials-2025.3.3.16_offline.sh -O intel-deep-learning-essentials_offline.sh
sudo bash intel-deep-learning-essentials_offline.sh -s -a --silent --eula accept
- name: Install Level Zero SDK
shell: bash
run: |
cd /tmp
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero.deb
wget -q "https://github.com/oneapi-src/level-zero/releases/download/v${LEVEL_ZERO_VERSION}/level-zero-devel_${LEVEL_ZERO_VERSION}%2B${LEVEL_ZERO_UBUNTU_VERSION}_amd64.deb" -O level-zero-devel.deb
sudo apt-get install -y ./level-zero.deb ./level-zero-devel.deb
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-sycl-${{ matrix.build }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
source /opt/intel/oneapi/setvars.sh
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_SYCL=ON \
-DCMAKE_C_COMPILER=icx \
-DCMAKE_CXX_COMPILER=icpx \
-DLLAMA_OPENSSL=OFF \
-DGGML_NATIVE=OFF \
-DGGML_SYCL_F16=${{ matrix.fp16 }}
time cmake --build build --config Release -j $(nproc)
windows-latest-sycl:
runs-on: windows-2022
defaults:
run:
shell: bash
env:
WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
ONEAPI_INSTALLER_VERSION: "2025.3.3"
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Use oneAPI Installation Cache
uses: actions/cache@v5
id: cache-sycl
with:
path: ${{ env.ONEAPI_ROOT }}
key: oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
- name: Download & Install oneAPI
shell: bash
if: steps.cache-sycl.outputs.cache-hit != 'true'
run: |
scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
- name: Install Level Zero SDK
shell: pwsh
run: |
Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
"LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-sycl
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
# TODO: add ssl support ; we will also need to modify win-build-sycl.bat to accept user-specified args
- name: Build
id: cmake_build
run: examples/sycl/win-build-sycl.bat
# TODO: this build is disabled to save Github Actions resources (https://github.com/ggml-org/llama.cpp/pull/23705)
# in order to enable it again, we have to provision dedicated runners to run it
# windows-latest-sycl:
# runs-on: windows-2022
#
# defaults:
# run:
# shell: bash
#
# env:
# WINDOWS_BASEKIT_URL: https://registrationcenter-download.intel.com/akdlm/IRC_NAS/b60765d1-2b85-4e85-86b6-cb0e9563a699/intel-deep-learning-essentials-2025.3.3.18_offline.exe
# WINDOWS_DPCPP_MKL: intel.oneapi.win.cpp-dpcpp-common:intel.oneapi.win.mkl.devel:intel.oneapi.win.dnnl:intel.oneapi.win.tbb.devel
# LEVEL_ZERO_SDK_URL: https://github.com/oneapi-src/level-zero/releases/download/v1.28.2/level-zero-win-sdk-1.28.2.zip
# ONEAPI_ROOT: "C:/Program Files (x86)/Intel/oneAPI"
# ONEAPI_INSTALLER_VERSION: "2025.3.3"
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Use oneAPI Installation Cache
# uses: actions/cache@v5
# id: cache-sycl
# with:
# path: ${{ env.ONEAPI_ROOT }}
# key: cache-gha-oneAPI-${{ env.ONEAPI_INSTALLER_VERSION }}-${{ runner.os }}
#
# - name: Download & Install oneAPI
# shell: bash
# if: steps.cache-sycl.outputs.cache-hit != 'true'
# run: |
# scripts/install-oneapi.bat $WINDOWS_BASEKIT_URL $WINDOWS_DPCPP_MKL
#
# - name: Install Level Zero SDK
# shell: pwsh
# run: |
# Invoke-WebRequest -Uri "${{ env.LEVEL_ZERO_SDK_URL }}" -OutFile "level-zero-win-sdk.zip"
# Expand-Archive -Path "level-zero-win-sdk.zip" -DestinationPath "C:/level-zero-sdk" -Force
# "LEVEL_ZERO_V1_SDK_PATH=C:/level-zero-sdk" | Out-File -FilePath $env:GITHUB_ENV -Append
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: sycl-windows-latest
# variant: ccache
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# # TODO: add ssl support ; we will also need to modify win-build-sycl.bat to accept user-specified args
#
# - name: Build
# id: cmake_build
# run: examples/sycl/win-build-sycl.bat

View File

@@ -31,26 +31,56 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-24-vulkan-llvmpipe:
runs-on: ubuntu-24.04
ubuntu-arm64:
runs-on: ubuntu-24.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev spirv-headers libssl-dev ninja-build
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-vulkan-llvmpipe
key: vulkan-ubuntu-24.04-arm-new
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Configure
id: cmake_configure
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_VULKAN=ON
- name: Build
id: cmake_build
run: |
time cmake --build build -j $(nproc)
ubuntu-llvmpipe:
runs-on: ubuntu-24.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
@@ -68,7 +98,7 @@ jobs:
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
key: cache-gha-vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
@@ -77,6 +107,13 @@ jobs:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: vulkan-ubuntu-24.04-llvmpipe
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |

View File

@@ -30,13 +30,12 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
macOS-latest-arm64-webgpu:
macos:
runs-on: macos-latest
steps:
@@ -47,7 +46,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64-webgpu
key: webgpu-macos-latest
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -77,7 +76,7 @@ jobs:
cd build
ctest -L main --verbose --timeout 900
ubuntu-24-webgpu:
ubuntu:
runs-on: ubuntu-24.04
steps:
@@ -88,7 +87,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-webgpu
key: webgpu-ubuntu-24.04
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
@@ -100,25 +99,6 @@ jobs:
sudo apt-get install -y build-essential mesa-vulkan-drivers \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev
- name: Get latest Vulkan SDK version
id: vulkan_sdk_version
run: |
echo "VULKAN_SDK_VERSION=$(curl https://vulkan.lunarg.com/sdk/latest/linux.txt)" >> "$GITHUB_ENV"
- name: Use Vulkan SDK Cache
uses: actions/cache@v5
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: Dawn Dependency
id: dawn-depends
run: |
@@ -149,14 +129,21 @@ jobs:
# test-backend-ops is too slow on llvmpipe, skip it
ctest -L main -E test-backend-ops --verbose --timeout 900
ubuntu-24-webgpu-wasm:
runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}
ubuntu-wasm:
runs-on: ubuntu-24.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: webgpu-ubuntu-24.04-arm-wasm
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install Emscripten
run: |
git clone https://github.com/emscripten-core/emsdk.git

View File

@@ -1,635 +0,0 @@
name: CI
on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
paths: [
'.github/workflows/build.yml',
'.github/workflows/build-cmake-pkg.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp',
'**/*.cu',
'**/*.cuh',
'**/*.swift',
'**/*.m',
'**/*.metal',
'**/*.comp',
'**/*.glsl',
'**/*.wgsl'
]
pull_request:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/build.yml',
'.github/workflows/build-cmake-pkg.yml',
'**/CMakeLists.txt',
'**/.cmake',
'**/*.h',
'**/*.hpp',
'**/*.c',
'**/*.cpp',
'**/*.cu',
'**/*.cuh',
'**/*.swift',
'**/*.m',
'**/*.metal',
'**/*.comp',
'**/*.glsl',
'**/*.wgsl'
]
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref && github.ref || github.run_id }}
cancel-in-progress: true
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
jobs:
build-cmake-pkg:
uses: ./.github/workflows/build-cmake-pkg.yml
macOS-latest-arm64:
runs-on: macos-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL_USE_BF16=ON \
-DGGML_METAL_EMBED_LIBRARY=OFF \
-DGGML_METAL_SHADER_DEBUG=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
leaks -atExit -- ./build/bin/test-thread-safety -hf ggml-org/gemma-3-270m-qat-GGUF -ngl 99 -p "$(printf 'hello %.0s' {1..128})" -n 16 -c 512 -ub 32 -np 2 -t 2 -lv 1
- name: Test
id: cmake_test
run: |
cd build
ctest -L main -E "test-llama-archs" --verbose --timeout 900
macOS-latest-x64:
runs-on: macos-15-intel
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
sysctl -a
# Metal is disabled due to intermittent failures with Github runners not having a GPU:
# https://github.com/ggml-org/llama.cpp/actions/runs/8635935781/job/23674807267#step:5:2313
cmake -B build \
-DCMAKE_BUILD_RPATH="@loader_path" \
-DLLAMA_FATAL_WARNINGS=ON \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_METAL=OFF \
-DGGML_RPC=ON \
-DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
ubuntu-cpu:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-22.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-cpu-${{ matrix.build }}
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build Dependencies
id: build_depends
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
python3 python3-pip python3-dev python3-wheel \
libjpeg-dev build-essential libssl-dev \
git-lfs
- name: Toolchain workaround (GCC 14)
if: ${{ contains(matrix.os, 'ubuntu-24.04') }}
run: |
sudo apt-get install -y gcc-14 g++-14
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: Python Dependencies
id: python_depends
run: |
export PIP_BREAK_SYSTEM_PACKAGES="1"
python3 -m pip install --upgrade pip setuptools
pip3 install ./gguf-py
- name: Build
id: cmake_build
run: |
cmake -B build \
-DLLAMA_FATAL_WARNINGS=ON \
-DGGML_RPC=ON
time cmake --build build --config Release -j $(nproc)
- name: Test
id: cmake_test
run: |
cd build
ctest -L main --verbose --timeout 900
- name: Test llama2c conversion
id: llama2c_test
run: |
cd build
echo "Fetch tokenizer"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/tok512.bin
echo "Fetch llama2c model"
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories260K/stories260K.bin
./bin/llama-convert-llama2c-to-ggml --copy-vocab-from-model ./tok512.bin --llama2c-model stories260K.bin --llama2c-output-model stories260K.gguf
./bin/llama-completion -m stories260K.gguf -p "One day, Lily met a Shoggoth" -n 500 -c 256
ubuntu-24-vulkan:
strategy:
matrix:
include:
- build: 'x64'
os: ubuntu-24.04
- build: 'arm64'
os: ubuntu-24.04-arm
runs-on: ${{ matrix.os }}
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev spirv-headers libssl-dev ninja-build
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"
- name: Configure
id: cmake_configure
run: |
cmake -B build \
-G "Ninja" \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DGGML_BACKEND_DL=ON \
-DGGML_CPU_ALL_VARIANTS=ON \
-DGGML_VULKAN=ON
- name: Build
id: cmake_build
run: |
time cmake --build build -j $(nproc)
windows-latest:
runs-on: windows-2025
env:
OPENBLAS_VERSION: 0.3.23
SDE_VERSION: 9.33.0-2024-01-07
VULKAN_VERSION: 1.4.313.2
strategy:
matrix:
include:
- build: 'cpu-x64 (static)'
arch: 'x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF'
- build: 'openblas-x64'
arch: 'x64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
- build: 'vulkan-x64'
arch: 'x64'
defines: '-DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON'
- build: 'llvm-arm64'
arch: 'arm64'
defines: '-G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-${{ matrix.build }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Download OpenBLAS
id: get_openblas
if: ${{ matrix.build == 'openblas-x64' }}
run: |
curl.exe -o $env:RUNNER_TEMP/openblas.zip -L "https://github.com/xianyi/OpenBLAS/releases/download/v${env:OPENBLAS_VERSION}/OpenBLAS-${env:OPENBLAS_VERSION}-x64.zip"
curl.exe -o $env:RUNNER_TEMP/OpenBLAS.LICENSE.txt -L "https://github.com/xianyi/OpenBLAS/raw/v${env:OPENBLAS_VERSION}/LICENSE"
mkdir $env:RUNNER_TEMP/openblas
tar.exe -xvf $env:RUNNER_TEMP/openblas.zip -C $env:RUNNER_TEMP/openblas
$vcdir = $(vswhere -latest -products * -requires Microsoft.VisualStudio.Component.VC.Tools.x86.x64 -property installationPath)
$msvc = $(join-path $vcdir $('VC\Tools\MSVC\'+$(gc -raw $(join-path $vcdir 'VC\Auxiliary\Build\Microsoft.VCToolsVersion.default.txt')).Trim()))
$lib = $(join-path $msvc 'bin\Hostx64\x64\lib.exe')
& $lib /machine:x64 "/def:${env:RUNNER_TEMP}/openblas/lib/libopenblas.def" "/out:${env:RUNNER_TEMP}/openblas/lib/openblas.lib" /name:openblas.dll
- name: Install Vulkan SDK
id: get_vulkan
if: ${{ matrix.build == 'vulkan-x64' }}
run: |
curl.exe -o $env:RUNNER_TEMP/VulkanSDK-Installer.exe -L "https://sdk.lunarg.com/sdk/download/${env:VULKAN_VERSION}/windows/vulkansdk-windows-X64-${env:VULKAN_VERSION}.exe"
& "$env:RUNNER_TEMP\VulkanSDK-Installer.exe" --accept-licenses --default-answer --confirm-command install
Add-Content $env:GITHUB_ENV "VULKAN_SDK=C:\VulkanSDK\${env:VULKAN_VERSION}"
Add-Content $env:GITHUB_PATH "C:\VulkanSDK\${env:VULKAN_VERSION}\bin"
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Build
id: cmake_build
run: |
cmake -S . -B build ${{ matrix.defines }} `
-DLLAMA_BUILD_BORINGSSL=ON
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS}
- name: Add libopenblas.dll
id: add_libopenblas_dll
if: ${{ matrix.build == 'openblas-x64' }}
run: |
cp $env:RUNNER_TEMP/openblas/bin/libopenblas.dll ./build/bin/Release/openblas.dll
cp $env:RUNNER_TEMP/OpenBLAS.LICENSE.txt ./build/bin/Release/OpenBLAS-${env:OPENBLAS_VERSION}.txt
- name: Test
id: cmake_test
if: ${{ matrix.arch == 'x64' }}
run: |
cd build
ctest -L main -C Release --verbose --timeout 900
# TODO: disabled for now, consider adding tests for all CPU variants instead
# - name: Test (Intel SDE)
# id: cmake_test_sde
# if: ${{ matrix.build == 'avx512-x64' && env.HAS_AVX512F == '0' }} # use Intel SDE for AVX-512 emulation
# run: |
# curl.exe -o $env:RUNNER_TEMP/sde.tar.xz -L "https://downloadmirror.intel.com/813591/sde-external-${env:SDE_VERSION}-win.tar.xz"
# # for some weird reason windows tar doesn't like sde tar.xz
# 7z x "-o${env:RUNNER_TEMP}" $env:RUNNER_TEMP/sde.tar.xz
# 7z x "-o${env:RUNNER_TEMP}" $env:RUNNER_TEMP/sde.tar
# $sde = $(join-path $env:RUNNER_TEMP sde-external-${env:SDE_VERSION}-win/sde.exe)
# cd build
# $env:LLAMA_SKIP_TESTS_SLOW_ON_EMULATOR = 1
# & $sde -future -- ctest -L main -C Release --verbose --timeout 900
ubuntu-latest-cuda:
runs-on: ubuntu-latest
container: nvidia/cuda:12.6.2-devel-ubuntu24.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install dependencies
env:
DEBIAN_FRONTEND: noninteractive
run: |
apt update
apt install -y cmake build-essential ninja-build libgomp1 git libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-latest-cuda
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build with CMake
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
cmake -S . -B build -G Ninja \
-DLLAMA_FATAL_WARNINGS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=89-real \
-DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined \
-DGGML_NATIVE=OFF \
-DGGML_CUDA=ON \
-DGGML_CUDA_CUB_3DOT2=ON
cmake --build build
windows-2022-cuda:
runs-on: windows-2022
strategy:
matrix:
cuda: ['12.4']
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-cuda-${{ matrix.cuda }}
variant: ccache
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Install Cuda Toolkit
uses: ./.github/actions/windows-setup-cuda
with:
cuda_version: ${{ matrix.cuda }}
- name: Install Ninja
id: install_ninja
run: |
choco install ninja
- name: Build
id: cmake_build
shell: cmd
# TODO: Remove GGML_CUDA_CUB_3DOT2 flag once CCCL 3.2 is bundled within CTK and that CTK version is used in this project
run: |
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
cmake -S . -B build -G "Ninja Multi-Config" ^
-DLLAMA_BUILD_SERVER=ON ^
-DLLAMA_BUILD_BORINGSSL=ON ^
-DGGML_NATIVE=OFF ^
-DGGML_BACKEND_DL=ON ^
-DGGML_CPU_ALL_VARIANTS=ON ^
-DGGML_CUDA=ON ^
-DGGML_RPC=ON ^
-DGGML_CUDA_CUB_3DOT2=ON
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% -t ggml
cmake --build build --config Release
# TODO: simplify the following workflows using a matrix
# TODO: run lighter CI on PRs and the full CI only on master (if needed)
ggml-ci-x64-cpu-low-perf:
runs-on: ubuntu-22.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-x64-cpu-low-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# note: moved to build-self-hosted.yml - can remove from here when everything is stable
# ggml-ci-arm64-cpu-low-perf:
# runs-on: ubuntu-22.04-arm
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ggml-ci-arm64-cpu-low-perf
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_LOW_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-x64-cpu-high-perf:
runs-on: ubuntu-22.04
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-x64-cpu-high-perf
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
# note: moved to build-self-hosted.yml - can remove from here when everything is stable
# ggml-ci-arm64-cpu-high-perf:
# runs-on: ubuntu-22.04-arm
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ggml-ci-arm64-cpu-high-perf
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
#
# - name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get install build-essential
#
# - name: Test
# id: ggml-ci
# run: |
# LLAMA_ARG_THREADS=$(nproc) GG_BUILD_HIGH_PERF=1 GG_BUILD_NO_SVE=1 GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-high-perf-sve:
runs-on: ubuntu-22.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-high-perf-sve
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Test
id: ggml-ci
run: |
LLAMA_ARG_THREADS=$(nproc) GG_BUILD_NO_BF16=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-kleidiai:
runs-on: ubuntu-22.04-arm
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-kleidiai
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install -y build-essential
- name: Test
id: ggml-ci
run: |
GG_BUILD_KLEIDIAI=1 GG_BUILD_EXTRA_TESTS_0=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
ggml-ci-arm64-cpu-kleidiai-graviton4:
runs-on: ah-ubuntu_22_04-c8g_8x
steps:
- name: Clone
id: checkout
uses: actions/checkout@v6
- name: Dependencies
id: depends
run: |
set -euxo pipefail
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive NEEDRESTART_MODE=a \
apt-get install -y \
build-essential \
python3-venv \
gpg \
wget \
time \
git-lfs
git lfs install
# install the latest cmake
sudo install -d /usr/share/keyrings
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc \
| gpg --dearmor \
| sudo tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ jammy main' \
| sudo tee /etc/apt/sources.list.d/kitware.list
sudo apt-get update
sudo apt-get install -y cmake
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-kleidiai-graviton4
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Test
id: ggml-ci
run: |
GG_BUILD_KLEIDIAI=1 \
GG_BUILD_EXTRA_TESTS_0=1 \
bash ./ci/run.sh ./tmp/results ./tmp/mnt

View File

@@ -28,9 +28,9 @@ concurrency:
env:
GGML_NLOOP: 3
GGML_N_THREADS: 1
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
jobs:
ubuntu-22-hip-quality-check:
@@ -50,7 +50,7 @@ jobs:
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip-quality-check
key: hip-quality-check-ubuntu-22.04
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

File diff suppressed because it is too large Load Diff

View File

@@ -26,10 +26,10 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
@@ -37,7 +37,7 @@ concurrency:
jobs:
server:
runs-on: ubuntu-latest
runs-on: [self-hosted, CPU, Linux, llama-server]
strategy:
matrix:
@@ -46,19 +46,19 @@ jobs:
fail-fast: false
steps:
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get -y install \
build-essential \
xxd \
git \
cmake \
curl \
wget \
language-pack-en \
libssl-dev
#- name: Dependencies
# id: depends
# run: |
# sudo apt-get update
# sudo apt-get -y install \
# build-essential \
# xxd \
# git \
# cmake \
# curl \
# wget \
# language-pack-en \
# libssl-dev
- name: Clone
id: checkout

View File

@@ -29,10 +29,10 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}

View File

@@ -44,37 +44,18 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
ui-build:
name: Build Web UI
uses: ./.github/workflows/ui-build.yml
server:
runs-on: ubuntu-latest
needs: ui-build
name: server (${{ matrix.wf_name }})
strategy:
matrix:
build_type: [Release]
wf_name: ["default"]
include:
- build_type: Release
extra_args: ""
wf_name: "default"
- build_type: Release
extra_args: "LLAMA_ARG_BACKEND_SAMPLING=1"
wf_name: "backend-sampling"
fail-fast: false
ubuntu:
runs-on: ubuntu-24.04-arm
steps:
- name: Dependencies
@@ -98,19 +79,19 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Download built UI
uses: actions/download-artifact@v7
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
name: ui-build
path: tools/ui/dist
key: server-ubuntu-24.04-arm
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
run: |
cmake -B build \
-DLLAMA_BUILD_BORINGSSL=ON \
-DGGML_SCHED_NO_REALLOC=ON
cmake --build build --config ${{ matrix.build_type }} -j $(nproc) --target llama-server
cmake --build build --config Release -j $(nproc) --target llama-server
- name: Python setup
id: setup_python
@@ -121,22 +102,34 @@ jobs:
- name: Tests
id: server_integration_tests
if: ${{ (!matrix.disabled_on_pr || !github.event.pull_request) }}
run: |
cd tools/server/tests
export ${{ matrix.extra_args }}
pytest -v -x -m "not slow"
- name: Slow tests
id: server_integration_tests_slow
if: ${{ (github.event.schedule || github.event.inputs.slow_tests == 'true') && matrix.build_type == 'Release' }}
if: ${{ github.event.schedule || github.event.inputs.slow_tests == 'true' }}
run: |
cd tools/server/tests
export ${{ matrix.extra_args }}
SLOW_TESTS=1 pytest -v -x
server-windows:
runs-on: windows-2022
- name: Tests (Backend sampling)
id: server_integration_tests_backend_sampling
run: |
cd tools/server/tests
export LLAMA_ARG_BACKEND_SAMPLING=1
pytest -v -x -m "not slow"
- name: Slow tests (Backend sampling)
id: server_integration_tests_slow_backend_sampling
if: ${{ github.event.schedule || github.event.inputs.slow_tests == 'true' }}
run: |
cd tools/server/tests
export LLAMA_ARG_BACKEND_SAMPLING=1
SLOW_TESTS=1 pytest -v -x
windows:
runs-on: windows-2025
steps:
- name: Clone
@@ -146,16 +139,24 @@ jobs:
fetch-depth: 0
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }}
- name: Setup Node.js
uses: actions/setup-node@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.21
with:
node-version: "24"
key: server-windows-2025-x64
evict-old-files: 1d
save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}
- name: Build
id: cmake_build
shell: cmd
run: |
cmake -B build -DLLAMA_BUILD_BORINGSSL=ON -DGGML_SCHED_NO_REALLOC=ON
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS} --target llama-server
cmake -B build -G "Ninja Multi-Config" ^
-DCMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake ^
-DCMAKE_BUILD_TYPE=Release ^
-DLLAMA_BUILD_BORINGSSL=ON ^
-DGGML_SCHED_NO_REALLOC=ON
set /A NINJA_JOBS=%NUMBER_OF_PROCESSORS%-1
cmake --build build --config Release -j %NINJA_JOBS% --target llama-server
- name: Python setup
id: setup_python
@@ -166,7 +167,6 @@ jobs:
- name: Tests
id: server_integration_tests
if: ${{ !matrix.disabled_on_pr || !github.event.pull_request }}
run: |
cd tools/server/tests
$env:PYTHONIOENCODING = ":replace"
@@ -174,7 +174,7 @@ jobs:
- name: Slow tests
id: server_integration_tests_slow
if: ${{ (github.event.schedule || github.event.inputs.slow_tests == 'true') && matrix.build_type == 'Release' }}
if: ${{ github.event.schedule || github.event.inputs.slow_tests == 'true' }}
run: |
cd tools/server/tests
$env:SLOW_TESTS = "1"

View File

@@ -0,0 +1,43 @@
name: UI Build (self-hosted)
on:
workflow_call:
jobs:
build:
runs-on: [self-hosted, fast]
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
steps:
- name: Checkout code
uses: actions/checkout@v6
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: "24"
cache: "npm"
cache-dependency-path: "tools/ui/package-lock.json"
- name: Install dependencies
run: npm ci
working-directory: tools/ui
- name: Build application
run: npm run build
working-directory: tools/ui
- name: Generate checksums
run: |
cd tools/ui/dist
for f in *; do
sha256sum "$f" | awk '{print $1, $2}' >> checksums.txt
done
- name: Upload built UI
uses: actions/upload-artifact@v6
with:
name: ui-build
path: tools/ui/dist/
retention-days: 1

View File

@@ -5,7 +5,7 @@ on:
jobs:
build:
runs-on: [self-hosted, fast]
runs-on: ubuntu-slim
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}

View File

@@ -20,7 +20,7 @@ jobs:
publish:
name: Publish UI Static Output
needs: build
runs-on: ubuntu-24.04-arm
runs-on: ubuntu-slim
permissions:
contents: read

View File

@@ -16,7 +16,7 @@ on:
- master
paths: [
'.github/workflows/ui-self-hosted.yml',
'.github/workflows/ui-build.yml',
'.github/workflows/ui-build-self-hosted.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
@@ -24,16 +24,16 @@ on:
types: [opened, synchronize, reopened]
paths: [
'.github/workflows/ui-self-hosted.yml',
'.github/workflows/ui-build.yml',
'.github/workflows/ui-build-self-hosted.yml',
'tools/ui/**.*',
'tools/server/tests/**.*'
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}
@@ -42,7 +42,7 @@ concurrency:
jobs:
ui-build:
name: Build static output
uses: ./.github/workflows/ui-build.yml
uses: ./.github/workflows/ui-build-self-hosted.yml
ui-checks:
name: Checks

View File

@@ -26,10 +26,10 @@ on:
]
env:
LLAMA_LOG_COLORS: 1
LLAMA_LOG_PREFIX: 1
LLAMA_LOG_TIMESTAMPS: 1
LLAMA_LOG_VERBOSITY: 10
LLAMA_ARG_LOG_COLORS: 1
LLAMA_ARG_LOG_PREFIX: 1
LLAMA_ARG_LOG_TIMESTAMPS: 1
LLAMA_ARG_LOG_VERBOSITY: 10
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}-${{ github.head_ref || github.run_id }}

View File

@@ -222,19 +222,6 @@ if (LLAMA_BUILD_APP)
add_subdirectory(app)
endif()
# Automatically add all files from the 'licenses' directory
file(GLOB EXTRA_LICENSES "${CMAKE_SOURCE_DIR}/licenses/LICENSE-*")
foreach(FILE_PATH ${EXTRA_LICENSES})
get_filename_component(FILE_NAME "${FILE_PATH}" NAME)
string(REGEX REPLACE "^LICENSE-" "" NAME "${FILE_NAME}")
license_add_file("${NAME}" "${FILE_PATH}")
endforeach()
if (LLAMA_BUILD_COMMON)
license_generate(llama-common)
endif()
#
# install
#

View File

@@ -63,6 +63,7 @@ After submitting your PR:
- Optionally pick a `<module>` from here: https://github.com/ggml-org/llama.cpp/wiki/Modules
- Let other maintainers merge their own PRs
- When merging a PR, make sure you have a good understanding of the changes
- If a PR does not warrant a new release, add `[no release]` in the squashed commit to spare CI resources
- Be mindful of maintenance: most of the work going into a feature happens after the PR is merged. If the PR author is not committed to contribute long-term, someone else needs to take responsibility (you)
Maintainers reserve the right to decline review or close pull requests for any reason, without any questions, particularly under any of the following conditions:

View File

@@ -12,16 +12,16 @@
## Reporting a vulnerability
> [!IMPORTANT]
> The private security disclosure program is disabled until further notice. Please submit patches with fixes directly to the repo as public PRs. Emails will be ignored.
If you have discovered a security vulnerability in this project that falls inside the [covered topics](#covered-topics), please report it privately. **Do not disclose it as a public issue.** This gives us time to work with you to fix the issue before public exposure, reducing the chance that the exploit will be used before a patch is released.
Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
> [!IMPORTANT]
> For collaborators: if you are interested in helping out with reviewing private security disclosures, please see: https://github.com/ggml-org/llama.cpp/discussions/18080
## Requirements
### Requirements
Before submitting your report, ensure you meet the following requirements:
@@ -31,7 +31,7 @@ Before submitting your report, ensure you meet the following requirements:
Maintainers reserve the right to close the report if these requirements are not fulfilled.
## Covered Topics
### Covered Topics
Only vulnerabilities that fall within these parts of the project are considered valid. For problems falling outside of this list, please report them as issues.

View File

@@ -15,6 +15,17 @@ target_link_libraries(${TARGET} PRIVATE
)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
# Automatically add all files from the 'licenses' directory
file(GLOB EXTRA_LICENSES "${CMAKE_SOURCE_DIR}/licenses/LICENSE-*")
foreach(FILE_PATH ${EXTRA_LICENSES})
get_filename_component(FILE_NAME "${FILE_PATH}" NAME)
string(REGEX REPLACE "^LICENSE-" "" NAME "${FILE_NAME}")
license_add_file("${NAME}" "${FILE_PATH}")
endforeach()
license_generate(${TARGET})
if(LLAMA_TOOLS_INSTALL)
install(TARGETS ${TARGET} RUNTIME)
endif()

View File

@@ -5,6 +5,9 @@
#include <string>
#include <vector>
// embedded data generated by cmake
extern const char * LICENSES[];
// visible
int llama_server(int argc, char ** argv);
int llama_cli(int argc, char ** argv);
@@ -17,8 +20,23 @@ int llama_fit_params(int argc, char ** argv);
int llama_quantize(int argc, char ** argv);
int llama_perplexity(int argc, char ** argv);
// hands the update over to the install script, which downloads and swaps the binary
static int llama_update(int argc, char ** argv) {
(void) argc;
(void) argv;
#if defined(_WIN32)
return system("powershell -NoProfile -ExecutionPolicy Bypass -Command \"irm https://llama.app/install.ps1 | iex\"");
#else
return system("curl -fsSL https://llama.app/install.sh | sh");
#endif
}
static const char * progname;
static int help(int argc, char ** argv);
static int version(int argc, char ** argv);
static int licenses(int argc, char ** argv);
struct command {
const char * name;
@@ -31,14 +49,16 @@ struct command {
static const command cmds[] = {
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"update", "Update llama to the latest release", {}, false, llama_update },
{"completion", "Text completion", {"complete"}, true, llama_completion },
{"bench", "Benchmark prompt processing and text generation", {}, true, llama_bench },
{"batched-bench", "Benchmark batched decoding performance", {}, true, llama_batched_bench},
{"fit-params", "Compute parameters to fit a model in device memory", {}, true, llama_fit_params },
{"quantize", "Quantize a model", {}, true, llama_quantize },
{"perplexity", "Compute model perplexity and KL divergence", {}, true, llama_perplexity },
{"version", "Show version", {}, true, version },
{"help", "Show available commands", {}, true, help },
{"version", "Show version", {}, false, version },
{"licenses", "Show third-party licenses", {"credits"}, false, licenses },
{"help", "Show available commands", {}, false, help },
};
static int version(int argc, char ** argv) {
@@ -46,17 +66,29 @@ static int version(int argc, char ** argv) {
return 0;
}
static int licenses(int argc, char ** argv) {
for (int i = 0; LICENSES[i]; ++i) {
printf("%s\n", LICENSES[i]);
}
return 0;
}
static int help(int argc, char ** argv) {
const bool show_all = argc >= 2 && std::string(argv[1]) == "all";
printf("Usage: llama <command> [options]\n\nAvailable commands:\n");
printf("Usage: %s <command> [options]\n\nAvailable commands:\n", progname);
for (const auto & cmd : cmds) {
if (show_all || !cmd.hidden) {
printf(" %-15s %s\n", cmd.name, cmd.desc);
}
}
printf("\nRun 'llama <command> --help' for command-specific usage.\n");
printf("\n");
if (!show_all) {
printf("Run '%s help all' to show additional commands.\n", progname);
}
printf("Run '%s <command> --help' for command-specific usage.\n", progname);
return 0;
}
@@ -74,13 +106,13 @@ static bool matches(const std::string & arg, const command & cmd) {
}
int main(int argc, char ** argv) {
progname = argv[0];
const std::string arg = argc >= 2 ? argv[1] : "help";
for (const auto & cmd : cmds) {
if (matches(arg, cmd)) {
// router spawns children through this same binary, it needs the
// subcommand to relaunch as 'llama serve' and not bare options
// keep cmd.name so the router's child processes re-invoke correctly
#ifdef _WIN32
_putenv_s("LLAMA_APP_CMD", cmd.name);
#else

View File

@@ -8,6 +8,7 @@ TVOS_MIN_OS_VERSION=16.4
BUILD_SHARED_LIBS=OFF
LLAMA_BUILD_APP=OFF
LLAMA_BUILD_COMMON=OFF
LLAMA_BUILD_EXAMPLES=OFF
LLAMA_BUILD_TOOLS=OFF
LLAMA_BUILD_TESTS=OFF
@@ -33,6 +34,7 @@ COMMON_CMAKE_ARGS=(
-DCMAKE_XCODE_ATTRIBUTE_DEVELOPMENT_TEAM=ggml
-DBUILD_SHARED_LIBS=${BUILD_SHARED_LIBS}
-DLLAMA_BUILD_APP=${LLAMA_BUILD_APP}
-DLLAMA_BUILD_COMMON=${LLAMA_BUILD_COMMON}
-DLLAMA_BUILD_EXAMPLES=${LLAMA_BUILD_EXAMPLES}
-DLLAMA_BUILD_TOOLS=${LLAMA_BUILD_TOOLS}
-DLLAMA_BUILD_TESTS=${LLAMA_BUILD_TESTS}
@@ -416,7 +418,7 @@ cmake -B build-ios-sim -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-ios-sim --config Release -- -quiet
cmake --build build-ios-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for iOS devices..."
cmake -B build-ios-device -G Xcode \
@@ -430,7 +432,7 @@ cmake -B build-ios-device -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-ios-device --config Release -- -quiet
cmake --build build-ios-device --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for macOS..."
cmake -B build-macos -G Xcode \
@@ -441,7 +443,7 @@ cmake -B build-macos -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-macos --config Release -- -quiet
cmake --build build-macos --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for visionOS..."
cmake -B build-visionos -G Xcode \
@@ -456,7 +458,7 @@ cmake -B build-visionos -G Xcode \
-DLLAMA_OPENSSL=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-S .
cmake --build build-visionos --config Release -- -quiet
cmake --build build-visionos --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for visionOS simulator..."
cmake -B build-visionos-sim -G Xcode \
@@ -471,7 +473,7 @@ cmake -B build-visionos-sim -G Xcode \
-DLLAMA_OPENSSL=OFF \
-DLLAMA_BUILD_SERVER=OFF \
-S .
cmake --build build-visionos-sim --config Release -- -quiet
cmake --build build-visionos-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
# Add tvOS builds (might need the same u_int definitions as watchOS and visionOS)
echo "Building for tvOS simulator..."
@@ -487,7 +489,7 @@ cmake -B build-tvos-sim -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-tvos-sim --config Release -- -quiet
cmake --build build-tvos-sim --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
echo "Building for tvOS devices..."
cmake -B build-tvos-device -G Xcode \
@@ -502,7 +504,7 @@ cmake -B build-tvos-device -G Xcode \
-DCMAKE_CXX_FLAGS="${COMMON_CXX_FLAGS}" \
-DLLAMA_OPENSSL=OFF \
-S .
cmake --build build-tvos-device --config Release -- -quiet
cmake --build build-tvos-device --config Release -j $(sysctl -n hw.logicalcpu) -- -quiet
# Setup frameworks and copy binaries and headers
echo "Setting up framework structures..."

View File

@@ -66,6 +66,8 @@ fi
if [ ! -z ${GG_BUILD_METAL} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=ON"
else
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=OFF"
fi
if [ ! -z ${GG_BUILD_CUDA} ]; then
@@ -114,10 +116,7 @@ fi
if [ ! -z ${GG_BUILD_VULKAN} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_VULKAN=1"
# if on Mac, disable METAL
if [[ "$OSTYPE" == "darwin"* ]]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=OFF -DGGML_BLAS=OFF"
MACOS_RUNNER_CUSTOM_VULKAN_CMAKE_LOCATION="/usr/local/lib/cmake/vulkan"
MACOS_RUNNER_CUSTOM_SPIRV_HEADERS_LOCATION="${MACOS_RUNNER_CUSTOM_VULKAN_CMAKE_LOCATION}/SPIRV-Headers/SPIRV-HeadersConfig.cmake"
if [[ -f "${MACOS_RUNNER_CUSTOM_SPIRV_HEADERS_LOCATION}" || -h "${MACOS_RUNNER_CUSTOM_SPIRV_HEADERS_LOCATION}" ]]; then
@@ -133,7 +132,7 @@ if [ ! -z ${GG_BUILD_VULKAN} ]; then
fi
if [ ! -z ${GG_BUILD_WEBGPU} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1 -DGGML_METAL=OFF -DGGML_BLAS=OFF"
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_WEBGPU=1"
if [ ! -z "${GG_BUILD_WEBGPU_DAWN_PREFIX}" ]; then
if [ -z "${CMAKE_PREFIX_PATH}" ]; then
@@ -167,6 +166,8 @@ fi
if [ ! -z ${GG_BUILD_BLAS} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=${GG_BUILD_BLAS_VENDOR:-OpenBLAS}"
else
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_BLAS=OFF"
fi
if [ ! -z ${GG_BUILD_OPENVINO} ]; then
@@ -700,8 +701,8 @@ function gg_sum_test_backend_ops_cpu {
## main
export LLAMA_LOG_PREFIX=1
export LLAMA_LOG_TIMESTAMPS=1
export LLAMA_ARG_LOG_PREFIX=1
export LLAMA_ARG_LOG_TIMESTAMPS=1
if [ -z ${GG_BUILD_LOW_PERF} ]; then
# Create symlink: ./llama.cpp/models-mnt -> $MNT/models

View File

@@ -50,8 +50,6 @@
#define LLAMA_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
extern const char * LICENSES[];
using json = nlohmann::ordered_json;
using namespace common_arg_utils;
@@ -342,9 +340,7 @@ struct handle_model_result {
};
static handle_model_result common_params_handle_model(struct common_params_model & model,
const std::string & bearer_token,
bool offline,
bool search_mtp = false) {
const common_download_opts & opts) {
handle_model_result result;
if (!model.docker_repo.empty()) {
@@ -356,10 +352,9 @@ static handle_model_result common_params_handle_model(struct common_params_model
model.hf_file = model.path;
model.path = "";
}
common_download_opts opts;
opts.bearer_token = bearer_token;
opts.offline = offline;
auto download_result = common_download_model(model, opts, true, search_mtp);
common_download_opts hf_opts = opts;
hf_opts.download_mmproj = true; // also look for mmproj when downloading hf model
auto download_result = common_download_model(model, hf_opts);
if (download_result.model_path.empty()) {
throw std::runtime_error("failed to download model from Hugging Face");
@@ -384,9 +379,6 @@ static handle_model_result common_params_handle_model(struct common_params_model
model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
}
common_download_opts opts;
opts.bearer_token = bearer_token;
opts.offline = offline;
auto download_result = common_download_model(model, opts);
if (download_result.model_path.empty()) {
throw std::runtime_error("failed to download model from " + model.url);
@@ -443,35 +435,49 @@ static bool parse_bool_value(const std::string & value) {
// CLI argument parsing functions
//
void common_params_handle_models(common_params & params, llama_example curr_ex) {
bool common_params_handle_models(common_params & params, llama_example curr_ex) {
const bool spec_type_draft_mtp = std::find(params.speculative.types.begin(),
params.speculative.types.end(),
COMMON_SPECULATIVE_TYPE_DRAFT_MTP) != params.speculative.types.end();
auto res = common_params_handle_model(params.model, params.hf_token, params.offline, spec_type_draft_mtp);
if (params.no_mmproj) {
params.mmproj = {};
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
// optionally, handle mmproj model when -hf is specified
params.mmproj = res.mmproj;
}
// only download mmproj if the current example is using it
for (const auto & ex : mmproj_examples) {
if (curr_ex == ex) {
common_params_handle_model(params.mmproj, params.hf_token, params.offline);
break;
common_download_opts opts;
opts.bearer_token = params.hf_token;
opts.offline = params.offline;
opts.skip_download = params.skip_download;
opts.download_mtp = spec_type_draft_mtp;
try {
auto res = common_params_handle_model(params.model, opts);
if (params.no_mmproj) {
params.mmproj = {};
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
// optionally, handle mmproj model when -hf is specified
params.mmproj = res.mmproj;
}
// only download mmproj if the current example is using it
for (const auto & ex : mmproj_examples) {
if (curr_ex == ex) {
common_params_handle_model(params.mmproj, opts);
break;
}
}
// when --spec-type mtp is set and no draft model was provided explicitly,
// fall back to the MTP head discovered alongside the -hf model
if (spec_type_draft_mtp && res.found_mtp &&
params.speculative.draft.mparams.path.empty() &&
params.speculative.draft.mparams.hf_repo.empty() &&
params.speculative.draft.mparams.url.empty()) {
params.speculative.draft.mparams.path = res.mtp.path;
}
common_params_handle_model(params.speculative.draft.mparams, opts);
common_params_handle_model(params.vocoder.model, opts);
return true;
} catch (const common_skip_download_exception &) {
return false;
} catch (const std::exception &) {
throw;
}
// when --spec-type mtp is set and no draft model was provided explicitly,
// fall back to the MTP head discovered alongside the -hf model
if (spec_type_draft_mtp && res.found_mtp &&
params.speculative.draft.mparams.path.empty() &&
params.speculative.draft.mparams.hf_repo.empty() &&
params.speculative.draft.mparams.url.empty()) {
params.speculative.draft.mparams.path = res.mtp.path;
}
common_params_handle_model(params.speculative.draft.mparams, params.hf_token, params.offline);
common_params_handle_model(params.vocoder.model, params.hf_token, params.offline);
}
static bool common_params_parse_ex(int argc, char ** argv, common_params_context & ctx_arg) {
@@ -1035,11 +1041,9 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
// we define here to make sure it's included in llama-gen-docs
if (ex == LLAMA_EXAMPLE_COMPLETION) {
params.use_jinja = false; // disable jinja by default
} else if (ex == LLAMA_EXAMPLE_MTMD) {
params.use_jinja = false; // disable jinja by default
params.sampling.temp = 0.2; // lower temp by default for better quality
} else if (ex == LLAMA_EXAMPLE_SERVER) {
params.n_parallel = -1; // auto by default
}
@@ -1060,7 +1064,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
sampler_type_names.pop_back(); // remove last semicolon
}
/**
* filter options by example
* rules:
@@ -1074,7 +1077,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
};
add_opt(common_arg(
{"-h", "--help", "--usage"},
"print usage and exit",
@@ -1091,16 +1093,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
exit(0);
}
));
add_opt(common_arg(
{"--license"},
"show source code license and dependencies",
[](common_params &) {
for (int i = 0; LICENSES[i]; ++i) {
printf("%s\n", LICENSES[i]);
}
exit(0);
}
));
add_opt(common_arg(
{"-cl", "--cache-list"},
"show list of models in cache",
@@ -2998,7 +2990,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
key_file.close();
}
).set_examples({LLAMA_EXAMPLE_SERVER}));
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_API_KEY_FILE"));
add_opt(common_arg(
{"--ssl-key-file"}, "FNAME",
"path to file a PEM-encoded SSL private key",
@@ -3026,7 +3018,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.default_template_kwargs[item.key()] = item.value().dump();
}
}
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_CHAT_TEMPLATE_KWARGS"));
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_CLI}).set_env("LLAMA_ARG_CHAT_TEMPLATE_KWARGS"));
add_opt(common_arg(
{"-to", "--timeout"}, "N",
string_format("server read/write timeout in seconds (default: %d)", params.timeout_read),
@@ -3327,7 +3319,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params &, const std::string & value) {
common_log_set_file(common_log_main(), value.c_str());
}
).set_env("LLAMA_LOG_FILE"));
).set_env("LLAMA_ARG_LOG_FILE"));
add_opt(common_arg(
{"--log-colors"}, "[on|off|auto]",
"Set colored logging ('on', 'off', or 'auto', default: 'auto')\n"
@@ -3344,7 +3336,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
string_format("error: unknown value for --log-colors: '%s'\n", value.c_str()));
}
}
).set_env("LLAMA_LOG_COLORS"));
).set_env("LLAMA_ARG_LOG_COLORS"));
add_opt(common_arg(
{"-v", "--verbose", "--log-verbose"},
"Set verbosity level to infinity (i.e. log all messages, useful for debugging)",
@@ -3359,7 +3351,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params) {
params.offline = true;
}
).set_env("LLAMA_OFFLINE"));
).set_env("LLAMA_ARG_OFFLINE"));
add_opt(common_arg(
{"-lv", "--verbosity", "--log-verbosity"}, "N",
string_format("Set the verbosity threshold. Messages with a higher verbosity will be ignored. Values:\n"
@@ -3374,7 +3366,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.verbosity = value;
common_log_set_verbosity_thold(value);
}
).set_env("LLAMA_LOG_VERBOSITY"));
).set_env("LLAMA_ARG_LOG_VERBOSITY"));
add_opt(common_arg(
{"--log-prefix"},
{"--no-log-prefix"},

View File

@@ -129,8 +129,11 @@ bool common_params_to_map(int argc, char ** argv, llama_example ex, std::map<com
// see: https://github.com/ggml-org/llama.cpp/issues/18163
void common_params_add_preset_options(std::vector<common_arg> & args);
// Populate model paths (main model, mmproj, etc) from -hf if necessary
void common_params_handle_models(common_params & params, llama_example curr_ex);
// populate model paths (main model, mmproj, etc) from -hf if necessary
// return true if the model is ready to use
// throw an exception if there is an error that prevents the model from being used (e.g. network error, model not found, etc)
// if params.skip_download is true, no downloads will be attempted. return false if the model is invalid or missing (e.g. ETag check failed)
bool common_params_handle_models(common_params & params, llama_example curr_ex);
// initialize argument parser context - used by test-arg-parser and preset
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);

View File

@@ -1563,6 +1563,7 @@ struct llama_context_params common_context_params_to_llama(const common_params &
cparams.n_ctx = params.n_ctx;
cparams.n_seq_max = params.n_parallel;
cparams.n_rs_seq = params.speculative.need_n_rs_seq();
cparams.n_outputs_max = std::max(params.n_outputs_max, 0);
cparams.n_batch = params.n_batch;
cparams.n_ubatch = params.n_ubatch;
cparams.n_threads = params.cpuparams.n_threads;

View File

@@ -277,6 +277,7 @@ struct common_params_sampling {
std::vector<llama_token> reasoning_budget_end; // end tag token sequence
std::vector<llama_token> reasoning_budget_forced; // forced sequence (message + end tag)
std::string reasoning_budget_message; // message injected before end tag when budget exhausted
bool reasoning_control = false; // create the budget sampler on demand so reasoning can be ended at runtime
bool backend_sampling = false;
@@ -431,6 +432,7 @@ struct common_params {
int32_t n_chunks = -1; // max number of chunks to process (-1 = unlimited)
int32_t n_parallel = 1; // number of parallel sequences to decode
int32_t n_sequences = 1; // number of sequences to decode
int32_t n_outputs_max = 0; // max outputs in a batch (0 = n_batch)
int32_t grp_attn_n = 1; // group-attention factor
int32_t grp_attn_w = 512; // group-attention width
int32_t n_print = -1; // print token count every n tokens (-1 = disabled)
@@ -479,7 +481,7 @@ struct common_params {
std::set<std::string> model_alias; // model aliases // NOLINT
std::set<std::string> model_tags; // model tags (informational, not used for routing) // NOLINT
std::string hf_token = ""; // HF token // NOLINT
std::string hf_token = ""; // HF token (aka bearer token) // NOLINT
std::string prompt = ""; // NOLINT
std::string system_prompt = ""; // NOLINT
std::string prompt_file = ""; // store the external prompt file name // NOLINT
@@ -507,6 +509,7 @@ struct common_params {
int32_t control_vector_layer_start = -1; // layer range for control vector
int32_t control_vector_layer_end = -1; // layer range for control vector
bool offline = false;
bool skip_download = false; // skip model file downloading
int32_t ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
int32_t ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
@@ -587,7 +590,7 @@ struct common_params {
// server params
int32_t port = 8080; // server listens on this network port
bool reuse_port = false; // allow multiple sockets to bind to the same port
int32_t timeout_read = 600; // http read timeout in seconds
int32_t timeout_read = 3600; // http read timeout in seconds
int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests (TODO: support threadpool)
int32_t n_cache_reuse = 0; // min chunk size to reuse from the cache via KV shifting

View File

@@ -292,6 +292,10 @@ static int common_download_file_single_online(const std::string & url,
const bool file_exists = std::filesystem::exists(path);
if (!file_exists && opts.skip_download) {
return -2; // file is missing and download is disabled
}
if (file_exists && skip_etag) {
LOG_DBG("%s: using cached file: %s\n", __func__, path.c_str());
return 304; // 304 Not Modified - fake cached response
@@ -357,6 +361,10 @@ static int common_download_file_single_online(const std::string & url,
LOG_DBG("%s: using cached file (same etag): %s\n", __func__, path.c_str());
return 304; // 304 Not Modified - fake cached response
}
// pass this point, the file exists but is different from the server version, so we need to redownload it
if (opts.skip_download) {
return -2; // special code to indicate that the download was skipped due to etag mismatch
}
if (remove(path.c_str()) != 0) {
LOG_ERR("%s: unable to delete file: %s\n", __func__, path.c_str());
return -1;
@@ -775,13 +783,13 @@ static std::vector<download_task> get_url_tasks(const common_params_model & mode
}
common_download_model_result common_download_model(const common_params_model & model,
const common_download_opts & opts,
bool download_mmproj,
bool download_mtp) {
const common_download_opts & opts) {
common_download_model_result result;
std::vector<download_task> tasks;
hf_plan hf;
bool download_mmproj = opts.download_mmproj;
bool download_mtp = opts.download_mtp;
bool is_hf = !model.hf_repo.empty();
if (is_hf) {
@@ -806,18 +814,22 @@ common_download_model_result common_download_model(const common_params_model &
return result;
}
std::vector<std::future<bool>> futures;
std::vector<std::future<int>> futures;
for (const auto & task : tasks) {
futures.push_back(std::async(std::launch::async,
[&task, &opts, is_hf]() {
int status = common_download_file_single(task.url, task.path, opts, is_hf);
return is_http_status_ok(status);
return common_download_file_single(task.url, task.path, opts, is_hf);
}
));
}
for (auto & f : futures) {
if (!f.get()) {
int status = f.get();
if (status == -2 && opts.skip_download) {
throw common_skip_download_exception();
}
bool is_ok = is_http_status_ok(status);
if (!is_ok) {
return {};
}
}

View File

@@ -52,6 +52,9 @@ struct common_download_opts {
std::string bearer_token;
common_header_list headers;
bool offline = false;
bool skip_download = false; // if true, only validation is performed, common_skip_download_exception may be thrown if the file is missing or invalid
bool download_mmproj = false;
bool download_mtp = false;
common_download_callback * callback = nullptr;
};
@@ -62,6 +65,11 @@ struct common_download_model_result {
std::string mtp_path;
};
// throw if the file is missing or invalid (e.g. ETag check failed)
struct common_skip_download_exception : public std::runtime_error {
common_skip_download_exception() : std::runtime_error("skip download") {}
};
// Download model from HuggingFace repo or URL
//
// input (via model struct):
@@ -89,9 +97,7 @@ struct common_download_model_result {
// returns result with model_path, mmproj_path and mtp_path (empty when not found / on failure)
common_download_model_result common_download_model(
const common_params_model & model,
const common_download_opts & opts = {},
bool download_mmproj = false,
bool download_mtp = false
const common_download_opts & opts = {}
);
// returns list of cached models
@@ -99,6 +105,7 @@ std::vector<common_cached_model_info> common_list_cached_models();
// download single file from url to local path
// returns status code or -1 on error
// returns -2 if the download was skipped due to ETag mismatch (file outdated, skip_download=true)
// skip_etag: if true, don't read/write .etag files (for HF cache where filename is the hash)
int common_download_file_single(const std::string & url,
const std::string & path,

View File

@@ -1,5 +1,7 @@
#include "ngram-mod.h"
#include <algorithm>
//
// common_ngram_mod
//

View File

@@ -247,3 +247,24 @@ common_reasoning_budget_state common_reasoning_budget_get_state(const struct lla
}
return ((const common_reasoning_budget_ctx *)smpl->ctx)->state;
}
bool common_reasoning_budget_force(struct llama_sampler * smpl) {
if (!smpl) {
return false;
}
auto * ctx = (common_reasoning_budget_ctx *) smpl->ctx;
// only a sampler that is actively counting down the budget may be forced;
// any other state (idle, already forcing/waiting, or done) is left untouched
if (ctx->state != REASONING_BUDGET_COUNTING) {
return false;
}
ctx->state = REASONING_BUDGET_FORCING;
ctx->force_pos = 0;
ctx->end_matcher.reset();
LOG_INF("reasoning-budget: forced into forcing state (manual transition)\n");
return true;
}

View File

@@ -40,3 +40,7 @@ struct llama_sampler * common_reasoning_budget_init(
common_reasoning_budget_state initial_state = REASONING_BUDGET_IDLE);
common_reasoning_budget_state common_reasoning_budget_get_state(const struct llama_sampler * smpl);
// Manually transition the reasoning budget sampler into the FORCING state.
// Returns true if the transition occurred.
bool common_reasoning_budget_force(struct llama_sampler * smpl);

View File

@@ -293,7 +293,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
}
// reasoning budget sampler (skip when budget is unlimited unless a lazy grammar is active, which needs rbudget for thinking-block suppression)
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0)) {
if (!params.reasoning_budget_start.empty() && !params.reasoning_budget_end.empty() && (params.grammar_lazy || params.reasoning_budget_tokens >= 0 || params.reasoning_control)) {
rbudget = common_reasoning_budget_init(
vocab,
params.reasoning_budget_start,
@@ -661,6 +661,14 @@ uint32_t common_sampler_get_seed(const struct common_sampler * gsmpl) {
return llama_sampler_get_seed(gsmpl->chain);
}
bool common_sampler_reasoning_budget_force(struct common_sampler * gsmpl) {
if (!gsmpl) {
return false;
}
return common_reasoning_budget_force(gsmpl->rbudget);
}
// helpers
llama_token_data_array * common_sampler_get_candidates(struct common_sampler * gsmpl, bool do_sort) {

View File

@@ -87,6 +87,9 @@ std::vector<llama_token> common_sampler_sample_and_accept_n(struct common_sample
uint32_t common_sampler_get_seed(const struct common_sampler * gsmpl);
// force the reasoning budget sampler (if any) to begin forcing its end sequence now.
bool common_sampler_reasoning_budget_force(struct common_sampler * gsmpl);
// helpers
// access the internal list of current candidate tokens

View File

@@ -1317,6 +1317,40 @@ static uint32_t common_get_enabled_speculative_configs(const std::vector<common_
return result;
}
int32_t common_speculative_n_max(const common_params_speculative * spec) {
int32_t n_max = 0;
for (const auto type : spec->types) {
switch (type) {
case COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE:
case COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3:
case COMMON_SPECULATIVE_TYPE_DRAFT_MTP:
n_max = std::max(n_max, std::max(0, spec->draft.n_max));
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_SIMPLE:
n_max = std::max(n_max, (int32_t) spec->ngram_simple.size_m);
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K:
n_max = std::max(n_max, (int32_t) spec->ngram_map_k.size_m);
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V:
n_max = std::max(n_max, (int32_t) spec->ngram_map_k4v.size_m);
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_MOD:
n_max = std::max(n_max, std::max(0, spec->ngram_mod.n_max));
break;
case COMMON_SPECULATIVE_TYPE_NGRAM_CACHE:
n_max = std::max(n_max, (int32_t) 8);
break;
case COMMON_SPECULATIVE_TYPE_NONE:
case COMMON_SPECULATIVE_TYPE_COUNT:
break;
}
}
return n_max;
}
// initialization of the speculative decoding system
//
common_speculative * common_speculative_init(common_params_speculative & params, uint32_t n_seq) {
@@ -1325,8 +1359,6 @@ common_speculative * common_speculative_init(common_params_speculative & params,
{
uint32_t enabled_configs = common_get_enabled_speculative_configs(params.types);
bool has_draft_model_path = !params.draft.mparams.path.empty();
bool has_draft_simple = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE));
bool has_draft_eagle3 = false; // TODO PR-18039: if params.speculative.eagle3
bool has_mtp = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_MTP)) && params.draft.ctx_dft != nullptr;
@@ -1359,16 +1391,6 @@ common_speculative * common_speculative_init(common_params_speculative & params,
if (has_ngram_cache) {
configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_NGRAM_CACHE, params));
}
if (has_draft_simple) {
if (!has_draft_model_path) {
LOG_WRN("%s: draft model is not specified - cannot use 'draft' type\n", __func__);
has_draft_simple = false;
}
} else if (has_draft_model_path && !has_mtp && !has_draft_eagle3) {
LOG_WRN("%s: draft model is specified but 'draft' speculative type is not explicitly enabled - enabling it\n", __func__);
has_draft_simple = true;
}
if (has_draft_simple) {
configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE, params));
}

View File

@@ -20,6 +20,9 @@ enum common_speculative_type common_speculative_type_from_name(const std::string
// convert type to string
std::string common_speculative_type_to_str(enum common_speculative_type type);
// return the max number of draft tokens based on the speculative parameters
int32_t common_speculative_n_max(const common_params_speculative * spec);
common_speculative * common_speculative_init(common_params_speculative & params, uint32_t n_seq);
void common_speculative_free(common_speculative * spec);

View File

@@ -47,6 +47,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"DeepseekForCausalLM": "deepseek",
"DeepseekV2ForCausalLM": "deepseek",
"DeepseekV3ForCausalLM": "deepseek",
"DeepseekV32ForCausalLM": "deepseek",
"DistilBertForMaskedLM": "bert",
"DistilBertForSequenceClassification": "bert",
"DistilBertModel": "bert",
@@ -57,6 +58,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"Ernie4_5_ForCausalLM": "ernie",
"Ernie4_5_MoeForCausalLM": "ernie",
"EuroBertModel": "bert",
"Exaone4_5_ForConditionalGeneration": "exaone",
"Exaone4ForCausalLM": "exaone",
"ExaoneForCausalLM": "exaone",
"ExaoneMoEForCausalLM": "exaone",
@@ -74,6 +76,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"Gemma3nForCausalLM": "gemma",
"Gemma3nForConditionalGeneration": "gemma",
"Gemma4ForConditionalGeneration": "gemma",
"Gemma4ForCausalLM": "gemma",
"GemmaForCausalLM": "gemma",
"Glm4ForCausalLM": "glm",
"Glm4MoeForCausalLM": "glm",
@@ -215,6 +218,7 @@ TEXT_MODEL_MAP: dict[str, str] = {
"T5EncoderModel": "t5",
"T5ForConditionalGeneration": "t5",
"T5WithLMHeadModel": "t5",
"TalkieForCausalLM": "talkie",
"UMT5ForConditionalGeneration": "t5",
"UMT5Model": "t5",
"UltravoxModel": "ultravox",
@@ -234,8 +238,10 @@ TEXT_MODEL_MAP: dict[str, str] = {
MMPROJ_MODEL_MAP: dict[str, str] = {
"AudioFlamingo3ForConditionalGeneration": "ultravox",
"CogVLMForCausalLM": "cogvlm",
"DeepseekOCR2ForCausalLM": "deepseek",
"DeepseekOCRForCausalLM": "deepseek",
"DotsOCRForCausalLM": "dotsocr",
"Exaone4_5_ForConditionalGeneration": "exaone",
"Gemma3ForConditionalGeneration": "gemma",
"Gemma3nForConditionalGeneration": "gemma",
"Gemma4ForConditionalGeneration": "gemma",

View File

@@ -119,7 +119,8 @@ class ModelBase:
small_first_shard: bool = False, hparams: dict[str, Any] | None = None, remote_hf_model_id: str | None = None,
disable_mistral_community_chat_template: bool = False,
sentence_transformers_dense_modules: bool = False,
fuse_gate_up_exps: bool = False):
fuse_gate_up_exps: bool = False,
fp8_as_q8: bool = False):
if type(self) is ModelBase or \
type(self) is TextModel or \
type(self) is MmprojModel:
@@ -148,6 +149,8 @@ class ModelBase:
self.dir_model_card = dir_model # overridden in convert_lora_to_gguf.py
self._is_nvfp4 = False
self._is_mxfp4 = False
self._fp8_as_q8 = fp8_as_q8
self._fp8_dequantized: set[str] = set()
# Apply heuristics to figure out typical tensor encoding based on first tensor's dtype
# NOTE: can't use field "torch_dtype" in config.json, because some finetunes lie.
@@ -429,6 +432,8 @@ class ModelBase:
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s, bs=block_size: dequant_simple(w(), s(), bs)
tensors_to_remove.append(name)
if self._fp8_as_q8:
self._fp8_dequantized.add(weight_name)
if name.endswith(".activation_scale"): # unused
tensors_to_remove.append(name)
if name.endswith("_activation_scale"): # Mistral-Small-4-119B-2602, unused
@@ -440,6 +445,8 @@ class ModelBase:
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s, bs=block_size: dequant_simple(w(), s(), bs)
tensors_to_remove.append(name)
if self._fp8_as_q8:
self._fp8_dequantized.add(weight_name)
if name.endswith(".qscale_act"):
tensors_to_remove.append(name)
elif quant_method == "gptq":
@@ -483,6 +490,11 @@ class ModelBase:
strategy = weight_config.get("strategy")
assert strategy == "channel" or strategy == "block"
assert weight_config.get("group_size") is None # didn't find a model using this yet
is_fp8 = (
quant_format == "float-quantized"
and weight_config.get("type") == "float"
and weight_config.get("num_bits") == 8
)
for name in self.model_tensors.keys():
if name.endswith(".weight_scale"):
weight_name = name.removesuffix("_scale")
@@ -490,6 +502,8 @@ class ModelBase:
s = self.model_tensors[name]
self.model_tensors[weight_name] = lambda w=w, s=s: dequant_simple(w(), s(), block_size)
tensors_to_remove.append(name)
if self._fp8_as_q8 and is_fp8:
self._fp8_dequantized.add(weight_name)
elif quant_format == "pack-quantized":
assert weight_config.get("strategy") == "group"
assert weight_config.get("type", "int") == "int"
@@ -524,10 +538,18 @@ class ModelBase:
for name in self.model_tensors.keys():
if name.endswith(".weight_scale"):
weight_name = name.removesuffix("_scale")
if weight_name not in self.model_tensors:
tensors_to_remove.append(name)
continue
w = self.model_tensors[weight_name]
s = self.model_tensors[name]
is_fp8_weight = False
if self._fp8_as_q8:
is_fp8_weight = w().dtype in (torch.float8_e4m3fn, torch.float8_e5m2)
self.model_tensors[weight_name] = lambda w=w, s=s: dequant_simple(w(), s(), None)
tensors_to_remove.append(name)
if is_fp8_weight:
self._fp8_dequantized.add(weight_name)
if name.endswith((".input_scale", ".k_scale", ".v_scale")):
tensors_to_remove.append(name)
elif quant_method is not None:
@@ -615,8 +637,10 @@ class ModelBase:
return [(new_name, data_torch)]
def tensor_force_quant(self, name: str, new_name: str, bid: int | None, n_dims: int) -> gguf.GGMLQuantizationType | bool:
del name, new_name, bid, n_dims # unused
del new_name, bid # unused
# Force FP8-original tensors to Q8_0 when requested; Q8_0 is faster than F16/BF16.
if self._fp8_as_q8 and name in self._fp8_dequantized and n_dims >= 2:
return gguf.GGMLQuantizationType.Q8_0
return False
# some models need extra generated tensors (like rope_freqs)
@@ -791,7 +815,7 @@ class ModelBase:
if quant_algo != "NVFP4":
if nvfp4_compressed_tensors:
quant_algo = "NVFP4"
elif any(v.get("quant_algo") == "NVFP4" for v in quant_layers.values() if isinstance(v, dict)):
elif any(str(v.get("quant_algo")).endswith("NVFP4") for v in quant_layers.values() if isinstance(v, dict)):
quant_algo = "NVFP4"
self._is_nvfp4 = quant_algo == "NVFP4"
@@ -891,6 +915,8 @@ class ModelBase:
gguf.MODEL_TENSOR.SSM_CONV1D_Q,
gguf.MODEL_TENSOR.SSM_CONV1D_K,
gguf.MODEL_TENSOR.SSM_CONV1D_V,
# DSA indexer weights should be F32
gguf.MODEL_TENSOR.INDEXER_PROJ,
)
)
or new_name[-7:] not in (".weight", ".lora_a", ".lora_b")
@@ -1114,7 +1140,7 @@ class TextModel(ModelBase):
# Skip multimodal tensors
if name.startswith(("mlp", "vit.", "vpm.", "siglip2.", "conformer.", "merger.", "resampler.", "sound_encoder.", "sound_projection.", "speech_embeddings.")) \
or "visual." in name or "vision." in name or "audio." in name or "talker." in name \
or "vision_" in name or "audio_" in name or "sam_model" in name \
or "vision_" in name or "audio_" in name \
or "token2wav." in name or "code2wav." in name \
or "projector." in name or "pre_mm_projector_norm" in name \
or "image_newline" in name or "view_seperator" in name \
@@ -1421,6 +1447,9 @@ class TextModel(ModelBase):
if chkhsh == "0fe1cf6eda062318a1af7270f3331a85c539a01778ff948e24388e949c5282f4":
# ref: https://huggingface.co/evilfreelancer/ruGPT3XL
res = "gpt-2"
if chkhsh == "9e454714343b69b99b71795c1d27a68c2a1d15dab111f4d353109f966af29da7":
# ref: https://huggingface.co/LiquidAI/LFM2.5-8B-A1B
res = "lfm2"
if chkhsh == "0ef9807a4087ebef797fc749390439009c3b9eda9ad1a097abbe738f486c01e5":
# ref: https://huggingface.co/meta-llama/Meta-Llama-3-8B
res = "llama-bpe"
@@ -1572,7 +1601,7 @@ class TextModel(ModelBase):
# ref: https://huggingface.co/K-intelligence/Midm-2.0-Base-Instruct
res = "midm-2.0"
if chkhsh == "169bf0296a13c4d9b7672313f749eb36501d931022de052aad6e36f2bf34dd51":
# ref: https://huggingface.co/LiquidAI/LFM2-Tokenizer
# ref: https://huggingface.co/LiquidAI/LFM2.5-350M
res = "lfm2"
if chkhsh == "2085e1638f6c377a0aa4ead21b27bb4cb941bf800df86ed391011769c1758dfb":
# ref: https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B
@@ -1622,6 +1651,12 @@ class TextModel(ModelBase):
if chkhsh == "62f6fb0a6fd5098caeabb19b07a5c1099cafc8b9c40eab6ea89ece4ec02fbc57":
# ref: https://huggingface.co/sarvamai/sarvam-30b
res = "sarvam-moe"
if chkhsh == "f728162c1315c26e40249849799b4ba3fe584c32084b4795b03eb295e63cb5af":
# ref: https://huggingface.co/lewtun/talkie-1930-13b-it-hf
res = "talkie"
if chkhsh == "36f3066e97b7f3994b379aaacde306c1444c6ae84e81a5ae3cd2b7ed3b8c42d4":
# ref: https://huggingface.co/openbmb/MiniCPM5-1B
res = "minicpm5"
if res is None:
logger.warning("\n")
@@ -1657,6 +1692,16 @@ class TextModel(ModelBase):
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
def _set_vocab_whitespace(self) -> None:
tokens, toktypes, _ = self.get_vocab_base()
self.gguf_writer.add_tokenizer_model("whitespace")
self.gguf_writer.add_tokenizer_pre("whitespace") # pinned, not hash-detected: chktxt hash collides with jina-v1-en
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
def _set_vocab_hybriddna(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
@@ -2411,10 +2456,9 @@ class MmprojModel(ModelBase):
raise KeyError(f"could not find any of: {keys}")
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, name, n_dims # unused
if ".patch_embd.weight" in new_name or ".patch_merger.weight" in new_name:
return gguf.GGMLQuantizationType.F16 if self.ftype == gguf.LlamaFileType.MOSTLY_F16 else gguf.GGMLQuantizationType.F32
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
class LazyTorchTensor(gguf.LazyBase):
@@ -2549,7 +2593,7 @@ def get_model_architecture(hparams: dict[str, Any], model_type: ModelType) -> st
# Step3-VL keeps text config under text_config but uses a custom top-level architecture.
# For text conversion we route to a dedicated text-only class.
# TODO: refactor this later to avoid adding exception here
if model_type == ModelType.TEXT and arch in ("StepVLForConditionalGeneration", "Sarashina2VisionForCausalLM"):
if model_type == ModelType.TEXT and arch in ("StepVLForConditionalGeneration", "Sarashina2VisionForCausalLM", "Exaone4_5_ForConditionalGeneration"):
return arch
# if "architectures" is found in the sub-config, use that instead

View File

@@ -571,7 +571,16 @@ class JinaBertV2Model(BertModel):
if tokenizer_class == 'BertTokenizer':
super().set_vocab()
elif tokenizer_class == 'RobertaTokenizer':
self._set_vocab_gpt2()
pre_tokenizer_type = None
tokenizer_json_path = self.dir_model / "tokenizer.json"
if tokenizer_json_path.is_file():
with open(tokenizer_json_path, "r", encoding="utf-8") as f:
pre_tokenizer_type = json.load(f).get("pre_tokenizer", {}).get("type")
if pre_tokenizer_type == "Whitespace":
self._set_vocab_whitespace()
else:
self._set_vocab_gpt2()
self.gguf_writer.add_token_type_count(2)
else:
raise NotImplementedError(f'Tokenizer {tokenizer_class} is not supported for JinaBertModel')

View File

@@ -16,10 +16,14 @@ from .qwen import QwenModel
@ModelBase.register("DeepseekOCRForCausalLM")
class DeepseekOCRVisionModel(MmprojModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.clip_projector_type = gguf.VisionProjectorType.DEEPSEEKOCR
def set_gguf_parameters(self):
super().set_gguf_parameters()
hparams = self.hparams
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.DEEPSEEKOCR)
self.gguf_writer.add_clip_projector_type(self.clip_projector_type)
# default values below are taken from HF tranformers code
self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("layer_norm_eps", 1e-6))
self.gguf_writer.add_vision_use_gelu(True)
@@ -49,22 +53,27 @@ class DeepseekOCRVisionModel(MmprojModel):
raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found")
vision_config['sam'] = vision_config['width']['sam_vit_b']
vision_config.update(vision_config['width']['clip-l-14-224'])
vision_config['hidden_size'] = vision_config['width']
vision_config['num_heads'] = vision_config['heads']
vision_config['intermediate_size'] = vision_config['heads'] * 4
if vision_config['width'].get('clip-l-14-224') is not None:
vision_config.update(vision_config['width']['clip-l-14-224'])
if isinstance(vision_config['width'], int):
vision_config['hidden_size'] = vision_config['width']
if vision_config.get('heads') is not None:
vision_config['num_heads'] = vision_config['heads']
vision_config['intermediate_size'] = vision_config['heads'] * 4
return vision_config
def tensor_force_quant(self, name, new_name, bid, n_dims):
if ".embeddings." in name or 'pos_embed' in name:
return gguf.GGMLQuantizationType.F32
if ".rel_pos_h" in name or '.rel_pos_w' in name:
return gguf.GGMLQuantizationType.F32
if ".neck." in name or ".net_" in name:
return gguf.GGMLQuantizationType.F32
for nq_name in ('.embeddings.', 'pos_embed', '.rel_pos_h', '.rel_pos_w', '.neck.', '.net_'):
if nq_name in name:
return gguf.GGMLQuantizationType.F32
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.endswith("view_seperator"):
data_torch = data_torch.unsqueeze(0)
yield from super().modify_tensors(data_torch, name, bid)
@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, gen = item
@@ -81,6 +90,33 @@ class DeepseekOCRVisionModel(MmprojModel):
return super().filter_tensors((name, gen))
@ModelBase.register("DeepseekOCR2ForCausalLM")
class DeepseekOCR2VisionModel(DeepseekOCRVisionModel):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.clip_projector_type = gguf.VisionProjectorType.DEEPSEEKOCR2
def set_gguf_parameters(self):
# the vision tower's qwen2 encoder is built from fixed defaults,
# see build_qwen2_decoder_as_encoder() in deepencoderv2.py
if self.hparams.get("patch_size") is None:
self.hparams["patch_size"] = 16
if self.hparams.get("intermediate_size") is None:
self.hparams["intermediate_size"] = 4864
if self.hparams.get("num_attention_heads") is None:
self.hparams["num_attention_heads"] = 14
super().set_gguf_parameters()
# qwen2 encoder is GQA: 14 Q heads, 2 KV heads
self.gguf_writer.add_vision_head_count_kv(2)
def get_vision_config(self) -> dict[str, Any]:
vision_config = super().get_vision_config()
vision_config['hidden_size'] = vision_config['width']['qwen2-0-5b']['dim']
if vision_config.get('layers') is None:
vision_config['layers'] = 24
return vision_config
@ModelBase.register("DeepseekForCausalLM")
class DeepseekModel(TextModel):
model_arch = gguf.MODEL_ARCH.DEEPSEEK
@@ -188,13 +224,21 @@ class DeepseekV2Model(TextModel):
self.origin_hf_arch = hparams.get('architectures', [None])[0]
# special handling for Deepseek OCR
if self.origin_hf_arch == "DeepseekOCRForCausalLM":
if self.origin_hf_arch in ("DeepseekOCRForCausalLM", "DeepseekOCR2ForCausalLM"):
self.model_arch = gguf.MODEL_ARCH.DEEPSEEK2OCR
self.gguf_writer.arch = gguf.MODEL_ARCH_NAMES[self.model_arch]
self.gguf_writer.add_architecture()
# default jinja template
self.gguf_writer.add_chat_template("{% for m in messages %}{{m['content']}}{% endfor %}")
@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, _ = item
# DeepSeek-OCR vision encoder (SAM + DeepSeek-OCR-2 qwen2 tower)
if "sam_model" in name or "qwen2_model" in name:
return None
return super().filter_tensors(item)
def set_vocab(self):
try:
self._set_vocab_gpt2()
@@ -386,3 +430,32 @@ class DeepseekV2Model(TextModel):
experts = [k for d in self._experts for k in d.keys()]
if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts}")
@ModelBase.register("DeepseekV32ForCausalLM")
class DeepseekV32Model(DeepseekV2Model):
model_arch = gguf.MODEL_ARCH.DEEPSEEK32
skip_mtp = False
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
def set_vocab(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
assert getattr(tokenizer, "add_bos_token", False), "Change value of add_bos_token to true in tokenizer_config.json file."
self._set_vocab_gpt2()
def set_gguf_parameters(self):
super().set_gguf_parameters()
# NextN/MTP prediction layers
if (num_nextn_predict_layers := self.hparams.get("num_nextn_predict_layers")) is not None:
self.gguf_writer.add_nextn_predict_layers(num_nextn_predict_layers)
# DSA indexer parameters
self.gguf_writer.add_indexer_head_count(self.hparams["index_n_heads"])
self.gguf_writer.add_indexer_key_length(self.hparams["index_head_dim"])
self.gguf_writer.add_indexer_top_k(self.hparams["index_topk"])

View File

@@ -3,14 +3,15 @@ from __future__ import annotations
import math
from pathlib import Path
from typing import Iterable, TYPE_CHECKING
from typing import Callable, Iterable, TYPE_CHECKING
import torch
if TYPE_CHECKING:
from torch import Tensor
from .base import ModelBase, TextModel, gguf
from .base import MmprojModel, ModelBase, TextModel, gguf
from .qwenvl import Qwen2VLVisionModel
@ModelBase.register("ExaoneForCausalLM")
@@ -208,3 +209,97 @@ class ExaoneMoEModel(Exaone4Model):
experts = [k for d in self._experts for k in d.keys()]
if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts}")
@ModelBase.register("Exaone4_5_ForConditionalGeneration")
class Exaone4_5_TextModel(Exaone4Model):
"""Text tower of EXAONE 4.5; Tensors match EXAONE4"""
model_arch = gguf.MODEL_ARCH.EXAONE4
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
n_nextn = int(self.hparams.get("num_nextn_predict_layers", 0) or 0)
if n_nextn > 0:
self.block_count = self.hparams["num_hidden_layers"] + n_nextn
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
def set_gguf_parameters(self):
super().set_gguf_parameters()
n_nextn = int(self.hparams.get("num_nextn_predict_layers", 0) or 0)
if n_nextn > 0:
self.gguf_writer.add_nextn_predict_layers(n_nextn)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("mtp."):
n_nextn = int(self.hparams.get("num_nextn_predict_layers", 0) or 0)
if n_nextn <= 0:
return
nh = self.hparams["num_hidden_layers"]
if ".layers." in name:
share = self.hparams.get("mtp_share_layers", False)
mtp_bid = bid if bid is not None else 0
if share:
for k in range(n_nextn):
nn = name.replace(f"mtp.layers.{mtp_bid}", f"model.layers.{nh + k}")
yield from super().modify_tensors(data_torch, nn, nh + k)
return
name = name.replace(f"mtp.layers.{mtp_bid}", f"model.layers.{mtp_bid + nh}")
else:
remapper = {
"mtp.fc": gguf.MODEL_TENSOR.NEXTN_EH_PROJ,
"mtp.pre_fc_norm_embedding": gguf.MODEL_TENSOR.NEXTN_ENORM,
"mtp.pre_fc_norm_hidden": gguf.MODEL_TENSOR.NEXTN_HNORM,
"mtp.norm": gguf.MODEL_TENSOR.NEXTN_SHARED_HEAD_NORM,
}
_n = Path(name)
key = _n.stem
if key not in remapper:
return
for bid_mtp in range(nh, self.block_count):
mapped_name = self.format_tensor_name(remapper[key], bid_mtp, suffix=_n.suffix)
yield from ModelBase.modify_tensors(self, data_torch, mapped_name, bid_mtp)
return
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Exaone4_5_ForConditionalGeneration")
class Exaone4_5VisionModel(Qwen2VLVisionModel):
"""Vision tower for EXAONE 4.5; Qwen2-VL-style ViT (GQA) + patch merger"""
@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, gen = item
name = name.replace("model.visual.", "visual.", 1)
return super().filter_tensors((name, gen))
def set_gguf_parameters(self):
MmprojModel.set_gguf_parameters(self)
assert self.hparams_vision is not None
hparams = self.hparams_vision
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.EXAONE4_5)
self.gguf_writer.add_vision_use_silu(True)
self.gguf_writer.add_vision_min_pixels(self.preprocessor_config["min_pixels"])
self.gguf_writer.add_vision_max_pixels(self.preprocessor_config["max_pixels"])
num_kv_head = self.find_vparam(["num_key_value_heads"], optional=True)
if num_kv_head is not None:
self.gguf_writer.add_vision_head_count_kv(num_kv_head)
eps = hparams.get("rms_norm_eps", self.global_config.get("rms_norm_eps", 1e-6))
self.gguf_writer.add_vision_attention_layernorm_eps(eps)
if (window_size := hparams.get("window_size")) is not None:
self.gguf_writer.add_vision_window_size(window_size)
fullatt_block_indexes = hparams.get("fullatt_block_indexes")
if fullatt_block_indexes:
n_wa_pattern = fullatt_block_indexes[0] + 1
for i in range(1, len(fullatt_block_indexes)):
if fullatt_block_indexes[i] - fullatt_block_indexes[i - 1] != n_wa_pattern:
raise ValueError(f"Invalid EXAONE4.5 fullatt_block_indexes: {fullatt_block_indexes}")
self.gguf_writer.add_vision_n_wa_pattern(n_wa_pattern)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if ".qkv." in name:
yield from ModelBase.modify_tensors(self, data_torch, name, bid)
return
yield from Qwen2VLVisionModel.modify_tensors(self, data_torch, name, bid)

View File

@@ -614,7 +614,7 @@ class Gemma3NModel(Gemma3Model):
yield from super().modify_tensors(data_torch, name, bid)
@ModelBase.register("Gemma4ForConditionalGeneration")
@ModelBase.register("Gemma4ForConditionalGeneration", "Gemma4ForCausalLM")
class Gemma4Model(Gemma3Model):
model_arch = gguf.MODEL_ARCH.GEMMA4
@@ -786,14 +786,15 @@ class Gemma4VisionAudioModel(MmprojModel):
super().set_gguf_parameters()
# vision params
assert self.hparams_vision is not None
self.gguf_writer.add_clip_vision_projector_type(gguf.VisionProjectorType.GEMMA4V)
self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-6))
self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams_vision.get("layer_norm_eps", 1e-6))
# audio params
if self.hparams_audio:
self.gguf_writer.add_clip_audio_projector_type(gguf.VisionProjectorType.GEMMA4A)
self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
self.gguf_writer.add_audio_attention_layernorm_eps(1e-5)
assert self.hparams_audio is not None
self.gguf_writer.add_clip_audio_projector_type(gguf.VisionProjectorType.GEMMA4A)
self.gguf_writer.add_audio_num_mel_bins(self.hparams_audio["feat_in"])
self.gguf_writer.add_audio_attention_layernorm_eps(self.hparams_audio.get("layer_norm_eps", 1e-6))
def is_audio_tensor(self, name: str) -> bool:
return "audio_tower" in name or "embed_audio" in name

53
conversion/talkie.py Normal file
View File

@@ -0,0 +1,53 @@
from __future__ import annotations
from typing import Iterable, TYPE_CHECKING
import torch
if TYPE_CHECKING:
from torch import Tensor
from .base import LazyTorchTensor, ModelBase, TextModel, gguf
@ModelBase.register("TalkieForCausalLM")
class TalkieModel(TextModel):
model_arch = gguf.MODEL_ARCH.TALKIE
def set_gguf_parameters(self):
super().set_gguf_parameters()
# Talkie used F.rms_norm without an explicit eps
self.gguf_writer.add_layer_norm_rms_eps(torch.finfo(torch.float32).eps)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
prefix = f"model.blocks.{bid}." if bid is not None else ""
suffix = name.removeprefix(prefix)
if suffix == "attn_gain.a_g":
yield self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_OUT, bid, ".scale"), data_torch
return
elif suffix == "mlp_gain.a_g":
yield self.format_tensor_name(gguf.MODEL_TENSOR.FFN_DOWN, bid, ".scale"), data_torch
return
elif suffix == "lm_head_gain.w_g":
self.gguf_writer.add_logit_scale(LazyTorchTensor.to_eager(data_torch).item())
return
elif suffix in ("attn.attn_query.weight", "attn.attn_key.weight"):
# absorb inverse rope
head_dim = self.hparams["head_dim"]
shape = data_torch.shape
data_torch = torch.reshape(data_torch, (-1, head_dim, shape[-1]))
signs = torch.ones((1, head_dim, 1), dtype=data_torch.dtype)
signs[:, head_dim // 2 :, :] = -1
if self.lazy:
signs = LazyTorchTensor.from_eager(signs)
# (n_head, head_dim, n_in) -> (n_out, n_in)
data_torch = torch.reshape(data_torch * signs, shape)
elif suffix == "attn.head_gain.head_g":
# allow head gain to broadcast
data_torch = data_torch.unsqueeze(-1)
if not name.endswith(".weight"):
name += ".weight"
yield from super().modify_tensors(data_torch, name, bid)

View File

@@ -148,6 +148,10 @@ def parse_args() -> argparse.Namespace:
"--fuse-gate-up-exps", action="store_true",
help="Fuse gate_exps and up_exps tensors into a single gate_up_exps tensor for MoE models.",
)
parser.add_argument(
"--fp8-as-q8", action="store_true",
help="Store tensors dequantized from FP8 as Q8_0 instead of BF16/F16.",
)
args = parser.parse_args()
if not args.print_supported_models and args.model is None:
@@ -264,7 +268,8 @@ def main() -> None:
small_first_shard=args.no_tensor_first_split,
remote_hf_model_id=hf_repo_id, disable_mistral_community_chat_template=disable_mistral_community_chat_template,
sentence_transformers_dense_modules=args.sentence_transformers_dense_modules,
fuse_gate_up_exps=args.fuse_gate_up_exps
fuse_gate_up_exps=args.fuse_gate_up_exps,
fp8_as_q8=args.fp8_as_q8,
)
if args.vocab_only:

View File

@@ -139,7 +139,7 @@ models = [
{"name": "seed-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", },
{"name": "a.x-4.0", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/skt/A.X-4.0", },
{"name": "midm-2.0", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/K-intelligence/Midm-2.0-Base-Instruct", },
{"name": "lfm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LiquidAI/LFM2-Tokenizer"},
{"name": "lfm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LiquidAI/LFM2.5-350M", },
{"name": "exaone4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B", },
{"name": "mellum", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/JetBrains/Mellum-4b-base", },
{"name": "modern-bert", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/answerdotai/ModernBERT-base", },
@@ -156,6 +156,8 @@ models = [
{"name": "kanana2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/kakaocorp/kanana-2-30b-a3b-instruct-2601", },
{"name": "f2llmv2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/codefuse-ai/F2LLM-v2-4B", },
{"name": "sarvam-moe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/sarvamai/sarvam-30b", },
{"name": "talkie", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/lewtun/talkie-1930-13b-it-hf", },
{"name": "minicpm5", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openbmb/MiniCPM5-1B"},
]
# some models are known to be broken upstream, so we will skip them as exceptions
@@ -181,6 +183,8 @@ pre_computed_hashes = [
# jina-v2-de variants
{"name": "jina-v2-de", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/aari1995/German_Semantic_V3", "chkhsh": "b3d1dd861f1d4c5c0d2569ce36baf3f90fe8a102db3de50dd71ff860d91be3df"},
{"name": "gpt-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/evilfreelancer/ruGPT3XL", "chkhsh": "0fe1cf6eda062318a1af7270f3331a85c539a01778ff948e24388e949c5282f4"},
# lfm2 variants
{"name": "lfm2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/LiquidAI/LFM2.5-8B-A1B", "chkhsh": "9e454714343b69b99b71795c1d27a68c2a1d15dab111f4d353109f966af29da7"},
]

View File

@@ -208,6 +208,16 @@ class LoraTorchTensor:
def to(self, *args, **kwargs):
return LoraTorchTensor(self._lora_A.to(*args, **kwargs), self._lora_B.to(*args, **kwargs))
def __mul__(self, other) -> LoraTorchTensor:
# Only output-side multiplication for now
# W = B @ A, so M_out * W == (M_out * B) @ A
if not isinstance(other, (int, float)) and other.shape and other.shape[-1] != 1:
raise NotImplementedError
return LoraTorchTensor(self._lora_A, self._lora_B * other)
def __rmul__(self, other) -> LoraTorchTensor:
return self * other
@classmethod
def __torch_function__(cls, func: Callable, types, args=(), kwargs=None):
del types # unused

View File

@@ -459,7 +459,7 @@ Each returned parser is wrapped by `wrap_for_generation_prompt()`, which prepend
- Usage: `./bin/llama-template-analysis path/to/template.jinja`
**Debug Logging**: Enable with `LLAMA_LOG_VERBOSITY=2`
**Debug Logging**: Enable with `LLAMA_ARG_LOG_VERBOSITY=2`
- Shows detailed analysis steps, pattern extraction results, and generated parser structure

View File

@@ -8,7 +8,7 @@
- [Performance Reference](#performance-reference)
- [Docker](#docker)
- [Linux](#linux)
- [Windows](#windows)
- [Windows](#windows-1)
- [Environment Variable](#environment-variable)
- [Design Rule](#design-rule)
- [Known Issue](#known-issues)
@@ -743,6 +743,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
| GGML_SYCL_ENABLE_LEVEL_ZERO | 1 (default) or 0 | Use Level Zero API for device memory allocation instead of SYCL. Reduces system RAM usage on Intel dGPUs by avoiding DMA-buf/TTM host memory staging. Requires GGML_SYCL_SUPPORT_LEVEL_ZERO=ON at build time. |
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
| GGML_SYCL_ENABLE_VMM | 0 or 1 (default) | Enable the virtual-memory device pool. |
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Allow SYCL/Unified Runtime Level Zero device allocations larger than 4 GiB. llama.cpp's direct Level Zero allocation path requests the relaxed maximum-size limit itself when GGML_SYCL_ENABLE_LEVEL_ZERO=1. |
@@ -753,6 +754,7 @@ Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spo
| Name | Function |
|-----------------|----------------------------------------------------------------------------------|
| DEBUG_SYCL_POOL | Enable device memory pool logging on teardown. Useful for profiling allocations. |
| DEBUG_SYCL_MALLOC | Enable verbose per-call logging of device pool alloc/free operations. |
## Design Rule

View File

@@ -72,10 +72,13 @@ The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)** and **expert-
|:----------------------:|:-------:|:---------------------------------------------:|
| FP32 | Support | Full precision floating point |
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
| Q8_0 | Support | 8-bit quantized weights via [dynamic quantization](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md) |
*Notes:*
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
- **Q8_0** is available for quantized model weights since ZenDNN supports dynamic quantization [LowOHA MatMul operator](https://github.com/amd/ZenDNN/blob/main/docs/operator/lowoha_matmul_operator.md).
- Other quantization formats fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
## Linux
@@ -140,6 +143,15 @@ Download LLaMA 3.1 8B Instruct BF16 model:
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
```
You can also use a Q8_0 GGUF model:
```sh
# Download a Q8_0 GGUF model from Hugging Face
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF \
Llama-3.1-8B-Instruct-Q8_0.gguf \
--local-dir models/
```
#### 2. Start Server
Run llama.cpp server with ZenDNN acceleration:
@@ -176,6 +188,10 @@ export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo (recommended)
For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details).
### Q8_0 Performance Notes
Q8_0 support is mainly beneficial for prompt processing / prefill workloads where large matrix multiplications dominate execution. Token generation performance may remain close to the standard CPU backend depending on the model, batch size, number of threads, and CPU topology.
### Profiling and Debugging
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md).
@@ -184,6 +200,7 @@ For detailed profiling and logging options, refer to the [ZenDNN Logging Documen
- **Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations.
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
- **Q8_0 support scope**: Q8_0 acceleration is available for supported matrix multiplication paths. Other quantization formats still fall back to the standard CPU backend.
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
## Q&A
@@ -202,7 +219,7 @@ A: ZenDNN is optimized specifically for AMD processors. While it may work on oth
**Q: Does ZenDNN support quantized models?**
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
A: Yes. The ZenDNN backend supports Q8_0 quantized models for supported matrix multiplication operations. FP32 and BF16 are also supported. Other quantization formats may fall back to the standard CPU backend unless explicitly supported by the ZenDNN backend.
**Q: Why is my inference not faster with ZenDNN?**

View File

@@ -10,7 +10,7 @@ This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc.
This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop.
```
~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.6
~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.7
[d]/> cd /workspace
```

View File

@@ -22,6 +22,7 @@ The following sections describe how to build with different backends and options
* [HIP](#hip)
* [Vulkan](#vulkan)
* [CANN](#cann)
* [ZenDNN](#zendnn)
* [Arm® KleidiAI™](#arm-kleidiai)
* [OpenCL](#opencl)
* [Android](#android-1)

View File

@@ -176,7 +176,7 @@ Note that currently you cannot quantize the visual encoder because granite visio
### 5. Running the Model in Llama cpp
Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
Build llama cpp normally; you should have a target binary named `llama-mtmd-cli`, which you can pass two binaries to. As an example, we pass the llama.cpp banner.
```bash
$ ./build/bin/llama-mtmd-cli -m $LLM_GGUF_PATH \

View File

@@ -55,7 +55,7 @@ Legend:
| GELU | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_ERF | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GELU_QUICK | ❌ | ✅ | ✅ | 🟡 | ✅ | 🟡 | ✅ | 🟡 | ✅ | ❌ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | 🟡 | ✅ | 🟡 | ❌ | ❌ |
| GET_ROWS | ❌ | 🟡 | ✅ | 🟡 | 🟡 | 🟡 | | ✅ | 🟡 | ❌ | ❌ |
| GET_ROWS_BACK | ❌ | ❌ | 🟡 | 🟡 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GROUP_NORM | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| HARDSIGMOID | ❌ | ✅ | ✅ | 🟡 | ✅ | ❌ | ✅ | 🟡 | ✅ | ❌ | ❌ |

File diff suppressed because it is too large Load Diff

View File

@@ -323,3 +323,8 @@ statistics ngram_map_k: #calls(b,g,a) = 6 1690 26, #gen drafts = 26, #acc drafts
- `#gen tokens`: number of tokens generated by this implementation (including rejected tokens)
- `#acc tokens`: number of tokens accepted by the main model
- `dur(b,g,a): durations of begin (new prompt), generation and accumulation (process acceptance).
## Benchmarking
To measure the end-to-end effect of speculative decoding (throughput, latency, and draft acceptance) across diverse prompts, see the SPEED-Bench client in [tools/server/bench/speed-bench](../tools/server/bench/speed-bench/README.md).
It runs against a running `llama-server` and can compare a baseline run against a speculative-decoding run.

View File

@@ -335,7 +335,7 @@ $ make perplexity-run-full QUANTIZED_MODEL=~/path/to/quantized/model-Qxx.gguf LO
## HuggingFace utilities
The following targets are useful for creating collections and model repositories
on Hugging Face in the the ggml-org. These can be used when preparing a release
on Hugging Face in the ggml-org. These can be used when preparing a release
to script the process for new model releases.
For the following targets a `HF_TOKEN` environment variable is required.

View File

@@ -5,7 +5,7 @@ project("ggml" C CXX ASM)
### GGML Version
set(GGML_VERSION_MAJOR 0)
set(GGML_VERSION_MINOR 13)
set(GGML_VERSION_PATCH 0)
set(GGML_VERSION_PATCH 1)
set(GGML_VERSION_BASE "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/")

View File

@@ -381,11 +381,15 @@ extern "C" {
// - most tensors have n_segments == 1 and a contiguous slice of the tensor data
// - some tensors have an inhomogenenous data layout along the split axis,
// those tensors are divided into segments which are each individually split across devices
// - ne has one entry per segment and device that add up to ggml_tensor::ne for that axis,
// the outer/inner loops are over segments/devices like [seg0_dev0, seg0_dev1, seg1_dev0, seg1_dev1],
// - ne has one entry per segment and device and that segment repeats nr times,
// in total when accounting for repetitions the segments add up to ggml_tensor::ne for that axis,
// the outer/inner loops are over segments/devices like [seg0_dev0_r0, seg0_dev1_r0, seg0_dev0_r1, seg0_dev1_r1, seg1_dev0_r0, seg1_dev1_r0],
// - for example, a transformer may have a fused QKV matrix rather than 3 matrices, those would be 3 separate segments
// that each need to be split individually across devices so that each device gets a slice of Q, K, and V
// that each need to be split individually across devices so that each device gets a slice of Q, K, and V,
// the Q matrix can be larger than the K and V matrices so this can either be expressed as 3 segments or as 2 segments
// where the segment for K/V repeats twice
int64_t ne[16*GGML_BACKEND_META_MAX_DEVICES];
uint32_t nr[16];
uint32_t n_segments;
};

View File

@@ -487,6 +487,9 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(co
static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
ggml_backend_meta_simple_tensor_container & stc, const struct ggml_tensor * tensor, bool assume_sync) {
// FIXME Currently this function preserves/erases the information in n_segments and nr in an inconsistent way.
// Since the operations in question are developed specifically for llama.cpp this currently does not manifest as a bug there.
// However, in a broader ggml context with arbitrary ggml graphs this can lead to unexpected results.
const size_t n_bufs = ggml_backend_meta_buffer_n_bufs(tensor->buffer);
ggml_backend_meta_buffer_context * buf_ctx = (ggml_backend_meta_buffer_context *) tensor->buffer->context;
@@ -497,11 +500,11 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
for (size_t j = 0; j < n_bufs; j++) {
int64_t sum_a = 0;
for (size_t s = 0; s < a.n_segments; s++) {
sum_a += a.ne[s*n_bufs + j];
sum_a += a.ne[s*n_bufs + j] * a.nr[s];
}
int64_t sum_b = 0;
for (size_t s = 0; s < b.n_segments; s++) {
sum_b += b.ne[s*n_bufs + j];
sum_b += b.ne[s*n_bufs + j] * b.nr[s];
}
if (sum_a != sum_b) {
return false;
@@ -511,7 +514,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
};
auto handle_generic = [&](const std::vector<ggml_backend_meta_split_state> & src_ss, bool scalar_only) -> ggml_backend_meta_split_state {
ggml_backend_meta_split_state ret = {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, 1};
ggml_backend_meta_split_state ret = {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, {1}, 1};
for (size_t i = 0; i < GGML_MAX_SRC; i++) {
if (tensor->src[i] == nullptr || tensor->src[i] == tensor) {
continue;
@@ -519,15 +522,15 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (ret.axis == GGML_BACKEND_SPLIT_AXIS_NONE) {
ret = src_ss[i];
} else if (!split_states_equal(src_ss[i], ret)) {
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
break;
}
}
if (ret.axis == GGML_BACKEND_SPLIT_AXIS_NONE) {
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
if (scalar_only && ret.axis >= 0 && ret.axis < GGML_MAX_DIMS) {
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
ret = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
GGML_ASSERT(ret.axis != GGML_BACKEND_SPLIT_AXIS_UNKNOWN);
return ret;
@@ -571,42 +574,24 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
auto handle_mul_mat = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
return {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, {1}, 1};
}
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_1 && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
ggml_backend_meta_split_state ret = src_ss[0];
ret.axis = GGML_BACKEND_SPLIT_AXIS_0;
ret.nr[0] = 1;
ret.n_segments = 1;
return ret;
}
if (src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_1 && src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
ggml_backend_meta_split_state ret = src_ss[1];
ret.n_segments = 1;
return ret;
return src_ss[1];
}
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_0 && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_0) {
GGML_ASSERT(split_states_equal(src_ss[0], src_ss[1]));
return {assume_sync ? GGML_BACKEND_SPLIT_AXIS_MIRRORED : GGML_BACKEND_SPLIT_AXIS_PARTIAL, {0}, 1};
return {assume_sync ? GGML_BACKEND_SPLIT_AXIS_MIRRORED : GGML_BACKEND_SPLIT_AXIS_PARTIAL, {0}, {1}, 1};
}
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
};
auto handle_cpy = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis >= 0 && src_ss[0].axis < GGML_MAX_DIMS) {
int64_t ne_split_src = tensor->src[0]->ne[0];
for (int dim = 1; dim <= src_ss[0].axis; dim++) {
ne_split_src *= tensor->src[0]->ne[dim];
}
int64_t ne_split_dst = 1;
for (int dim = 0; dim < GGML_MAX_DIMS; dim++) {
ne_split_dst *= tensor->ne[dim];
if (ne_split_dst == ne_split_src) {
return {ggml_backend_meta_split_axis(dim), {0}, 1};
}
}
}
return handle_generic(src_ss, /*scalar_only =*/ false);
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
};
auto handle_reshape = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
@@ -615,33 +600,25 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
case GGML_BACKEND_SPLIT_AXIS_1:
case GGML_BACKEND_SPLIT_AXIS_2:
case GGML_BACKEND_SPLIT_AXIS_3: {
GGML_ASSERT(!ggml_is_permuted(tensor) && !ggml_is_permuted(tensor->src[0]));
if (src_ss[0].axis == ggml_n_dims(tensor->src[0]) - 1) {
return {ggml_backend_meta_split_axis(ggml_n_dims(tensor) - 1), {0}, 1};
GGML_ASSERT(src_ss[0].n_segments == 1);
if (src_ss[0].axis == ggml_n_dims(tensor->src[0]) - 1 && src_ss[0].nr[0] == 1) {
return {ggml_backend_meta_split_axis(ggml_n_dims(tensor) - 1), {0}, {1}, 1};
}
std::vector<int64_t> base_ne_in;
base_ne_in.reserve(GGML_MAX_DIMS - src_ss[0].axis);
{
base_ne_in.push_back(1);
int dim = 0;
for (; dim <= src_ss[0].axis; dim++) {
base_ne_in[0] *= tensor->src[0]->ne[dim];
}
for (; dim <= GGML_MAX_DIMS; dim++) {
base_ne_in.push_back(base_ne_in.back() * tensor->src[0]->ne[dim]);
}
int64_t base_ne_in = tensor->src[0]->ne[0];
for (int dim = 1; dim <= src_ss[0].axis; dim++) {
base_ne_in *= tensor->src[0]->ne[dim];
}
base_ne_in /= src_ss[0].nr[0];
int64_t base_ne_out = 1;
for (int dim = 0; dim < GGML_MAX_DIMS; dim++) {
const int64_t base_ne_out_next = base_ne_out *= tensor->ne[dim];
for (const int64_t & bni : base_ne_in) {
if (bni == base_ne_out_next) {
return {ggml_backend_meta_split_axis(dim), {0}, 1};
}
if (base_ne_out_next % base_ne_in == 0) {
return {ggml_backend_meta_split_axis(dim), {0}, {uint32_t(base_ne_out_next/base_ne_in)}, 1};
}
if (base_ne_out_next > base_ne_in[0]) {
GGML_ASSERT(dim + 1 < GGML_MAX_DIMS);
return {ggml_backend_meta_split_axis(dim + 1), {0}, 1};
if (base_ne_out_next > base_ne_in) {
GGML_ASSERT(src_ss[0].n_segments == 1);
GGML_ASSERT(src_ss[0].nr[0] == 1);
return {ggml_backend_meta_split_axis(dim), {0}, {1}, 1};
}
base_ne_out = base_ne_out_next;
}
@@ -653,11 +630,18 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
}
default: {
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
}
};
auto handle_cpy = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis >= 0 && src_ss[0].axis < GGML_MAX_DIMS) {
return handle_reshape(src_ss);
}
return handle_generic(src_ss, /*scalar_only =*/ false);
};
auto handle_view = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (ggml_is_contiguous(tensor) && ggml_is_contiguous(tensor->src[0])) {
return handle_reshape(src_ss);
@@ -681,7 +665,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (!ggml_is_permuted(tensor) && !ggml_is_permuted(tensor->src[0]) && axis >= 0 && axis < GGML_MAX_DIMS-1) {
for (int dim = 0; dim < GGML_MAX_DIMS-1; dim++) {
if (tensor->nb[dim+1] == tensor->src[0]->nb[axis+1]) {
return {ggml_backend_meta_split_axis(dim), {0}, 1};
return {ggml_backend_meta_split_axis(dim), {0}, {1}, 1};
}
}
GGML_ABORT("fatal error");
@@ -690,7 +674,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
return src_ss[0];
}
GGML_ABORT("view of permuted tensor not implemented");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
};
auto handle_permute = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
@@ -699,7 +683,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
case GGML_BACKEND_SPLIT_AXIS_1:
case GGML_BACKEND_SPLIT_AXIS_2:
case GGML_BACKEND_SPLIT_AXIS_3: {
return {ggml_backend_meta_split_axis(tensor->op_params[src_ss[0].axis]), {0}, 1};
GGML_ASSERT(src_ss[0].n_segments == 1 || src_ss[0].nr[0] == 1);
return {ggml_backend_meta_split_axis(tensor->op_params[src_ss[0].axis]), {0}, {src_ss[0].nr[0]}, 1};
}
case GGML_BACKEND_SPLIT_AXIS_MIRRORED:
case GGML_BACKEND_SPLIT_AXIS_PARTIAL: {
@@ -707,7 +692,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
}
default: {
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
}
};
@@ -716,7 +701,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
switch (src_ss[0].axis) {
case GGML_BACKEND_SPLIT_AXIS_0:
case GGML_BACKEND_SPLIT_AXIS_1: {
return {ggml_backend_meta_split_axis(int(src_ss[0].axis) ^ 1), {0}, 1};
GGML_ASSERT(src_ss[0].n_segments == 1 || src_ss[0].nr[0] == 1);
return {ggml_backend_meta_split_axis(int(src_ss[0].axis) ^ 1), {0}, {src_ss[0].nr[0]}, 1};
}
case GGML_BACKEND_SPLIT_AXIS_2:
case GGML_BACKEND_SPLIT_AXIS_3:
@@ -726,7 +712,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
}
default: {
GGML_ABORT("fatal error");
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
//return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
}
};
@@ -764,16 +750,16 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
GGML_ASSERT( src_ss[2].axis == GGML_BACKEND_SPLIT_AXIS_2);
GGML_ASSERT(tensor->src[4] == nullptr || src_ss[3].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED);
GGML_ASSERT(tensor->src[4] == nullptr || src_ss[4].axis == GGML_BACKEND_SPLIT_AXIS_0);
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, {1}, 1};
};
auto handle_ssm_conv = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis == src_ss[1].axis) {
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_0) {
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_1, {0}, {1}, 1};
}
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_1) {
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, {1}, 1};
}
}
return handle_generic(src_ss, /*scalar_only =*/ false);
@@ -781,8 +767,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
auto handle_gated_delta_net = [&](const std::vector<ggml_backend_meta_split_state> & src_ss) -> ggml_backend_meta_split_state {
if (src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[1].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
src_ss[2].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[3].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
src_ss[4].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
src_ss[2].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[3].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED &&
src_ss[4].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED && src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_MIRRORED) {
return src_ss[0];
}
GGML_ASSERT(src_ss[0].axis == GGML_BACKEND_SPLIT_AXIS_1);
@@ -793,12 +779,12 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
// state shape is (S_v*S_v*H, K, n_seqs); the heads dim is nested inside axis 0,
// so a head-aligned split on the input cache reshapes to axis 0 here (not axis 2).
GGML_ASSERT(src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_2 || src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_1 || src_ss[5].axis == GGML_BACKEND_SPLIT_AXIS_0);
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_0, {0}, {1}, 1};
};
auto calculate_split_state = [&]() -> ggml_backend_meta_split_state {
if (ggml_nelements(tensor) == 0) {
return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
return {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
}
if (ggml_backend_buffer_get_usage(tensor->buffer) != GGML_BACKEND_BUFFER_USAGE_COMPUTE && tensor->view_src == nullptr) {
ggml_backend_dev_t dev = ggml_backend_buft_get_device(ggml_backend_buffer_get_type(tensor->buffer));
@@ -807,19 +793,21 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (ret.axis >= 0 && ret.axis <= GGML_MAX_DIMS) {
const int64_t granularity = ret.axis == GGML_BACKEND_SPLIT_AXIS_0 ? ggml_blck_size(tensor->type) : 1;
int64_t ne_sum = 0;
for (size_t sj = 0; sj < ret.n_segments*n_bufs; sj++) {
GGML_ASSERT(ret.ne[sj] % granularity == 0);
ne_sum += ret.ne[sj];
for (size_t s = 0; s < ret.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
GGML_ASSERT(ret.ne[s*n_bufs + j] % granularity == 0);
ne_sum += ret.ne[s*n_bufs + j] * ret.nr[s];
}
}
GGML_ASSERT(ne_sum == tensor->ne[ret.axis]);
}
return ret;
}
std::vector<ggml_backend_meta_split_state> src_ss(GGML_MAX_SRC, {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, 1});
std::vector<ggml_backend_meta_split_state> src_ss(GGML_MAX_SRC, {GGML_BACKEND_SPLIT_AXIS_NONE, {0}, {1}, 1});
for (size_t i = 0; i < GGML_MAX_SRC; i++) {
if (tensor->src[i] == nullptr || tensor->src[i] == tensor) {
src_ss[i] = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
src_ss[i] = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
continue;
}
src_ss[i] = ggml_backend_meta_get_split_state(stc, tensor->src[i], /*assume_sync =*/ true);
@@ -829,7 +817,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
ggml_backend_meta_split_state split_state;
switch (tensor->op) {
case GGML_OP_NONE: {
split_state = {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, 1};
split_state = {GGML_BACKEND_SPLIT_AXIS_MIRRORED, {0}, {1}, 1};
} break;
case GGML_OP_DUP: {
split_state = handle_generic(src_ss, /*scalar_only =*/ true);
@@ -1016,7 +1004,7 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
} break;
default: {
GGML_ABORT("ggml op not implemented: %s", ggml_op_name(tensor->op));
split_state = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, 1};
split_state = {GGML_BACKEND_SPLIT_AXIS_UNKNOWN, {0}, {1}, 1};
} break;
}
if (split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS) {
@@ -1034,23 +1022,25 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
split_state.ne[s*n_bufs + j] = 0;
}
for (size_t s = 0; s < src_ss[i].n_segments; s++) {
split_state.ne[j] += src_ss[i].ne[s*n_bufs + j];
split_state.ne[j] += src_ss[i].ne[s*n_bufs + j] * src_ss[i].nr[s];
}
split_state.ne[j] *= tensor->ne[split_state.axis];
if (split_state.ne[j] != 0 || tensor->src[i]->ne[src_ss[i].axis] != 0) {
GGML_ASSERT(split_state.ne[j] % tensor->src[i]->ne[src_ss[i].axis] == 0);
split_state.ne[j] /= tensor->src[i]->ne[src_ss[i].axis];
const int64_t div = tensor->src[i]->ne[src_ss[i].axis] * split_state.nr[0];
GGML_ASSERT(split_state.ne[j] % div == 0);
split_state.ne[j] /= div;
}
}
} else {
GGML_ASSERT(split_state.n_segments == 1);
for (size_t j = 0; j < n_bufs; j++) {
// Assert that ratio is consistent:
int64_t sum = 0;
for (size_t s = 0; s < src_ss[i].n_segments; s++) {
sum += src_ss[i].ne[s*n_bufs + j];
sum += src_ss[i].ne[s*n_bufs + j] * src_ss[i].nr[s];
}
// Assert that ratio is consistent:
GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis]
== sum * tensor->ne[split_state.axis]);
GGML_ASSERT(split_state.ne[j]*split_state.nr[0] * tensor->src[i]->ne[src_ss[i].axis]
== sum * tensor->ne[split_state.axis]);
}
}
first_src_split_by_axis = false;
@@ -1080,13 +1070,14 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
srcs_info += ", ";
}
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor->src[0], true);
GGML_ASSERT(split_state.n_segments == 1);
const char * axis_name = ggml_backend_meta_split_axis_name(split_state.axis);
std::string ne_info;
for (size_t j = 0; j < n_bufs; j++) {
if (!ne_info.empty()) {
ne_info += ", ";
}
ne_info += std::to_string(split_state.ne[j]);
ne_info += std::to_string(split_state.ne[j]) + "x" + std::to_string(split_state.nr[0]);
}
srcs_info += std::string(tensor->src[i]->name) + "[" + ggml_op_name(tensor->src[i]->op) + ", " + axis_name + ", {" + ne_info + "}]";
}
@@ -1095,7 +1086,8 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
if (!ne_info.empty()) {
ne_info += ", ";
}
ne_info += std::to_string(buf_ctx->split_state_cache[key].first.ne[j]);
const ggml_backend_meta_split_state & ss = buf_ctx->split_state_cache[key].first;
ne_info += std::to_string(ss.ne[j]) + "x" + std::to_string(ss.nr[0]);
}
GGML_LOG_DEBUG("SPLIT_STATE: {%s} -> %s[%s, %s, {%s}]\n", srcs_info.c_str(), tensor->name, ggml_op_name(tensor->op),
ggml_backend_meta_split_axis_name(buf_ctx->split_state_cache[key].first.axis), ne_info.c_str());
@@ -1107,8 +1099,10 @@ static struct ggml_backend_meta_split_state ggml_backend_meta_get_split_state(
#ifndef NDEBUG
if (ret.axis >= 0 && ret.axis < GGML_MAX_DIMS) {
int64_t ne_ret = 0;
for (size_t sj = 0; sj < ret.n_segments*n_bufs; sj++) {
ne_ret += ret.ne[sj];
for (size_t s = 0; s < ret.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
ne_ret += ret.ne[s*n_bufs + j] * ret.nr[s];
}
}
assert(ne_ret == tensor->ne[int(ret.axis)]);
}
@@ -1155,7 +1149,7 @@ static enum ggml_status ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_m
// GGML_ASSERT(ggml_is_contiguously_allocated(tensor));
ne[split_dim] = 0;
for (size_t s = 0; s < split_state.n_segments; s++) {
ne[split_dim] += split_state.ne[s*n_simple_bufs + j];
ne[split_dim] += split_state.ne[s*n_simple_bufs + j] * split_state.nr[s];
}
for (int i = 0; i < GGML_MAX_DIMS; i++) {
if (tensor->nb[i] > tensor->nb[split_dim]) {
@@ -1229,7 +1223,7 @@ static enum ggml_status ggml_backend_meta_buffer_init_tensor_impl(ggml_backend_m
for (size_t j = 0; j < n_simple_bufs; j++) {
int64_t ne_sum = 0;
for (size_t s = 0; s < split_state_src.n_segments; s++) {
ne_sum += split_state_src.ne[s*n_simple_bufs + j];
ne_sum += split_state_src.ne[s*n_simple_bufs + j] * split_state_src.nr[s];
}
if (ne_sum == 0) {
simple_tensors[j]->flags &= ~GGML_TENSOR_FLAG_COMPUTE;
@@ -1255,8 +1249,9 @@ static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
if (split_state.n_segments != 1) {
if (split_state.n_segments != 1 || split_state.nr[0] != 1) {
GGML_ASSERT(split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS);
GGML_ASSERT(split_state.nr[0] != 0);
GGML_ASSERT(tensor->ne[3] == 1);
size_t offset_data = 0;
@@ -1267,24 +1262,26 @@ static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
const size_t row_stride = tensor->nb[1];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[1]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[1]);
const int64_t blck_size = ggml_blck_size(tensor->type);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[1], nbytes,
r_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[1], nbytes,
row_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
GGML_ASSERT(split_state.axis == GGML_BACKEND_SPLIT_AXIS_1);
@@ -1292,22 +1289,24 @@ static void ggml_backend_meta_buffer_set_tensor(ggml_backend_buffer_t buffer, gg
const size_t row_stride = tensor->nb[2];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[2]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[2]);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[2], nbytes,
r_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_set_2d(simple_tensor, (const char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[2], nbytes,
row_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
@@ -1365,8 +1364,9 @@ static void ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer_t buffer, co
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
if (split_state.n_segments != 1) {
if (split_state.n_segments != 1 || split_state.nr[0] != 1) {
GGML_ASSERT(split_state.axis >= 0 && split_state.axis < GGML_MAX_DIMS);
GGML_ASSERT(split_state.nr[0] != 0);
GGML_ASSERT(tensor->ne[3] == 1);
size_t offset_data = 0;
@@ -1377,24 +1377,26 @@ static void ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer_t buffer, co
const size_t row_stride = tensor->nb[1];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[1]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[1]);
const int64_t blck_size = ggml_blck_size(tensor->type);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[1], nbytes,
r_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
GGML_ASSERT(split_state.ne[s*n_bufs + j] % blck_size == 0);
const size_t nbytes = split_state.ne[s*n_bufs + j]/blck_size * tensor->nb[0];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[1], nbytes,
row_count, simple_tensor->nb[1], tensor->nb[1]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
GGML_ASSERT(split_state.axis == GGML_BACKEND_SPLIT_AXIS_1);
@@ -1402,22 +1404,24 @@ static void ggml_backend_meta_buffer_get_tensor(ggml_backend_buffer_t buffer, co
const size_t row_stride = tensor->nb[2];
GGML_ASSERT(offset % row_stride == 0);
GGML_ASSERT(size % row_stride == 0);
const int64_t r_start = offset / row_stride;
const int64_t r_count = size / row_stride;
GGML_ASSERT(r_start + r_count <= tensor->ne[2]);
const int64_t row_start = offset / row_stride;
const int64_t row_count = size / row_stride;
GGML_ASSERT(row_start + row_count <= tensor->ne[2]);
for (size_t s = 0; s < split_state.n_segments; s++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + r_start * simple_tensor->nb[2], nbytes,
r_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
for (size_t r = 0; r < split_state.nr[s]; r++) {
for (size_t j = 0; j < n_bufs; j++) {
const ggml_tensor * simple_tensor = ggml_backend_meta_buffer_simple_tensor(tensor, j);
const size_t nbytes = split_state.ne[s*n_bufs + j] * tensor->nb[1];
ggml_backend_tensor_get_2d(simple_tensor, (char *) data + offset_data,
simple_offsets[j] + row_start * simple_tensor->nb[2], nbytes,
row_count, simple_tensor->nb[2], tensor->nb[2]);
offset_data += nbytes;
simple_offsets[j] += nbytes;
}
}
}
GGML_ASSERT(offset_data*r_count == size);
GGML_ASSERT(offset_data*row_count == size);
return;
}
@@ -1675,6 +1679,7 @@ static void ggml_backend_meta_set_tensor_async(ggml_backend_t backend, ggml_tens
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
GGML_ASSERT(split_state.n_segments == 1);
GGML_ASSERT(split_state.nr[0] == 1);
switch (split_state.axis) {
case GGML_BACKEND_SPLIT_AXIS_0:
@@ -1719,6 +1724,7 @@ static void ggml_backend_meta_get_tensor_async(ggml_backend_t backend, const ggm
const ggml_backend_meta_split_state split_state = ggml_backend_meta_get_split_state(tensor, /*assume_sync =*/ false);
GGML_ASSERT(split_state.n_segments == 1);
GGML_ASSERT(split_state.nr[0] == 1);
switch (split_state.axis) {
case GGML_BACKEND_SPLIT_AXIS_0:
@@ -2076,6 +2082,7 @@ static enum ggml_status ggml_backend_meta_graph_compute(ggml_backend_t backend,
node_zero->src[0] = node;
ggml_set_op_params_f32(node_zero, 0, 0.0f);
node_zero->data = node->data;
node_zero->buffer = node->buffer;
node_zero->flags |= GGML_TENSOR_FLAG_COMPUTE;
step_cgraphs[j] = get_cgraph_aux();

View File

@@ -977,6 +977,35 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
sumf = hsum_float_8(acc);
*s = sumf;
#elif defined(__loongarch_sx)
__m128 acc = (__m128)__lsx_vldi(0);
for (; ib < nb; ++ib) {
const float d = GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d);
const __m128i qx_0 = __lsx_vld((const __m128i *)x[ib].qs, 0);
const __m128i qx_1 = __lsx_vld((const __m128i *)x[ib].qs + 1, 0);
const __m128i qy_0 = __lsx_vld((const __m128i *)y[ib].qs, 0);
const __m128i qy_1 = __lsx_vld((const __m128i *)y[ib].qs + 1, 0);
const __m128i p16_0 = lsx_maddubs_h(qx_0, qy_0);
const __m128i p16_1 = lsx_maddubs_h(qx_1, qy_1);
// Sum int16 pairs → int32
const __m128i s_0 = __lsx_vaddwev_w_h(p16_0, p16_1);
const __m128i s_1 = __lsx_vaddwod_w_h(p16_0, p16_1);
const __m128 q = __lsx_vffint_s_w(__lsx_vadd_w(s_0, s_1));
acc = __lsx_vfmadd_s(__lsx_vreplfr2vr_s(d), q, acc);
}
__m128 res = lsx_hadd_s(acc, acc);
res = lsx_hadd_s(res, res);
sumf = ((v4f32)res)[0];
*s = sumf;
#else
UNUSED(nb);
UNUSED(ib);
@@ -1443,6 +1472,99 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
*s = hsum_float_8(acc);
#elif defined(__loongarch_sx)
const __m128i m32s = __lsx_vreplgr2vr_b(32);
__m128 acc_0 = (__m128)__lsx_vldi(0);
__m128 acc_1 = (__m128)__lsx_vldi(0);
for (int i = 0; i < nb; ++i) {
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
const uint8_t * GGML_RESTRICT qh = x[i].qh;
const int8_t * GGML_RESTRICT q8 = y[i].qs;
const __m128i scale_i8 = __lsx_vld(x[i].scales, 0);
const __m128i scales_lo = __lsx_vsllwil_h_b(scale_i8, 0);
const __m128i scales_hi = __lsx_vsllwil_h_b(__lsx_vbsrl_v(scale_i8, 8), 0);
__m128i sumi_0 = __lsx_vldi(0);
__m128i sumi_1 = __lsx_vldi(0);
for (int j = 0; j < QK_K/128; ++j) {
const __m128i q4bitsH_0 = __lsx_vld((const __m128i*)qh, 0); qh += 16;
const __m128i q4bitsH_1 = __lsx_vld((const __m128i*)qh, 0); qh += 16;
const __m128i q4h_0 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_0, 3), 4);
const __m128i q4h_1 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_1, 3), 4);
const __m128i q4h_2 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_0, 3 << 2), 2);
const __m128i q4h_3 = __lsx_vslli_b(__lsx_vandi_b(q4bitsH_1, 3 << 2), 2);
const __m128i q4h_4 = __lsx_vandi_b(q4bitsH_0, 3 << 4);
const __m128i q4h_5 = __lsx_vandi_b(q4bitsH_1, 3 << 4);
const __m128i q4h_6 = __lsx_vsrli_b(__lsx_vandi_b(q4bitsH_0, 3 << 6), 2);
const __m128i q4h_7 = __lsx_vsrli_b(__lsx_vandi_b(q4bitsH_1, 3 << 6), 2);
const __m128i q4bits1_0 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4bits1_1 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4bits2_0 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4bits2_1 = __lsx_vld((const __m128i*)q4, 0); q4 += 16;
const __m128i q4_0 = __lsx_vor_v(__lsx_vandi_b(q4bits1_0, 0xf), q4h_0);
const __m128i q4_1 = __lsx_vor_v(__lsx_vandi_b(q4bits1_1, 0xf), q4h_1);
const __m128i q4_2 = __lsx_vor_v(__lsx_vandi_b(q4bits2_0, 0xf), q4h_2);
const __m128i q4_3 = __lsx_vor_v(__lsx_vandi_b(q4bits2_1, 0xf), q4h_3);
const __m128i q4_4 = __lsx_vor_v(__lsx_vsrli_b(q4bits1_0, 4), q4h_4);
const __m128i q4_5 = __lsx_vor_v(__lsx_vsrli_b(q4bits1_1, 4), q4h_5);
const __m128i q4_6 = __lsx_vor_v(__lsx_vsrli_b(q4bits2_0, 4), q4h_6);
const __m128i q4_7 = __lsx_vor_v(__lsx_vsrli_b(q4bits2_1, 4), q4h_7);
const __m128i q8_0 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_1 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_2 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_3 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_4 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_5 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_6 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8_7 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
__m128i p16_0 = lsx_maddubs_h(__lsx_vsub_b(q4_0, m32s), q8_0);
__m128i p16_1 = lsx_maddubs_h(__lsx_vsub_b(q4_1, m32s), q8_1);
__m128i p16_2 = lsx_maddubs_h(__lsx_vsub_b(q4_2, m32s), q8_2);
__m128i p16_3 = lsx_maddubs_h(__lsx_vsub_b(q4_3, m32s), q8_3);
__m128i p16_4 = lsx_maddubs_h(__lsx_vsub_b(q4_4, m32s), q8_4);
__m128i p16_5 = lsx_maddubs_h(__lsx_vsub_b(q4_5, m32s), q8_5);
__m128i p16_6 = lsx_maddubs_h(__lsx_vsub_b(q4_6, m32s), q8_6);
__m128i p16_7 = lsx_maddubs_h(__lsx_vsub_b(q4_7, m32s), q8_7);
const __m128i sc_vec = j == 0 ? scales_lo : scales_hi;
p16_0 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 0), p16_0);
p16_1 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 1), p16_1);
p16_2 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 2), p16_2);
p16_3 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 3), p16_3);
p16_4 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 4), p16_4);
p16_5 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 5), p16_5);
p16_6 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 6), p16_6);
p16_7 = lsx_madd_h(__lsx_vreplvei_h(sc_vec, 7), p16_7);
sumi_0 = __lsx_vadd_w(sumi_0, __lsx_vadd_w(p16_0, p16_2));
sumi_1 = __lsx_vadd_w(sumi_1, __lsx_vadd_w(p16_1, p16_3));
sumi_0 = __lsx_vadd_w(sumi_0, __lsx_vadd_w(p16_4, p16_6));
sumi_1 = __lsx_vadd_w(sumi_1, __lsx_vadd_w(p16_5, p16_7));
}
__m128 p_0 = __lsx_vfmul_s(__lsx_vreplfr2vr_s(d), __lsx_vffint_s_w(sumi_0));
__m128 p_1 = __lsx_vfmul_s(__lsx_vreplfr2vr_s(d), __lsx_vffint_s_w(sumi_1));
acc_0 = __lsx_vfadd_s(p_0, acc_0);
acc_1 = __lsx_vfadd_s(p_1, acc_1);
}
*s = hsum_float_4x4(acc_0, acc_1, (__m128)__lsx_vldi(0), (__m128)__lsx_vldi(0));
#else
UNUSED(x);
UNUSED(y);
@@ -2149,6 +2271,35 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
*s = hsum_float_8(accum);
#elif defined(__loongarch_sx)
const __m128i values128 = __lsx_vld((const __m128i*)kvalues_iq4nl, 0);
__m128 accum = (__m128)__lsx_vldi(0);
for (int ibl = 0; ibl < nb; ++ibl) {
const uint8_t * qs = x[ibl].qs;
const int8_t * q8 = y[ibl].qs;
uint16_t sh = x[ibl].scales_h;
__m128i sumi = __lsx_vldi(0);
for (int ib = 0; ib < QK_K/32; ++ib) {
const __m128i q4bits = __lsx_vld((const __m128i*)qs, 0); qs += 16;
const __m128i q8b_0 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q8b_1 = __lsx_vld((const __m128i*)q8, 0); q8 += 16;
const __m128i q4b_0 = __lsx_vshuf_b(values128, values128, __lsx_vandi_b(q4bits, 0xf));
const __m128i q4b_1 = __lsx_vshuf_b(values128, values128, __lsx_vsrli_b(q4bits, 4));
const __m128i p16_0 = lsx_maddubs_h(q4b_0, q8b_0);
const __m128i p16_1 = lsx_maddubs_h(q4b_1, q8b_1);
const int16_t ls = (((x[ibl].scales_l[ib/2] >> ((ib & 1) * 4)) & 0xf) | ((sh & 0x3) << 4)) - 32;
sh >>= 2;
sumi = __lsx_vadd_w(lsx_madd_h(p16_0, __lsx_vreplgr2vr_h(ls)), sumi);
sumi = __lsx_vadd_w(lsx_madd_h(p16_1, __lsx_vreplgr2vr_h(ls)), sumi);
}
const float ds = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
accum = __lsx_vfadd_s(__lsx_vfmul_s(__lsx_vreplfr2vr_s(ds), __lsx_vffint_s_w(sumi)), accum);
}
*s = ((v4f32)lsx_hadd_s(lsx_hadd_s(accum, accum), lsx_hadd_s(accum, accum)))[0];
#else
UNUSED(x);
UNUSED(y);

View File

@@ -2235,8 +2235,42 @@ static void ggml_compute_forward_fill_f32(const ggml_compute_params * params, gg
}
}
static void ggml_compute_forward_fill_f16(const ggml_compute_params * params, ggml_tensor * dst) {
const ggml_fp16_t c = GGML_CPU_FP32_TO_FP16(ggml_get_op_params_f32(dst, 0));
GGML_TENSOR_LOCALS(int64_t, ne, dst, ne);
GGML_TENSOR_LOCALS(size_t, nb, dst, nb);
const auto [ir0, ir1] = get_thread_range(params, dst);
for (int64_t ir = ir0; ir < ir1; ++ir) {
const int64_t i03 = ir/(ne2*ne1);
const int64_t i02 = (ir - i03*ne2*ne1)/ne1;
const int64_t i01 = (ir - i03*ne2*ne1 - i02*ne1);
ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i03*nb3 + i02*nb2 + i01*nb1);
ggml_vec_set_f16(ne0, dst_ptr, c);
}
}
void ggml_compute_forward_fill(const ggml_compute_params * params, ggml_tensor * dst) {
ggml_compute_forward_fill_f32(params, dst);
const ggml_tensor * src0 = dst->src[0];
switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_fill_f32(params, dst);
} break;
case GGML_TYPE_F16:
{
ggml_compute_forward_fill_f16(params, dst);
} break;
default:
{
GGML_ABORT("unsupported type for ggml_compute_forward_fill: %s", ggml_type_name(src0->type));
}
}
}
// ggml_compute_tri

View File

@@ -1125,25 +1125,12 @@ static inline void __lasx_f32cx8_store(ggml_fp16_t * x, __m256 y) {
#define GGML_F16_EPR 4
static inline __m128 __lsx_f16x4_load(const ggml_fp16_t * x) {
float tmp[4];
tmp[0] = GGML_CPU_FP16_TO_FP32(x[0]);
tmp[1] = GGML_CPU_FP16_TO_FP32(x[1]);
tmp[2] = GGML_CPU_FP16_TO_FP32(x[2]);
tmp[3] = GGML_CPU_FP16_TO_FP32(x[3]);
return (__m128)__lsx_vld(tmp, 0);
return __lsx_vfcvtl_s_h(__lsx_vld((const void *)x, 0));
}
static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
float arr[4];
__lsx_vst(y, arr, 0);
x[0] = GGML_CPU_FP32_TO_FP16(arr[0]);
x[1] = GGML_CPU_FP32_TO_FP16(arr[1]);
x[2] = GGML_CPU_FP32_TO_FP16(arr[2]);
x[3] = GGML_CPU_FP32_TO_FP16(arr[3]);
__m128i a = __lsx_vfcvt_h_s(y, y);
memcpy(x, &a, sizeof(ggml_fp16_t) * 4);
}
#define GGML_F32Cx4 __m128

View File

@@ -273,67 +273,51 @@ void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * G
#if defined(GGML_SIMD)
#if defined(__ARM_FEATURE_SVE)
const int sve_register_length = svcntb() * 8; //get vector length
const int ggml_f16_epr = sve_register_length / 16; // running when 16
const int ggml_f16_step = 8 * ggml_f16_epr; // choose 8 SVE registers
const int ggml_f16_epr = svcnth();
const int ggml_f16_step = 8 * ggml_f16_epr;
const int np = n - (n % ggml_f16_step);
const int np2 = n - (n % ggml_f16_epr);
const int np= (n & ~(ggml_f16_step - 1));
svfloat16_t sum1 = svdup_n_f16(0.0f);
svfloat16_t sum2 = svdup_n_f16(0.0f);
svfloat16_t sum3 = svdup_n_f16(0.0f);
svfloat16_t sum4 = svdup_n_f16(0.0f);
svfloat32_t sum1_lo = svdup_n_f32(0.0f);
svfloat32_t sum1_hi = svdup_n_f32(0.0f);
svfloat32_t sum2_lo = svdup_n_f32(0.0f);
svfloat32_t sum2_hi = svdup_n_f32(0.0f);
svfloat32_t sum3_lo = svdup_n_f32(0.0f);
svfloat32_t sum3_hi = svdup_n_f32(0.0f);
svfloat32_t sum4_lo = svdup_n_f32(0.0f);
svfloat32_t sum4_hi = svdup_n_f32(0.0f);
svfloat16_t ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8;
svfloat16_t ay1, ay2, ay3, ay4, ay5, ay6, ay7, ay8;
for (int i = 0; i < np; i += ggml_f16_step) {
ax1 = GGML_F16x_VEC_LOAD(x + i + 0 * ggml_f16_epr, 0);
ay1 = GGML_F16x_VEC_LOAD(y + i + 0 * ggml_f16_epr, 0);
sum1 = GGML_F16x_VEC_FMA(sum1, ax1, ay1);
ax2 = GGML_F16x_VEC_LOAD(x + i + 1 * ggml_f16_epr, 1);
ay2 = GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 1);
sum2 = GGML_F16x_VEC_FMA(sum2, ax2, ay2);
ax3 = GGML_F16x_VEC_LOAD(x + i + 2 * ggml_f16_epr, 2);
ay3 = GGML_F16x_VEC_LOAD(y + i + 2 * ggml_f16_epr, 2);
sum3 = GGML_F16x_VEC_FMA(sum3, ax3, ay3);
ax4 = GGML_F16x_VEC_LOAD(x + i + 3 * ggml_f16_epr, 3);
ay4 = GGML_F16x_VEC_LOAD(y + i + 3 * ggml_f16_epr, 3);
sum4 = GGML_F16x_VEC_FMA(sum4, ax4, ay4);
ax5 = GGML_F16x_VEC_LOAD(x + i + 4 * ggml_f16_epr, 4);
ay5 = GGML_F16x_VEC_LOAD(y + i + 4 * ggml_f16_epr, 4);
sum1 = GGML_F16x_VEC_FMA(sum1, ax5, ay5);
ax6 = GGML_F16x_VEC_LOAD(x + i + 5 * ggml_f16_epr, 5);
ay6 = GGML_F16x_VEC_LOAD(y + i + 5 * ggml_f16_epr, 5);
sum2 = GGML_F16x_VEC_FMA(sum2, ax6, ay6);
ax7 = GGML_F16x_VEC_LOAD(x + i + 6 * ggml_f16_epr, 6);
ay7 = GGML_F16x_VEC_LOAD(y + i + 6 * ggml_f16_epr, 6);
sum3 = GGML_F16x_VEC_FMA(sum3, ax7, ay7);
ax8 = GGML_F16x_VEC_LOAD(x + i + 7 * ggml_f16_epr, 7);
ay8 = GGML_F16x_VEC_LOAD(y + i + 7 * ggml_f16_epr, 7);
sum4 = GGML_F16x_VEC_FMA(sum4, ax8, ay8);
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, GGML_F16x_VEC_LOAD(x + i + 0 * ggml_f16_epr, 0), GGML_F16x_VEC_LOAD(y + i + 0 * ggml_f16_epr, 0));
ggml_sve_f16_fma_widened(&sum2_lo, &sum2_hi, GGML_F16x_VEC_LOAD(x + i + 1 * ggml_f16_epr, 1), GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 1));
ggml_sve_f16_fma_widened(&sum3_lo, &sum3_hi, GGML_F16x_VEC_LOAD(x + i + 2 * ggml_f16_epr, 2), GGML_F16x_VEC_LOAD(y + i + 2 * ggml_f16_epr, 2));
ggml_sve_f16_fma_widened(&sum4_lo, &sum4_hi, GGML_F16x_VEC_LOAD(x + i + 3 * ggml_f16_epr, 3), GGML_F16x_VEC_LOAD(y + i + 3 * ggml_f16_epr, 3));
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, GGML_F16x_VEC_LOAD(x + i + 4 * ggml_f16_epr, 4), GGML_F16x_VEC_LOAD(y + i + 4 * ggml_f16_epr, 4));
ggml_sve_f16_fma_widened(&sum2_lo, &sum2_hi, GGML_F16x_VEC_LOAD(x + i + 5 * ggml_f16_epr, 5), GGML_F16x_VEC_LOAD(y + i + 5 * ggml_f16_epr, 5));
ggml_sve_f16_fma_widened(&sum3_lo, &sum3_hi, GGML_F16x_VEC_LOAD(x + i + 6 * ggml_f16_epr, 6), GGML_F16x_VEC_LOAD(y + i + 6 * ggml_f16_epr, 6));
ggml_sve_f16_fma_widened(&sum4_lo, &sum4_hi, GGML_F16x_VEC_LOAD(x + i + 7 * ggml_f16_epr, 7), GGML_F16x_VEC_LOAD(y + i + 7 * ggml_f16_epr, 7));
}
const int np2 = (n & ~(ggml_f16_epr - 1)); // round down to multiple of 8
for (int k = np; k < np2; k += ggml_f16_epr) {
svfloat16_t rx = GGML_F16x_VEC_LOAD(x + k, 0);
svfloat16_t ry = GGML_F16x_VEC_LOAD(y + k, 0);
sum1 = GGML_F16x_VEC_FMA(sum1, rx, ry);
for (int i = np; i < np2; i += ggml_f16_epr) {
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, GGML_F16x_VEC_LOAD(x + i, 0), GGML_F16x_VEC_LOAD(y + i, 0));
}
if (np2 < n) {
svbool_t pg = svwhilelt_b16(np2, n);
svfloat16_t hx = svld1_f16(pg, (const __fp16 *)(x + np2));
svfloat16_t hy = svld1_f16(pg, (const __fp16 *)(y + np2));
const svbool_t pg = svwhilelt_b16(np2, n);
const svfloat16_t rx = svld1_f16(pg, (const __fp16 *)(x + np2));
const svfloat16_t ry = svld1_f16(pg, (const __fp16 *)(y + np2));
sum1 = svmad_f16_x(pg, hx, hy, sum1);
ggml_sve_f16_fma_widened(&sum1_lo, &sum1_hi, rx, ry);
}
GGML_F16x_VEC_REDUCE(sumf, sum1, sum2, sum3, sum4);
sum1_lo = svadd_f32_m(DEFAULT_PG32, sum1_lo, sum2_lo);
sum1_hi = svadd_f32_m(DEFAULT_PG32, sum1_hi, sum2_hi);
sum3_lo = svadd_f32_m(DEFAULT_PG32, sum3_lo, sum4_lo);
sum3_hi = svadd_f32_m(DEFAULT_PG32, sum3_hi, sum4_hi);
sum1_lo = svadd_f32_m(DEFAULT_PG32, sum1_lo, sum3_lo);
sum1_hi = svadd_f32_m(DEFAULT_PG32, sum1_hi, sum3_hi);
sumf = ggml_sve_sum_f32x2(sum1_lo, sum1_hi);
#elif defined(__riscv_v_intrinsic)
#if defined(__riscv_zvfh)
int vl = __riscv_vsetvlmax_e32m2();

View File

@@ -14,6 +14,35 @@
// floating point type used to accumulate sums
typedef double ggml_float;
#if defined(__ARM_FEATURE_SVE)
inline static void ggml_sve_f16_fma_widened(
svfloat32_t * acc_lo,
svfloat32_t * acc_hi,
svfloat16_t x,
svfloat16_t y) {
#if defined(__ARM_FEATURE_SVE2)
*acc_lo = svmlalb_f32(*acc_lo, x, y);
*acc_hi = svmlalt_f32(*acc_hi, x, y);
#else
// Plain SVE fallback path if SVE2 instructions not available
svfloat16_t x_even = svtrn1_f16(x, x);
svfloat16_t x_odd = svtrn2_f16(x, x);
svfloat16_t y_even = svtrn1_f16(y, y);
svfloat16_t y_odd = svtrn2_f16(y, y);
svbool_t pg = svptrue_b32();
*acc_lo = svmla_f32_x(pg, *acc_lo, svcvt_f32_f16_x(pg, x_even), svcvt_f32_f16_x(pg, y_even));
*acc_hi = svmla_f32_x(pg, *acc_hi, svcvt_f32_f16_x(pg, x_odd), svcvt_f32_f16_x(pg, y_odd));
#endif
}
inline static ggml_float ggml_sve_sum_f32x2(svfloat32_t sum_lo, svfloat32_t sum_hi) {
return (ggml_float) (svaddv_f32(svptrue_b32(), sum_lo) + svaddv_f32(svptrue_b32(), sum_hi));
}
#endif
#define GGML_GELU_FP16
#define GGML_GELU_QUICK_FP16
@@ -122,108 +151,61 @@ inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * GG
#if defined(GGML_SIMD)
#if defined(__ARM_FEATURE_SVE)
const int sve_register_length = svcntb() * 8;
const int ggml_f16_epr = sve_register_length / 16; // running when 16
const int ggml_f16_step = 8 * ggml_f16_epr; // choose 8 SVE registers
const int ggml_f16_epr = svcnth();
const int ggml_f16_step = 2 * ggml_f16_epr;
int np = n - (n % ggml_f16_step);
int np2 = n - (n % ggml_f16_epr);
int np = (n & ~(ggml_f16_step - 1));
svfloat16_t sum_00 = svdup_n_f16(0.0f);
svfloat16_t sum_01 = svdup_n_f16(0.0f);
svfloat16_t sum_02 = svdup_n_f16(0.0f);
svfloat16_t sum_03 = svdup_n_f16(0.0f);
svfloat16_t sum_10 = svdup_n_f16(0.0f);
svfloat16_t sum_11 = svdup_n_f16(0.0f);
svfloat16_t sum_12 = svdup_n_f16(0.0f);
svfloat16_t sum_13 = svdup_n_f16(0.0f);
svfloat16_t ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8;
svfloat16_t ay1, ay2, ay3, ay4, ay5, ay6, ay7, ay8;
svfloat32_t sum_0_0_lo = svdup_n_f32(0.0f);
svfloat32_t sum_0_0_hi = svdup_n_f32(0.0f);
svfloat32_t sum_0_1_lo = svdup_n_f32(0.0f);
svfloat32_t sum_0_1_hi = svdup_n_f32(0.0f);
svfloat32_t sum_1_0_lo = svdup_n_f32(0.0f);
svfloat32_t sum_1_0_hi = svdup_n_f32(0.0f);
svfloat32_t sum_1_1_lo = svdup_n_f32(0.0f);
svfloat32_t sum_1_1_hi = svdup_n_f32(0.0f);
for (int i = 0; i < np; i += ggml_f16_step) {
ay1 = GGML_F16x_VEC_LOAD(y + i + 0 * ggml_f16_epr, 0); // 8 elements
const svfloat16_t ay0 = GGML_F16x_VEC_LOAD(y + i, 0);
const svfloat16_t ax00 = GGML_F16x_VEC_LOAD(x[0] + i, 0);
const svfloat16_t ax01 = GGML_F16x_VEC_LOAD(x[1] + i, 0);
ax1 = GGML_F16x_VEC_LOAD(x[0] + i + 0*ggml_f16_epr, 0); // 8 elements
sum_00 = GGML_F16x_VEC_FMA(sum_00, ax1, ay1); // sum_00 = sum_00+ax1*ay1
ax1 = GGML_F16x_VEC_LOAD(x[1] + i + 0*ggml_f16_epr, 0); // 8 elements
sum_10 = GGML_F16x_VEC_FMA(sum_10, ax1, ay1);
ggml_sve_f16_fma_widened(&sum_0_0_lo, &sum_0_0_hi, ax00, ay0);
ggml_sve_f16_fma_widened(&sum_1_0_lo, &sum_1_0_hi, ax01, ay0);
ay2 = GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 1); // next 8 elements
const svfloat16_t ay1 = GGML_F16x_VEC_LOAD(y + i + 1 * ggml_f16_epr, 0);
const svfloat16_t ax10 = GGML_F16x_VEC_LOAD(x[0] + i + 1 * ggml_f16_epr, 0);
const svfloat16_t ax11 = GGML_F16x_VEC_LOAD(x[1] + i + 1 * ggml_f16_epr, 0);
ax2 = GGML_F16x_VEC_LOAD(x[0] + i + 1*ggml_f16_epr, 1); // next 8 elements
sum_01 = GGML_F16x_VEC_FMA(sum_01, ax2, ay2);
ax2 = GGML_F16x_VEC_LOAD(x[1] + i + 1*ggml_f16_epr, 1);
sum_11 = GGML_F16x_VEC_FMA(sum_11, ax2, ay2);
ay3 = GGML_F16x_VEC_LOAD(y + i + 2 * ggml_f16_epr, 2);
ax3 = GGML_F16x_VEC_LOAD(x[0] + i + 2*ggml_f16_epr, 2);
sum_02 = GGML_F16x_VEC_FMA(sum_02, ax3, ay3);
ax3 = GGML_F16x_VEC_LOAD(x[1] + i + 2*ggml_f16_epr, 2);
sum_12 = GGML_F16x_VEC_FMA(sum_12, ax3, ay3);
ay4 = GGML_F16x_VEC_LOAD(y + i + 3 * ggml_f16_epr, 3);
ax4 = GGML_F16x_VEC_LOAD(x[0] + i + 3*ggml_f16_epr, 3);
sum_03 = GGML_F16x_VEC_FMA(sum_03, ax4, ay4);
ax4 = GGML_F16x_VEC_LOAD(x[1] + i + 3*ggml_f16_epr, 3);
sum_13 = GGML_F16x_VEC_FMA(sum_13, ax4, ay4);
ay5 = GGML_F16x_VEC_LOAD(y + i + 4 * ggml_f16_epr, 4);
ax5 = GGML_F16x_VEC_LOAD(x[0] + i + 4*ggml_f16_epr, 4);
sum_00 = GGML_F16x_VEC_FMA(sum_00, ax5, ay5);
ax5 = GGML_F16x_VEC_LOAD(x[1] + i + 4*ggml_f16_epr, 4);
sum_10 = GGML_F16x_VEC_FMA(sum_10, ax5, ay5);
ay6 = GGML_F16x_VEC_LOAD(y + i + 5 * ggml_f16_epr, 5);
ax6 = GGML_F16x_VEC_LOAD(x[0] + i + 5*ggml_f16_epr, 5);
sum_01 = GGML_F16x_VEC_FMA(sum_01, ax6, ay6);
ax6 = GGML_F16x_VEC_LOAD(x[1] + i + 5*ggml_f16_epr, 5);
sum_11 = GGML_F16x_VEC_FMA(sum_11, ax6, ay6);
ay7 = GGML_F16x_VEC_LOAD(y + i + 6 * ggml_f16_epr, 6);
ax7 = GGML_F16x_VEC_LOAD(x[0] + i + 6*ggml_f16_epr, 6);
sum_02 = GGML_F16x_VEC_FMA(sum_02, ax7, ay7);
ax7 = GGML_F16x_VEC_LOAD(x[1] + i + 6*ggml_f16_epr, 6);
sum_12 = GGML_F16x_VEC_FMA(sum_12, ax7, ay7);
ay8 = GGML_F16x_VEC_LOAD(y + i + 7 * ggml_f16_epr, 7);
ax8 = GGML_F16x_VEC_LOAD(x[0] + i + 7*ggml_f16_epr, 7);
sum_03 = GGML_F16x_VEC_FMA(sum_03, ax8, ay8);
ax8 = GGML_F16x_VEC_LOAD(x[1] + i + 7*ggml_f16_epr, 7);
sum_13 = GGML_F16x_VEC_FMA(sum_13, ax8, ay8);
ggml_sve_f16_fma_widened(&sum_0_1_lo, &sum_0_1_hi, ax10, ay1);
ggml_sve_f16_fma_widened(&sum_1_1_lo, &sum_1_1_hi, ax11, ay1);
}
const int np2 = (n & ~(ggml_f16_epr - 1));
for (int k = np; k < np2; k += ggml_f16_epr) {
svfloat16_t ry = GGML_F16x_VEC_LOAD(y + k, 0);
for (int i = np; i < np2; i += ggml_f16_epr) {
const svfloat16_t ry = GGML_F16x_VEC_LOAD(y + i, 0);
const svfloat16_t rx0 = GGML_F16x_VEC_LOAD(x[0] + i, 0);
const svfloat16_t rx1 = GGML_F16x_VEC_LOAD(x[1] + i, 0);
svfloat16_t rx = GGML_F16x_VEC_LOAD(x[0] + k, 0);
sum_00 = GGML_F16x_VEC_FMA(sum_00, rx, ry);
rx = GGML_F16x_VEC_LOAD(x[1] + k, 0);
sum_10 = GGML_F16x_VEC_FMA(sum_10, rx, ry);
ggml_sve_f16_fma_widened(&sum_0_0_lo, &sum_0_0_hi, rx0, ry);
ggml_sve_f16_fma_widened(&sum_1_0_lo, &sum_1_0_hi, rx1, ry);
}
if (np2 < n) {
svbool_t pg = svwhilelt_b16(np2, n);
svfloat16_t hx_0 = svld1_f16(pg, (const __fp16 *)(x[0] + np2));
svfloat16_t hx_1 = svld1_f16(pg, (const __fp16 *)(x[1] + np2));
svfloat16_t hy = svld1_f16(pg, (const __fp16 *)(y + np2));
const svbool_t pg = svwhilelt_b16(np2, n);
const svfloat16_t ay = svld1_f16(pg, (const __fp16 *)(y + np2));
const svfloat16_t ax0 = svld1_f16(pg, (const __fp16 *)(x[0] + np2));
const svfloat16_t ax1 = svld1_f16(pg, (const __fp16 *)(x[1] + np2));
sum_00 = svmad_f16_x(pg, hx_0, hy, sum_00);
sum_10 = svmad_f16_x(pg, hx_1, hy, sum_10);
ggml_sve_f16_fma_widened(&sum_0_0_lo, &sum_0_0_hi, ax0, ay);
ggml_sve_f16_fma_widened(&sum_1_0_lo, &sum_1_0_hi, ax1, ay);
}
GGML_F16x_VEC_REDUCE(sumf[0], sum_00, sum_01, sum_02, sum_03);
GGML_F16x_VEC_REDUCE(sumf[1], sum_10, sum_11, sum_12, sum_13);
svfloat32_t sum_0_lo = svadd_f32_x(DEFAULT_PG32, sum_0_0_lo, sum_0_1_lo);
svfloat32_t sum_0_hi = svadd_f32_x(DEFAULT_PG32, sum_0_0_hi, sum_0_1_hi);
svfloat32_t sum_1_lo = svadd_f32_x(DEFAULT_PG32, sum_1_0_lo, sum_1_1_lo);
svfloat32_t sum_1_hi = svadd_f32_x(DEFAULT_PG32, sum_1_0_hi, sum_1_1_hi);
sumf[0] = ggml_sve_sum_f32x2(sum_0_lo, sum_0_hi);
sumf[1] = ggml_sve_sum_f32x2(sum_1_lo, sum_1_hi);
np = n;
#elif defined(__riscv_v_intrinsic)
#if defined(__riscv_zvfh)

View File

@@ -7,6 +7,7 @@
#include <cstdint>
#include <cstdlib>
#include <memory>
#include <mutex>
#if defined(GGML_USE_HIP)
#define GGML_COMMON_DECL_HIP
@@ -110,11 +111,14 @@
# define GGML_CUDA_USE_CUB
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && CUDART_VERSION >= 11070
// PDL host-side support (cudaLaunchKernelEx) requires CUDART >= 11.8 and excludes HIP/MUSA.
// PDL host-side support (cudaLaunchKernelEx) requires CUDART >= 11.8.
// However, this has been bugged in CTK < 12.3 for MSVC builds, see
// https://github.com/ggml-org/llama.cpp/pull/22522#discussion_r3302393293
// __CUDA_ARCH__ is undefined in host passes; GPU arch check happens in device-side code.
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && CUDART_VERSION >= 11080
#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && \
(CUDART_VERSION >= 12030 || (!(defined(_MSC_VER) && !defined(__clang__)) && CUDART_VERSION >= 11080))
# define GGML_CUDA_USE_PDL
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && CUDART_VERSION >= 11080
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) && (CUDART_VERSION >= 12030 || (!(defined(_MSC_VER) && !defined(__clang__)) && CUDART_VERSION >= 11080))
static __device__ __forceinline__ void ggml_cuda_pdl_sync() {
#if defined(GGML_CUDA_USE_PDL) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER
@@ -1549,6 +1553,62 @@ struct ggml_cuda_pdl_config {
ggml_cuda_pdl_config& operator=(ggml_cuda_pdl_config&&) = delete;
};
static bool ggml_cuda_kernel_can_use_pdl(const void * kernel) {
const int device = ggml_cuda_get_device();
struct cache_key {
int device;
const void * kernel;
bool operator==(const cache_key & other) const { return device == other.device && kernel == other.kernel; }
};
struct cache_key_hash {
// MurmurHash3 mixing function for better hash distribution (vs. just std::hash which in some implementations simply returns the identity)
static size_t hash_mix(size_t x) {
std::uint64_t y = x;
const std::uint64_t m = 0xe9846af9b1a615d;
y ^= y >> 32;
y *= m;
y ^= y >> 32;
y *= m;
y ^= y >> 28;
return static_cast<size_t>(y);
}
size_t operator()(const cache_key & key) const {
// Use a nonzero seed to avoid mapping all-zero keys to zero
size_t h = 42;
h = hash_mix(h + key.device);
h = hash_mix(h + reinterpret_cast<size_t>(key.kernel));
return h;
}
};
static std::mutex cache_mutex;
static std::unordered_map<cache_key, bool, cache_key_hash> cache;
const cache_key key = { device, kernel };
std::lock_guard<std::mutex> lock(cache_mutex);
const auto it = cache.find(key);
if (it != cache.end()) {
return it->second;
}
cudaFuncAttributes attr = {};
CUDA_CHECK(cudaFuncGetAttributes(&attr, kernel));
// PDL device-side primitives are emitted only for PTX versions >= 90.
// We have to guard on a loaded kernel's PTX version so a kernel forward-JIT'ed
// from pre-Hopper PTX to a Hopper-or-newer GPU does not opt into PDL.
const bool can_use_pdl = attr.ptxVersion >= 90;
cache.emplace(key, can_use_pdl);
return can_use_pdl;
}
#endif //defined(GGML_CUDA_USE_PDL)
@@ -1561,8 +1621,7 @@ static __inline__ void ggml_cuda_kernel_launch(Kernel kernel, const ggml_cuda_ke
return env == nullptr || std::atoi(env) != 0;
}();
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
if (env_pdl_enabled && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER) {
if (env_pdl_enabled && ggml_cuda_kernel_can_use_pdl(reinterpret_cast<const void *>(kernel))) {
auto pdl_cfg = ggml_cuda_pdl_config(launch_params);
CUDA_CHECK(cudaLaunchKernelEx(&pdl_cfg.cfg, kernel, std::forward<Args>(args)... ));

View File

@@ -1153,8 +1153,8 @@ void launch_fattn(
GGML_ASSERT(block_dim.x % warp_size == 0);
const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(blocks_num, block_dim, nbytes_shared, main_stream);
ggml_cuda_kernel_launch(fattn_kernel, launch_params,
// disabled PDL enrollment for now due to a compiler bug.
fattn_kernel<<<blocks_num, block_dim, nbytes_shared, main_stream>>>(
(const char *) Q->data,
K_data,
V_data,

View File

@@ -472,7 +472,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
const int i = 8 * (threadIdx.x % (nbatch_fa/8));
cp_async_cg_16<preload>(tile_mask_32 + j_sram*(nbatch_fa*sizeof(half) + 16) + i*sizeof(half), mask_h + j_vram*stride_mask + i);
cp_async_cg_16<preload>(tile_mask_32 + j_sram*(nbatch_fa*sizeof(half) + 16) + i*sizeof(half), mask_h + int64_t(j_vram)*stride_mask + i);
}
} else if constexpr (oob_check) {
#pragma unroll
@@ -488,7 +488,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
for (int i0 = 0; i0 < nbatch_fa; i0 += warp_size) {
const int i = i0 + threadIdx.x;
tile_mask[j_sram*(nbatch_fa + 8) + i] = i < i_sup ? mask_h[j_vram*stride_mask + i] : half(0.0f);
tile_mask[j_sram*(nbatch_fa + 8) + i] = i < i_sup ? mask_h[int64_t(j_vram)*stride_mask + i] : half(0.0f);
}
}
} else if constexpr (nbatch_fa < 2*warp_size) {
@@ -505,7 +505,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
const int i = threadIdx.x % (warp_size/cols_per_warp);
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + 2*i, mask_h + j_vram*stride_mask + 2*i);
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + 2*i, mask_h + int64_t(j_vram)*stride_mask + 2*i);
}
} else {
#pragma unroll
@@ -521,7 +521,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
for (int i0 = 0; i0 < nbatch_fa; i0 += 2*warp_size) {
const int i = i0 + 2*threadIdx.x;
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + i, mask_h + j_vram*stride_mask + i);
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + i, mask_h + int64_t(j_vram)*stride_mask + i);
}
}
}
@@ -568,7 +568,6 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
constexpr bool Q_in_reg = ggml_cuda_fattn_mma_get_Q_in_reg (DKQ, DV, ncols);
constexpr int nstages = ggml_cuda_fattn_mma_get_nstages (DKQ, DV, ncols1, ncols2);
constexpr int stride_tile_Q = DKQ/2 + 4;
constexpr int stride_tile_K = nbatch_K2 + 4;
constexpr int stride_tile_V = V_is_K_view ? stride_tile_K : nbatch_V2 + 4;
@@ -604,9 +603,9 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll
for (int k0_start = (DKQ/2-1) - (DKQ/2-1) % nbatch_K2; k0_start >= 0; k0_start -= nbatch_K2) {
const int k0_stop = k0_start + nbatch_K2 < DKQ/2 ? k0_start + nbatch_K2 : DKQ/2;
const int k0_diff = k0_stop - k0_start;
if constexpr (nstages <= 1) {
const int k0_diff = k0_stop - k0_start;
constexpr bool use_cp_async = nstages == 1;
flash_attn_ext_f16_load_tile<stride_tile_K, nwarps, nbatch_fa, use_cp_async, oob_check>
(K_h2 + int64_t(k_VKQ_0)*stride_K + k0_start, tile_K, k0_diff, stride_K, k_VKQ_sup);
@@ -640,6 +639,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
}
}
} else {
constexpr int stride_tile_Q = DKQ/2 + 4;
#pragma unroll
for (int k_KQ_0 = k0_start; k_KQ_0 < k0_stop; k_KQ_0 += T_A_KQ::J) {
load_ldmatrix(Q_B[0], tile_Q + (threadIdx.y / np)*(T_B_KQ::I*stride_tile_Q) + k_KQ_0, stride_tile_Q);
@@ -954,9 +954,9 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
for (int i0_start = 0; i0_start < DV; i0_start += 2*nbatch_V2) {
static_assert(DV % (2*nbatch_V2) == 0, "bad loop size");
const int i0_stop = i0_start + 2*nbatch_V2;
const int i0_diff = i0_stop - i0_start;
if constexpr (nstages <= 1) {
const int i0_diff = i0_stop - i0_start;
if (!V_is_K_view || i0_stop > 2*nbatch_K2) {
constexpr bool use_cp_async = nstages == 1;
flash_attn_ext_f16_load_tile<stride_tile_V, nwarps, nbatch_fa, use_cp_async, oob_check>

View File

@@ -19,6 +19,7 @@ __global__ void fwht_cuda(const float * src, float * dst, const int64_t n_rows,
float reg[el_w];
const int lane = threadIdx.x;
ggml_cuda_pdl_sync();
#pragma unroll
for (int i = 0; i < el_w; ++i) {
reg[i] = src[i * warp_size + lane] * scale;
@@ -57,10 +58,11 @@ __global__ void fwht_cuda(const float * src, float * dst, const int64_t n_rows,
}
}
void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst) {
bool ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst) {
GGML_ASSERT(ggml_are_same_shape(src, dst));
GGML_ASSERT(ggml_is_contiguous(src));
GGML_ASSERT(ggml_is_contiguous(dst));
if (!ggml_is_contiguous(src) || !ggml_is_contiguous(dst)) {
return false;
}
const int n = src->ne[0];
const int64_t rows = ggml_nrows(src);
@@ -68,7 +70,6 @@ void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src,
float * dst_d = (float *) dst->data;
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
GGML_ASSERT(n % warp_size == 0);
const int rows_per_block = 4;
const int64_t num_blocks = (rows + rows_per_block - 1) / rows_per_block;
@@ -83,26 +84,18 @@ void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src,
switch (n) {
case 64:
{
ggml_cuda_kernel_launch(fwht_cuda<64>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<64>, launch_params, src_d, dst_d, rows, scale);
return true;
case 128:
{
ggml_cuda_kernel_launch(fwht_cuda<128>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<128>, launch_params, src_d, dst_d, rows, scale);
return true;
case 256:
{
ggml_cuda_kernel_launch(fwht_cuda<256>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<256>, launch_params, src_d, dst_d, rows, scale);
return true;
case 512:
{
ggml_cuda_kernel_launch(fwht_cuda<512>, launch_params, src_d, dst_d, rows, scale);
break;
}
ggml_cuda_kernel_launch(fwht_cuda<512>, launch_params, src_d, dst_d, rows, scale);
return true;
default:
GGML_ABORT("fatal error");
return false;
}
}

View File

@@ -1,3 +1,4 @@
#include "common.cuh"
void ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst);
// Returns whether the Fast Walsh-Hadamard transform could be used.
bool ggml_cuda_op_fwht(ggml_backend_cuda_context & ctx, const ggml_tensor * src, ggml_tensor * dst);

View File

@@ -43,7 +43,6 @@ gated_delta_net_cuda(const float * q,
// output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
const int64_t state_in_offset = sequence * K * H * S_v * S_v + h_idx * S_v * S_v;
const int64_t state_out_offset = (sequence * H + h_idx) * S_v * S_v;
const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
state += state_out_offset;
curr_state += state_in_offset + col * S_v;
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
@@ -61,10 +60,6 @@ gated_delta_net_cuda(const float * q,
s_shard[r] = curr_state[i];
}
// slot mapping: target_slot = t - shift. When n_tokens < K only the last n_tokens slots
// are written; earlier slots are left untouched (caller-owned).
const int shift = (int) n_tokens - K;
for (int t = 0; t < n_tokens; t++) {
const float * q_t = q + iq3 * sq3 + t * sq2 + iq1 * sq1;
const float * k_t = k + iq3 * sq3 + t * sq2 + iq1 * sq1;
@@ -148,6 +143,11 @@ gated_delta_net_cuda(const float * q,
attn_data += S_v * H;
if constexpr (keep_rs_t) {
// slot mapping: target_slot = t - shift. When n_tokens < K only the last n_tokens slots
// are written; earlier slots are left untouched (caller-owned).
const int shift = (int) n_tokens - K;
const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
const int target_slot = t - shift;
if (target_slot >= 0 && target_slot < K) {
float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;

View File

@@ -2570,6 +2570,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
use_mul_mat_q = use_mul_mat_q && ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[1], /*n_experts=*/0);
use_mul_mat_f = use_mul_mat_f && ggml_cuda_should_use_mmf(src0->type, cc, warp_size, src0->ne, src0->nb, src1->ne[1], /*mul_mat_id=*/false);
use_mul_mat_vec_f = use_mul_mat_vec_f && ggml_cuda_should_use_mmvf(src0->type, cc, src0->ne, src0->nb, src1->ne[1]);
use_mul_mat_vec_q = use_mul_mat_vec_q && ggml_cuda_should_use_mmvq(src0->type, cc, src1->ne[1]);
any_gpus_with_slow_fp16 = any_gpus_with_slow_fp16 || !fast_fp16_hardware_available(cc);
}
} else {
@@ -2578,6 +2579,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
use_mul_mat_q = use_mul_mat_q && ggml_cuda_should_use_mmq(src0->type, cc, src1->ne[1], /*n_experts=*/0);
use_mul_mat_f = use_mul_mat_f && ggml_cuda_should_use_mmf(src0->type, cc, warp_size, src0->ne, src0->nb, src1->ne[1], /*mul_mat_id=*/false);
use_mul_mat_vec_f = use_mul_mat_vec_f && ggml_cuda_should_use_mmvf(src0->type, cc, src0->ne, src0->nb, src1->ne[1]);
use_mul_mat_vec_q = use_mul_mat_vec_q && ggml_cuda_should_use_mmvq(src0->type, cc, src1->ne[1]);
any_gpus_with_slow_fp16 = any_gpus_with_slow_fp16 || !fast_fp16_hardware_available(cc);
}
@@ -2596,9 +2598,7 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
bool use_batched_cublas_f32 = src0->type == GGML_TYPE_F32;
const int32_t hint = ggml_get_op_params_i32(dst, 1);
if (hint == GGML_HINT_SRC0_IS_HADAMARD) {
GGML_ASSERT(!split);
ggml_cuda_op_fwht(ctx, src1, dst);
if (hint == GGML_HINT_SRC0_IS_HADAMARD && !split && ggml_cuda_op_fwht(ctx, src1, dst)) {
return;
}
@@ -4994,8 +4994,14 @@ static void ggml_backend_cuda_device_get_memory(ggml_backend_dev_t dev, size_t *
}
static enum ggml_backend_dev_type ggml_backend_cuda_device_get_type(ggml_backend_dev_t dev) {
GGML_UNUSED(dev);
return GGML_BACKEND_DEVICE_TYPE_GPU;
ggml_backend_cuda_device_context * ctx = (ggml_backend_cuda_device_context *) dev->context;
cudaDeviceProp prop;
CUDA_CHECK(cudaGetDeviceProperties(&prop, ctx->device));
return prop.integrated
? GGML_BACKEND_DEVICE_TYPE_IGPU
: GGML_BACKEND_DEVICE_TYPE_GPU;
}
static void ggml_backend_cuda_device_get_props(ggml_backend_dev_t dev, ggml_backend_dev_props * props) {

View File

@@ -91,7 +91,7 @@ static __global__ void mul_mat_f(
const int row0 = blockIdx.x * rows_per_block;
int expert_idx = 0;
int col_base = 0;
[[maybe_unused]] int col_base = 0;
const int channel_dst = has_ids ? 0 : blockIdx.y;
@@ -122,12 +122,12 @@ static __global__ void mul_mat_f(
ids += col_offset * stride_row_id;
}
const float2 * y2 = (const float2 *) y;
[[maybe_unused]] const float2 * y2 = (const float2 *) y;
extern __shared__ char data_mmv[];
char * shmem_base = data_mmv;
int * slot_map = (int *) shmem_base;
[[maybe_unused]] int * slot_map = (int *) shmem_base;
char * compute_base = has_ids ? (shmem_base + GGML_PAD(cols_per_block, 16) * sizeof(int)) : shmem_base;
tile_C C[ntA][ntB];

View File

@@ -80,9 +80,8 @@ static __global__ void mul_mat_vec_f(
gate_x += int64_t(sample_x) *stride_sample_x + channel_x *stride_channel_x + row*stride_row;
}
const int channel_bias = ids ? channel_x : channel_dst;
if constexpr (has_fusion) {
const int channel_bias = ids ? channel_x : channel_dst;
if (use_bias) {
x_bias += int64_t(sample_dst)*stride_sample_dst + channel_bias*stride_channel_dst;
}
@@ -95,7 +94,7 @@ static __global__ void mul_mat_vec_f(
extern __shared__ char data_mmv[];
float * buf_iw = (float *) data_mmv;
float * buf_iw_gate = nullptr;
[[maybe_unused]] float * buf_iw_gate = nullptr;
if constexpr (has_fusion) {
buf_iw_gate = (float *) (data_mmv + warp_size*sizeof(float));
}
@@ -123,7 +122,7 @@ static __global__ void mul_mat_vec_f(
if constexpr (std::is_same_v<T, float>) {
const float2 * x2 = (const float2 *) x;
const float2 * gate_x2 = nullptr;
[[maybe_unused]] const float2 * gate_x2 = nullptr;
if constexpr (has_fusion) {
if (use_gate) {
gate_x2 = (const float2 *) gate_x;
@@ -155,7 +154,7 @@ static __global__ void mul_mat_vec_f(
}
} else if constexpr (std::is_same_v<T, half>) {
const half2 * x2 = (const half2 *) x;
const half2 * gate_x2 = nullptr;
[[maybe_unused]] const half2 * gate_x2 = nullptr;
if constexpr (has_fusion) {
if (use_gate) {
gate_x2 = (const half2 *) gate_x;
@@ -266,7 +265,7 @@ static __global__ void mul_mat_vec_f(
}
#else
const nv_bfloat162 * x2 = (const nv_bfloat162 *) x;
const nv_bfloat162 * gate_x2 = nullptr;
[[maybe_unused]] const nv_bfloat162 * gate_x2 = nullptr;
if constexpr (has_fusion) {
if (use_gate) {
gate_x2 = (const nv_bfloat162 *) gate_x;
@@ -274,7 +273,7 @@ static __global__ void mul_mat_vec_f(
}
for (int col2 = tid; col2 < ncols2; col2 += block_size) {
const nv_bfloat162 tmpx = x2[col2];
nv_bfloat162 tmpx_gate;
[[maybe_unused]] nv_bfloat162 tmpx_gate;
if constexpr (has_fusion) {
if (use_gate) {
tmpx_gate = gate_x2[col2];

View File

@@ -63,6 +63,7 @@ static constexpr __host__ __device__ int get_vdr_mmvq(ggml_type type) {
enum mmvq_parameter_table_id {
MMVQ_PARAMETERS_GENERIC = 0,
MMVQ_PARAMETERS_TURING,
MMVQ_PARAMETERS_GCN,
MMVQ_PARAMETERS_RDNA2,
MMVQ_PARAMETERS_RDNA3_0,
@@ -78,6 +79,8 @@ static constexpr __device__ mmvq_parameter_table_id get_device_table_id() {
return MMVQ_PARAMETERS_RDNA2;
#elif defined(GCN) || defined(CDNA)
return MMVQ_PARAMETERS_GCN;
#elif defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= GGML_CUDA_CC_TURING && __CUDA_ARCH__ < GGML_CUDA_CC_AMPERE
return MMVQ_PARAMETERS_TURING;
#else
return MMVQ_PARAMETERS_GENERIC;
#endif
@@ -96,6 +99,9 @@ static __host__ mmvq_parameter_table_id get_device_table_id(int cc) {
if (GGML_CUDA_CC_IS_GCN(cc) || GGML_CUDA_CC_IS_CDNA(cc)) {
return MMVQ_PARAMETERS_GCN;
}
if (GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_TURING && ggml_cuda_highest_compiled_arch(cc) < GGML_CUDA_CC_AMPERE) {
return MMVQ_PARAMETERS_TURING;
}
return MMVQ_PARAMETERS_GENERIC;
}
@@ -271,6 +277,53 @@ int get_mmvq_mmid_max_batch(ggml_type type, int cc) {
return MMVQ_MAX_BATCH_SIZE;
}
bool ggml_cuda_should_use_mmvq(enum ggml_type type, int cc, int64_t ne11) {
if (GGML_CUDA_CC_IS_CDNA(cc)) {
if (GGML_CUDA_CC_IS_CDNA1(cc)) {
switch (type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
return ne11 <= 7;
case GGML_TYPE_Q5_1:
return ne11 <= 7;
case GGML_TYPE_Q8_0:
return ne11 <= 6;
case GGML_TYPE_Q2_K:
return ne11 <= 4;
case GGML_TYPE_Q3_K:
return ne11 <= 3;
case GGML_TYPE_Q4_K:
return ne11 <= 2;
case GGML_TYPE_Q5_K:
return ne11 <= 3;
case GGML_TYPE_Q6_K:
return ne11 <= 4;
case GGML_TYPE_IQ1_S:
return ne11 <= 5;
case GGML_TYPE_IQ2_XXS:
case GGML_TYPE_IQ3_S:
case GGML_TYPE_IQ4_XS:
return ne11 <= 6;
default:
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
}
switch (type) { // tuned for CDNA2
case GGML_TYPE_Q2_K:
return ne11 <= 5;
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
return ne11 <= 3;
case GGML_TYPE_Q6_K:
return ne11 <= 5;
default:
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
}
return ne11 <= MMVQ_MAX_BATCH_SIZE;
}
// Device constexpr: returns the max batch size for the current arch+type at compile time.
template <ggml_type type>
static constexpr __device__ int get_mmvq_mmid_max_batch_for_device() {
@@ -370,11 +423,38 @@ static constexpr __host__ __device__ int calc_nwarps(ggml_type type, int ncols_d
}
return 1;
}
if (table_id == MMVQ_PARAMETERS_TURING) {
if (ncols_dst == 1) {
switch (type) {
case GGML_TYPE_Q2_K:
case GGML_TYPE_Q3_K:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
case GGML_TYPE_Q6_K:
return 2;
default:
return 4;
}
}
switch (ncols_dst) {
case 2:
case 3:
case 4:
return 4;
case 5:
case 6:
case 7:
case 8:
return 2;
default:
return 1;
}
}
return 1;
}
static constexpr __host__ __device__ int calc_rows_per_block(int ncols_dst, int table_id, bool small_k = false, int nwarps = 1) {
if (table_id == MMVQ_PARAMETERS_GENERIC || table_id == MMVQ_PARAMETERS_GCN) {
if (table_id == MMVQ_PARAMETERS_GENERIC || table_id == MMVQ_PARAMETERS_GCN || table_id == MMVQ_PARAMETERS_TURING) {
switch (ncols_dst) {
case 1:
return small_k ? nwarps : 1;
@@ -435,7 +515,7 @@ static __global__ void mul_mat_vec_q(
bool use_gate = false;
bool use_bias = false;
bool use_gate_bias = false;
const void * vgate = nullptr;
[[maybe_unused]] const void * vgate = nullptr;
const float * x_bias = nullptr;
const float * gate_bias = nullptr;
ggml_glu_op active_glu;
@@ -451,8 +531,8 @@ static __global__ void mul_mat_vec_q(
}
float x_biases[ncols_dst] = { 0.0f };
float gate_biases[ncols_dst] = { 0.0f };
[[maybe_unused]] float x_biases[ncols_dst] = { 0.0f };
[[maybe_unused]] float gate_biases[ncols_dst] = { 0.0f };
if constexpr (has_fusion) {
const uint32_t channel_bias = ids ? channel_x : channel_dst;
if (use_bias) {
@@ -509,12 +589,7 @@ static __global__ void mul_mat_vec_q(
}
__shared__ float tmp_shared[nwarps-1 > 0 ? nwarps-1 : 1][ncols_dst][rows_per_cuda_block][warp_size];
__shared__ float tmp_shared_gate[(has_fusion && (nwarps-1 > 0)) ? nwarps-1 : 1][ncols_dst][rows_per_cuda_block][warp_size];
if constexpr (!has_fusion) {
(void) tmp_shared_gate;
} else if (!use_gate) {
(void) tmp_shared_gate;
}
[[maybe_unused]] __shared__ float tmp_shared_gate[(has_fusion && (nwarps-1 > 0)) ? nwarps-1 : 1][ncols_dst][rows_per_cuda_block][warp_size];
if (threadIdx.y > 0) {
#pragma unroll

View File

@@ -2,6 +2,8 @@
#define MMVQ_MAX_BATCH_SIZE 8 // Max. batch size for which to use MMVQ kernels.
bool ggml_cuda_should_use_mmvq(enum ggml_type type, int cc, int64_t ne11);
// Returns the maximum batch size for which MMVQ should be used for MUL_MAT_ID,
// based on the quantization type and GPU architecture (compute capability).
int get_mmvq_mmid_max_batch(ggml_type type, int cc);

View File

@@ -134,7 +134,7 @@ __launch_bounds__(4 * WARP_SIZE, 1) __global__ void topk_moe_cuda(const float *
// selection_wt is only needed when bias is present (selection uses wt + bias)
// when no bias, we use wt directly for both selection and weight values
float selection_wt[has_bias ? experts_per_thread : 1];
[[maybe_unused]] float selection_wt[has_bias ? experts_per_thread : 1];
if constexpr (has_bias) {
#pragma unroll

View File

@@ -39,7 +39,7 @@
#include "ggml-hexagon.h"
#include "ggml-impl.h"
#include "ggml-quants.h"
#include "op-desc.h"
#include "htp-opnode.h"
#include "htp-ops.h"
#include "htp_iface.h"
#include "htp-drv.h"
@@ -68,6 +68,7 @@ static u32vec opt_pmu_evt { 0x3, 0x111, 0x100, 0x105, 0x240, 0x256, 0x7D, 0x8C }
static int opt_opstage = HTP_OPSTAGE_QUEUE | HTP_OPSTAGE_COMPUTE;
static int opt_opbatch = 1024; // max number of ops in a batch
static int opt_opqueue = 16; // max number of pending batches
static int opt_oppoll = 0; // polling for batch completions
static std::regex* opt_opfilter = NULL; // regex of ops to not claim
@@ -101,23 +102,23 @@ static const char * status_to_str(uint32_t status) {
// ** debug helpers
static void ggml_hexagon_dump_op_exec(const std::string &sess_name, const ggml_tensor * op, const uint32_t req_flags) {
static void ggml_hexagon_dump_op_exec(const std::string &sess_name, const htp_opnode & node, const uint32_t req_flags) {
if (!opt_verbose) return;
op_desc desc(op);
htp_opformat fmt(node);
GGML_LOG_DEBUG("ggml-hex: %s execute-op %s: %s : %s : %s : %s : %s : flags 0x%x\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, desc.buffs, req_flags);
node.op_name().c_str(), fmt.names, fmt.dims, fmt.types, fmt.strides, fmt.buffs, req_flags);
}
static void ggml_hexagon_dump_op_supp(const std::string &sess_name, const struct ggml_tensor * op, bool supp) {
if (!opt_verbose) return;
op_desc desc(op);
htp_opformat fmt(htp_opformat(htp_opnode{const_cast<ggml_tensor*>(op), {}, HTP_OP_INVALID}));
GGML_LOG_DEBUG("ggml-hex: %s supports-op %s: %s : %s : %s : %s : %s : %s\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, desc.buffs, supp ? "yes" : "no");
ggml_op_desc(op), fmt.names, fmt.dims, fmt.types, fmt.strides, fmt.buffs, supp ? "yes" : "no");
}
static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const ggml_tensor * op,
static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const htp_opnode & node,
uint32_t op_usec, uint32_t op_cycles, const uint32_t pmu[]) {
if (!opt_profile) return;
@@ -128,15 +129,16 @@ static void ggml_hexagon_dump_op_prof(const std::string &sess_name, const ggml_t
pmu[0], pmu[1], pmu[2], pmu[3], pmu[4], pmu[5], pmu[6], pmu[7]);
}
op_desc desc(op);
htp_opformat fmt(node);
GGML_LOG_DEBUG("ggml-hex: %s profile-op %s: %s : %s : %s : %s : usec %u cycles %u%s\n", sess_name.c_str(),
ggml_op_desc(op), desc.names, desc.dims, desc.types, desc.strides, op_usec, op_cycles, pmu_str);
node.op_name().c_str(), fmt.names, fmt.dims, fmt.types, fmt.strides, op_usec, op_cycles, pmu_str);
}
// ** backend sessions
struct ggml_hexagon_opbatch;
struct ggml_hexagon_opqueue;
struct htp_opnode;
struct ggml_hexagon_session {
std::string name;
@@ -166,7 +168,7 @@ struct ggml_hexagon_session {
void allocate(int dev_id) noexcept(false);
void release() noexcept(true);
void enqueue_op(htp_op_code opcode, const ggml_tensor *op);
void enqueue_op(const htp_opnode & node);
void flush(bool all = true);
void flush_pending(bool all = false);
@@ -550,7 +552,7 @@ static void repack_q4_0_q4x4x2(ggml_tensor * t, const void * data, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -611,7 +613,7 @@ static void repack_q4x4x2_q4_0(void * data, const ggml_tensor * t, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -660,6 +662,239 @@ static void repack_q4x4x2_q4_0(void * data, const ggml_tensor * t, size_t size)
ggml_aligned_free(buf_rp, row_size_rp);
}
static void unpack_q4_1_quants(uint8_t * qs, const block_q4_1 * x, unsigned int bi) {
static const int qk = QK4_1;
for (unsigned int i = 0; i < qk / 2; ++i) {
const int x0 = (x->qs[i] & 0x0F);
const int x1 = (x->qs[i] >> 4);
qs[bi * qk + i + 0] = x0;
qs[bi * qk + i + qk / 2] = x1;
}
}
static void pack_q4_1_quants(block_q4_1 * x, const uint8_t * qs, unsigned int bi) {
static const int qk = QK4_1;
for (unsigned int i = 0; i < qk / 2; ++i) {
const uint8_t x0 = qs[bi * qk + i + 0];
const uint8_t x1 = qs[bi * qk + i + qk / 2];
x->qs[i] = x0 | (x1 << 4);
}
}
static void repack_row_q4_1x4x2(uint8_t * y, const block_q4_1 * x, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
const int nloe = k % qk; // leftovers
const int dblk_size = 8 * 4; // 8x (d, m) __fp16 = 32 bytes
const int qblk_size = qk / 2; // int4 = 128 bytes
const int qrow_size = k / 2; // int4 (not padded to blocks)
uint8_t * y_q = y + 0; // quants first
uint8_t * y_d = y + qrow_size; // then scales/offsets
// Repack the quants
for (int i = 0; i < nb; i++) {
uint8_t qs[QK_Q4_0x4x2]; // unpacked quants
unpack_q4_1_quants(qs, &x[i * 8 + 0], 0);
unpack_q4_1_quants(qs, &x[i * 8 + 1], 1);
unpack_q4_1_quants(qs, &x[i * 8 + 2], 2);
unpack_q4_1_quants(qs, &x[i * 8 + 3], 3);
unpack_q4_1_quants(qs, &x[i * 8 + 4], 4);
unpack_q4_1_quants(qs, &x[i * 8 + 5], 5);
unpack_q4_1_quants(qs, &x[i * 8 + 6], 6);
unpack_q4_1_quants(qs, &x[i * 8 + 7], 7);
bool partial = (nloe && i == nb-1);
uint8_t * q = y_q + (i * qblk_size);
for (int j = 0; j < qk / 2; j++) {
q[j] = partial ? (qs[j*2+1] << 4) | qs[j*2+0] : (qs[j+128] << 4) | qs[j+000];
}
}
// Repack the scales and offsets
for (int i = 0; i < nb; i++) {
ggml_half * d_m = (ggml_half *) (y_d + i * dblk_size);
for (int j = 0; j < 8; j++) {
d_m[j * 2 + 0] = x[i * 8 + j].d;
d_m[j * 2 + 1] = x[i * 8 + j].m;
}
}
}
static void unpack_row_q4_1x4x2(block_q4_1 * x, const uint8_t * y, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
const int nloe = k % qk; // leftovers
const int dblk_size = 8 * 4; // 8x (d, m) __fp16 = 32 bytes
const int qblk_size = qk / 2; // int4 = 128 bytes
const int qrow_size = k / 2; // int4 (not padded to blocks)
const uint8_t * y_q = y + 0; // quants first
const uint8_t * y_d = y + qrow_size; // then scales/offsets
// Unpack the quants
for (int i = 0; i < nb; i++) {
uint8_t qs[QK_Q4_0x4x2];
bool partial = (nloe && i == nb-1);
const uint8_t * q = y_q + (i * qblk_size);
for (int j = 0; j < qk / 2; j++) {
if (partial) {
qs[j*2+0] = q[j] & 0x0F;
qs[j*2+1] = q[j] >> 4;
} else {
qs[j+000] = q[j] & 0x0F;
qs[j+128] = q[j] >> 4;
}
}
pack_q4_1_quants(&x[i * 8 + 0], qs, 0);
pack_q4_1_quants(&x[i * 8 + 1], qs, 1);
pack_q4_1_quants(&x[i * 8 + 2], qs, 2);
pack_q4_1_quants(&x[i * 8 + 3], qs, 3);
pack_q4_1_quants(&x[i * 8 + 4], qs, 4);
pack_q4_1_quants(&x[i * 8 + 5], qs, 5);
pack_q4_1_quants(&x[i * 8 + 6], qs, 6);
pack_q4_1_quants(&x[i * 8 + 7], qs, 7);
}
// Unpack the scales and offsets
for (int i = 0; i < nb; i++) {
const ggml_half * d_m = (const ggml_half *) (y_d + i * dblk_size);
for (int j = 0; j < 8; j++) {
x[i * 8 + j].d = d_m[j * 2 + 0];
x[i * 8 + j].m = d_m[j * 2 + 1];
}
}
}
static void init_row_q4_1x4x2(block_q4_1 * x, int64_t k) {
static const int qk = QK_Q4_0x4x2;
const int nb = (k + qk - 1) / qk; // number of blocks (padded)
uint8_t qs[QK_Q4_0x4x2]; // unpacked quants
memset(qs, 0, sizeof(qs));
for (int i = 0; i < nb; i++) {
pack_q4_1_quants(&x[i * 8 + 0], qs, 0);
pack_q4_1_quants(&x[i * 8 + 1], qs, 1);
pack_q4_1_quants(&x[i * 8 + 2], qs, 2);
pack_q4_1_quants(&x[i * 8 + 3], qs, 3);
pack_q4_1_quants(&x[i * 8 + 4], qs, 4);
pack_q4_1_quants(&x[i * 8 + 5], qs, 5);
pack_q4_1_quants(&x[i * 8 + 6], qs, 6);
pack_q4_1_quants(&x[i * 8 + 7], qs, 7);
}
for (int i = 0; i < nb; i++) {
for (int j = 0; j < 8; j++) {
x[i * 8 + j].d = 0;
x[i * 8 + j].m = 0;
}
}
}
static void repack_q4_1_q4x4x2(ggml_tensor * t, const void * data, size_t size) {
int64_t nrows = ggml_nrows(t);
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2));
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
const size_t total_tensor_size = (size_t)nrows * row_size;
const size_t n_bytes_to_copy = size < total_tensor_size ? size : total_tensor_size;
const int64_t n_full_rows = n_bytes_to_copy / row_size;
const size_t n_rem_bytes = n_bytes_to_copy % row_size;
void * buf_pd = ggml_aligned_malloc(row_size_pd);
GGML_ASSERT(buf_pd != NULL);
void * buf_rp = ggml_aligned_malloc(row_size_rp);
GGML_ASSERT(buf_rp != NULL);
HEX_VERBOSE("ggml-hex: repack-q4_1-q4x4x2 %s : data %p size %zu dims %ldx%ld row-size %zu\n", t->name, data, size,
t->ne[0], nrows, row_size);
init_row_q4_1x4x2((block_q4_1 *) buf_pd, t->ne[0]);
for (int64_t i = 0; i < n_full_rows; i++) {
const uint8_t * src = (const uint8_t *) data + (i * row_size);
uint8_t * dst = (uint8_t *) t->data + (i * row_size);
memcpy(buf_pd, src, row_size);
repack_row_q4_1x4x2((uint8_t *) buf_rp, (const block_q4_1 *) buf_pd, t->ne[0]);
memcpy(dst, buf_rp, row_size);
}
if (n_rem_bytes > 0) {
const int64_t i = n_full_rows;
const uint8_t * src = (const uint8_t *) data + (i * row_size);
uint8_t * dst = (uint8_t *) t->data + (i * row_size);
init_row_q4_1x4x2((block_q4_1 *) buf_pd, t->ne[0]);
memcpy(buf_pd, src, n_rem_bytes);
repack_row_q4_1x4x2((uint8_t *) buf_rp, (const block_q4_1 *) buf_pd, t->ne[0]);
memcpy(dst, buf_rp, n_rem_bytes);
}
ggml_aligned_free(buf_pd, row_size_pd);
ggml_aligned_free(buf_rp, row_size_rp);
}
static void repack_q4x4x2_q4_1(void * data, const ggml_tensor * t, size_t size) {
int64_t nrows = ggml_nrows(t);
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q4_0x4x2));
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
const size_t total_tensor_size = (size_t)nrows * row_size;
const size_t n_bytes_to_copy = size < total_tensor_size ? size : total_tensor_size;
const int64_t n_full_rows = n_bytes_to_copy / row_size;
const size_t n_rem_bytes = n_bytes_to_copy % row_size;
void * buf_pd = ggml_aligned_malloc(row_size_pd);
GGML_ASSERT(buf_pd != NULL);
void * buf_rp = ggml_aligned_malloc(row_size_rp);
GGML_ASSERT(buf_rp != NULL);
HEX_VERBOSE("ggml-hex: repack-q4x4x2-q4_1 %s : data %p size %zu dims %ldx%ld row-size %zu\n", t->name, data, size,
t->ne[0], nrows, row_size);
memset(buf_rp, 0, row_size_rp); // clear-out padded buffer to make sure the tail is all zeros
for (int64_t i = 0; i < n_full_rows; i++) {
const uint8_t * src = (const uint8_t *) t->data + (i * row_size);
uint8_t * dst = (uint8_t *) data + (i * row_size);
memcpy(buf_rp, src, row_size);
unpack_row_q4_1x4x2((block_q4_1 *) buf_pd, (const uint8_t *) buf_rp, t->ne[0]);
memcpy(dst, buf_pd, row_size);
}
if (n_rem_bytes > 0) {
const int64_t i = n_full_rows;
const uint8_t * src = (const uint8_t *) t->data + (i * row_size);
uint8_t * dst = (uint8_t *) data + (i * row_size);
// We still need to read and unpack the entire source row because quantization is block-based.
memcpy(buf_rp, src, row_size);
unpack_row_q4_1x4x2((block_q4_1 *) buf_pd, (const uint8_t *) buf_rp, t->ne[0]);
memcpy(dst, buf_pd, n_rem_bytes);
}
ggml_aligned_free(buf_pd, row_size_pd);
ggml_aligned_free(buf_rp, row_size_rp);
}
// ======== Q8x4x2 ====================
static void dump_block_q8_0(const block_q8_0 * b, int i) {
HEX_VERBOSE("ggml-hex: repack q8_0 %d: %d %d %d %d ... %d %d %d %d : %.6f\n", i, b->qs[0], b->qs[1], b->qs[2],
@@ -876,7 +1111,7 @@ static void repack_q8_0_q8x4x2(ggml_tensor * t, const void * data, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q8_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -937,7 +1172,7 @@ static void repack_q8x4x2_q8_0(void * data, const ggml_tensor * t, size_t size)
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_Q8_0x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -1238,7 +1473,7 @@ static void repack_mxfp4_mxfp4x4x2(ggml_tensor * t, const void * data, size_t si
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_MXFP4x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to read more data than is available in the source buffer 'data'
// or write more than the tensor can hold.
@@ -1299,7 +1534,7 @@ static void repack_mxfp4x4x2_mxfp4(void * data, const ggml_tensor * t, size_t si
size_t row_size = ggml_row_size(t->type, t->ne[0]);
size_t row_size_pd = ggml_row_size(t->type, hex_round_up(t->ne[0], QK_MXFP4x4x2)); // extra elements for the pad
size_t row_size_rp = row_size * 2; // extra space for tmp pad (if any)
size_t row_size_rp = row_size_pd; // scratch must hold one full padded tile (qblk_size/2 quants + scales)
// Ensure we don't try to copy more data than the tensor actually contains.
const size_t total_tensor_size = (size_t)nrows * row_size;
@@ -1365,6 +1600,12 @@ static void ggml_backend_hexagon_buffer_set_tensor(ggml_backend_buffer_t buffer,
repack_q4_0_q4x4x2(tensor, data, size);
break;
case GGML_TYPE_Q4_1:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
repack_q4_1_q4x4x2(tensor, data, size);
break;
case GGML_TYPE_Q8_0:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
@@ -1407,6 +1648,12 @@ static void ggml_backend_hexagon_buffer_get_tensor(ggml_backend_buffer_t buffer,
repack_q4x4x2_q4_0(data, tensor, size);
break;
case GGML_TYPE_Q4_1:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
repack_q4x4x2_q4_1(data, tensor, size);
break;
case GGML_TYPE_Q8_0:
GGML_ASSERT(offset == 0);
GGML_ASSERT(offset + size <= ggml_nbytes(tensor));
@@ -1536,12 +1783,10 @@ static ggml_backend_buffer_type_i ggml_backend_hexagon_repack_buffer_type_interf
/* .is_host = */ ggml_backend_hexagon_repack_buffer_type_is_host,
};
// Backend session implementation
struct ggml_hexagon_opbatch {
ggml_hexagon_session* sess;
std::vector<const ggml_tensor*> ops; // pointers to original ops
std::vector<htp_opnode> ops; // htp_opnode of ops
std::vector<htp_buf_desc> h_bufs; // htp buffer descriptors
std::vector<htp_tensor> h_tens; // htp tensor descriptors
@@ -1673,7 +1918,7 @@ struct ggml_hexagon_opbatch {
return ti;
}
bool fit_op(const struct ggml_tensor *t) const {
bool fit_op(const htp_opnode & node) const {
if (n_ops >= n_ops_max ) return false;
// check how much extras we will need
@@ -1682,6 +1927,7 @@ struct ggml_hexagon_opbatch {
size_t extra_tens = 0;
auto fit_tensor = [&](const ggml_tensor *t) {
if (!t) return;
if (!t_map.count(t)) {
extra_tens++;
@@ -1693,10 +1939,10 @@ struct ggml_hexagon_opbatch {
}
};
for (unsigned int i=0; i < HTP_OP_MAX_INPUTS && t->src[i]; i++) {
fit_tensor(t->src[i]);
for (const auto * src : node.get_inputs()) {
fit_tensor(src);
}
fit_tensor(t);
fit_tensor(node.dst());
if ((extra_bufs + n_bufs) > n_bufs_max) return false;
if ((extra_tens + n_tens) > n_tens_max) return false;
@@ -1706,29 +1952,30 @@ struct ggml_hexagon_opbatch {
}
// assumes that fit_op() was called first and returned true
void add_op(htp_op_code opcode, const struct ggml_tensor * t) {
void add_op(const htp_opnode & node) {
// Add new op
unsigned int n = n_ops++;
GGML_ASSERT(n_ops <= n_ops_max);
ops[n] = t;
ops[n] = node;
htp_op_desc &o = h_ops[n];
memcpy(&o.params, &t->op_params, sizeof(t->op_params));
o.opcode = opcode;
memcpy(&o.params, &node.node->op_params, sizeof(node.node->op_params));
o.opcode = node.opcode;
o.flags = 0;
if (!(opt_opstage & HTP_OPSTAGE_COMPUTE)) {
o.flags |= HTP_OPFLAGS_SKIP_COMPUTE;
}
ggml_hexagon_dump_op_exec(sess->c_name(), t, o.flags);
ggml_hexagon_dump_op_exec(sess->c_name(), node, o.flags);
auto inputs = node.get_inputs();
for (unsigned int i=0; i < HTP_OP_MAX_INPUTS; i++) {
o.src[i] = t->src[i] ? add_tensor(t->src[i]) : 0xffff;
o.src[i] = (i < inputs.size() && inputs[i]) ? add_tensor(inputs[i]) : 0xffff;
}
o.dst = add_tensor(t);
o.dst = add_tensor(node.dst());
}
};
@@ -1737,7 +1984,7 @@ struct ggml_hexagon_opqueue {
ggml_hexagon_shared_buffer *shm_buf;
size_t shm_blk_size;
using opvec = std::vector<const ggml_tensor*>;
using opvec = std::vector<htp_opnode>;
std::queue<unsigned int> done; // completed batch ids
std::vector<opvec> op_cache; // per batch op cache
@@ -1886,7 +2133,8 @@ void ggml_hexagon_session::flush_pending(bool all) {
uint32_t n_dbufs;
// Read response packet from queue
int err = dspqueue_read(this->queue, &flags, 1, &n_dbufs, &dbuf, sizeof(rsp), &rsp_size, (uint8_t *) &rsp, DSPQUEUE_TIMEOUT);
const uint32_t timeo = opt_oppoll ? 0 : DSPQUEUE_TIMEOUT;
int err = dspqueue_read(this->queue, &flags, 1, &n_dbufs, &dbuf, sizeof(rsp), &rsp_size, (uint8_t *) &rsp, timeo);
if (err == AEE_EEXPIRED) {
continue;
}
@@ -1935,11 +2183,11 @@ void ggml_hexagon_session::flush_batch() {
}
}
void ggml_hexagon_session::enqueue_op(htp_op_code opcode, const ggml_tensor *op) {
if (!op_batch->fit_op(op)) {
void ggml_hexagon_session::enqueue_op(const htp_opnode & node) {
if (!op_batch->fit_op(node)) {
flush_batch();
}
op_batch->add_op(opcode, op);
op_batch->add_op(node);
}
// Flush HTP response queue i.e wait for all outstanding requests to complete
@@ -2290,6 +2538,7 @@ static bool ggml_hexagon_supported_gated_delta_net(const struct ggml_hexagon_ses
const int64_t H = v->ne[1];
const int64_t n_tokens = v->ne[2];
const int64_t n_seqs = v->ne[3];
const int64_t K = state->ne[1];
if (S_v <= 0 || S_v > 128 || H <= 0 || n_tokens <= 0 || n_seqs <= 0) {
return false;
@@ -2302,10 +2551,10 @@ static bool ggml_hexagon_supported_gated_delta_net(const struct ggml_hexagon_ses
if ((g->ne[0] != 1 && g->ne[0] != S_v) || beta->ne[0] != 1) {
return false;
}
if (ggml_nelements(state) != S_v * S_v * H * n_seqs) {
if (ggml_nelements(state) != S_v * S_v * H * n_seqs * K) {
return false;
}
if (dst->ne[0] != S_v * H || dst->ne[1] != n_tokens * n_seqs + S_v * n_seqs) {
if (dst->ne[0] != S_v * H || dst->ne[1] != n_tokens * n_seqs + S_v * n_seqs * K) {
return false;
}
@@ -2327,6 +2576,7 @@ static bool ggml_hexagon_supported_mul_mat(const struct ggml_hexagon_session * s
switch (src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_MXFP4:
@@ -2353,6 +2603,27 @@ static bool ggml_hexagon_supported_mul_mat(const struct ggml_hexagon_session * s
GGML_LOG_DEBUG("ggml_hexagon_supported_mul_mat: permuted F16 src0 not supported\n");
return false;
}
if (src1->ne[2] < src0->ne[2] || src1->ne[3] < src0->ne[3]) {
GGML_LOG_DEBUG("ggml_hexagon_supported_mul_mat: src1 broadcasting not supported\n");
return false;
}
if (ggml_nrows(src1) > 1024) {
return false; // no huge batches (for now)
}
break;
case GGML_TYPE_F32:
if (src1->type != GGML_TYPE_F32) {
return false;
}
if (src0->nb[1] < src0->nb[0]) {
GGML_LOG_DEBUG("ggml_hexagon_supported_mul_mat: permuted F32 src0 not supported\n");
return false;
}
if (src1->ne[2] < src0->ne[2] || src1->ne[3] < src0->ne[3]) {
GGML_LOG_DEBUG("ggml_hexagon_supported_mul_mat: src1 broadcasting not supported\n");
return false;
}
if (ggml_nrows(src1) > 1024) {
return false; // no huge batches (for now)
}
@@ -2377,6 +2648,7 @@ static bool ggml_hexagon_supported_mul_mat_id(const struct ggml_hexagon_session
switch (src0->type) {
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q8_0:
case GGML_TYPE_IQ4_NL:
case GGML_TYPE_MXFP4:
@@ -2874,6 +3146,7 @@ static htp_op_code op_remap_to_htp(const ggml_tensor * t) {
case GGML_OP_NORM: return HTP_OP_NORM;
case GGML_OP_L2_NORM: return HTP_OP_L2_NORM;
case GGML_OP_RMS_NORM: return HTP_OP_RMS_NORM;
case GGML_OP_CONCAT: return HTP_OP_CONCAT;
case GGML_OP_SCALE: return HTP_OP_SCALE;
case GGML_OP_SQR: return HTP_OP_SQR;
case GGML_OP_SQRT: return HTP_OP_SQRT;
@@ -2891,13 +3164,14 @@ static htp_op_code op_remap_to_htp(const ggml_tensor * t) {
case GGML_OP_UNARY:
switch (ggml_get_unary_op(t)) {
case GGML_UNARY_OP_SILU: return HTP_OP_UNARY_SILU;
case GGML_UNARY_OP_GELU: return HTP_OP_UNARY_GELU;
case GGML_UNARY_OP_SIGMOID: return HTP_OP_UNARY_SIGMOID;
case GGML_UNARY_OP_NEG: return HTP_OP_UNARY_NEG;
case GGML_UNARY_OP_EXP: return HTP_OP_UNARY_EXP;
case GGML_UNARY_OP_SOFTPLUS: return HTP_OP_UNARY_SOFTPLUS;
case GGML_UNARY_OP_TANH: return HTP_OP_UNARY_TANH;
case GGML_UNARY_OP_SILU: return HTP_OP_UNARY_SILU;
case GGML_UNARY_OP_GELU: return HTP_OP_UNARY_GELU;
case GGML_UNARY_OP_GELU_QUICK: return HTP_OP_UNARY_GELU;
case GGML_UNARY_OP_SIGMOID: return HTP_OP_UNARY_SIGMOID;
case GGML_UNARY_OP_NEG: return HTP_OP_UNARY_NEG;
case GGML_UNARY_OP_EXP: return HTP_OP_UNARY_EXP;
case GGML_UNARY_OP_SOFTPLUS: return HTP_OP_UNARY_SOFTPLUS;
case GGML_UNARY_OP_TANH: return HTP_OP_UNARY_TANH;
default:
break;
}
@@ -2928,10 +3202,43 @@ static ggml_status ggml_backend_hexagon_graph_compute(ggml_backend_t backend, gg
HEX_VERBOSE("ggml-hex: %s graph-compute n_nodes %d\n", sess->c_name(), graph->n_nodes);
std::vector<htp_opnode> nodes;
nodes.reserve(graph->n_nodes);
// Fusion
for (int i = 0; i < graph->n_nodes; ++i) {
ggml_tensor * n = graph->nodes[i];
if (op_is_compute(n) && (opt_opstage & HTP_OPSTAGE_QUEUE)) {
sess->enqueue_op(op_remap_to_htp(n), n);
if (!op_is_compute(n)) {
continue;
}
ggml_tensor * next_node = (i + 1 < graph->n_nodes) ? graph->nodes[i + 1] : nullptr;
htp_opnode node = {
/*.node =*/ n,
/*.fused =*/ {},
/*.opcode =*/ HTP_OP_INVALID
};
if (n->op == GGML_OP_RMS_NORM && next_node) {
if (next_node->op == GGML_OP_MUL && op_is_compute(next_node) && ggml_can_fuse(graph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL })) {
node.add_fused(next_node);
node.opcode = HTP_OP_RMS_NORM_MUL;
i++; // skip the fused MUL node
}
}
if (node.opcode == HTP_OP_INVALID) {
node.opcode = op_remap_to_htp(n);
}
nodes.push_back(std::move(node));
}
// Queue and execute
if (opt_opstage & HTP_OPSTAGE_QUEUE) {
for (const auto & node : nodes) {
sess->enqueue_op(node);
}
}
@@ -2950,51 +3257,7 @@ static void ggml_backend_hexagon_synchronize(ggml_backend_t backend) {
sess->flush();
}
struct node_info {
ggml_tensor * node;
std::vector<ggml_tensor *> fused;
ggml_op op() const {
return node->op;
}
const ggml_tensor * dst() const {
return fused.empty() ? node : fused.back();
}
const ggml_tensor * src0() const {
return node->src[0];
}
const ggml_tensor * src1() const {
return node->src[1];
}
bool is_empty() const {
return ggml_op_is_empty(node->op);
}
void add_fused(ggml_tensor * t) {
fused.push_back(t);
}
bool stackable() const {
switch (this->op()) {
case GGML_OP_MUL_MAT:
case GGML_OP_MUL_MAT_ID:
return ggml_is_quantized(this->src0()->type);
default:
return false;
}
}
bool same_input(const node_info& n) const {
return n.src1() == this->src1();
}
};
static std::vector<int> ggml_hexagon_graph_optimize_reorder(const std::vector<node_info> & nodes) {
static std::vector<int> ggml_hexagon_graph_optimize_reorder(const std::vector<htp_opnode> & nodes) {
const int n = nodes.size();
std::vector<int> res;
@@ -3048,14 +3311,14 @@ static void ggml_backend_hexagon_graph_optimize(ggml_backend_t backend, ggml_cgr
enum ggml_op ops[MAX_FUSE];
std::vector<node_info> nodes;
std::vector<htp_opnode> nodes;
nodes.reserve(gf->n_nodes);
// fuse nodes:
// we don't want to make reorders that break fusing, so we first pack all fusable tensors
// and perform the reorder over the fused nodes. after the reorder is done, we unfuse
for (int i = 0; i < n; i++) {
node_info node = {
htp_opnode node = {
/*.node =*/gf->nodes[i],
/*.fused =*/{},
};
@@ -3286,6 +3549,25 @@ static bool ggml_hexagon_supported_repeat(const struct ggml_hexagon_session * se
return true;
}
static bool ggml_hexagon_supported_concat(const struct ggml_hexagon_session * sess, const struct ggml_tensor * op) {
int dim = ((const int32_t *) op->op_params)[0];
if (dim < 0 || dim >= GGML_MAX_DIMS) {
return false;
}
for (int i = 0; i < GGML_MAX_SRC; ++i) {
const struct ggml_tensor * src = op->src[i];
if (!src) {
continue;
}
if (src->type != GGML_TYPE_F32 && src->type != GGML_TYPE_I32 && src->type != GGML_TYPE_F16) {
return false;
}
}
return true;
}
static bool ggml_hexagon_supported_fill(const struct ggml_hexagon_session * sess, const struct ggml_tensor * op) {
const struct ggml_tensor * dst = op;
@@ -3371,6 +3653,7 @@ static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, cons
break;
case GGML_UNARY_OP_SILU:
case GGML_UNARY_OP_GELU:
case GGML_UNARY_OP_GELU_QUICK:
supp = ggml_hexagon_supported_activations(sess, op);
break;
default:
@@ -3434,6 +3717,10 @@ static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, cons
supp = ggml_hexagon_supported_cumsum(sess, op);
break;
case GGML_OP_CONCAT:
supp = ggml_hexagon_supported_concat(sess, op);
break;
case GGML_OP_FILL:
supp = ggml_hexagon_supported_fill(sess, op);
break;
@@ -3598,6 +3885,8 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
// Basic sanity checks to make sure definitions match
static_assert((unsigned int) HTP_TYPE_Q4_0 == (unsigned int) GGML_TYPE_Q4_0,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_Q4_1 == (unsigned int) GGML_TYPE_Q4_1,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_Q8_0 == (unsigned int) GGML_TYPE_Q8_0,
"please update hexagon_type to match ggml_type");
static_assert((unsigned int) HTP_TYPE_MXFP4 == (unsigned int) GGML_TYPE_MXFP4,
@@ -3610,6 +3899,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
const char * str_opstage = getenv("GGML_HEXAGON_OPSTAGE");
const char * str_opbatch = getenv("GGML_HEXAGON_OPBATCH");
const char * str_opqueue = getenv("GGML_HEXAGON_OPQUEUE");
const char * str_oppoll = getenv("GGML_HEXAGON_OPPOLL");
const char * str_opfilter = getenv("GGML_HEXAGON_OPFILTER");
const char * str_profile = getenv("GGML_HEXAGON_PROFILE");
const char * str_etm = getenv("GGML_HEXAGON_ETM");
@@ -3647,6 +3937,7 @@ static void ggml_hexagon_init(ggml_backend_reg * reg) {
opt_opstage = str_opstage ? strtoul(str_opstage, NULL, 0) : opt_opstage;
opt_opbatch = str_opbatch ? strtoul(str_opbatch, NULL, 0) : opt_opbatch;
opt_opqueue = str_opqueue ? strtoul(str_opqueue, NULL, 0) : opt_opqueue;
opt_oppoll = str_oppoll ? strtoul(str_oppoll, NULL, 0) : opt_oppoll;
opt_profile = str_profile ? atoi(str_profile) : 0;
opt_etm = str_etm ? atoi(str_etm) : 0;
opt_nhvx = str_nhvx ? strtoul(str_nhvx, NULL, 0) : opt_nhvx;

View File

@@ -0,0 +1,265 @@
#ifndef HTP_OPNODE_H
#define HTP_OPNODE_H
#define GGML_COMMON_IMPL_CPP
#include "ggml-backend-impl.h"
#include "ggml-common.h"
#include <string>
#include <vector>
#include <stdio.h>
#include "htp-ops.h"
struct htp_opnode {
ggml_tensor * node = nullptr;
std::vector<ggml_tensor *> fused;
htp_op_code opcode = HTP_OP_INVALID;
ggml_op op() const {
return node->op;
}
const ggml_tensor * dst() const {
return fused.empty() ? node : fused.back();
}
const ggml_tensor * src0() const {
return node->src[0];
}
const ggml_tensor * src1() const {
return node->src[1];
}
bool is_empty() const {
return ggml_op_is_empty(node->op);
}
void add_fused(ggml_tensor * t) {
fused.push_back(t);
}
bool stackable() const {
switch (this->op()) {
case GGML_OP_MUL_MAT:
case GGML_OP_MUL_MAT_ID:
return ggml_is_quantized(this->src0()->type);
default:
return false;
}
}
bool same_input(const htp_opnode& n) const {
return n.src1() == this->src1();
}
std::vector<const ggml_tensor *> get_inputs() const {
std::vector<const ggml_tensor *> inputs(GGML_MAX_SRC, nullptr);
std::vector<const ggml_tensor *> outputs;
outputs.push_back(node);
for (const auto * f : fused) {
outputs.push_back(f);
}
auto contains = [&](const std::vector<const ggml_tensor *> & vec, const ggml_tensor * t) {
for (const auto * x : vec) {
if (x == t) return true;
}
return false;
};
int count = 0;
auto add_input = [&](const ggml_tensor * t) {
if (t && !contains(outputs, t) && !contains(inputs, t)) {
if (count < (int)inputs.size()) {
inputs[count++] = t;
} else {
inputs.push_back(t);
}
}
};
for (int i = 0; i < GGML_MAX_SRC; i++) {
if (fused.empty()) {
inputs[i] = node->src[i];
} else {
if (node->src[i]) {
add_input(node->src[i]);
}
}
}
for (const auto * f : fused) {
for (int i = 0; i < GGML_MAX_SRC; i++) {
if (f->src[i]) {
add_input(f->src[i]);
}
}
}
if (!fused.empty()) {
inputs.resize(count);
}
return inputs;
}
std::string op_name() const {
if (fused.empty()) {
return ggml_op_desc(node);
}
std::string name = ggml_op_desc(node);
for (const auto * f : fused) {
name += "+";
name += ggml_op_desc(f);
}
return name;
}
};
struct htp_opformat {
char strides[64 * GGML_MAX_SRC];
char dims[64 * GGML_MAX_SRC];
char types[16 * GGML_MAX_SRC];
char buffs[64 * GGML_MAX_SRC];
char names[64 * GGML_MAX_SRC];
int format_tensor_dims(char * str, const struct ggml_tensor * t) {
if (!t) {
return sprintf(str, "NONE");
}
if (t->ne[2] == 1 && t->ne[3] == 1) {
return sprintf(str, "%d:%d", (int) t->ne[0], (int) t->ne[1]);
} else {
return sprintf(str, "%d:%d:%d:%d", (int) t->ne[0], (int) t->ne[1], (int) t->ne[2], (int) t->ne[3]);
}
}
void format_op_dims(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += format_tensor_dims(p, inputs[0]);
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += format_tensor_dims(p, inputs[i]);
}
p += sprintf(p, " -> ");
}
char self[64];
format_tensor_dims(self, node.dst());
p += sprintf(p, "%s", self);
}
int format_tensor_strides(char * str, const struct ggml_tensor * t) {
if (!t) {
return sprintf(str, "NONE");
}
const char * c = ggml_is_contiguous(t) ? "" : "!";
if (t->ne[2] == 1 && t->ne[3] == 1) {
return sprintf(str, "%zu:%zu%s", (size_t) t->nb[0], (size_t) t->nb[1], c);
} else {
return sprintf(str, "%zu:%zu:%zu:%zu%s", (size_t) t->nb[0], (size_t) t->nb[1], (size_t) t->nb[2], (size_t) t->nb[3], c);
}
}
void format_op_strides(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += format_tensor_strides(p, inputs[0]);
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += format_tensor_strides(p, inputs[i]);
}
p += sprintf(p, " -> ");
}
char self[64];
format_tensor_strides(self, node.dst());
p += sprintf(p, "%s", self);
}
void format_op_types(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += sprintf(p, "%s", inputs[0] ? ggml_type_name(inputs[0]->type) : "NONE");
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", inputs[i] ? ggml_type_name(inputs[i]->type) : "NONE");
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", ggml_type_name(node.dst()->type));
}
const char * tensor_buff_name(const struct ggml_tensor * t) {
if (t && t->buffer) {
return ggml_backend_buffer_name(t->buffer);
}
return "NONE";
}
void format_op_buffs(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += sprintf(p, "%s", tensor_buff_name(inputs[0]));
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", tensor_buff_name(inputs[i]));
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", tensor_buff_name(node.dst()));
}
void format_op_names(char * str, const htp_opnode & node) {
char * p = str;
auto inputs = node.get_inputs();
if (!inputs.empty()) {
p += sprintf(p, "%s", inputs[0] ? inputs[0]->name : "NONE");
for (size_t i = 1; i < inputs.size(); i++) {
p += sprintf(p, " x ");
p += sprintf(p, "%s", inputs[i] ? inputs[i]->name : "NONE");
}
p += sprintf(p, " -> ");
}
p += sprintf(p, "%s", node.dst()->name);
}
void format(const htp_opnode & node) {
format_op_dims(dims, node);
format_op_strides(strides, node);
format_op_types(types, node);
format_op_buffs(buffs, node);
format_op_names(names, node);
}
htp_opformat() {}
htp_opformat(const htp_opnode & node) { format(node); }
};
#endif // HTP_OPNODE_H

View File

@@ -19,26 +19,6 @@ add_library(${HTP_LIB} SHARED
htp_iface_skel.c
worker-pool.c
hex-dma.c
matmul-ops.c
binary-ops.c
unary-ops.c
sum-rows-ops.c
softmax-ops.c
act-ops.c
rope-ops.c
flash-attn-ops.c
set-rows-ops.c
get-rows-ops.c
cpy-ops.c
repeat-ops.c
argsort-ops.c
ssm-conv.c
cumsum-ops.c
fill-ops.c
diag-ops.c
solve-tri-ops.c
gated-delta-net-ops.c
pad-ops.c
)
target_compile_definitions(${HTP_LIB} PRIVATE
@@ -57,15 +37,16 @@ list(FIND HTP_HMX_VERSIONS ${DSP_VERSION} _hmx_idx)
if (_hmx_idx GREATER_EQUAL 0)
target_sources(${HTP_LIB} PRIVATE
hmx-queue.c
hmx-matmul-ops.c
hmx-flash-attn-ops.c
hmx-queue.c
)
# -mhmx enables HMX instruction set (needed by files that include hmx-utils.h)
set_source_files_properties(
hmx-matmul-ops.c
hmx-flash-attn-ops.c
hmx-matmul-ops.c
hmx-queue.c
PROPERTIES COMPILE_OPTIONS "-mhmx"
)
@@ -74,6 +55,30 @@ endif()
build_idl(htp_iface.idl ${HTP_LIB})
target_sources(${HTP_LIB} PRIVATE
matmul-ops.c
binary-ops.c
unary-ops.c
sum-rows-ops.c
softmax-ops.c
act-ops.c
rope-ops.c
flash-attn-ops.c
set-rows-ops.c
get-rows-ops.c
cpy-ops.c
repeat-ops.c
argsort-ops.c
ssm-conv.c
cumsum-ops.c
fill-ops.c
concat-ops.c
diag-ops.c
solve-tri-ops.c
gated-delta-net-ops.c
pad-ops.c
)
set_target_properties(${HTP_LIB} PROPERTIES EXPORT_COMPILE_COMMANDS ON)
install(TARGETS ${HTP_LIB})

View File

@@ -276,6 +276,7 @@ int op_argsort(struct htp_ops_context * octx) {
octx->src0_spad.data = octx->ctx->vtcm_base;
octx->src0_spad.size = total_spad_size;
octx->src0_spad.size_per_thread = spad_per_thread;
octx->src0_spad.src = NULL;
FARF(HIGH, "argsort: %ux%ux%ux%u -> %ux%ux%ux%u (0x%x, 0x%x)",
octx->src[0]->ne[0], octx->src[0]->ne[1], octx->src[0]->ne[2], octx->src[0]->ne[3],

Some files were not shown because too many files have changed in this diff Show More