7739 Commits

Author SHA1 Message Date
Oliver Simons
36f0132464 CUDA: Factor out and re-use block_reduce function (#18785)
* CUDA: Refactor and expose two_stage_warp_reduce_* function

* Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it

Moving smem out of `__device__` function to `__global__` function
allows for explicit smem reuse, as either compiler or cuda rt seem to not
free it afterwards (`cudaFuncSetAttribute` fails when not accounting for
it once for each call to two_stage_warp_reduce)

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Use two_stage_warp_reduce in group_norm_f32

* Use two_stage_warp_reduce in rms_norm_f32

* Fix smem calculation which expects bytes

* Make `two_stage_warp_reduce` accept all values warp_reduce accepts

Also integrate it into norm_f32 function

* Use two_stage_warp_reduce in l2_norm_f32

* Use type traits for block reduction for better legibility

Also adresss other requests by @am17an such as variable renaming

* Make norm tests cover all cuda paths

* Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK

Unit-tests passed locally, let's see if they pass in the CI as well

* Use `enum class` for `block_reduce_method`

This is more type-safe than plain enum

* Rename variables as suggested in code review by @am17an

* Rename two_stage_warp_reduce -> block_reduce

* Fix trailing whitespace in common.cuh

* Make condition of static_assert type-dependent

This delays evaluation until the template is actually instantiated.
Otherwise, some compilers may evaluate the assert when parsing the
template, resulting in build errors as observed here:

https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785

* Inline definitions

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
b7739
2026-01-15 10:44:54 +08:00
Piotr Wilkin (ilintar)
d98b548120 Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)
* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality

* Move to common

* Remove unneeded header

* Unlink from common

* chore: update webui build output

* Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code.

* Revert change to webapp

* Post-merge adjust

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Apply code review changes

* Remove changes to server-context

* Remove mtmd.h include

* Remove utility functions from header

* Apply suggestions from code review

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Rename functions

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
b7738
2026-01-14 20:29:35 +01:00
Junwon Hwang
8fb7175576 model : clean up and fix EXAONE-MoE configuration (#18840)
* Fix mismatch of EXAONE-MoE configuration

* ensure gating func is set, cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7737
2026-01-14 19:38:21 +01:00
Adrien Gallouët
516a4ca9b5 refactor : remove libcurl, use OpenSSL when available (#18828) b7736 2026-01-14 18:02:47 +01:00
Jeff Bolz
3e4bb29666 vulkan: Check maxStorageBufferRange in supports_op (#18709)
* vulkan: Check maxStorageBufferRange in supports_op

* skip maxStorageBufferRange check when shader64BitIndexing is enabled
b7735
2026-01-14 10:59:05 +01:00
Aman Gupta
47f9612492 llama-model: fix unfortunate typo (#18832) 2026-01-14 17:55:15 +08:00
Daniel Bevenius
01cbdfd7eb CUDA : fix typo in clang pragma comment [no ci] (#18830) 2026-01-14 10:31:49 +01:00
Ruben Ortlam
635ef78ec5 vulkan: work around Intel fp16 bug in mmq (#18814) 2026-01-14 09:41:23 +01:00
Perry Naseck
7d587e5544 ggml-metal: do not copy headers for embedded, use current binary dir for embedded (#18705) b7731 2026-01-14 09:22:25 +02:00
Daniel Benjaminsson
d34aa07193 mmap: add Haiku support by skipping RLIMIT_MEMLOCK check (#18819)
Haiku OS does not support RLIMIT_MEMLOCK, similar to visionOS/tvOS.
Skip the resource limit check on Haiku to allow mlock functionality
to work without compile errors.

Tested on Haiku with NVIDIA RTX 3080 Ti using Vulkan backend.
b7730
2026-01-14 09:11:05 +02:00
Adrien Gallouët
f709c7a33f ci, tests : use cmake to download models and remove libcurl dependency (#18791)
* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7729
2026-01-14 07:46:27 +01:00
ddh0
6e36299b47 llama : print_info alignment fix (#18708)
* fix text spacing in print_info

* align all
b7728
2026-01-14 00:05:11 +01:00
Junwon Hwang
60591f01d4 model : add EXAONE MoE (#18543)
* Add EXAONE MoE implementations

Co-authored-by: Junwon Hwang <nuclear1221@gmail.com>

* Address PR feedback

* Address PR feedback

* [WIP] Add MTP for EXAONE-MoE

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

* Address PR feedback

---------

Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>
b7727
2026-01-13 23:28:38 +01:00
Georgi Gerganov
e4832e3ae4 vocab : fix attribute overrides for harmony (#18806)
* vocab : fix attribute overrides for harmony

* cont : add warning log
b7726
2026-01-13 17:40:13 +02:00
Ruben Ortlam
960e5e3b46 llama-mmap: fix direct-io loading fallback EOF exception (#18801) b7725 2026-01-13 15:57:07 +01:00
Daniel Bevenius
20ca2e12c4 model-conversion : remove -c 0 from model card template [no ci] (#18807)
This commit removes the `-c, --ctx-size N` from the llama-server
command in the model card template for causal models.

The motivation for this is that -c 0 is the default and specifying it
is redundant.
2026-01-13 14:13:10 +01:00
yulo
ea4a321f2a HIP: add fattn-mma-f16 for RDNA4 (#18481)
* finish VQ mma

* flash_attn_ext_f16_iter

* KQ_rowsum

* correct exp

* fix scale error

* fix softmax scale

* fix softmax scale

* enable fattn on cpu side

* fix random error

* disable fattn-mma-f16 on rdna3

* fix wrong col for rdna

* use identity mat to transpose

* resolve conflicts

* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B

* fix volta compile error

* align rdna4 policy for fattn

* adjust fattn policy

* adjust kernel selection logic

* update as the review comments

* keep fattn-wmma logic

* adjust kernel selection logic

---------

Co-authored-by: zhang hui <you@example.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
b7723
2026-01-13 13:52:16 +01:00
Johannes Gäßler
c1e79e610f doc: ban AI-generated PR descriptions [no ci] (#18765) 2026-01-13 13:43:12 +01:00
Xuan-Son Nguyen
e047f9ee9d mtmd: fix use_non_causal being reported incorrectly (#18793)
* mtmd: fix use_non_causal being reported incorrectly

* move clip_is_mrope to mtmd_decode_use_mrope

* fix sloppy code ggml_cpy
b7721
2026-01-13 12:19:38 +01:00
Georgi Gerganov
0a57271ab6 CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800) b7720 2026-01-13 12:25:53 +02:00
Gabe Goodhart
076b0faf7d graph : clean up t5 input builders (#18795)
* fix: Remove unnecessary `h` loops where `h` was only ever 0

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary padding loop that is never hit anymore

The upper bound used to use GGML_PAD(n_tokens, GGML_KQ_MASK_PAD), but was
removed in https://github.com/ggml-org/llama.cpp/pull/17910 leaving the
loop dead.

Branch: CleanUpT5InputBuilders

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
b7719
2026-01-13 09:43:51 +01:00
Ruben Ortlam
db79dc06b1 llama-bench: add direct_io parameter (#18778) b7718 2026-01-13 08:49:10 +01:00
Adrien Gallouët
537d4240d4 ci : remove libcurl in releases (#18775)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7717
2026-01-12 21:43:02 +01:00
Radoslav Gerganov
bcf7546160 server : add arg for disabling prompt caching (#18776)
* server : add arg for disabling prompt caching

Disabling prompt caching is useful for clients who are restricted to
sending only OpenAI-compat requests and want deterministic
responses.

* address review comments

* address review comments
b7716
2026-01-12 19:21:34 +02:00
Adrien Gallouët
36c5913c45 ci : use openssl for openEuler-latest-cmake-cann (#18779)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 17:29:00 +01:00
Adrien Gallouët
8e649571cd vendor : update cpp-httplib to 0.30.1 (#18771)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
b7714
2026-01-12 15:58:52 +01:00
Daniel Bevenius
4150da9a95 examples : add --kv-unified to batched example (#18774)
This commit adds the --kv-unified flag to the batched example. This flag
is currently specified in the README.md as required, but is currently
not available as a command line option for the batched example.

The motivation for this is that specifying this flag as the README
instructs, will lead to an error about the flag not being recognized,
and without this option the example fail with the following error:
```console
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
b7713
2026-01-12 13:47:58 +01:00
Jeff Bolz
8e2da778da vulkan: change memory_logger to be controlled by an env var (#18769) b7712 2026-01-12 13:32:55 +01:00
Xuan-Son Nguyen
ce3bf9b1a4 server: update docs for sleeping [no ci] (#18777) 2026-01-12 13:01:24 +01:00
Jeff Bolz
2bbe4c2cf8 vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678)
This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which
has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128.

This should work when the number of blocks in the A matrix is less than 2^32
(for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like
2^32*LOAD_VEC_A elements.

- Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b.
- Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle
variants. So far this change just adds a single use case for this, compiling with the
e64BitIndexingEXT flag.
- Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange.

64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort
to avoid enabling it unconditionally.
b7710
2026-01-12 12:32:13 +01:00
Ruben Ortlam
1051ecd289 vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763)
* vulkan: Disable large coopmat matmul configuration on proprietary AMD driver

* Also disable the large tile size
b7709
2026-01-12 07:29:35 +01:00
Xuan-Son Nguyen
0c3b7a9efe model: fix qwen3next broken due to #18683 (#18762) b7708 2026-01-11 21:00:10 +01:00
Ruben Ortlam
0e76501e1d Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
b7707
2026-01-11 17:33:33 +01:00
Xuan-Son Nguyen
4b060bf240 security: make it clear about subtopics in server (#18754)
* security: make it clear about subtopics in server

* exclude DoS
2026-01-11 16:51:03 +01:00
Daniel Bevenius
9789e28459 debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692)
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check

This commit updates the pooling check in the debug example to
also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just
LLAMA_POOLING_TYPE_NONE.

* debug : normalize both pooled and token embeddings

This commit updates debug.cpp to normalize embeddings for both pooled
and non-pooled outputs. For pooled embeddings, normalization is applied
to the single vector, and for non-pooled embeddings, normalization is
applied to each token embedding vector individually.

The motivation for this is to enable non-pooled embeddings to be
normalized which was not possible previously.
b7705
2026-01-11 16:34:41 +01:00
Georgi Gerganov
84ae04f163 tests : refactor test-backend-sampler (#18753)
* tests : use "auto", use std::string

* tests : refactor test-backend-sampler.cpp

* cmake : remove redundant declarations

* ci : use smaller model

* tests : add struct test_params

* tests : reduce logit bias 100.0f -> 10.0f
b7704
2026-01-11 17:31:03 +02:00
Xuan-Son Nguyen
506bb6e010 model: try to improve Qwen3 Next (#18683)
* qwen3next: simplify qkvz projection

* use ggml_swiglu_split

* revert swiglu_split, but remove redundant repeat()

* fix missing reshape

* rm 2 redundant transposes

* move mul_mat(k,q) to outside of chunking

* rm redundant cont

* improve g_cs_chunk

* add comments about no cont

* use std::pair instead of ggml_concat

* vectorize key_gdiff calculation

* rm unused tensor

* avoid ggml_concat inside loop

* bring back ggml_concat as it may not work on other backend

* nits
b7703
2026-01-11 12:53:33 +01:00
thom-dev-fr
79456a690a readme : update UIs (#18751) 2026-01-11 13:46:50 +02:00
Xuan-Son Nguyen
28068af789 security: narrow down the scope of what we consider a vulnerability (#18752)
* security: narrow down the scope of what we consider a vulnerability

* fix typo
2026-01-11 12:23:36 +01:00
shaofeiqi
707cbafcaa opencl: add SOFTPLUS op support (#18726) b7700 2026-01-10 21:57:44 -08:00
Aman Gupta
b137718878 test-backend-ops: fix mxfp4 tests on blackwell (#18736) b7699 2026-01-11 01:12:57 +08:00
Johannes Gäßler
d2ff4e23ac HIP: adjust RDNA3.5 MMQ kernel selction logic (#18666) b7698 2026-01-10 17:19:01 +01:00
Perry Naseck
657a2e644b cmake : update blas logic (#18205) b7697 2026-01-10 18:00:54 +02:00
Georgi Gerganov
f307926482 server : adjust unified KV cache tests (#18716) 2026-01-10 17:51:56 +02:00
Sigbjørn Skjæret
7fdc8c893d scripts : follow api redirects in pr2wt.sh (#18739) 2026-01-10 16:04:05 +01:00
Xuan-Son Nguyen
23f82f2420 preset: allow named remote preset (#18728)
* preset: allow named remote preset

* nits: fix docs

* cont docs
b7694
2026-01-10 15:12:29 +01:00
Aaron Teo
2656c0d265 docs(ggml): update backend ops (#18734)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-01-10 18:48:17 +08:00
Michael Wand
600a366478 Corrected: changed s13 = src1->nb[3] instead of nb[2] (#18724) b7692 2026-01-10 10:16:07 +01:00
Adrien Gallouët
ea23c15990 common : add --license to display embedded licenses (#18696)
This commit introduces a mechanism to embed all licenses directly
into the compiled binaries.

This eliminates the need to distribute separate LICENSE files alongside
the executable, making the binaries self-contained and simplifying
deployment.
b7691
2026-01-10 09:46:24 +01:00
Xuan-Son Nguyen
9ac2693a30 server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing

* fix infinite loop on empty batch

* cont : init child samplers + modify child logic

* cont : cleanup

* cont : improve n_cmpl logic

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache

* cont : remove redundant function

* cont : reduce parent checks

* fix : nullptr task dereference

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7690
2026-01-10 00:00:41 +01:00