7586 Commits

Author SHA1 Message Date
Georgi Gerganov
08b5d956fc minor : std::unordered_set over std::set 2026-01-12 13:35:25 +02:00
Xuan Son Nguyen
cb5e0f8734 deprecate llama_adapter_lora_free 2025-12-31 12:07:07 +01:00
Xuan Son Nguyen
f5e8bfddc3 lora: make sure model keep track of associated adapters 2025-12-30 15:57:21 +01:00
Xuan-Son Nguyen
cd78e57c3a lora: count lora nodes in graph_max_nodes (#18469)
* lora: count lora nodes in graph_max_nodes

* 3 nodes per weight

* 4 nodes

* keep track n_lora_nodes from llama_model

* fix assert

* rm redundant header

* common: load adapters before context creation

* use 6 nodes
b7583
2025-12-30 15:53:12 +01:00
Jay Zenith
c32fa21db8 sampling: reuse token data buffer in llama_sampler_sample (#18365)
* sampling: reuse token data buffer in llama_sampler_sample

* move cur buffer before timing section, after samplers

* minor : fix build

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b7582
2025-12-30 16:27:49 +02:00
Jeff Bolz
f14f4e421b server: fix files built redundantly (#18474) b7581 2025-12-30 13:11:13 +01:00
Charles Xu
2d6c00a9b8 kleidiai: add and integrate SVE 256-bit vector-length kernel (#18458)
* kleidiai: add and integrate SVE 256-bit vector-length kernel

* updated for review comments
b7580
2025-12-30 14:04:53 +02:00
Aman Gupta
d77d7c5c06 CUDA: add log line when mxfp4 acceleration is used (#18483)
* CUDA: add log line when mxfp4 acceleration is used

* add in backend_get_features
b7579
2025-12-30 17:40:46 +08:00
Daniel Bevenius
a864fb1c14 model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461)
This commit updates the causal model verification script to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.

The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification script
will look for the wrong .bin file that was generating when running
the converted model.

This similar to the change made for the embeddings models script in
Commit db81d5ec4b ("model-conversion :
use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)"),
but we also verify the embeddings of for causal models as well.
2025-12-30 10:13:12 +01:00
Xuan-Son Nguyen
51a48720b8 webui: fix prompt progress ETA calculation (#18468)
* webui: fix prompt progress ETA calculation

* handle case done === 0
b7577
2025-12-29 21:42:11 +01:00
Pascal
c9a3b40d65 Webui/prompt processing progress (#18300)
* webui: display prompt preprocessing progress

* webui: add percentage/ETA and exclude cached tokens from progress

Address review feedback from ngxson

* webui: add minutes and first chunk (0%) case

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: address review feedback from allozaur

* chore: update webui build output

* webui: address review feedback from allozaur

* nit

* chore: update webui build output

* feat: Enhance chat processing state

* feat: Improve chat processing statistics UI

* chore: update webui build output

* feat: Add live generation statistics to processing state hook

* feat: Persist prompt processing stats in hook for better UX

* refactor: Enhance ChatMessageStatistics for live stream display

* feat: Implement enhanced live chat statistics into assistant message

* chore: update webui build output

* fix: Proper tab for each stage of prompt processing/generation

* chore: update webui build output

* fix: Improved ETA calculation & display logic

* chore: update webui build output

* feat: Simplify logic & remove ETA from prompt progress

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2025-12-29 19:32:21 +01:00
Johannes Gäßler
0bd1212a43 CUDA: fix replacment of bad archs in CMake (#18457) 2025-12-29 17:58:20 +01:00
wbtek
5b1248c9af server : Cmdline arg -to changes http read timeout from current 600sec default (#18279)
* Prevent crash if TTFT >300sec, boosted to 90 days

* server : allow configurable HTTP timeouts for child models

* server : pass needed timeouts from params only

---------

Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>
b7574
2025-12-29 17:12:48 +01:00
Xuan-Son Nguyen
3595ae5963 contributing: tighten AI usage policy (#18388)
* contributing: tighten AI usage policy

* refactor AGENTS.md

* proofreading

* update contributing

* add claude.md

* add trailing newline

* add note about dishonest practices

* rm point about dishonest

* rm requirement watermarking

* add .gemini/settings.json

* allow initially AI-generated content

* revise

* Update CONTRIBUTING.md

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* improve

* trailing space

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* update

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-12-29 16:01:32 +01:00
Naco Siren
c1366056f6 android: routine maintenance - Dec 2025 (#18338)
* Fix `msg` typo

* Fix thread safety in destroy() to support generation abortion in lifecycle callbacks.

* UI polish: stack new message change from below; fix GGUF margin not in view port

* Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation.

* Bump dependencies' versions; Deprecated outdated dsl usage.
b7572
2025-12-29 15:51:13 +02:00
Georgi Gerganov
2a85f720b8 server : handle closed connection for tasks (#18459) b7571 2025-12-29 15:34:41 +02:00
Daniel Bevenius
7cbec34a63 model-conversion : add device option to embd run orig model (#18386)
This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.
2025-12-29 13:37:02 +01:00
Héctor Estrada Moreno
0c8986403b retrieval : use at most n_seq_max chunks (#18400) b7569 2025-12-29 13:21:13 +02:00
o7si
daa242dfc8 common: fix return value check for setpriority (#18412)
* common: fix return value check for setpriority

* tools: add logging for process priority setting
b7568
2025-12-29 11:07:49 +02:00
Johannes Gäßler
e70e640db3 CUDA: Blackwell features for non-native builds (#18436) b7567 2025-12-29 09:35:42 +01:00
Aman Gupta
5fa66c6e67 cuda: fix race condition in cumsum (#18448)
* ggml-cuda: fix race condition in cumsum

* remove unneccesary sync_threads
b7566
2025-12-29 14:07:17 +08:00
Tim Neumann
382808c14b ci : re-enable rocm build on amd64 (#18439)
This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913.

I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).

A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.

The `runs_on` option was added to match the other entries.
b7565
2025-12-29 00:29:23 +01:00
uvos
4ffc47cb20 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (#18202) b7564 2025-12-28 20:12:55 +01:00
momonga
9c675c7140 model : Plamo3 support (#17304)
* plamo3

* fix plamo3

* clean code

* clean up the code

* fix diff

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* clean up the code

* add chat_template if exist

* clean up the code

* fix cpu-backend

* chore: whitespace trim fix + typo fix

* Fix: address review feedback

* restore `FREQ_BASE_SWA` constant

* Fix: address review feedback2

* Fix:typecheck

* Fix: address review feedback3

* final cleanup

---------

Co-authored-by: mmngays <146910567+mmngays@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
b7563
2025-12-28 17:28:31 +01:00
Aman Gupta
07a0c4ba92 Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (#18426) b7562 2025-12-28 20:53:36 +08:00
o7si
60f17f56da rpc: fix segfault on invalid endpoint format (#18387)
* rpc: fix segfault on invalid endpoint format

* rpc: add error log for failed endpoint connection
b7561
2025-12-28 12:34:41 +02:00
Johannes Gäßler
f8d561eb87 llama-fit-params: fix step size for last device (#18415) b7560 2025-12-28 10:52:09 +01:00
Johannes Gäßler
e59efe6a78 github: update issue templates [no ci] (#18410)
* github: update issue templates [no ci]

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-28 10:50:56 +01:00
Xuan-Son Nguyen
cffa5c46ea mtmd: clarify that we no longer accept AI-generated PRs (#18406) b7558 2025-12-28 09:57:04 +01:00
Boian Berberov
94de74e7b1 cmake: Added more x86_64 CPU backends when building with GGML_CPU_ALL_VARIANTS=On (#18186)
* minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

* cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

- `ivybridge`
- `piledriver`
- `cannonlake`
- `cascadelake`
- `cooperlake`
- `zen4`

Resolves: #17966
b7557
2025-12-28 09:33:29 +02:00
QDelta
4fd59e8427 ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413) b7556 2025-12-28 09:33:14 +08:00
lhez
08566977a7 opencl: allow resizing transpose buffers (#18384)
* opencl: allow resizing transpose buffers instead of using fixed sizes

* opencl: remove commented code
b7555
2025-12-27 15:51:14 -08:00
Johannes Gäßler
a4bf35889e llama-fit-params: fix overflow check (#18354) b7554 2025-12-27 20:20:45 +01:00
Johannes Gäßler
026d2ad472 llama: fix magic number of 999 for GPU layers (#18266)
* llama: fix magic number of 999 for GPU layers

* use strings for -ngl, -ngld

* enacapsulate n_gpu_layers, split_mode
b7553
2025-12-27 20:18:35 +01:00
Aman Gupta
06705fdcb3 ggml-cuda: Use same regex for GGML_NATIVE=OFF (#18407) b7552 2025-12-27 19:56:27 +08:00
Johannes Gäßler
a52dc60ba3 llama_fit_params: return enum for fail vs. error (#18374) b7551 2025-12-27 09:59:19 +01:00
Johannes Gäßler
9045c9afe5 llama-fit-params: fix Gemma 3 calculation (#18372) b7550 2025-12-27 09:56:04 +01:00
Jeff Bolz
c9ced4910b vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (#18352)
Run a preprocess to count how many times each expert is used, and use this to
quickly discard workgroups that aren't needed.
b7549
2025-12-26 16:12:58 -06:00
Jeff Bolz
7ac8902133 vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (#18349)
* vulkan: Use BK=32 for coopmat2 mul_mat_id

* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.

Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
b7548
2025-12-26 18:15:50 +01:00
Jeff Bolz
9bf20d8ac3 vulkan: Use BK=32 for coopmat2 mul_mat_id (#18332) b7547 2025-12-26 18:15:02 +01:00
Eve
cb999704fb vulkan: small dequantization improvements (#18380)
* iq4_xs

* quants
2025-12-26 18:12:11 +01:00
Jeff Bolz
b96b82fc85 vulkan: Support UPSCALE w/antialias (#18327) b7545 2025-12-26 17:00:57 +01:00
Jeff Bolz
10dc500bdb vulkan: handle rope with large number of rows (#18306) b7544 2025-12-26 16:53:46 +01:00
o7si
4893cc07bb server : fix crash when seq_rm fails for hybrid/recurrent models (#18391)
* server : fix crash when seq_rm fails for hybrid/recurrent models

* server : add allow_processing param to clear_slot
b7543
2025-12-26 16:35:29 +01:00
Francisco Herrera
af3be131c0 docs: added note for pre SYCL Intel hardware (#18016)
Specify that it's for pre sycl hardware
b7542
2025-12-26 10:34:30 +08:00
0Marble
b07cda687c CANN: implement the SSM_CONV operator (#17737)
* CANN: implement SSM_CONV operator

Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

* CANN: remove custom error limit for SSM_CONV

* CANN: merge SSM_CONV tensor shape/strides into one line

---------

Co-authored-by: Sujin Kang, <waterjin326@gmail.com>
b7541
2025-12-26 09:12:04 +08:00
Aman Gupta
85c40c9b02 ggml-cuda: fix regex for arch list (#18371)
* ggml-cuda: fix regex for arch list

* make regex exact
b7540
2025-12-26 01:35:14 +08:00
Aman Gupta
83b3b1c271 cuda: optimize cumsum cub path (#18362)
* cuda: optimize cumsum cub path

* remove heavy perf test
b7539
2025-12-25 23:55:38 +08:00
Aman Gupta
b0fb0f0aee ggml-cuda: fix blackwell native builds (#18361)
* ggml-cuda: fix blackwell native builds

Replace 12x in native architectures by 12xa

* replace for GGML_NATIVE=OFF too

* only replace for native

* remove 120f-virtual for default compilation

---------

Co-authored-by: Aman Gupta <aman>
b7538
2025-12-25 22:12:11 +08:00
Penglin Cai
e68c19b0fd CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934)
* CONV_TRANSPOSE_1D kernel_size>255

* remove condition check

* fix the bug of type conversion

* removing trailing whitespaces

* fix: return true in the switch case
2025-12-25 16:46:09 +08:00