Some toolchains do not support linking via pragmas such as:
#pragma comment(lib, "crypt32.lib")
so we need to add the library explicitly.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : implement parser combinators to simplify chat parsing
* add virtual destructor to parser_base
* fix memory leak from circular references of rules
* implement gbnf grammar building
* remove unused private variable
* create a base visitor and implement id assignment as a visitor
* fix const ref for grammar builder
* clean up types, friend classes, and class declarations
* remove builder usage from until_parser
* Use a counter class to help assign rule ids
* cache everything
* add short description for each parser
* create a type for the root parser
* implement repetition parser
* Make optional, one_or_more, and zero_or_more subclasses of repetition
* improve context constructor
* improve until parsing and add benchmarks
* remove cached() pattern, cache in parser_base with specialized parsing functions for each parser
* improve json parsing performance to better match legacy parsing
* fix const auto * it for windows
* move id assignment to classes instead of using a visitor
* create named rules in the command r7b example
* use '.' for any in GBNF
* fix parens around choices in gbnf grammar
* add convenience operators to turn strings to literals
* add free-form operators for const char * to simplify defining literals
* simplify test case parser
* implement semantic actions
* remove groups in favor of actions and a scratchpad
* add built in actions for common operations
* add actions to command r7b example
* use std::default_searcher for platforms that don't have bm
* improve parser_type handling and add cast helper
* add partial result type to better control when to run actions
* fix bug in until()
* run actions on partial results by default
* use common_chat_msg for result
* add qwen3 example wip
* trash partial idea and simplify
* move action arguments to a struct
* implement aho-corasick matcher for until_parser and to build exclusion grammars
* use std::string for input, since std::string_view is incompatible with std::regex
* Refactor tests
* improve qwen3 example
* implement sax-style parsing and refactor
* fix json string in test
* rename classes to use common_chat_ prefix
* remove is_ suffix from functions
* rename from id_counter to just counter
* Final refactored tests
* Fix executable name and editorconfig-checker
* Third time's the charm...
* add trigger parser to begin lazy grammar rule generation
* working lazy grammar
* refactor json rules now that we check for reachability
* reduce pointer usage
* print out grammars in example
* rename to chat-peg-parser* and common_chat_peg_parser*
* Revert unrelated changes
* New macros for CMakeLists to enable multi-file compilations
* starting unicode support
* add unicode support to char_parser
* use unparsed args as additional sources
* Refactor tests to new harness
* Fix CMakeLists
* fix rate calculation
* add unicode tests
* fix trailing whitespace and line endings
skip-checks: true
* Helpers + rewrite qwen3 with helpers
* Fix whitespace
* extract unicode functions to separate file
* refactor parse unicode function
* fix compiler error
* improve construction of sequence/choice parsers
* be less clever
* add make_parser helper function
* expand usage of make_parser, alias common_chat_msg_peg_parser_builder to builder in source
* lower bench iterations
* add unicode support to until_parser
* add unicode support to json_string_parser
* clean up unicode tests
* reduce unicode details to match src/unicode.cpp
* simplify even further
* remove unused functions
* fix type
* reformat char class parsing
* clean up json string parser
* clean up + fix diagnostics
* reorder includes
* compact builder functions
* replace action_parser with capture_parser, rename env to semantics
* rename env to semantics
* clean up common_chat_parse_context
* move type() to below constant
* use default constructor for common_chat_peg_parser
* make all operators functions for consistency
* fix compilation errors in test-optional.cpp
* simplify result values
* rename json_string_unquoted to json_string_content
* Move helper to separate class, add separate explicit and helper classes
* Whitespace
* Change + to append()
* Reformat
* Add extra helpers, tests and Minimax example
* Add some extra optional debugging prints + real example of how to use them
* fix bug in repetitions when min_count = 0 reports failures
* dump rule in debug
* fix token accumulation and assert parsing never fails
* indent debug by depth
* use LOG_* in tests so logs sync up with test logs
* - Add selective testing
- Refactor all messaging to use LOG_ERR
- Fix lack of argument / tool name capturing
- Temporary fix for double event capture
* refactor rule() and introduce ref()
* clean up visitor
* clean up indirection in root parser w.r.t rules
* store shared ptr directly in parser classes
* replace aho-corasick automation with a simple trie
* Reset prev for qwen3 helper example variant
* refactor to use value semantics with std::variant/std::visit
* simplify trie_matcher result
* fix linting issues
* add annotations to rules
* revert test workaround
* implement serializing the parser
* remove redundant parsers
* remove tests
* gbnf generation fixes
* remove LOG_* use in tests
* update gbnf tests to test entire grammar
* clean up gbnf generation and fix a few bugs
* fix typo in test output
* remove implicit conversion rules
* improve test output
* rename trie_matcher to trie
* simplify trie to just know if a node is the end of a word
* remove common_chat_ prefix and ensure a common_peg_ prefix to all types
* rename chat-peg-parser -> peg-parser
* promote chat-peg-parser-helper to chat-peg-parser
* checkpoint
* use a static_assert to ensure we handle every branch
* inline trivial peg parser builders
* use json strings for now
* implement basic and native chat peg parser builders/extractors
* resolve refs to their rules
* remove packrat caching (for now)
* update tests
* compare parsers with incremental input
* benchmark both complete and incremental parsing
* add raw string generation from json schema
* add support for string schemas in gbnf generation
* fix qwen example to include \n
* tidy up example
* rename extractor to mapper
* rename ast_arena to ast
* place basic tests into one
* use gbnf_format_literal from json-schema-to-grammar
* integrate parser with common/chat and server
* clean up schema and serialization
* add json-schema raw string tests
* clean up json creation and remove capture parser
* trim spaces from reasoning and content
* clean up redundant rules and comments
* rename input_is_complete to is_partial to match rest of project
* simplify json rules
* remove extraneous file
* remove comment
* implement += and |= operators
* add comments to qwen3 implementation
* reorder arguments to common_chat_peg_parse
* remove commented outdated tests
* add explicit copy constructor
* fix operators and constness
* wip: update test-chat for qwen3-coder
* bring json parser closer to json-schema-to-grammar rules
* trim trailing space for most things
* fix qwen3 coder rules w.r.t. trailing spaces
* group rules
* do not trim trailing space from string args
* tweak spacing of qwen3 grammar
* update qwen3-coder tests
* qwen3-coder small fixes
* place parser in common_chat_syntax to simplify invocation
* use std::set to collect rules to keep order predictable for tests
* initialize parser to make certain platforms happy
* revert back to std::unordered_set, sort rule names at the end instead
* uncomment rest of chat tests
* define explicit default constructor
* improve arena init and server integration
* fix chat test
* add json_member()
* add a comprehensive native example
* clean up example qwen test and add response_format example to native test
* make build_peg_parser accept std::function instead of template
* change peg parser parameters into const ref
* push tool call on tool open for constructed parser
* add parsing documentation
* clean up some comments
* add json schema support to qwen3-coder
* add id initializer in tests
* remove grammar debug line from qwen3-coder
* refactor qwen3-coder to use sequence over operators
* only call common_chat_peg_parse if appropriate format
* simplify qwen3-coder space handling
* revert qwen3-coder implementation
* revert json-schema-to-grammar changes
* remove unnecessary forward declaration
* small adjustment to until_parser
* rename C/C++ files to use dashes
* codeowners : add aldehir to peg-parser and related files
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
* Faster tensors (#8)
Add fast matrix and matrix/vector multiplication.
* Use map for shader replacements instead of pair of strings
* Wasm (#9)
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Remove extra whitespace
* Move wasm single-thread logic out of test-backend-ops for cpu backend
* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
* Fix .gitignore
* Add memory64 option and remove unneeded macros for setting threads to 1
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Added RISC-V supported tests
* Added default value for LLAMA_FATAL_WARNINGS and option to specify by user
* Added RISC-V supported tests
* Added default value for LLAMA_FATAL_WARNINGS and option to specify by user
* Removed apt prompt
* Added RISC-V specific tests with corrections
Corrections included:
1. Changed the test names from debian to ubuntu as it is more stable than Debian Trixie
2. Added explicit compiler in cmake command as GCC compiler below version 14 have been recorded
to throw errors with rvv1.0 and some other extensions
3. Added dependencies which are not installed by default in the RISC-V Ubuntu 24.04
4. Separate ccache directory for all jobs as all the ccache results are not the same and may cause ccache to not work
* Resolved the merge conflict and cleaned up run.sh
* Update ci/run.sh
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Removed previously added build ci for RISC-V
* Removed trailing whitespaces
* corrected build name
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cleanup
* Enabled build tests (1)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Enabled build tests (2)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* enable openssl
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.
The motivation for this change is to clarify the code as the outer if
statement already performs this check.
Taking a break from llama.cpp . I wasn't around at the start of llama.cpp
but I want to thank @ggerganov and @slaren for creating a neat community
here.
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
* Revert "rm unused fn"
This reverts commit f2dbe9c087.
* server: explicitly set exec path when create new instance
* put back TODO
* only call get_server_exec_path() once
* add fallback logic
* Adjust to pytorch
* Add antialiasing upscale
* Increase number of patches to 1024
* Handle default marker insertion for LFM2
* Switch to flag
* Reformat
* Cuda implementation of antialias kernel
* Change placement in ops.cpp
* consistent float literals
* Pad only for LFM2
* Address PR feedback
* Rollback default marker placement changes
* Fallback to CPU implementation for antialias implementation of upscale
This change limits progress updates to approximately every 0.1% of the
file size to minimize stdio overhead.
Also fixes compiler warnings regarding __func__ in lambdas.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* git mv
* add server-context.h
* add server-context.h
* clean up headers
* cont : cleanup
* also expose server_response_reader (to be used by CLI)
* fix windows build
* decouple server_routes and server_http
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
As [1] explained, the real debug message will be like:
"res operator(): operator() : queue result stop"
Set the name explicitly, the message is easy for debugging:
"res operator(): recv : queue result stop"
The left "operator()" is generated by 'RES_DBG() ... __func__'
[1]: https://clang.llvm.org/extra/clang-tidy/checks/bugprone/lambda-function-name.html
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
gguf_new_metadata.py reads data from reader.
Reader doesn't byteswap tensors to native endianness.
But writer does expect tensors in native endianness to convert them
into requested endianness.
There are two ways to fix this: update reader and do conversion to native endianness and back,
or skip converting endianness in writer in this particular USE-case.
gguf_editor_gui.py doesn't allow editing or viewing tensor data.
Let's go with skipping excessive byteswapping.
If eventually capability to view or edit tensor data is added,
tensor data should be instead byteswapped when reading it.
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.
* llama : update worst-case graph for unified cache
* ci : disable op offload in some tests
* fix spelling
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : add Anthropic Messages API support
* remove -@pytest.mark.slow from tool calling/jinja tests
* server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py
* server : removed redundant n field logic in anthropic_params_from_json
* server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream()
* server : refactor Anthropic API to use OAI conversion
* make sure basic test always go first
* clean up
* clean up api key check, add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Qwen3 Next - cleaned up version
* Whitespaces and stuff
* Correct minor errors
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Misc. fixes.
* Clean up code, add missing hybrid qualifier
* Did someone transpose the SOLVE_TRI result matrix? Perhaps...
* Whitespace
* Proper tensors for cb calls
* Use llama-graph.h vertical alignment
* BROKEN: chunking
* Set new tensors as inputs.
* Proper chunk logic
* It's the circle of life...
* More shenanigans for n_seq > 1
* Nail in the coffin?
* Fix Windows build
* Eh, one fails on Windows, the other fails on Mac... just use general capture.
* quant : cleanup
* model : cleanup
* qwen3 : cleanup
* cont : cleanup
* cont : cleanup
* ggml : revert change
* qwen3 : cleanup
* cont : cleanup
* Readd cmath
* qwen3 : fix typo
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Usual suspects
* fix my bad suggestion
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
* enable mmf for rdna4
* move some mmvf to mmf
* revert lds128 for wmma loading
* Revert "revert lds128 for wmma loading"
This reverts commit db9ae8b6b4.
* Revert "enable mmf for rdna4"
This reverts commit 698c9f2418.
* Revert "move some mmvf to mmf"
This reverts commit 99b92bd665.
* enable mul_mat for rdna4
---------
Co-authored-by: zhang hui <you@example.com>
* Enabled q4_K_4x8 path
* Fixed generic Q4_K 8x4 implementation
* wip: dotprod gemm
* Working arm q4_K dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Undo acc rename
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Q4_K arm dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fix: q4_qs reinterpret from uint to int
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Removed comments
* Fixed macro guards
* Fixed unused vars in generic implementation
* Fixed unused vars in 8x4 repack
* Fixed unused vars in generic implementation, unneeded comment
* Missing arch fallback for x86
* minor : style
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: Implement top-k
Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.
* fix pipeline selection
* vulkan: Add N-ary search algorithm for topk
* microoptimizations
We have to separate the code path starting 3.28 because
`FetchContent_Populate` is now deprecated and will be completely removed
in a future version.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
On arm64 with `cmake` version 3.31.6, the final feature verification fails:
-- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
-- Checking for ARM features using flags:
-- -U__ARM_FEATURE_SME
-- -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Failed
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Failed
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Failed
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4
* Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162
* CANN: ROPE supports both MROPE and IMROPE.
1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.
Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.
* Resolve review comments
* Fix convert_hf_to_gguf.py script on s390x
Assume converted model data is originally little-endian.
Byteswap data on s390x after reading it to put values in correct presentation
for any transformation needed, like calculating weight tensors.
Then byteswap data to little-endian before passing it to GGUFWriter while
GGUFWriter will byteswap data back to big endian if big endian output is requested.
byteswap(inplace=True) calls don't work with lazy tensor and array wrappers.
Use byteswap with copying data to workaround this behaviour.
* Make GGUFWriter accept tensors in native endianness instead of little-endian
With this change if no byteswapping is actually needed, 2 excessive byteswaps can be omitted on s390x
* Fix byteswapping in convert_hf_to_gguf.py for remote models
* webui: add rehype plugin to restore HTML in Markdown table cells
The remark/rehype pipeline neutralizes inline HTML as literal text
(remarkLiteralHtml) so that XML/HTML snippets in LLM responses display
as-is instead of being rendered. This causes <br> and <ul> markup in
table cells to show as plain text.
This plugin traverses the HAST post-conversion, parses whitelisted HTML
patterns (<br>, <ul><li>) from text nodes, and replaces them with actual
HAST element nodes. For lists, adjacent siblings must be combined first
as the AST fragmentation breaks pattern matching.
Strict validation rejects malformed markup, keeping it as raw text.
* chore: update webui build output
This commit adds a check to skip the output reordering logic when
n_outputs == 1. With a single output token, the data is trivially
sorted and the reordering code is currently doing unnecessary work
(resetting and rebuilding output_ids to the same values).
The motivation for this change is improved code clarity and avoiding
confusion when debugging. While the performance impact is probably
negligible, this unnecessary work happens on every decode call in
llama-server when processing batches with single-token outputs.
* first commit naive test to enable mmq for RDNA4
* adding appropriate WMMA instructions
* git rebase on top of master: fixing the correctness of the mat mul operations, updating layout mappings for RDNA4
* clean up merge conflicts
* add comments and code clean up
* PR clean up, addressed comments
* enable MMQ fallback on RDNA4
* addressed comments: add guards in load generic, separate wmma branch for use_mmq function
* Revert build-xcframework.sh
* Formating: remove trailing whitespace
* revert CMake files
* clean up after rebase: remove duplicated change, revert cmake files
* clean up after rebase: revert changes from build-xcframework.sh
* clean up: remove extra space line in mma.cuh
* Revert "clean up: remove extra space line in mma.cuh"
This reverts commit b39ed57c45.
This commit adds the --kv-unified flag to the usage example
in the README.md file for the batched example.
The motivation for this is that without this flag the example will fail
with the following error:
```console
Hello my name is
split_equal: sequential split is not supported when there are coupled
sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 4
main: llama_decode() failed
```
This commit removes the "-dirty" suffix from the GGML version string.
The motivation for this change is to ensure that the version string
works with different ways of checking out ggml and using it in projects.
By removing the dirty flag from the version string, we avoid potential
artifacts like shared libraries getting a -dirty suffix in their names.
Instead, if the project is built from a dirty git state, the dirty flag
will be appended to the commit hash in the GGML_BUILD_COMMIT variable.
This will enable users to still identify that the build was made from
from a modified/dirty state even though the version might match a "real"
version.
For example, the commit can be produces as follows:
```c++
printf("commit: %s\n", ggml_commit());
```
Which would print the following for a dirty build:
```console
commit: 781baf2a-dirty
```
Refs: https://github.com/ggml-org/ggml/pull/1363#issuecomment-3569691546
**Description of the problem**
`cann_graph_update_required` is redundantly defined and
initialized as `false` inside two mutually exclusive macro branches.
**Proposed solution**
Define it right before the macro so that it could serve both
branches.
* ggml-hexagon: fix build error with GCC
Add stdexcept include to fix GCC build errors
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
* ggml-hexagon: check VTCM acquire failures
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
* ggml-hexagon: disable destination bypass on older than v73
v68 errors out if having bypass enabled when the VTCM is the destination.
At least on v68 this made things actually work... not a proper fix though, so to look at later...
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
* ggml-hexagon: add initial v68/v69 support
v68 is the Hexagon revision notably used on the Snapdragon 8cx
Gen 3 and the QCM6490.
Also add support for v69.
8MB isn't a supported page size, so relax asked for page size constraint
for HAP_compute_res_attr_set_vtcm_param_v2 to optimal.
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
---------
Signed-off-by: Mohamed Mediouni <mohamed@unpredictable.fr>
* hexagon: add buffer support checks for hexagon sessions
* refactor: simplify buffer support checks in hexagon operations
* hexagon: update buffer support checks to use tensor structure
* refactor: streamline buffer initialization for DSP queue in hexagon operations
* refactor: simplify buffer initialization in DSP queue for hexagon operations
* refactor: optimize hex_supported_buffer function by fold expression
* wip
* refactor: simplify dspqueue_buffers_init function and its usage in hexagon operations
* fix: improve nan handling at hvx_vec_fast_sigmoid_fp32_guard
* refactor: optimize hvx_vec_inverse_fp32_guard for better nan handling
* refactor: update hvx_vec_fast_sigmoid_fp32_guard to use adjusted exponent limits
* refactor: modify hvx_vec_fast_sigmoid_fp32_guard to accept parameters for improved flexibility
* refactor: update hvx_vec_exp_fp32_guard to accept max_exp and inf parameters to save some instructions
* refactor: move hvx_vec_inverse_fp32_guard implementation to hvx-inverse.c for better perf
* mmf for rdna4
* align the padding for rdna4
* forbit mul_mat_f for rdna4
* fix as comment
* remove device kernels
* add constexpr for early return
* update based on review comment
* change based on the review comment
* pass compile error
* keep code consistency
---------
Co-authored-by: zhang hui <you@example.com>
* Detect GigaChat3-10-A1.8B as deepseek lite
Hardcodes checking number of layers to detect if lite version of deepseek.
* Add commnent identifying deepseek lite variants
deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B
* CANN: Refactor `evaluate_and_capture_cann_graph`
**Description of the problem**
* `matched_graph` is obtained even if graph mode is disabled.
* End of graph capture and graph replay are unnecessarily placed in different `if` blocks.
**Proposed solution**
* Obtain `matched_graph` only if graph mode is enabled.
* Place end of graph capture and graph reply inside the same `if` block.
* Unify graph related comments.
* Remove trailing whitespace
* Fix DoS / integer overflow
* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)
* White space
* Actually, since it's unsigned, use UINT64_MAX
* fix: TypeError when loading base model remotely in convert_lora_to_gguf
* refactor: simplify base model loading using cache_dir from HuggingFace
* Update convert_lora_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: add remote_hf_model_id to trigger lazy mode in LoRA converter
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition
* Argh.
* Making CISC happy ;)
* Integrate CONT tests
* Use loopy loop
* Skip new tests for (B)F16 for now.
Test 'Q4_K_M' quantization on https://huggingface.co/pfnet/plamo-2-translate
The 'suffix_to_score' size is 193510, it needs 19 memory allocation with final
capacity 262144 to hold the value, if not preserve the memory.
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
* Add files via upload
* fix unit test
* fix crashes for --reasoning-format=none
* Patch buggy official MiniMax-M2 chat template
* add upstream minja fix: https://github.com/ochafik/minja/pull/7
* Fix <think> token not generated
* add test copied from https://github.com/ggml-org/llama.cpp/pull/16946
* cleanup
* Hopes to fix the compilation error on CI
* Delete chat template patching since it’s fixed by upstream Minja
* Remove undeeded Minimax-M2 template patch
https://github.com/ochafik/minja/pull/7#issuecomment-3480356100
* Add proper handling of optional parameters with test
merged tests from: 23d4bb75c4
* Fix making all tool parameters optional
* Move xml tool parser to separate file
* cleanup & add tests for GLM4.5
* add streaming tests & enhancement & cleanups
Add streaming test for both GLM 4.5 and minimax-m2.
Cleanup for preserved_tokens.
Cleanup for grammar rule name.
Enhance the parser's stability.
* cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo
* apply suggestions from reviewers
* fix a misuse for data.grammar_lazy
* fix grammar when tool have no argument
* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2
* update chat.cpp
* fix grammar for GLM 4.5/4.6
* Try fix Jinja template for GLM
* Try fix GLM-4.6.jinja
* Update common/chat-parser-xml-toolcall.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking
* Improve Kimi-K2 chat template
* Fix unit test
* Fix "Invalid tool call arguments passed" in a rare case.
In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation
Fix compilation errors in the ASCEND_310P-specific ROPE operation code
by adding .get() calls when passing acl_tensor_ptr smart pointers to
functions expecting raw aclTensor* pointers.
This fixes the code that was missed in the previous refactoring commit
(8981848) which changed ggml_cann_create_tensor() return type from
aclTensor* to acl_tensor_ptr.
* cann: format code
Update openEuler version
Remove variable ASCEND_SOC_TYPE
Modify the chip type
Fix case in zip filename
Change "device" to "chip_type"
Modify the value of chip_type
* server: split HTTP into its own interface
* move server-http and httplib to its own file
* add the remaining endpoints
* fix exception/error handling
* renaming
* missing header
* fix missing windows header
* fix error responses from http layer
* fix slot save/restore handler
* fix case where only one stream chunk is returned
* add NOMINMAX
* do not call sink.write on empty data
* use safe_json_to_str for SSE
* clean up
* add some comments
* improve usage of next()
* bring back the "server is listening on" message
* more generic handler
* add req.headers
* move the chat template print to init()
* add req.path
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CANN: Use smart pointers to manage ACL objects
Previously, ACL objects were managed via manual destruction, which
led to multiple memory-leak issues during runtime. This patch replaces
manual memory management with smart pointers so that ACL objects
are properly released and ownership is clearly defined.
Note that the ownership of an ACL object belongs to the function
that creates it. Other internal functions should operate on these ACL
objects using raw pointers to avoid unintended ownership transfers.
Additionally, since aclTensorList automatically frees its contained
aclTensor objects, any aclTensor added to a tensor list must release
ownership to avoid double free operations.
This PR also removes the asynchronous task submission mechanism.
Due to changes in recent CANN versions, tiling time has significantly
decreased. Even with a dual-thread submission model, the dispatch
overhead still falls on the critical path, making async submission
less beneficial. Moreover, aclGraph support provides a much better
path to reducing operator dispatch latency.
* CANN: resolve review comments
* vulkan: add LOG operation support for F32 and F16
Part of #14909.
* vulkan: Fix LOG operation types
* docs: Update operation support documentation for Vulkan LOG operation
* vulkan: fix log_f16 shader
* docs: restore missing LOG test cases and regenerate ops.md
* SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access
* SYCL: update documentation and sycl.csv to reflect new unary op support
* update ops.md after syncing SYCL.csv changes
* Fix SYCL.csv merge conflict
* Update ops.md after fixing SYCL.csv conflicts
* Fix SYCL.csv tail after merge conflict and regenerate ops.md
* Fix line endings and final newline in SYCL.csv
* Remove TOPK_MOE entries from SYCL.csv as requested
* Update ops.md after removing TOPK_MOE from SYCL.csv
* Regenerated SYCL.csv and synced ops.md with upstream
* Update ops.md using create_ops_docs.py
* webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI
- Purely visual and diagnostic change, no effect on model context, prompt
construction, or inference behavior
- Captured assistant tool call payloads during streaming and non-streaming
completions, and persisted them in chat state and storage for downstream use
- Exposed parsed tool call labels beneath the assistant's model info line
with graceful fallback when parsing fails
- Added tool call badges beneath assistant responses that expose JSON tooltips
and copy their payloads when clicked, matching the existing model badge styling
- Added a user-facing setting to toggle tool call visibility to the Developer
settings section directly under the model selector option
* webui: remove scroll listener causing unnecessary layout updates (model selector)
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: npm run format & update webui build output
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* vulkan: change graph_compute to be async and enable get_tensor_async
This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.
Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.
The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.
The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.
* fix thread safety errors
* teardown context cleanly
* Handle async read to non-pinned dst
* fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047)
* Replace 'static' workaround, with keeping variable in scope for longer
* Create std::array directly and pass into llama_grammar_init_impl
* Add back the trigger pattern
* Missed array include
* ggml-cpu: handle 3d tensors in repack mul_mat
* Removed unnecessary branch, removed need for <algorithm>
* Fixed dst_ptr pointer in chunk + clang_format
* GGML_ASSERT to check wdata within bounds
* Accidental ggml.h inclusion
* Improved GGML_ASSERT on wdata boundaries
* Address performance regression in Qwen and llama.cpp due to chunking
* vulkan: remove shell call from vulkan-shaders-gen tool
* use string vector for command execution
* Fix condition
* use string, remove const_cast
* Fix dependency file quotation on Windows
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* update L2_NORM op support
* update L2_NORM op support
* remove extra whitespace
* cann: update cross_entropy_loss op support
* remove trailing whitespaces
* rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request.
* undo the l2_norm operator deletion
* CUDA: add fused rope
* move k forward_expand up
* create helper function instead of re-using params
* make assert statement more in line with comment
* rope_norm: coalesced writes to global mem
* hexagon: explicitly check for ops with zero nrows
llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.
* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL
Co-authored-by: chraac <chraac@gmail.com>
* hexagon: use fastdiv in ADD_ID
* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs
---------
Co-authored-by: chraac <chraac@gmail.com>
* extract rotate_pairs logic from ggml_compute_forward_rope_f32
* templateify ggml_compute_forward_rope_f32 and _f16
* abort when rope type not supported, remove GLM from test-rope
* add imrope branch to switch
* add rope tests for perf
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.
Register UMT5Model as a supported architecture variant for T5 model conversion.
This allows the conversion to work for models downloaded with AutoModel.
* feat(memory): Only fail partial erasure of recurrent tail
The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.
There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(main): Check the output of seq_rm for prefix matching
This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(memory): Fix condition for partial erasure failure if p0 > pos
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
* style: Fix extra parens
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K
* Surround SVE function with compiler directive
* fix compile switch
* fix coding style
* ggml : fix indent
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan : implement upscale with bicubic interpolation
* cuda : implement upscale with bicubic interpolation
* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests
* adapt OpenCL backend to not support the OP in that case so tests don't fail
* print scale mode & flags in test-backend-ops
* convert : parse safetensors directly
* gguf-py : order safetensors tensors by name
Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.
* convert : rename from_safetensors_meta to from_local_tensor
For consistency with from_remote_tensor
* convert : fix no-lazy dtypes from direct safetensors
* vulkan: use all device-local heaps for memory availability reporting
Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
* use all available heaps for iGPU memory reporting
* Allow multiple memory types per buffer request for devices with split heaps
---------
Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
* arg: add --cache-list argument to list cached models
* new manifest naming format
* improve naming
* Update common/arg.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: correct time_ms calculation in send_partial_response
The time_ms field was incorrectly calculated. The division was happening
before the subtraction leading to incorrect values.
Before: (ggml_time_us() - slot.t_start_process_prompt / 1000) After:
(ggml_time_us() - slot.t_start_process_prompt) / 1000
* docs : document time_ms field in prompt_progress
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the
same time, which needs to hold the lock. To be safe, hold the lock for all of
ggml_vk_load_shaders.
* bench : cache llama_context state at depth
* cont : handle failures to restore the old state
* cont : print information when the state is being reused
* kv-cache : pad the size of the small SWA cache for performance
* context : pad the total context to 256
* cont : future-proof the swa pad
* server : adjust test params to new logic
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.
Signed-off-by: lizhenneng <lizhenneng@kylinos.cn>
Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>
* common: move download functions to download.(cpp|h)
* rm unused includes
* minor cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : rework mat-mat multiplication
* metal : initial Metal4 support
* cont
* metal : detect tensor support
* cont : better ifdefs
* metal : support tensors in mul_mm_id
* metal : add env for disabling tensor API
* tests : restore
* metal : remove unused constants
* metal : fix check for bfloat tensor support
* cont : handle API incompatibilities
* cont : handle even more incompatibilities
* metal : use tensor API only on M5 and later
* support older socs where FASTRPC_GET_URI is unsupported
* added graceful fallback when FASTRPC_GET_URI call fails
* use weak symbols instead of loading libcdsprpc.so dynamically
* Add weak pragma for rpcmem_alloc2
* Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp
Removed weak declaration for rpcmem_alloc2.
* Enforce ndev to 1 for archs below v75
Force ndev to 1 for SoCs architectures lower than v75.
* WIP
* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth
* added BF16 support
* more strict check to make sure src0 is a transpose
* reformulated to handle more complicated transpose cases
* bring back 2D transpose for higher performance
* allow build on windows
* tranpose copy more shapes
* minor tweak
* final clean up
* restore some test cases
* keep only the kernel for true tranposed case; updated with review suggestions
* make CI happy
* remove headers not needed
* reduced bank conflicts for fp16 and bf16
* add missing const*
* now bank conflicts free
* use padding instead of swizzling
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
* feat(llama-gguf): Print out the tensor type in llama-gguf r
Branch: Mamba2Perf
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat(off-topic): print the number of elements in tensors with llama-gguf
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: valign
Branch: GGUFToolOutputs
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Update examples/gguf/gguf.cpp
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Remove some comments
* Implement overlap binary operators
* Revert "Implement overlap binary operators"
This reverts commit ed710b36f5.
* Disable support for non-contiguous binary_op tensors and leave note for future support
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* Fix test-conv2d-dw failure on ARM SVE by using runtime vector length
The ggml_compute_forward_conv_2d_dw_cwhn function was using a hardcoded GGML_F32_EPR (8) for SIMD vectorization, but on ARM SVE the actual vector length varies by hardware. This caused incorrect computation when processing CWHN layout tensors on ARM machines.
Fix by using svcntw() to get the runtime SVE vector length instead of the compile-time constant.
Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>
* ci : reduce sam score threshold
* ci : update bbox checks for sam test
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: ggerganov <1991296+ggerganov@users.noreply.github.com>
* vulkan: remove the need for the dryrun
Allocate pipelines and descriptor sets when requested.
Reallocate the prealloc buffers when needed, and flush any pending work
before reallocating.
For rms_partials and total_mul_mat_bytes, use the sizes computed the last time
the graph was executed.
* remove dryrun parameters
* Fix garbled output with REPACK at high thread counts
Fixed a race condition in the REPACK matrix multiplication code that caused garbled output when using 26+ threads (model-dependent threshold). The issue occurred because with high thread counts, the code forced chunk count to equal thread count, creating many small chunks. After aligning these chunks to NB_COLS boundaries, adjacent chunks could overlap, causing data corruption and race conditions. The fix enforces minimum chunk sizes based on NB_COLS and caps maximum chunk count to prevent creating too many tiny chunks, ensuring proper alignment without overlaps.
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/repack.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit modifies the script `run-org-model.py` to ensure that the
model configuration is explicitly passed to the `from_pretrained` method
when loading the model. It also removes a duplicate configuration
loading which was a mistake.
The motivation for this change is that enables the config object to be
modified and then passed to the model loading function, which can be
useful when testing new models.
* SYCL repeat_back v1 — add core op + switch case
* Implement repeat_back SYCL operation and minor fixes
* SYCL: optimize repeat_back kernel
* Remove Hebrew comment from repeat_back.cpp
* Remove comments for code clarity
Removed comments to clean up the code.
* Fix formatting in ggml-sycl.cpp
* Formatted lambda according to legacy style. No logic changes
* Remove blank line in repeat_back.cpp
Remove unnecessary blank line before assigning acc to dst_dd.
* tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage
* tests: init gf and filter out fusion tests for support mode
* tests: filter out fusion cases before calling eval_support
* tests: filter out fusion cases from show_test_coverage as well, fix lint
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : support unified context across slots
* cont : fix speculative decoding initialization
* context : fix n_ctx_per_seq computation
* server : purge slots one by one
* tests : add unified cache server tests
* llama : update per-seq context computation
* test-thread-safety : handle tiny training context of the input model
* server : fix server_tokens clear()
* server : use 4 slots + unified KV by default
* llama : add note about context size queries
* cont : update todos [no ci]
* context : do not cap the size of the context
* tests : adjust parameters to be CI friendlier
* context : add warning
commit 5fb5e24811 (llama : minor
sampling refactor (2) (#9386)) moved the llama_sampler_accept call
into llama_sampler_sample, but the sampling sample usage in llama.h
was forgotten to be updated accordingly.
* webui: auto-refresh /props on inference start to resync model metadata
- Add no-cache headers to /props and /slots
- Throttle slot checks to 30s
- Prevent concurrent fetches with promise guard
- Trigger refresh from chat streaming for legacy and ModelSelector
- Show dynamic serverWarning when using cached data
* fix: restore proper legacy behavior in webui by using unified /props refresh
Updated assistant message bubbles to show each message's stored model when available,
falling back to the current server model only when the per-message value is missing
When the model selector is disabled, now fetches /props and prioritizes that model name
over chunk metadata, then persists it with the streamed message so legacy mode properly
reflects the backend configuration
* fix: detect first valid SSE chunk and refresh server props once
* fix: removed the slots availability throttle constant and state
* webui: purge ai-generated cruft
* chore: update webui static build
* webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog
Extended MarkdownContent to flag previewable code languages,
add a preview button alongside copy controls, manage preview
dialog state, and share styling for the new button group
Introduced CodePreviewDialog.svelte, a sandboxed iframe modal
for rendering HTML/JS previews with consistent dialog controls
* webui: fullscreen HTML preview dialog using bits-ui
* Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: pedantic style tweak for CodePreviewDialog close button
* webui: remove overengineered preview language logic
* chore: update webui static build
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: recognize AsciiDoc files as valid text files
* webui: add an updated static webui build
* webui: add the updated dependency list
* webui: re-add an updated static webui build
This also reverts commit 742dbb8379.
* vulkan: fuse mul_mat+add and mul_mat_id+add_id
The fusion is only applied for the mat-vec mul paths.
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix 32b build
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: Remove unneded bias/gate dims in fused mmvq
Pointed out
[here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989)
that only a single value is needed per target col per thread
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Fix "Error 991-D: extra braces are nonstandard" during compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: Volta tensor core support for MMF
* more generic checks for hardware support
* Update ggml/src/ggml-cuda/mmf.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Experimenting crash fix
* added assert for aborting and fixed comment
* changed to check if a pipeline is empty or not
* Moved function in class definition
* replaced with is_empty
* Modified is_empty to check only unaligned pipelines
* respect input size when getting/setting tensor data
allows partial repacking/copying when get tensor size is smaller than the actual tensor
* Removed duplicate repack_mxfp4_mxfp4x4x2 function
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).
Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).
Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.
Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.
Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.
Add new backend tests.
* vulkan: Update topk_moe fusion to handle gpt's late softmax
Based on #16649.
* Add ggml_check_edges
* Add sync logging to show fusion effects
* handle clamp added in #16655
* Update ggml/src/ggml-impl.h
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* hexagon: remove dspqueue callbacks and do all read processing inplace
* hexagon: there is no need to ref/deref the buffers at this point
We're not going to release the buffers without flushing the session queue.
So there is no need to inc/dec the refcounts for every request.
We also don't need to include those bufs in the response.
* hexagon: bump the thread count in the adb wrapper scripts
We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention).
Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to
the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs.
* hexagon: add lhez as the second code owner
* CUDA: Fix bug in topk-moe for gpt-oss
When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph),
but it actually doesn't fuse in the actual gpt-oss
* fix for qwen3 too
* change ifndef to ifdef
* Add --embd-output-format raw for plain numeric embedding output
This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting.
* Move raw output handling into format handling section
* Move raw output handling into else-if block with other format handlers
* Use LOG instead of printf for raw embedding output
* docs: document 'raw' embedding output format in arg.cpp and README
* cann: improve device ID handling and aclnnArange checks
- Stop relying on CANN's internal device ID retrieval; use a global variable instead.
- Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions.
* cann: use thread local var
* feat: Add SYCL backend support for SSM_CONV operator
* Implement State Space Model Convolution 1D for SYCL backend
* Add optimized GPU kernel with parallel work distribution
* Support various tensor dimensions and batch sizes
* Full integration with existing SYCL infrastructure
* All tests pass with CPU backend equivalence verification
* feat: Implement SYCL backend support for SSM_CONV operation
- Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp
- Implement SYCL kernel for state space model convolution
- Ensure numerical correctness matches CPU implementation exactly
- Add proper type checking for F32 tensors in backend support
- All test-backend-ops SSM_CONV tests pass (14490/14490)
* Perfect SSM_CONV SYCL implementation - 100% CPU parity
✅ Flawless numerical accuracy - matches CPU bit-for-bit
✅ Optimal SYCL kernel design - efficient parallel execution
✅ Complete tensor layout compatibility - handles all strides correctly
✅ Robust error handling - comprehensive assertions and validation
✅ All official tests pass - 14,490/14,490 backend operations verified
✅ Production-ready code - clean, documented, maintainable
Implements state-space model 1D convolution with sliding window algorithm.
Eliminates blocking queue.wait() for better async performance.
* Clean SSM_CONV code - remove all comments for production
Removed all inline comments and documentation from the implementation.
Clean, minimal code ready for production merge.
* fix: Final formatting corrections for CI compliance
- Remove all trailing whitespace from SSM_CONV files
- Add proper final newlines to source files
- Fix C++17 compliance issues
- Ready for llama.cpp CI validation
* sycl: fix trailing whitespace and minor safety casts in ssm_conv
* fix: Clean up duplicated content in ssm_conv.hpp header file
---------
Co-authored-by: tamarPal <tamarPal@example.com>
* ggml : fix interpolate with align-corners and ne=1
* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken
* fix clang warning
* sycl: add ROLL operation support
- Implement ggml_sycl_roll function for F32 tensors
- Add multi-axis roll operation with SYCL kernel
- Support all 4 tensor dimensions with proper shift normalization
- Add roll.cpp and roll.hpp to SYCL backend
- Update backend dispatch and supports_op for GGML_OP_ROLL
- Tests: 17662/17662 pass with identical CPU reference results
* fix: remove trailing whitespace from roll.cpp
- Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp
- Remove trailing spaces from lines 6, 11, 28, 47, 58, 60
* ci: retrigger
* sycl: remove wait() calls from ROLL operation
* fix: editorconfig — LF endings + final newline for roll.hpp
---------
Co-authored-by: tamarPal <tamarPal@example.com>
* fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver
* style: indent
* fix: decrease priority
* fix: switch to `||`
ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* webui: support q URL parameter
Fixes#16722
I’ve checked that it works with Firefox’s AI tools
* webui: apply suggestions from code review
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: update webui static build
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
This commit add the trust_remote_code=True argument when loading models
using AutoConfig, AutoTokenizer, and AutoModelForCausalLM for the run
original model script.
The motivation for this is that some models require custom code to be
loaded properly, and setting trust_remote_code=True avoids a prompt
asking for user confirmation:
```console
(venv) $ make causal-run-original-model
The repository /path/to/model contains custom code which must be
executed to correctly load the model. You can inspect the repository
content at /path/to/model.
Do you wish to run the custom code? [y/N] N
```
Having this as the default seems like a safe choice as we have to clone
or download the models we convert and would be expecting to run any
custom code they have.
* sycl: use async memory allocation to fix graph recording failures
GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because:
- Host waits are currently unsupported in graph recording mode.
- SYCL malloc / free calls are unsupported in graph recording mode.
The following changes are made to fix SYCL graph functionality:
- When graphs are enabled, use the SYCL async memory extension for temp
buffers which is supported with SYCL graphs.
- For compiler versions that do not support this extension, skip
graphs with the affected op.
- Switch from USM shared to device memory as the async extension
currently just supports device allocations.
* Address reviewer feedback
* Use global async variable to decide path in sycl_ext_[malloc_device|free]
* model: add support for extra bufs for all devices
* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU
This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.
Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX
**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.
Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com>
* hexagon: fix format checker errors
* hexagon: update readme and cmake presets
* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions
* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input
* hexagon: move ADB helper scripts into scripts/snapdragon/adb
* hexagon: replace all f/printfs with GGML_LOG_...
* readme: add hexagon to the list supported backends
* hexagon: stack malmuts with quantized inputs only
* hexagon: add TODO for fixing issues in hexagon_graph_optimize
* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC
* scripts: fix lint errors
* scripts: update qdc pytest script to make linter happy
* hexagon: add reduce sum in fp32
* hexagon: reduce number of vector stores in matmul output
* hexagon: remove the need for vdelta in reduce-multiply-x8
* hexagon: consistent use of reduce_sum_fp32 for row_sums
* hexagon: some more matmul optimizations and comments
Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.
* hexagon: update cmake presets
* hexagon: add OPMASK support for run-bench.sh wrapper
* hexagon: update to use GGML_BACKEND_API
* hexagon: remove unused logic for setting tensor flags for the views
* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors
Same asserts as the CPU backend.
* hexagon: use cpy_tensor slow path for non-host buffers
* hexagon: error checks in the buffer allocator
* cmake: move include(extProj) under ggml-hexagon
* hexagon: don't forget to delete the backend on free
* hexagon: set/get_tensor size assert apply only to quantized tensors
* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now
GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.
* docs: typos in hexagon developer docs (libggm-...)
* hexagon: overhaul error handling in the session/device allocation
this should handle all failure paths in the session allocation.
* hexagon: update cmake presets to enable fp16 vectors
* hexagon: remove unused time_usec function
* hexagon: don't forget to release buffer contexts
* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)
* hexagon: remove custom can_repeat function and use ggml_can_repeat
---------
Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* webui: introduce OpenAI-compatible model selector in JSON payload
* webui: restore OpenAI-Compatible model source of truth and unify metadata capture
This change re-establishes a single, reliable source of truth for the active model:
fully aligned with the OpenAI-Compat API behavior
It introduces a unified metadata flow that captures the model field from both
streaming and non-streaming responses, wiring a new onModel callback through ChatService
The model name is now resolved directly from the API payload rather than relying on
server /props or UI assumptions
ChatStore records and persists the resolved model for each assistant message during
streaming, ensuring consistency across the UI and database
Type definitions for API and settings were also extended to include model metadata
and the onModel callback, completing the alignment with OpenAI-Compat semantics
* webui: address review feedback from allozaur
* webui: move model selector into ChatForm (idea by @allozaur)
* webui: make model selector more subtle and integrated into ChatForm
* webui: replaced the Flowbite selector with a native Svelte dropdown
* webui: add developer setting to toggle the chat model selector
* webui: address review feedback from allozaur
Normalized streamed model names during chat updates
by trimming input and removing directory components before saving
or persisting them, so the conversation UI shows only the filename
Forced model names within the chat form selector dropdown to render as
a single-line, truncated entry with a tooltip revealing the full name
* webui: toggle displayed model source for legacy vs OpenAI-Compat modes
When the selector is disabled, it falls back to the active server model name from /props
When the model selector is enabled, the displayed model comes from the message metadata
(the one explicitly selected and sent in the request)
* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormActions.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/constants/localstorage-keys.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/services/chat.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/services/chat.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: refactor model selector and persistence helpers
- Replace inline portal and event listeners with proper Svelte bindings
- Introduce 'persisted' store helper for localStorage sync without runes
- Extract 'normalizeModelName' utils + Vitest coverage
- Simplify ChatFormModelSelector structure and cleanup logic
Replaced the persisted store helper's use of '$state/$effect' runes with
a plain TS implementation to prevent orphaned effect runtime errors
outside component context
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: document normalizeModelName usage with inline examples
* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/stores/models.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/stores/models.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: extract ModelOption type into dedicated models.d.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: refine ChatMessageAssistant displayedModel source logic
* webui: stabilize dropdown, simplify model extraction, and init assistant model field
* chore: update webui static build
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: npm format, update webui static build
* webui: align sidebar trigger position, remove z-index glitch
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds.
* Vectorize additional f32 helper loops
* Normalize f32 helper tails for ggml vec ops
---------
Co-authored-by: Aaron <shelhamer.aaron@gmail.com>
CI (AMD) / ggml-ci-x64-amd-vulkan (push) Waiting to run
CI (AMD) / ggml-ci-x64-amd-rocm (push) Waiting to run
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
* ggml: add ggml_can_fuse_subgraph
* ggml-cuda: use ggml_can_fuse_subgraph for topk-moe
* format
* 1. remove inputs from signature as they are transient nodes
2. add check for views: view_src should be part of the subgraph
* - combine check into one loop
- check all view_src parents
- other minor review comments
* remove redudant if test
* - rename and other minor review comments
* add assert about count < 32
* add BailingMoeV2 support
* update llm types
* undo
* undo
* update llm types
* add model collection link
* update
* almost working
* correct group selection and rename n_group_exp
* avoid large top_k and use argmax instead for now
if we had something like argmax2 that would be equivalent, but this works fine until then
* poke
* skip group selection when there are no tokens
* fix 1T conversion
* hopefully fixed expert group selection
third time's the charm?
* make expert group selection generally available
The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.
* allow n_expert_groups to be 1 (Kimi K2)
* address review suggestions
CI (AMD) / ggml-ci-x64-amd-vulkan (push) Waiting to run
CI (AMD) / ggml-ci-x64-amd-rocm (push) Waiting to run
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Waiting to run
* feat: Per-conversation loading states and tracking streaming stats
* chore: update webui build output
* refactor: Chat state management
Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states.
This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed.
Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution.
* feat: Adds loading indicator to conversation items
* chore: update webui build output
* fix: Fix aborting chat streaming
Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent.
This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion.
* refactor: Remove redundant comments
* chore: build webui static output
* refactor: Cleanup
* chore: update webui build output
* chore: update webui build output
* fix: Conversation loading indicator for regenerating messages
* chore: update webui static build
* feat: Improve configuration
* feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI
## Why it failed
When compiling with strict compiler flags (-Wmissing-braces -Werror=missing-braces),
the build fails with the following error:
```
cmake \
-S . \
-B ../llama.cpp.build \
--preset=x64-linux-gcc-debug \
-DCMAKE_INSTALL_PREFIX=/tmp/local \
-DCMAKE_CXX_FLAGS="-Wmissing-braces -Werror=missing-braces" && \
cmake --build ../llama.cpp.build/
...
In file included from /home/otegami/work/cpp/llama.cpp/src/llama-graph.h:4,
from /home/otegami/work/cpp/llama.cpp/src/llama-model.h:5,
from /home/otegami/work/cpp/llama.cpp/src/llama.cpp:8:
/home/otegami/work/cpp/llama.cpp/src/llama-batch.h:126:48: error: missing braces around initializer for 'std::__array_traits<int, 1>::_Type' {aka 'int [1]'} [-Werror=missing-braces]
126 | std::array<llama_seq_id, 1> seq_id_0 = { 0 }; // default sequence id
| ^
cc1plus: some warnings being treated as errors
```
The issue is that std::array initialization requires double braces.
## How to fix
This PR changes `{ 0 }` to `{{ 0 }}` for std::array initialization.
This is part of a series of commits to fix missing braces warnings across the codebase.
- src/llama-batch.h <- This PR is here.
- src/llama-context.cpp
- tests/test-backend-ops.cpp
- tests/test-gguf.cpp
- tools/mtmd/clip.cpp
Benefits:
- std::array is a struct containing a C-style array, requiring nested braces
- Enables stricter compiler warnings to catch potential issues
* SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators
Clean up unrelated changes from previous commit
* Chore: remove empty lines and fix indentation
* Clean up: remove leftover blank lines and fix spacing
* chore: fix trailing whitespace and ensure final newline
* Cleanup: remove redundant declarations already defined in header
* Sync docs/ops.md with updated backend operation support
* docs: update ops.md after rebase
* docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN
The unexpeced pooling_type warning was incorrectly shown when users did not
specify the --pooling-type parameter. In this case, the parameter
defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code
automatically applies the model's default pooling type.
Example of spurious warning:
```
$ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello"
...
llama_init_from_model: model default pooling_type is [2], but [-1] was specified
...
```
This fix ensures the warning only appears when users explicitly specify
a pooling type that differs from the model's default (e.g., using
--pooling-type mean on a model that expects CLS pooling).
* rpc : report actual free memory
Start reporting the free memory on every device instead of using
fixed values. Now llama-cli users can get a nice memory breakdown
when using RPC devices.
* drop --mem in rpc-server
* vulkan: implement SSM scan operation
Add State Space Model scan operation to the Vulkan backend.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* vulkan: implement SSM conv operation
Add State Space Model conv operation to the Vulkan backend.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Fix incorrect task-to-batch index calculation in the quantization phase.
The bug caused out-of-bounds access to qnbitgemm_args array when
compute_idx exceeded per_gemm_block_count_m, leading to invalid
pointer dereferences and SIGBUS errors.
Correctly map tasks to batches by dividing compute_idx by
per_gemm_block_count_m instead of block_size_m.
Example:
batch_feature=1, gemm_m=30, block_size_m=4
per_gemm_block_count_m = 8, task_count = 8
Old: gemm_idx = 4/4 = 1 (out of bounds New: gemm_idx = 4/8 = 0 (correct)
Tested on SpaceMit K1 RISC-V64 with qwen2.5:0.5b model.
Co-authored-by: muggle <mingjun.rong@spacemit.com>
* fix: added a normalization step for MathJax-style \[\] and \(\) delimiters
So inline and block equations are converted before KaTeX rendering,
enabling proper display of model-generated LaTeX in the WebUI
* chore: update webui build output
This commit applies .clang-format rules to all source files under the
ggml-cann directory to ensure consistent coding style and readability.
The .clang-format option `SortIncludes: false` has been set to disable
automatic reordering of include directives.
No functional changes are introduced.
Co-authored-by: hipudding <huafengchun@gmail.com>
* Update the docs on -t --threads
* Revert "Update the docs on -t --threads"
This reverts commit eba97345e2.
* docs: clarify -t/--threads parameter uses CPU threads and defaults to all available cores
* Update arg.cpp
## Why it failed
When compiling with strict compiler flags (-Wwrite-strings -Werror=discarded-qualifiers),
the build fails with the following error:
```
cmake \
-S . \
-B ../llama.cpp.build \
--preset=x64-linux-gcc-debug \
-DCMAKE_INSTALL_PREFIX=/tmp/local \
-DCMAKE_C_FLAGS="-Wwrite-strings -Werror=discarded-qualifiers" && \
cmake --build ../llama.cpp.build/
...
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c: In function ‘ggml_cpu_init’:
/home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3572:24: error: passing argument 1 of ‘putenv’ discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers]
3572 | putenv("KMP_BLOCKTIME=200"); // 200ms
| ^~~~~~~~~~~~~~~~~~~
In file included from /home/otegami/work/cpp/llama.cpp/ggml/src/./ggml-impl.h:10,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu-impl.h:6,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/traits.h:3,
from /home/otegami/work/cpp/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:6:
/usr/include/stdlib.h:786:26: note: expected ‘char *’ but argument is of type ‘const char *’
786 | extern int putenv (char *__string) __THROW __nonnull ((1));
| ~~~~~~^~~~~~~~
cc1: some warnings being treated as errors
ninja: build stopped: subcommand failed.
```
The issue is that putenv() expects a non-const char * but receives a string literal (const char *).
## How to fix
This PR replaces putenv("KMP_BLOCKTIME=200") with setenv("KMP_BLOCKTIME", "200", 0).
Benefits of setenv():
- Accepts const char * parameters (no qualifier warnings)
- Makes copies of the strings (safer memory handling)
- The third parameter (0) ensures we don't overwrite if already set
BF16 requires special handling in this script
while it's a 2-bytes data, but view is 1-byte by default.
Switch to correct view before attempting byteswapping.
With this change correctly byteswapping models like
Meta-Llama-3-8B-Instruct-bf16-GGUF
should be possible.
* CPU: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators
- Added the operators to unary op enum
- Implemented API functions
- Implemented forward and unary-op logic in CPU backend
- Updated ggml_get_n_tasks
- Updated operators names array and static_assert
- Updated docs and enabled automatic tests
* docs: add documentation for ggml_trunc and ggml_trunc_inplace in ggml.h
* chore: remove trailing whitespace from ggml.h
* Remove unresolved merge markers
* Apply review suggestions: cleanup formatting, enum order and leftover artifacts
* Regenerate ops.md using create_ops_docs.py
* opencl: add mm_q8_0_f32
* opencl: fix data loading for incomplete tile
* opencl: use q8_0 mm for larger matrix
* opencl: add some tests to cover the path
* optimise GGML_OP_SUM
* add non-contiguous tests by permuting the input
* change tests to require full contiguity of OP_SUM
* cuda : add check GGML_OP_SUM
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama-quant: add support for mmproj
* Update src/llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* check prefix instead
* small fix
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* CUDA set scheduling strategy to spinning for cc121
* Using prop.major and prop.minor, include HIP and MUSA
* Exclude HIP and MUSA
* Remove trailing whitespace
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Remove empty line
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ggml : fix build broken with -march=armv9-a on MacOS
Signed-off-by: Jie Fu <jiefu@tencent.com>
* Add #pragma message
Signed-off-by: Jie Fu <jiefu@tencent.com>
* Address review comment.
Signed-off-by: Jie Fu <jiefu@tencent.com>
* Update ggml/src/ggml-cpu/ggml-cpu.c
---------
Signed-off-by: Jie Fu <jiefu@tencent.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
CI (AMD) / ggml-ci-x64-amd-vulkan (push) Waiting to run
CI (AMD) / ggml-ci-x64-amd-rocm (push) Waiting to run
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
This commit fixes a CPU-side memory leak issue in the CANN backend,
which occurred when intermediate aclTensorList objects were not properly
released after operator execution. The leak happened during repeated
invocations of CANN ops (e.g., FlashAttention), leading to increasing
host memory usage over time.
Proper resource cleanup (aclDestroyTensorList and related release logic)
has been added to ensure that all temporary tensors are correctly freed.
* fix: add remark plugin to render raw HTML as literal text
Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering
Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure
Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization
Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required
Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify
* fix: address review feedback from allozaur
* chore: update webui build output
CI (AMD) / ggml-ci-x64-amd-vulkan (push) Waiting to run
CI (AMD) / ggml-ci-x64-amd-rocm (push) Waiting to run
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.
In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.
Co-authored-by: noemotiovon <757486878@qq.com>
* scaffold to support opt step adamw on metal (not written so far)
* add opt-step-adamw kernel for metal
* pass op->src[4] as a separate buffer to the pipeline
* add bounds check to opt-step-adamw kernel
* complete scaffold for GGML_OP_SUM
* naive GGML_OP_SUM kernel
* remove unwanted comment
* change OP_SUM capability gate
* Add has_simdgroup_reduction to both ops to pass CI
CI (AMD) / ggml-ci-x64-amd-vulkan (push) Waiting to run
CI (AMD) / ggml-ci-x64-amd-rocm (push) Waiting to run
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan-deb (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* fix: make SSE client robust to premature [DONE] in agentic proxy chains
* webui: remove client-side context pre-check and rely on backend for limits
Removed the client-side context window pre-check and now simply sends messages
while keeping the dialog imports limited to core components, eliminating the
maximum context alert path
Simplified streaming and non-streaming chat error handling to surface a generic
'No response received from server' error whenever the backend returns no content
Removed the obsolete maxContextError plumbing from the chat store so state
management now focuses on the core message flow without special context-limit cases
* webui: cosmetic rename of error messages
* Update tools/server/webui/src/lib/stores/chat.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/stores/chat.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* fix/refactor OP argsort, pad
* fix count-equal op
* update SYCL OP list
* fix format issue
---------
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
* hparams : add check for layer index in is_recurrent
This commit adds a check in the is_recurrent method to ensure that the
provided layer index is within the valid range.
The motivation for this change is to prevent potential out-of-bounds
and also be consistent with other methods in the class that perform
similar checks, like is_swa.
The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion.
Co-authored-by: Aaron <shelhamer.aaron@gmail.com>
* feat: render user content as markdown option
- Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options
- Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle
- Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise
- chore: update webui build output
* chore: update webui build output
* server / ranking : add sorting and management of top_n
* Make the retro compatible if no top_n will return
all results
here is a script to make some test
```script
URL=${1:-http://127.0.0.1:8181}
curl "$URL/v1/rerank" -H "Content-Type: application/json" \
-d '{ "model": "M", "query": "What is the recipe to make bread ?",
"return_text" : true,
"texts" : true,
"top_n": 6,
"documents": [
"voici la recette pour faire du pain, il faut de la farine de l eau et du levain et du sel",
"it is a bear",
"bread recipe : floor, water, yest, salt",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
"here is the ingedients to bake bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"recipe to make cookies : floor, eggs, water, chocolat",
"here is the recipe to make bread : 500g floor, 350g water, 120g fresh refresh yest, 15g salt",
"il fait tres beau aujourd hui",
"je n ai pas faim, je ne veux pas manger",
"je suis a paris"
] }' | jq
```
* use resize() instead for(...)
* simplify top_n init since no need to return error
result to test :
./tests.sh unit/test_rerank.py -v -x
==================================================== test session starts =====================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 8 items
unit/test_rerank.py::test_rerank PASSED [ 12%]
unit/test_rerank.py::test_rerank_tei_format PASSED [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 37%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 50%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 62%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 75%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 87%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [100%]
===================================================== 8 passed in 4.31s ======================================================
* add rerank top_n unit test
here is the result :
./tests.sh unit/test_rerank.py -v -x
=================================================================== test session starts ===================================================================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.6.0 -- /home/yann/dev/yann/llama.cpp/tools/server/tests/test/bin/python3
cachedir: .pytest_cache
rootdir: /home/yann/dev/yann/llama.cpp/tools/server/tests
configfile: pytest.ini
plugins: anyio-4.11.0
collected 16 items
unit/test_rerank.py::test_rerank PASSED [ 6%]
unit/test_rerank.py::test_rerank_tei_format PASSED [ 12%]
unit/test_rerank.py::test_invalid_rerank_req[documents0] PASSED [ 18%]
unit/test_rerank.py::test_invalid_rerank_req[None] PASSED [ 25%]
unit/test_rerank.py::test_invalid_rerank_req[123] PASSED [ 31%]
unit/test_rerank.py::test_invalid_rerank_req[documents3] PASSED [ 37%]
unit/test_rerank.py::test_rerank_usage[Machine learning is-A machine-Learning is-19] PASSED [ 43%]
unit/test_rerank.py::test_rerank_usage[Which city?-Machine learning is -Paris, capitale de la-26] PASSED [ 50%]
unit/test_rerank.py::test_rerank_top_n[None-4] PASSED [ 56%]
unit/test_rerank.py::test_rerank_top_n[2-2] PASSED [ 62%]
unit/test_rerank.py::test_rerank_top_n[4-4] PASSED [ 68%]
unit/test_rerank.py::test_rerank_top_n[99-4] PASSED [ 75%]
unit/test_rerank.py::test_rerank_tei_top_n[None-4] PASSED [ 81%]
unit/test_rerank.py::test_rerank_tei_top_n[2-2] PASSED [ 87%]
unit/test_rerank.py::test_rerank_tei_top_n[4-4] PASSED [ 93%]
unit/test_rerank.py::test_rerank_tei_top_n[99-4] PASSED [100%]
=================================================================== 16 passed in 8.84s ===================================================================
* editor config check fix
In streaming mode when prompt exceeds context length, the server returns
HTTP 200 status code with a JSON error in the body. This is very
confusing and inconsistent with all other inference engines which return
HTTP 4xx error in this case.
This patch fixes this problem and makes the server return HTTP 400 in
such cases.
* webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel
* chore: update webui build output
* minor : code style
* server : fix prompt similarity calculation
* server : initial host-memory prompt caching
* cont
* server : refactor
* cont
* cont : make the server task of the slot const
* cont : minor [no ci]
* server : cache prompts and checkpoints only for completion tasks
* server : improve prompt caching logic
* cont : fix check for number of cached prompts [no ci]
* server : improve caching logic, add -cram CLI arg
* server : print prompt mismatch info
* cont : better naming [no ci]
* server : improve prompt cache loading logic
* server : add option to debug the slot contents (#16482)
* server : add option to debug the slot contents
* Update tools/server/server.cpp
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* server : add option to disable prompt cache
---------
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* model-conversion : add support for SentenceTransformers
This commit adds support for models that use SentenceTransformer layers.
The motivation for this is that if converted model includes any of the
numbered layers specified in the original models repository then these
changes enable these models to be used and verified. Currently the
model-conversion only support the base model output without any of
the additional transformation layers.
Usage:
Convert the model that also includes the SentenceTransformer layers:
```console
(venv) $ export EMBEDDING_MODEL_PATH="~/google/embeddinggemma-300M"
(venv) make embedding-convert-model
```
Verify the produced embeddings from the converted model against the
original model embeddings:
```console
(venv) make embedding-verify-logits-st
```
The original model can be run using SentenceTransformer:
```console
(venv) make embedding-run-original-model-st
```
Run the converted model using "SentenceTransformer" layers whic
enables pooling and normalization:
```console
(venv) make embedding-run-converted-model-st
```
* add model-conversion example requirements
* add support for -st flag in embedding model conversion
This commit add support for the -st flag in the embedding model
conversion script. This will enable models to be converted using
sentence transformers dense layers.
* CANN: improve ACL graph matching
Record `ne` and `nb` information for src tensors and include them in the
graph matching check. This enhances the robustness of ACL graph matching
by preventing incorrect matches when src tensors share the same data
address but differ in shape or stride.
* CANN: add op_params match
* refactor to support soft_max_ext
* fix error and support soft_max_back
* rm unused functions
* fix format issue
---------
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
* model: EmbeddingGemma sentence-transformers dense linear projections support
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
Adding support for the Dense modules used in EmbeddingGemma models.
EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone.
See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/
* model: add support for EmbeddingGemma SentenceTransformers dense linear projections
- converting model with dense-layers is optional
- introduced dense config params
* Update convert_hf_to_gguf.py
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* fixed formatting issues
* Update src/llama-graph.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* - removed pooling_type_opt, always allow overriding pooling_type
- asserts checking dense features dims
* fix python lint
* fix ubuntu gcc build warning
* - fixed thread-safety test
- moved asserts to load_hparams
* - tidying up code
- simplifying graph-context expecting both dense weights
* minor : add TODO
---------
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing
- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
* refactor: implement streaming-aware universal reasoning parser
Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.
- Rework try_parse_reasoning() to track whitespace, partial tags, and
multiple reasoning segments, allowing proper separation of reasoning_content
and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments
The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.
Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.
* refactor: address review feedback from allozaur
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)
- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
* refactor: address review feedback from ngxson
* debug: say goodbye to curl -N, hello one-click raw stream
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story
- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
* npm run format
* chat-parser: address review feedback from ngxson
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* metal : better unroll in the FA kernels
* metal : index FA blocks
* tests : restore [no ci]
* metal : prevent division by zero in FA kernels
* metal : fix -INF detection logic
CI (AMD) / ggml-ci-x64-amd-vulkan (push) Waiting to run
CI (AMD) / ggml-ci-x64-amd-rocm (push) Waiting to run
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-24-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cuda.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:cuda]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:intel]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:musa]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/s390x.Dockerfile free_disk_space:false full:true light:true platforms:linux/s390x runs_on:ubuntu-22.04-s390x server:true tag:s390x]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 runs_on:ubuntu-22.04 server:true tag:vulkan]) (push) Has been cancelled
Publish Docker image / Create and push git tag (push) Has been cancelled
Update Winget Package / Update Winget Package (push) Has been cancelled
* Add profiling
* More detailed profiling
* Rework command submission to avoid global locks
* Update wait handling
* try new method of waiting on futures
* Add serializing of command submission in some cases
* Add new pool for timestamp queries and clean up logging
* Serialize command submission in CI and leave a TODO note
* Update webgpu CI
* Add myself as WebGPU codeowner
* Deadlock avoidance
* Leave WebGPU/Vulkan CI serialized
* Fix divide by 0
* Fix logic in division by inflight_threads
* Update CODEOWNERS and remove serialize submit option
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* metal : pad K, V and Mask when needed
* cont : simplify
* cuda : add TODO about KV padding requirement
* metal : add comments
* metal : remove mask padding requirement
* tests : add -INF blocks to the KQ mask in the FA tests
* cont : bump -INF block size to 64
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* ggml : prevent division by zero in FA CPU op
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* metal : ssm_scan minor opts
* metal : get_rows optimize
* metal : cpy optimize
* metal : ssm_conv opt
* metal : ssm_scan simplify
* metal : ssm_Scan opt
* implement --no-host to disable host buffer
* fix equal_mparams
* move no-host enumeration order together with other model params
---------
Co-authored-by: slaren <slarengh@gmail.com>
* fix: Fix duplicate fake image before token on first slice
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use double-newline before overview image
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove incorrect newline at the end of granite chat template gen prompt
There should not be one, even for the language models.
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* tests: Remove bad newline from granite chat template test (legacy)
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This commit updates the leftover handling in ggml_vec_scale_f32.
The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.
The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.
Example scenario with 256-bit SVE:
```
ggml_f32_epr = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n = 25
np = 16
leftovers = 9 elements (16-24)
Original : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* rpc : add support for multiple devices
Allow rpc-server to expose multiple devices from a single endpoint.
Change RPC protocol to include device identifier where needed.
closes: #15210
* fixes
* use ggml_backend_reg_t
* address review comments
* fix llama-bench backend report
* address review comments, change device naming
* fix cmd order
* vulkan (DRAFT): split shader generation by GLSL source file, to improve incremental build times
* support dep-files so shaders are recompiled if their included files change
* rename shader files which are used as "headers" to use .glsl extension
* move glslc extension detection shaders to separate folders
* the above is to prevent them from getting glob'd with the actual compute shaders that need to be compiled
* vulkan : only write embedded shader .hpp/.cpp when they change
* avoid recompiling ggml-vulkan.cpp when editing shaders
* pass single --source argument instead of --input-dir & --filter to shader gen
* check for source file match earlier
* fix hang in vulkan-shaders-gen when there are compilation errors
* early out did not decrement compile_count
* clean up
* fix glslc integer dot product test
* unconditionally write the embedded shader cpp output
* replace output filepath in generated dep-files to match output in CMakeLists
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
CI (AMD) / ggml-ci-x64-amd-vulkan (push) Waiting to run
CI (AMD) / ggml-ci-x64-amd-rocm (push) Waiting to run
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu-cmake (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-24-cmake-vulkan (push) Waiting to run
CI / ubuntu-22-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-x64-nvidia-cuda (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm (push) Waiting to run
CI / ggml-ci-x64-nvidia-vulkan-cm2 (push) Waiting to run
CI / ggml-ci-x64-cpu-amx (push) Waiting to run
CI / ggml-ci-mac-metal (push) Waiting to run
CI / ggml-ci-mac-vulkan (push) Waiting to run
Copilot Setup Steps / copilot-setup-steps (push) Has been cancelled
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Has been cancelled
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Update Operations Documentation / update-ops-docs (push) Has been cancelled
* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls
* feat: new flow in the chat template test suite for Magistral
* initial commit for branch 3
* generalize `swa_checkpoint` to `ctx_checkpoint`
this extends `llama-server`'s SWA checkpointing logic to include
hybrid/recurrent models such as Jamba, Granite
* oops
* disable debug prints
* keep backwards compat with `--swa-checkpoints`
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update prompt re-processing message
* fix off-by-one error per GG
* keep `seq_rm` log per GG
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix checkpoint logic to support recurrent caches
* server : cleanup and fixes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: Replace uses of maxMemoryAllocationSize and VK_WHOLE_SIZE
Replace maxMemoryAllocationSize check with maxBufferSize when creating buffers.
The maxMemoryAllocationSize limit is a "soft" limit and allocations can succeed
beyond that limit. This allows > 4GB buffers to be allocated on some
implementations (e.g. NVIDIA) and tensors this large can be used for im2col
and mul_mat.
For temporary buffers (prealloc_x/y/etc) check against maxStorageBufferRange.
I'm not sure this check is ideal, but we always use these buffers as a single
full size binding and the limit may be smaller than maxMemoryAllocationSize
or maxBufferSize, so I think this is reasonable.
Replace descriptor range uses of VK_WHOLE_SIZE with a manually computed range.
The maxStorageBufferRange may be smaller than the maxBufferSize or
maxMemoryAllocationSize (and the Vulkan spec warns about this in a note) and
it's invalid usage if VK_WHOLE_SIZE computes a range larger than
maxStorageBufferRange.
With this change, it should be possible to generate videos using wan networks
in stable-diffusion.cpp.
* vulkan: Add env var GGML_VK_FORCE_MAX_BUFFER_SIZE and use stoull
When computing sinks, the cm1 shader was looping r from 0 to Br rather than
to rows_per_thread. I must have copied this from the scalar path (where it is
correct), and somehow it wasn't causing failures on current drivers.
* feat: Capture model name only after first token (streaming) or completed request (non-streaming)
* chore: update webui build output
* chore: update webui build output
* fix: Include just the currently active message branches instead of all in chat completions request
* chore: Build webui static output
* chore: Formatting
* chore: update webui build output
* do not use more threads than physically available
* ensure n_threads > 0
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* First attempt
* No permute during convert (fixes qk tensors), proper norm application.
* RoPE = NeoX
* Coherence!
* Migrate xielu params from tensors to hyperparameters
* Simple CUDA kernel
* Revert stupid LLM refactorings
* Chat template support
* configchecker / flake8 errors
* Reorder unary.cu
* I do conclude that LLMs are, in fact, stupid.
* Fix after merge
* Final newline
* Make xIELU an UNARY_OP
* Final newline
* Correctly account for parameter shift
* Argh.
* Update ggml/src/ggml-cpu/unary-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor: remove unused methods, inline and factorize softplus, add const modifiers
* Revert CUDA changes, implement xIELU as a separate OP
* Pesky newline
* Add float2half / half2float for F16 inputs/outputs
* CUDA variants, attempt 2
* Actually, attempt 3
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Missing convert header
* Proper formula and reference for xIELU in the comments.
* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tensor mappings for Apertus to global list instead
* Fix lazy on scalars
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add comment about the constraints on positive/negative alpha
* Change `softplus` to `ggml_softplus`
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* update oneapi to 2025.2, use deep-learning-essentials to replace base-tool
* update to 2025.2 use deeplearn essi to replace base toolkit
* add missed dll
* add deep learning essentials
* add sycl-ls
---------
Co-authored-by: Zhang Jianyu <zhang.jianyu@outlook.com>
* HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0
rocwmma 2.0.0 includes a bug in the code fakeing fp16 accumulation on CDNA
* CUDA: Fix volta condition in ggml_cuda_should_use_wmma_fattn
* Fix to use hidden_size_per_head
* Fix num heads
* Fix array
* Fix loading weights
* Support old GGUF converted by the previous version of llama.cpp
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Move shared parameter definitions to the outside of loop
* Not calculating n_embd_head_k,v by n_embd / n_head
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CI: Properly install rocwmma for hip builds
on windows we now windows install rocwmma from ubuntu pacakges
* CI: update linux rocm docker build to use rocm 7.0
* common: introduce http.h for httplib-based client
This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.
It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* tools : add missing WIN32_LEAN_AND_MEAN
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
* feat: Add a setting to include model name used to generate the message
* feat: UI improvements
* feat: Save model info along with the database message entry creation
* chore: Build webui static output
* Make a few GLM tensors not required
layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work
* Update llama-model.cpp
layer.nextn.shared_head_norm also not required in case of future models
* Work on rope
* Simplify inplace operation generation and combine mul/add generation
* Work on rope variants
* implement neox rope
* rope complete
* Add sub,div,glu operators
* implement scale op
* Update cpy shader to handle cont/more types
* formatting
* Update test vars printing for rope,rms_norm
* Avoid ROPE hardcoded constants
* Add TODO to change ROPE constants to enum
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix TODO comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.
The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.
This legacy code can be removed in a future version.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
* tools/main: llama-cli: prevent spurious assistant token (#13402)
During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.
Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.
Fixes#13402.
Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>
* Update tools/main/main.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* tools/main: remove outdated comment
Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>
---------
Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : fix reasoning before forced tool call via tool_choice = required
* common : improve reasoning and commentary handling when tool_choice is required
(cherry picked from commit c746984956d6882c2de73d53ae2bb3bdf889e475)
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
* vulkan: 64-bit im2col
Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
* fix validation error for large im2col
* metal : support mul_mm with src1->type == GGML_TYPE_F16
* metal : support mul_mm_id with src1->type == GGML_TYPE_F16
[no ci]
* metal : mul_mm support ne00 % 32 != 0
* metal : support mul_mm_id with ne00 % 32 != 0
* cont : remove unnecessary unrolls
* cont : simplify data loading
* metal : optimize mul_mm when output bounds checks are not needed
* vulkan: handle mat_mul with A matrix > 4GB
This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.
Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.
* build fixes
The "Clamp" spec constant is already based on whether KV is a multiple of Bc,
so use that to control whether bounds checking is performed. Add bounds checking
to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V
tensors are already optionally clamped, nothing else needed to be changed).
* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32
This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.
My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32
* Review: refactor if statement
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.
Introduce a new `LLAMA_OPENSSL` option, enabled by default.
This preserves the previous default (libcurl first, OpenSSL as fallback),
while allowing OpenSSL to be disabled if desired.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* minicpm: make GGUF scaling keys optional with legacy defaults
Older MiniCPM GGUFs do not include the scaling metadata keys (minicpm.embedding_scale, minicpm.residual_scale, minicpm.logit_scale). The loader currently treats these as required, so quantization fails with:
key not found in model: minicpm.embedding_scale
This change restores backward compatibility by treating these keys as optional in the loader and using the older MiniCPM scaling values:
embedding_scale = 12.0f
residual_scale = 1.4f / sqrt(n_layer)
logit_scale = 256.0f / n_embd
When the GGUF provides the keys, their values override the defaults; otherwise the legacy defaults are used. Newer GGUFs that already include these keys are unaffected.
Fixes: #16192
Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>
* Update src/llama-model.cpp
Committed as suggested. Thanks!
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Signed-off-by: Vinkal Chudgar <vinkal.chudgar@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* devops: move s390x and ppc64le ci build
we have access to ubuntu-24.04-s390x and ppc64le images now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le for now since they have compiler errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: stop warnings as errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: switch to non-macro flag
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: going the llama macro route
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian gguf test models
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le to test s390x, check test build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.inp files for big-endian tests
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.out files for big-endian too
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add python setup and endian byteswap
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: pooring thing does not have s390x python3
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add missing rust compiler for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try rust actions runner
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "devops: try rust actions runner"
This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try a different path for rust
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dump home directory and user info
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: install gguf-py only
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: missed relative path
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove big-endian files since local swapping is working
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: revert test-tokenizer-0 cmakelists
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix unicode flags conversion from and to uint16_t
Bitfields are allocated in different order on s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Simplify byteswap command
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix endianness detection in vocab loader
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Disable test-thread-safety on s390x
In this test a model is downloaded,
then immediately loaded to check if more downloads are needed,
and then used for test.
There is no clean way to separate all those steps
to add byteswapping between them, so just skip this test.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q8_0 test in test-quantize-fns
vec_signed uses unexpected rounding mode.
Explicitly use different rounding function.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian stories260K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add s390x test-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix test does not exist
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix model not found llama-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q3_K dot product error in test-quantize-fns on s390x
Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: re-enable ppc64le for testing
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: activate test-thread-safety for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le tests
for some reason it keeps failing test-thread-safety tests and I do not
have a machine that is able to replicate the tests.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: LLAMA_FATAL_WARNINGS=ON
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Correct repository URL for s390x for test-thread-safety model
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix fs_get_cache_directory
Ensure it works even if both XDG_CACHE_HOME and HOME are unset.
This might happen in containers.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Re-enable CI for ppc64le
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fortify ggml_rope_impl
Only memcpy data from sections argument if it's non-NULL.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way
* Update URL for big-endian model
* Update .github/workflows/build.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update remaining mentions of BE models to ggml-org/models repo
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* webui: allow viewing conversations and sending messages even if llama-server is down
- Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable.
- Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation.
- Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data.
* feat: Add UI for `props` endpoint unavailable + cleanup logic
* webui: extend cached props fallback to offline errors
Treat connection failures (refused, DNS, timeout, fetch) the same way as
server 5xx so the warning banner shows up when cache is available, instead
of falling back to a full error screen.
* webui: Left the chat form enabled when a server warning is present so operators can keep sending messages
e.g., to restart the backend over llama-swap, even while cached /props data is in use
* chore: update webui build output
---------
Co-authored-by: Pascal <admin@serveurperso.com>
* Switched web UI to hash-based routing
* Added hash to missed goto function call
* Removed outdated SPA handling code
* Fixed broken sidebar home link
* vendor : update httplib
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : use cpp-httplib as a cURL alternative for downloads
The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : Bump to Windows 10
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ci : create git tags for released docker images
When releasing a docker image for build number X, we should also create
the corresponding git tag. This allows users to easily checkout the
corresponding source tree for given docker image.
* Update .github/workflows/docker.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update .github/workflows/docker.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Apply suggestion from @CISC
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: add a fused top-K MoE kernel
This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
* Refactor into ggml_cuda_should_use_topk_moe
* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before
* Review: format + micro-optimizations
* Fix bug: fix tie breakers
* Add optional norm + clean-up code
* Use smem for final write
* Add bounds check
* Use better memory pattern for writeback
This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.
The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.
This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.
The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
* run the x64 ci on regular machines
* set up the same thing for arm
fix test-quantize-perf just like #12306
* try to disable sve
* add another sve run
* ggml : make gallocr respect the backend's max buffer size
* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max allocation size in buffer type interface
* fix missing newline, apple-clang warning
* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.
* track (chunk, offset) pairs instead of "global" offsets through gallocr.
* simpler, don't need loops to map between local/global offsets
* touches more code
* fix dyn_tallocr_max_size and initialization
* fix memory leak when buffers are reused due to same buffer type appearing multiple times
* make vbuffer allocation follow the same logic as backend_buffer did before
* continue to use leftover unallocated space of previous chunks after a new one has been created
* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size
* refactor: move adding new free block and new chunk into separate functions
* allocate chunks individually with a separate free-blocks list for each one
* needs a bit more memory/allocations/indirections, but code is simpler
* fix warnings (missing static) & debug checks
This commit adds a leading slash to the paths of root-level files
in the CODEOWNERS file.
The motivation for this is that these might otherwise match files
in subdirectories that have other/additional owners will override them.
Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274
This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.
Branch: gabe-l-hart/GraniteNonHybridConversion
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Disable 'performance-enum-size' checking:
Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* common : use the json parser
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : enable --offline mode without CURL support
This change refactors the download logic to properly support offline mode
even when the project is built without CURL.
Without this commit, using `--offline` would give the following error:
error: built without CURL, cannot download model from the internet
even if all the files are already cached.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
These two local variables 'arg' and 'arg_prefix' have been overriden by:
1. for (const auto & arg : opt.args)
2. for (int i = 1; i < argc; i++) {
const std::string arg_prefix = "--";
std::string arg = argv[i];
* Vulkan: add conv_transpose_2d operation
* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)
* Vulkan: fix incorrect indentation in conv_transpose_2d shader
* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation
* Vulkan: revert the order of the index calculation and bound check in conv_2d shader
* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.
* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* contrib : update roles
* contrib : merge PR sections + add link to CI instructions
Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
* ci : migrate ggml ci to a self-hosted runners
* ci : add T4 runner
* ci : add instructions for adding self-hosted runners
* ci : disable test-backend-ops from debug builds due to slowness
* ci : add AMD V710 runner (vulkan)
* cont : add ROCM workflow
* ci : switch to qwen3 0.6b model
* cont : fix the context size
* vulkan: optimize UMA buffer operations and fix driver hangs
The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.
[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang
* vulkan: implement deferred_memset on UMA
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* ggml : introduce semantic versioning
This commit introduces semantic versioning for the GGML library.
The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.
The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
or patch. This script will handle incrementing the version
(major|minor|patch), create a new commit with the version change,
create a tag for the version, and prepare for the next development
iteration.
3. Inspect the commits/tag and push to master. This will trigger the
github release workflow which is triggered for new tags which will
then publish a new release on github.
Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made
Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0
Step 2: Updating version in ggml/CMakeLists.txt...
[dry-run] Would update GGML_VERSION_MAJOR to 1
[dry-run] Would update GGML_VERSION_MINOR to 0
[dry-run] Would update GGML_VERSION_PATCH to 0
[dry-run] Would remove -dev suffix
Step 3: Committing version bump...
[dry-run] Would commit: 'ggml : bump version to 1.0.0'
Step 4: Creating git tag...
[dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'
Step 5: Preparing for next development cycle...
[dry-run] Would update GGML_VERSION_MINOR to 1
[dry-run] Would add -dev suffix back
Step 6: Committing development version...
[dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'
[dry-run] Summary (no changes were made):
• Would have released version: 1.0.0
• Would have created tag: v1.0.0
• Would have set next development version: 1.1.0-dev
```
Refs: https://github.com/ggml-org/ggml/issues/1333
* ggml: create branch for release candidate and check master
* ggml : sign the git tag
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions
* use fma instead of dot to fix Nvidia and Apple performance issues
* server: fix SSE and OpenAI compatibility for error messages when streaming
* server: remove obsolete event parameter and use required data fieldname instead
* * llama-bench: add --devices support
- Support --devices same as llama-server
- Provide for benchmarking different device combinations
- Include --list-devices like llama-server for convenience
* fix: field display ordering restored
* fix: integrated the rpc devices
- aimed to mimic the server as much as possible
* cleanup: defaults for list-devices
- handle dup device listing with RPC
* cleanup: remove dup device load calls
* docs: update llama-bench
- added the recently added n-cpu-moe option to the docs while in there
* llama-bench: rpc device simplification
* rpc servers unify with other devices earlier, simplifying code
* --list-devices made stateless and simpler
* various cleanup
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.
Without this commit, the code compiles (with warnings) but fails:
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
....
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 4B
Illegal instruction (core dumped)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:
$ cmake -B build \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DGGML_CCACHE=OFF \
-DGGML_NATIVE=OFF \
-DGGML_STATIC=ON
$ ldd llama-server
linux-vdso.so.1 (0x0000e1a434e3b000)
libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
/lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)
This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D
* use fast_div to improve performance
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* optimize
* use a concise expression to further speedup the cuda kernel
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
- Implement resumable downloads in common_download_file_single function
- Add detection of partial download files (.downloadInProgress)
- Check server support for HTTP Range requests via Accept-Ranges header
- Implement HTTP Range request with "bytes=<start>-" header
- Open files in append mode when resuming vs create mode for new downloads
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
* server : include usage statistics only when user request them
When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics
closes: #16048
* add unit test
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* some f32 tests passing
* Disable set_rows until it's implemented
* f32 add all tests passing
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Add templated addition, clean up code
* Get addition and multiplication working
* Implement rms_norm
* Add get_rows implementation
* Add new get_rows files
* Refactor use of wg size entry
* Fix compilation
* Try manually unrolled q4_0 quant
* Revert "Try manually unrolled q4_0 quant"
This reverts commit 77f8b96515.
* Move to constant max wg size
* Check for tensor size in supports_op
* Vectorize f32 and change default workgroup size
* Move f32 get_rows from < 4 to % 4 != 0
* fix linter errors
* Add in-place tests
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci
This commit reverts the change of the runs-on parameter for the
macOS-latest-cmake-x64 job back to macos-13 that was make in
Commit 51abc96bdc ("ci : update
macos-latest* jobs to use macos-latest (#15938)").
The motivation for this is that using macos-latest will cause an ARM
based runner to be used, and not an x64 based runner.
Refs: https://github.com/ggml-org/llama.cpp/pull/15938#issuecomment-3300805127
* CANN: Fix ggml_cann_set_device to avoid redundant device switches
- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.
* CANN: add device default id
This commit updates the runs-on field for the macOS arm64 webgpu build
job to use macos-latest instead of just latest.
The motivation for this is that this job can wait for a runner to pick
up the job for a very long time, sometimes over 7 hours. This is an
attempt to see if this change can help reduce the wait time.
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17754163447/job/50454257570?pr=16004
* ggml : remove adding extra dim timestep embedding
This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.
The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.
* ggml-cuda : fix padding in timestep embedding kernel
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* ggml-metal : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel
* ggml-opencl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-sycl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-vulkan : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-cpu : fix padding in timestep embedding function
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
This commit updates the github workflows build.yml file to include steps
for uploading and downloading the xcframework artifact. The
macos-latest-swift job now depends on the ios-xcode-build job and
downloads the xcframework artifact produced by it.
The motivation for this changes is that it takes a long time to build
the xcframework and we are currently doing this twice in the workflow.
With this change, we only build it once and reuse the artifact.
clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into
unreadable line breaks inside template declarations, such as:
template<int D, int ncols, int nwarps, int VKQ_stride,
typename KQ_acc_t, bool use_logit_softcap>
__launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1)
This change adjusts formatting rules so that CUDA macros remain consistent
and aligned with the surrounding template syntax.
* ci : update macos-latest* jobs to use macos-latest
This commit updates the jobs that are named macos-latest* to use the
macos-latest label instead explicit versions.
The motivation for this is that there is currently a mixuture of
versions in this workflow and there are jobs that are failing because
they require a newer version.
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17644792595/job/50140010907#step:5:1759
* ci : add xcodebuild -downloadPlatform iOS command
* fix im2col_3d to respect non-contiguous inputs (views)
The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.
This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.
* use ggml_element_size() for src strides
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* SYCL: Add COUNT_EQUAL operator support (rebased on master)
* SYCL: remove duplicate op_count_equal definition
* tests: remove test_count_equal_typed and use test_count_equal for all cases
* tests: keep only I32 case for COUNT_EQUAL as suggested
* tests: keep only I32 case for COUNT_EQUAL as requested
* llama-run: Fix model download on Windows
* fix SSL error (SSL peer certificate or SSH remote key was not OK)
* fix program crash on std::filesystem::rename
* llama-run: create a separate method to utilize RAII
* llama-run: handle rename exception
In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).
* add grok-2 support
* type fix
* type fix
* type fix
* "fix" vocab for invalid sequences
* fix expert tensor mapping and spaces in vocab
* add chat template
* fix norm tensor mapping
* rename layer_out_norm to ffn_post_norm
* ensure ffn_post_norm is mapped
* fix experts merging
* remove erroneous FFN_GATE entry
* concatenate split tensors and add more metadata
* process all expert layers and try cat instead of hstack
* add support for community BPE vocab
* fix expert feed forward length and ffn_down concat
* commit this too
* add ffn_up/gate/down, unsure if sequence is right
* add ffn_gate/down/up to tensor names
* correct residual moe (still not working)
* mess--
* fix embedding scale being applied twice
* add built in chat template
* change beta fast for grok if default value
* remove spm vocab in favor of community bpe vocab
* change attention temp length metadata type to integer
* update attention temp length metadata
* remove comment
* replace M_SQRT2 with std::sqrt(2)
* add yarn metadata, move defaults to hparams
* metal : remove mem pool usage
ggml-ci
* metal : remove mem pool implementation
ggml-ci
* metal : take into account the actual allocated memory of the tensor
ggml-ci
* cont : use ggml_backend_buft_get_alloc_size
ggml-ci
* cont : improve, comments
ggml-ci
* cont : add functions for the extra tensor sizes
* metal : add comments
ggml-ci
* metal : implement .get_alloc_size for the rest of the buffer types
ggml-ci
* metal : remove ggml_metal_heap
ggml-ci
* rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series
https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/reference/system-requirements.html#rdna-os
states the Radeon RX 9000 series is supported support from Ubuntu 24.04.2, and the dockerfile is using 24.04 which is ROCm 6.4.
This fixed the `ROCm error: invalid device function` I was getting when trying to use the rocm container.
* build: fix the cache keys for Windows HIP release job
Update the cache keys to include the HIP SDK version, preventing the
use of outdated ROCm installation caches.
* build: sync changes from release.yml to build.yml
- Update HIP SDK version to 25.Q3 and ROCm version to 6.4.2
- Update the cache keys to reflect the new versions
* build: remove Windows HIP release for gfx1151
since the current stable rocWMMA does not support gfx1151.
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
* metal : refactor bin kernels loading
ggml-ci
* metal : refactor rms kernel loading
ggml-ci
* ci : try to add memory leaks check
ggml-ci
* ci : try to enable memory leak detection for Mac
* cont : seems to be working
To pull and run models via: llama-server -dr gemma3
Add some validators and sanitizers for Docker Model urls and metadata
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type
ggml-backend : add device id to device props
llama : only use iGPU devices if there are no GPU devices
llama : do not use multiple devices from different backends with the same device id
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.
The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.
Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
Since the prefill length is not fixed, graphs constructed for the
prefill stage cannot be reused. For this reason, ACL graph
execution is disabled by default during prefill.
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend
* support non-contiguous Q in build_attn_mha
* Update src/llama-graph.cpp
ggml-ci
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:
Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero
This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.
Resolves: https://github.com/ggml-org/ggml/issues/1324
* metal : make the backend async
ggml-ci
* cont : add comments, extend op offload, clean up
ggml-ci
* metal : fix batch size for MUL_MAT_ID
* metal : remove deprecated ggml_backend_metal_buffer_from_ptr
* metal : create only metal buffers, no wrapping of host memory
ggml-ci
* metal : restore .alloc_buffer for buffer_from_ptr_type
ggml-ci
* metal : remove broken implementation of GGML_OP_SET
ggml-ci
* metal : clean-up loose ends, ready for tests
ggml-ci
* metal : support both private and shared buffers
ggml-ci
* metal : enable private buffers + add global device queue
* metal : disable host buffer to prevent races
ggml-ci
* metal : avoid extra copy during set_tensor
ggml-ci
* metal : use separate buffer types for shread and private Metal buffers
ggml-ci
* metal : simplify synchronization logic
ggml-ci
* metal : fix build
ggml-ci
* metal : do not implement cpy_tensor
ggml-ci
* metal : separate implementations for shared and private buffers
ggml-ci
This commit applies the same caching to the release workflow which
currently exists for the main CI workflow that was introduced in Commit
ff02caf9ee ("ci : cache ROCm installation
in windows-latest-cmake-hip (#15887)").
* CANN: Add ROPE sin/cos cache for reuse
Introduce sin/cos caching mechanism in ROPE to avoid redundant
computation across layers. The cache is built on the first layer
per device and reused by subsequent layers if parameters match.
- Added sin_cache / cos_cache pointers and position_length tracking
- Introduced cache validity flags and properties:
(ext_factor, theta_scale, freq_scale, attn_factor, is_neox)
- Accelerates ROPE by eliminating repeated sin/cos generation
This change reduces overhead in multi-layer scenarios while
preserving correctness by verifying parameter consistency.
Co-authored-by: hipudding <huafengchun@gmail.com>
* fix typo
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
* CANN: implement LRU cache for ACL graphs in CANN backend
- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.
* fix typo
* The LRU cache capacity can be configured via an env variable
Signed-off-by: noemotiovon <757486878@qq.com>
* refactory acl graph
* refactory && fix review comments
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
This commit adds check for two function pointers returned from
ggml_backend_reg_get_proc_address.
The motivation for this is that the function pointer could be nullptr if
the get proc address function changes in the future. This is also
consistent with all the other calls to ggml_backend_reg_get_proc_address
in the code base.
This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job.
The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step.
Refs: https://github.com/ggml-org/llama.cpp/pull/15365
* CUDA: Add mul_mat_id support the mmf
Add support for mul_mat_id for bs < 16
* Review: use warp_size, fix should_use_mmf condition
* Launch one block per expert, stride along n_expert_used
* templatize mul_mat_id
* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids
* Reduce compile times by dividing mmf into f16, bf16 and f32 variants
* Divide mmf by ncols_dst
* Add missing files
* Fix MUSA/HIP builds
* requirements : update transformers/torch for Embedding Gemma
This commit updates the requirements to support converting
Embedding Gemma 300m models.
The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.
I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.
* resolve additional python dependencies
* fix pyright errors in tokenizer test and remove unused import
* feat: Extra debugging support for model conversion - added BF16 support for llama-callback-eval and support for dumping intermediate steps in run-org-model.py
* vulkan: sort graph to allow more parallel execution
Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.
With #15489, this reduces the number of synchronizations needed.
* call optimize_graph per-split
* Add DeepSeek V3.1 thinking mode support
- Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value
- Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation)
- Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format:
- Extracts reasoning content before '</think>' tag into reasoning_content
- Extracts regular content after '</think>' tag into content
- No opening '<think>' tag in V3.1 format
- Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking'
- Added V3.1 case to parsing switch statement
This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag.
* Another attempt by V3.1 non-thinking
* Fix test, but it's not asserting anything.
* Ignore vim swap files in tests dir
* Update the test
* Try using try_find_literal instead of regex
* passing test
* Revert "Try using try_find_literal instead of regex"
This reverts commit c50d887ec2.
* Remove unnecessary change
* Remove comment
* Add code to handle non-thinking mode.
* Try to set message['prefix'] when thinking is enabled.
* This fixes reasoning, but breaks normal content. We need state in the
chat parser.
* DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`.
* Simplify (DeepSeek V3.1 reasoning)
* Fix sign inversion bug
* Add some tool calling code (not working).
* Tool calls working in non-reasoning mode.
* Attempt a unit test for tool call parsing.
* Passing test
* Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants.
* Passing DeepSeek V3.1 tool call tests, but model is not working.
* Revert assistance response prefill change. Not my monkeys.
* Add fenced_thinking unit test variant. Passes, but thinking tool calling
still isn't working for some reason.
* Tests pass in reasoning mode. Also e2e tool test passes.
* Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so
as to not accidentally introduce regressions.
* Fix thinking_forced_open logic. tool calling broken. Need to add another
test case.
* That's what I get for cargo culting a newline.
* Add multi tool call test for deepseek v3.1 non-reasoning
* Move test, remove .gitignore change
* Place deepseek-v3.1 reasoning test directly into existing reasoning
function per CISC's request.
* Address whitespace CI failure.
* Merge two assert_equals per CISC's request.
* Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request.
* Merge deepseek V3.1 and regular parse_json_tool_calls() function
behaviors by adding optional update_cursor argument.
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* DeepSeek V3.1 fix reasoning_format none
* Strip grammar down to strictly what we expect based on model card. Throw
out parts we cargo culted from R1 that don't make sense.
* Update tests/test-chat-parser.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* DeepSeek V3.1 - Add edge case where thinking is forced open, there is
tool calling in the reasoning content, but then the model just stops the
output without closing the </think> tag, so it's not a partial. In this
case, use the tool call in the reasoning content.
* DeepSeek V3.1 - simplify update_cursor
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update common/chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix indent
---------
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cuda : fix supports_op condition for get_rows when src1->ne2 > 1
ggml-ci
* ggml : add comment about ggml_get_rows
ggml-ci
* cuda : add FIXME [no ci]
* cuda : update support condition
ggml-ci
* ggml: allow casting between f32 and i32
* fix cuda
* add vulkan
* fix CPU non-cont
* add non-cont test case
* add note
* extend test number range
* correct note
* add cont version for vulkan
* convert : force setting sliding_window from original config
This commit modifies the set_gguf_parameters method for EmbeddingGemma
so that it reads the sliding_window parameter from the original model
config.json and uses that value.
The motivation for this change is that the Gemma3TextConfig
constructor adjusts the sliding_window value, which can lead to
inconsistencies when converting models as we expects this value to
match the original model's configuration.
Refs: bb45d3631e/src/transformers/models/gemma3/configuration_gemma3.py (L230)
* fix flake8 error
* add link to huggingface PR
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...
rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
* ggml WebGPU: remove userdata from request adapter callback
This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.
The motivation for this change is to simplify the code and improve
readability.
* inline the callback lambda into the RequestAdapter call
This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.
* server : implement `return_progress`
* add timings.cache_n
* add progress.time_ms
* add test
* fix test for chat/completions
* readme: add docs on timings
* use ggml_time_us
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: Add python-side constants and conversion for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add c++ side constants for adapter.lora.invocation_string
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse invocation string for adapters from GGUF
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(python): Update conversion to alora_invocation_tokens
This is the preferred method in PEFT which is the source of ground truth
https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(cpp): Update to alora_invocation_tokens on c++ side
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add C APIs to get alora invocation token array from lora
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Initial implementation of alora cache logic in server
This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Identify alora invocation sequences
This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Only reuse cache for tokens before the alora invocation start
This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Handle un-cached tokens that come before the alora activation
The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use || instead of 'or'
Too much python 🤦
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix off-by-one for limiting cached tokens to before alora start
This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json
While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove duplicate logging
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters
Branch: gabe-l-hart/alora-support
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: Set enable_thinking IFF not disabled and supported
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Fix inverted logic condition for prefill error
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Always parse the enable_thinking kwarg to overwrite the default value
From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Don't limit tempalte expansion check to jinja
With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add the error text to json type errors in json_value
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Explicitly reject string values for "enable_thinking"
There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Move logic for detecting template enable_thinking support to common
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use raw pointer for common chat template function
Branch: gabe-l-hart/thinking-model-disabled-agent-prefill
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This commit adds two new command-line options to the
test-backend-ops.cpp that allow users to list all available GGML
operations and to show test coverage of these operations.
The motivation for this is that it can be useful to quickly see which
operations are currently covered by tests and which are not. Also it
migth be useful when using the `support` mode.
* gguf: split gguf writer into base and buf impl
* gguf: templated gguf write out
* gguf: file based writer (avoid writing everything to memory first!)
* examples(llama2c): fix log not being the same level and compiler nits
This commit updates the modelcard.template file used in the model
conversion scripts for embedding models to include the llama-server
--embeddings flag in the recommended command to run the model.
The motivation for this change was that when using the model-conversion
"tool" to upload the EmbeddingGemma models to Hugging Face this flag was
missing and the embedding endpoint was there for not available when
copy&pasting the command.
This commit add support for the EmbeddingGemma 300m. This model supports
sliding window attention (SWA) and a new swq_type is introduced to
support symmetric SWA masking.
This commit also extracts the code from the function
llama_is_masked_swa in llama-impl.h, so that the logic can be shared
by both llm_graph_input_attn_no_cache::set_input and
llama_kv_cache::set_input_kq_mask.
With this commit the EmbeddingGemma 300m model can be converted to
to GGUF and used with llama.cpp.
Once the model has been uploaded to HuggingFace it can be used like
this:
```console
./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0
```
* llama : set n_outputs to 1 to avoid 0 outputs mean-pooling
This commit modifies the llama_context constructor to set n_outputs to
1.
The motivation for this is that when using pooling, and specifically
mean pooling, for embeddings having n_outputs set to 0 can lead to the
following error:
```console
$ build/bin/llama-embedding -m models/nomic-embed-text-1.5-Q4_K_M.gguf \
--pooling mean -p "Hello, how are you?"
...
llama_context: CPU output buffer size = 0.12 MiB
/home/danbev/work/ai/llama.cpp/ggml/src/ggml.c:3023: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
0x0000743c96d107e3 in __GI___wait4 (pid=292978, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
30 in ../sysdeps/unix/sysv/linux/wait4.c
196 waitpid(child_pid, NULL, 0);
230 ggml_print_backtrace();
3023 GGML_ASSERT(ggml_can_mul_mat(a, b));
1823 cur = ggml_mul_mat(ctx0, ggml_cont(ctx0, ggml_transpose(ctx0, inp)), inp_mean);
18983 llm->build_pooling(cls, cls_b, cls_out, cls_out_b);
1399 auto * gf = model.build_graph(gparams);
292 auto * gf = graph_reserve(1, n_seqs, n_outputs, mctx.get(), true);
2329 auto * ctx = new llama_context(*model, params);
913 llama_context * lctx = llama_init_from_model(model, cparams);
105 common_init_result llama_init = common_init_from_params(params);
[Inferior 1 (process 292976) detached]
Aborted (core dumped)
```
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add comment about not reserving graphs with zero outputs
* add assert in graph_reserve to ensure n_outputs >= 1
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Fixes#15330
Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation.
Co-authored-by: Yuchuan <yuchuan-cao@users.noreply.github.com>
* vulkan : update ggml_vk_instance_validation_ext_available
This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.
Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.
* remove try/catch
This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.
* update warning message about validation layers
* Add fastdiv, use it in modulo and use modulo in rms_norm_f32
Fastdiv is much faster way to do integer division, which was identified
as bottleneck in rms_norm_f32
* Support more `block_size` values in `rms_norm_f32`
This makes us more flexible in selecting the optimal threads w.r.t
paralellizing across a col vs. launch-overheads of threads and mio
throttles
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Replace modulo with fastmodulo in `rms_norm_f32`
* Use `BinPackArguments=true` for formating function calls
Will file a separate PR to adjust .clang-format file
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Use uint3 for both `fastdiv` and `fastmodulo`
The compiler seems to reliably optimize away the unused .z component in
the fastdiv use-case, see https://godbolt.org/z/rx8KPrKr3
* More constrained type declarations
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Rename fastdiv and fastmodulo variables to shared variable name
As suggest by JohannesGaessler, this increases clarity of the intended
use
* Pack fastdiv/fastmodulo constants into uint2/uint3 objects
By packing constants to be used together into a struct, we are less
likely to make errors.
* Rename function parameter of fastmodulo
`modulo_consts` is more fitting/descriptive
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This commit fixes the model type for the Gemma 270M model in
llama_model.cpp which should be LLM_TYPE_270M. I incorrectly added this
previously as LLM_TYPE_537M which was wrong.
The motivation for this is that it causes the model to not be identified
properly when using tools like llama-bench. For example:
```console
$ ./build/bin/llama-bench -m models/gemma-3-270m-Q8_0.gguf
| model | size | ...
| ------------------------------ | ---------: | ...
| gemma3 ?B Q8_0 | 271.81 MiB | ...
| gemma3 ?B Q8_0 | 271.81 MiB | ...
```
With the changes in this commit the output will be:
```console
$ ./build/bin/llama-bench -m models/gemma-3-270m-Q8_0.gguf
| model | size | ...
| ------------------------------ | ---------: | ...
| gemma3 270M Q8_0 | 271.81 MiB | ...
| gemma3 270M Q8_0 | 271.81 MiB | ...
```
* model-conversion : remove hardcoded /bin/bash shebangs [no ci]
This commit updates the bash scripts to use env instead of using
hardcoded /bin/bash in the shebang line.
The motivation for this is that some systems may have bash installed
in a different location, and using /usr/bin/env bash ensures that
the script will use the first bash interpreter found in the user's
PATH, making the scripts more portable across different environments.
* model-conversion : rename script to .py [no ci]
This commit renames run-casual-gen-embeddings-org.sh to
run-casual-gen-embeddings-org.py to reflect its Python nature.
This commit adds a curl script to the model-conversion examples
which is currently missing. This script is required for the running the
embedding server targets to test llama-server embeddings functionality.
Previously, the slope tensor was set to fp16 to improve efficiency.
While this worked correctly in FA, it caused precision issues in soft_max.
This change applies different data types for different operators
to balance both accuracy and performance.
* [CANN] Support eager execution mode under ACL graph compilation
Add support for running operators in eager mode while ACL graph
compilation is enabled. This allows bypassing graph execution
and directly submitting ops, which is useful for debugging and
reducing graph build overhead in certain scenarios.
Signed-off-by: noemotiovon <757486878@qq.com>
* fix typo
Signed-off-by: noemotiovon <757486878@qq.com>
* rename to acl_graph_mode
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
* vulkan: use memory budget extension to read memory usage
* fix: formatting and names
* formatting
* fix: detect and cache memory budget extension availability on init
* fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available
* style: lints
* SVE support for exponential functions
Add const notation to variable pg
* Update ggml/src/ggml-cpu/vec.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add const
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants
* vulkan: use subgroup operations for quantize_q8_1 shader
* vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader
* vulkan: use q8_1_x4 blocks in mul_mmq shader
* vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec
* vulkan: tune mul_mat_vecq performance for Intel
* vulkan: fix quantizing issue when tensor is not divisible by 128
* vulkan: adapt integer dot mmv to mmv small m optimization (#15355)
* vulkan: allow all subgroup modes for mmv and mmvq
* vulkan: use prealloc intermediate reuse for mmvq path
* vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090
* vulkan: adapt mmv quantize_y path to conditional sync logic
* vulkan: disable q8_0 mmvq on Nvidia
* vulkan: enable q8_0 on Nvidia pre-turing
* fix prealloc sync condition
* fix llvmpipe subgroup 8 issue
* ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops
This commit adds support for the TRANSPOSE and RESHAPE operations in the
ggml webgpu backend.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: fix build error from ambiguous __half conversions in conv2d
Building conv2d with half precision failed because `__half` defines
multiple implicit conversion operators (to float, int, short, etc.),
causing ambiguous overload resolution when multiplying with float.
Introduce a templated `to_float` helper that explicitly converts
`__half` via `__half2float`, while passing through float unchanged.
Use this helper in conv2d accumulation to ensure unambiguous and
correct promotion to float.
Fixes some build errors with half-precision kernels on CUDA.
ggml-ci
* CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion
* CUDA: Add missing convert.cuh header
* CUDA: remove unnecessary extension in ggml_cuda_cast
* CUDA: Address review comment, remove second type template argument
* CANN: fix RoPE cache issue on multi-device
RoPE cache only needs to be computed once per token.
However, in multi-device scenarios, not every device starts
computation from layer 0, which may lead to unallocated memory
issues and precision errors.
This commit records the first layer of each device to avoid
the above issues.
* CANN: Optimize first-layer detection method
* CANN: Remove trailing whitespace
* CANN: Only cache the data that can be determined as unchanged through the parameters.
* CANN: Update function comment
* sampling : optimize sorting using bucket sort in more places
ggml-ci
* sampling : do not sort in dist sampler
ggml-ci
* sampling : avoid heap allocations for sort buffers
ggml-ci
* common : add option to sort sampling candidates by probability
ggml-ci
* sampling : revert the change for preserving sort buffers
* sampling : use std::copy instead of memcpy
* sampling : clarify purpose of partial sort helpers
ggml-ci
* cont : remove wrong comment [no ci]
* common : update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* server : enable /slots by default and make it secure
ggml-ci
* server : fix tests to pass `--no-slots` when necessary
* server : extend /props with info about enabled endpoints
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
* vulkan: mul_mat_id coopmat2 optimizations
Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.
Only call fetch_scales/store_scales once per QUANT_K block, and once at the
beginning in case start_k is not aligned.
* Also add a path for BN/4 - worth a couple more percent
This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.
* gguf-py: implement byteswapping for Q4_0
This is needed to byteswap Mistral model.
Also restore original shapes after byteswapping tensors.
It is not needed at the moment, but do it in case
they'd be used in future.
* Rework byteswapping code in gguf-py
Move out details from byteswapping tensor blocks code
* Change to warn instead of debug, to explain reason for stopping.
* Update tools/main/main.cpp
Fix printing --2
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit adds a new target to the Makefile for converting models that
are multimodal. This target will convert the original model and in
addition also create the mmproj GGUF model.
The motivation for this change is that for models that are multimodal,
for example those that contain a vision encoders, we will often want to
upload both the quantized model and the vision encoder model to
HuggingFace.
Example usage:
```console
$ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/
...
The environment variable CONVERTED_MODEL can be set to this path using:
export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf
The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf
```
The converted original model can then be quantized, and after that both
the quantized model and the mmproj file can then be uploaded to
HuggingFace.
Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main
Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.
We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
* CANN(flash-attn): refactor mask handling and improve performance
1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.
Signed-off-by: noemotiovon <757486878@qq.com>
* [CANN]: fix review
Signed-off-by: noemotiovon <757486878@qq.com>
* [CANN]: Optimization FA BNSD to BSND
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
This commit updates the bash completion script to include the -m
short option for the --model argument.
The motivation for this is that currently tab completion only works the
full --model option, and it is nice to have it work for the short option
as well.
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.
This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
This patch improves GEMM for FP32 Data Type on PowerPC
Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.
Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).
Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces
Performance Testing:
Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.
model Size Params Backend Threads Test Patch Base
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44
25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
This commit adds two targets to the Makefile for quantizing of
Quantization Aware Trained (QAT) models to Q4_0 format.
The motivation for this is that this sets the token embedding and the
output tensors data types to Q8_0 instead of the default Q6_K. This is
someting that we wish to enforce for QAT Q4_0 models that are to be
uploaded to ggml-org on Huggingface to guarantee the best quality.
* metal : optmize FA vec for large heads and sequences
* metal : adjust small-batch mul mv kernels
ggml-ci
* batched-bench : fix total speed computation
ggml-ci
* cont : add comments
ggml-ci
* convert : fix tensor naming conflict for llama 4 vision
* convert ok
* support kimi vision model
* clean up
* fix style
* fix calc number of output tokens
* refactor resize_position_embeddings
* add test case
* rename build fn
* correct a small bug
* metal : mul_mm_id remove hdst
* metal : remove mul_mm_id hsrc1
* metal : mul_mm_id simplify + add test
* metal : opt mul_mm_id map0
* metal : optimize mul_mm_id id gathering
* metal : mul/div opt
* metal : optimize mul_mm_id_map0
ggml-ci
* CUDA: optimize get_int_from_table_16
* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs
* revise documentation
---------
Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This commit explicitly sets the pooling type to 'none' in the logits.cpp
to support models that have a pooling type specified.
The motivation for this is that some models may have a pooling type set
in the model file (.gguf file) and for this specific case where we only
want to extract logits, we need to ensure that no pooling is used to
so that we are comparing raw logits and not pooled embeddings.
* model-conversion: add model card template for embeddings [no ci]
This commit adds a separate model card template (model repository
README.md template) for embedding models.
The motivation for this is that there server command for the embedding
model is a little different and some addition information can be useful
in the model card for embedding models which might not be directly
relevant for causal models.
* squash! model-conversion: add model card template for embeddings [no ci]
Fix pyright lint error.
* remove --pooling override and clarify embd_normalize usage
* vulkan: use subgroup function for mul_mat_id shader even without coopmat
* vulkan: fix compile warnings
* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id
* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.
Store the FA pipelines in a map, indexed by the pipeline state.
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs
There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.
The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
* complete rebase against fused adds - multi_add shader can also compute partial sums
* fix validation errors
* disable add_rms_fusion for Intel due to possible driver bug
* resolve against #15489, sync after clearing partial sums
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.
The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
* vulkan : support ggml_mean
* vulkan : support sum, sum_rows and mean with non-contiguous tensors
* vulkan : fix subbuffer size not accounting for misalign offset
* tests : add backend-op tests for non-contiguous sum_rows
* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support
* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Work on templating for different types in shaders
* Work on shader type generation
* Working q4_0 mul_mat and some templating for different types
* Add q4_0_f16 matmul and fix device init
* Add matmul support for basic quantization types
* Add q2_k and q3_k quantization
* Add rest of k-quants
* Get firt i-quant working
* Closer to supporting all i-quants
* Support rest of i-quants
* Cleanup code
* Fix python formatting
* debug
* Bugfix for memset
* Add padding to end of buffers on creation
* Simplify bit-shifting
* Update usage of StringView
* Add Pad Reflect 1D CUDA support
* Update ggml/src/ggml-cuda/pad_reflect_1d.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add capability to model endpoint to indicate if client can send multimodal data
- Add tests.
* vulkan: Reuse conversion results in prealloc_y
Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.
* don't use shared pointer for prealloc_y_last_pipeline_used
* Changed the CI file to hw
* Changed the CI file to hw
* Added to sudoers for apt
* Removed the clone command and used checkout
* Added libcurl
* Added gcc-14
* Checking gcc --version
* added gcc-14 symlink
* added CC and C++ variables
* Added the gguf weight
* Changed the weights path
* Added system specification
* Removed white spaces
* ci: Replace Jenkins riscv native build Cloud-V pipeline with GitHub Actions workflow
Removed the legacy .devops/cloud-v-pipeline Jenkins CI configuration and introduced .github/workflows/build-riscv-native.yml for native RISC-V builds using GitHub Actions.
* removed trailing whitespaces
* Added the trigger at PR creation
* Corrected OS name
* Added ccache as setup package
* Added ccache for self-hosted runner
* Added directory for ccache size storage
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Changed the build command and added ccache debug log
* Added the base dir for the ccache
* Re-trigger CI
* Cleanup and refactored ccache steps
* Cleanup and refactored ccache steps
---------
Co-authored-by: Akif Ejaz <akifejaz40@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* examples : add model conversion tool/example
This commit adds an "example/tool" that is intended to help in the
process of converting models to GGUF. Currently it supports normal
causal models and embedding models. The readme contains instructions and
command to guide through the process.
The motivation for this to have a structured and repeatable process for
model conversions and hopefully with time improve upon it to make the
process easier and more reliable. We have started to use this for new
model conversions internally and will continue doing so and improve it
as we go along. Perhaps with time this should be placed in a different
directory than the examples directory, but for now it seems like a good
place to keep it while we are still developing it.
* squash! examples : add model conversion tool/example
Remove dependency on scikit-learn in model conversion example.
* squash! examples : add model conversion tool/example
Update transformer dep to use non-dev version. And also import
`AutoModelForCausalLM` instead of `AutoModel` to ensure compatibility
with the latest version.
* squash! examples : add model conversion tool/example
Remove the logits requirements file from the all requirements file.
* Fix -Werror=return-type so ci/run.sh can run
* Update tools/mtmd/clip.cpp
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Remove false now that we have abort
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Initial plan
* Initialize copilot instructions exploration
* Add comprehensive .github/copilot-instructions.md file
* Update Python environment and tools directory documentation
- Add instructions for using .venv Python environment
- Include flake8 and pyright linting tools from virtual environment
- Add tools/ as core directory in project layout
- Reference existing configuration files (.flake8, pyrightconfig.json)
* add more python dependencies to .venv
* Update copilot instructions: add backend hardware note and server testing
* Apply suggestions from code review
* Apply suggestions from code review
* Replace clang-format with git clang-format to format only changed code
* Minor formatting improvements: remove extra blank line and add trailing newline
* try installing git-clang-format
* try just clang-format
* Remove --binary flag from git clang-format and add git-clang-format installation to CI
* download 18.x release
* typo--
* remove --binary flag
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Make Mistral community chat templates optional
* Change the flag arg to disable instead of enable community chat templates
* Improve error message
* Improve help message
* Tone down the logger messages
This commit removes references to `make` in the examples, as the build
system has been updated to use CMake directly and using `make` will now
generate an error since Commit 37f10f955f
("make : remove make in favor of CMake (#15449)").
This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.
The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.
Resolves: https://github.com/ggml-org/llama.cpp/issues/15417
* Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
* feat:Modify the header file include path
1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.
* Restore the docker.yml file
This commit removes the content from the Makefile and updates the
current deprecation message to information that `make` has been
replaced by CMake instead.
The message when `make` is invoked will now be the following:
```console
$ make
Makefile:6: *** Build system changed:
The Makefile build has been replaced by CMake.
For build instructions see:
https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
. Stop.
```
The motivation for this is that many, if not all targets fail to build
now, after changes to the system, and `make` has also been deprected for
some time now.
* musa: fix build warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Revert "devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)"
This reverts commit e4e915912c.
* devops: Allow pip to modify externally-managed python environment (system installation)
- Updated pip install commands to include the --break-system-packages
flag, ensuring compatibility when working with system-managed Python
environments (PEP 668).
- Note: The --break-system-packages option was introduced in 2023.
Ensure pip is updated to a recent version before using this flag.
fixes [#15004](https://github.com/danchev/llama.cpp/issues/15004)
Add tracking for high watermark cache usage and make it available in /metrics endpoint.
Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
* vulkan: Use larger workgroups for mul_mat_vec when M is small
Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.
* update heuristic for amd/intel
Co-authored-by: 0cc4m <picard12@live.de>
---------
Co-authored-by: 0cc4m <picard12@live.de>
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
* wip lfm2 vision model
* Fix conv weight
* Implement dynamic resolution
* Fix cuda
* support LFM2-VL-450M
* happy CI
* Remove extra `ggml_conv` and put others into the right place
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: fuse adds
Fuse adds that have the same shape, which are common in MoE models.
It will currently fuse up to 6 adds, because we assume no more than
8 descriptors per dispatch. But this could be changed.
* check runtimeDescriptorArray feature
* disable multi_add for Intel due to likely driver bug
* vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id
* vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up
- There's no explicit way to request f32 precision for mul_mat_id, but there
probably should be, and this gets the code in place for that.
- A couple fixes to check_results.
- Remove casts to fp16 in coopmat1 FA shader (found by inspection).
* add F16/F16 fa support
* fix kernel init
* use mad instead of fma
* use inline function
* mark FA with sinks as unsupported for now
* add pragma unroll to loops
This commit updates common_chat_templates_apply_jinja to use the
the add_bos and add_eos parameters from the chat template instead of
the inputs.
The motivation for this is that currently if the `add_bos` and `add_eos`
from the input parameters are used it is possible to there will be a
missmatch between the model and the chat template which can lead to the
the removal of duplicate BOS/EOS tokens in chat.cpp `apply` to not
happen leading to two BOS tokens being added to the template.
This commit adds support for the 18-layer model type in the Gemma3
series, which is the size of the Gemma3-270m model.
The motivation for this commit is was the only change required for
Gemma3-270m to be converted to GGUF format and used with llama.cpp.
Once the model has been converted and uploaded to Huggingface it can be
used like this:
```console
$ ./build/bin/llama-cli -hf ggml-org/gemma-3-270m-GGUF:Q8_0
```
add expicit conversion operator to support older versions of rocm
Switch over to hip_bf16 from legacy hip_bfloat16
Simplify RDNA3 define
Reduce swap over of new hipblas api to rocm 6.5 as this version is used for rocm 7.0 previews
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* model : add harmony parser for gpt-oss
* gpt-oss : fix grammar trigger from causing empty stack
* gpt-oss: tweak the grammar trigger again
* gpt-oss : add support for recipient in role header
* gpt-oss : fix ungrouped tool calls in grammar
* gpt-oss : loosen function name matching during parse
* gpt-oss : clean up workarounds
* gpt-oss : add template tests
* gpt-oss : simulate thinking and tool call tags
* gpt-oss : undo think tags when reasoning_format is none
* gpt-oss : set special tokens back to user defined
* gpt-oss : update openai-gpt-oss template
* server : filter out harmony thought messages
* gpt-oss : simplify parsing
* vulkan: perf_logger improvements
- Account for batch dimension in flops calculation.
- Fix how "_VEC" is detected for mat_mul_id.
- Fix "n" dimension for mat_mul_id (in case of broadcasting).
- Include a->type in name.
* use <=mul_mat_vec_max_cols rather than ==1
* server : add SWA checkpoints
ggml-ci
* cont : server clean-up
* server : handle state restore fails
* llama : add extended llama_state_seq_ API
* server : do not make checkpoints if --swa-full
ggml-ci
* llama : remove flags value for NONE
* server : configure number of SWA checkpoints with CLI arg
ggml-ci
* args : fix scope of new argument
When attempting to do llama-perplexity on certain tasks which have coupled sequences there is a cryptic error that does not tell you what to do, which is to set the -kvu flag. This adds a hint about that fact.
* examples/finetune -opt SGD (stochastic gradient descent) memory opt
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
* Vulkan: Implement GGML_OP_OPT_STEP_SGD
* tests: Fix OPT_STEP_SGD test-backend-ops
* SGD op param store weight-decay and not 1-alpha*wd
* minor + cosmetic changes
* fix vulkan sgd
* try CI fix
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* perplexity: give more information about constraints on failure
This checks whether -np is insufficient vs context, and provides clues as to how much is needed for each.
* log formatting
* log error and return instead of storing max_seq_exceeded int
* check if s0 is zero for -np check
The flake.nix included references to llama-cpp.cachix.org cache with a comment
claiming it's 'Populated by the CI in ggml-org/llama.cpp', but:
1. No visible CI workflow populates this cache
2. The cache is empty for recent builds (tested b6150, etc.)
3. This misleads users into expecting pre-built binaries that don't exist
This change removes the non-functional cache references entirely, leaving only
the working cuda-maintainers cache that actually provides CUDA dependencies.
Users can still manually add the llama-cpp cache if it becomes functional in the future.
* Checkpoint from VS Code for coding agent session
* Initial plan
* Fix typo in --override-tensor-draft flag implementation
* Add null termination for speculative tensor buffer overrides
* Apply suggestions from code review
* Apply suggestions from code review
* Extract tensor override parsing logic to common function (addresses @slaren's feedback)
* Apply suggestions from code review
* Apply suggestions
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Changed the CI file to hw
* Changed the CI file to hw
* Added to sudoers for apt
* Removed the clone command and used checkout
* Added libcurl
* Added gcc-14
* Checking gcc --version
* added gcc-14 symlink
* added CC and C++ variables
* Added the gguf weight
* Changed the weights path
* Added system specification
* Removed white spaces
* ci: Replace Jenkins riscv native build Cloud-V pipeline with GitHub Actions workflow
Removed the legacy .devops/cloud-v-pipeline Jenkins CI configuration and introduced .github/workflows/build-riscv-native.yml for native RISC-V builds using GitHub Actions.
* removed trailing whitespaces
---------
Co-authored-by: Akif Ejaz <akifejaz40@gmail.com>
* Factor out `reduce_rows_f32` from common.cuh
This increases iteration cycle speed by not having to recompile
every kernel all the time
* Hide memory-latency by loop unrolling in reduce_rows_f32
* Further optimizations to `reduce_rows_f32`
1. Increase threadblock size to better hide latency of memory requests.
As a consequence of bigger threadblocks, do 2-step summation, using
shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims
* Add perf tests for `reduce_rows_f32` kernel
* Add heuristic to toggle 128/512 threads based on sm count
Break even point was the minimum of the following multiples.
| GPU Model | Nrow SM Count Multiple |
| ----------- | ----------- |
| RTX 4000 SFF ADA | 2.0x |
| RTX 6000 ADA | 2.5x |
| RTX PRO 6000 Blackwell Max-Q | 3.04x |
| RTX PRO 4500 Blackwell | 3.15x |
* Ensure perf gains also for small ncols and large nrows
Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily
* Modify perf and unit-tests
* Apply auto-formatting by clang
* Fix CI build failure
See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.
* Remove sm_count property from `ggml_backend_cuda_context`
Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect
* Add CUB-based implementation for GGML_OP_MEAN
Currently this branch is only executed for nrows==1
* Add heuristics to execute CUB branch only when it brings perf
Heuristics were determined on the following HW:
* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell
* Add unit-test for CUB-based mean
Tests should run with CUDA Graphs enabled per default on NVGPUs
* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`
Suggested by @JohannesGaessler
* Unindent Preprocessor directives
See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506
* ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others). Fixes#15055
* ggml-rpc: rename RPC_IO_CHUNK->MAX_CHUNK_SIZE, use std::min() for cap, switch to GGML_LOG_ERROR, handle 0-length send/recv
* rpc: drop n==0 special case in send_data(); retry in loop per review
* rpc: remove trailing whitespace in send_data()
---------
Co-authored-by: Shinnosuke Takagi <nosuke@nosukenoMacBook-Pro.local>
* Fix MinicpmV model converter and clip to avoid using hardcode.
* Code update for pr/14750
* Remove unused field, update script path in docs.
* Add version 5 for fallback code.
---------
Co-authored-by: lzhang <zhanglei@modelbest.cn>
This commit updates `llama_kv_cache_unified::find_slot` to log
information for all streams when debug is enabled.
The motivation for this change is that currently if a non-unified
kv-cache is used, then only one stream will be logged because the
code was currently uses `seq_to_stream[1]`.
This commit updates comments and error messages to use "decode" instead
of "eval" in perplexity.cpp.
The motivation for this is that `llama_eval` was renamed to
`llama_decode` a while ago, but the comments and error messages
still referred to "eval". This change ensures consistency and clarity.
CI / macOS-latest-cmake-arm64 (push) Has been cancelled
CI / macOS-latest-cmake-x64 (push) Has been cancelled
CI / macOS-latest-cmake-arm64-webgpu (push) Has been cancelled
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Has been cancelled
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Has been cancelled
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Has been cancelled
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Has been cancelled
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Has been cancelled
CI / ubuntu-latest-llguidance (push) Has been cancelled
CI / ubuntu-latest-cmake-rpc (push) Has been cancelled
CI / ubuntu-22-cmake-vulkan (push) Has been cancelled
CI / ubuntu-22-cmake-webgpu (push) Has been cancelled
CI / ubuntu-22-cmake-hip (push) Has been cancelled
CI / ubuntu-22-cmake-musa (push) Has been cancelled
CI / ubuntu-22-cmake-sycl (push) Has been cancelled
CI / ubuntu-22-cmake-sycl-fp16 (push) Has been cancelled
CI / build-linux-cross (push) Has been cancelled
CI / build-cmake-pkg (push) Has been cancelled
CI / macOS-latest-cmake-ios (push) Has been cancelled
CI / macOS-latest-cmake-tvos (push) Has been cancelled
CI / macOS-latest-cmake-visionos (push) Has been cancelled
CI / macOS-latest-swift (generic/platform=iOS) (push) Has been cancelled
CI / macOS-latest-swift (generic/platform=macOS) (push) Has been cancelled
CI / macOS-latest-swift (generic/platform=tvOS) (push) Has been cancelled
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Has been cancelled
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Has been cancelled
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Has been cancelled
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Has been cancelled
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Has been cancelled
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Has been cancelled
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Has been cancelled
CI / ubuntu-latest-cmake-cuda (push) Has been cancelled
CI / windows-2022-cmake-cuda (12.4) (push) Has been cancelled
CI / windows-latest-cmake-sycl (push) Has been cancelled
CI / windows-latest-cmake-hip (push) Has been cancelled
CI / ios-xcode-build (push) Has been cancelled
CI / android-build (push) Has been cancelled
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Has been cancelled
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Has been cancelled
Close inactive issues / close-issues (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cpu.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 server:true tag:cpu]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/cuda.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 server:true tag:cuda]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/intel.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 server:true tag:intel]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/musa.Dockerfile free_disk_space:true full:true light:true platforms:linux/amd64 server:true tag:musa]) (push) Has been cancelled
Publish Docker image / Push Docker image to Docker Hub (map[dockerfile:.devops/vulkan.Dockerfile free_disk_space:false full:true light:true platforms:linux/amd64 server:true tag:vulkan]) (push) Has been cancelled
Update Winget Package / Update Winget Package (push) Has been cancelled
* cuda: refactored ssm_scan to use CUB
* fixed compilation error when when not using CUB
* assign L to constant and use size_t instead of int
* deduplicated functions
* change min blocks per mp to 1
* Use cub load and store warp transpose
* suppress clang warning
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / pyright type-check (push) Waiting to run
This commit addresses an issue with the convert_hf_to_gguf script
which is currently failing with:
```console
AttributeError: module 'torch' has no attribute 'uint64'
```
This occurred because safetensors expects torch.uint64 to be available
in the public API, but PyTorch 2.2.x only provides limited support for
unsigned types beyond uint8 it seems. The torch.uint64 dtype exists but
is not exposed in the standard torch namespace
(see pytorch/pytorch#58734).
PyTorch 2.4.0 properly exposes torch.uint64 in the public API, resolving
the compatibility issue with safetensors. This also required torchvision
to updated to =0.19.0 for compatibility.
Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/186#68938de803e47d990aa087fb
Refs: https://github.com/pytorch/pytorch/issues/58734
CI / macOS-latest-cmake-arm64 (push) Waiting to run
CI / macOS-latest-cmake-x64 (push) Waiting to run
CI / macOS-latest-cmake-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm) (push) Waiting to run
CI / ubuntu-cpu-cmake (x64, ubuntu-22.04) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, THREAD) (push) Waiting to run
CI / ubuntu-latest-cmake-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI / ubuntu-latest-llguidance (push) Waiting to run
CI / ubuntu-latest-cmake-rpc (push) Waiting to run
CI / ubuntu-22-cmake-vulkan (push) Waiting to run
CI / ubuntu-22-cmake-webgpu (push) Waiting to run
CI / ubuntu-22-cmake-hip (push) Waiting to run
CI / ubuntu-22-cmake-musa (push) Waiting to run
CI / ubuntu-22-cmake-sycl (push) Waiting to run
CI / ubuntu-22-cmake-sycl-fp16 (push) Waiting to run
CI / build-linux-cross (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-cmake-ios (push) Waiting to run
CI / macOS-latest-cmake-tvos (push) Waiting to run
CI / macOS-latest-cmake-visionos (push) Waiting to run
CI / macOS-latest-swift (generic/platform=iOS) (push) Waiting to run
CI / macOS-latest-swift (generic/platform=macOS) (push) Waiting to run
CI / macOS-latest-swift (generic/platform=tvOS) (push) Waiting to run
CI / windows-msys2 (Release, clang-x86_64, CLANG64) (push) Waiting to run
CI / windows-msys2 (Release, ucrt-x86_64, UCRT64) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest-cmake (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest-cmake (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest-cmake (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=… (push) Waiting to run
CI / windows-latest-cmake (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cmake-cuda (push) Waiting to run
CI / windows-2022-cmake-cuda (12.4) (push) Waiting to run
CI / windows-latest-cmake-sycl (push) Waiting to run
CI / windows-latest-cmake-hip (push) Waiting to run
CI / ios-xcode-build (push) Waiting to run
CI / android-build (push) Waiting to run
CI / openEuler-latest-cmake-cann (aarch64, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
CI / openEuler-latest-cmake-cann (x86, Release, 8.1.RC1.alpha001-910b-openeuler22.03-py3.10, ascend910b3) (push) Waiting to run
* feat(cann): add optional support for ACL Graph execution
This commit adds support for executing ggml computational graphs using
Huawei's ACL graph mode via the USE_CANN_GRAPH flag. The support can be
enabled at compile time using the CMake option:
-DUSE_CANN_GRAPH=ON
By default, ACL graph execution is **disabled**, and the fallback path
uses node-by-node execution.
Key additions:
- CMake option to toggle graph mode
- Graph capture and execution logic using
- Tensor property matching to determine whether graph update is required
- Safe fallback and logging if the environment variable LLAMA_SET_ROWS
is unset or invalid
This prepares the backend for performance improvements in repetitive graph
execution scenarios on Ascend devices.
Signed-off-by: noemotiovon <757486878@qq.com>
* Fix review comments
Signed-off-by: noemotiovon <757486878@qq.com>
* remane USE_CANN_GRAPH to USE_ACL_GRAPH
Signed-off-by: noemotiovon <757486878@qq.com>
* fix typo
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* Disable set_rows until it's implemented
* Fix potential issue around empty queue submission
* Try synchronous submission
* Try waiting on all futures explicitly
* Add debug
* Add more debug messages
* Work on getting ssh access for debugging
* Debug on failure
* Disable other tests
* Remove extra if
* Try more locking
* maybe passes?
* test
* Some cleanups
* Restore build file
* Remove extra testing branch ci
* cmake: Add GGML_BACKEND_DIR option
This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.
* Fix phrasing
llama.cpp is a large-scale C/C++ project for efficient LLM (Large Language Model) inference with minimal setup and dependencies. The project enables running language models on diverse hardware with state-of-the-art performance.
**Key Facts:**
- **Primary language**: C/C++ with Python utility scripts
- **Size**: ~200k+ lines of code across 1000+ files
- **Architecture**: Modular design with main library (`libllama`) and 40+ executable tools/examples
- **Core dependency**: ggml tensor library (vendored in `ggml/` directory)
- **Backends supported**: CPU (AVX/NEON/RVV optimized), CUDA, Metal, Vulkan, SYCL, ROCm, MUSA
**Build time**: ~10 minutes on 4-core system with ccache enabled, ~25 minutes without ccache.
**Important Notes:**
- The Makefile is deprecated - always use CMake
- ccache is automatically detected and used if available
- Built binaries are placed in `build/bin/`
- Parallel builds (`-j`) significantly reduce build time
### Backend-Specific Builds
For CUDA support:
```bash
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
```
For Metal (macOS):
```bash
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j $(nproc)
```
**Important Note**: While all backends can be built as long as the correct requirements for that backend are installed, you will not be able to run them without the correct hardware. The only backend that can be run for testing and validation is the CPU backend.
### Debug Builds
Single-config generators:
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
```
Multi-config generators:
```bash
cmake -B build -G "Xcode"
cmake --build build --config Debug
```
### Common Build Issues
- **Issue**: Network tests fail in isolated environments
**Solution**: Expected behavior - core functionality tests will still pass
**Expected failures**: 2-3 tests may fail if network access is unavailable (they download models)
**Test time**: ~30 seconds for passing tests
### Server Unit Tests
Run server-specific unit tests after building the server:
```bash
# Build the server first
cmake --build build --target llama-server
# Navigate to server tests and run
cd tools/server/tests
source ../../../.venv/bin/activate
./tests.sh
```
**Server test dependencies**: The `.venv` environment includes the required dependencies for server unit tests (pytest, aiohttp, etc.). Tests can be run individually or with various options as documented in `tools/server/tests/README.md`.
### Test Categories
- Tokenizer tests: Various model tokenizers (BERT, GPT-2, LLaMA, etc.)
- Grammar tests: GBNF parsing and validation
- Backend tests: Core ggml operations across different backends
- Use descriptive commit messages following project conventions
### Trust These Instructions
Only search for additional information if these instructions are incomplete or found to be incorrect. This document contains validated build and test procedures that work reliably across different environments.
# Automatically run the setup steps when they are changed to allow for easy validation, and
# allow manual testing through the repository's "Actions" tab
on:
workflow_dispatch:
push:
paths:
- .github/workflows/copilot-setup-steps.yml
pull_request:
paths:
- .github/workflows/copilot-setup-steps.yml
jobs:
# The job MUST be called `copilot-setup-steps` or it will not be picked up by Copilot.
copilot-setup-steps:
runs-on:ubuntu-latest
# Set the permissions to the lowest permissions possible needed for your steps.
# Copilot will be given its own token for its operations.
permissions:
# If you want to clone the repository as part of your setup steps, for example to install dependencies, you'll need the `contents: read` permission. If you don't clone the repository in your setup steps, Copilot will do this for you automatically after the steps complete.
contents:read
# You can define any steps you want, and they will run before the agent starts.
# If you do not check out your code, Copilot will do this for you.
> **Release Format Update**: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.
<details open>
${{ github.event.head_commit.message }}
</details>
**macOS/iOS:**
- [macOS Apple Silicon (arm64)](https://github.com/ggml-org/llama.cpp/releases/download/${{ steps.tag.outputs.name }}/llama-${{ steps.tag.outputs.name }}-bin-macos-arm64.tar.gz)
The project differentiates between 3 levels of contributors:
- Contributors: people who have contributed before (no special privileges)
- Collaborators (Triage): people with significant contributions, who may be responsible for some parts of the code, and are expected to maintain and review contributions for the code they own
- Maintainers: responsible for reviewing and merging PRs, after approval from the code owners
- llama.cpp uses the ggml tensor library for model evaluation. If you are unfamiliar with ggml, consider taking a look at the [examples in the ggml repository](https://github.com/ggml-org/ggml/tree/master/examples/). [simple](https://github.com/ggml-org/ggml/tree/master/examples/simple) shows the bare minimum for using ggml. [gpt-2](https://github.com/ggml-org/ggml/tree/master/examples/gpt-2) has minimal implementations for language model inference using GPT-2. [mnist](https://github.com/ggml-org/ggml/tree/master/examples/mnist) demonstrates how to train and evaluate a simple image classifier
- Test your changes:
@@ -9,13 +17,18 @@
- Create separate PRs for each feature or fix. Avoid combining unrelated changes in a single PR
- Consider allowing write access to your branch for faster reviews, as reviewers can push commits directly
- If your PR becomes stale, don't hesitate to ping the maintainers in the comments
- Maintainers will rely on your insights and approval when making a final decision to approve and merge a PR
- Consider adding yourself to [CODEOWNERS](CODEOWNERS) to indicate your availability for reviewing related PRs
- Using AI to generate PRs is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before publishing the PR. Note that trivial tab autocompletions do not require disclosure.
# Pull requests (for collaborators)
# Pull requests (for maintainers)
- Squash-merge PRs
- Use the following format for the squashed commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : fix typo in utils.py (#1234)`
- Optionally pick a `<module>` from here: https://github.com/ggml-org/llama.cpp/wiki/Modules
-Consider adding yourself to [CODEOWNERS](CODEOWNERS)
-Let other maintainers merge their own PRs
- When merging a PR, make sure you have a good understanding of the changes
- Be mindful of maintenance: most of the work going into a feature happens after the PR is merged. If the PR author is not committed to contribute long-term, someone else needs to take responsibility (you)
# Coding guidelines
@@ -114,6 +127,21 @@
#endif // FOO
```
# Code maintenance
- Existing code should have designated collaborators and/or maintainers specified in the [CODEOWNERS](CODEOWNERS) file reponsible for:
- Reviewing and merging related PRs
- Fixing related bugs
- Providing developer guidance/support
- When adding or modifying a large piece of code:
- If you are a collaborator, make sure to add yourself to [CODEOWNERS](CODEOWNERS) to indicate your availability for reviewing related PRs
- If you are a contributor, find an existing collaborator who is willing to review and maintain your code long-term
- Provide the necessary CI workflow (and hardware) to test your changes (see [ci/README.md](https://github.com/ggml-org/llama.cpp/tree/master/ci))
- New code should follow the guidelines (coding, naming, etc.) outlined in this document. Exceptions are allowed in isolated, backend-specific parts of the code that do not interface directly with the `ggml` interfaces.
_(NOTE: for legacy reasons, existing code is not required to follow this guideline)_
-**[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
- [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)
- [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
- Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Go (no CGo needed): [hybridgroup/yzma](https://github.com/hybridgroup/yzma)
</details>
@@ -233,13 +242,14 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
- [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
- [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
- [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
- [unslothai/unsloth](https://github.com/unslothai/unsloth) – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
</details>
<details>
<summary>Infrastructure</summary>
- [Paddler](https://github.com/distantmagic/paddler) - Stateful load balancer custom-tailored for llama.cpp
- [Paddler](https://github.com/intentee/paddler) - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
- [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
- [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
- [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
@@ -269,8 +279,10 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
| [Vulkan](docs/build.md#vulkan) | GPU |
| [CANN](docs/build.md#cann) | Ascend NPU |
| [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
| [IBM zDNN](docs/backend/zDNN.md) | IBM Z & LinuxONE |
| [WebGPU [In Progress]](docs/build.md#webgpu) | All |
| [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |
@@ -515,8 +527,8 @@ To learn more about model quantization, [read this documentation](tools/quantize
## Contributing
- Contributors can open PRs
- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
- Collaborators will be invited based on contributions
- Maintainers can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
- Any help with managing issues, PRs and projects is very appreciated!
- See [good first issues](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
- Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
- [linenoise.cpp](./tools/run/linenoise.cpp/linenoise.cpp) - C++ library that provides readline-like line editing capabilities, used by `llama-run` - BSD 2-Clause License
- [curl](https://curl.se/) - Client-side URL transfer library, used by various tools/examples - [CURL License](https://curl.se/docs/copyright.html)
- [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain
- [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain
@@ -65,4 +65,6 @@ However, If you have discovered a security vulnerability in this project, please
Please disclose it as a private [security advisory](https://github.com/ggml-org/llama.cpp/security/advisories/new).
Please note that using AI to identify vulnerabilities and generate reports is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before submitting the report.
A team of volunteers on a reasonable-effort basis maintains this project. As such, please give us at least 90 days to work on a fix before public exposure.
This setup ensures that the CI runs within an isolated Docker environment while maintaining cached files and results across runs.
- Add a self-hosted `ggml-ci` workflow to [[.github/workflows/build.yml]] with an appropriate label
- Request a runner token from `ggml-org` (for example, via a comment in the PR or email)
- Set-up a machine using the received token ([docs](https://docs.github.com/en/actions/how-tos/manage-runners/self-hosted-runners/add-runners))
- Optionally update [ci/run.sh](https://github.com/ggml-org/llama.cpp/blob/master/ci/run.sh) to build and run on the target platform by gating the implementation with a `GG_BUILD_...` env
(time ./bin/llama-cli -no-cnv --model ${model_f16} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-cli -no-cnv --model ${model_f16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_bf16} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-bf16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -ngl 99 -c 1024 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-perplexity --model ${model_q8_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-perplexity --model ${model_q4_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-perplexity --model ${model_q4_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-perplexity --model ${model_q5_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-perplexity --model ${model_q5_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-perplexity --model ${model_q2_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-perplexity --model ${model_q3_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-f16.log
if[ -z ${GG_BUILD_NO_BF16}];then
(time ./bin/llama-perplexity --model ${model_bf16} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-bf16.log
fi
(time ./bin/llama-perplexity --model ${model_q8_0} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-perplexity --model ${model_q4_0} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-perplexity --model ${model_q4_1} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-perplexity --model ${model_q5_0} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-perplexity --model ${model_q5_1} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-perplexity --model ${model_q2_k} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-perplexity --model ${model_q3_k} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-perplexity --model ${model_q5_k} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-perplexity --model ${model_q6_k} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2) 2>&1| tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa ) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-cli -no-cnv --model ${model_f16} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -ngl 99 -c 0 -s 1234 -n 64 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-perplexity --model ${model_q8_0} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-perplexity --model ${model_q4_0} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-perplexity --model ${model_q4_1} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-perplexity --model ${model_q5_0} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-perplexity --model ${model_q5_1} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-perplexity --model ${model_q2_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-perplexity --model ${model_q3_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-perplexity --model ${model_q5_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-perplexity --model ${model_q6_k} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1) 2>&1| tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa off --no-op-offload) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa on --no-op-offload) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 1024 -fa off ) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 1024 -fa on ) 2>&1| tee -a $OUT/${ci}-save-load-state.log
function check_ppl {
qnt="$1"
@@ -537,6 +447,9 @@ function gg_run_pythia_1_4b {
}
check_ppl "f16""$(cat $OUT/${ci}-tg-f16.log | grep "^\[1\]")"| tee -a $OUT/${ci}-ppl.log
if[ -z ${GG_BUILD_NO_BF16}];then
check_ppl "bf16""$(cat $OUT/${ci}-tg-bf16.log | grep "^\[1\]")"| tee -a $OUT/${ci}-ppl.log
fi
check_ppl "q8_0""$(cat $OUT/${ci}-tg-q8_0.log | grep "^\[1\]")"| tee -a $OUT/${ci}-ppl.log
check_ppl "q4_0""$(cat $OUT/${ci}-tg-q4_0.log | grep "^\[1\]")"| tee -a $OUT/${ci}-ppl.log
check_ppl "q4_1""$(cat $OUT/${ci}-tg-q4_1.log | grep "^\[1\]")"| tee -a $OUT/${ci}-ppl.log
@@ -553,147 +466,17 @@ function gg_run_pythia_1_4b {
(time ./bin/llama-cli -no-cnv --model ${model_f16} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-cli -no-cnv --model ${model_q8_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_0} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_1} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-cli -no-cnv --model ${model_q2_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q3_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q4_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q5_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-cli -no-cnv --model ${model_q6_k} -t 1 -ngl 99 -c 0 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-perplexity --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-perplexity --model ${model_q8_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-perplexity --model ${model_q4_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/llama-perplexity --model ${model_q4_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/llama-perplexity --model ${model_q5_0} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/llama-perplexity --model ${model_q5_1} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/llama-perplexity --model ${model_q2_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/llama-perplexity --model ${model_q3_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/llama-perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/llama-perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/llama-perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa ) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1| tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-embedding --model ${model_f16} -p "I believe the meaning of life is" -ngl 99 -c 0) 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-embedding --model ${model_q8_0} -p "I believe the meaning of life is" -ngl 99 -c 0) 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/llama-embedding --model ${model_f16} -p "I believe the meaning of life is" -ngl 99 -c 0--no-op-offload) 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/llama-embedding --model ${model_q8_0} -p "I believe the meaning of life is" -ngl 99 -c 0--no-op-offload) 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
set +e
}
@@ -765,12 +548,7 @@ function gg_run_rerank_tiny {
(time ./bin/llama-embedding --model ${model_f16} -p "what is panda?\thi\nwhat is panda?\tit's a bear\nwhat is panda?\tThe giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." -ngl 99 -c 0 --pooling rank --embd-normalize -1 --verbose-prompt) 2>&1| tee -a $OUT/${ci}-rk-f16.log
(time ./bin/llama-embedding --model ${model_f16} -p "what is panda?\thi\nwhat is panda?\tit's a bear\nwhat is panda?\tThe giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." -ngl 99 -c 0 --pooling rank --embd-normalize -1 --no-op-offload --verbose-prompt) 2>&1| tee -a $OUT/${ci}-rk-f16.log
# sample output
# rerank score 0: 0.029
@@ -860,10 +638,8 @@ if [ -z ${GG_BUILD_LOW_PERF} ]; then
fi
ret=0
if[ -z ${GG_BUILD_SYCL}];then
# SYCL build breaks with debug build flags
test$ret -eq 0&& gg_run ctest_debug
fi
test$ret -eq 0&& gg_run ctest_debug
test$ret -eq 0&& gg_run ctest_release
if[ -z ${GG_BUILD_LOW_PERF}];then
@@ -871,24 +647,15 @@ if [ -z ${GG_BUILD_LOW_PERF} ]; then
booladd_gumbel_noise=false;// add gumbel noise to the logits if temp > 0.0
};
// reasoning API response format (not to be confused as chat template's reasoning format)
enumcommon_reasoning_format{
COMMON_REASONING_FORMAT_NONE,
COMMON_REASONING_FORMAT_AUTO,// Same as deepseek, using `message.reasoning_content`
COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY,// Extract thinking tag contents and return as `message.reasoning_content`, or leave inline in <think> tags in stream mode
COMMON_REASONING_FORMAT_DEEPSEEK,// Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
// do not extend this enum unless you absolutely have to
// in most cases, use COMMON_REASONING_FORMAT_AUTO
// return string representation like "user/model:tag"
// if tag is "latest", it will be omitted
std::stringto_string()const{
returnuser+"/"+model+(tag=="latest"?"":":"+tag);
}
};
structcommon_hf_file_res{
std::stringrepo;// repo name with ":tag" removed
std::stringggufFile;
std::stringmmprojFile;
};
/**
* Allow getting the HF file from the HF repo with tag (like ollama), for example:
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q4
* - bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q5_k_s
* Tag is optional, default to "latest" (meaning it checks for Q4_K_M first, then Q4, then if not found, return the first GGUF file in repo)
*
* Return pair of <repo, file> (with "repo" already having tag removed)
*
* Note: we use the Ollama-compatible HF API, but not using the blobId. Instead, we use the special "ggufFile" field which returns the value for "hf_file". This is done to be backward-compatible with existing cache files.
help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type",
@@ -293,17 +293,14 @@ We would like to thank Tuo Dai, Shanni Li, and all of the project maintainers fr
## Environment variable setup
### GGML_CANN_ASYNC_MODE
Enables asynchronous operator submission. Disabled by default.
### GGML_CANN_MEM_POOL
Specifies the memory pool management strategy:
Specifies the memory pool management strategy, Default is vmm.
- vmm: Utilizes a virtual memory manager pool. If hardware support for VMM is unavailable, falls back to the legacy (leg) memory pool.
- prio: Employs a priority queue-based memory pool management.
- leg: Uses a fixed-size buffer pool.
### GGML_CANN_DISABLE_BUF_POOL_CLEAN
@@ -312,5 +309,21 @@ Controls automatic cleanup of the memory pool. This option is only effective whe
### GGML_CANN_WEIGHT_NZ
Converting the matmul weight format from ND to NZ can significantly improve performance on the 310I DUO NPU.
Converting the matmul weight format from ND to NZ to improve performance. Enabled by default.
### GGML_CANN_ACL_GRAPH
Operators are executed using ACL graph execution, rather than in op-by-op (eager) mode. Enabled by default. This option is only effective if `USE_ACL_GRAPH` was enabled at compilation time. To enable it, recompile using:
Maximum number of compiled CANN graphs kept in the LRU cache, default is 12. When the number of cached graphs exceeds this capacity, the least recently used graph will be evicted.
### GGML_CANN_PREFILL_USE_GRAPH
Enable ACL graph execution during the prefill stage, default is false. This option is only effective when FA is enabled.
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration.
You can refer to the general [llama-quantize tool](/tools/quantize/README.md) for steps to convert a model in Hugging Face safetensor format to GGUF with quantization.
Currently we support `Q4_0` quantization and have optimize for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize`. For example,
Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize` (i.e., make all weights in `Q4_0`). For example,
@@ -58,6 +63,17 @@ Currently we support `Q4_0` quantization and have optimize for it. To achieve be
Since `Q6_K` is also supported, `Q4_0` quantization without `--pure` will also work. However, the performance will be worse compared to pure `Q4_0` quantization.
### `MXFP4` MoE Models
OpenAI gpt-oss models are MoE models in `MXFP4`. The quantized model will be in `MXFP4_MOE`, a mixture of `MXFP4` and `Q8_0`.
For this quantization, there is no need to specify `--pure`.
For gpt-oss-20b model, you can directly [download](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) the quantized GGUF file in `MXFP4_MOE` from Hugging Face.
Although it is possible to quantize gpt-oss-20b model in pure `Q4_0` (all weights in `Q4_0`), it is not recommended since `MXFP4` has been optimized for MoE while `Q4_0` is not. In addition, accuracy should degrade with such pure `Q4_0` quantization.
Hence, using the default `MXFP4_MOE` quantization (see the link above) is recommended for this model.
> Note that the `Q4_0` model found [here](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q4_0.gguf) is a mixture of `Q4_0`, `Q8_0` and `MXFP4` and gives better performance than `MXFP4_MOE` quantization.
## CMake Options
The OpenCL backend has the following CMake options that control the behavior of the backend.
@@ -146,10 +162,13 @@ A Snapdragon X Elite device with Windows 11 Arm64 is used. Make sure the followi
* Ninja
* Visual Studio 2022
* Powershell 7
* Python
Visual Studio provides necessary headers and libraries although it is not directly used for building.
Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
> Note that building using Visual Studio's cl compiler is not supported. Clang must be used. Clang depends on libraries provided by Visual Studio to work. Therefore, Visual Studio must be installed. Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
Powershell 7 is used for the following commands.
If an older version of Powershell is used, these commands may not work as they are.
@@ -201,9 +220,12 @@ ninja
## Known Issues
-Currently OpenCL backend does not work on Adreno 6xx GPUs.
-Flash attention does not always improve performance.
- Currently OpenCL backend works on A6xx GPUs with recent drivers and compilers (usually found in IoT platforms).
However, it does not work on A6xx GPUs found in phones with old drivers and compilers.
@@ -42,6 +42,9 @@ The following releases are verified and recommended:
## News
- 2025.11
- Support malloc memory on device more than 4GB.
- 2025.2
- Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
|GPU|Base tokens/s|Increased tokens/s|Percent|
@@ -145,12 +148,13 @@ The docker build option is currently limited to *Intel GPU* targets.
To build in default FP32 *(Slower than FP16 alternative)*, set `--build-arg="GGML_SYCL_F16=OFF"` in the previous command.
You can also use the `.devops/llama-server-intel.Dockerfile`, which builds the *"server"* alternative.
Check the [documentation for Docker](../docker.md) to see the available images.
@@ -160,7 +164,7 @@ Check the [documentation for Docker](../docker.md) to see the available images.
# First, find all the DRI cards
ls -la /dev/dri
# Then, pick the card that you want to use (here for e.g. /dev/dri/card1).
docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-sycl -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-sycl -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -c 4096 -s 0
```
*Notes:*
@@ -215,9 +219,19 @@ To target AMD GPUs with SYCL, the ROCm stack must be installed first.
2.**Install Intel® oneAPI Base toolkit**
SYCL backend depends on:
- Intel® oneAPI DPC++/C++ compiler/running-time.
- Intel® oneAPI DPC++/C++ library (oneDPL).
- Intel® oneAPI Deep Neural Network Library (oneDNN).
- Intel® oneAPI Math Kernel Library (oneMKL).
- **For Intel GPU**
The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages.
It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size.
The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
Please follow the instructions for downloading and installing the Toolkit for Linux, and preferably keep the default installation values unchanged, notably the installation path *(`/opt/intel/oneapi` by default)*.
@@ -225,6 +239,12 @@ Following guidelines/code snippets assume the default installation values. Other
Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs.
|Verified release|
|-|
|2025.2.1|
|2025.1|
|2024.1|
- **Adding support to Nvidia GPUs**
**oneAPI Plugin**: In order to enable SYCL support on Nvidia GPUs, please install the [Codeplay oneAPI Plugin for Nvidia GPUs](https://developer.codeplay.com/products/oneapi/nvidia/download). User should also make sure the plugin version matches the installed base toolkit one *(previous step)* for a seamless "oneAPI on Nvidia GPU" setup.
@@ -255,10 +275,11 @@ sycl-ls
When targeting an intel GPU, the user should expect one or more devices among the available SYCL devices. Please make sure that at least one GPU is present via `sycl-ls`, for instance `[level_zero:gpu]` in the sample output below:
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf?download=true) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
##### Check device
@@ -466,7 +487,17 @@ If you already have a recent version of Microsoft Visual Studio, you can skip th
3. Install Intel® oneAPI Base toolkit
The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
SYCL backend depends on:
- Intel® oneAPI DPC++/C++ compiler/running-time.
- Intel® oneAPI DPC++/C++ library (oneDPL).
- Intel® oneAPI Deep Neural Network Library (oneDNN).
- Intel® oneAPI Math Kernel Library (oneMKL).
All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages.
It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size.
The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
Please follow the instructions for downloading and installing the Toolkit for Windows, and preferably keep the default installation values unchanged, notably the installation path *(`C:\Program Files (x86)\Intel\oneAPI` by default)*.
@@ -761,6 +792,8 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because graph performance isn't yet better than non-graph performance. |
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
## Known Issues
@@ -807,6 +840,14 @@ use 1 SYCL GPUs: [0] with Max compute units:512
| The default context is too big. It leads to excessive memory usage.|Set `-c 8192` or a smaller value.|
| The model is too big and requires more memory than what is available.|Choose a smaller model or change to a smaller quantization, like Q5 -> Q4;<br>Alternatively, use more than one device to load model.|
- `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5000000000 Bytes of memory on device`
You need to enable to support 4GB memory malloc by:
```
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
set UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
```
### **GitHub contribution**:
Please add the `SYCL :` prefix/tag in issues/PRs titles to help the SYCL contributors to check/address them without delay.
The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables.
llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4).
You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option.
Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options.
Here are some examples of running various llama.cpp tools via ADB.
Simple question for Llama-3.2-1B
```
~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-cli.sh -no-cnv -p "what is the most popular cookie in the world?"
IBM zDNN (Z Deep Neural Network) is a hardware acceleration library designed specifically to leverage the IBM NNPA (Neural Network Processor Assist) accelerator located within IBM Telum I and II processors. It provides significant performance improvements for neural network inference operations.
### Llama.cpp + IBM zDNN
The llama.cpp zDNN backend is designed to enable llama.cpp on IBM z17 and later systems via the IBM zDNN hardware acceleration library.
Note: Using the zDNN library provided via `apt` or `yum` may not work correctly as reported in [#15772](https://github.com/ggml-org/llama.cpp/issues/15772). It is preferred that you compile from source.
Below is the build script: it requires utilizing RISC-V vector instructions for acceleration. Ensure the `GGML_CPU_RISCV64_SPACEMIT` compilation option is enabled. The currently supported optimization version is `RISCV64_SPACEMIT_IME1`, corresponding to the `RISCV64_SPACEMIT_IME_SPEC` compilation option. Compiler configurations are defined in the `riscv64-spacemit-linux-gnu-gcc.cmake` file. Please ensure you have installed the RISC-V compiler and set the environment variable via `export RISCV_ROOT_PATH={your_compiler_path}`.
isa : rv64imafdcv_zicbom_zicboz_zicntr_zicond_zicsr_zifencei_zihintpause_zihpm_zfh_zfhmin_zca_zcd_zba_zbb_zbc_zbs_zkt_zve32f_zve32x_zve64d_zve64f_zve64x_zvfh_zvfhmin_zvkt_sscofpmf_sstc_svinval_svnapot_svpbmt
mmu : sv39
uarch : spacemit,x60
mvendorid : 0x710
marchid : 0x8000000058000001
~~~
Q4_0
| Model | Size | Params | backend | threads | test | t/s |
This provides acceleration using the IBM zAIU co-processor located in the Telum I and Telum II processors. Make sure to have the [IBM zDNN library](https://github.com/IBM/zDNN) installed.
#### Compile from source from IBM
You may find the official build instructions here: [Building and Installing zDNN](https://github.com/IBM/zDNN?tab=readme-ov-file#building-and-installing-zdnn)
### Compilation
```bash
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_ZDNN=ON
cmake --build build --config Release -j$(nproc)
```
## Getting GGUF Models
All models need to be converted to Big-Endian. You can achieve this in three cases:
@@ -145,17 +150,13 @@ All models need to be converted to Big-Endian. You can achieve this in three cas
### 1. SIMD Acceleration
Only available in IBM z15 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
Only available in IBM z15/LinuxONE 3 or later system with the `-DGGML_VXE=ON` (turned on by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z14/arch12. In such systems, the APIs can still run but will use a scalar implementation.
### 2. NNPA Vector Intrinsics Acceleration
### 2. zDNN Accelerator (WIP)
Only available in IBM z16 or later system with the `-DGGML_NNPA=ON` (turned off by default) compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs can still run but will use a scalar implementation.
Only available in IBM z17/LinuxONE 5 or later system with the `-DGGML_ZDNN=ON` compile flag. No hardware acceleration is possible with llama.cpp with older systems, such as IBM z15/arch13. In such systems, the APIs will default back to CPU routines.
### 3. zDNN Accelerator
_Only available in IBM z16 / LinuxONE 4 or later system. No support currently available._
### 4. Spyre Accelerator
### 3. Spyre Accelerator
_Only available with IBM z17 / LinuxONE 5 or later system. No support currently available._
@@ -213,10 +214,6 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
Answer: We are aware of this as detailed in [this issue](https://github.com/ggml-org/llama.cpp/issues/14877). Please either try reducing the number of threads, or disable the compile option using `-DGGML_NNPA=OFF`.
## Getting Help on IBM Z & LinuxONE
1.**Bugs, Feature Requests**
@@ -229,48 +226,50 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
## Appendix A: Hardware Support Matrix
| | Support | Minimum Compiler Version |
| ------- | ------- | ------------------------ |
| IBM z15 | ✅ | |
| IBM z16 | ✅ | |
| IBM z17 | ✅ | GCC 15.1.0 |
| | Support | Minimum Compiler Version |
| -------- | ------- | ------------------------ |
| IBM z15 | ✅ | |
| IBM z16 | ✅ | |
| IBM z17 | ✅ | GCC 15.1.0 |
| IBM zDNN | ✅ | |
- ✅ - supported and verified to run as intended
- 🚫 - unsupported, we are unlikely able to provide support
## Appendix B: SIMD Support Matrix
| | VX/VXE/VXE2 | NNPA | zDNN | Spyre |
|---------- | ----------- | ---- | ---- | -----|
| FP32 | ✅ | ✅ | ❓ | ❓ |
| FP16 | ✅ | ✅ | ❓ | ❓ |
| BF16 | 🚫 | 🚫 | ❓ | ❓ |
| Q4_0 | ✅ | ✅ | ❓ | ❓ |
| Q4_1 | ✅ | ✅ | ❓ | ❓ |
| Q5_0| 🚫 | 🚫 | ❓ | ❓ |
| Q5_1 | 🚫| 🚫 | ❓ | ❓ |
| Q8_0 | ✅ | ✅ | ❓ | ❓ |
| Q2_K | 🚫| 🚫 | ❓ | ❓ |
| Q3_K | ✅| ✅ | ❓ | ❓ |
| Q4_K | ✅ | ✅ | ❓ | ❓ |
| Q5_K | ✅ | ✅ | ❓ | ❓ |
| Q6_K | ✅ | ✅ | ❓ | ❓ |
| TQ1_0 | 🚫 | 🚫 | ❓ | ❓ |
| TQ2_0 | 🚫 | 🚫 | ❓ | ❓ |
| IQ2_XXS | 🚫 | 🚫 | ❓ | ❓ |
| IQ2_XS | 🚫 | 🚫 | ❓ | ❓ |
| IQ2_S | 🚫 | 🚫 | ❓ | ❓ |
| IQ3_XXS | 🚫 | 🚫 | ❓ | ❓ |
| IQ3_S | 🚫 | 🚫 | ❓ | ❓ |
| IQ1_S | 🚫 | 🚫 | ❓ | ❓ |
| IQ1_M | 🚫 | 🚫 | ❓ | ❓ |
| IQ4_NL | ✅ | ✅ | ❓ | ❓ |
| IQ4_XS | ✅ | ✅ | ❓ | ❓ |
| FP32->FP16 | 🚫 | ✅ | ❓ | ❓ |
| FP16->FP32 | 🚫 | ✅ | ❓ | ❓ |
| | VX/VXE/VXE2 | zDNN | Spyre |
|------------|-------------|------|-------|
| FP32 | ✅ | ✅ | ❓ |
| FP16 | ✅ | ✅ | ❓ |
| BF16 | 🚫 | ✅ | ❓ |
| Q4_0 | ✅ | ❓ | ❓ |
| Q4_1 | ✅ | ❓ | ❓ |
| MXFP4 | 🚫 | ❓ | ❓ |
| Q5_0 | ✅ | ❓ | ❓ |
| Q5_1 | ✅ | ❓ | ❓ |
| Q8_0 | ✅ | ❓ | ❓ |
| Q2_K | 🚫 | ❓ | ❓ |
| Q3_K | ✅ | ❓ | ❓ |
| Q4_K | ✅ | ❓ | ❓ |
| Q5_K | ✅ | ❓ | ❓ |
| Q6_K | ✅ | ❓ | ❓ |
| TQ1_0 | 🚫 | ❓ | ❓ |
| TQ2_0 | 🚫 | ❓ | ❓ |
| IQ2_XXS | 🚫 | ❓ | ❓ |
| IQ2_XS | 🚫 | ❓ | ❓ |
| IQ2_S | 🚫 | ❓ | ❓ |
| IQ3_XXS | 🚫 | ❓ | ❓ |
| IQ3_S | 🚫 | ❓ | ❓ |
| IQ1_S | 🚫 | ❓ | ❓ |
| IQ1_M | 🚫 | ❓ | ❓ |
| IQ4_NL | ✅ | ❓ | ❓ |
| IQ4_XS | ✅ | ❓ | ❓ |
| FP32->FP16 | 🚫 | ❓ | ❓ |
| FP16->FP32 | 🚫 | ❓ | ❓ |
- ✅ - acceleration available
- 🚫 - acceleration unavailable, will still run using scalar implementation
- ❓ - acceleration unknown, please contribute if you can test it yourself
Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on July 25, 2025.
Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on Sep 7, 2025.
Building for arm64 can also be done with the MSVC compiler with the build-arm64-windows-MSVC preset, or the standard CMake build instructions. However, note that the MSVC compiler does not support inline ARM assembly code, used e.g. for the accelerated Q4_0_N_M CPU kernels.
For building with ninja generator and clang compiler as default:
If you have multiple CUDA installations on your system and want to compile llama.cpp for a specific one, e.g. for CUDA 11.7 installed under `/opt/cuda-11.7`:
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models |
| GGML_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for CDNA and RDNA4) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000). |
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer| 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
## MUSA
@@ -264,10 +303,12 @@ You can download it from your Linux distro's package manager or from here: [ROCm
- Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
Note: `GPU_TARGETS` is optional, omitting it will build the code for all GPUs in the current system.
To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the `-DGGML_HIP_ROCWMMA_FATTN=ON` option. This requires rocWMMA headers to be installed on the build system.
The rocWMMA library is included by default when installing the ROCm SDK using the `rocm` meta package provided by AMD. Alternatively, if you are not using the meta package, you can install the library using the `rocwmma-dev` or `rocwmma-devel` package, depending on your system's package manager.
@@ -285,17 +326,17 @@ You can download it from your Linux distro's package manager or from here: [ROCm
Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
If necessary, adapt `GPU_TARGETS` to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
First, follow the official LunarG instructions for the installation and setup of the Vulkan SDK in the [Getting Started with the Linux Tarball Vulkan SDK](https://vulkan.lunarg.com/doc/sdk/latest/linux/getting_started.html) guide.
> [!IMPORTANT]
> After completing the first step, ensure that you have used the `source` command on the `setup_env.sh` file inside of the Vulkan SDK in your current terminal session. Otherwise, the build won't work. Additionally, if you close out of your terminal, you must perform this step again if you intend to perform a build. However, there are ways to make this persistent. Refer to the Vulkan SDK guide linked in the first step for more information about any of this.
#### Using system packages
On Debian / Ubuntu, you can install the required dependencies using:
```sh
sudo apt-get install libvulkan-dev glslc
```
#### Common steps
Second, after verifying that you have followed all of the SDK installation/setup steps, use this command to make sure before proceeding:
We have three Docker images available for this project:
1.`ghcr.io/ggml-org/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
2.`ghcr.io/ggml-org/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
3.`ghcr.io/ggml-org/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`)
1.`ghcr.io/ggml-org/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`, `linux/s390x`)
2.`ghcr.io/ggml-org/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`, `linux/s390x`)
3.`ghcr.io/ggml-org/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`, `linux/s390x`)
Additionally, there the following images, similar to the above:
@@ -110,7 +110,7 @@ You may want to pass in some different `ARGS`, depending on the MUSA environment
The defaults are:
-`MUSA_VERSION` set to `rc4.2.0`
-`MUSA_VERSION` set to `rc4.3.0`
The resulting images, are essentially the same as the non-MUSA images:
@@ -21,6 +21,8 @@ Function calling is supported for all models (see https://github.com/ggml-org/ll
- Use `--chat-template-file` to override the template when appropriate (see examples below)
- Generic support may consume more tokens and be less efficient than a model's native format.
- Multiple/parallel tool calling is supported on some models but disabled by default, enable it by passing `"parallel_tool_calls": true` in the completion endpoint payload.
<details>
<summary>Show some common templates and which format handler they use</summary>
@@ -6,7 +6,7 @@ Download [MiniCPM-V-4](https://huggingface.co/openbmb/MiniCPM-V-4) PyTorch model
### Build llama.cpp
Readme modification time: 20250206
Readme modification time: 20250731
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
Download [MiniCPM-V-4_5](https://huggingface.co/openbmb/MiniCPM-V-4_5) PyTorch model from huggingface to "MiniCPM-V-4_5" folder.
### Build llama.cpp
Readme modification time: 20250826
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md)
Clone llama.cpp:
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
```
Build llama.cpp using `CMake`:
```bash
cmake -B build
cmake --build build --config Release
```
### Usage of MiniCPM-V 4
Convert PyTorch model to gguf files (You can also download the converted [gguf](https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf) by us)
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.