mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2026-02-05 13:53:23 +02:00
b7942
1602 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6ab881b7c3 |
model-conversion : add tensor-info.py utility (#18954)
This commit adds a new python script that can be used to print tensors
information from a tensor in a safetensors model.
The motivation for this is that during model conversion work it can
sometimes be useful to verify the shape of tensors in the original
model. While it is possible to print the tensors when loading the model
this can be slow when working with larger models.
With this script it is possible to quickly query tensor shapes.
Example usage:
```console
(venv) $ ./scripts/utils/tensor-info.py --help
usage: tensor-info.py [-h] [-m MODEL_PATH] [-l] [tensor_name]
Print tensor information from a safetensors model
positional arguments:
tensor_name Name of the tensor to inspect
options:
-h, --help show this help message and exit
-m MODEL_PATH, --model-path MODEL_PATH
Path to the model directory (default: MODEL_PATH environment variable)
-l, --list List unique tensor patterns in the model (layer numbers replaced with #)
```
Listing tensor names:
```console
(venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m -l
embed_tokens.weight
layers.#.input_layernorm.weight
layers.#.mlp.down_proj.weight
layers.#.mlp.gate_proj.weight
layers.#.mlp.up_proj.weight
layers.#.post_attention_layernorm.weight
layers.#.post_feedforward_layernorm.weight
layers.#.pre_feedforward_layernorm.weight
layers.#.self_attn.k_norm.weight
layers.#.self_attn.k_proj.weight
layers.#.self_attn.o_proj.weight
layers.#.self_attn.q_norm.weight
layers.#.self_attn.q_proj.weight
layers.#.self_attn.v_proj.weight
norm.weight
```
Printing a specific tensor's information:
```console
(venv) $ ./scripts/utils/tensor-info.py -m ~/work/ai/models/google/embeddinggemma-300m layers.0.input_layernorm.weight
Tensor: layers.0.input_layernorm.weight
File: model.safetensors
Shape: [768]
```
|
||
|
|
6156ae5111 |
model-conversion : add debug option to conversion script (#19265)
This commit adds a debug option to the model conversion script to enable using the Python debugger (pdb) during model conversion. The motivation for this is that I've found myself adding this a few times now and it would be quicker to have this flag as an option and a makefile target/recipe for it. |
||
|
|
7a4ca3cbd9 |
docs : Minor cleanups (#19252)
* Update old URLs to github.com/ggml-org/ * Bump copyrights |
||
|
|
2634ed207a | create test.sh to enhance the parameters for testing, update the guide, rm useless script (#19243) | ||
|
|
1488339138 |
lookup, lookahead: fix crash when n_ctx not specified (#18729)
* lookup, lookahead: fix crash when n_ctx not specified Since PR #16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR #4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR #4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR #10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR #16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR #16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR #14482 changed validation logic. Consolidates fix from PR #18730 per maintainer request. Commit message drafted with Claude. |
||
|
|
72d3b1898a |
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
* server: introduce self-speculative decoding * server: moved self-call into speculative.cpp * can_speculate() includes self-speculation Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: can_speculate() tests self-spec * server: replace can_speculate() with slot.can_speculate() Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common: use %zu format specifier for size_t in logging Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * server: can_speculate() requires a task instance * common: ngram map, config self-speculative decoding * common: add enum common_speculative_type * common: add vector of speculative states * common: add option --spec-draftless * server: cleanup (remove slot.batch_spec, rename) * common: moved self-spec impl to ngram-map * common: cleanup (use common_speculative_state_draft) * spec : refactor * cont : naming * spec: remove --spec-config * doc: (draftless) speculative decoding * common: print performance in spec decoding * minor : cleanup * common : better names * minor : cleanup + fix build * minor: comments * CODEOWNERS: add common/ngram-map.* (#18471) * common : rename speculative.draftless_type -> speculative.type * ngram-map : fix uninitialized values * ngram-map : take into account the input can become shorter * ngram-map : revert len check for now * arg : change `--spec-draftless` -> `--spec-type` * spec : add common_speculative_state::accept() * spec : refactor + add common_speculative_begin() * spec : fix begin() call with mtmd * spec : additional refactor + remove common_speculative_params --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
a14b960bc7 |
model-conversion : use BUILD_DIR variable in all scripts (#19015)
This commit modifies all the utility scripts to use an optional
BUILD_DIR variable/argument to specify the build directory.
The motivation for this is that Commit
|
||
|
|
3d55846a5c |
model-conversion : add BUILD_DIR variable to run-converted-model scripts (#18927)
This commit adds a BUILD_DIR variable to the scripts used for running converted models. The motivation for this is that currently the `build` directory is hardcoded and it can be useful to specify a different build directory, with builds for different configurations. |
||
|
|
39173bcacb |
context : reserve new scheduler when graph topology changes (#18547)
* context : reserve new scheduler when graph topology changes * cont : fix * cont : fix reserve * cont : reserve only when changes occur + timing * context : add comments * llama : reserve on sampler changes * common : allow null common_sampler * server : task declares needs (embd, logits, sampling) * server : do not init sampler if not needed * llama : fix need_reserve when unsetting a sampler * server : consolidate slot reset/clear logic |
||
|
|
ec997b4f2b |
tests : download models only when running ctest (#18843)
Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
d98b548120 |
Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914)
* Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality * Move to common * Remove unneeded header * Unlink from common * chore: update webui build output * Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code. * Revert change to webapp * Post-merge adjust * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Apply code review changes * Remove changes to server-context * Remove mtmd.h include * Remove utility functions from header * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Rename functions * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> |
||
|
|
516a4ca9b5 | refactor : remove libcurl, use OpenSSL when available (#18828) | ||
|
|
f709c7a33f |
ci, tests : use cmake to download models and remove libcurl dependency (#18791)
* ci, tests : use cmake to download models and remove libcurl dependency * llama_dl_model -> llama_download_model * use EXPECTED_HASH for robust model downloading * Move llama_download_model to cmake/common.cmake Signed-off-by: Adrien Gallouët <angt@huggingface.co> |
||
|
|
20ca2e12c4 |
model-conversion : remove -c 0 from model card template [no ci] (#18807)
This commit removes the `-c, --ctx-size N` from the llama-server command in the model card template for causal models. The motivation for this is that -c 0 is the default and specifying it is redundant. |
||
|
|
4150da9a95 |
examples : add --kv-unified to batched example (#18774)
This commit adds the --kv-unified flag to the batched example. This flag is currently specified in the README.md as required, but is currently not available as a command line option for the batched example. The motivation for this is that specifying this flag as the README instructs, will lead to an error about the flag not being recognized, and without this option the example fail with the following error: ```console split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 4 main: llama_decode() failed ``` |
||
|
|
9789e28459 |
debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692)
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check This commit updates the pooling check in the debug example to also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just LLAMA_POOLING_TYPE_NONE. * debug : normalize both pooled and token embeddings This commit updates debug.cpp to normalize embeddings for both pooled and non-pooled outputs. For pooled embeddings, normalization is applied to the single vector, and for non-pooled embeddings, normalization is applied to each token embedding vector individually. The motivation for this is to enable non-pooled embeddings to be normalized which was not possible previously. |
||
|
|
9c142e3a2a |
model-conversion : add warn about transformers mismatch (#18691)
This commit adds a check comparing the installed transformers library with the transformers version that the original model supports. This check will be performed upon a model verification failure and prints a warning/hint to the user suggesting to install the correct version of the transformers library. The motivation for this change is that it is possible for the model verification to fail due to differences in the transformers library used and it might not be obvious that this could be the cause of the failure. With this warning the correct version can be checked and hopefully save time troubleshooting the cause of the verification failure. |
||
|
|
df7fb92170 |
model-conversion : remove -st targets for converted model (#18689)
This commit removes the '-st` make target for running the converted embedding model. The motivation for this is that the pooling type is now part of the .gguf metdata of the model and this is used by llama-debug when running the model. So there is no need to specify the pooling type separately any more. The commit also adds an option to specify the type of normalization applied to the output embeddings when running the converted model. And the readme documentation has been updated to reflect these changes. |
||
|
|
2038101bd9 |
llama : add use_direct_io flag for model loading (#18166)
* Adding --direct-io flag for model loading * Fixing read_raw() calls * Fixing Windows read_raw_at * Changing type off_t to size_t for windows and Renaming functions * disable direct io when mmap is explicitly enabled * Use read_raw_unsafe when upload_backend is available, not functional on some devices with Vulkan and SYCL * Fallback to std::fread in case O_DIRECT fails due to bad address * Windows: remove const keywords and unused functions * Update src/llama-mmap.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: jtischbein <jtischbein@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> |
||
|
|
ffba4f29e6 |
examples : add debug utility/example (#18464)
* examples : add debug utility/example
This commit introduces a new example named llama-debug which is a
utility that is intended to be used to assist with developing/debugging
a converted model.
The motivation for this utilitiy is to assist in model conversion work
to verify that the model produces the expected outputs. It is intended
to replace logits.cpp in examples/model-conversion.
Example usage:
```console
./build/bin/llama-debug \
-m models/Qwen2.5-0.5B-Instruct.gguf \
--prompt "Hello, my name is" \
--save-logits
...
Model add_bos: false
Input prompt: "Hello, my name is"
Token ids (5):
Hello(9707) ,(11) my(847) name(829) is(374)
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.bin
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.txt
Prompt saved to data/llamacpp-Qwen2.5-0.5B-Instruct-prompt.txt
Tokens saved to data/llamacpp-Qwen2.5-0.5B-Instruct-tokens.bin
```
For more details about the options available for this example, please
refer to examples/debug/README.md.
* throw runtime error instead of logging error
* remove params.warmup and enable the warmup/nowarmup option
* model-conversion : remove logits.cpp
This commit removes logits.cpp in favor of using llama-debug for
generating logits and embeddings.
* examples : remove model-conversion directory
This was missed in the previous commit.
* model-conversion : add support for saving prompt and token ids
This commit add support for storing the prompt and the token ids for the
prompt when running the original models.
The motivation for this is that this will allow us to compare the prompt
and the tokens generated for the prompt when verifing the converted
model. Currently it is possible that even if the same prompt is used
that the tokens generated are different if there is a difference in the
tokenization between the original and converted model which would
currently go unnoticed (the verification will most likely fail but it
might not be obvious why).
* squash! model-conversion : add support for saving prompt and token ids
fix pyright errors.
* model-conversion : add compare_tokens utility
This commit adds a script to compare token outputs between original and
converted models.
Example usage:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
And there is a verbose flag that will also print out the prompts:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16 -v
Original model prompt (pytorch-gemma-3-270m-it):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Converted model prompt (llamacpp-gemma-3-270m-it-bf16):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
* model-conversion : add token comparison to verifiction scripts
This commit add the calling of the compare_tokens function in
compare-logits.py and semantic_check.py to ensure that the token ids
that the tokenizers procoduce are the same before proceeding with
verifying the logits/embeddings.
Placing them in the existing scripts instead calling them separately
ensures that the token comparison is always done prior to the
logit/embedding verifications.
Follow up commit/pr could refactor the causal logits verification into
a single script instead of the two that exist now. This would reduce the
code and make it consistent with the embeddings verficiation which only
has a single script.
* debug : use llama_model_n_embd_out
This commit updates the debug example to use the new function
llama_model_n_embd_out instead of llama_model_n_embd.
The motivation for this change is to support late interation retriever
models, like LFM2-ColBert-350M, where the output embeddings are down
projected to a lower dimension.
* debug : add print_usage function
This commit adds a print_usage function that is passed to the
common_params_parse.
The motivation for this is that this enables a specific usage message
which will be printed after all the options, for example:
```console
example usage:
Print tensors:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --verbose
The tensors to be printed can be filtered with --tensor-filter option.
Save logits/embeddings:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --save-logits
Add --embedding to save embeddings
```
|
||
|
|
73d284a250 |
model : add LFM2-ColBert-350M (#18607)
* model : add LFM2-ColBert-350M * llama_model_n_embd_out() - returns `hparams.n_embd_out` if set and fallbacks to `hparams.n_embd` |
||
|
|
d3dce4e0a5 |
sampling : add support for backend sampling (#17004)
* sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ``` * sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection. * sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers. * sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode. * squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available. * common : fix regression caused by extra memory allocations during sampling * squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring. * squash! common : fix regression caused by extra memory allocations during sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit |
||
|
|
a864fb1c14 |
model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461)
This commit updates the causal model verification script to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.
The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification script
will look for the wrong .bin file that was generating when running
the converted model.
This similar to the change made for the embeddings models script in
Commit
|
||
|
|
c1366056f6 |
android: routine maintenance - Dec 2025 (#18338)
* Fix `msg` typo * Fix thread safety in destroy() to support generation abortion in lifecycle callbacks. * UI polish: stack new message change from below; fix GGUF margin not in view port * Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation. * Bump dependencies' versions; Deprecated outdated dsl usage. |
||
|
|
7cbec34a63 |
model-conversion : add device option to embd run orig model (#18386)
This commit refactors the original model embedding script to include a device selection option. Users can now specify the device (cpu, cuda, mps, auto) via command-line arguments. It also refactors the code to be more structured. |
||
|
|
0c8986403b | retrieval : use at most n_seq_max chunks (#18400) | ||
|
|
8e3ead6e4d |
model-conversion : add device option to run-org-model.py (#18318)
* model-conversion : add device option to run-org-model.py This commit refactors the `run-org-model.py` script to include a `--device` argument, to allow users to specify the device on which to run the model (e.g., cpu, cuda, mps, auto). It also extracts a few common functions to prepare for future changes where some code duplication will be removed which there currently exists in embedding scripts. The Makefile is also been updated to pass the device argument, for example: ```console (venv) $ make causal-verify-logits DEVICE=cpu ``` * fix error handling and remove parser reference This commit fixes the error handling which previously referenced an undefined 'parser' variable. |
||
|
|
847c35f7d5 |
model-conversion : add trust_remote_code for embedding scripts (#18288)
This commit adds the trust_remote_code=True parameter when loading models and configurations in the embedding model conversion scripts. It also adds a cast to float for models that might use a data type that is not supported by python, for example bfloat16. The motivation for this is that some models may require custom code to be executed during loading, and setting trust_remote_code to True avoids getting prompted for confirmation. Future work will consolidate the embedding conversion scripts with the causal conversion scripts to avoid code duplication. But in the mean time it would be nice to have this fix in place. |
||
|
|
a6a552e4ec |
[SYCL] replace llama-cli by llama-completion to rm the impact to test script (#18290)
* replace llama-cli by llama-completion to rm the impact to test script * Update examples/sycl/run-llama2.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/run-llama3.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/run-llama3.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/win-run-llama2.bat Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/win-run-llama3.bat Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
179fd82a72 |
gen-docs: automatically update markdown file (#18294)
* gen-docs: automatically update markdown file * also strip whitespace * do not add extra newline * update TOC |
||
|
|
0a271d82b4 |
model-conversion : add verbose flag in run-org-model.py (#18194)
This commit adds a --verbose flag to the run-org-model.py script to enable or disable detailed debug output, such as input and output tensors for each layer. Debug utilities (summarize, debug_hook, setup_rope_debug) have been moved to utils/common.py. The motivation for this is that the detailed debug output can be useful for diagnosing issues with model conversion or execution, but it can also produce a large amount of output that may not always be needed. The script will also be further cleaned/refactored in follow-up commits. |
||
|
|
52fc7fee8a |
android: fix missing screenshots for Android.md (#18156)
* Android basic sample app layout polish * Add missing screenshots and polish android README doc * Replace file blobs with URLs served by GitHub pages service. |
||
|
|
4301e27319 |
common : restore grammar-based rejection sampling (#18137)
* common : restart grammar-based rejection sampling * sampling : allow null samplers |
||
|
|
8faa87db02 | Extend run-org-model.py, add (a) batching (b) loading prompt from file (c) multimodal capacity (#18034) | ||
|
|
5c0d18881e |
llama.android : Rewrite Android binding (w/o cpu_features dep) (#17413)
* UI: implement basic UI components
* util: implement performance monitor; wrap it with a viewmodel
* util: implement user preferences utility
* UI: implement core flow's screens
* UI: add a new MainActivity; update manifest
* [WIP] DI: implement simple local vm factory provider
* UI: disable triggering drawer via gesture; enable alert dialog on back navigation inside conversation and benchmark
* UI: allow drawer's gesture control only on Home and Settings screens; enable alert dialog on back navigation inside conversation and benchmark
* UI: split a nested parent settings screen into separate child settings screens
* UI: polish system prompt setup UI
* Deps: bump Kotlin plugin; introduce KSP; apply in :app subproject
* DB: setup Room database
* data: introduce repo for System Prompt; flow data from Room to VM
* bugfix: properly handle user's quitting conversation screen while tokens in generation
* UI: rename `ModeSelection` to `ModelLoading` for better clarity
* UI: update app name to be more Arm
* UI: polish conversation screen
* data: code polish
* UI: code polish
* bugfix: handle user quitting on model loading
* UI: locks user in alert dialog when model is unloading
* vm: replace token metrics stubs with actual implementation
* UI: refactor top app bars
* nit: combine temperatureMetrics and useFahrenheit
* DI: introduce Hilt plugin + processor + lib dependencies
* DI: make app Hilt injectable
* DI: make viewmodels Hilt injectable
* DI: replace manual DI with Hilt DI
* UI: optimize AppContent's composing
* bugfix: wait for model to load before navigating to benchmark screen; use NavigationActions instead of raw navController
* UI: navigation with more natural animated transitions
* DI: Optimize AppModule
* Feature: Introduce ModelRepository and ModelsManagementViewModel; update AppModule
* UI: polish UI for ModelsManagementScreen; inject ModelsManagementVieModel
* DI: abstract the protocol of SystemPromptRepository; update AppModule
* data: [WIP] prepare for ModelRepository refactor & impl
* data: introduce Model entity and DAO; update DI module
* UI: replace Models Management screen's stubbing with instrumentation
* UI: polish sort order menu
* data: import local model with file picker
* bugfix: use List instead of Collection for ModelDao's deletion
* data: add a util file for extracting file name & size and model metadata
* UI: enrich ModelManagementState; extract filename to show correct importing UI
* UI: implement multiple models deletion; update Models Management screen
* UI: handle back navigation when user is in multi-selection mode
* util: extract file size formatting into ModelUtils
* UI: add a confirmation step when user picks a file; refactor model import overlay into AlertDialog
* UI: extract a shared ModelCard component
* UI: replace model selection screen's data stubbing; add empty view
* nit: tidy SystemPromptViewModel
* Util: split FileUtils from ModelUtils; extract copy methods into FileUtils
* data: pass through getModelById from ModelDao into ModelRepository
* core: extract conversation and benchmark logics into InferenceManager; add logs and missing state updates in stub InferenceEngine
* vm: split mono MainViewModel into separate individual ViewModels
* vm: merge SystemPromptViewModel into ModelLoadingViewModel
* core: break down InferenceManager due to Interface Segregation Principle
* UI: show model card in Model Loading screen
* UI: show model card in Conversation screen
* UI: unify Model Card components
* core: swap in LLamaAndroid and mark stub engine for testing only
* data: allow canceling the ongoing model import
* UI: update UI ongoing model import's cancellation
* LLama: update engine state after handling the cancellation of sendUserPrompt
* VM: handle the cancellation of ongoing token generation
* LLama: refactor loadModel by splitting the system prompt setting into a separate method
* feature: check for available space before copying local model
* UI: centralize the AppScaffold and modularize its configs
* UI: refactor BottomBarConfig.ModelsManagement APIs
* UI: combine TopBarConfig and BottomBarConfig into each route's ScaffoldConfig
* UI: replace ugly optional as casts in AppScaffold with extension functions
* UI: fix the typo `totalGb` in `StorageMetrics`
* UI: remove code duplication in sort menu
* LLama: add ModelUnloadingState to engine State; add missing state checks in stub engine; fix instrumentation engine's error messages
* UI: refactor back handling by removing centralized BackHandlerSetup and UnloadModelConfirmationDialog from AppContent
* UI: implement BenchmarkScreen's individual back handling
* LLama: add a new Initializing state; ; add two extension properties; rename LibraryLoaded state to Initialized
* UI: Introduce an abstract ViewModel to handle additional model unloading logics
* UI: expose a single facade ModelUnloadDialogHandler; move UnloadModelState into ModelUnloadingViewModel.kt
* UI: migrate ModelLoadingScreen onto ModelLoadingViewModel; update & refine ModelLoadingScreen
* UI: migrate ConversationViewModel onto ModelLoadingViewModel; update & refine ConversationScreen
* nit: extract app name into a constant value; remove unused onBackPressed callbacks
* UI: update AppContent to pass in correct navigation callbacks
* nit: polish ModelLoadingScreen UI
* core: throw Exception instead of returning null if model fails to load
* navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens
* gguf: add GGUF metadata data holder and its corresponding extractor implementation
* DB: introduce Kotlin serialization extension's library and plugin; add Room runtime library
* GGUF: make GgufMetadata serializable in order to be compatible with Room
* nit: refactor data.local package structure
* nit: rename lastUsed field to dateLastUsed; add dateAdded field
* UI: refactor ModelCard UI to show GGUF metadata
* UI: update ModelSelectionScreen with a preselect mechanism
* UI: polish model card
* nit: allow deselect model on Model Selection screen
* nit: revert accidental committing of debug code
* UI: polish ModelLoading screen
* util: extract formatting helper functions from FileUtils into a new FormatUtils
* UI: polish model cards on Benchmark and Conversation screens to show model loading metrics
* UI: show a Snack bar to warn user that system prompt is not always supported
* UI: handle back press on Model Selection screen
* UI: finally support theme modes; remove hardcoded color schemes, default to dynamic color scheme implementation
* feature: support searching on Model Selection screen
* nit: move scaffold related UI components into a separate package
* UI: extract InfoView out into a separate file for reusability
* data: move Model related actions (query, filter, sort) into ModelInfo file
* UI: animate FAB on model preselection states
* feature: support filtering in Model Management screen
* ui: show empty models info in Model Management screen
* ui: add filter off icon to "Clear filters" menu item
* [WIP] ui: polish Benchmark screen; implement its bottom app bar
* ui: polish Benchmark screen; implement its bottom app bar's rerun and share
* nit: disable mode selection's radio buttons when loading model
* feature: implement Conversation screen's bottom app bar
* pkg: restructure BottomAppBars into separate files in a child package
* pkg: restructure TopBarApps into separate files in a child package
* pkg: restructure system metrics into a separate file
* UI: polish Conversation screen
* data: update system prompt presets
* UI: allow hide or show model card on Conversation & Benchmark screens; fix message arrangement
* data: update & enhance system prompt presets
* deps: introduce Retrofit2
* data: implement HuggingFace data model, data source with Retrofit API
* data: update Model data repository to support fetching HuggingFace models
* [WIP] UI: replace the HuggingFace stub in Model Management screen with actual API call
* UI: map language codes into country Emojis
* ui: add "clear results" action to Benchmark screen
* nit: print current pp & tg in llama-bench
* UI: disable landscape mode; prevent duplicated benchmark running
* llama: migrate C/CXX flags into CMakeList
* [WIP] llama: ABI split builds five .so artifacts.
However, all .so are performing on SVE level
* [WIP] llama: ABI split where five tiers are built sequentially.
* [WIP] llama: disable OpenMP in ABI split since most SoCs are big.LITTLE
* [WIP] llama: enable KleidiAI and disable tier 4 due to `+sve+sve2` bug caused by `ggml_add_cpu_backend_variant_impl` as explained below
```CMake
if (NOT SME_ENABLED MATCHES -1)
...
set(PRIVATE_ARCH_FLAGS "-fno-tree-vectorize;${PRIVATE_ARCH_FLAGS}+sve+sve2")
...
```
* core: add Google's cpu_features as a submodule
* core: implement cpu_detector native lib
* core: swap out hardcoded LlamaAndroid library loading
* core: add back OpenMP due to huge perf loss on TG128
* misc: reorg the pkg structure
* misc: rename LlamaAndroid related class to InferenceEngine prefixes
* [WIP] lib: move GgufMetadata into the lib submodule
* lib: expose GgufMetadataReader as interface only
* lib: replace the naive & plain SharedPreferences with DataStore implementation
* lib: hide the internal implementations, only expose a facade and interfaces
* lib: expose Arm features
* di: add a stub TierDetection; provide both actual impl and stub in AppModule
* UI: add visualizer UI for Arm features
* misc: UI polish
* lib: refactored InferenceEngineLoader; added a `NONE` Llama Tier
* UI: support `NONE` Llama Tier in general settings
* lib: optimize engine loader; always perform a fresh detection when cache is null
* remote: add HuggingFaceModelDetails data class
* remote: refine HuggingFaceModel data class
* nit: remove `trendingScore` field from HuggingFace model entities, weird...
* remote: refactor HuggingFaceApiService; implement download feature in HuggingFaceRemoteDataSource
* remote: fix the incorrect parse of HuggingFace's inconsistent & weird JSON response
* UI: scaffold Models Management screen and view model
* UI: implement a dialog UI to show fetched HuggingFace models.
* UI: use a broadcast receiver to listen for download complete events and show local import dialog.
* data: handle network exceptions elegantly
* pkg: restructure `data`'s packages
* data: extract local file info, copy and cleanup logics into LocalFileDataSource
* nit: minor UI patch; add missing comments
* bugfix: tapping "Home" in navigation drawer should simply close it without any navigation action.
* UI: improve autoscroll during token generation
* lib: tested on JFrog Artifactory for Maven publishing
* UI: show RAM warning if model too large
* UI: polish model management screen's error dialog
* util: add more items into the mapping table of ISO 639-1 language code to ISO 3166-1 country code
* llm: properly propagate error to UI upon failing to load selected model
* UI: avoid duplicated calculation of token metrics
* lib: read & validate the magic number from the picked source file before executing the import
* UI: add "Learn More" hyperlinks to Error dialog upon model import failures
* lib: refactor the GgufMetadataReader to take InputStream instead of absolute path as argument
* lib: fix the `SIMD` typo in Tier description
* core: verify model file path is readable
* lib: add UnsupportedArchitectureException for triaged error message
* util: split FormatUtils into multiple utils for better readability
* UI: change benchmark screen from raw markdown to table view
* bugfix: reset preselection upon running the preselected model
* misc: linter issue
* bugfix: fix the malfunctioning monitoring switch
* UI: update Arm features indicator; fix the broken hyperlinks
* UI: add quick action buttons to benchmark screen's result card
* UI: hide share fab after clearing all benchmark results
* UI: fix the model unload dialog message; elevate the model card and hide it by default on Conversation screen;
* UI: hide the stubbing actions in Conversation screen
* UI: add show/hide stats control to conversation screen's assistant message bubble; fix placeholder
* UI: add a info button to explain token metrics
* misc: remove the redundant `Companion` added due to refactoring
* UI: show corresponding system metrics detailed info upon tapping RAM / storage / temperature indicator
* UI: add info button to System Prompt switch; expand the model card by default
* UI: disable tag & language chips; add section headers to explain what they are
* misc: replace top bar indicator's spacer with padding
* UI: merge the Model Selection and Model Management into a unified Models screen
* UI: split the ModelsManagementViewModel from a unified ModelsViewModel due to huge complexity
* UI: add model loading in progress view; polish the empty model info view
* UI: polish the bottom bars and info view when no models found; show loading in progress while fetching models
* build: [BREAKING] bump the versions of libraries and plugins
* UI: fix the breaking build
* UI: add Tooltip on Import FAB for user onboarding
* UI: adds AppPreferences to track user onboarding status
* UI: tracks user's first success on importing a model
* data: add hand crafted rules to filter the models fetched from HuggingFace API
* UI: update app name & about; polish top bars' indicators & buttons
* UI: polish Hugging Face download dialog UI
* UX: implement onboarding tooltips for model import and onboarding
* misc: use sentence case for CTA button labels
* [WIP] UI: add Arm color palette from Philip.Watson3
* UI: address Rojin's UX feedbacks
* UI: address Rojin's UX feedbacks - part 2
* UI: update Arm color palette from Philip.Watson3
* data: make sure fetch preselected models in the same order of their IDs
* UI: fix UI issues in the generic settings screen and navigation drawer
* nit: address Rojin's feedbacks on model import message again
* nit: append `®` to all `Arm` labels
* UI: extract a reusable InfoAlertDialog
* core: support GGML_CPU_ALL_VARIANTS on Android!
* core: restructure Kleidi-Llama library
* core: organizing cmake arguments
* data: sort preselected models according to device's available RAM
* app: update adaptive + themed + legacy icons and app name
* UI: fix the font size auto scaling for ArmFeaturesVisualizer
* core: further improve the performance on native methods
* UI: minor color palette changes; emphasize the bottom bar FABs; fix Settings Screen menu item label
* UI: make more room for assistant message bubble's width
* UI: better usage of tertiary colors to highlight model cards but not for warnings
* UI: fix the layout issue on large font sizes
* lib: support x86-64 by dynamically set Arm related definitions
* lib: replace the factory pattern for deprecated tiered lib loading with single instance pattern
* llama: update the library name in JNI and CMake project
* llama: update the library's package name and namespace
* llama: update the app's package name and namespace
* app: bump ksp version
* app: remove deprecated SystemUIController from accompanist by migrating to EdgeToEdge
* app: extract AppContent from MainActivity to a separate file in ui package
* lib: add File version for GGUF Magic number verification
* lib: perform engine state check inclusively instead of exclusively
* lib: change `LlamaTier` to `ArmCpuTier`
* lib: remove kleidi-llama related namings
* cleanup: remove Arm AI Chat/Playground app source code; replace with the basic sample app from https://github.com/hanyin-arm/Arm-AI-Chat-Sample
Note: the full Google Play version of AI Chat app will be open will be open sourced in another repo soon, therefore didn't go through the trouble of pruning the history using `git filter-repo` here.
* [WIP] doc: update main and Android README docs; add self to code owners
* lib: revert System.load back to System.loadLibrary
* jni: introduce a logging util to filter different logging levels on different build types
* lib: enable app optimization
* doc: replace stub Google Play app URL with the actual link add screenshots; add my GitHub ID to maintainer list
* Remove cpu_features
* Fix linters issues in editorconfig-checker job
https://github.com/ggml-org/llama.cpp/actions/runs/19548770247/job/55974800633?pr=17413
* Remove unnecessary Android CMake flag
* purge include/cpu_features directory
---------
Co-authored-by: Han Yin <han.yin@arm.com>
|
||
|
|
79dbae034a |
model-conversion : remove -fa option in model card template [no ci] (#18088)
This commit updates the causal model card template and removes the -fa option as it is no longer required (fa is auto detected). |
||
|
|
7b1db3d3b7 |
arg: clarify auto kvu/np being set on server (#17997)
* arg: clarify auto kvu/np being set on server * improve docs * use invalid_argument |
||
|
|
9963b81f63 |
model-conversion : add note about verifying previous models (#18082)
This commit adds a note to the README in the model-conversion examples, advising developers to verify that previous versions of models pass logits verification before adding new models from the same family. |
||
|
|
db81d5ec4b |
model-conversion : use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)
This commit updates the embedding model verification script to use the CONVERTED_EMBEDDING_MODEL environment variable instead of using the EMBEDDING_MODEL_PATH (the original embedding model path) as the basis for the converted model file name. The motivation for this that currently if the converted embedding model file name differs from the original embedding model directory/name the verification script will look for the wrong .bin files that were generating when running the models. |
||
|
|
254098a279 |
common : refactor common_sampler + grammar logic changes (#17937)
* common : refactor common_sampler + grammar logic changes * tests : increase max_tokens to get needed response * batched : fix uninitialized samplers |
||
|
|
77ad8542bd | model-conversion : cast logits to float32 (#18009) | ||
|
|
3c6391e748 | speculative-simple : free batch on exit (#17985) | ||
|
|
fd1085ffb7 |
model-conversion : use CONVERTED_MODEL value for converted model [no ci] (#17984)
* model-conversion : use CONVERTED_MODEL value for converted model [no ci] This commit updates the model verification scripts to use the CONVERTED_MODEL environment variable instead of using the MODEL_PATH (the original model path) as the basis for the converted model file name. The motivation for this that currently if the converted model file name differs from the original model directory/name the verification scripts will look for the wrong .bin files that were generating when running the models. For example, the following steps were not possible: ```console (venv) $ huggingface-cli download google/gemma-3-270m-it --local-dir ggml-org/gemma-3-270m (venv) $ python3 convert_hf_to_gguf.py ggml-org/gemma-3-270m --outfile test-bf16.gguf --outtype bf16 (venv) $ cd examples/model-conversion/ (venv) $ export MODEL_PATH=../../ggml-org/gemma-3-270m (venv) $ export CONVERTED_MODEL=../../test-bf16.gguf (venv) $ make causal-verify-logits ... Data saved to data/llamacpp-test-bf16.bin Data saved to data/llamacpp-test-bf16.txt Error: llama.cpp logits file not found: data/llamacpp-gemma-3-270m.bin Please run scripts/run-converted-model.sh first to generate this file. make: *** [Makefile:62: causal-verify-logits] Error 1 ``` With the changes in this commit, the above steps will now work as expected. |
||
|
|
380b4c984e |
common: support negated args (#17919)
* args: support negated args * update docs * fix typo * add more neg options * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * rm duplicated arg * fix LLAMA_ARG_NO_HOST * add test --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
dada4c846d |
model-conversion : remove max diff check in compare-logits [no ci] (#17954)
This commit removes the maximum difference check from the compare-logits.py which would stop early if the difference between the logits exceeded a threshold. The motivation for removing this is that it can be useful to be able to get the complete log for debugging/reporting purposes. |
||
|
|
6c2131773c |
cli: new CLI experience (#17824)
* wip * wip * fix logging, add display info * handle commands * add args * wip * move old cli to llama-completion * rm deprecation notice * move server to a shared library * move ci to llama-completion * add loading animation * add --show-timings arg * add /read command, improve LOG_ERR * add args for speculative decoding, enable show timings by default * add arg --image and --audio * fix windows build * support reasoning_content * fix llama2c workflow * color default is auto * fix merge conflicts * properly fix color problem Co-authored-by: bandoti <bandoti@users.noreply.github.com> * better loading spinner * make sure to clean color on force-exit * also clear input files on "/clear" * simplify common_log_flush * add warning in mtmd-cli * implement console writter * fix data race * add attribute * fix llama-completion and mtmd-cli * add some notes about console::log * fix compilation --------- Co-authored-by: bandoti <bandoti@users.noreply.github.com> |
||
|
|
2fa51c19b0 |
model-conversion : add token ids to prompt token output [no ci] (#17863)
This commit adds the token ids to the printed prompt outputs. The motivation for this is that is can be useful to see the actual token ids alongside the token strings for debugging. |
||
|
|
8ce774a102 |
metal : fix build(#17799)
* metal : fix build * tests : fix context destruction |
||
|
|
c41bde6fbd |
metal : add residency sets keep-alive heartbeat (#17766)
* examples : add idle * metal : attach residency sets to queue * idle : add link * idle : adjust intervals * metal : add residency sets keep-alive heartbeat * cont : adjust default keep-alive time |
||
|
|
817d743cc1 |
examples : add missing code block end marker [no ci] (#17756)
This commit adds the missing code block end marker in simple-cmake-pkg to correct the formatting. |