Compare commits

...

12 Commits

Author SHA1 Message Date
Georgi Gerganov
40d5358d3c tests : move save-load-state from examples to tests (#23336)
* tests : move save-load-state from examples to tests

- Move examples/save-load-state/ to tests/test-save-load-state.cpp
- Remove subdirectory reference from examples/CMakeLists.txt
- Add test to tests/CMakeLists.txt as a model test
- Remove CODEOWNERS entry for removed example directory

Assisted-by: llama.cpp:local pi

* cont : update ci
2026-05-21 14:41:50 +03:00
ScrewTSW
b65bb4baae server: expose prompt token counts in /slots endpoint (#23454)
Add n_prompt_tokens, n_prompt_tokens_processed, and n_prompt_tokens_cache
to the /slots JSON response. These fields are already tracked internally
but were not exposed, making it impossible for clients to monitor prompt
evaluation progress during processing.
2026-05-21 13:29:13 +02:00
Georgi Gerganov
a1a69f777a metal : optimize concat kernel and fix set kernel threads (#23411)
* metal : fix GGML_OP_SET kernel threads

* tests : extend test_cpy to support different src/dst shapes

Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.

- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility

Assisted-by: llama.cpp:local pi

* metal : optimize concat kernel with row batching for small widths

When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.

- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1

Assisted-by: llama.cpp:local pi

* tests : clean-up

* tests : refactor CPY shape tests to use dimension permutations

Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).

Assisted-by: llama.cpp:local pi
2026-05-21 13:34:08 +03:00
Aman Gupta
52fb93a2bd server : free draft/MTP resources on sleep to fix VRAM leak (#23461)
The destroy() function in server_context_impl only cleaned up the main
model and context (via llama_init.reset()) but did not free the speculative
decoder (spec), draft context (ctx_dft), or draft model (model_dft).

For MTP (Multi-Token Prediction) models, ctx_dft holds GPU-allocated
resources (KV cache, compute buffers) that are not freed when entering
the sleeping state. On each sleep/resume cycle, new resources are
allocated without the old ones being freed, leading to a VRAM leak
that eventually crashes the server with out-of-memory errors.

Fix by explicitly resetting spec, ctx_dft, and model_dft in destroy()
before resetting llama_init, ensuring proper cleanup order to avoid
use-after-free.

ref: https://github.com/ggml-org/llama.cpp/issues/23395

Assisted-by: llama.cpp:local pi
2026-05-21 16:11:11 +08:00
Pascal
c9021714e8 server: re-inject subcommand when router spawns children under unified binary (#23442) 2026-05-21 10:09:19 +02:00
Adrien Gallouët
1d7ab2b947 app : add batched-bench, fit-params, quantize & perplexity (#23459)
* app : add batched-bench, fit-params, quantize & perplexity

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add missing main.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add EOL

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-05-21 10:29:44 +03:00
Aman Gupta
12e5d99078 mtp: use inp_out_ids for skipping logit computation (#23433)
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
2026-05-21 15:23:14 +08:00
Kashif Rasul
7ea23ddf7b vocab : add Carbon-3B (HybridDNATokenizer) support (#23410)
* vocab : add Carbon-3B (HybridDNATokenizer) support

Adds a new BPE pre-type LLAMA_VOCAB_PRE_TYPE_CARBON for the
HybridDNATokenizer used by HuggingFaceBio/Carbon-{500M,3B,8B}.
The base BPE is Qwen3-4B-Base's; what differs is that text inside
<dna>...</dna> regions is chunked into fixed 6-mers (right-padded
with 'A' on the trailing partial), and any base outside ACGT maps
to <oov>.

* src/llama-vocab.{h,cpp}: new pre-type, dispatched from
  llm_tokenizer_bpe_session::tokenize.
* src/llama-vocab-carbon.h: pure helpers (tokenize_carbon,
  emit_dna_kmers) factored out for unit testing — no llama_vocab
  dependency, vocab access goes through a std::function.
* conversion/base.py: detect HybridDNATokenizer by class name in
  get_vocab_base_pre (chktxt collides with Qwen3 base since it
  has no <dna>), and pass trust_remote_code=True in get_vocab_base
  so the custom tokenizer class can load.
* tests/test-tokenizer-carbon.cpp: 12 cases covering single 6-mer,
  multi 6-mer, lowercase, invalid base -> <oov>, partial k-mer
  right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
  two regions, vocab miss.

* vocab : align Carbon-3B changes with llama.cpp conventions

* Fold tokenize_carbon + emit_dna_kmers inline into
  llm_tokenizer_bpe_session (drop src/llama-vocab-carbon.h),
  matching how every other tokenizer keeps its helpers inside
  llama-vocab.cpp.

* Replace the standalone unit test with the conventional
  test-tokenizer-0 row backed by models/ggml-vocab-carbon.gguf
  (vocab-only conversion) + .inp/.out fixtures covering single
  6-mer, multi 6-mer, lowercase, invalid base -> <oov>, partial
  right-pad, mixed text+DNA, empty <dna></dna>, unterminated <dna>,
  two regions.

* Register "carbon" in convert_hf_to_gguf_update.py's model list
  (pointing at HuggingFaceBio/Carbon-3B) and teach both
  AutoTokenizer call sites in the updater to pass
  trust_remote_code=True for it, matching how t5 is special-cased.

* vocab : move Carbon dispatch to _set_vocab_carbon + LlamaModel branch

Refactor the conversion-side changes to follow the per-tokenizer-family
convention used by _set_vocab_qwen, _set_vocab_interns1, _set_vocab_glm,
etc. instead of conditionalising the shared get_vocab_base /
get_vocab_base_pre paths.

* conversion/base.py: add _set_vocab_carbon — self-contained, loads
  with trust_remote_code=True so HybridDNATokenizer's merged Qwen3 + DNA
  vocab is visible, writes tokenizer.ggml.pre = "carbon" directly.
* conversion/llama.py: branch in LlamaModel.set_vocab on
  tokenizer_config.json["tokenizer_class"] == "HybridDNATokenizer" and
  dispatch to _set_vocab_carbon. Same precedent as conversion/bert.py
  (tokenizer_class branch between BertTokenizer / RobertaTokenizer) and
  conversion/phi.py.
* conversion/base.py: revert the conditional in get_vocab_base and the
  class-name short-circuit in the auto-generated get_vocab_base_pre.

* tests : expand ggml-vocab-carbon.gguf fixtures with model-card examples

Add 6 cases from the Carbon-3B model card on top of the existing edge
coverage: the unterminated basic-completion prompt, the closed 33-bp
example, the metadata-conditioned prompt (with <vertebrate_mammalian>
and <protein_coding_region> which BPE-decompose since they are not in
the vocab), the documented anti-pattern of raw DNA without <dna> tags,
and the two likelihood-scoring examples. Brings the suite to 19 cases.

* vocab : promote HybridDNATokenizer to its own LLAMA_VOCAB_TYPE

Refactor per upstream review:

> This should be its own tokenizer model, ie. carbonhybriddna instead
> of gpt2 and not carbon pre-tokenizer. That way you can keep the
> correct pre-tokenizer, in case that ever changes.

Previously the tokenizer was modelled as LLAMA_VOCAB_TYPE_BPE plus a
new LLAMA_VOCAB_PRE_TYPE_CARBON, which (a) put a CARBON-specific
branch inside llm_tokenizer_bpe_session::tokenize (only existing
pre-types differ in regex, not dispatch logic), and (b) conflated
"hybrid DNA tokenization" with "Qwen3 BPE pre-tokenizer".

This change moves it to its own vocab type, peer to PLAMO2, with the
GGUF model name matching the HF tokenizer class (HybridDNATokenizer):

* include/llama.h: new LLAMA_VOCAB_TYPE_HYBRIDDNA = 7.
* src/llama-vocab.cpp: new llm_tokenizer_hybriddna + session that
  owns std::unique_ptr<llm_tokenizer_bpe> for non-<dna> text and
  routes raw text through a DNA-aware splitter; wired into
  init_tokenizer, tokenize, type_name, byte_to_token, and the
  BPE-style token_to_piece case (DNA k-mers + <dna>/</dna>/<oov>
  are pure ASCII, so byte-level BPE decoding handles them).
  LLAMA_VOCAB_TYPE_HYBRIDDNA gets its own branch in the vocab-type
  config block alongside SPM/WPM/UGM/RWKV, where pre_type is set
  to QWEN2 and the matching add_space_prefix / escape_whitespaces /
  clean_spaces flags are applied — mirroring qwen2's BPE path so
  byte-level BPE merging stays bit-identical to the Python
  reference for non-DNA text.
* src/llama-vocab.h: drop the short-lived LLAMA_VOCAB_PRE_TYPE_CARBON.
* conversion/base.py: _set_vocab_hybriddna writes
  tokenizer.ggml.model = "hybriddna" (no separate pre).
* conversion/llama.py: dispatch on tokenizer_class ==
  "HybridDNATokenizer" same as bert.py / phi.py do.
* models/ggml-vocab-hybriddna.gguf{,.inp,.out}: renamed fixture +
  regenerated metadata.
* convert_hf_to_gguf_update.py: drop the stale chkhsh entry and
  trust_remote_code special-case (no longer needed since dispatch
  is now class-name driven, not chkhsh).

Verified end-to-end against HuggingFaceBio/Carbon-{500M,3B,8B}:
tokenization is bit-identical to the Python HybridDNATokenizer for
all 19 test fixtures plus the model-card metadata-conditioned
prompt; greedy completion produces the same DNA continuation as
the Python reference; spec-dec with 500M as draft for 8B still
works.

* vocab : relax llm_tokenizer_bpe assert to allow HYBRIDDNA

* vocab : drop llm_tokenizer_bpe vocab-type assert

* vocab : write tokenizer.ggml.pre for HYBRIDDNA, share BPE dispatch

* vocab : assert BPE or HYBRIDDNA in llm_tokenizer_bpe

* vocab : annotate #endif with PRETOKENIZERDEBUG

* vocab : drop local hybriddna fixture (moves to ggml-org/vocabs)

* deduplicate

* simplify

* simplify

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-05-21 08:34:32 +02:00
Ruixiang Wang
2fc8d1851e doc: fix spec mtp typo (#23435) 2026-05-21 09:30:55 +03:00
Aleksander Grygier
5e932a1c8d ui: Improve Git Hooks for UI development (#23403)
* refactor: Improve Git Hooks for UI development

* fix: Address review comments

* fix: Use absolute git path for `/hooks`

Co-authored-by: Pascal <admin@serveurperso.com>

---------

Co-authored-by: Pascal <admin@serveurperso.com>
2026-05-21 08:27:50 +02:00
Matt Corallo
2754ce1b3e ggml : Check the right iface method before using the fallback 2d get (#23306)
Probably no backends implement only one of 2d get/set, but this
might be annoying for some future backend developer trying to add
2d get/set.
2026-05-21 09:24:40 +03:00
Daniel Elliott
eeeaf6180b llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (#23131)
When a model has zero non-SWA attention layers (e.g. a SWA-only slice of Gemma 4),
the base KV cache has no layer tensors. The input tensors (self_k_idxs, self_v_idxs,
self_kq_mask) are created as graph input nodes but never consumed by any compute node,
so the backend scheduler never allocates a buffer for them. Calling
mctx->get_base()->set_input_k_idxs() on an unallocated tensor then hits
GGML_ASSERT(buffer) at ggml-backend.cpp:194.

The same scenario applies symmetrically: if a model had zero SWA layers, the SWA
tensors would be unallocated.

Fix: guard both the base and SWA set_input calls with null/buffer checks, matching
the pattern already used by llm_graph_input_mem_hybrid_iswa::set_input (line ~674)
which has the comment: 'base tensors may not be allocated if there are no non-SWA
attention layers'.

Also fix can_reuse() in the same class to skip the ne[0] and kq_mask checks for
unallocated tensors, preventing a null-dereference on the reuse path.
2026-05-21 09:20:51 +03:00
38 changed files with 619 additions and 188 deletions

View File

@@ -49,7 +49,6 @@
/examples/parallel/ @ggerganov
/examples/passkey/ @ggerganov
/examples/retrieval/ @ggerganov
/examples/save-load-state/ @ggerganov
/examples/speculative-simple/ @ggerganov
/examples/speculative/ @ggerganov
/ggml/cmake/ @ggerganov

View File

@@ -3,7 +3,16 @@ set(TARGET llama-app)
add_executable(${TARGET} llama.cpp)
set_target_properties(${TARGET} PROPERTIES OUTPUT_NAME llama)
target_link_libraries(${TARGET} PRIVATE llama-server-impl llama-cli-impl llama-completion-impl llama-bench-impl)
target_link_libraries(${TARGET} PRIVATE
llama-server-impl
llama-cli-impl
llama-completion-impl
llama-bench-impl
llama-batched-bench-impl
llama-fit-params-impl
llama-quantize-impl
llama-perplexity-impl
)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
if(LLAMA_TOOLS_INSTALL)

View File

@@ -1,15 +1,22 @@
#include "build-info.h"
#include <cstdio>
#include <cstdlib>
#include <string>
#include <vector>
// visible
int llama_server(int argc, char ** argv);
int llama_cli(int argc, char ** argv);
// hidden
int llama_completion(int argc, char ** argv);
int llama_bench(int argc, char ** argv);
int llama_batched_bench(int argc, char ** argv);
int llama_fit_params(int argc, char ** argv);
int llama_quantize(int argc, char ** argv);
int llama_perplexity(int argc, char ** argv);
static int help(int argc, char ** argv);
static int version(int argc, char ** argv);
@@ -22,12 +29,16 @@ struct command {
};
static const command cmds[] = {
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"completion", "Text completion", {"complete"}, true, llama_completion },
{"bench", "Benchmarking tool", {}, true, llama_bench },
{"version", "Show version", {}, true, version },
{"help", "Show available commands", {}, true, help },
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"completion", "Text completion", {"complete"}, true, llama_completion },
{"bench", "Benchmark prompt processing and text generation", {}, true, llama_bench },
{"batched-bench", "Benchmark batched decoding performance", {}, true, llama_batched_bench},
{"fit-params", "Compute parameters to fit a model in device memory", {}, true, llama_fit_params },
{"quantize", "Quantize a model", {}, true, llama_quantize },
{"perplexity", "Compute model perplexity and KL divergence", {}, true, llama_perplexity },
{"version", "Show version", {}, true, version },
{"help", "Show available commands", {}, true, help },
};
static int version(int argc, char ** argv) {
@@ -67,6 +78,14 @@ int main(int argc, char ** argv) {
for (const auto & cmd : cmds) {
if (matches(arg, cmd)) {
// router spawns children through this same binary, it needs the
// subcommand to relaunch as 'llama serve' and not bare options
#ifdef _WIN32
_putenv_s("LLAMA_APP_CMD", cmd.name);
#else
setenv("LLAMA_APP_CMD", cmd.name, 1);
#endif
return cmd.func(argc - 1, argv + 1);
}
}

View File

@@ -461,10 +461,10 @@ function gg_run_qwen3_0_6b {
(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -ngl 99 -c 1024 -b 512 --chunks 2 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa off --no-op-offload) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa on --no-op-offload) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 1024 -fa off ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 1024 -fa on ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/test-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa off --no-op-offload) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/test-save-load-state --model ${model_q4_0} -ngl 10 -c 1024 -fa on --no-op-offload) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/test-save-load-state --model ${model_q4_0} -ngl 99 -c 1024 -fa off ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/test-save-load-state --model ${model_q4_0} -ngl 99 -c 1024 -fa on ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
function check_ppl {
qnt="$1"

View File

@@ -1610,6 +1610,42 @@ class TextModel(ModelBase):
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
def _set_vocab_hybriddna(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab)) # ty: ignore[unresolved-attribute]
assert max(tokenizer.vocab.values()) < vocab_size # ty: ignore[unresolved-attribute]
reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()} # ty: ignore[unresolved-attribute]
added_vocab = tokenizer.get_added_vocab() # ty: ignore[unresolved-attribute]
added_tokens_decoder = tokenizer.added_tokens_decoder # ty: ignore[unresolved-attribute]
tokens: list[str] = []
toktypes: list[int] = []
for i in range(vocab_size):
if i not in reverse_vocab:
tokens.append(f"[PAD{i}]")
toktypes.append(gguf.TokenType.UNUSED)
else:
token: str = reverse_vocab[i]
if token in added_vocab:
if added_tokens_decoder[i].special or self.does_token_look_special(token):
toktypes.append(gguf.TokenType.CONTROL)
else:
toktypes.append(gguf.TokenType.USER_DEFINED)
else:
toktypes.append(gguf.TokenType.NORMAL)
tokens.append(token)
tokpre = self.get_vocab_base_pre(tokenizer)
self.gguf_writer.add_tokenizer_model("hybriddna")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=True)
special_vocab.add_to_gguf(self.gguf_writer)
def _set_vocab_qwen(self):
from .qwen import QwenModel

View File

@@ -51,6 +51,15 @@ class LlamaModel(TextModel):
if path_tekken_json.is_file() and not path_tokenizer_json.is_file():
self._set_vocab_mistral()
tokenizer_config_file = self.dir_model / 'tokenizer_config.json'
if tokenizer_config_file.is_file():
with open(tokenizer_config_file, "r", encoding="utf-8") as f:
tokenizer_config_json = json.load(f)
if (add_prefix_space := tokenizer_config_json.get("add_prefix_space")) is not None:
self.gguf_writer.add_add_space_prefix(add_prefix_space)
if tokenizer_config_json.get("tokenizer_class") == "HybridDNATokenizer":
return self._set_vocab_hybriddna()
try:
self._set_vocab_sentencepiece()
except FileNotFoundError:
@@ -72,13 +81,6 @@ class LlamaModel(TextModel):
special_vocab._set_special_token("eot", 32010)
special_vocab.add_to_gguf(self.gguf_writer)
tokenizer_config_file = self.dir_model / 'tokenizer_config.json'
if tokenizer_config_file.is_file():
with open(tokenizer_config_file, "r", encoding="utf-8") as f:
tokenizer_config_json = json.load(f)
if "add_prefix_space" in tokenizer_config_json:
self.gguf_writer.add_add_space_prefix(tokenizer_config_json["add_prefix_space"])
# Apply to granite small models only
if self.hparams.get("vocab_size", 32000) == 49152:
self.gguf_writer.add_add_bos_token(False)

View File

@@ -247,7 +247,7 @@ Specifies a comma-separated list of speculative decoding types to use.
|------|-------------|
| `none` | No speculative decoding (default) |
| `draft-simple` | Use a simple draft model for speculation |
| `draft-mtp` | Use Masked Token Prediction (MTP) heads from the main model |
| `draft-mtp` | Use Multi Token Prediction (MTP) heads from the main model |
| `ngram-cache` | Use n-gram cache lookup |
| `ngram-simple` | Use simple n-gram pattern matching |
| `ngram-map-k` | Use n-gram pattern matching with n-gram-keys |

View File

@@ -27,7 +27,6 @@ else()
add_subdirectory(parallel)
add_subdirectory(passkey)
add_subdirectory(retrieval)
add_subdirectory(save-load-state)
add_subdirectory(simple)
add_subdirectory(simple-chat)
add_subdirectory(speculative)

View File

@@ -1,5 +0,0 @@
set(TARGET llama-save-load-state)
add_executable(${TARGET} save-load-state.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -379,7 +379,7 @@ void ggml_backend_tensor_get_2d(const struct ggml_tensor * tensor, void * data,
ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
GGML_ASSERT(buf != NULL && "tensor buffer not set");
if (n_copies <= 1 || buf->iface.set_tensor_2d == NULL) {
if (n_copies <= 1 || buf->iface.get_tensor_2d == NULL) {
for (size_t i = 0; i < n_copies; i++) {
ggml_backend_tensor_get(tensor, (char *) data + i*stride_data, offset + i*stride_tensor, size);
}

View File

@@ -564,9 +564,20 @@ int ggml_metal_op_concat(ggml_metal_op_t ctx, int idx) {
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op->src[1]), 2);
ggml_metal_encoder_set_buffer (enc, ggml_metal_get_buffer_id(op), 3);
const int nth = std::min(1024, ne0);
int nth = std::min(256, ne0);
ggml_metal_encoder_dispatch_threadgroups(enc, ne1, ne2, ne3, nth, 1, 1);
// when rows are small, we can batch them together in a single threadgroup
int nrptg = 1;
if (nth < 256) {
nrptg = std::min((256 + nth - 1) / nth, ne1);
if (nrptg * nth > 256) {
nrptg = 256 / nth;
}
}
const int nw0 = (ne1 + nrptg - 1) / nrptg;
ggml_metal_encoder_dispatch_threadgroups(enc, nw0, ne2, ne3, nth, nrptg, 1);
return 1;
}
@@ -1786,7 +1797,7 @@ int ggml_metal_op_set(ggml_metal_op_t ctx, int idx) {
nk0 = ne10/ggml_blck_size(op->type);
}
int nth = std::min<int>(nk0, ggml_metal_pipeline_max_theads_per_threadgroup(pipeline));
int nth = std::min<int>(nk0*ne11, 256);
// when rows are small, we can batch them together in a single threadgroup
int nrptg = 1;
@@ -1797,7 +1808,7 @@ int ggml_metal_op_set(ggml_metal_op_t ctx, int idx) {
nrptg = (nth + nk0 - 1)/nk0;
nth = nk0;
if (nrptg*nth > ggml_metal_pipeline_max_theads_per_threadgroup(pipeline)) {
if (nrptg*nth > 256) {
nrptg--;
}
}

View File

@@ -7486,7 +7486,11 @@ kernel void kernel_concat(
const int i3 = tgpig.z;
const int i2 = tgpig.y;
const int i1 = tgpig.x;
const int i1 = ntg.y == 1 ? tgpig.x : tgpig.x*ntg.y + tpitg.y;
if (i1 >= args.ne1) {
return;
}
int o[4] = {0, 0, 0, 0};
o[args.dim] = args.dim == 0 ? args.ne00 : (args.dim == 1 ? args.ne01 : (args.dim == 2 ? args.ne02 : args.ne03));

View File

@@ -500,15 +500,21 @@ bool llm_graph_input_attn_k::can_reuse(const llm_graph_params & params) {
}
void llm_graph_input_attn_kv_iswa::set_input(const llama_ubatch * ubatch) {
mctx->get_base()->set_input_k_idxs(self_k_idxs, ubatch);
mctx->get_base()->set_input_v_idxs(self_v_idxs, ubatch);
// base tensors may not be allocated if there are no non-SWA attention layers
if (self_k_idxs && self_k_idxs->buffer) {
mctx->get_base()->set_input_k_idxs(self_k_idxs, ubatch);
mctx->get_base()->set_input_v_idxs(self_v_idxs, ubatch);
mctx->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
mctx->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
}
mctx->get_swa()->set_input_k_idxs(self_k_idxs_swa, ubatch);
mctx->get_swa()->set_input_v_idxs(self_v_idxs_swa, ubatch);
// swa tensors may not be allocated if there are no SWA attention layers
if (self_k_idxs_swa && self_k_idxs_swa->buffer) {
mctx->get_swa()->set_input_k_idxs(self_k_idxs_swa, ubatch);
mctx->get_swa()->set_input_v_idxs(self_v_idxs_swa, ubatch);
mctx->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);
mctx->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);
}
if (self_k_rot) {
mctx->get_base()->set_input_k_rot(self_k_rot);
@@ -534,14 +540,21 @@ bool llm_graph_input_attn_kv_iswa::can_reuse(const llm_graph_params & params) {
bool res = true;
res &= self_k_idxs->ne[0] == params.ubatch.n_tokens;
//res &= self_v_idxs->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
// base tensors may not be allocated if there are no non-SWA attention layers
if (self_k_idxs && self_k_idxs->buffer) {
res &= self_k_idxs->ne[0] == params.ubatch.n_tokens;
//res &= self_v_idxs->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
res &= self_k_idxs_swa->ne[0] == params.ubatch.n_tokens;
//res &= self_v_idxs_swa->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
res &= can_reuse_kq_mask(self_kq_mask, mctx->get_base(), params.ubatch, params.cparams);
}
res &= can_reuse_kq_mask(self_kq_mask, mctx->get_base(), params.ubatch, params.cparams);
res &= can_reuse_kq_mask(self_kq_mask_swa, mctx->get_swa(), params.ubatch, params.cparams);
// swa tensors may not be allocated if there are no SWA attention layers
if (self_k_idxs_swa && self_k_idxs_swa->buffer) {
res &= self_k_idxs_swa->ne[0] == params.ubatch.n_tokens;
//res &= self_v_idxs_swa->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
res &= can_reuse_kq_mask(self_kq_mask_swa, mctx->get_swa(), params.ubatch, params.cparams);
}
return res;
}

View File

@@ -530,6 +530,8 @@ struct llm_tokenizer_bpe : llm_tokenizer {
struct llm_tokenizer_bpe_session {
llm_tokenizer_bpe_session(const llama_vocab & vocab, const llm_tokenizer_bpe & tokenizer) : vocab(vocab), tokenizer(tokenizer) {}
virtual ~llm_tokenizer_bpe_session() = default;
static void append(const llama_token token_id, std::vector<llama_token> & output) {
output.push_back(token_id);
}
@@ -567,7 +569,7 @@ struct llm_tokenizer_bpe_session {
}
}
void tokenize(const std::string & text, std::vector<llama_token> & output) {
virtual void tokenize(const std::string & text, std::vector<llama_token> & output) {
int final_prev_index = -1;
const auto word_collection = unicode_regex_split(text, tokenizer.regex_exprs, tokenizer.byte_encode);
@@ -1579,6 +1581,95 @@ private:
const llm_tokenizer_plamo2 & tokenizer;
};
struct llm_tokenizer_hybriddna_session : llm_tokenizer_bpe_session {
llm_tokenizer_hybriddna_session(const llama_vocab & vocab, const llm_tokenizer_bpe & tokenizer) : llm_tokenizer_bpe_session{vocab, tokenizer}, vocab{vocab} {}
void tokenize(const std::string & text, std::vector<llama_token> & output) override {
static const std::string open_tag = "<dna>";
static const std::string close_tag = "</dna>";
const auto dna_begin_id = vocab.text_to_token(open_tag);
const auto dna_end_id = vocab.text_to_token(close_tag);
const auto dna_oov_id = vocab.text_to_token("<oov>");
// Fall back to plain BPE if the DNA pieces aren't in the vocab.
if (dna_begin_id == LLAMA_TOKEN_NULL || dna_end_id == LLAMA_TOKEN_NULL || dna_oov_id == LLAMA_TOKEN_NULL) {
llm_tokenizer_bpe_session::tokenize(text, output);
return;
}
const size_t k = 6;
size_t pos = 0;
while (pos < text.size()) {
const size_t start = text.find(open_tag, pos);
if (start == std::string::npos) {
if (pos < text.size()) {
llm_tokenizer_bpe_session::tokenize(text.substr(pos), output);
}
break;
}
if (start > pos) {
llm_tokenizer_bpe_session::tokenize(text.substr(pos, start - pos), output);
}
output.push_back(dna_begin_id);
const size_t content_start = start + open_tag.size();
const size_t end = text.find(close_tag, content_start);
const size_t content_end = (end == std::string::npos) ? text.size() : end;
emit_dna_kmers(text.substr(content_start, content_end - content_start), k, dna_oov_id, output);
if (end == std::string::npos) {
break;
}
output.push_back(dna_end_id);
pos = end + close_tag.size();
}
}
private:
void emit_dna_kmers(const std::string & raw, size_t k, llama_token oov_id, std::vector<llama_token> & output) {
std::string seq = raw;
for (char & c : seq) {
if (c >= 'a' && c <= 'z') {
c = char(c - 32);
}
}
auto is_valid_kmer = [](const std::string & s) {
for (char c : s) {
if (c != 'A' && c != 'C' && c != 'G' && c != 'T') {
return false;
}
}
return true;
};
size_t i = 0;
for (; i + k <= seq.size(); i += k) {
const std::string kmer = seq.substr(i, k);
if (is_valid_kmer(kmer)) {
const auto tok = vocab.text_to_token(kmer);
output.push_back(tok != LLAMA_TOKEN_NULL ? tok : oov_id);
} else {
output.push_back(oov_id);
}
}
if (i < seq.size()) {
std::string kmer = seq.substr(i);
kmer.append(k - kmer.size(), 'A');
if (is_valid_kmer(kmer)) {
const auto tok = vocab.text_to_token(kmer);
output.push_back(tok != LLAMA_TOKEN_NULL ? tok : oov_id);
} else {
output.push_back(oov_id);
}
}
}
const llama_vocab & vocab;
};
//
// impl
//
@@ -1808,7 +1899,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
special_mask_id = 103;
add_sep = true;
} else if (tokenizer_model == "gpt2") {
} else if (tokenizer_model == "gpt2" || tokenizer_model == "hybriddna") {
type = LLAMA_VOCAB_TYPE_BPE;
// read bpe merges and populate bpe ranks
@@ -3144,11 +3235,19 @@ std::vector<llama_token> llama_vocab::impl::tokenize(
} break;
case LLAMA_VOCAB_TYPE_BPE:
{
llm_tokenizer_bpe_session session(vocab, *static_cast<const llm_tokenizer_bpe *>(tokenizer.get()));
// it calls some other methods that are not exist in llm_tokenizer,
// here just cast it to bpe tokenizer object
const llm_tokenizer_bpe * tok_bpe = static_cast<const llm_tokenizer_bpe *>(tokenizer.get());
std::unique_ptr<llm_tokenizer_bpe_session> session;
if (vocab.get_tokenizer_model() == "hybriddna") {
session = std::make_unique<llm_tokenizer_hybriddna_session>(vocab, *tok_bpe);
} else {
session = std::make_unique<llm_tokenizer_bpe_session>(vocab, *tok_bpe);
}
if (add_special) {
session.append_bos(output);
session->append_bos(output);
}
for (const auto & fragment : fragment_buffer) {
if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_RAW_TEXT) {
@@ -3161,15 +3260,15 @@ std::vector<llama_token> llama_vocab::impl::tokenize(
#ifdef PRETOKENIZERDEBUG
LLAMA_LOG_WARN("TT: (%ld %ld %ld) '%s'\n", text.length(), fragment.offset, fragment.length, text.c_str());
#endif
session.tokenize(text, output);
session->tokenize(text, output);
} else { // if (fragment.type == FRAGMENT_BUFFER_VARIANT_TYPE_TOKEN)
session.append(fragment.token, output);
session->append(fragment.token, output);
}
}
if (add_special) {
session.append_eos(output);
session.check_double_bos_eos(output);
session->append_eos(output);
session->check_double_bos_eos(output);
}
} break;
case LLAMA_VOCAB_TYPE_WPM:

View File

@@ -525,8 +525,9 @@ llama_model_qwen35::graph_mtp::graph_mtp(const llama_model & model, const llm_gr
res->add_input(std::move(inp));
ggml_tensor * inp_pos = build_inp_pos();
auto * inp_attn = build_attn_inp_kv();
ggml_tensor * inp_pos = build_inp_pos();
ggml_tensor * inp_out_ids = build_inp_out_ids();
auto * inp_attn = build_attn_inp_kv();
ggml_tensor * h_norm = build_norm(h_input, layer.nextn.hnorm, nullptr, LLM_NORM_RMS, il);
cb(h_norm, "mtp_hnorm", il);
@@ -615,6 +616,8 @@ llama_model_qwen35::graph_mtp::graph_mtp(const llama_model & model, const llm_gr
cb(cur, "h_pre_norm", -1);
res->t_h_pre_norm = cur;
cur = ggml_get_rows(ctx0, cur, inp_out_ids);
ggml_tensor * head_norm_w = layer.nextn.shared_head_norm
? layer.nextn.shared_head_norm
: model.output_norm;

View File

@@ -588,8 +588,10 @@ llama_model_qwen35moe::graph_mtp::graph_mtp(const llama_model & model, const llm
res->add_input(std::move(inp));
ggml_tensor * inp_pos = build_inp_pos();
auto * inp_attn = build_attn_inp_kv();
ggml_tensor * inp_pos = build_inp_pos();
ggml_tensor * inp_out_ids = build_inp_out_ids();
auto * inp_attn = build_attn_inp_kv();
ggml_tensor * h_norm = build_norm(h_input, layer.nextn.hnorm, nullptr, LLM_NORM_RMS, il);
cb(h_norm, "mtp_hnorm", il);
@@ -710,6 +712,8 @@ llama_model_qwen35moe::graph_mtp::graph_mtp(const llama_model & model, const llm
cb(cur, "h_pre_norm", -1);
res->t_h_pre_norm = cur;
cur = ggml_get_rows(ctx0, cur, inp_out_ids);
ggml_tensor * head_norm_w = layer.nextn.shared_head_norm
? layer.nextn.shared_head_norm
: model.output_norm;

View File

@@ -255,6 +255,10 @@ set_tests_properties(test-state-restore-fragmented PROPERTIES FIXTURES_REQUIRED
llama_build_and_test(test-recurrent-state-rollback.cpp LABEL "model" ARGS -m "${MODEL_DEST}")
set_tests_properties(test-recurrent-state-rollback PROPERTIES FIXTURES_REQUIRED test-download-model)
# Test state save/load functionality
llama_build_and_test(test-save-load-state.cpp LABEL "model" ARGS -m "${MODEL_DEST}")
set_tests_properties(test-save-load-state PROPERTIES FIXTURES_REQUIRED test-download-model)
if (NOT GGML_BACKEND_DL)
# these tests use the backends directly and cannot be built with dynamic loading
llama_build_and_test(test-barrier.cpp)

View File

@@ -2866,15 +2866,24 @@ struct test_set : public test_case {
struct test_cpy : public test_case {
const ggml_type type_src;
const ggml_type type_dst;
const std::array<int64_t, 4> ne;
const std::array<int64_t, 4> ne_src;
const std::array<int64_t, 4> ne_dst;
const std::array<int64_t, 4> permute_src;
const std::array<int64_t, 4> permute_dst;
bool _src_use_permute;
bool _dst_use_permute;
bool _src_transpose;
bool _use_dst_shape;
std::string vars() override {
return VARS_TO_STR6(type_src, type_dst, ne, permute_src, permute_dst, _src_transpose);
if (_use_dst_shape) {
return VARS_TO_STR7(type_src, type_dst, ne_src, ne_dst, permute_src, permute_dst, _src_transpose);
}
return VARS_TO_STR6(type_src, type_dst, ne_src, permute_src, permute_dst, _src_transpose);
}
int64_t total_elements() const {
return ne_src[0] * ne_src[1] * ne_src[2] * ne_src[3];
}
double max_nmse_err() override {
@@ -2899,7 +2908,7 @@ struct test_cpy : public test_case {
err_estimate /= 8.0f;
}
err_estimate *= err_estimate;
err_estimate /= (150.0f*150.0f*0.25f)*float(ne[0] * ne[1] * ne[2] * ne[3]);
err_estimate /= (150.0f*150.0f*0.25f)*float(total_elements());
return err_estimate;
}
return 1e-6;
@@ -2910,17 +2919,19 @@ struct test_cpy : public test_case {
}
test_cpy(ggml_type type_src = GGML_TYPE_F32, ggml_type type_dst = GGML_TYPE_F32,
std::array<int64_t, 4> ne = {10, 10, 10, 1},
std::array<int64_t, 4> ne_src = {10, 10, 10, 1},
std::array<int64_t, 4> ne_dst = {-1, -1, -1, -1},
std::array<int64_t, 4> permute_src = {0, 0, 0, 0},
std::array<int64_t, 4> permute_dst = {0, 0, 0, 0},
bool transpose_src = false)
: type_src(type_src), type_dst(type_dst), ne(ne), permute_src(permute_src), permute_dst(permute_dst),
: type_src(type_src), type_dst(type_dst), ne_src(ne_src), ne_dst(ne_dst), permute_src(permute_src), permute_dst(permute_dst),
_src_use_permute(permute_src[0] + permute_src[1] + permute_src[2] + permute_src[3] > 0),
_dst_use_permute(permute_dst[0] + permute_dst[1] + permute_dst[2] + permute_dst[3] > 0),
_src_transpose(transpose_src){}
_src_transpose(transpose_src),
_use_dst_shape(ne_dst[0] >= 0 && ne_dst[1] >= 0 && ne_dst[2] >= 0 && ne_dst[3] >= 0){}
ggml_tensor * build_graph(ggml_context * ctx) override {
ggml_tensor * src = ggml_new_tensor(ctx, type_src, 4, ne.data());
ggml_tensor * src = ggml_new_tensor(ctx, type_src, 4, ne_src.data());
ggml_set_param(src);
ggml_set_name(src, "src");
@@ -2934,7 +2945,8 @@ struct test_cpy : public test_case {
ggml_set_name(src, "src_transposed");
}
ggml_tensor * dst = ggml_new_tensor(ctx, type_dst, 4, src->ne);
std::array<int64_t, 4> dst_ne = _use_dst_shape ? ne_dst : std::array<int64_t, 4>{src->ne[0], src->ne[1], src->ne[2], src->ne[3]};
ggml_tensor * dst = ggml_new_tensor(ctx, type_dst, 4, dst_ne.data());
ggml_set_name(dst, "dst");
if (_dst_use_permute) {
@@ -8040,42 +8052,72 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
for (int k = 1; k < 4; ++k) {
test_cases.emplace_back(new test_cpy(type, type, {k*nk, 2, 3, 4}));
test_cases.emplace_back(new test_cpy(type, type, {k*nk, 2, 3, 4}, {0, 2, 1, 3}));
test_cases.emplace_back(new test_cpy(type, type, {k*nk, 2, 3, 4}, {0, 3, 1, 2}, {0, 2, 1, 3}));
test_cases.emplace_back(new test_cpy(type, type, {k*nk, 2, 3, 4}, {-1,-1,-1,-1}, {0, 2, 1, 3}));
test_cases.emplace_back(new test_cpy(type, type, {k*nk, 2, 3, 4}, {-1,-1,-1,-1}, {0, 3, 1, 2}, {0, 2, 1, 3}));
}
}
for (ggml_type type_src : {GGML_TYPE_F16, GGML_TYPE_BF16, GGML_TYPE_F32}) {
for (ggml_type type_dst : all_types) {
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 4, 4, 4}));
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 2, 3, 4}, {0, 2, 1, 3})); // cpy by rows
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 2, 3, 4}, {-1,-1,-1,-1}, {0, 2, 1, 3})); // cpy by rows
}
}
for (ggml_type type_src : all_types) {
for (ggml_type type_dst : {GGML_TYPE_F32}) {
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 4, 4, 4}));
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 2, 3, 4}, {0, 2, 1, 3})); // cpy by rows
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 2, 3, 4}, {-1,-1,-1,-1}, {0, 2, 1, 3})); // cpy by rows
}
}
for (ggml_type type_src : {GGML_TYPE_F16, GGML_TYPE_F32}) {
for (ggml_type type_dst : {GGML_TYPE_F16, GGML_TYPE_F32}) {
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 2, 3, 4}, {1, 0, 2, 3})); // cpy not-contiguous
test_cases.emplace_back(new test_cpy(type_src, type_dst, {256, 2, 3, 4}, {-1,-1,-1,-1}, {1, 0, 2, 3})); // cpy not-contiguous
}
}
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_I32, {256, 2, 3, 4}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_I32, {256, 2, 3, 4}, {1, 0, 2, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_I32, {256, 2, 3, 4}, {-1,-1,-1,-1}, {1, 0, 2, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_F32, {256, 2, 3, 4}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_F32, {256, 2, 3, 4}, {1, 0, 2, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {256, 4, 3, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 4, 3, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 4, 3, 3}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {256, 4, 3, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {256, 4, 1, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 4, 1, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {256, 4, 1, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_I32, {256, 4, 1, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_I32, {256, 1, 4, 1}, {1, 2, 0, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 1, 4, 1}, {1, 2, 0, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_F32, {256, 2, 3, 4}, {-1,-1,-1,-1}, {1, 0, 2, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {256, 4, 3, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 4, 3, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 4, 3, 3}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {256, 4, 3, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {256, 4, 1, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 4, 1, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {256, 4, 1, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_I32, {256, 4, 1, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_I32, GGML_TYPE_I32, {256, 1, 4, 1}, {-1,-1,-1,-1}, {1, 2, 0, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {256, 1, 4, 1}, {-1,-1,-1,-1}, {1, 2, 0, 3}, {0, 0, 0, 0}));
// CPY - different src/dst shapes (reshaping via CPY)
// Use permutations of {3, 5, 7, 32}. Total elements: 3*5*7*32 = 3360.
// Each src permutation is tested against canonical sorted and reverse dst (skip self).
{
std::array<int64_t, 4> dims = {3, 5, 7, 32};
std::sort(dims.begin(), dims.end());
std::array<int64_t, 4> canonical = dims;
std::array<int64_t, 4> reversed = {32, 7, 5, 3};
for (ggml_type type : {GGML_TYPE_F32, GGML_TYPE_F16}) {
std::array<int64_t, 4> cur = dims;
do {
if (cur != canonical) {
test_cases.emplace_back(new test_cpy(type, type, cur, canonical));
}
if (cur != reversed) {
test_cases.emplace_back(new test_cpy(type, type, cur, reversed));
}
if (cur[0] == 32 && type == GGML_TYPE_F32) {
if (canonical[0] == 32) {
test_cases.emplace_back(new test_cpy(GGML_TYPE_Q4_0, GGML_TYPE_Q4_0, cur, canonical));
}
if (reversed[0] == 32) {
test_cases.emplace_back(new test_cpy(GGML_TYPE_Q4_0, GGML_TYPE_Q4_0, cur, reversed));
}
}
std::next_permutation(cur.begin(), cur.end());
} while (cur != canonical);
}
}
for (ggml_type type_dst : { GGML_TYPE_F32, GGML_TYPE_I32, GGML_TYPE_F16, GGML_TYPE_BF16 }) {
for (bool use_view_slice : { true, false }) {
@@ -8830,9 +8872,24 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_acc(GGML_TYPE_F32, {256, 17, 2, 3}, {256, 16, 2, 3}, 1));
test_cases.emplace_back(new test_acc(GGML_TYPE_F32, {256, 17, 2, 3}, {128, 16, 2, 3}, 2));
test_cases.emplace_back(new test_acc(GGML_TYPE_F32, {256, 17, 2, 3}, {64, 16, 2, 3}, 3));
test_cases.emplace_back(new test_pad());
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {33, 17, 2, 1}, 4, 3, true)); // circular
test_cases.emplace_back(new test_pad_ext());
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {1024, 1, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {1024, 2, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {1024, 16, 1, 1}, 0, 1, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {1023, 1, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {1023, 8, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {1025, 1, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {1025, 8, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {2048, 1, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {2048, 4, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {2049, 1, 1, 1}, 1, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {100, 1, 1, 1}, 100, 0, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {100, 1, 1, 1}, 0, 100, false));
test_cases.emplace_back(new test_pad(GGML_TYPE_F32, {100, 100, 1, 1}, 50, 50, false));
test_cases.emplace_back(new test_pad_reflect_1d());
test_cases.emplace_back(new test_pad_reflect_1d(GGML_TYPE_F32, {3000, 384, 4, 1}));
test_cases.emplace_back(new test_roll());
@@ -9132,22 +9189,21 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
test_cases.emplace_back(new test_bin_bcast(ggml_add, GGML_TYPE_F32, {4096, 1, 1, 1}, {1, 512, 1, 1}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F16, {512, 3072, 1, 1}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {8192, 512, 2, 1}, {0, 2, 1, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {3072, 512, 2, 1}, {0, 2, 1, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {8192, 512, 2, 1}, {-1,-1,-1,-1}, {0, 2, 1, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {3072, 512, 2, 1}, {-1,-1,-1,-1}, {0, 2, 1, 3}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_Q4_0, {8192, 512, 2, 1}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_Q4_0, GGML_TYPE_F32, {8192, 512, 2, 1}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {768*1024, 256, 1, 1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768*1024, 256, 1, 1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768, 1024, 256, 1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {768, 1024, 256, 1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {768*1024, 256, 1, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {768, 1024, 256, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768*1024, 256, 1, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768, 1024, 256, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {768, 1024, 256, 1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {768*1024, 256, 1, 1}, {-1,-1,-1,-1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768*1024, 256, 1, 1}, {-1,-1,-1,-1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768, 1024, 256, 1}, {-1,-1,-1,-1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {768, 1024, 256, 1}, {-1,-1,-1,-1}, {1, 0, 2, 3}, {0, 0, 0, 0}));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {768*1024, 256, 1, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F32, GGML_TYPE_F32, {768, 1024, 256, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768*1024, 256, 1, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_F16, GGML_TYPE_F16, {768, 1024, 256, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_cpy(GGML_TYPE_BF16, GGML_TYPE_BF16, {768, 1024, 256, 1}, {-1,-1,-1,-1}, {0, 0, 0, 0}, {0, 0, 0, 0}, true));
test_cases.emplace_back(new test_soft_max(GGML_TYPE_F32, {4096, 4096, 5, 1}, false, false, GGML_TYPE_F32, {1, 1}, 1.0f, 0.0f));
test_cases.emplace_back(new test_soft_max(GGML_TYPE_F32, {12888, 256, 5, 1}, false, false, GGML_TYPE_F32, {1, 1}, 1.0f, 0.0f));

View File

@@ -1,6 +1,18 @@
# llama-batched-bench-impl: batched-bench logic, reusable by app
set(TARGET llama-batched-bench-impl)
add_library(${TARGET} STATIC batched-bench.cpp)
target_include_directories(${TARGET} PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries(${TARGET} PUBLIC llama-common llama ${CMAKE_THREAD_LIBS_INIT})
# llama-batched-bench executable
set(TARGET llama-batched-bench)
add_executable(${TARGET} batched-bench.cpp)
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
add_executable(${TARGET} main.cpp)
target_link_libraries(${TARGET} PRIVATE llama-batched-bench-impl)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
if(LLAMA_TOOLS_INSTALL)

View File

@@ -15,7 +15,10 @@ static void print_usage(int, char ** argv) {
LOG("\n");
}
int main(int argc, char ** argv) {
// satisfies -Wmissing-declarations
int llama_batched_bench(int argc, char ** argv);
int llama_batched_bench(int argc, char ** argv) {
std::setlocale(LC_NUMERIC, "C");
common_params params;

View File

@@ -0,0 +1,5 @@
int llama_batched_bench(int argc, char ** argv);
int main(int argc, char ** argv) {
return llama_batched_bench(argc, argv);
}

View File

@@ -1,6 +1,18 @@
# llama-fit-params-impl: fit-params logic, reusable by app
set(TARGET llama-fit-params-impl)
add_library(${TARGET} STATIC fit-params.cpp)
target_include_directories(${TARGET} PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries(${TARGET} PUBLIC llama-common llama ${CMAKE_THREAD_LIBS_INIT})
# llama-fit-params executable
set(TARGET llama-fit-params)
add_executable(${TARGET} fit-params.cpp)
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
add_executable(${TARGET} main.cpp)
target_link_libraries(${TARGET} PRIVATE llama-fit-params-impl)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
if(LLAMA_TOOLS_INSTALL)

View File

@@ -12,7 +12,10 @@
#pragma warning(disable: 4244 4267) // possible loss of data
#endif
int main(int argc, char ** argv) {
// satisfies -Wmissing-declarations
int llama_fit_params(int argc, char ** argv);
int llama_fit_params(int argc, char ** argv) {
common_params params;
common_init();

View File

@@ -0,0 +1,5 @@
int llama_fit_params(int argc, char ** argv);
int main(int argc, char ** argv) {
return llama_fit_params(argc, argv);
}

View File

@@ -1,6 +1,18 @@
# llama-perplexity-impl: perplexity logic, reusable by app
set(TARGET llama-perplexity-impl)
add_library(${TARGET} STATIC perplexity.cpp)
target_include_directories(${TARGET} PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries(${TARGET} PUBLIC llama-common llama ${CMAKE_THREAD_LIBS_INIT})
# llama-perplexity executable
set(TARGET llama-perplexity)
add_executable(${TARGET} perplexity.cpp)
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
add_executable(${TARGET} main.cpp)
target_link_libraries(${TARGET} PRIVATE llama-perplexity-impl)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
if(LLAMA_TOOLS_INSTALL)

View File

@@ -0,0 +1,5 @@
int llama_perplexity(int argc, char ** argv);
int main(int argc, char ** argv) {
return llama_perplexity(argc, argv);
}

View File

@@ -2005,7 +2005,10 @@ static void kl_divergence(llama_context * ctx, const common_params & params) {
LOG("Same top p: %6.3lf ± %5.3lf %%\n", 100.0*same_top_p, 100.0*sqrt(same_top_p*(1.0 - same_top_p)/(kld.count - 1)));
}
int main(int argc, char ** argv) {
// satisfies -Wmissing-declarations
int llama_perplexity(int argc, char ** argv);
int llama_perplexity(int argc, char ** argv) {
std::setlocale(LC_NUMERIC, "C");
common_params params;

View File

@@ -1,7 +1,18 @@
# llama-quantize-impl: quantize logic, reusable by app
set(TARGET llama-quantize-impl)
add_library(${TARGET} STATIC quantize.cpp)
target_include_directories(${TARGET} PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries(${TARGET} PUBLIC llama-common llama ${CMAKE_THREAD_LIBS_INIT})
# llama-quantize executable
set(TARGET llama-quantize)
add_executable(${TARGET} quantize.cpp)
target_link_libraries(${TARGET} PRIVATE llama-common llama ${CMAKE_THREAD_LIBS_INIT})
target_include_directories(${TARGET} PRIVATE ../../common)
add_executable(${TARGET} main.cpp)
target_link_libraries(${TARGET} PRIVATE llama-quantize-impl)
target_compile_features(${TARGET} PRIVATE cxx_std_17)
if(LLAMA_TOOLS_INSTALL)

5
tools/quantize/main.cpp Normal file
View File

@@ -0,0 +1,5 @@
int llama_quantize(int argc, char ** argv);
int main(int argc, char ** argv) {
return llama_quantize(argc, argv);
}

View File

@@ -490,7 +490,10 @@ static bool parse_layer_prune(const char * data, std::vector<int> & prune_layers
return true;
}
int main(int argc, char ** argv) {
// satisfies -Wmissing-declarations
int llama_quantize(int argc, char ** argv);
int llama_quantize(int argc, char ** argv) {
std::setlocale(LC_NUMERIC, "C");
if (argc < 3) {
usage(argv[0]);

View File

@@ -506,6 +506,9 @@ struct server_slot {
if (ptask) {
res["id_task"] = ptask->id;
res["n_prompt_tokens"] = (int32_t) prompt.tokens.size();
res["n_prompt_tokens_processed"] = n_prompt_tokens_processed;
res["n_prompt_tokens_cache"] = n_prompt_tokens_cache;
res["params"] = ptask->params.to_json(only_metrics);
res["next_token"] = {
{
@@ -701,6 +704,10 @@ private:
bool sleeping = false;
void destroy() {
spec.reset();
ctx_dft.reset();
model_dft.reset();
llama_init.reset();
ctx_tgt = nullptr;

View File

@@ -14,6 +14,7 @@
#include <mutex>
#include <condition_variable>
#include <cstring>
#include <cstdlib>
#include <atomic>
#include <chrono>
#include <queue>
@@ -159,6 +160,13 @@ void server_model_meta::update_args(common_preset_context & ctx_preset, std::str
// TODO: maybe validate preset before rendering ?
// render args
args = preset.to_args(bin_path);
// unified binary dispatches by subcommand, re-inject it right after the
// binary path so the child starts as 'llama serve ...' not 'llama ...'
const char * app_cmd = std::getenv("LLAMA_APP_CMD");
if (app_cmd != nullptr && app_cmd[0] != '\0' && !bin_path.empty()) {
args.insert(args.begin() + 1, app_cmd);
}
}
void server_model_meta::update_caps() {

View File

@@ -11,24 +11,28 @@
cd ../../
# Ensure node_modules are installed
if [ ! -d "tools/ui/node_modules" ]; then
echo "📦 Installing npm dependencies..."
cd tools/ui && npm install && cd ../../
fi
# Check and install git hooks if missing
check_and_install_hooks() {
local hooks_missing=false
# Check for required hooks
if [ ! -f ".git/hooks/pre-commit" ] || [ ! -f ".git/hooks/pre-push" ] || [ ! -f ".git/hooks/post-push" ]; then
if [ ! -f ".git/hooks/pre-commit" ] || [ ! -f ".git/hooks/pre-push" ]; then
hooks_missing=true
fi
if [ "$hooks_missing" = true ]; then
echo "🔧 Git hooks missing, installing them..."
cd tools/ui
if bash scripts/install-git-hooks.sh; then
if bash "$(dirname "$0")/git-hooks/install.sh"; then
echo "✅ Git hooks installed successfully"
else
echo "⚠️ Failed to install git hooks, continuing anyway..."
fi
cd ../../
else
echo "✅ Git hooks already installed"
fi

View File

@@ -0,0 +1,35 @@
#!/usr/bin/env bash
#
# Install git hooks for llama-ui
# Copies pre-commit and pre-push hooks into the repo's .git/hooks directory.
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../.." && pwd)"
HOOKS_DIR="$REPO_ROOT/$(cd "$REPO_ROOT" && git rev-parse --git-path hooks)"
# Verify package.json exists
if [ ! -f "$REPO_ROOT/tools/ui/package.json" ]; then
echo "❌ package.json not found in tools/ui"
exit 1
fi
echo "Installing git hooks for llama-ui..."
for hook in pre-commit pre-push; do
src="$SCRIPT_DIR/${hook}.sh"
dst="$HOOKS_DIR/$hook"
if cp "$src" "$dst" && chmod +x "$dst"; then
echo "$hook"
else
echo " ❌ Failed to install $hook"
exit 1
fi
done
echo ""
echo "Pre-commit: format (staged) + type-check"
echo "Pre-push: lint + test"
echo ""
echo "Hooks stash unstaged changes temporarily and restore them after."
echo "Skip with: git commit --no-verify / git push --no-verify"

View File

@@ -0,0 +1,57 @@
#!/usr/bin/env bash
#
# Pre-commit hook for llama-ui
# Runs: format (staged files only) + type-check
# Stashes unstaged changes temporarily and restores them after.
# Only run when there are staged changes in tools/ui/
if ! git diff --cached --name-only | grep -q "^tools/ui/"; then
exit 0
fi
REPO_ROOT=$(git rev-parse --show-toplevel)
cd "$REPO_ROOT/tools/ui"
# Check that node_modules exists
if [ ! -d "node_modules" ]; then
echo "❌ node_modules not found. Run 'npm install' first."
exit 1
fi
# Stash unstaged changes in tools/ui/ so they don't interfere
stash_name="pi-ui-precommit"
git stash push --keep-index -u -m "$stash_name" -- tools/ui/ 2>/dev/null || true
echo "Running pre-commit checks for llama-ui..."
# Format only staged files
staged_ui=$(git diff --cached --name-only -- tools/ui/)
if [ -n "$staged_ui" ]; then
echo "$staged_ui" | xargs npx --no-install prettier --write
format_ok=$?
# Re-stage formatted files
git add tools/ui/
else
format_ok=0
fi
# Type-check the clean tree
npm run check
check_ok=$?
# Restore stashed changes
if git stash list | grep -q "$stash_name"; then
git stash pop 2>/dev/null || true
fi
if [ $format_ok -ne 0 ]; then
echo "❌ Format failed"
exit 1
fi
if [ $check_ok -ne 0 ]; then
echo "❌ Type check failed"
exit 1
fi
echo "✅ Pre-commit checks passed"
exit 0

View File

@@ -0,0 +1,66 @@
#!/usr/bin/env bash
#
# Pre-push hook for llama-ui
# Runs: lint + test
# Ignores unstaged changes (stashes them temporarily and restores after).
needs_check=false
# Read refs from stdin: local_ref local_sha remote_ref remote_sha
while read local_ref local_sha remote_ref remote_sha; do
# New branch or force-push — always check
if [ "$local_sha" = "0000000000000000000000000000000000000000" ] || \
[ "$remote_sha" = "0000000000000000000000000000000000000000" ]; then
needs_check=true
continue
fi
# Check for changes in tools/ui/ between remote and local
if git diff --name-only "$remote_sha...$local_sha" -- tools/ui/ | grep -q .; then
needs_check=true
fi
done
if [ "$needs_check" = false ]; then
exit 0
fi
REPO_ROOT=$(git rev-parse --show-toplevel)
cd "$REPO_ROOT/tools/ui"
# Check that node_modules exists
if [ ! -d "node_modules" ]; then
echo "❌ node_modules not found. Run 'npm install' first."
exit 1
fi
# Stash unstaged changes so they don't interfere with checks
stash_name="pi-ui-prepush"
git stash push -u -m "$stash_name" -- tools/ui/ 2>/dev/null || true
echo "Running pre-push checks for llama-ui..."
# Lint
npm run lint
lint_ok=$?
# Test
npm test
test_ok=$?
# Restore stashed changes
if git stash list | grep -q "$stash_name"; then
git stash pop 2>/dev/null || true
fi
if [ $lint_ok -ne 0 ]; then
echo "❌ Lint failed"
exit 1
fi
if [ $test_ok -ne 0 ]; then
echo "❌ Tests failed"
exit 1
fi
echo "✅ Pre-push checks passed"
exit 0

View File

@@ -1,78 +0,0 @@
#!/bin/bash
# Script to install pre-commit hook for llama-ui
# Pre-commit: formats, checks, and builds the UI app
REPO_ROOT=$(git rev-parse --show-toplevel)
PRE_COMMIT_HOOK="$REPO_ROOT/.git/hooks/pre-commit"
echo "Installing pre-commit hook for llama-ui..."
# Create the pre-commit hook
cat > "$PRE_COMMIT_HOOK" << 'EOF'
#!/bin/bash
# Check if there are any changes in the tools/ui directory
if git diff --cached --name-only | grep -q "^tools/ui/"; then
REPO_ROOT=$(git rev-parse --show-toplevel)
cd "$REPO_ROOT/tools/ui"
# Check if package.json exists
if [ ! -f "package.json" ]; then
echo "Error: package.json not found in tools/ui"
exit 1
fi
echo "Formatting and checking llama-ui code..."
# Run the format command
npm run format
if [ $? -ne 0 ]; then
echo "Error: npm run format failed"
exit 1
fi
# Run the lint command
npm run lint
if [ $? -ne 0 ]; then
echo "Error: npm run lint failed"
exit 1
fi
# Run the check command
npm run check
if [ $? -ne 0 ]; then
echo "Error: npm run check failed"
exit 1
fi
echo "✅ llama-ui code formatted and checked successfully"
# Build the llama-ui
echo "Building llama-ui..."
npm run build
if [ $? -ne 0 ]; then
echo "❌ npm run build failed"
exit 1
fi
echo "✅ llama-ui built successfully"
fi
exit 0
EOF
# Make hook executable
chmod +x "$PRE_COMMIT_HOOK"
if [ $? -eq 0 ]; then
echo "✅ Git hook installed successfully!"
echo " Pre-commit: $PRE_COMMIT_HOOK"
echo ""
echo "The hook will automatically:"
echo " • Format, lint and check llama-ui code before commits"
echo " • Build llama-ui"
else
echo "❌ Failed to make hook executable"
exit 1
fi