Compare commits

...

21 Commits
b8390 ... b8411

Author SHA1 Message Date
Georgi Gerganov
4efd326e71 sync : ggml 2026-03-18 15:17:28 +02:00
Georgi Gerganov
b08f7322ee ggml : bump version to 0.9.8 (ggml/1442) 2026-03-18 15:17:28 +02:00
Georgi Gerganov
79187f2fb8 ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441) 2026-03-18 15:17:28 +02:00
Julien Chaumond
48e61238e1 webui: improve tooltip wording for attachment requirements (#20688)
* webui: improve tooltip wording for attachment requirements

Co-Authored-By: Claude <Agents+claude@huggingface.co>

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Claude <Agents+claude@huggingface.co>
2026-03-18 14:01:02 +01:00
Pop Flamingo
312cf03328 llama : re-enable manual LoRA adapter free (#19983)
* Re-enable manual LoRA adapter free

* Remove stale "all adapters must be loaded before context creation" stale comments
2026-03-18 12:03:26 +02:00
Masato Nakasaka
f4049ad735 tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] (#20483)
* Fix errors occurring on Windows

* Reverted fix

#20365 will take care of CRLF isue

* Changed to write to directly to stdin

* Prevent fclose to happen twice
2026-03-18 10:43:31 +01:00
Aldehir Rojas
5e8910a0db common : rework gpt-oss parser (#20393)
* common : rework gpt-oss parser

* cont : fix gpt-oss tests

* cont : add structured output test

* cont : rename final to final_msg
2026-03-18 10:41:25 +01:00
Aaron Teo
fe00a84b4b tests: enable kv_unified to prevent cuda oom error on rtx 2060 (#20645)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-03-18 17:40:22 +08:00
Aleksander Grygier
7ab321d40d webui: Fix duplicated messages on q param (#20715)
* fix: Remove duplicate message sending on `?q` param

* chore: update webui build output
2026-03-18 10:32:43 +01:00
uvos
7533a7d509 HIP : ignore return of hipMemAdvise [no ci] (#20696) 2026-03-18 09:53:13 +01:00
Andreas Obersteiner
a69d54f990 context : fix graph not resetting when control vector changes (#20381) 2026-03-18 08:10:13 +02:00
Krishna Sridhar
cf23ee2447 hexagon: add neg, exp, sigmoid, softplus ops, cont, repeat ops (#20701)
Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.

- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
  contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
  the worker pool, supporting f32 and f16 types. The kernel parallelizes
  across output rows and uses memcpy for each tile.

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-17 15:34:36 -07:00
Ruben Ortlam
892e3c333a vulkan: disable mmvq on Intel Windows driver (#20672)
* vulkan: disable mmvq on Intel Windows driver

* improve comment
2026-03-17 21:51:43 +01:00
Kevin Hannon
ee4801e5a6 ggml-blas: set mkl threads from thread context (#20602)
* ggml blas: set mkl threads from thread context

* add code to run blas locally
2026-03-18 01:16:49 +08:00
Piotr Wilkin (ilintar)
d2ecd2d1cf common/parser: add --skip-chat-parsing to force a pure content parser. (#20289)
* Add `--force-pure-content` to force a pure content parser.

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Change parameter name [no ci]

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-17 16:16:43 +01:00
Taimur Ahmad
054d8b0f24 ggml-cpu: fix RVV checks in quants and repacking (#20682)
* ggml-cpu: refactor quants.c; add rvv check

* ggml-cpu: refactor; disable generic fallback
2026-03-17 16:03:40 +02:00
Sigbjørn Skjæret
ab0bb93748 ci : bump ccache [no ci] (#20679)
* bump ccache

* forgotten

* disable for s390x

* disable also for ppc64le
2026-03-17 14:54:31 +01:00
Ruben Ortlam
3a5cb629b1 vulkan: async and event fixes (#20518)
* vulkan: fix event wait submission, event command buffer reset

* fix event command buffer reset validation error

* also reset command buffers before reuse

* use timeline semaphores instead of fences for event_synchronize

* don't use initializer list for semaphore wait info

* use multiple events to avoid reset issues

* fix event reuse issue with multiple vectors

* add semaphore wait condition also if compute_ctx already exists

* remove event pending stage
2026-03-17 14:27:23 +01:00
Georgi Gerganov
8cc2d81264 server : fix ctx checkpoint invalidation (#20671) 2026-03-17 15:21:14 +02:00
Justin Bradford
627670601a kleidiai : fix MUL_MAT support for batched (3D) inputs (#20620)
* kleidiai : fix MUL_MAT support for batched (3D) inputs

The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.

This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.

Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.

Fixes #20608

* Kleidiai support_ops should only return true for 3D inputs, not also 4D
2026-03-17 14:03:54 +02:00
Ruben Ortlam
740a447fc3 vulkan: allow graphics queue only through env var (#20599)
* vulkan: avoid graphics queue on non-RADV AMD drivers

* avoid graphics queues on small GPUs

* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE

* reenable transfer queue if graphics queue is not used
2026-03-17 10:09:59 +01:00
50 changed files with 793 additions and 308 deletions

View File

@@ -46,7 +46,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-ios
evict-old-files: 1d
@@ -124,7 +124,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-tvos
evict-old-files: 1d
@@ -186,7 +186,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-swift
evict-old-files: 1d

View File

@@ -43,7 +43,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-latest-sanitizer-${{ matrix.sanitizer }}
evict-old-files: 1d

View File

@@ -45,7 +45,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-vulkan-llvmpipe
evict-old-files: 1d

View File

@@ -69,7 +69,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64
evict-old-files: 1d
@@ -105,7 +105,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-x64
evict-old-files: 1d
@@ -141,7 +141,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64-webgpu
evict-old-files: 1d
@@ -195,7 +195,8 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
if: ${{ matrix.build != 's390x' && matrix.build != 'ppc64le' }}
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-cpu-${{ matrix.build }}
evict-old-files: 1d
@@ -324,7 +325,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-webgpu
evict-old-files: 1d
@@ -436,7 +437,7 @@ jobs:
sudo apt-get install -y build-essential git cmake rocblas-dev hipblas-dev libssl-dev rocwmma-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-hip
evict-old-files: 1d
@@ -467,7 +468,7 @@ jobs:
apt-get install -y build-essential git cmake libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-musa
evict-old-files: 1d
@@ -513,7 +514,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-sycl
evict-old-files: 1d
@@ -562,7 +563,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-sycl-fp16
evict-old-files: 1d
@@ -605,7 +606,7 @@ jobs:
- name: ccache
if: runner.environment == 'github-hosted'
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-openvino-${{ matrix.variant }}-no-preset-v1
evict-old-files: 1d
@@ -692,7 +693,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-${{ matrix.build }}
variant: ccache
@@ -798,7 +799,7 @@ jobs:
apt install -y cmake build-essential ninja-build libgomp1 git libssl-dev
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-latest-cuda
evict-old-files: 1d
@@ -830,7 +831,7 @@ jobs:
uses: actions/checkout@v6
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-cuda-${{ matrix.cuda }}
variant: ccache
@@ -883,7 +884,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-sycl
variant: ccache
@@ -944,7 +945,7 @@ jobs:
& $clangPath.FullName --version
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ${{ github.job }}
evict-old-files: 1d
@@ -1068,7 +1069,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-x64-cpu-low-perf
evict-old-files: 1d
@@ -1094,7 +1095,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-low-perf
evict-old-files: 1d
@@ -1120,7 +1121,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-x64-cpu-high-perf
evict-old-files: 1d
@@ -1146,7 +1147,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-high-perf
evict-old-files: 1d
@@ -1172,7 +1173,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-high-perf-sve
evict-old-files: 1d
@@ -1198,7 +1199,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-kleidiai
evict-old-files: 1d
@@ -1250,7 +1251,7 @@ jobs:
sudo apt-get install -y cmake
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ggml-ci-arm64-cpu-kleidiai-graviton4
evict-old-files: 1d

View File

@@ -29,7 +29,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: copilot-setup-steps
evict-old-files: 1d
@@ -52,6 +52,6 @@ jobs:
- name: Install Python dependencies
run: |
python3 -m venv .venv
.venv/bin/activate
source .venv/bin/activate
pip install -r requirements/requirements-all.txt -r tools/server/tests/requirements.txt
pip install flake8 pyright pre-commit

View File

@@ -47,7 +47,7 @@ jobs:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-arm64
evict-old-files: 1d
@@ -94,7 +94,7 @@ jobs:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: macOS-latest-x64
evict-old-files: 1d
@@ -153,7 +153,8 @@ jobs:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
if: ${{ matrix.build != 's390x' }}
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-cpu-${{ matrix.build }}
evict-old-files: 1d
@@ -204,7 +205,7 @@ jobs:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-22-vulkan
evict-old-files: 1d
@@ -269,7 +270,7 @@ jobs:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-24-openvino-release-no-preset-v1
evict-old-files: 1d
@@ -342,7 +343,7 @@ jobs:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-cpu-${{ matrix.arch }}
variant: ccache
@@ -403,7 +404,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-${{ matrix.backend }}-${{ matrix.arch }}
variant: ccache
@@ -473,7 +474,7 @@ jobs:
uses: actions/checkout@v6
- name: Install ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-cuda-${{ matrix.cuda }}
variant: ccache
@@ -549,7 +550,7 @@ jobs:
uses: actions/checkout@v6
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-sycl
variant: ccache
@@ -629,7 +630,7 @@ jobs:
fetch-depth: 0
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: ubuntu-rocm-${{ matrix.ROCM_VERSION }}-${{ matrix.build }}
evict-old-files: 1d
@@ -739,7 +740,7 @@ jobs:
key: rocm-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ runner.os }}
- name: ccache
uses: ggml-org/ccache-action@v1.2.16
uses: ggml-org/ccache-action@v1.2.21
with:
key: windows-latest-hip-${{ env.HIPSDK_INSTALLER_VERSION }}-${{ matrix.name }}-x64
evict-old-files: 1d

View File

@@ -25,7 +25,13 @@
# # with KLEIDIAI support
# GG_BUILD_KLEIDIAI=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# # with OPENVINO support
# # with BLAS support
# GG_BUILD_BLAS=1 bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# with BLAS support (custom vendor)
# GG_BUILD_BLAS=1 GG_BUILD_BLAS_VENDOR=Intel10_64lp bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
# with OPENVINO support
# GG_BUILD_OPENVINO=1 GG_BUILD_LOW_PERF=1 GGML_OPENVINO_DEVICE=CPU bash ./ci/run.sh ./tmp/results ./tmp/mnt
#
@@ -169,6 +175,10 @@ if [ -n "${GG_BUILD_KLEIDIAI}" ]; then
-DBUILD_SHARED_LIBS=OFF"
fi
if [ ! -z ${GG_BUILD_BLAS} ]; then
CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=${GG_BUILD_BLAS_VENDOR:-OpenBLAS}"
fi
if [ ! -z ${GG_BUILD_OPENVINO} ]; then
if [ -z ${OpenVINO_DIR} ]; then
echo "OpenVINO_DIR not found, please install OpenVINO via archives and enable it by:"

View File

@@ -3115,6 +3115,17 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.chat_template = read_file(value);
}
).set_examples({LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_CHAT_TEMPLATE_FILE"));
add_opt(common_arg(
{"--skip-chat-parsing"},
{"--no-skip-chat-parsing"},
string_format(
"force a pure content parser, even if a Jinja template is specified; model will output everything "
"in the content section, including any reasoning and/or tool calls (default: disabled)"
),
[](common_params & params, bool value) {
params.force_pure_content_parser = value;
}
).set_examples({LLAMA_EXAMPLE_COMPLETION, LLAMA_EXAMPLE_CLI, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_SKIP_CHAT_PARSING"));
add_opt(common_arg(
{"--prefill-assistant"},
{"--no-prefill-assistant"},

View File

@@ -933,17 +933,12 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
// Copy reasoning to the "thinking" field as expected by the gpt-oss template
auto adjusted_messages = json::array();
for (const auto & msg : inputs.messages) {
auto has_reasoning_content = msg.contains("reasoning_content") && msg.at("reasoning_content").is_string();
auto has_tool_calls = msg.contains("tool_calls") && msg.at("tool_calls").is_array();
if (has_reasoning_content && has_tool_calls) {
auto adjusted_message = msg;
adjusted_message["thinking"] = msg.at("reasoning_content");
adjusted_messages.push_back(adjusted_message);
} else {
adjusted_messages.push_back(msg);
for (auto msg : inputs.messages) {
if (msg.contains("reasoning_content") && msg.at("reasoning_content").is_string()) {
msg["thinking"] = msg.at("reasoning_content");
msg.erase("content");
}
adjusted_messages.push_back(msg);
}
auto prompt = common_chat_template_direct_apply(tmpl, inputs, /* messages_override= */ adjusted_messages);
@@ -969,45 +964,31 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
"<|channel|>", "<|constrain|>", "<|message|>", "<|start|>", "<|end|>",
};
auto has_tools = inputs.tools.is_array() && !inputs.tools.empty();
auto extract_reasoning = inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE;
auto include_grammar = inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE && has_tools;
auto has_tools = inputs.tools.is_array() && !inputs.tools.empty();
auto has_response_format = !inputs.json_schema.is_null() && inputs.json_schema.is_object();
auto include_grammar = has_response_format || (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE);
auto parser = build_chat_peg_parser([&](common_chat_peg_builder & p) {
const std::string END = "<|end|>";
const std::string START = "<|start|>";
const std::string MESSAGE = "<|message|>";
const std::string CHANNEL = "<|channel|>";
const std::string CONSTRAIN = "<|constrain|>";
const std::string START_ASSISTANT = START + "assistant";
const std::string CHANNEL_ANALYSIS = CHANNEL + "analysis";
const std::string CHANNEL_COMMENTARY = CHANNEL + "commentary";
const std::string CHANNEL_FINAL = CHANNEL + "final";
auto start = p.rule("start", p.literal("<|start|>assistant"));
auto end = p.rule("end", p.literal("<|end|>"));
auto content = p.rule("message-content", p.until("<|end|>"));
auto channel = p.literal("<|channel|>") + (p.literal("commentary") | p.literal("analysis"));
auto constrain_type = p.chars("[A-Za-z0-9_-]", 1, -1);
auto the_end = END | p.end();
auto analysis = p.rule("analysis", p.literal("<|channel|>analysis<|message|>") + p.reasoning(content) + end);
auto preamble = p.rule("preamble", p.literal("<|channel|>commentary<|message|>") + p.content(content) + end);
auto final_msg = p.rule("final", p.literal("<|channel|>final<|message|>") + p.content(content));
auto any = p.rule("any", preamble | analysis);
const std::string analysis_header = CHANNEL_ANALYSIS + MESSAGE;
auto segment_content = p.until(END);
auto analysis_segment = extract_reasoning ?
p.literal(analysis_header) + p.reasoning(segment_content) + p.until(END) + the_end :
p.content(analysis_header + p.until(END) + the_end);
if (has_response_format) {
auto constraint = p.optional(p.space() + p.literal("<|constrain|>") + constrain_type);
auto response_format = p.rule("response-format",
p.literal("<|channel|>final") + constraint + p.literal("<|message|>") +
p.content(p.schema(p.json(), "response-format-schema", inputs.json_schema)));
auto channel_header_content = p.until_one_of({ " to=functions.", MESSAGE });
auto content_header = p.choice({ p.literal(CHANNEL_COMMENTARY), p.literal(CHANNEL_FINAL) });
auto content_segment = p.rule("content-segment", content_header + channel_header_content + MESSAGE +
p.content(segment_content) + the_end);
if (!inputs.json_schema.is_null()) {
auto final_header = p.literal(CHANNEL_FINAL);
auto constraint = p.optional(p.space() + p.literal(CONSTRAIN) + channel_header_content);
return p.optional(analysis_segment) + final_header + constraint + MESSAGE +
p.content(p.schema(p.json(), "response-format", inputs.json_schema));
return response_format | (analysis + p.zero_or_more(start + analysis) + start + response_format);
}
auto segment = p.optional(START_ASSISTANT + p.space()) + p.choice({ content_segment, analysis_segment });
auto contents = p.optional(segment + p.repeat(p.optional(p.space()) + segment, 0, -1)) + p.end();
// Tool call parser
if (has_tools && inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_NONE) {
auto tool_choice = p.choice();
@@ -1016,42 +997,37 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
std::string name = function.at("name");
const auto & params = function.at("parameters");
// Tool call can appear as:
// 1. In role header: " to=functions.NAME<|channel|>..."
// 2. In channel: "<|channel|>(analysis|commentary) to=functions.NAME..."
auto func_name = p.literal(" to=functions.") + p.tool_name(p.literal(name));
auto channel = p.literal(CHANNEL_COMMENTARY) | p.literal(CHANNEL_ANALYSIS);
auto constraint = p.space() + p.optional(p.literal(CONSTRAIN) + channel_header_content);
auto func_name = p.literal(" to=functions.") + p.tool_name(p.literal(name));
auto constraint = p.optional(p.space() + p.literal("<|constrain|>") + constrain_type);
auto args = p.tool_args(p.schema(p.json(), "tool-" + name + "-schema", params));
// Pattern 1: recipient in role header
// " to=functions.NAME<|channel|>(analysis|commentary)[constraint]<|message|>ARGS"
auto tool_in_role = p.tool(p.tool_open(func_name + channel) + constraint + MESSAGE + args);
// recipient in role header
// <|start|>assistant to=functions.NAME<|channel|>(commentary|analysis)[constraint]<|message|>ARGS
auto tool_in_role = p.tool(p.tool_open(func_name + channel + constraint + p.literal("<|message|>")) + args);
// Pattern 2: recipient in channel header
// "<|channel|>(analysis|commentary) to=functions.NAME[constraint]<|message|>ARGS"
auto tool_in_channel = p.tool(channel + p.tool_open(func_name + constraint + MESSAGE) + args);
// recipient in channel header
// <|channel|>(commentary|analysis) to=functions.NAME[constraint]<|message|>ARGS
auto tool_in_channel = p.tool(p.tool_open(channel + func_name + constraint + p.literal("<|message|>")) + args);
tool_choice |= tool_in_role | tool_in_channel;
tool_choice |= p.rule("tool-" + name, tool_in_role | tool_in_channel);
});
auto min_calls = inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED ? 1 : 0;
auto max_calls = inputs.parallel_tool_calls ? -1 : 1;
auto tool_call = p.trigger_rule("tool-call", tool_choice);
auto role_start = p.optional(p.space() + p.literal(START_ASSISTANT));
auto tool_call = p.rule("tool-call", p.repeat(role_start + tool_choice, min_calls, max_calls) + p.end());
if (inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED) {
return tool_call | ( any + p.zero_or_more(start + any) + start + tool_call);
}
return p.choice({ p.trigger_rule("single-tool", tool_call), p.trigger_rule("tools", p.one_or_more(segment) + tool_call) });
return tool_call | final_msg | (any + p.zero_or_more(start + any) + start + (tool_call | final_msg));
}
return contents;
return final_msg | (any + p.zero_or_more(start + any) + start + final_msg);
});
data.parser = parser.save();
if (include_grammar) {
data.grammar_lazy = has_tools && inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_AUTO;
data.grammar_lazy = !(has_response_format || (has_tools && inputs.tool_choice == COMMON_CHAT_TOOL_CHOICE_REQUIRED));
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
foreach_function(inputs.tools, [&](const json & tool) {
const auto & function = tool.at("function");
@@ -1062,10 +1038,9 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
});
data.grammar_triggers = {
{ COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN, "^(?:<\\|start\\|>assistant\\s*)?(\\s+to=functions)" },
{ COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN, "(?:<\\|end\\|>)(?:<\\|start\\|>assistant\\s*)?(\\s+to=functions)" },
{ COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN,
"(?:<\\|start\\|>assistant\\s*)?(<\\|channel\\|>(?:commentary|analysis)\\s+to=functions)" }
{ COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN, "^\\s+to$" },
{ COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN, "<\\|start\\|>assistant(\\s+to)" },
{ COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN, "<\\|start\\|>assistant(<\\|channel\\|>(?:commentary|analysis)\\s+to)" }
};
}
@@ -1562,6 +1537,21 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
}
}
if (inputs.force_pure_content) {
LOG_WRN("Forcing pure content template, will not render reasoning or tools separately.");
// Create the result structure
common_chat_params data;
auto params_copy = params;
params_copy.reasoning_format = COMMON_REASONING_FORMAT_NONE;
data.prompt = common_chat_template_direct_apply(tmpl, params_copy);
data.format = COMMON_CHAT_FORMAT_PEG_NATIVE;
auto parser = build_chat_peg_parser([](common_chat_peg_builder &p) {
return p.content(p.rest());
});
data.parser = parser.save();
return data;
}
// Ministral/Mistral Large 3 - uses special reasoning structure fixes, can't use autoparser
// Note: Mistral Small 3.2 uses [CALL_ID] which Ministral doesn't have, so we can distinguish them
if (src.find("[SYSTEM_PROMPT]") != std::string::npos && src.find("[TOOL_CALLS]") != std::string::npos &&

View File

@@ -204,6 +204,7 @@ struct common_chat_templates_inputs {
std::map<std::string, std::string> chat_template_kwargs;
bool add_bos = false;
bool add_eos = false;
bool force_pure_content = false;
};
struct common_chat_params {

View File

@@ -1067,7 +1067,7 @@ common_init_result::common_init_result(common_params & params) :
const llama_vocab * vocab = llama_model_get_vocab(model);
// load and optionally apply lora adapters (must be loaded before context creation)
// load and optionally apply lora adapters
for (auto & la : params.lora_adapters) {
llama_adapter_lora_ptr lora;
lora.reset(llama_adapter_lora_init(model, la.path.c_str()));

View File

@@ -544,6 +544,7 @@ struct common_params {
std::string chat_template = ""; // NOLINT
bool use_jinja = true; // NOLINT
bool enable_chat_template = true;
bool force_pure_content_parser = false;
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
int enable_reasoning = -1; // -1 = auto, 0 = disable, 1 = enable
int reasoning_budget = -1;

View File

@@ -4,7 +4,7 @@ project("ggml" C CXX ASM)
### GGML Version
set(GGML_VERSION_MAJOR 0)
set(GGML_VERSION_MINOR 9)
set(GGML_VERSION_PATCH 7)
set(GGML_VERSION_PATCH 8)
set(GGML_VERSION_BASE "${GGML_VERSION_MAJOR}.${GGML_VERSION_MINOR}.${GGML_VERSION_PATCH}")
find_program(GIT_EXE NAMES git git.exe NO_CMAKE_FIND_ROOT_PATH)

View File

@@ -733,6 +733,10 @@ extern "C" {
GGML_API size_t ggml_type_size(enum ggml_type type); // size in bytes for all elements in a block
GGML_API size_t ggml_row_size (enum ggml_type type, int64_t ne); // size in bytes for all elements in a row
GGML_DEPRECATED(
GGML_API double ggml_type_sizef(enum ggml_type type), // ggml_type_size()/ggml_blck_size() as float
"use ggml_row_size() instead");
GGML_API const char * ggml_type_name(enum ggml_type type);
GGML_API const char * ggml_op_name (enum ggml_op op);
GGML_API const char * ggml_op_symbol(enum ggml_op op);

View File

@@ -121,6 +121,8 @@ static void ggml_backend_blas_mul_mat(ggml_backend_blas_context * ctx, struct gg
bli_thread_set_num_threads(ctx->n_threads);
#elif defined(GGML_BLAS_USE_NVPL)
nvpl_blas_set_num_threads(ctx->n_threads);
#elif defined(GGML_BLAS_USE_MKL)
mkl_set_num_threads(ctx->n_threads);
#endif
for (int64_t i13 = 0; i13 < ne13; i13++) {

View File

@@ -115,10 +115,10 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
void quantize_row_q8_K(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k) {
assert(k % QK_K == 0);
block_q8_K * y_blocks = (block_q8_K *)y;
size_t nb = k / QK_K;
#if defined(__riscv_v_intrinsic)
block_q8_K * y_blocks = (block_q8_K *)y;
const size_t vlmax_f32m8 = __riscv_vsetvlmax_e32m8();
for (size_t i = 0; i < nb; i++) {
@@ -2052,6 +2052,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_iq1_s_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(n % QK_K == 0);
assert(nrc == 1);
@@ -2147,6 +2148,7 @@ static void ggml_vec_dot_iq1_s_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t
*s = sumf;
}
#endif
void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -2163,6 +2165,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_iq1_m_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(n % QK_K == 0);
assert(nrc == 1);
@@ -2269,6 +2272,7 @@ static void ggml_vec_dot_iq1_m_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t
*s = sumf;
}
#endif
void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -2285,6 +2289,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
#endif
}
#if defined __riscv_v_intrinsic
static const uint8_t sign_gather_indices_arr[64] = {
0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,
4,4,4,4,4,4,4,4, 5,5,5,5,5,5,5,5, 6,6,6,6,6,6,6,6, 7,7,7,7,7,7,7,7
@@ -2488,6 +2493,7 @@ static void ggml_vec_dot_iq2_s_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t
}
*s = 0.125f * sumf;
}
#endif
void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -2507,7 +2513,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
#endif
}
#if defined(__riscv_v)
#if defined(__riscv_v_intrinsic)
static const int8_t keven_signs_q2xs[1024] = {
1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, 1, -1,
@@ -2542,7 +2548,6 @@ static const int8_t keven_signs_q2xs[1024] = {
1, 1, 1, -1, -1, -1, -1, 1, -1, 1, 1, -1, -1, -1, -1, -1, 1, -1, 1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, 1,
1, 1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1,
};
#endif
static void ggml_vec_dot_iq2_xs_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(n % QK_K == 0);
@@ -2618,6 +2623,7 @@ static void ggml_vec_dot_iq2_xs_q8_K_vl256(int n, float * GGML_RESTRICT s, size_
}
*s = 0.125f * sumf;
}
#endif
void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -2634,6 +2640,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_iq2_xxs_q8_K_vl128(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(n % QK_K == 0);
assert(nrc == 1);
@@ -2818,6 +2825,7 @@ static void ggml_vec_dot_iq2_xxs_q8_K_vl256(int n, float * GGML_RESTRICT s, size
}
*s = 0.125f * sumf;
}
#endif
void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -2830,10 +2838,11 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
break;
}
#else
ggml_vec_dot_iq2_xxs_q8_K(n, s, bs, vx, bx, vy, by, nrc);
ggml_vec_dot_iq2_xxs_q8_K_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_iq3_s_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(n % QK_K == 0);
UNUSED(nrc);
@@ -2928,6 +2937,7 @@ static void ggml_vec_dot_iq3_s_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t
}
*s = sumf;
}
#endif
void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -2944,6 +2954,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_iq3_xxs_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(n % QK_K == 0);
assert(nrc == 1);
@@ -3036,6 +3047,7 @@ static void ggml_vec_dot_iq3_xxs_q8_K_vl256(int n, float * GGML_RESTRICT s, size
}
*s = 0.25f * sumf;
}
#endif
void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -3052,6 +3064,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_iq4_nl_q8_0_vl128(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(nrc == 1);
UNUSED(nrc);
@@ -3161,6 +3174,7 @@ static void ggml_vec_dot_iq4_nl_q8_0_vl256(int n, float * GGML_RESTRICT s, size_
*s = sumf;
}
#endif
void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -3177,6 +3191,7 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_iq4_xs_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(nrc == 1);
UNUSED(nrc);
@@ -3190,7 +3205,6 @@ static void ggml_vec_dot_iq4_xs_q8_K_vl256(int n, float * GGML_RESTRICT s, size_
const int nb = n / QK_K;
#if defined __riscv_v_intrinsic
const vint8m4_t values = __riscv_vle8_v_i8m4(kvalues_iq4nl, 16);
float sumf = 0;
int acc[4];
@@ -3252,14 +3266,8 @@ static void ggml_vec_dot_iq4_xs_q8_K_vl256(int n, float * GGML_RESTRICT s, size_
}
*s = sumf;
#else
UNUSED(x);
UNUSED(y);
UNUSED(nb);
ggml_vec_dot_iq4_xs_q8_K_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
#endif
void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -3276,6 +3284,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_tq1_0_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(nrc == 1);
UNUSED(nrc);
@@ -3381,6 +3390,7 @@ static void ggml_vec_dot_tq1_0_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t
*s = sumf;
}
#endif
void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -3397,6 +3407,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_tq2_0_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(n % QK_K == 0);
assert(nrc == 1);
@@ -3467,6 +3478,7 @@ static void ggml_vec_dot_tq2_0_q8_K_vl256(int n, float * GGML_RESTRICT s, size_t
*s = sumf;
}
#endif
void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -3483,6 +3495,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
#endif
}
#if defined __riscv_v_intrinsic
static void ggml_vec_dot_mxfp4_q8_0_vl128(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(nrc == 1);
UNUSED(nrc);
@@ -3592,6 +3605,7 @@ static void ggml_vec_dot_mxfp4_q8_0_vl256(int n, float * GGML_RESTRICT s, size_t
*s = sumf;
}
#endif
void ggml_vec_dot_mxfp4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
#if defined __riscv_v_intrinsic
@@ -3604,6 +3618,6 @@ void ggml_vec_dot_mxfp4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
break;
}
#else
return ggml_vec_dot_mxfp4_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
ggml_vec_dot_mxfp4_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}

View File

@@ -107,8 +107,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
}
#else
UNUSED(nb);
UNUSED(y);
ggml_quantize_mat_q8_0_4x4_generic(x, vy, k);
ggml_quantize_mat_q8_0_4x8_generic(x, vy, k);
#endif
}
@@ -203,6 +202,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
ggml_gemv_q4_0_8x8_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
#if defined __riscv_zvfh
void ggml_gemv_q4_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
const int qk = QK8_0;
const int nb = n / qk;
@@ -222,7 +222,6 @@ void ggml_gemv_q4_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined __riscv_v_intrinsic
const block_q8_0 * a_ptr = (const block_q8_0 *) vy;
for (int x = 0; x < nc / ncols_interleaved; x++) {
const block_q4_0x16 * b_ptr = (const block_q4_0x16 *) vx + (x * nb);
@@ -256,9 +255,6 @@ void ggml_gemv_q4_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
__riscv_vse32_v_f32m2(s + x * 16, sumf, 16);
}
return;
#endif
ggml_gemv_q4_0_16x1_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemv_q4_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -280,7 +276,6 @@ void ggml_gemv_q4_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined __riscv_v_intrinsic
const block_q8_K * a_ptr = (const block_q8_K *) vy;
for (int x = 0; x < nc / ncols_interleaved; x++) {
@@ -392,9 +387,6 @@ void ggml_gemv_q4_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
__riscv_vse32_v_f32m2(s + x * 16, sumf, 16);
}
return;
#endif
ggml_gemv_q4_K_16x1_q8_K_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemv_iq4_nl_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -416,7 +408,6 @@ void ggml_gemv_iq4_nl_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined __riscv_v_intrinsic
const vint8mf2_t values = __riscv_vle8_v_i8mf2(kvalues_iq4nl, 16);
const block_q8_0 * a_ptr = (const block_q8_0 *) vy;
for (int x = 0; x < nc / ncols_interleaved; x++) {
@@ -451,9 +442,6 @@ void ggml_gemv_iq4_nl_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
__riscv_vse32_v_f32m2(s + x * 16, sumf, 16);
}
return;
#endif
ggml_gemv_iq4_nl_16x1_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemv_q8_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -476,7 +464,6 @@ void ggml_gemv_q8_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
UNUSED(blocklen);
UNUSED(bs);
#if defined __riscv_v_intrinsic
const block_q8_0 * a_ptr = (const block_q8_0 *) vy;
for (int x = 0; x < nc / ncols_interleaved; x++) {
const block_q8_0x16 * b_ptr = (const block_q8_0x16 *) vx + (x * nb);
@@ -505,9 +492,6 @@ void ggml_gemv_q8_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
__riscv_vse32_v_f32m2(s + x * 16, sumf, 16);
}
return;
#endif
ggml_gemv_q8_0_16x1_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemv_q2_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -679,9 +663,9 @@ void ggml_gemv_q2_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
} // End K-Block
__riscv_vse32_v_f32m2(s + col_tile, v_sumf, vl);
}
}
#endif
void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
const int qk = QK8_0;
@@ -909,6 +893,7 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
ggml_gemm_q4_0_8x8_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
#if defined __riscv_zvfh
void ggml_gemm_q4_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
const int qk = QK8_0;
const int nb = n / qk;
@@ -929,7 +914,6 @@ void ggml_gemm_q4_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined __riscv_v_intrinsic
for (int y = 0; y < nr / 4; y++) {
const block_q8_0x4 * a_ptr = (const block_q8_0x4 *) vy + (y * nb);
for (int x = 0; x < nc / ncols_interleaved; x++) {
@@ -994,9 +978,6 @@ void ggml_gemm_q4_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
__riscv_vse32_v_f32m2(s + (y * 4 + 3) * bs + x * 16, sumf_3, 16);
}
}
return;
#endif
ggml_gemm_q4_0_16x1_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemm_q4_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -1019,7 +1000,6 @@ void ggml_gemm_q4_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined __riscv_v_intrinsic
for (int y = 0; y < nr / 4; y++) {
const block_q8_Kx4 * a_ptr = (const block_q8_Kx4 *) vy + (y * nb);
for (int x = 0; x < nc / ncols_interleaved; x++) {
@@ -1267,9 +1247,6 @@ void ggml_gemm_q4_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
__riscv_vse32_v_f32m2(s + (y * 4 + 3) * bs + x * 16, sumf_3, 16);
}
}
return;
#endif
ggml_gemm_q4_K_16x1_q8_K_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemm_iq4_nl_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -1292,7 +1269,6 @@ void ggml_gemm_iq4_nl_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined __riscv_v_intrinsic
const vint8mf2_t values = __riscv_vle8_v_i8mf2(kvalues_iq4nl, 16);
for (int y = 0; y < nr / 4; y++) {
const block_q8_0x4 * a_ptr = (const block_q8_0x4 *) vy + (y * nb);
@@ -1355,9 +1331,6 @@ void ggml_gemm_iq4_nl_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
__riscv_vse32_v_f32m2(s + (y * 4 + 3) * bs + x * 16, sumf_3, 16);
}
}
return;
#endif
ggml_gemm_iq4_nl_16x1_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemm_q8_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -1380,7 +1353,6 @@ void ggml_gemm_q8_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
UNUSED(ncols_interleaved);
UNUSED(blocklen);
#if defined __riscv_v_intrinsic
for (int y = 0; y < nr / 4; y++) {
const block_q8_0x4 * a_ptr = (const block_q8_0x4 *) vy + (y * nb);
for (int x = 0; x < nc / ncols_interleaved; x++) {
@@ -1429,9 +1401,6 @@ void ggml_gemm_q8_0_16x1_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
__riscv_vse32_v_f32m2(s + (y * 4 + 3) * bs + x * 16, sumf_3, 16);
}
}
return;
#endif
ggml_gemm_q8_0_16x1_q8_0_generic(n, s, bs, vx, vy, nr, nc);
}
void ggml_gemm_q2_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
@@ -1731,3 +1700,4 @@ void ggml_gemm_q2_K_16x1_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
}
}
}
#endif

View File

@@ -1461,7 +1461,7 @@ class extra_buffer_type : ggml::cpu::extra_buffer_type {
return false;
}
if ((op->src[1]->type == GGML_TYPE_F32 || op->src[1]->type == GGML_TYPE_I32) &&
ggml_ne(op->src[1], 2) == 1 && ggml_ne(op->src[1], 3) == 1) {
ggml_ne(op->src[1], 3) == 1) {
return true;
}
}

View File

@@ -1365,6 +1365,7 @@ void ggml_gemv_q8_0_4x8_q8_0_generic(int n,
}
}
// Only enable these for RISC-V.
#if defined __riscv_zvfh
void ggml_gemv_q4_0_16x1_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
const int qk = QK8_0;
@@ -1568,6 +1569,7 @@ void ggml_gemv_q2_K_16x1_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
assert(nc % 16 == 0);
UNUSED(bs);
UNUSED(nr);
const int nb = n / QK_K;
const block_q2_Kx16 * x = (const block_q2_Kx16 *)vx;
@@ -2381,6 +2383,7 @@ void ggml_gemm_q8_0_4x8_q8_0_generic(int n,
}
}
// Only enable these for RISC-V.
#if defined __riscv_zvfh
void ggml_gemm_q4_0_16x1_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, const void * GGML_RESTRICT vy, int nr, int nc) {
const int qk = QK8_0;

View File

@@ -126,7 +126,7 @@ static cudaError_t ggml_cuda_device_malloc(void ** ptr, size_t size, int device)
if (err == hipSuccess) {
// hipMemAdviseSetCoarseGrain is an optional performance hint;
// ignore errors (e.g. hipErrorInvalidValue on some APU/iGPU configs).
cudaMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device);
(void)cudaMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device);
(void)hipGetLastError(); // clear any error
}

View File

@@ -2362,6 +2362,27 @@ static inline size_t init_cpy_req(htp_general_req * req, dspqueue_buffer * bufs,
return n_bufs;
}
static inline size_t init_cont_req(htp_general_req * req, dspqueue_buffer * bufs, const ggml_tensor * t) {
// CONT is just a contiguous copy — reuse CPY op
req->op = HTP_OP_CPY;
size_t n_bufs = 0;
n_bufs += htp_req_buff_init(&req->src0, &bufs[n_bufs], t->src[0], DSPQBUF_TYPE_CPU_WRITE_DSP_READ);
n_bufs += htp_req_buff_init(&req->dst, &bufs[n_bufs], t, DSPQBUF_TYPE_DSP_WRITE_CPU_READ);
return n_bufs;
}
static inline size_t init_repeat_req(htp_general_req * req, dspqueue_buffer * bufs, const ggml_tensor * t) {
req->op = HTP_OP_REPEAT;
size_t n_bufs = 0;
n_bufs += htp_req_buff_init(&req->src0, &bufs[n_bufs], t->src[0], DSPQBUF_TYPE_CPU_WRITE_DSP_READ);
n_bufs += htp_req_buff_init(&req->dst, &bufs[n_bufs], t, DSPQBUF_TYPE_DSP_WRITE_CPU_READ);
return n_bufs;
}
static inline size_t init_get_rows_req(htp_general_req * req, dspqueue_buffer * bufs, const ggml_tensor * t) {
req->op = HTP_OP_GET_ROWS;
@@ -2449,12 +2470,33 @@ static inline size_t init_unary_req(htp_general_req * req, dspqueue_buffer * buf
break;
case GGML_OP_UNARY:
if (ggml_get_unary_op(t) == GGML_UNARY_OP_SILU) {
switch (ggml_get_unary_op(t)) {
case GGML_UNARY_OP_SILU:
req->op = HTP_OP_UNARY_SILU;
supported = true;
} else if (ggml_get_unary_op(t) == GGML_UNARY_OP_GELU) {
break;
case GGML_UNARY_OP_GELU:
req->op = HTP_OP_UNARY_GELU;
supported = true;
break;
case GGML_UNARY_OP_SIGMOID:
req->op = HTP_OP_UNARY_SIGMOID;
supported = true;
break;
case GGML_UNARY_OP_NEG:
req->op = HTP_OP_UNARY_NEG;
supported = true;
break;
case GGML_UNARY_OP_EXP:
req->op = HTP_OP_UNARY_EXP;
supported = true;
break;
case GGML_UNARY_OP_SOFTPLUS:
req->op = HTP_OP_UNARY_SOFTPLUS;
supported = true;
break;
default:
break;
}
break;
@@ -2640,16 +2682,28 @@ static ggml_status ggml_backend_hexagon_graph_compute(ggml_backend_t backend, gg
ggml_hexagon_dispatch_op<init_sum_rows_req>(sess, node, flags);
break;
case GGML_OP_UNARY:
if ((ggml_get_unary_op(node) == GGML_UNARY_OP_SILU) ||
(ggml_get_unary_op(node) == GGML_UNARY_OP_GELU)) {
ggml_hexagon_dispatch_op<init_unary_req>(sess, node, flags);
switch (ggml_get_unary_op(node)) {
case GGML_UNARY_OP_NEG:
case GGML_UNARY_OP_EXP:
case GGML_UNARY_OP_SIGMOID:
case GGML_UNARY_OP_SOFTPLUS:
case GGML_UNARY_OP_SILU:
case GGML_UNARY_OP_GELU:
ggml_hexagon_dispatch_op<init_unary_req>(sess, node, flags);
break;
default:
break;
}
break;
case GGML_OP_GLU:
if ((ggml_get_glu_op(node) == GGML_GLU_OP_SWIGLU) ||
(ggml_get_glu_op(node) == GGML_GLU_OP_SWIGLU_OAI) ||
(ggml_get_glu_op(node) == GGML_GLU_OP_GEGLU)) {
ggml_hexagon_dispatch_op<init_unary_req>(sess, node, flags);
switch (ggml_get_glu_op(node)) {
case GGML_GLU_OP_SWIGLU:
case GGML_GLU_OP_SWIGLU_OAI:
case GGML_GLU_OP_GEGLU:
ggml_hexagon_dispatch_op<init_unary_req>(sess, node, flags);
break;
default:
break;
}
break;
case GGML_OP_SOFT_MAX:
@@ -2676,6 +2730,14 @@ static ggml_status ggml_backend_hexagon_graph_compute(ggml_backend_t backend, gg
ggml_hexagon_dispatch_op<init_cpy_req>(sess, node, flags);
break;
case GGML_OP_CONT:
ggml_hexagon_dispatch_op<init_cont_req>(sess, node, flags);
break;
case GGML_OP_REPEAT:
ggml_hexagon_dispatch_op<init_repeat_req>(sess, node, flags);
break;
case GGML_OP_ARGSORT:
ggml_hexagon_dispatch_op<init_argsort_req>(sess, node, flags);
break;
@@ -3006,6 +3068,39 @@ static bool ggml_hexagon_supported_cpy(const struct ggml_hexagon_session * sess,
return true;
}
static bool ggml_hexagon_supported_cont(const struct ggml_hexagon_session * sess, const struct ggml_tensor * op) {
GGML_UNUSED(sess);
const struct ggml_tensor * src0 = op->src[0];
// CONT is same-type only, supports f32 and f16
if (src0->type != GGML_TYPE_F32 && src0->type != GGML_TYPE_F16) return false;
return true;
}
static bool ggml_hexagon_supported_repeat(const struct ggml_hexagon_session * sess, const struct ggml_tensor * op) {
GGML_UNUSED(sess);
const struct ggml_tensor * src0 = op->src[0];
const struct ggml_tensor * dst = op;
// Support f32 and f16
if (src0->type != GGML_TYPE_F32 && src0->type != GGML_TYPE_F16) return false;
// src and dst must be the same type
if (src0->type != dst->type) return false;
// dst dims must be multiples of src dims
if (dst->ne[0] % src0->ne[0] != 0) return false;
if (dst->ne[1] % src0->ne[1] != 0) return false;
if (dst->ne[2] % src0->ne[2] != 0) return false;
if (dst->ne[3] % src0->ne[3] != 0) return false;
// require contiguous tensors (no transposition)
if (ggml_is_transposed(src0) || ggml_is_transposed(dst)) return false;
return true;
}
static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, const struct ggml_tensor * op) {
auto sess = static_cast<ggml_hexagon_session *>(dev->context);
@@ -3063,21 +3158,32 @@ static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, cons
break;
case GGML_OP_UNARY:
{
const auto unary_op = ggml_get_unary_op(op);
if (unary_op == GGML_UNARY_OP_SILU || unary_op == GGML_UNARY_OP_GELU) {
switch (ggml_get_unary_op(op)) {
case GGML_UNARY_OP_NEG:
case GGML_UNARY_OP_EXP:
case GGML_UNARY_OP_SIGMOID:
case GGML_UNARY_OP_SOFTPLUS:
supp = ggml_hexagon_supported_unary(sess, op);
break;
case GGML_UNARY_OP_SILU:
case GGML_UNARY_OP_GELU:
supp = ggml_hexagon_supported_activations(sess, op);
}
break;
break;
default:
break;
}
break;
case GGML_OP_GLU:
{
const auto glu_op = ggml_get_glu_op(op);
if ((glu_op == GGML_GLU_OP_SWIGLU) || (glu_op == GGML_GLU_OP_SWIGLU_OAI) || (glu_op == GGML_GLU_OP_GEGLU)) {
switch (ggml_get_glu_op(op)) {
case GGML_GLU_OP_SWIGLU:
case GGML_GLU_OP_SWIGLU_OAI:
case GGML_GLU_OP_GEGLU:
supp = ggml_hexagon_supported_activations(sess, op);
}
break;
break;
default:
break;
}
break;
case GGML_OP_ROPE:
supp = ggml_hexagon_supported_rope(sess, op);
break;
@@ -3098,6 +3204,14 @@ static bool ggml_backend_hexagon_device_supports_op(ggml_backend_dev_t dev, cons
supp = ggml_hexagon_supported_cpy(sess, op);
break;
case GGML_OP_CONT:
supp = ggml_hexagon_supported_cont(sess, op);
break;
case GGML_OP_REPEAT:
supp = ggml_hexagon_supported_repeat(sess, op);
break;
case GGML_OP_ARGSORT:
supp = ggml_hexagon_supported_argsort(sess, op);
break;

View File

@@ -30,6 +30,7 @@ add_library(${HTP_LIB} SHARED
set-rows-ops.c
get-rows-ops.c
cpy-ops.c
repeat-ops.c
argsort-ops.c
ssm-conv.c
)

View File

@@ -53,6 +53,10 @@ enum htp_op {
HTP_OP_RMS_NORM,
HTP_OP_UNARY_SILU,
HTP_OP_UNARY_GELU,
HTP_OP_UNARY_SIGMOID,
HTP_OP_UNARY_EXP,
HTP_OP_UNARY_NEG,
HTP_OP_UNARY_SOFTPLUS,
HTP_OP_GLU_SWIGLU,
HTP_OP_GLU_SWIGLU_OAI,
HTP_OP_GLU_GEGLU,
@@ -69,6 +73,7 @@ enum htp_op {
HTP_OP_SQRT,
HTP_OP_SUM_ROWS,
HTP_OP_SSM_CONV,
HTP_OP_REPEAT,
INVALID
};

View File

@@ -57,6 +57,7 @@ int op_flash_attn_ext(struct htp_ops_context * octx);
int op_set_rows(struct htp_ops_context * octx);
int op_get_rows(struct htp_ops_context * octx);
int op_cpy(struct htp_ops_context * octx);
int op_repeat(struct htp_ops_context * octx);
int op_argsort(struct htp_ops_context * octx);
int op_ssm_conv(struct htp_ops_context * octx);

View File

@@ -3,6 +3,8 @@
#include <stdbool.h>
#include <stdint.h>
#include <math.h>
#include <assert.h>
#include "hex-utils.h"
#include "hvx-types.h"

View File

@@ -3,6 +3,7 @@
#include <stdbool.h>
#include <stdint.h>
#include <math.h>
#include "hvx-base.h"
#include "hvx-floor.h"
@@ -16,8 +17,8 @@
#define EXP_LOGN2 (0x3F317218) // ln(2) = 0.6931471805
#define EXP_LOG2E (0x3FB8AA3B) // log2(e) = 1/ln(2) = 1.4426950408
#define EXP_ONE (0x3f800000) // 1.0
#define EXP_RANGE_R (0x41a00000) // 20.0
#define EXP_RANGE_L (0xc1a00000) // -20.0
#define EXP_RANGE_R (0x42B16666) // 88.7
#define EXP_RANGE_L (0xC2B00000) // -88.0 (approx log(FLT_MIN))
static inline HVX_Vector hvx_vec_exp_f32(HVX_Vector in_vec) {
HVX_Vector z_qf32_v;
@@ -47,12 +48,12 @@ static inline HVX_Vector hvx_vec_exp_f32(HVX_Vector in_vec) {
HVX_Vector temp_v = in_vec;
// Clamp inputs to (-20.0, 20.0)
// Clamp inputs to (-88.0, 88.0) to avoid overflow/underflow
HVX_VectorPred pred_cap_right = Q6_Q_vcmp_gt_VsfVsf(in_vec, Q6_V_vsplat_R(EXP_RANGE_R));
HVX_VectorPred pred_cap_left = Q6_Q_vcmp_gt_VsfVsf(Q6_V_vsplat_R(EXP_RANGE_L), in_vec);
in_vec = Q6_V_vmux_QVV(pred_cap_right, Q6_V_vsplat_R(EXP_RANGE_R), temp_v);
in_vec = Q6_V_vmux_QVV(pred_cap_left, Q6_V_vsplat_R(EXP_RANGE_L), temp_v);
in_vec = Q6_V_vmux_QVV(pred_cap_left, Q6_V_vsplat_R(EXP_RANGE_L), in_vec);
epsilon_v = Q6_Vqf32_vmpy_VsfVsf(log2e, in_vec);
epsilon_v = Q6_Vsf_equals_Vqf32(epsilon_v);
@@ -69,12 +70,12 @@ static inline HVX_Vector hvx_vec_exp_f32(HVX_Vector in_vec) {
// normalize before every QFloat's vmpy
x_qf32_v = Q6_Vqf32_vadd_Vqf32Vsf(x_qf32_v, zero_v);
x_v = Q6_Vsf_equals_Vqf32(x_qf32_v);
// z = x * x;
z_qf32_v = Q6_Vqf32_vmpy_Vqf32Vqf32(x_qf32_v, x_qf32_v);
z_qf32_v = Q6_Vqf32_vadd_Vqf32Vsf(z_qf32_v, zero_v);
x_v = Q6_Vsf_equals_Vqf32(x_qf32_v);
// y = E4 + E5 * x;
E_const = Q6_V_vsplat_R(EXP_COEFF_5);
y_v = Q6_Vqf32_vmpy_VsfVsf(E_const, x_v);
@@ -145,7 +146,7 @@ static inline HVX_Vector hvx_vec_exp_f32_guard(HVX_Vector in_vec, HVX_Vector max
return Q6_V_vmux_QVV(pred0, inf, out);
}
static inline void hvx_exp_f32(const uint8_t * restrict src, uint8_t * restrict dst, const int num_elems, bool negate) {
static inline void hvx_exp_f32(uint8_t * restrict dst, const uint8_t * restrict src, const int num_elems, bool negate) {
int left_over = num_elems & (VLEN_FP32 - 1);
int num_elems_whole = num_elems - left_over;
@@ -162,7 +163,7 @@ static inline void hvx_exp_f32(const uint8_t * restrict src, uint8_t * restrict
HVX_Vector vec_out = Q6_V_vzero();
static const float kInf = INFINITY;
static const float kMaxExp = 88.02f; // log(INF)
static const float kMaxExp = 88.7f;
const HVX_Vector max_exp = hvx_vec_splat_f32(kMaxExp);
const HVX_Vector inf = hvx_vec_splat_f32(kInf);

View File

@@ -2,6 +2,7 @@
#define HVX_SIGMOID_H
#include "hvx-base.h"
#include "hvx-inverse.h"
#define FAST_SIGMOID_LOG2F (0x3fb8aa3b) // 1.442695022
#define FAST_SIGMOID_C1 (0x3d009076) // 0.03138777

View File

@@ -516,6 +516,39 @@ static void proc_cpy_req(struct htp_context * ctx, struct htp_general_req * req,
send_htp_rsp(ctx, req->op, rsp_status, rsp_bufs, 1, &prof);
}
static void proc_repeat_req(struct htp_context * ctx, struct htp_general_req * req, struct dspqueue_buffer * bufs) {
struct dspqueue_buffer rsp_bufs[1];
// We had written to the output buffer, we'd also need to flush it
rsp_bufs[0].fd = bufs[1].fd;
rsp_bufs[0].ptr = bufs[1].ptr;
rsp_bufs[0].offset = bufs[1].offset;
rsp_bufs[0].size = bufs[1].size;
rsp_bufs[0].flags = (DSPQUEUE_BUFFER_FLAG_FLUSH_SENDER | // Flush HTP
DSPQUEUE_BUFFER_FLAG_INVALIDATE_RECIPIENT); // Invalidate CPU
// Setup Op context
struct htp_ops_context octx = { 0 };
octx.ctx = ctx;
octx.src0 = req->src0;
octx.dst = req->dst;
octx.flags = req->flags;
octx.op = req->op;
// Update data pointers
octx.src0.data = (uint32_t) bufs[0].ptr;
octx.dst.data = (uint32_t) bufs[1].ptr;
octx.n_threads = ctx->n_threads;
struct profile_data prof;
profile_start(&prof);
uint32_t rsp_status = op_repeat(&octx);
profile_stop(&prof);
send_htp_rsp(ctx, req->op, rsp_status, rsp_bufs, 1, &prof);
}
static void proc_get_rows_req(struct htp_context * ctx, struct htp_general_req * req, struct dspqueue_buffer * bufs) {
struct dspqueue_buffer rsp_bufs[1];
@@ -1090,6 +1123,10 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
case HTP_OP_SQR:
case HTP_OP_SQRT:
case HTP_OP_UNARY_NEG:
case HTP_OP_UNARY_EXP:
case HTP_OP_UNARY_SIGMOID:
case HTP_OP_UNARY_SOFTPLUS:
if (n_bufs != 2) {
FARF(ERROR, "Bad unary-req buffer list");
continue;
@@ -1175,6 +1212,14 @@ static void htp_packet_callback(dspqueue_t queue, int error, void * context) {
proc_cpy_req(ctx, &req, bufs);
break;
case HTP_OP_REPEAT:
if (n_bufs != 2) {
FARF(ERROR, "Bad repeat-req buffer list");
continue;
}
proc_repeat_req(ctx, &req, bufs);
break;
case HTP_OP_ARGSORT:
if (n_bufs != 2) {
FARF(ERROR, "Bad argsort-req buffer list");

View File

@@ -0,0 +1,148 @@
#pragma clang diagnostic ignored "-Wunused-variable"
#pragma clang diagnostic ignored "-Wunused-function"
#pragma clang diagnostic ignored "-Wunused-but-set-variable"
#include <HAP_farf.h>
#include <HAP_perf.h>
#include <string.h>
#include "hvx-utils.h"
#define GGML_COMMON_DECL_C
#include "ggml-common.h"
#include "htp-ctx.h"
#include "htp-msg.h"
#include "htp-ops.h"
struct htp_repeat_context {
struct htp_ops_context * octx;
uint32_t nr0;
uint32_t nr1;
uint32_t nr2;
uint32_t nr3;
uint32_t nrows_per_thread;
uint32_t total_dst_rows; // ne1 * ne2 * ne3
size_t type_size;
};
static void repeat_job_per_thread(unsigned int nth, unsigned int ith, void * data) {
const struct htp_repeat_context * rctx = (const struct htp_repeat_context *) data;
struct htp_ops_context * octx = rctx->octx;
const struct htp_tensor * src = &octx->src0;
const struct htp_tensor * dst = &octx->dst;
const uint32_t ne00 = src->ne[0];
const uint32_t ne01 = src->ne[1];
const uint32_t ne02 = src->ne[2];
const uint32_t ne03 = src->ne[3];
const uint32_t nb00 = src->nb[0];
const uint32_t nb01 = src->nb[1];
const uint32_t nb02 = src->nb[2];
const uint32_t nb03 = src->nb[3];
const uint32_t ne0 = dst->ne[0];
const uint32_t ne1 = dst->ne[1];
const uint32_t ne2 = dst->ne[2];
const uint32_t ne3 = dst->ne[3];
const uint32_t nb0 = dst->nb[0];
const uint32_t nb1 = dst->nb[1];
const uint32_t nb2 = dst->nb[2];
const uint32_t nb3 = dst->nb[3];
const uint32_t nr0 = rctx->nr0;
const uint32_t nr1 = rctx->nr1;
const uint32_t nr2 = rctx->nr2;
const uint32_t nr3 = rctx->nr3;
const size_t row_bytes = ne00 * rctx->type_size;
const uint32_t row_start = rctx->nrows_per_thread * ith;
const uint32_t row_end = MIN(row_start + rctx->nrows_per_thread, rctx->total_dst_rows);
uint64_t t1, t2;
t1 = HAP_perf_get_qtimer_count();
for (uint32_t dst_row = row_start; dst_row < row_end; dst_row++) {
// Decompose flat dst row index into (i1, i2, i3)
const uint32_t i1 = dst_row % ne1;
const uint32_t i2 = (dst_row / ne1) % ne2;
const uint32_t i3 = dst_row / (ne1 * ne2);
// Map to source indices (tiling)
const uint32_t k1 = i1 % ne01;
const uint32_t k2 = i2 % ne02;
const uint32_t k3 = i3 % ne03;
const uint8_t * src_row = (const uint8_t *) src->data + k1 * nb01 + k2 * nb02 + k3 * nb03;
uint8_t * dst_base = (uint8_t *) dst->data + i1 * nb1 + i2 * nb2 + i3 * nb3;
// Tile along dimension 0
for (uint32_t i0 = 0; i0 < nr0; i0++) {
uint8_t * dst_ptr = dst_base + i0 * ne00 * nb0;
memcpy(dst_ptr, src_row, row_bytes);
}
}
t2 = HAP_perf_get_qtimer_count();
FARF(HIGH, "repeat %d/%d: (%ux%ux%ux%u) -> (%ux%ux%ux%u) rows %u:%u usec %u\n",
ith, nth, src->ne[0], src->ne[1], src->ne[2], src->ne[3],
dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3],
row_start, row_end, (unsigned) HAP_perf_qtimer_count_to_us(t2 - t1));
}
int op_repeat(struct htp_ops_context * octx) {
const struct htp_tensor * src0 = &octx->src0;
struct htp_tensor * dst = &octx->dst;
// Validate that dst dims are multiples of src dims
if (dst->ne[0] % src0->ne[0] != 0 ||
dst->ne[1] % src0->ne[1] != 0 ||
dst->ne[2] % src0->ne[2] != 0 ||
dst->ne[3] % src0->ne[3] != 0) {
FARF(ERROR, "repeat: dst dims must be multiples of src dims\n");
return HTP_STATUS_INVAL_PARAMS;
}
size_t type_size;
switch (src0->type) {
case HTP_TYPE_F32: type_size = 4; break;
case HTP_TYPE_F16: type_size = 2; break;
default:
FARF(ERROR, "repeat: unsupported type %u\n", src0->type);
return HTP_STATUS_NO_SUPPORT;
}
const uint32_t total_dst_rows = dst->ne[1] * dst->ne[2] * dst->ne[3];
const uint32_t n_threads = MIN(octx->n_threads, total_dst_rows);
if (octx->flags & HTP_OPFLAGS_SKIP_COMPUTE) {
return HTP_STATUS_OK;
}
struct htp_repeat_context rctx = {
.octx = octx,
.nr0 = dst->ne[0] / src0->ne[0],
.nr1 = dst->ne[1] / src0->ne[1],
.nr2 = dst->ne[2] / src0->ne[2],
.nr3 = dst->ne[3] / src0->ne[3],
.nrows_per_thread = (total_dst_rows + n_threads - 1) / n_threads,
.total_dst_rows = total_dst_rows,
.type_size = type_size,
};
FARF(HIGH, "repeat: (%ux%ux%ux%u) -> (%ux%ux%ux%u) nr=(%u,%u,%u,%u)\n",
src0->ne[0], src0->ne[1], src0->ne[2], src0->ne[3],
dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3],
rctx.nr0, rctx.nr1, rctx.nr2, rctx.nr3);
worker_pool_run_func(octx->ctx->worker_pool, repeat_job_per_thread, &rctx, n_threads);
return HTP_STATUS_OK;
}

View File

@@ -195,7 +195,7 @@ static float hvx_softmax_f32(const uint8_t * restrict src,
const float max) {
hvx_sub_scalar_f32(spad, src, max, num_elems);
hvx_exp_f32(spad, dst, num_elems, false);
hvx_exp_f32(dst, spad, num_elems, false);
float sum = hvx_reduce_sum_f32(dst, num_elems);

View File

@@ -9,6 +9,8 @@
#include <string.h>
#include "hex-dma.h"
#include "hvx-exp.h"
#include "hvx-sigmoid.h"
#include "hvx-utils.h"
#define GGML_COMMON_DECL_C
@@ -166,6 +168,75 @@ static void sqrt_f32(const float * restrict src,
}
}
static void neg_f32(const float * restrict src,
float * restrict dst,
uint8_t * restrict spad,
const uint32_t num_rows,
const uint32_t row_elems,
const size_t row_size,
int32_t * op_params) {
for (uint32_t ir = 0; ir < num_rows; ir++) {
const uint8_t * restrict src_local = (const uint8_t *)src + (ir * row_size);
uint8_t * restrict dst_local = (uint8_t *)dst + (ir * row_size);
hvx_scale_f32_aa(dst_local, src_local, row_elems, -1.0f);
}
}
static void exp_f32(const float * restrict src,
float * restrict dst,
uint8_t * restrict spad,
const uint32_t num_rows,
const uint32_t row_elems,
const size_t row_size,
int32_t * op_params) {
for (uint32_t ir = 0; ir < num_rows; ir++) {
const uint8_t * restrict src_local = (const uint8_t *)src + (ir * row_size);
uint8_t * restrict dst_local = (uint8_t *)dst + (ir * row_size);
hvx_exp_f32(dst_local, src_local, row_elems, false);
}
}
static void sigmoid_f32(const float * restrict src,
float * restrict dst,
uint8_t * restrict spad,
const uint32_t num_rows,
const uint32_t row_elems,
const size_t row_size,
int32_t * op_params) {
for (uint32_t ir = 0; ir < num_rows; ir++) {
const uint8_t * restrict src_local = (const uint8_t *)src + (ir * row_size);
uint8_t * restrict dst_local = (uint8_t *)dst + (ir * row_size);
hvx_sigmoid_f32_aa(dst_local, src_local, row_elems);
}
}
static void softplus_f32(const float * restrict src,
float * restrict dst,
uint8_t * restrict spad,
const uint32_t num_rows,
const uint32_t row_elems,
const size_t row_size,
int32_t * op_params) {
// softplus(x) = log(1 + exp(x))
// Match CPU reference: ggml_compute_softplus_f32() in ggml-impl.h
for (uint32_t ir = 0; ir < num_rows; ir++) {
const float * restrict src_f = (const float *)((const uint8_t *)src + (ir * row_size));
float * restrict dst_f = (float *)((uint8_t *)dst + (ir * row_size));
for (uint32_t i = 0; i < row_elems; i++) {
float x = src_f[i];
// For x > 20: softplus(x) ≈ x (avoids exp overflow)
dst_f[i] = (x > 20.0f) ? x : logf(1.0f + expf(x));
}
}
}
static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void * data) {
const struct htp_unary_context * uctx = (const struct htp_unary_context *) data;
struct htp_ops_context * octx = uctx->octx;
@@ -247,6 +318,18 @@ static void unary_job_f32_per_thread(unsigned int nth, unsigned int ith, void *
case HTP_OP_SQRT:
sqrt_f32(src0_spad, dst_spad, NULL, block_size, ne0, src0_row_size_aligned, op_params);
break;
case HTP_OP_UNARY_NEG:
neg_f32(src0_spad, dst_spad, NULL, block_size, ne0, src0_row_size_aligned, op_params);
break;
case HTP_OP_UNARY_EXP:
exp_f32(src0_spad, dst_spad, NULL, block_size, ne0, src0_row_size_aligned, op_params);
break;
case HTP_OP_UNARY_SIGMOID:
sigmoid_f32(src0_spad, dst_spad, NULL, block_size, ne0, src0_row_size_aligned, op_params);
break;
case HTP_OP_UNARY_SOFTPLUS:
softplus_f32(src0_spad, dst_spad, NULL, block_size, ne0, src0_row_size_aligned, op_params);
break;
default:
break;
}
@@ -295,6 +378,18 @@ static int execute_op_unary_f32(struct htp_ops_context * octx) {
case HTP_OP_SQRT:
op_type = "sqrt-f32";
break;
case HTP_OP_UNARY_NEG:
op_type = "neg-f32";
break;
case HTP_OP_UNARY_EXP:
op_type = "exp-f32";
break;
case HTP_OP_UNARY_SIGMOID:
op_type = "sigmoid-f32";
break;
case HTP_OP_UNARY_SOFTPLUS:
op_type = "softplus-f32";
break;
default:
FARF(ERROR, "Unsupported unary Op %u\n", octx->op);

View File

@@ -191,6 +191,7 @@ struct vk_queue;
struct vk_command_buffer {
vk::CommandBuffer buf;
uint64_t use_counter = 0;
bool in_use = false;
};
@@ -938,21 +939,26 @@ struct vk_subbuffer {
}
};
// vk_event is used for the event-related backend interfaces. It uses 'event' for
// event_wait and 'fence' for event_synchronize. Polling on an event for
// event_synchronize wouldn't be sufficient to wait for command buffers to complete,
// and would lead to validation errors.
struct vk_event {
vk::Event event;
vk::Fence fence;
vk_command_buffer* cmd_buffer = nullptr;
};
struct vk_semaphore {
vk::Semaphore s;
uint64_t value;
};
// vk_event is used for the event-related backend interfaces. It uses vk::Events for
// event_wait and a timeline semaphore for event_synchronize. Polling on an event for
// event_synchronize wouldn't be sufficient to wait for command buffers to complete,
// and would lead to validation errors.
struct vk_event {
std::vector<vk::Event> events_free; // Events available for reuse
std::vector<vk::Event> events_submitted; // Events that are fully submitted and can be reused on next synchronize
vk::Event event;
bool has_event;
vk_semaphore tl_semaphore;
vk_command_buffer* cmd_buffer = nullptr;
uint64_t cmd_buffer_use_counter = 0;
};
struct vk_submission {
vk_command_buffer* buffer = nullptr;
std::vector<vk_semaphore> wait_semaphores;
@@ -2319,7 +2325,7 @@ static vk_command_buffer* ggml_vk_create_cmd_buffer(vk_device& device, vk_comman
vk::CommandBufferLevel::ePrimary,
1);
const std::vector<vk::CommandBuffer> cmd_buffers = device->device.allocateCommandBuffers(command_buffer_alloc_info);
p.cmd_buffers.push_back({ cmd_buffers.front(), true });
p.cmd_buffers.push_back({ cmd_buffers.front(), 0, true });
return &p.cmd_buffers[p.cmd_buffers.size()-1];
}
@@ -2788,6 +2794,15 @@ static void ggml_vk_sync_buffers(ggml_backend_vk_context* ctx, vk_context& subct
);
}
static void ggml_vk_reset_event(vk_context& ctx, vk::Event& event) {
VK_LOG_DEBUG("ggml_vk_set_event()");
ctx->s->buffer->buf.resetEvent(
event,
ctx->p->q->stage_flags
);
}
static void ggml_vk_set_event(vk_context& ctx, vk::Event& event) {
VK_LOG_DEBUG("ggml_vk_set_event()");
@@ -4981,8 +4996,9 @@ static vk_device ggml_vk_get_device(size_t idx) {
std::vector<vk::QueueFamilyProperties> queue_family_props = device->physical_device.getQueueFamilyProperties();
// Try to find a non-graphics compute queue and transfer-focused queues
// On AMD, the graphics queue seems to be faster, so don't avoid it
const vk::QueueFlagBits graphics_flag = device->vendor_id == VK_VENDOR_ID_AMD ? (vk::QueueFlagBits)0 : vk::QueueFlagBits::eGraphics;
// Allow overriding avoiding the graphics queue because it can increase performance on RADV
const bool allow_graphics_queue = (getenv("GGML_VK_ALLOW_GRAPHICS_QUEUE") != nullptr);
const vk::QueueFlagBits graphics_flag = allow_graphics_queue ? (vk::QueueFlagBits)0 : vk::QueueFlagBits::eGraphics;
const uint32_t compute_queue_family_index = ggml_vk_find_queue_family_index(queue_family_props, vk::QueueFlagBits::eCompute, graphics_flag, -1, 1);
const uint32_t transfer_queue_family_index = ggml_vk_find_queue_family_index(queue_family_props, vk::QueueFlagBits::eTransfer, vk::QueueFlagBits::eCompute | graphics_flag, compute_queue_family_index, 1);
@@ -5443,11 +5459,14 @@ static vk_device ggml_vk_get_device(size_t idx) {
ggml_vk_load_shaders(device);
// Only use transfer queue on AMD non-GCN, when the graphics queue is not enabled
const bool prefers_transfer_queue = device->vendor_id == VK_VENDOR_ID_AMD && device->architecture != AMD_GCN && !allow_graphics_queue;
if (!device->single_queue) {
const uint32_t transfer_queue_index = compute_queue_family_index == transfer_queue_family_index ? 1 : 0;
ggml_vk_create_queue(device, device->transfer_queue, transfer_queue_family_index, transfer_queue_index, { vk::PipelineStageFlagBits::eTransfer }, true);
device->async_use_transfer_queue = (getenv("GGML_VK_ASYNC_USE_TRANSFER_QUEUE") != nullptr);
device->async_use_transfer_queue = prefers_transfer_queue || (getenv("GGML_VK_ASYNC_USE_TRANSFER_QUEUE") != nullptr);
} else {
// TODO: Use pointer or reference to avoid copy
device->transfer_queue.copyFrom(device->compute_queue);
@@ -6392,6 +6411,7 @@ static vk_subbuffer ggml_vk_tensor_subbuffer(
static vk_command_buffer* ggml_vk_get_or_create_cmd_buffer(vk_device& device, vk_command_pool& pool) {
for (auto& cmd_buffer : pool.cmd_buffers) {
if (!cmd_buffer.in_use) {
cmd_buffer.use_counter++;
cmd_buffer.in_use = true;
return &cmd_buffer;
}
@@ -6496,15 +6516,16 @@ static void ggml_vk_ctx_begin(vk_device& device, vk_context& subctx) {
}
static vk_context ggml_vk_get_compute_ctx(ggml_backend_vk_context * ctx) {
vk_context result;
if (!ctx->compute_ctx.expired()) {
return ctx->compute_ctx.lock();
result = ctx->compute_ctx.lock();
} else {
result = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = result;
ggml_vk_ctx_begin(ctx->device, result);
}
vk_context result = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = result;
ggml_vk_ctx_begin(ctx->device, result);
if (ctx->device->async_use_transfer_queue && ctx->transfer_semaphore_last_submitted < ctx->transfer_semaphore.value) {
result->s->wait_semaphores.push_back(ctx->transfer_semaphore);
ctx->transfer_semaphore_last_submitted = ctx->transfer_semaphore.value;
@@ -7625,20 +7646,14 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
return true;
}
case VK_VENDOR_ID_INTEL:
if (k < 2048) {
if (device->driver_id == vk::DriverId::eIntelProprietaryWindows) {
// Intel Windows proprietary driver MMVQ performance is worse than fp16, see
// https://github.com/ggml-org/llama.cpp/issues/17628
return false;
}
if (device->driver_id == vk::DriverId::eIntelProprietaryWindows) {
// Intel Windows proprietary driver tuning
switch (src0_type) {
case GGML_TYPE_MXFP4:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
return false;
default:
return true;
}
if (k < 2048) {
return false;
}
switch (src0_type) {
@@ -13797,6 +13812,7 @@ static void ggml_vk_synchronize(ggml_backend_vk_context * ctx) {
ctx->submit_pending = false;
if (cmd_buf) {
cmd_buf->in_use = false;
cmd_buf->buf.reset();
}
}
@@ -14858,18 +14874,31 @@ static void ggml_backend_vk_event_record(ggml_backend_t backend, ggml_backend_ev
vk_context compute_ctx = ggml_vk_get_compute_ctx(ctx);
auto* cmd_buf = compute_ctx->s->buffer; // retrieve pointer before it gets reset
// the backend interface doesn't have an explicit reset, so reset it here
// before we record the command to set it
ctx->device->device.resetEvent(vkev->event);
ctx->device->device.resetFences({ vkev->fence });
if (vkev->has_event) {
// Move existing event into submitted
vkev->events_submitted.push_back(vkev->event);
}
// Grab the next event and record it, create one if necessary
if (vkev->events_free.empty()) {
vkev->event = ctx->device->device.createEvent({});
} else {
vkev->event = vkev->events_free.back();
vkev->events_free.pop_back();
}
vkev->has_event = true;
ggml_vk_set_event(compute_ctx, vkev->event);
vkev->tl_semaphore.value++;
compute_ctx->s->signal_semaphores.push_back(vkev->tl_semaphore);
ggml_vk_ctx_end(compute_ctx);
ggml_vk_submit(compute_ctx, {vkev->fence});
ggml_vk_submit(compute_ctx, {});
ctx->submit_pending = true;
vkev->cmd_buffer = cmd_buf;
vkev->cmd_buffer_use_counter = cmd_buf->use_counter;
ctx->compute_ctx.reset();
}
@@ -14880,9 +14909,10 @@ static void ggml_backend_vk_event_wait(ggml_backend_t backend, ggml_backend_even
vk_context compute_ctx = ggml_vk_get_compute_ctx(ctx);
ggml_vk_wait_events(compute_ctx, {vkev->event});
ggml_vk_ctx_end(compute_ctx);
ctx->compute_ctx.reset();
if (vkev->has_event) {
// Wait for latest event
ggml_vk_wait_events(compute_ctx, { vkev->event });
}
}
// TODO: enable async and synchronize
@@ -15672,10 +15702,13 @@ static ggml_backend_event_t ggml_backend_vk_device_event_new(ggml_backend_dev_t
return nullptr;
}
// The event/fence is expected to initially be in the signaled state.
vkev->event = device->device.createEvent({});
vkev->fence = device->device.createFence({vk::FenceCreateFlagBits::eSignaled});
device->device.setEvent(vkev->event);
// No events initially, they get created on demand
vkev->has_event = false;
vk::SemaphoreTypeCreateInfo tci{ vk::SemaphoreType::eTimeline, 0 };
vk::SemaphoreCreateInfo ci{};
ci.setPNext(&tci);
vkev->tl_semaphore = { device->device.createSemaphore(ci), 0 };
return new ggml_backend_event {
/* .device = */ dev,
@@ -15689,8 +15722,16 @@ static void ggml_backend_vk_device_event_free(ggml_backend_dev_t dev, ggml_backe
vk_event *vkev = (vk_event *)event->context;
device->device.destroyFence(vkev->fence);
device->device.destroyEvent(vkev->event);
device->device.destroySemaphore(vkev->tl_semaphore.s);
for (auto& event : vkev->events_free) {
device->device.destroyEvent(event);
}
for (auto& event : vkev->events_submitted) {
device->device.destroyEvent(event);
}
if (vkev->has_event) {
device->device.destroyEvent(vkev->event);
}
delete vkev;
delete event;
}
@@ -15701,10 +15742,29 @@ static void ggml_backend_vk_device_event_synchronize(ggml_backend_dev_t dev, ggm
auto device = ggml_vk_get_device(ctx->device);
vk_event *vkev = (vk_event *)event->context;
VK_CHECK(device->device.waitForFences({ vkev->fence }, true, UINT64_MAX), "event_synchronize");
// Finished using current command buffer so we flag for reuse
if (vkev->cmd_buffer) {
vkev->cmd_buffer->in_use = false;
// Only do something if the event has actually been used
if (vkev->has_event) {
vk::Semaphore sem = vkev->tl_semaphore.s;
uint64_t val = vkev->tl_semaphore.value;
vk::SemaphoreWaitInfo swi{vk::SemaphoreWaitFlags{}, sem, val};
VK_CHECK(device->device.waitSemaphores(swi, UINT64_MAX), "event_synchronize");
// Reset and move submitted events
for (auto& event : vkev->events_submitted) {
device->device.resetEvent(event);
}
vkev->events_free.insert(vkev->events_free.end(), vkev->events_submitted.begin(), vkev->events_submitted.end());
vkev->events_submitted.clear();
// Finished using current command buffer so we flag for reuse
if (vkev->cmd_buffer) {
// Only flag for reuse if it hasn't been reused already
if (vkev->cmd_buffer_use_counter == vkev->cmd_buffer->use_counter) {
vkev->cmd_buffer->in_use = false;
vkev->cmd_buffer->buf.reset();
}
vkev->cmd_buffer = nullptr;
}
}
}

View File

@@ -1294,6 +1294,12 @@ size_t ggml_row_size(enum ggml_type type, int64_t ne) {
return ggml_type_size(type)*ne/ggml_blck_size(type);
}
double ggml_type_sizef(enum ggml_type type) {
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);
return ((double)(type_traits[type].type_size))/type_traits[type].blck_size;
}
const char * ggml_type_name(enum ggml_type type) {
assert(type >= 0);
assert(type < GGML_TYPE_COUNT);

View File

@@ -21,9 +21,7 @@ struct llama_sampler_deleter {
};
struct llama_adapter_lora_deleter {
void operator()(llama_adapter_lora *) {
// llama_adapter_lora_free is deprecated
}
void operator()(llama_adapter_lora * adapter) { llama_adapter_lora_free(adapter); }
};
typedef std::unique_ptr<llama_model, llama_model_deleter> llama_model_ptr;

View File

@@ -636,7 +636,6 @@ extern "C" {
// Load a LoRA adapter from file
// The adapter is valid as long as the associated model is not freed
// All adapters must be loaded before context creation
LLAMA_API struct llama_adapter_lora * llama_adapter_lora_init(
struct llama_model * model,
const char * path_lora);
@@ -660,9 +659,8 @@ extern "C" {
LLAMA_API int32_t llama_adapter_meta_val_str_by_index(const struct llama_adapter_lora * adapter, int32_t i, char * buf, size_t buf_size);
// Manually free a LoRA adapter
// NOTE: loaded adapters will be free when the associated model is deleted
LLAMA_API DEPRECATED(void llama_adapter_lora_free(struct llama_adapter_lora * adapter),
"adapters are now freed together with the associated model");
// NOTE: loaded adapters that are not manually freed will be freed when the associated model is deleted
LLAMA_API void llama_adapter_lora_free(struct llama_adapter_lora * adapter);
// Get the invocation tokens if the current lora is an alora
LLAMA_API uint64_t llama_adapter_get_alora_n_invocation_tokens(const struct llama_adapter_lora * adapter);

View File

@@ -1 +1 @@
553552e1d88be2b214b85e5159eedd39a63e2c34
c044a8eeae2591faa0950c8b5e514cbc4bbfc4ca

View File

@@ -418,7 +418,7 @@ static void llama_adapter_lora_init_impl(llama_model & model, const char * path_
}
llama_adapter_lora * llama_adapter_lora_init(llama_model * model, const char * path_lora) {
llama_adapter_lora * adapter = new llama_adapter_lora();
llama_adapter_lora * adapter = new llama_adapter_lora(model);
try {
llama_adapter_lora_init_impl(*model, path_lora, *adapter);
@@ -471,8 +471,17 @@ int32_t llama_adapter_meta_val_str_by_index(const llama_adapter_lora * adapter,
return snprintf(buf, buf_size, "%s", it->second.c_str());
}
void llama_adapter_lora_free(llama_adapter_lora *) {
// deprecated: adapters are freed by llama_model's destructor
void llama_adapter_lora_free(llama_adapter_lora * adapter) {
if (adapter == nullptr) {
return;
}
if (adapter->model != nullptr) {
adapter->model->loras.erase(adapter);
adapter->model = nullptr;
}
delete adapter;
}
uint64_t llama_adapter_get_alora_n_invocation_tokens(const struct llama_adapter_lora * adapter) {

View File

@@ -61,6 +61,8 @@ struct llama_adapter_lora_weight {
};
struct llama_adapter_lora {
llama_model * model = nullptr;
// map tensor name to lora_a_b
std::unordered_map<std::string, llama_adapter_lora_weight> ab_map;
@@ -75,7 +77,7 @@ struct llama_adapter_lora {
// activated lora (aLoRA)
std::vector<llama_token> alora_invocation_tokens;
llama_adapter_lora() = default;
explicit llama_adapter_lora(llama_model * model) : model(model) {}
~llama_adapter_lora() = default;
llama_adapter_lora_weight * get_weight(ggml_tensor * w);

View File

@@ -1165,9 +1165,11 @@ bool llama_context::set_adapter_cvec(
int32_t il_end) {
LLAMA_LOG_DEBUG("%s: il_start = %d, il_end = %d\n", __func__, il_start, il_end);
// TODO: should we reserve?
bool res = cvec->apply(model, data, len, n_embd, il_start, il_end);
return cvec->apply(model, data, len, n_embd, il_start, il_end);
sched_need_reserve = true;
return res;
}
llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) {

View File

@@ -89,6 +89,7 @@ struct test_context {
cparams.n_batch = 512;
cparams.samplers = configs.data();
cparams.n_samplers = configs.size();
cparams.kv_unified = true;
// If n_seq_max is not specified, calculate it from configs
if (n_seq_max < 0) {

View File

@@ -2448,7 +2448,7 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
// Analysis channel (reasoning) with final channel (content)
tst.test(
"<|channel|>analysis<|message|>I'm\nthinking<|end|>\n<|channel|>final<|message|>Hello, world!\nWhat's "
"<|channel|>analysis<|message|>I'm\nthinking<|end|><|start|>assistant<|channel|>final<|message|>Hello, world!\nWhat's "
"up?")
.reasoning_format(COMMON_REASONING_FORMAT_AUTO)
.expect(message_assist_thoughts)
@@ -2461,15 +2461,6 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
.expect_reasoning("I'm\nthinking")
.run();
// Reasoning format none - reasoning stays in content
tst.test(
"<|channel|>analysis<|message|>I'm\nthinking<|end|>\n<|channel|>final<|message|>Hello, world!\nWhat's "
"up?")
.reasoning_format(COMMON_REASONING_FORMAT_NONE)
.expect_content(
"<|channel|>analysis<|message|>I'm\nthinking<|end|>Hello, world!\nWhat's up?")
.run();
// Tool call with recipient in role header: " to=functions.NAME<|channel|>analysis<|message|>JSON"
tst.test(" to=functions.special_function<|channel|>analysis<|message|>{\"arg1\": 1}")
.tools({ special_function_tool })
@@ -2496,37 +2487,16 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
// Tool call with reasoning + content (analysis first, then tool call)
tst.test(
"<|channel|>analysis<|message|>I'm\nthinking<|end|>\n"
"<|channel|>analysis<|message|>I'm\nthinking<|end|>"
"<|start|>assistant to=functions.special_function<|channel|>analysis<|message|>{\"arg1\": 1}")
.reasoning_format(COMMON_REASONING_FORMAT_AUTO)
.tools({ special_function_tool })
.expect(message_assist_call_thoughts)
.run();
// Tool calling with extra channel before
// Complex tool calling
tst.test(
"<|channel|>analysis<|message|>I'm\nthinking<|end|><|start|>assistant<|channel|>commentary"
" to=functions.special_function <|message|>{\"arg1\": 1}")
.reasoning_format(COMMON_REASONING_FORMAT_AUTO)
.tools({ special_function_tool })
.expect(message_assist_call_thoughts)
.run();
// Reasoning after final channel
// Tool calling after final channel
tst.test(
"<|channel|>final<|message|><|end|>"
"<|start|>assistant<|channel|>analysis<|message|>Thinking about edit..."
)
.reasoning_format(COMMON_REASONING_FORMAT_AUTO)
.expect_reasoning("Thinking about edit...")
.expect_content("")
.run();
// Tool calling after final channel
tst.test(
"<|channel|>final<|message|><|end|>"
"<|start|>assistant<|channel|>analysis<|message|>Thinking about edit...<|end|>"
"<|channel|>analysis<|message|>Thinking about edit...<|end|>"
"<|start|>assistant<|channel|>commentary to=functions.edit <|constrain|>json"
"<|message|>{\"oldString\": \"if (part < railCount - 1) {\", \"newString\": \"if (part < 4) {\", \"replaceAll\": false}"
)
@@ -2561,19 +2531,17 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
})
.run();
// Parallel tool calls
// Structured output
tst.test(
" to=functions.special_function<|channel|>analysis<|message|>{\"arg1\": 1}\n"
"<|start|>assistant to=functions.special_function_with_opt<|channel|>analysis<|message|>{\"arg1\": 1, "
"\"arg2\": 2}")
.parallel_tool_calls(true)
.tools({
special_function_tool, special_function_tool_with_optional_param
})
.expect_tool_calls({
{ "special_function", R"({"arg1": 1})", {} },
{ "special_function_with_opt", R"({"arg1": 1, "arg2": 2})", {} },
})
"<|channel|>analysis<|message|>I need to output the invoice details in JSON<|end|>"
"<|start|>assistant<|channel|>final <|constrain|>json"
"<|message|>"
R"({"amount": 123.45, "date": "2025-12-03"})"
)
.reasoning_format(COMMON_REASONING_FORMAT_AUTO)
.json_schema(invoice_schema)
.expect_reasoning("I need to output the invoice details in JSON")
.expect_content(R"({"amount": 123.45, "date": "2025-12-03"})")
.run();
}

View File

@@ -1897,8 +1897,9 @@ import sys
from datetime import datetime
from jinja2.sandbox import SandboxedEnvironment
tmpl = json.loads(sys.argv[1])
vars_json = json.loads(sys.argv[2])
merged_input = json.loads(sys.stdin.buffer.read().decode("utf-8"))
tmpl = merged_input["tmpl"]
vars_json = merged_input["vars"]
env = SandboxedEnvironment(
trim_blocks=True,
@@ -1921,8 +1922,9 @@ sys.stdout.buffer.write(result.encode())
static void test_template_py(testing & t, const std::string & name, const std::string & tmpl, const json & vars, const std::string & expect) {
t.test(name, [&tmpl, &vars, &expect](testing & t) {
// Prepare arguments
std::string tmpl_json = json(tmpl).dump();
std::string vars_json = vars.dump();
json merged;
merged["tmpl"] = json(tmpl);
merged["vars"] = vars;
#ifdef _WIN32
const char * python_executable = "python.exe";
@@ -1930,7 +1932,7 @@ static void test_template_py(testing & t, const std::string & name, const std::s
const char * python_executable = "python3";
#endif
const char * command_line[] = {python_executable, "-c", py_script.c_str(), tmpl_json.c_str(), vars_json.c_str(), NULL};
const char * command_line[] = {python_executable, "-c", py_script.c_str(), NULL};
struct subprocess_s subprocess;
int options = subprocess_option_combined_stdout_stderr
@@ -1944,6 +1946,20 @@ static void test_template_py(testing & t, const std::string & name, const std::s
t.assert_true("subprocess creation", false);
return;
}
FILE * p_stdin = subprocess_stdin(&subprocess);
// Write input
std::string input = merged.dump();
auto written = fwrite(input.c_str(), 1, input.size(), p_stdin);
if (written != input.size()) {
t.log("Failed to write complete input to subprocess stdin");
t.assert_true("subprocess stdin write", false);
subprocess_destroy(&subprocess);
return;
}
fflush(p_stdin);
fclose(p_stdin); // Close stdin to signal EOF to the Python process
subprocess.stdin_file = nullptr;
// Read output
std::string output;

View File

@@ -215,6 +215,7 @@ struct cli_context {
inputs.parallel_tool_calls = false;
inputs.add_generation_prompt = true;
inputs.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
inputs.force_pure_content = chat_params.force_pure_content;
inputs.enable_thinking = chat_params.enable_thinking ? common_chat_templates_support_enable_thinking(chat_params.tmpls.get()) : false;
// Apply chat template to the list of messages

View File

@@ -308,6 +308,7 @@ int main(int argc, char ** argv) {
inputs.use_jinja = g_params->use_jinja;
inputs.messages = chat_msgs;
inputs.add_generation_prompt = !params.prompt.empty();
inputs.force_pure_content = params.force_pure_content_parser;
prompt = common_chat_templates_apply(chat_templates.get(), inputs).prompt;
}

Binary file not shown.

View File

@@ -1065,6 +1065,7 @@ json oaicompat_chat_params_parse(
inputs.add_generation_prompt = true;
}
inputs.force_pure_content = opt.force_pure_content;
// Apply chat template to the list of messages
auto chat_params = common_chat_templates_apply(opt.tmpls.get(), inputs);

View File

@@ -290,6 +290,7 @@ struct server_chat_params {
int reasoning_budget = -1;
std::string reasoning_budget_message;
std::string media_path;
bool force_pure_content = false;
};
// used by /completions endpoint

View File

@@ -911,6 +911,7 @@ private:
/* reasoning_budget */ params_base.reasoning_budget,
/* reasoning_budget_msg */ params_base.reasoning_budget_message,
/* media_path */ params_base.media_path,
/* force_pure_content */ params_base.force_pure_content_parser
};
}
@@ -2402,11 +2403,11 @@ private:
}
{
// erase any checkpoints with pos_min > pos_min_thold
// erase any checkpoints with pos_max > pos_next
for (auto it = slot.prompt.checkpoints.begin(); it != slot.prompt.checkpoints.end();) {
const auto & cur = *it;
if (cur.pos_min > pos_min_thold) {
SLT_WRN(slot, "erased invalidated context checkpoint (pos_min = %d, pos_max = %d, n_tokens = %" PRId64 ", n_swa = %d, size = %.3f MiB)\n", cur.pos_min, cur.pos_max, cur.n_tokens, n_swa, (float) cur.data.size() / 1024 / 1024);
if (cur.pos_max > pos_next) {
SLT_WRN(slot, "erased invalidated context checkpoint (pos_min = %d, pos_max = %d, n_tokens = %" PRId64 ", n_swa = %d, pos_next = %d, size = %.3f MiB)\n", cur.pos_min, cur.pos_max, cur.n_tokens, n_swa, pos_next, (float) cur.data.size() / 1024 / 1024);
it = slot.prompt.checkpoints.erase(it);
} else {
++it;

View File

@@ -148,7 +148,7 @@
</Tooltip.Trigger>
<Tooltip.Content side="right">
<p>Images require vision models to be processed</p>
<p>Image processing requires a vision model</p>
</Tooltip.Content>
</Tooltip.Root>
{/if}
@@ -173,7 +173,7 @@
</Tooltip.Trigger>
<Tooltip.Content side="right">
<p>Audio files require audio models to be processed</p>
<p>Audio files processing requires an audio model</p>
</Tooltip.Content>
</Tooltip.Root>
{/if}

View File

@@ -57,7 +57,6 @@
// Handle ?q= parameter - create new conversation and send message
if (qParam !== null) {
await conversationsStore.createConversation();
await chatStore.sendMessage(qParam);
clearUrlParams();
} else if (modelParam || newChatParam === 'true') {
clearUrlParams();