Compare commits

..

72 Commits
b6183 ... b6255

Author SHA1 Message Date
LaffeyNyaa
21dc4ddaf2 chat : fix debug build assertion in trim function (#15520) 2025-08-23 10:38:30 +02:00
Jeff Bolz
289bf4113e vulkan: Rewrite synchronization to allow some overlap between nodes (#15489)
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.

The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
2025-08-23 09:33:36 +02:00
R0CKSTAR
b55f06e1aa vulkan.Dockerfile: install vulkan SDK using tarball (#15282)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-23 08:58:57 +02:00
Acly
0a9b43e507 vulkan : support ggml_mean (#15393)
* vulkan : support ggml_mean

* vulkan : support sum, sum_rows and mean with non-contiguous tensors

* vulkan : fix subbuffer size not accounting for misalign offset

* tests : add backend-op tests for non-contiguous sum_rows

* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support

* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
2025-08-23 08:35:21 +02:00
Jeff Bolz
330c3d2d21 vulkan: optimize mul_mat_id loading row ids into shared memory (#15427)
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
2025-08-23 08:31:54 +02:00
Johannes Gäßler
e92734d51b test-opt: allow slight inprecision (#15503) 2025-08-22 23:47:01 +02:00
Reese Levine
45363632cb ggml WebGPU: add support for quantization types (#15440)
* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Work on templating for different types in shaders

* Work on shader type generation

* Working q4_0 mul_mat and some templating for different types

* Add q4_0_f16 matmul and fix device init

* Add matmul support for basic quantization types

* Add q2_k and q3_k quantization

* Add rest of k-quants

* Get firt i-quant working

* Closer to supporting all i-quants

* Support rest of i-quants

* Cleanup code

* Fix python formatting

* debug

* Bugfix for memset

* Add padding to end of buffers on creation

* Simplify bit-shifting

* Update usage of StringView
2025-08-22 11:28:03 -07:00
Aldehir Rojas
32732f2459 model : gpt-oss add response_format support (#15494) 2025-08-22 11:04:08 -05:00
rmatif
92f7f0a53c ggml: add conv3d op (#15182)
* add conv3d

* bump GGML_OP_COUNT
2025-08-22 15:33:15 +02:00
Yavor Ivanov
b1ab91821f cuda : add Pad Reflect 1D support (#14659)
* Add Pad Reflect 1D CUDA support

* Update ggml/src/ggml-cuda/pad_reflect_1d.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-22 13:06:29 +02:00
Georgi Gerganov
9ebebef62f llama : remove KV cache defragmentation logic (#15473)
ggml-ci
2025-08-22 12:22:13 +03:00
Aaron Teo
ad5c975c2d ggml-cpu: Support Q5_0 and Q5_1 on s390x (#15486)
* ggml-cpu: initial q5_0 impl for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: updated q5_0 code for better performance

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: use optimised hsum for better performance

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: introduce q5_1 simd + refactor q5_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix incorrect return type vec_hsum

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: refactor q5_1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: q5_1 update loop unroll to 4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update q5_0 unroll to 4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update build-s390x docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: update unused variables q5_0

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: update the last update date

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-08-22 16:11:04 +08:00
65a
4afb0a746f server : Support multimodal completion and embeddings prompts in JSON format (#15108)
- Use server_tokens in more places in server and util.cpp
- Convert most functions that used llama_tokens to server_tokens
- Modify input tokenizer to handle JSON objects as subprompts
- Break out MTMD prompt parsing into utility function
- Support JSON objects with multimodal_data arrays for MTMD prompts along with other existing types
- Add capability to model endpoint to indicate if client can send multimodal data
- Add tests.
2025-08-22 10:10:14 +02:00
Tarek Dakhran
e288693669 readme : model : mtdm : lfm2 improvements (#15476)
* Support untied embeddings

* Increase number of image tokens to 1024

* Add LFM2-VL to readme

* Actually use untied embeddings
2025-08-22 09:29:08 +02:00
Chenguang Li
a0f98dd604 CANN: Optimize RMS_NORM using cache (#15419)
* [CANN] Optimize RMS_NORM using cache

Signed-off-by: noemotiovon <757486878@qq.com>

* fix typo

Signed-off-by: noemotiovon <757486878@qq.com>

* fix review comment

Signed-off-by: noemotiovon <757486878@qq.com>

* codestyle adjustment

Signed-off-by: noemotiovon <757486878@qq.com>

---------

Signed-off-by: noemotiovon <757486878@qq.com>
2025-08-22 14:12:07 +08:00
Diego Devesa
54a241f505 sched : fix possible use of wrong ids tensor when offloading moe prompt processing (#15488) 2025-08-21 23:09:32 +02:00
Georgi Gerganov
cd36b5e5c7 llama : remove deprecated llama_kv_self API (#15472)
ggml-ci
2025-08-21 19:13:45 +03:00
Georgi Gerganov
3f196be84b graph : remove build_attn_with_sinks overload (#15469)
ggml-ci
2025-08-21 18:44:45 +03:00
Acly
97ae5961a4 vulkan : support conv_2d_dw with f16 weights (#15392) 2025-08-21 17:01:51 +02:00
Dong Won Kim
20c2dac8c6 vulkan: add exp operation (#15456)
Co-authored-by: aeseulgi <kim2h7903@gmail.com>
2025-08-21 17:00:16 +02:00
Jeff Bolz
96452a3fa4 vulkan: Reuse conversion results in prealloc_y (#15410)
* vulkan: Reuse conversion results in prealloc_y

Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.

* don't use shared pointer for prealloc_y_last_pipeline_used
2025-08-21 16:55:00 +02:00
Jie Fu (傅杰)
9ad5e60dba examples : fix some typos in examples/model-conversion/README.md (#15477)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-21 16:53:13 +02:00
Georgi Gerganov
715a6db02c kv-cache : drop the "unified" prefix (#15467)
* kv-cache : drop the "unified" prefix

ggml-ci

* cont : fix comment [no ci]
2025-08-21 17:00:33 +03:00
Jie Fu (傅杰)
ad294df03f examples : install torch-cpu for model conversion tool/example (#15475)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-21 15:42:34 +02:00
Ali Tariq
029bb39eb1 ci : enable RVV1.0 native build (#15386)
* Changed the CI file to hw

* Changed the CI file to hw

* Added to sudoers for apt

* Removed the clone command and used checkout

* Added libcurl

* Added gcc-14

* Checking gcc --version

* added gcc-14 symlink

* added CC and C++ variables

* Added the gguf weight

* Changed the weights path

* Added system specification

* Removed white spaces

* ci: Replace Jenkins riscv native build Cloud-V pipeline with GitHub Actions workflow

Removed the legacy .devops/cloud-v-pipeline Jenkins CI configuration and introduced .github/workflows/build-riscv-native.yml for native RISC-V builds using GitHub Actions.

* removed trailing whitespaces

* Added the trigger at PR creation

* Corrected OS name

* Added ccache as setup package

* Added ccache for self-hosted runner

* Added directory for ccache size storage

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Changed the build command and added ccache debug log

* Added the base dir for the ccache

* Re-trigger CI

* Cleanup and refactored ccache steps

* Cleanup and refactored ccache steps

---------

Co-authored-by: Akif Ejaz <akifejaz40@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-21 14:52:16 +02:00
Georgi Gerganov
30649cab65 ci : continue file download with wget (#15471)
ggml-ci
2025-08-21 13:42:55 +03:00
Daniel Bevenius
2758fa10da examples : add model conversion tool/example (#15455)
* examples : add model conversion tool/example

This commit adds an "example/tool" that is intended to help in the
process of converting models to GGUF. Currently it supports normal
causal models and embedding models. The readme contains instructions and
command to guide through the process.

The motivation for this to have a structured and repeatable process for
model conversions and hopefully with time improve upon it to make the
process easier and more reliable. We have started to use this for new
model conversions internally and will continue doing so and improve it
as we go along. Perhaps with time this should be placed in a different
directory than the examples directory, but for now it seems like a good
place to keep it while we are still developing it.

* squash! examples : add model conversion tool/example

Remove dependency on scikit-learn in model conversion example.

* squash! examples : add model conversion tool/example

Update transformer dep to use non-dev version. And also import
`AutoModelForCausalLM` instead of `AutoModel` to ensure compatibility
with the latest version.

* squash! examples : add model conversion tool/example

Remove the logits requirements file from the all requirements file.
2025-08-21 12:16:54 +02:00
Michael Giba
b108e42904 ci : fix -Werror=return-type in clip.cpp so ci/run.sh can run without issue (#15221)
* Fix -Werror=return-type so ci/run.sh can run

* Update tools/mtmd/clip.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Remove false now that we have abort

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-21 12:06:46 +02:00
Copilot
245be739df ci : add copilot-instructions.md (#15286)
* Initial plan

* Initialize copilot instructions exploration

* Add comprehensive .github/copilot-instructions.md file

* Update Python environment and tools directory documentation

- Add instructions for using .venv Python environment
- Include flake8 and pyright linting tools from virtual environment
- Add tools/ as core directory in project layout
- Reference existing configuration files (.flake8, pyrightconfig.json)

* add more python dependencies to .venv

* Update copilot instructions: add backend hardware note and server testing

* Apply suggestions from code review

* Apply suggestions from code review

* Replace clang-format with git clang-format to format only changed code

* Minor formatting improvements: remove extra blank line and add trailing newline

* try installing git-clang-format

* try just clang-format

* Remove --binary flag from git clang-format and add git-clang-format installation to CI

* download 18.x release

* typo--

* remove --binary flag

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-21 11:47:52 +02:00
Julien Denize
b2caf67db1 convert : make Mistral community chat templates optional via parameter (#15420)
* Make Mistral community chat templates optional

* Change the flag arg to disable instead of enable community chat templates

* Improve error message

* Improve help message

* Tone down the logger messages
2025-08-21 11:19:50 +02:00
Jie Fu (傅杰)
2f3dbffb17 common : fix incorrect print of non-ascii characters in the logging (#15466)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-21 11:54:34 +03:00
Xuan-Son Nguyen
945e1f12a6 ggml : fix condition of im2col on Metal backend (#15460) 2025-08-21 08:32:26 +03:00
stduhpf
1b0db8f6e0 server : fix webui (#15462)
* Fix webui crash after streaming

* build webui
2025-08-21 08:19:22 +03:00
Daniel Bevenius
29f538ac63 examples : remove references to make in examples [no ci] (#15457)
This commit removes references to `make` in the examples, as the build
system has been updated to use CMake directly and using `make` will now
generate an error since Commit 37f10f955f
("make : remove make in favor of CMake (#15449)").
2025-08-21 06:12:28 +02:00
R0CKSTAR
8ad038c0fd musa: add GGML_UNUSED_VARS (#15446)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-21 11:06:05 +08:00
Diego Devesa
5682a3745f sched : copy only the used experts when offloading prompt processing (#15346) 2025-08-21 01:35:28 +02:00
teo
1bc664a26a server: fix OpenAI API compatibility for usage statistics in chat streams (#15444) 2025-08-21 00:10:08 +02:00
Johannes Gäßler
13aeb7aef2 CUDA: refactor FA support/selection code (#15454) 2025-08-20 23:14:14 +02:00
Johannes Gäßler
7a6e91ad26 CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433) 2025-08-20 16:58:49 +02:00
Jeff Bolz
fec9519802 vulkan: shorten pipeline name strings (#15431)
These detailed strings were causing increased build time on gcc.
2025-08-20 16:33:14 +02:00
Daniel Bevenius
657b8a77bd chat: handle gpt-oss return/end token inconsistency (#15421)
This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: https://github.com/ggml-org/llama.cpp/issues/15417
2025-08-20 14:26:01 +02:00
Jie Fu (傅杰)
ec5ab1a36c common : fix context shift help message (#15448)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-20 13:33:30 +03:00
xiaobing318
1a99c2d948 cmake : fix target include directories (#15450)
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* feat:Modify the header file include path

1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.

* Restore the docker.yml file
2025-08-20 13:32:05 +03:00
Daniel Bevenius
37f10f955f make : remove make in favor of CMake (#15449)
This commit removes the content from the Makefile and updates the
current deprecation message to information that `make` has been
replaced by CMake instead.

The message when `make` is invoked will now be the following:
```console
$ make
Makefile:6: *** Build system changed:
 The Makefile build has been replaced by CMake.

 For build instructions see:
 https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

.  Stop.
```

The motivation for this is that many, if not all targets fail to build
now, after changes to the system, and `make` has also been deprected for
some time now.
2025-08-20 13:31:16 +03:00
Georgi Gerganov
2f37014073 lookahead : add sample command to readme (#15447)
* lookahead : add sample command to readme

* cont : build-agnostic command
2025-08-20 13:30:46 +03:00
R0CKSTAR
a094f38143 musa: fix build warnings (#15258)
* musa: fix build warnings

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-20 10:17:37 +08:00
lhez
fb22dd07a6 opencl: mark argsort unsupported if cols exceed workgroup limit (#15375) 2025-08-19 11:25:51 -07:00
Georgi Gerganov
9ef6b0b835 model : add gpt-oss type strings (#15424) 2025-08-19 19:58:28 +03:00
Gian-Carlo Pascutto
1e19f5d462 common : Add top-nsigma sampler to help globally (#15428)
Fixes #15423.
2025-08-19 19:58:14 +03:00
Georgi Gerganov
d2fcd91cf9 server : disable context shift by default (#15416)
* server : disable context shift by default

ggml-ci

* server : make scopr of test parameters local
2025-08-19 16:46:37 +03:00
SHUAI YANG
a6d3cfe7fa CANN: optimize rope operator (#15335)
* optimize rope ops

* amendment

* delete trailing whitespace

* change the variable name
2025-08-19 21:28:22 +08:00
R0CKSTAR
67f09a3a27 musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (#15413)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-19 12:33:47 +02:00
Marvin Gießing
6424594c56 ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (#15385)
* Added VSX intrinsics for Power9+ systems

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Manual unrolling for minor perf improvement

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Update ggml/src/ggml-cpu/arch/powerpc/quants.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: mgiessing <marvin.giessing@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-19 11:54:31 +03:00
Xuan-Son Nguyen
e9288e8869 chat : clarify the meaning of reasoning_format (#15408)
* chat : clarify the meaning of reasoning_format

* add link to this PR
2025-08-19 10:29:36 +02:00
Georgi Gerganov
9d262f4bad server : remove swa_full warning (#15399) 2025-08-19 08:45:26 +03:00
Georgi Gerganov
f0d3c7405c batched-bench : use rand tokens (#15398) 2025-08-19 08:45:12 +03:00
Xuan-Son Nguyen
f08c4c0d8d mtmd : clean up clip_n_output_tokens (#15391) 2025-08-18 22:53:52 +02:00
Georgi Gerganov
6d7f1117e3 codeowners : remove mmv.* 2025-08-18 22:06:44 +03:00
Georgi Gerganov
60212f1ead sync : ggml 2025-08-18 22:06:44 +03:00
Georgi Gerganov
f0c541d315 scripts : update sync scripts 2025-08-18 22:06:44 +03:00
Sigbjørn Skjæret
baa9255a45 llama : merge conts and reshapes and remove unnecessary cont (#15380)
* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes
2025-08-18 19:30:17 +02:00
Georgi Gerganov
3007baf201 readme : update hot topics (#15397) 2025-08-18 18:11:44 +03:00
davidef
d1d8241600 server : fix incoming tasks not process in order (#15395) 2025-08-18 17:51:42 +03:00
Dobri Danchev
618575c582 Fix broken build: require updated pip to support --break-system-packages (#15357)
* Revert "devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)"

This reverts commit e4e915912c.

* devops: Allow pip to modify externally-managed python environment (system installation)

- Updated pip install commands to include the --break-system-packages
  flag, ensuring compatibility when working with system-managed Python
  environments (PEP 668).

- Note: The --break-system-packages option was introduced in 2023.
  Ensure pip is updated to a recent version before using this flag.

fixes [#15004](https://github.com/danchev/llama.cpp/issues/15004)
2025-08-18 12:50:48 +02:00
compilade
f44f793172 ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)
* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants
2025-08-18 09:23:56 +02:00
Jeff Bolz
ae532eac2c vulkan: disable spirv-opt for bfloat16 shaders (#15352) 2025-08-18 07:56:29 +02:00
Oleksandr Kuvshynov
e5155e6986 server : export max observed n_past value (#15361)
Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
2025-08-18 00:28:58 +02:00
Jeff Bolz
21c17b5bef vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)
* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-17 18:08:57 +02:00
Dong Won Kim
19f4decae0 vulkan: support sqrt (#15370) 2025-08-17 16:03:09 +02:00
Sigbjørn Skjæret
4d196981d4 convert : force patch_embd weights to F16 or F32 to avoid broken GGUFs (#15367)
* force patch_embd weights to f32

* use MmprojModel base tensor_force_quant instead
2025-08-17 14:47:42 +02:00
Sigbjørn Skjæret
b143fbc87a ci : fix hang in windows-hip build/release (#15365)
* fix hang in windows-latest-cmake-hip

* apply fix to release as well
2025-08-17 13:30:23 +02:00
Jeff Bolz
de5627910d vulkan: Optimize argsort (#15354)
- Launch an appropriate number of invocations (next larger power of two).
32 invocations is common and the barrier is much cheaper there.
- Specialize for "needs bounds checking" vs not.
- Make the code less branchy and [[unroll]] the loops. In the final code,
I see no branches inside the main loop (only predicated stores) when
needs_bounds_check is false.
- Always sort ascending, then apply the ascending vs descending option when
doing the final stores to memory.
- Copy the values into shared memory, makes them slightly cheaper to access.
2025-08-17 10:41:45 +02:00
177 changed files with 8298 additions and 4378 deletions

View File

@@ -60,6 +60,7 @@ RUN apt-get update \
git \
python3 \
python3-pip \
&& pip install --upgrade pip setuptools wheel \
&& pip install --break-system-packages -r requirements.txt \
&& apt autoremove -y \
&& apt clean -y \

View File

@@ -2,14 +2,30 @@ ARG UBUNTU_VERSION=24.04
FROM ubuntu:$UBUNTU_VERSION AS build
# Install build tools
RUN apt update && apt install -y git build-essential cmake wget
# Ref: https://vulkan.lunarg.com/doc/sdk/latest/linux/getting_started.html
# Install Vulkan SDK and cURL
RUN wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-noble.list https://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list && \
apt update -y && \
apt-get install -y vulkan-sdk libcurl4-openssl-dev curl
# Install build tools
RUN apt update && apt install -y git build-essential cmake wget xz-utils
# Install Vulkan SDK
ARG VULKAN_VERSION=1.4.321.1
RUN ARCH=$(uname -m) && \
wget -qO /tmp/vulkan-sdk.tar.xz https://sdk.lunarg.com/sdk/download/${VULKAN_VERSION}/linux/vulkan-sdk-linux-${ARCH}-${VULKAN_VERSION}.tar.xz && \
mkdir -p /opt/vulkan && \
tar -xf /tmp/vulkan-sdk.tar.xz -C /tmp --strip-components=1 && \
mv /tmp/${ARCH}/* /opt/vulkan/ && \
rm -rf /tmp/*
# Install cURL and Vulkan SDK dependencies
RUN apt install -y libcurl4-openssl-dev curl \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev
# Set environment variables
ENV VULKAN_SDK=/opt/vulkan
ENV PATH=$VULKAN_SDK/bin:$PATH
ENV LD_LIBRARY_PATH=$VULKAN_SDK/lib:$LD_LIBRARY_PATH
ENV CMAKE_PREFIX_PATH=$VULKAN_SDK:$CMAKE_PREFIX_PATH
ENV PKG_CONFIG_PATH=$VULKAN_SDK/lib/pkgconfig:$PKG_CONFIG_PATH
# Build it
WORKDIR /app

262
.github/copilot-instructions.md vendored Normal file
View File

@@ -0,0 +1,262 @@
# Copilot Instructions for llama.cpp
## Repository Overview
llama.cpp is a large-scale C/C++ project for efficient LLM (Large Language Model) inference with minimal setup and dependencies. The project enables running language models on diverse hardware with state-of-the-art performance.
**Key Facts:**
- **Primary language**: C/C++ with Python utility scripts
- **Size**: ~200k+ lines of code across 1000+ files
- **Architecture**: Modular design with main library (`libllama`) and 40+ executable tools/examples
- **Core dependency**: ggml tensor library (vendored in `ggml/` directory)
- **Backends supported**: CPU (AVX/NEON optimized), CUDA, Metal, Vulkan, SYCL, ROCm, MUSA
- **License**: MIT
## Build Instructions
### Prerequisites
- CMake 3.14+ (primary build system)
- C++17 compatible compiler (GCC 13.3+, Clang, MSVC)
- Optional: ccache for faster compilation
### Basic Build (CPU-only)
**ALWAYS run these commands in sequence:**
```bash
cmake -B build
cmake --build build --config Release -j $(nproc)
```
**Build time**: ~10 minutes on 4-core system with ccache enabled, ~25 minutes without ccache.
**Important Notes:**
- The Makefile is deprecated - always use CMake
- ccache is automatically detected and used if available
- Built binaries are placed in `build/bin/`
- Parallel builds (`-j`) significantly reduce build time
### Backend-Specific Builds
For CUDA support:
```bash
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
```
For Metal (macOS):
```bash
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j $(nproc)
```
**Important Note**: While all backends can be built as long as the correct requirements for that backend are installed, you will not be able to run them without the correct hardware. The only backend that can be run for testing and validation is the CPU backend.
### Debug Builds
Single-config generators:
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
```
Multi-config generators:
```bash
cmake -B build -G "Xcode"
cmake --build build --config Debug
```
### Common Build Issues
- **Issue**: Network tests fail in isolated environments
**Solution**: Expected behavior - core functionality tests will still pass
## Testing
### Running Tests
```bash
ctest --test-dir build --output-on-failure -j $(nproc)
```
**Test suite**: 38 tests covering tokenizers, grammar parsing, sampling, backends, and integration
**Expected failures**: 2-3 tests may fail if network access is unavailable (they download models)
**Test time**: ~30 seconds for passing tests
### Server Unit Tests
Run server-specific unit tests after building the server:
```bash
# Build the server first
cmake --build build --target llama-server
# Navigate to server tests and run
cd tools/server/tests
source ../../../.venv/bin/activate
./tests.sh
```
**Server test dependencies**: The `.venv` environment includes the required dependencies for server unit tests (pytest, aiohttp, etc.). Tests can be run individually or with various options as documented in `tools/server/tests/README.md`.
### Test Categories
- Tokenizer tests: Various model tokenizers (BERT, GPT-2, LLaMA, etc.)
- Grammar tests: GBNF parsing and validation
- Backend tests: Core ggml operations across different backends
- Integration tests: End-to-end workflows
### Manual Testing Commands
```bash
# Test basic inference
./build/bin/llama-cli --version
# Test model loading (requires model file)
./build/bin/llama-cli -m path/to/model.gguf -p "Hello" -n 10
```
## Code Quality and Linting
### C++ Code Formatting
**ALWAYS format C++ code before committing:**
```bash
git clang-format
```
Configuration is in `.clang-format` with these key rules:
- 4-space indentation
- 120 column limit
- Braces on same line for functions
- Pointer alignment: `void * ptr` (middle)
- Reference alignment: `int & ref` (middle)
### Python Code
**ALWAYS activate the Python environment in `.venv` and use tools from that environment:**
```bash
# Activate virtual environment
source .venv/bin/activate
```
Configuration files:
- `.flake8`: flake8 settings (max-line-length=125, excludes examples/tools)
- `pyrightconfig.json`: pyright type checking configuration
### Pre-commit Hooks
Run before committing:
```bash
pre-commit run --all-files
```
## Continuous Integration
### GitHub Actions Workflows
Key workflows that run on every PR:
- `.github/workflows/build.yml`: Multi-platform builds
- `.github/workflows/server.yml`: Server functionality tests
- `.github/workflows/python-lint.yml`: Python code quality
- `.github/workflows/python-type-check.yml`: Python type checking
### Local CI Validation
**Run full CI locally before submitting PRs:**
```bash
mkdir tmp
# CPU-only build
bash ./ci/run.sh ./tmp/results ./tmp/mnt
```
**CI Runtime**: 30-60 minutes depending on backend configuration
### Triggering CI
Add `ggml-ci` to commit message to trigger heavy CI workloads on the custom CI infrastructure.
## Project Layout and Architecture
### Core Directories
- **`src/`**: Main llama library implementation (`llama.cpp`, `llama-*.cpp`)
- **`include/`**: Public API headers, primarily `include/llama.h`
- **`ggml/`**: Core tensor library (submodule with custom GGML framework)
- **`examples/`**: 30+ example applications and tools
- **`tools/`**: Additional development and utility tools (server benchmarks, tests)
- **`tests/`**: Comprehensive test suite with CTest integration
- **`docs/`**: Detailed documentation (build guides, API docs, etc.)
- **`scripts/`**: Utility scripts for CI, data processing, and automation
- **`common/`**: Shared utility code used across examples
### Key Files
- **`CMakeLists.txt`**: Primary build configuration
- **`include/llama.h`**: Main C API header (~2000 lines)
- **`src/llama.cpp`**: Core library implementation (~8000 lines)
- **`CONTRIBUTING.md`**: Coding guidelines and PR requirements
- **`.clang-format`**: C++ formatting rules
- **`.pre-commit-config.yaml`**: Git hook configuration
### Built Executables (in `build/bin/`)
Primary tools:
- **`llama-cli`**: Main inference tool
- **`llama-server`**: OpenAI-compatible HTTP server
- **`llama-quantize`**: Model quantization utility
- **`llama-perplexity`**: Model evaluation tool
- **`llama-bench`**: Performance benchmarking
- **`llama-convert-llama2c-to-ggml`**: Model conversion utilities
### Configuration Files
- **CMake**: `CMakeLists.txt`, `cmake/` directory
- **Linting**: `.clang-format`, `.clang-tidy`, `.flake8`
- **CI**: `.github/workflows/`, `ci/run.sh`
- **Git**: `.gitignore` (includes build artifacts, models, cache)
### Dependencies
- **System**: OpenMP, libcurl (for model downloading)
- **Optional**: CUDA SDK, Metal framework, Vulkan SDK, Intel oneAPI
- **Bundled**: httplib, json (header-only libraries in vendored form)
## Common Validation Steps
### After Making Changes
1. **Format code**: `git clang-format`
2. **Build**: `cmake --build build --config Release`
3. **Test**: `ctest --test-dir build --output-on-failure`
4. **Server tests** (if modifying server): `cd tools/server/tests && source ../../../.venv/bin/activate && ./tests.sh`
5. **Manual validation**: Test relevant tools in `build/bin/`
### Performance Validation
```bash
# Benchmark inference performance
./build/bin/llama-bench -m model.gguf
# Evaluate model perplexity
./build/bin/llama-perplexity -m model.gguf -f dataset.txt
```
### Backend Validation
```bash
# Test backend operations
./build/bin/test-backend-ops
```
## Environment Setup
### Required Tools
- CMake 3.14+ (install via system package manager)
- Modern C++ compiler with C++17 support
- Git (for submodule management)
- Python 3.9+ with virtual environment (`.venv` is provided)
### Optional but Recommended
- ccache: `apt install ccache` or `brew install ccache`
- clang-format 15+: Usually included with LLVM/Clang installation
- pre-commit: `pip install pre-commit`
### Backend-Specific Requirements
- **CUDA**: NVIDIA CUDA Toolkit 11.2+
- **Metal**: Xcode command line tools (macOS only)
- **Vulkan**: Vulkan SDK
- **SYCL**: Intel oneAPI toolkit
## Important Guidelines
### Code Changes
- **Minimal dependencies**: Avoid adding new external dependencies
- **Cross-platform compatibility**: Test on Linux, macOS, Windows when possible
- **Performance focus**: This is a performance-critical inference library
- **API stability**: Changes to `include/llama.h` require careful consideration
### Git Workflow
- Always create feature branches from `master`
- **Never** commit build artifacts (`build/`, `.ccache/`, `*.o`, `*.gguf`)
- Use descriptive commit messages following project conventions
### Trust These Instructions
Only search for additional information if these instructions are incomplete or found to be incorrect. This document contains validated build and test procedures that work reliably across different environments.

View File

@@ -1,10 +1,11 @@
name: Build on RISCV Linux Machine by Cloud-V
on:
pull_request:
workflow_dispatch:
workflow_call:
jobs:
bianbu-riscv64-native: # Bianbu 2.2
debian-13-riscv64-native: # Bianbu 2.2
runs-on: self-hosted
steps:
@@ -20,24 +21,40 @@ jobs:
build-essential \
gcc-14-riscv64-linux-gnu \
g++-14-riscv64-linux-gnu \
ccache \
cmake
- name: Setup ccache
run: |
mkdir -p $HOME/.ccache
ccache -M 5G -d $HOME/.ccache
export CCACHE_LOGFILE=/home/runneruser/ccache_debug/ccache.log
export CCACHE_DEBUGDIR="/home/runneruser/ccache_debug"
echo "$GITHUB_WORKSPACE"
echo "CCACHE_LOGFILE=$CCACHE_LOGFILE" >> $GITHUB_ENV
echo "CCACHE_DEBUGDIR=$CCACHE_DEBUGDIR" >> $GITHUB_ENV
echo "CCACHE_BASEDIR=$GITHUB_WORKSPACE" >> $GITHUB_ENV
echo "CCACHE_DIR=$HOME/.ccache" >> $GITHUB_ENV
- name: Build
run: |
cmake -B build -DLLAMA_CURL=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=riscv64 \
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14 \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_FIND_ROOT_PATH=/usr/lib/riscv64-linux-gnu \
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
cmake -B build \
-DLLAMA_CURL=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=OFF \
-DLLAMA_BUILD_EXAMPLES=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_SYSTEM_NAME=Linux \
-DCMAKE_SYSTEM_PROCESSOR=riscv64 \
-DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
-DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14 \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_POSITION_INDEPENDENT_CODE=ON \
-DCMAKE_FIND_ROOT_PATH=/usr/lib/riscv64-linux-gnu \
-DCMAKE_FIND_ROOT_PATH_MODE_PROGRAM=NEVER \
-DCMAKE_FIND_ROOT_PATH_MODE_LIBRARY=ONLY \
-DCMAKE_FIND_ROOT_PATH_MODE_INCLUDE=BOTH
cmake --build build --config Release -j $(nproc)

View File

@@ -1070,7 +1070,8 @@ jobs:
write-host "Downloading AMD HIP SDK Installer"
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
write-host "Installing AMD HIP SDK"
Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -Wait
$proc = Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru
$proc.WaitForExit(600000)
write-host "Completed AMD HIP SDK installation"
- name: Verify ROCm

View File

@@ -39,6 +39,10 @@ jobs:
run: |
sudo apt-get update
sudo apt-get install build-essential libcurl4-openssl-dev
# Install git-clang-format script for formatting only changed code
wget -O /tmp/git-clang-format https://raw.githubusercontent.com/llvm/llvm-project/release/18.x/clang/tools/clang-format/git-clang-format
sudo cp /tmp/git-clang-format /usr/local/bin/git-clang-format
sudo chmod +x /usr/local/bin/git-clang-format
- name: Set up Python
uses: actions/setup-python@v5
@@ -50,4 +54,4 @@ jobs:
python3 -m venv .venv
.venv/bin/activate
pip install -r requirements/requirements-all.txt -r tools/server/tests/requirements.txt
pip install flake8 pyright
pip install flake8 pyright pre-commit

View File

@@ -557,7 +557,8 @@ jobs:
write-host "Downloading AMD HIP SDK Installer"
Invoke-WebRequest -Uri "https://download.amd.com/developer/eula/rocm-hub/AMD-Software-PRO-Edition-24.Q3-WinSvr2022-For-HIP.exe" -OutFile "${env:RUNNER_TEMP}\rocm-install.exe"
write-host "Installing AMD HIP SDK"
Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -Wait
$proc = Start-Process "${env:RUNNER_TEMP}\rocm-install.exe" -ArgumentList '-install' -NoNewWindow -PassThru
$proc.WaitForExit(600000)
write-host "Completed AMD HIP SDK installation"
- name: Verify ROCm

1
.gitignore vendored
View File

@@ -147,3 +147,4 @@ poetry.toml
# Local scripts
/run-vim.sh
/run-chat.sh
.ccache/

View File

@@ -5,7 +5,6 @@
/tools/server/ @ngxson
/ggml/src/ggml-cuda/fattn* @JohannesGaessler
/ggml/src/ggml-cuda/mmq.* @JohannesGaessler
/ggml/src/ggml-cuda/mmv.* @JohannesGaessler
/ggml/src/ggml-cuda/mmvq.* @JohannesGaessler
/ggml/src/ggml-opt.cpp @JohannesGaessler
/ggml/src/gguf.cpp @JohannesGaessler

1611
Makefile

File diff suppressed because it is too large Load Diff

View File

@@ -17,6 +17,7 @@ LLM inference in C/C++
## Hot topics
- **[guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)**
- **[[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)**
- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
- Hot PRs: [All](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+) | [Open](https://github.com/ggml-org/llama.cpp/pulls?q=is%3Apr+label%3Ahot+is%3Aopen)
@@ -150,6 +151,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
- [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
- [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge)
- [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
- [x] [LFM2-VL](https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)
</details>

View File

@@ -106,7 +106,7 @@ function gg_wget {
cd $out
# should not re-download if file is the same
wget -nv -N $url
wget -nv -c -N $url
cd $cwd
}

View File

@@ -1530,6 +1530,13 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.ctx_shift = false;
}
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_PERPLEXITY}).set_env("LLAMA_ARG_NO_CONTEXT_SHIFT"));
add_opt(common_arg(
{"--context-shift"},
string_format("enables context shift on infinite text generation (default: %s)", params.ctx_shift ? "enabled" : "disabled"),
[](common_params & params) {
params.ctx_shift = true;
}
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_PERPLEXITY}).set_env("LLAMA_ARG_CONTEXT_SHIFT"));
add_opt(common_arg(
{"--chunks"}, "N",
string_format("max number of chunks to process (default: %d, -1 = all)", params.n_chunks),
@@ -1748,7 +1755,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params) {
params.warmup = false;
}
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_EMBEDDING, LLAMA_EXAMPLE_RETRIEVAL}));
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_EMBEDDING, LLAMA_EXAMPLE_RETRIEVAL, LLAMA_EXAMPLE_PERPLEXITY}));
add_opt(common_arg(
{"--spm-infill"},
string_format(
@@ -1823,7 +1830,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
[](common_params & params, const std::string & value) {
params.sampling.top_n_sigma = std::stof(value);
}
).set_examples({LLAMA_EXAMPLE_MAIN}).set_sparam());
).set_sparam());
add_opt(common_arg(
{"--xtc-probability"}, "N",
string_format("xtc probability (default: %.1f, 0.0 = disabled)", (double)params.sampling.xtc_probability),
@@ -2247,9 +2254,11 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
).set_examples({LLAMA_EXAMPLE_PERPLEXITY}));
add_opt(common_arg(
{"-dt", "--defrag-thold"}, "N",
string_format("KV cache defragmentation threshold (default: %.1f, < 0 - disabled)", (double)params.defrag_thold),
string_format("KV cache defragmentation threshold (DEPRECATED)"),
[](common_params & params, const std::string & value) {
params.defrag_thold = std::stof(value);
GGML_UNUSED(params);
GGML_UNUSED(value);
LOG_WRN("DEPRECATED: --defrag-thold is deprecated and no longer necessary to specify\n");
}
).set_env("LLAMA_ARG_DEFRAG_THOLD"));
add_opt(common_arg(

View File

@@ -147,6 +147,7 @@ struct templates_params {
json extra_context;
bool add_bos;
bool add_eos;
bool is_inference = true;
};
common_chat_tool_choice common_chat_tool_choice_parse_oaicompat(const std::string & tool_choice) {
@@ -632,7 +633,6 @@ const char * common_reasoning_format_name(common_reasoning_format format) {
case COMMON_REASONING_FORMAT_AUTO: return "auto";
case COMMON_REASONING_FORMAT_DEEPSEEK: return "deepseek";
case COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY: return "deepseek-legacy";
case COMMON_REASONING_FORMAT_GRANITE: return "granite";
default:
throw std::runtime_error("Unknown reasoning format");
}
@@ -1337,6 +1337,17 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
common_chat_params data;
auto prompt = apply(tmpl, inputs);
// Check if we need to replace the return token with end token during
// inference and without generation prompt. For more details see:
// https://github.com/ggml-org/llama.cpp/issues/15417
if (inputs.is_inference && !inputs.add_generation_prompt) {
static constexpr std::string_view return_token = "<|return|>";
static constexpr std::string_view end_token = "<|end|>";
if (size_t pos = prompt.rfind(return_token); pos != std::string::npos) {
prompt.replace(pos, return_token.length(), end_token);
}
}
data.prompt = prompt;
data.format = COMMON_CHAT_FORMAT_GPT_OSS;
@@ -1350,6 +1361,26 @@ static common_chat_params common_chat_params_init_gpt_oss(const common_chat_temp
"<|end|>",
};
if (!inputs.json_schema.is_null()) {
data.grammar_lazy = false;
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
auto schema = inputs.json_schema;
builder.resolve_refs(schema);
auto not_end = builder.add_rule("not-end",
"[^<] | \"<\" [^|] | \"<|\" [^e] | \"<|e\" [^n] | \"<|en\" [^d] | \"<|end\" [^|] | \"<|end|\" [^>]");
auto analysis = builder.add_rule("analysis",
"\"<|channel|>analysis<|message|>\" ( " + not_end + " )* \"<|end|>\"");
auto constraint = builder.add_rule("constraint", "\"<|constrain|>\"? [a-zA-Z0-9_-]+");
auto final = builder.add_rule("final",
"\"<|channel|>final\" ( \" \" " + constraint + " )? \"<|message|>\" " +
builder.add_schema("response", schema)
);
builder.add_rule("root", "( " + analysis + " \"<|start|>assistant\" )? " + final);
});
}
if (inputs.tools.is_array() && !inputs.tools.empty()) {
data.grammar_lazy = inputs.tool_choice != COMMON_CHAT_TOOL_CHOICE_REQUIRED;
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
@@ -2110,7 +2141,7 @@ static common_chat_params common_chat_templates_apply_jinja(
}
// GPT-OSS
if (src.find("<|channel|>") != std::string::npos && params.json_schema.is_null()) {
if (src.find("<|channel|>") != std::string::npos) {
return common_chat_params_init_gpt_oss(tmpl, params);
}

View File

@@ -558,13 +558,6 @@ std::string string_from(const struct llama_context * ctx, const std::vector<llam
auto detokenized = common_token_to_piece(ctx, token);
detokenized.erase(
std::remove_if(
detokenized.begin(),
detokenized.end(),
[](const unsigned char c) { return !std::isprint(c); }),
detokenized.end());
buf << "'" << detokenized << "'"
<< ":" << std::to_string(token);
}
@@ -589,13 +582,6 @@ std::string string_from(const struct llama_context * ctx, const struct llama_bat
auto detokenized = common_token_to_piece(ctx, batch.token[i]);
detokenized.erase(
std::remove_if(
detokenized.begin(),
detokenized.end(),
[](const unsigned char c) { return !std::isprint(c); }),
detokenized.end());
buf << "\n" << std::to_string(i)
<< ", token '" << detokenized << "'"
<< ", pos " << std::to_string(batch.pos[i])
@@ -1166,7 +1152,6 @@ struct llama_context_params common_context_params_to_llama(const common_params &
cparams.yarn_orig_ctx = params.yarn_orig_ctx;
cparams.pooling_type = params.pooling_type;
cparams.attention_type = params.attention_type;
cparams.defrag_thold = params.defrag_thold;
cparams.cb_eval = params.cb_eval;
cparams.cb_eval_user_data = params.cb_eval_user_data;
cparams.offload_kqv = !params.no_kv_offload;

View File

@@ -239,12 +239,15 @@ struct common_params_diffusion {
bool add_gumbel_noise = false; // add gumbel noise to the logits if temp > 0.0
};
// reasoning API response format (not to be confused as chat template's reasoning format)
enum common_reasoning_format {
COMMON_REASONING_FORMAT_NONE,
COMMON_REASONING_FORMAT_AUTO,
COMMON_REASONING_FORMAT_AUTO, // Same as deepseek, using `message.reasoning_content`
COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY, // Extract thinking tag contents and return as `message.reasoning_content`, or leave inline in <think> tags in stream mode
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
COMMON_REASONING_FORMAT_GRANITE, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
// do not extend this enum unless you absolutely have to
// in most cases, use COMMON_REASONING_FORMAT_AUTO
// see: https://github.com/ggml-org/llama.cpp/pull/15408
};
@@ -285,7 +288,6 @@ struct common_params {
float yarn_beta_fast = 32.0f; // YaRN low correction dim
float yarn_beta_slow = 1.0f; // YaRN high correction dim
int32_t yarn_orig_ctx = 0; // YaRN original context length
float defrag_thold = 0.1f; // KV cache defragmentation threshold
// offload params
std::vector<ggml_backend_dev_t> devices; // devices to use for offloading
@@ -372,7 +374,7 @@ struct common_params {
bool cont_batching = true; // insert new sequences for decoding on-the-fly
bool flash_attn = false; // flash attention
bool no_perf = false; // disable performance metrics
bool ctx_shift = true; // context shift on inifinite text generation
bool ctx_shift = false; // context shift on infinite text generation
bool swa_full = false; // use full-size SWA cache (https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
bool kv_unified = false; // enable unified KV cache

View File

@@ -89,13 +89,16 @@ class ModelBase:
block_count: int
tensor_map: gguf.TensorNameMap
# Mistral format specifics
is_mistral_format: bool = False
disable_mistral_community_chat_template: bool = False
def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, *, is_big_endian: bool = False,
use_temp_file: bool = False, eager: bool = False,
metadata_override: Path | None = None, model_name: str | None = None,
split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False,
small_first_shard: bool = False, hparams: dict[str, Any] | None = None, remote_hf_model_id: str | None = None):
small_first_shard: bool = False, hparams: dict[str, Any] | None = None, remote_hf_model_id: str | None = None,
disable_mistral_community_chat_template: bool = False):
if type(self) is ModelBase or \
type(self) is TextModel or \
type(self) is MmprojModel:
@@ -147,6 +150,9 @@ class ModelBase:
self.gguf_writer = gguf.GGUFWriter(path=None, arch=gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file,
split_max_tensors=split_max_tensors, split_max_size=split_max_size, dry_run=dry_run, small_first_shard=small_first_shard)
# Mistral specific
self.disable_mistral_community_chat_template = disable_mistral_community_chat_template
@classmethod
def add_prefix_to_filename(cls, path: Path, prefix: str) -> Path:
stem, suffix = path.stem, path.suffix
@@ -1334,6 +1340,12 @@ class MmprojModel(ModelBase):
return None
raise KeyError(f"could not find any of: {keys}")
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, name, n_dims # unused
if ".patch_embd.weight" in new_name:
return gguf.GGMLQuantizationType.F16 if self.ftype == gguf.LlamaFileType.MOSTLY_F16 else gguf.GGMLQuantizationType.F32
return False
@ModelBase.register("GPTNeoXForCausalLM")
class GPTNeoXModel(TextModel):
@@ -2005,8 +2017,17 @@ class LlamaModel(TextModel):
template_dir = Path(__file__).parent / "models/templates/"
template = MistralModel.get_community_chat_template(vocab, template_dir)
self.gguf_writer.add_chat_template(template)
if not self.is_mistral_format or not self.disable_mistral_community_chat_template:
# Log only for Mistral format that the official tokenization and detokenization is via `mistral-common`.
if self.is_mistral_format:
logger.info(
"Using a Mistral community chat template. These templates can be subject to errors in early days or weeks after a release. "
"Mistral recommends to use `mistral-common` to perform tokenization and detokenization."
)
template = MistralModel.get_community_chat_template(vocab, template_dir, self.is_mistral_format)
self.gguf_writer.add_chat_template(template)
else:
logger.info("Not using a Mistral community chat template. Ensure to perform the tokenization and detokenization via `mistral-common`.")
def set_vocab(self):
if self.is_mistral_format:
@@ -2305,10 +2326,9 @@ class SmolVLMModel(MmprojModel):
self.gguf_writer.add_vision_use_gelu(True)
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, new_name, n_dims # unused
if ".embeddings." in name:
return gguf.GGMLQuantizationType.F32
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
@@ -3296,12 +3316,9 @@ class Qwen2VLVisionModel(MmprojModel):
self.gguf_writer.add_vision_attention_layernorm_eps(self.global_config.get("rms_norm_eps", 1e-6))
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, name, n_dims # unused
if ".patch_embd." in new_name:
return gguf.GGMLQuantizationType.F16
if ".position_embd." in new_name:
return gguf.GGMLQuantizationType.F32
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
@@ -3374,10 +3391,9 @@ class Qwen25OmniModel(Qwen2VLVisionModel):
yield ("audio_tower.embed_positions.weight", pos_embd)
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, new_name, n_dims # unused
if ".conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F16
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("thinker."):
@@ -3423,12 +3439,9 @@ class InternVisionModel(MmprojModel):
self.gguf_writer.add_vision_projector_scale_factor(int(1.0 / downsample_ratio))
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, name, n_dims # unused
if ".patch_embd." in new_name:
return gguf.GGMLQuantizationType.F16
if ".position_embd." in new_name:
return gguf.GGMLQuantizationType.F32
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
def _mapping_interns1_name(self, name):
names_map = {
@@ -5062,13 +5075,12 @@ class Gemma3VisionModel(MmprojModel):
self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor)
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, new_name, n_dims # unused
# related to https://github.com/ggml-org/llama.cpp/issues/13025
if "input_projection" in name:
return gguf.GGMLQuantizationType.F16
if ".embeddings." in name:
return gguf.GGMLQuantizationType.F32
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
@@ -7727,10 +7739,9 @@ class WhisperEncoderModel(MmprojModel):
self.gguf_writer.add_audio_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-5))
def tensor_force_quant(self, name, new_name, bid, n_dims):
del bid, new_name, n_dims # unused
if ".conv" in name and ".weight" in name:
return gguf.GGMLQuantizationType.F16
return False
return super().tensor_force_quant(name, new_name, bid, n_dims)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
@@ -8426,7 +8437,7 @@ class MistralModel(LlamaModel):
undo_permute = False
@staticmethod
def get_community_chat_template(vocab: MistralVocab, templates_dir: Path):
def get_community_chat_template(vocab: MistralVocab, templates_dir: Path, is_mistral_format: bool):
assert TokenizerVersion is not None, "mistral_common is not installed"
assert isinstance(vocab.tokenizer, (Tekkenizer, SentencePieceTokenizer)), (
f"Expected Tekkenizer or SentencePieceTokenizer, got {type(vocab.tokenizer)}"
@@ -8447,7 +8458,13 @@ class MistralModel(LlamaModel):
elif vocab.tokenizer.version == TokenizerVersion.v13:
template_file = "unsloth-mistral-Devstral-Small-2507.jinja"
else:
raise ValueError(f"Unknown tokenizer type: {vocab.tokenizer_type} and version {vocab.tokenizer.version}")
err_message = f"Unknown tokenizer type: {vocab.tokenizer_type} and version {vocab.tokenizer.version}"
if is_mistral_format:
err_message += (
" . Please pass --disable-mistral-community-chat-template argument to the CLI "
"if you want to skip this error and use the Mistral official `mistral-common` pre-processing library."
)
raise ValueError(err_message)
template_path = templates_dir / template_file
if not template_path.exists():
@@ -8642,6 +8659,13 @@ def parse_args() -> argparse.Namespace:
"--mistral-format", action="store_true",
help="Whether the model is stored following the Mistral format.",
)
parser.add_argument(
"--disable-mistral-community-chat-template", action="store_true",
help=(
"Whether to disable usage of Mistral community chat templates. If set, use the Mistral official `mistral-common` library for tokenization and detokenization of Mistral models. "
"Using `mistral-common` ensure correctness and zero-day support of tokenization for models converted from the Mistral format but requires to manually setup the tokenization server."
)
)
args = parser.parse_args()
if not args.print_supported_models and args.model is None:
@@ -8748,6 +8772,7 @@ def main() -> None:
fname_out = ModelBase.add_prefix_to_filename(fname_out, "mmproj-")
is_mistral_format = args.mistral_format
disable_mistral_community_chat_template = args.disable_mistral_community_chat_template
with torch.inference_mode():
output_type = ftype_map[args.outtype]
@@ -8774,7 +8799,7 @@ def main() -> None:
split_max_tensors=args.split_max_tensors,
split_max_size=split_str_to_n_bytes(args.split_max_size), dry_run=args.dry_run,
small_first_shard=args.no_tensor_first_split,
remote_hf_model_id=hf_repo_id,
remote_hf_model_id=hf_repo_id, disable_mistral_community_chat_template=disable_mistral_community_chat_template
)
if args.vocab_only:

View File

@@ -265,8 +265,9 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
| BF16 | 🚫 | 🚫 | ❓ | ❓ |
| Q4_0 | ✅ | ✅ | ❓ | ❓ |
| Q4_1 | ✅ | ✅ | ❓ | ❓ |
| Q5_0 | 🚫 | 🚫 | ❓ | ❓ |
| Q5_1 | 🚫 | 🚫 | ❓ | ❓ |
| MXFP4 | 🚫 | 🚫 | ❓ | ❓ |
| Q5_0 | | | ❓ | ❓ |
| Q5_1 | ✅ | ✅ | ❓ | ❓ |
| Q8_0 | ✅ | ✅ | ❓ | ❓ |
| Q2_K | 🚫 | 🚫 | ❓ | ❓ |
| Q3_K | ✅ | ✅ | ❓ | ❓ |
@@ -291,4 +292,4 @@ IBM VXE/VXE2 SIMD acceleration depends on the BLAS implementation. It is strongl
- 🚫 - acceleration unavailable, will still run using scalar implementation
- ❓ - acceleration unknown, please contribute if you can test it yourself
Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on July 31, 2025.
Last Updated by **Aaron Teo (aaron.teo1@ibm.com)** on Aug 22, 2025.

View File

@@ -197,13 +197,12 @@ The environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` can be used to enab
The following compilation options are also available to tweak performance:
| Option | Legal values | Default | Description |
|-------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models |
| GGML_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
| Option | Legal values | Default | Description |
|-------------------------------|------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| GGML_CUDA_FORCE_MMQ | Boolean | false | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for CDNA and RDNA4) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000). |
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
## MUSA

View File

@@ -194,7 +194,7 @@ llama_print_timings: total time = 44411.01 ms / 377 tokens
## Orin compile and run
### compile
```sh
make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 GGML_CUDA_F16=1 -j 32
make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 -j 32
```
### run on Orin
### case 1

View File

@@ -34,6 +34,7 @@ else()
add_subdirectory(gen-docs)
add_subdirectory(training)
add_subdirectory(diffusion)
add_subdirectory(model-conversion)
if (NOT GGML_BACKEND_DL)
add_subdirectory(convert-llama2c-to-ggml)
# these examples use the backends directly and cannot be built with dynamic loading

View File

@@ -1,4 +1,5 @@
This is a swift clone of `examples/batched`.
$ `make`
$ `./llama-batched-swift MODEL_PATH [PROMPT] [PARALLEL]`
```bash
$ ./llama-batched-swift MODEL_PATH [PROMPT] [PARALLEL]
```

View File

@@ -17,7 +17,7 @@
"
" start the llama.cpp server with a FIM-compatible model. for example:
"
" $ llama-server -m {model.gguf} --port 8012 -ngl 99 -fa -dt 0.1 --ubatch-size 512 --batch-size 1024 --cache-reuse 256
" $ llama-server -m {model.gguf} --port 8012 -ngl 99 -fa --ubatch-size 512 --batch-size 1024 --cache-reuse 256
"
" --batch-size [512, model max context]
"

View File

@@ -5,3 +5,9 @@ Demonstration of lookahead decoding technique:
https://lmsys.org/blog/2023-11-21-lookahead-decoding/
More info: https://github.com/ggml-org/llama.cpp/pull/4207
Sample command:
```bash
llama-lookahead -hf ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF -p "// network server implemented in C\n// author: Peter Hacker\n\n#include" -e -ngl 99 -t 4 -n 512 -c 4096 -kvu
```

3
examples/model-conversion/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
.model_name
data
ppl

View File

@@ -0,0 +1,5 @@
set(TARGET llama-logits)
add_executable(${TARGET} logits.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_17)

View File

@@ -0,0 +1,163 @@
# Validation functions
define validate_model_path
@if [ -z "$(MODEL_PATH)" ]; then \
echo "Error: MODEL_PATH must be provided either as:"; \
echo " 1. Environment variable: export MODEL_PATH=/path/to/model"; \
echo " 2. Command line argument: make $(1) MODEL_PATH=/path/to/model"; \
exit 1; \
fi
endef
define validate_embedding_model_path
@if [ -z "$(EMBEDDING_MODEL_PATH)" ]; then \
echo "Error: EMBEDDING_MODEL_PATH must be provided either as:"; \
echo " 1. Environment variable: export EMBEDDING_MODEL_PATH=/path/to/model"; \
echo " 2. Command line argument: make $(1) EMBEDDING_MODEL_PATH=/path/to/model"; \
exit 1; \
fi
endef
###
### Casual Model targets/recipes
###
causal-convert-model-bf16: OUTTYPE=bf16
causal-convert-model-bf16: causal-convert-model
causal-convert-model:
$(call validate_model_path,causal-convert-model)
@MODEL_NAME="$(MODEL_NAME)" OUTTYPE="$(OUTTYPE)" MODEL_PATH="$(MODEL_PATH)" \
METADATA_OVERRIDE="$(METADATA_OVERRIDE)" \
./scripts/causal/convert-model.sh
causal-run-original-model:
$(call validate_model_path,causal-run-original-model)
@MODEL_PATH="$(MODEL_PATH)" ./scripts/causal/run-org-model.py
causal-run-converted-model:
@CONVERTED_MODEL="$(CONVERTED_MODEL)" ./scripts/causal/run-converted-model.sh
causal-verify-logits: causal-run-original-model causal-run-converted-model
@./scripts/causal/compare-logits.py
@MODEL_PATH="$(MODEL_PATH)" ./scripts/utils/check-nmse.py -m ${MODEL_PATH}
causal-run-original-embeddings:
@./scripts/causal/run-casual-gen-embeddings-org.sh
causal-run-converted-embeddings:
@./scripts/causal/run-converted-model-embeddings-logits.sh
causal-verify-embeddings: causal-run-original-embeddings causal-run-converted-embeddings
@./scripts/causal/compare-embeddings-logits.sh
causal-inspect-original-model:
@./scripts/utils/inspect-org-model.py
causal-inspect-converted-model:
@./scripts/utils/inspect-converted-model.sh
causal-start-embedding-server:
@./scripts/utils/run-embedding-server.sh ${CONVERTED_MODEL}
causal-curl-embedding-endpoint: causal-run-original-embeddings
@./scripts/utils/curl-embedding-server.sh | ./scripts/causal/compare-embeddings-logits.sh
causal-quantize-Q8_0: QUANTIZED_TYPE = Q8_0
causal-quantize-Q8_0: causal-quantize-model
causal-quantize-Q4_0: QUANTIZED_TYPE = Q4_0
causal-quantize-Q4_0: causal-quantize-model
causal-quantize-model:
@CONVERTED_MODEL="$(CONVERTED_MODEL)" QUANTIZED_TYPE="$(QUANTIZED_TYPE)" ./scripts/utils/quantize.sh ${CONVERTED_MODEL} ${QUANTIZED_TYPE}
@echo "Export the quantized model path to QUANTIZED_MODEL variable in your environment"
causal-run-quantized-model:
@QUANTIZED_MODEL="$(QUANTIZED_MODEL)" ./scripts/causal/run-converted-model.sh ${QUANTIZED_MODEL}
###
### Embedding Model targets/recipes
###
embedding-convert-model-bf16: OUTTYPE=bf16
embedding-convert-model-bf16: embedding-convert-model
embedding-convert-model:
$(call validate_embedding_model_path,embedding-convert-model)
@MODEL_NAME="$(MODEL_NAME)" OUTTYPE="$(OUTTYPE)" MODEL_PATH="$(EMBEDDING_MODEL_PATH)" \
METADATA_OVERRIDE="$(METADATA_OVERRIDE)" \
./scripts/embedding/convert-model.sh
embedding-run-original-model:
$(call validate_embedding_model_path,embedding-run-original-model)
@EMBEDDING_MODEL_PATH="$(EMBEDDING_MODEL_PATH)" ./scripts/embedding/run-original-model.py
embedding-run-converted-model:
@CONVERTED_EMBEDDING_MODEL="$(CONVERTED_EMBEDDING_MODEL)" ./scripts/embedding/run-converted-model.sh ${CONVERTED_EMBEDDING_MODEL}
embedding-verify-logits: embedding-run-original-model embedding-run-converted-model
@./scripts/embedding/compare-embeddings-logits.sh
embedding-inspect-original-model:
$(call validate_embedding_model_path,embedding-inspect-original-model)
@EMBEDDING_MODEL_PATH="$(EMBEDDING_MODEL_PATH)" ./scripts/utils/inspect-org-model.py -m ${EMBEDDING_MODEL_PATH}
embedding-inspect-converted-model:
@CONVERTED_EMBEDDING_MODEL="$(CONVERTED_EMBEDDING_MODEL)" ./scripts/utils/inspect-converted-model.sh ${CONVERTED_EMBEDDING_MODEL}
embedding-start-embedding-server:
@./scripts/utils/run-embedding-server.sh ${CONVERTED_EMBEDDING_MODEL}
embedding-curl-embedding-endpoint:
@./scripts/utils/curl-embedding-server.sh | ./scripts/embedding/compare-embeddings-logits.sh
embedding-quantize-Q8_0: QUANTIZED_TYPE = Q8_0
embedding-quantize-Q8_0: embedding-quantize-model
embedding-quantize-Q4_0: QUANTIZED_TYPE = Q4_0
embedding-quantize-Q4_0: embedding-quantize-model
embedding-quantize-model:
@./scripts/utils/quantize.sh ${CONVERTED_EMBEDDING_MODEL} ${QUANTIZED_TYPE}
@echo "Export the quantized model path to QUANTIZED_EMBEDDING_MODEL variable in your environment"
embedding-run-quantized-model:
@./scripts/embedding/run-converted-model.sh ${QUANTIZED_EMBEDDING_MODEL}
###
### Perplexity targets/recipes
###
perplexity-data-gen:
CONVERTED_MODEL="$(CONVERTED_MODEL)" ./scripts/utils/perplexity-gen.sh
perplexity-run-full:
QUANTIZED_MODEL="$(QUANTIZED_MODEL)" LOOGITS_FILE="$(LOGITS_FILE)" \
./scripts/utils/perplexity-run.sh
perplexity-run:
QUANTIZED_MODEL="$(QUANTIZED_MODEL)" ./scripts/utils/perplexity-run-simple.sh
###
### HuggingFace targets/recipes
###
hf-create-model:
@./scripts/utils/hf-create-model.py -m "${MODEL_NAME}" -ns "${NAMESPACE}" -b "${ORIGINAL_BASE_MODEL}"
hf-create-model-private:
@./scripts/utils/hf-create-model.py -m "${MODEL_NAME}" -ns "${NAMESPACE}" -b "${ORIGINAL_BASE_MODEL}" -p
hf-upload-gguf-to-model:
@./scripts/utils/hf-upload-gguf-model.py -m "${MODEL_PATH}" -r "${REPO_ID}" -o "${NAME_IN_REPO}"
hf-create-collection:
@./scripts/utils/hf-create-collection.py -n "${NAME}" -d "${DESCRIPTION}" -ns "${NAMESPACE}"
hf-add-model-to-collection:
@./scripts/utils/hf-add-model-to-collection.py -c "${COLLECTION}" -m "${MODEL}"
.PHONY: clean
clean:
@${RM} -rf data .converted_embedding_model.txt .converted_model.txt .embedding_model_name.txt .model_name.txt

View File

@@ -0,0 +1,335 @@
# Model Conversion Example
This directory contains scripts and code to help in the process of converting
HuggingFace PyTorch models to GGUF format.
The motivation for having this is that the conversion process can often be an
iterative process, where the original model is inspected, converted, updates
made to llama.cpp, converted again, etc. Once the model has been converted it
needs to be verified against the original model, and then optionally quantified,
and in some cases perplexity checked of the quantized model. And finally the
model/models need to the ggml-org on Hugging Face. This tool/example tries to
help with this process.
### Overview
The idea is that the makefile targets and scripts here can be used in the
development/conversion process assisting with things like:
* inspect/run the original model to figure out how it works
* convert the original model to GGUF format
* inspect/run the converted model
* verify the logits produced by the original model and the converted model
* quantize the model to GGUF format
* run perplexity evaluation to verify that the quantized model is performing
as expected
* upload the model to HuggingFace to make it available for others
## Setup
Create virtual python environment
```console
$ python3.11 -m venv venv
$ source venv/bin/activate
(venv) $ pip install -r requirements.txt
```
## Causal Language Model Conversion
This section describes the steps to convert a causal language model to GGUF and
to verify that the conversion was successful.
### Download the original model
First, clone the original model to some local directory:
```console
$ mkdir models && cd models
$ git clone https://huggingface.co/user/model_name
$ cd model_name
$ git lfs install
$ git lfs pull
```
### Set the MODEL_PATH
The path to the downloaded model can be provided in two ways:
**Option 1: Environment variable (recommended for iterative development)**
```console
export MODEL_PATH=~/work/ai/models/some_model
```
**Option 2: Command line argument (for one-off tasks)**
```console
make causal-convert-model MODEL_PATH=~/work/ai/models/some_model
```
Command line arguments take precedence over environment variables when both are provided.
In cases where the transformer implementation for the model has not been released
yet it is possible to set the environment variable `UNRELEASED_MODEL_NAME` which
will then cause the transformer implementation to be loaded explicitely and not
use AutoModelForCausalLM:
```
export UNRELEASED_MODEL_NAME=SomeNewModel
```
### Inspecting the original tensors
```console
# Using environment variable
(venv) $ make causal-inspect-original-model
# Or using command line argument
(venv) $ make causal-inspect-original-model MODEL_PATH=~/work/ai/models/some_model
```
### Running the original model
This is mainly to verify that the original model works, and to compare the output
from the converted model.
```console
# Using environment variable
(venv) $ make causal-run-original-model
# Or using command line argument
(venv) $ make causal-run-original-model MODEL_PATH=~/work/ai/models/some_model
```
This command will save two files to the `data` directory, one is a binary file
containing logits which will be used for comparison with the converted model
later, and the other is a text file which allows for manual visual inspection.
### Model conversion
After updates have been made to [gguf-py](../../gguf-py) to add support for the
new model, the model can be converted to GGUF format using the following command:
```console
# Using environment variable
(venv) $ make causal-convert-model
# Or using command line argument
(venv) $ make causal-convert-model MODEL_PATH=~/work/ai/models/some_model
```
### Inspecting the converted model
The converted model can be inspected using the following command:
```console
(venv) $ make inspect-converted-model
```
### Running the converted model
```console
(venv) $ make run-converted-model
```
### Model logits verfication
The following target will run the original model and the converted model and
compare the logits:
```console
(venv) $ make causal-verify-logits
```
### Quantizing the model
The causal model can be quantized to GGUF format using the following command:
```console
(venv) $ make causal-quantize-Q8_0
Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
Export the quantized model path to QUANTIZED_MODEL variable in your environment
```
This will show the path to the quantized model in the terminal, which can then
be used to set the `QUANTIZED_MODEL` environment variable:
```console
export QUANTIZED_MODEL=/path/to/quantized/model-Q8_0.gguf
```
Then the quantized model can be run using the following command:
```console
(venv) $ make causal-run-quantized-model
```
## Embedding Language Model Conversion
### Download the original model
```console
$ mkdir models && cd models
$ git clone https://huggingface.co/user/model_name
$ cd model_name
$ git lfs install
$ git lfs pull
```
The path to the embedding model can be provided in two ways:
**Option 1: Environment variable (recommended for iterative development)**
```console
export EMBEDDING_MODEL_PATH=~/path/to/embedding_model
```
**Option 2: Command line argument (for one-off tasks)**
```console
make embedding-convert-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model
```
Command line arguments take precedence over environment variables when both are provided.
### Running the original model
This is mainly to verify that the original model works and to compare the output
with the output from the converted model.
```console
# Using environment variable
(venv) $ make embedding-run-original-model
# Or using command line argument
(venv) $ make embedding-run-original-model EMBEDDING_MODEL_PATH=~/path/to/embedding_model
```
This command will save two files to the `data` directory, one is a binary
file containing logits which will be used for comparison with the converted
model, and the other is a text file which allows for manual visual inspection.
### Model conversion
After updates have been made to [gguf-py](../../gguf-py) to add support for the
new model the model can be converted to GGUF format using the following command:
```console
(venv) $ make embedding-convert-model
```
### Run the converted model
```console
(venv) $ make embedding-run-converted-model
```
### Model logits verfication
The following target will run the original model and the converted model (which
was done manually in the previous steps) and compare the logits:
```console
(venv) $ make embedding-verify-logits
```
### llama-server verification
To verify that the converted model works with llama-server, the following
command can be used:
```console
(venv) $ make embedding-start-embedding-server
```
Then open another terminal and set the `EMBEDDINGS_MODEL_PATH` environment
variable as this will not be inherited by the new terminal:
```console
(venv) $ make embedding-curl-embedding-endpoint
```
This will call the `embedding` endpoing and the output will be piped into
the same verification script as used by the target `embedding-verify-logits`.
The causal model can also be used to produce embeddings and this can be verified
using the following commands:
```console
(venv) $ make causal-start-embedding-server
```
Then open another terminal and set the `MODEL_PATH` environment
variable as this will not be inherited by the new terminal:
```console
(venv) $ make casual-curl-embedding-endpoint
```
### Quantizing the model
The embedding model can be quantized to GGUF format using the following command:
```console
(venv) $ make embedding-quantize-Q8_0
Quantized model saved to: /path/to/quantized/model-Q8_0.gguf
Export the quantized model path to QUANTIZED_EMBEDDING_MODEL variable in your environment
```
This will show the path to the quantized model in the terminal, which can then
be used to set the `QUANTIZED_EMBEDDING_MODEL` environment variable:
```console
export QUANTIZED_EMBEDDING_MODEL=/path/to/quantized/model-Q8_0.gguf
```
Then the quantized model can be run using the following command:
```console
(venv) $ make embedding-run-quantized-model
```
## Perplexity Evaluation
### Simple perplexity evaluation
This allows to run the perplexity evaluation without having to generate a
token/logits file:
```console
(venv) $ make perplexity-run QUANTIZED_MODEL=~/path/to/quantized/model.gguf
```
This will use the wikitext dataset to run the perplexity evaluation and
output the perplexity score to the terminal. This value can then be compared
with the perplexity score of the unquantized model.
### Full perplexity evaluation
First use the converted, non-quantized, model to generate the perplexity evaluation
dataset using the following command:
```console
$ make perplexity-data-gen CONVERTED_MODEL=~/path/to/converted/model.gguf
```
This will generate a file in the `data` directory named after the model and with
a `.kld` suffix which contains the tokens and the logits for the wikitext dataset.
After the dataset has been generated, the perplexity evaluation can be run using
the quantized model:
```console
$ make perplexity-run-full QUANTIZED_MODEL=~/path/to/quantized/model-Qxx.gguf LOGITS_FILE=data/model.gguf.ppl
```
> 📝 **Note:** The `LOGITS_FILE` is the file generated by the previous command
> can be very large, so make sure you have enough disk space available.
## HuggingFace utilities
The following targets are useful for creating collections and model repositories
on Hugging Face in the the ggml-org. These can be used when preparing a relase
to script the process for new model releases.
For the following targets a `HF_TOKEN` environment variable is required.
> 📝 **Note:** Don't forget to logout from Hugging Face after running these
> commands, otherwise you might have issues pulling/cloning repositories as
> the token will still be in use:
> $ huggingface-cli logout
> $ unset HF_TOKEN
### Create a new Hugging Face Model (model repository)
This will create a new model repsository on Hugging Face with the specified
model name.
```console
(venv) $ make hf-create-model MODEL_NAME='TestModel' NAMESPACE="danbev"
Repository ID: danbev/TestModel-GGUF
Repository created: https://huggingface.co/danbev/TestModel-GGUF
```
Note that we append a `-GGUF` suffix to the model name to ensure a consistent
naming convention for GGUF models.
### Upload a GGUF model to model repository
The following target uploads a model to an existing Hugging Face model repository.
```console
(venv) $ make hf-upload-gguf-to-model MODEL_PATH=dummy-model1.gguf REPO_ID=danbev/TestModel-GGUF
📤 Uploading dummy-model1.gguf to danbev/TestModel-GGUF/dummy-model1.gguf
✅ Upload successful!
🔗 File available at: https://huggingface.co/danbev/TestModel-GGUF/blob/main/dummy-model1.gguf
```
This command can also be used to update an existing model file in a repository.
### Create a new Collection
```console
(venv) $ make hf-new-collection NAME=TestCollection DESCRIPTION="Collection for testing scripts" NAMESPACE=danbev
🚀 Creating Hugging Face Collection
Title: TestCollection
Description: Collection for testing scripts
Namespace: danbev
Private: False
✅ Authenticated as: danbev
📚 Creating collection: 'TestCollection'...
✅ Collection created successfully!
📋 Collection slug: danbev/testcollection-68930fcf73eb3fc200b9956d
🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d
🎉 Collection created successfully!
Use this slug to add models: danbev/testcollection-68930fcf73eb3fc200b9956d
```
### Add model to a Collection
```console
(venv) $ make hf-add-model-to-collection COLLECTION=danbev/testcollection-68930fcf73eb3fc200b9956d MODEL=danbev/TestModel-GGUF
✅ Authenticated as: danbev
🔍 Checking if model exists: danbev/TestModel-GGUF
✅ Model found: danbev/TestModel-GGUF
📚 Adding model to collection...
✅ Model added to collection successfully!
🔗 Collection URL: https://huggingface.co/collections/danbev/testcollection-68930fcf73eb3fc200b9956d
🎉 Model added successfully!
```

View File

@@ -0,0 +1,209 @@
#include "llama.h"
#include <cstdio>
#include <cstring>
#include <string>
#include <vector>
#include <ctype.h>
#include <filesystem>
static void print_usage(int, char ** argv) {
printf("\nexample usage:\n");
printf("\n %s -m model.gguf [-ngl n_gpu_layers] -embd-mode [prompt]\n", argv[0]);
printf("\n");
}
int main(int argc, char ** argv) {
std::string model_path;
std::string prompt = "Hello, my name is";
int ngl = 0;
bool embedding_mode = false;
{
int i = 1;
for (; i < argc; i++) {
if (strcmp(argv[i], "-m") == 0) {
if (i + 1 < argc) {
model_path = argv[++i];
} else {
print_usage(argc, argv);
return 1;
}
} else if (strcmp(argv[i], "-ngl") == 0) {
if (i + 1 < argc) {
try {
ngl = std::stoi(argv[++i]);
} catch (...) {
print_usage(argc, argv);
return 1;
}
} else {
print_usage(argc, argv);
return 1;
}
} else if (strcmp(argv[i], "-embd-mode") == 0) {
if (i + 1 < argc) {
try {
embedding_mode = true;
} catch (...) {
print_usage(argc, argv);
return 1;
}
} else {
print_usage(argc, argv);
return 1;
}
} else {
// prompt starts here
break;
}
}
if (model_path.empty()) {
print_usage(argc, argv);
return 1;
}
if (i < argc) {
prompt = argv[i++];
for (; i < argc; i++) {
prompt += " ";
prompt += argv[i];
}
}
}
ggml_backend_load_all();
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = ngl;
llama_model * model = llama_model_load_from_file(model_path.c_str(), model_params);
if (model == NULL) {
fprintf(stderr , "%s: error: unable to load model\n" , __func__);
return 1;
}
// Extract basename from model_path
const char * basename = strrchr(model_path.c_str(), '/');
basename = (basename == NULL) ? model_path.c_str() : basename + 1;
char model_name[256];
strncpy(model_name, basename, 255);
model_name[255] = '\0';
char * dot = strrchr(model_name, '.');
if (dot != NULL && strcmp(dot, ".gguf") == 0) {
*dot = '\0';
}
printf("Model name: %s\n", model_name);
const llama_vocab * vocab = llama_model_get_vocab(model);
const int n_prompt = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);
std::vector<llama_token> prompt_tokens(n_prompt);
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(), prompt_tokens.data(), prompt_tokens.size(), true, true) < 0) {
fprintf(stderr, "%s: error: failed to tokenize the prompt\n", __func__);
return 1;
}
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = n_prompt;
ctx_params.n_batch = n_prompt;
ctx_params.no_perf = false;
if (embedding_mode) {
ctx_params.embeddings = true;
ctx_params.n_ubatch = ctx_params.n_batch;
}
llama_context * ctx = llama_init_from_model(model, ctx_params);
if (ctx == NULL) {
fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
return 1;
}
printf("Input prompt: \"%s\"\n", prompt.c_str());
printf("Tokenized prompt (%d tokens): ", n_prompt);
for (auto id : prompt_tokens) {
char buf[128];
int n = llama_token_to_piece(vocab, id, buf, sizeof(buf), 0, true);
if (n < 0) {
fprintf(stderr, "%s: error: failed to convert token to piece\n", __func__);
return 1;
}
std::string s(buf, n);
printf("%s", s.c_str());
}
printf("\n");
llama_batch batch = llama_batch_get_one(prompt_tokens.data(), prompt_tokens.size());
if (llama_decode(ctx, batch)) {
fprintf(stderr, "%s : failed to eval\n", __func__);
return 1;
}
float * logits;
int n_logits;
const char * type;
if (embedding_mode) {
logits = llama_get_embeddings(ctx);
n_logits = llama_model_n_embd(model) * batch.n_tokens;
type = "-embeddings";
printf("Embeddings size: %d\n", n_logits);
} else {
logits = llama_get_logits_ith(ctx, batch.n_tokens - 1);
n_logits = llama_vocab_n_tokens(vocab);
type = "";
printf("Vocab size: %d\n", n_logits);
}
std::filesystem::create_directory("data");
// Save logits to binary file
char bin_filename[512];
snprintf(bin_filename, sizeof(bin_filename), "data/llamacpp-%s%s.bin", model_name, type);
printf("Saving logits to %s\n", bin_filename);
FILE * f = fopen(bin_filename, "wb");
if (f == NULL) {
fprintf(stderr, "%s: error: failed to open binary output file\n", __func__);
return 1;
}
fwrite(logits, sizeof(float), n_logits, f);
fclose(f);
// Also save as text for debugging
char txt_filename[512];
snprintf(txt_filename, sizeof(txt_filename), "data/llamacpp-%s%s.txt", model_name, type);
f = fopen(txt_filename, "w");
if (f == NULL) {
fprintf(stderr, "%s: error: failed to open text output file\n", __func__);
return 1;
}
for (int i = 0; i < n_logits; i++) {
fprintf(f, "%d: %.6f\n", i, logits[i]); // Added index and changed format
}
fclose(f);
// Print first and last 10 logits for quick verification
printf("First 10 logits: ");
for (int i = 0; i < 10 && i < n_logits; i++) {
printf("%.6f ", logits[i]);
}
printf("\n");
printf("Last 10 logits: ");
for (int i = n_logits - 10; i < n_logits; i++) {
if (i >= 0) printf("%.6f ", logits[i]);
}
printf("\n\n");
printf("Logits saved to %s\n", bin_filename);
printf("Logits saved to %s\n", txt_filename);
llama_free(ctx);
llama_model_free(model);
return 0;
}

View File

@@ -0,0 +1,5 @@
--extra-index-url https://download.pytorch.org/whl/cpu
torch~=2.6.0
torchvision~=0.21.0
transformers~=4.55.0
huggingface-hub~=0.34.0

View File

@@ -0,0 +1,43 @@
#/bin/bash
set -e
MODEL_PATH="${1:-"$MODEL_PATH"}"
MODEL_NAME="${2:-$(basename "$MODEL_PATH")}"
if [ -t 0 ]; then
CPP_EMBEDDINGS="data/llamacpp-${MODEL_NAME}-embeddings.bin"
else
# Process piped JSON data and convert to binary (matching logits.cpp format)
TEMP_FILE=$(mktemp /tmp/tmp.XXXXXX.binn)
python3 -c "
import json
import sys
import struct
data = json.load(sys.stdin)
# Flatten all embeddings completely
flattened = []
for item in data:
embedding = item['embedding']
for token_embedding in embedding:
flattened.extend(token_embedding)
print(f'Total embedding values: {len(flattened)}', file=sys.stderr)
# Write as binary floats - matches logitc.cpp fwrite format
with open('$TEMP_FILE', 'wb') as f:
for value in flattened:
f.write(struct.pack('f', value))
"
CPP_EMBEDDINGS="$TEMP_FILE"
trap "rm -f $TEMP_FILE" EXIT
fi
python scripts/utils/semantic_check.py --model-path $MODEL_PATH \
--python-embeddings data/pytorch-${MODEL_NAME}-embeddings.bin \
--cpp-embeddings $CPP_EMBEDDINGS \
--prompt "Hello world today" \
--causal

View File

@@ -0,0 +1,88 @@
#!/usr/bin/env python3
import numpy as np
import sys
import os
from pathlib import Path
def quick_logits_check(pytorch_file, llamacpp_file):
"""Lightweight sanity check before NMSE"""
try:
pytorch_logits = np.fromfile(pytorch_file, dtype=np.float32)
llamacpp_logits = np.fromfile(llamacpp_file, dtype=np.float32)
except Exception as e:
print(f"❌ NOK: Failed to load files - {e}")
return False
# Check shapes match
if pytorch_logits.shape != llamacpp_logits.shape:
print(f"❌ NOK: Shape mismatch - PyTorch: {pytorch_logits.shape}, llama.cpp: {llamacpp_logits.shape}")
return False
# Calculate key metrics
diff = pytorch_logits - llamacpp_logits
abs_diff = np.abs(diff)
max_diff = np.max(abs_diff)
# Get top 10 predictions from both models
pytorch_top10 = np.argsort(pytorch_logits)[-10:][::-1]
llamacpp_top10 = np.argsort(llamacpp_logits)[-10:][::-1]
print(f"Top 10 PyTorch logits: {pytorch_logits[pytorch_top10]}")
print(f"Top 10 llama.cpp logits: {llamacpp_logits[llamacpp_top10]}")
print(f"Max absolute difference: {max_diff:.4f}")
if max_diff > 1.0:
print(f"❌ NOK: Large differences detected - max diff: {max_diff:.4f}")
return False
return True
def main():
model_path = os.getenv('MODEL_PATH')
if not model_path:
print("Error: MODEL_PATH environment variable not set")
sys.exit(1)
if not os.path.exists(model_path):
print(f"Error: Model file not found: {model_path}")
sys.exit(1)
model_name = os.path.splitext(os.path.basename(model_path))[0]
data_dir = Path("data")
pytorch_file = data_dir / f"pytorch-{model_name}.bin"
llamacpp_file = data_dir / f"llamacpp-{model_name}.bin"
if not pytorch_file.exists():
print(f"Error: PyTorch logits file not found: {pytorch_file}")
print("Please run scripts/run-org-model.sh first to generate this file.")
sys.exit(1)
if not llamacpp_file.exists():
print(f"Error: llama.cpp logits file not found: {llamacpp_file}")
print("Please run scripts/run-converted-model.sh first to generate this file.")
sys.exit(1)
print("Checked all required files were found. Proceeding...\n")
print("🔍 GGML Model Validation for model ", model_name)
print("=" * 40)
print(f"PyTorch logits : {pytorch_file}")
print(f"llama.cpp logits: {llamacpp_file}")
print()
success = quick_logits_check(pytorch_file, llamacpp_file)
# Exit with appropriate code
if success:
print("✅ OK: Lightweight model check successful!")
print(" Ok to proceed with NMSE check...")
sys.exit(0)
else:
print(f"❌ NOK: Top 10 predictions don't match - generation will differ")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,22 @@
#!/bin/bash
MODEL_NAME="${MODEL_NAME:-$(basename "$MODEL_PATH")}"
OUTPUT_DIR="${OUTPUT_DIR:-../../models}"
TYPE="${OUTTYPE:-f16}"
METADATA_OVERRIDE="${METADATA_OVERRIDE:-}"
CONVERTED_MODEL="${OUTPUT_DIR}/${MODEL_NAME}.gguf"
echo "Model path: ${MODEL_PATH}"
echo "Model name: ${MODEL_NAME}"
echo "Data type: ${TYPE}"
echo "Converted model path:: ${CONVERTED_MODEL}"
echo "Metadata override: ${METADATA_OVERRIDE}"
python ../../convert_hf_to_gguf.py --verbose \
${MODEL_PATH} \
--outfile ${CONVERTED_MODEL} \
--outtype ${TYPE} \
--metadata "${METADATA_OVERRIDE}"
echo ""
echo "The environment variable CONVERTED_MODEL can be set to this path using:"
echo "export CONVERTED_MODEL=$(realpath ${CONVERTED_MODEL})"

View File

@@ -0,0 +1,113 @@
#!/usr/bin/env python3
import argparse
import os
import importlib
import sys
import torch
import numpy as np
from transformers import AutoTokenizer, AutoConfig, AutoModel, AutoModelForCausalLM
from pathlib import Path
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
parser = argparse.ArgumentParser(description='Process model with specified path')
parser.add_argument('--model-path', '-m', help='Path to the model')
args = parser.parse_args()
model_path = os.environ.get('MODEL_PATH', args.model_path)
if model_path is None:
parser.error("Model path must be specified either via --model-path argument or MODEL_PATH environment variable")
config = AutoConfig.from_pretrained(model_path)
print("Model type: ", config.model_type)
print("Vocab size: ", config.vocab_size)
print("Hidden size: ", config.hidden_size)
print("Number of layers: ", config.num_hidden_layers)
print("BOS token id: ", config.bos_token_id)
print("EOS token id: ", config.eos_token_id)
print("Loading model and tokenizer using AutoTokenizer:", model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if unreleased_model_name:
model_name_lower = unreleased_model_name.lower()
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
class_name = f"{unreleased_model_name}ForCausalLM"
print(f"Importing unreleased model module: {unreleased_module_path}")
try:
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
model = model_class.from_pretrained(model_path)
except (ImportError, AttributeError) as e:
print(f"Failed to import or load model: {e}")
else:
model = AutoModelForCausalLM.from_pretrained(model_path)
print(f"Model class: {type(model)}")
#print(f"Model file: {type(model).__module__}")
model_name = os.path.basename(model_path)
print(f"Model name: {model_name}")
prompt = "Hello world today"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
print(f"Input tokens: {input_ids}")
print(f"Input text: {repr(prompt)}")
print(f"Tokenized: {tokenizer.convert_ids_to_tokens(input_ids[0])}")
with torch.no_grad():
outputs = model(input_ids, output_hidden_states=True)
# Extract hidden states from the last layer
# outputs.hidden_states is a tuple of (num_layers + 1) tensors
# Index -1 gets the last layer, shape: [batch_size, seq_len, hidden_size]
last_hidden_states = outputs.hidden_states[-1]
# Get embeddings for all tokens
token_embeddings = last_hidden_states[0].cpu().numpy() # Remove batch dimension
print(f"Hidden states shape: {last_hidden_states.shape}")
print(f"Token embeddings shape: {token_embeddings.shape}")
print(f"Hidden dimension: {token_embeddings.shape[-1]}")
print(f"Number of tokens: {token_embeddings.shape[0]}")
# Save raw token embeddings
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
bin_filename = data_dir / f"pytorch-{model_name}-embeddings.bin"
txt_filename = data_dir / f"pytorch-{model_name}-embeddings.txt"
# Save all token embeddings as binary
print(token_embeddings)
token_embeddings.astype(np.float32).tofile(bin_filename)
# Save as text for inspection
with open(txt_filename, "w") as f:
for i, embedding in enumerate(token_embeddings):
for j, val in enumerate(embedding):
f.write(f"{i} {j} {val:.6f}\n")
# Print embeddings per token in the requested format
print("\nToken embeddings:")
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
for i, embedding in enumerate(token_embeddings):
# Format: show first few values, ..., then last few values
if len(embedding) > 10:
# Show first 3 and last 3 values with ... in between
first_vals = " ".join(f"{val:8.6f}" for val in embedding[:3])
last_vals = " ".join(f"{val:8.6f}" for val in embedding[-3:])
print(f"embedding {i}: {first_vals} ... {last_vals}")
else:
# If embedding is short, show all values
vals = " ".join(f"{val:8.6f}" for val in embedding)
print(f"embedding {i}: {vals}")
# Also show token info for reference
print(f"\nToken reference:")
for i, token in enumerate(tokens):
print(f" Token {i}: {repr(token)}")
print(f"Saved bin logits to: {bin_filename}")
print(f"Saved txt logist to: {txt_filename}")

View File

@@ -0,0 +1,18 @@
#!/bin/bash
set -e
# First try command line argument, then environment variable, then file
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
# Final check if we have a model path
if [ -z "$CONVERTED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. CONVERTED_MODEL environment variable" >&2
exit 1
fi
cmake --build ../../build --target llama-logits -j8
../../build/bin/llama-logits -m $CONVERTED_MODEL -embd-mode "Hello world today"

View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -e
# First try command line argument, then environment variable, then file
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
# Final check if we have a model path
if [ -z "$CONVERTED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. CONVERTED_MODEL environment variable" >&2
exit 1
fi
echo $CONVERTED_MODEL
cmake --build ../../build --target llama-logits -j8
../../build/bin/llama-logits -m "$CONVERTED_MODEL" "Hello, my name is"

View File

@@ -0,0 +1,100 @@
#!/usr/bin/env python3
import argparse
import os
import importlib
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import torch
import numpy as np
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
parser = argparse.ArgumentParser(description='Process model with specified path')
parser.add_argument('--model-path', '-m', help='Path to the model')
args = parser.parse_args()
model_path = os.environ.get('MODEL_PATH', args.model_path)
if model_path is None:
parser.error("Model path must be specified either via --model-path argument or MODEL_PATH environment variable")
config = AutoConfig.from_pretrained(model_path)
print("Model type: ", config.model_type)
print("Vocab size: ", config.vocab_size)
print("Hidden size: ", config.hidden_size)
print("Number of layers: ", config.num_hidden_layers)
print("BOS token id: ", config.bos_token_id)
print("EOS token id: ", config.eos_token_id)
print("Loading model and tokenizer using AutoTokenizer:", model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
config = AutoConfig.from_pretrained(model_path)
if unreleased_model_name:
model_name_lower = unreleased_model_name.lower()
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
class_name = f"{unreleased_model_name}ForCausalLM"
print(f"Importing unreleased model module: {unreleased_module_path}")
try:
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
model = model_class.from_pretrained(model_path) # Note: from_pretrained, not fromPretrained
except (ImportError, AttributeError) as e:
print(f"Failed to import or load model: {e}")
exit(1)
else:
model = AutoModelForCausalLM.from_pretrained(model_path)
model_name = os.path.basename(model_path)
# Printing the Model class to allow for easier debugging. This can be useful
# when working with models that have not been publicly released yet and this
# migth require that the concrete class is imported and used directly instead
# of using AutoModelForCausalLM.
print(f"Model class: {model.__class__.__name__}")
prompt = "Hello, my name is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
print(f"Input tokens: {input_ids}")
print(f"Input text: {repr(prompt)}")
print(f"Tokenized: {tokenizer.convert_ids_to_tokens(input_ids[0])}")
with torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits
# Extract logits for the last token (next token prediction)
last_logits = logits[0, -1, :].cpu().numpy()
print(f"Logits shape: {logits.shape}")
print(f"Last token logits shape: {last_logits.shape}")
print(f"Vocab size: {len(last_logits)}")
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
bin_filename = data_dir / f"pytorch-{model_name}.bin"
txt_filename = data_dir / f"pytorch-{model_name}.txt"
# Save to file for comparison
last_logits.astype(np.float32).tofile(bin_filename)
# Also save as text file for easy inspection
with open(txt_filename, "w") as f:
for i, logit in enumerate(last_logits):
f.write(f"{i}: {logit:.6f}\n")
# Print some sample logits for quick verification
print(f"First 10 logits: {last_logits[:10]}")
print(f"Last 10 logits: {last_logits[-10:]}")
# Show top 5 predicted tokens
top_indices = np.argsort(last_logits)[-5:][::-1]
print("Top 5 predictions:")
for idx in top_indices:
token = tokenizer.decode([idx])
print(f" Token {idx} ({repr(token)}): {last_logits[idx]:.6f}")
print(f"Saved bin logits to: {bin_filename}")
print(f"Saved txt logist to: {txt_filename}")

View File

@@ -0,0 +1,42 @@
#/bin/bash
set -e
MODEL_PATH="${1:-"$EMBEDDING_MODEL_PATH"}"
MODEL_NAME="${2:-$(basename "$MODEL_PATH")}"
if [ -t 0 ]; then
CPP_EMBEDDINGS="data/llamacpp-${MODEL_NAME}-embeddings.bin"
else
# Process piped JSON data and convert to binary (matching logits.cpp format)
TEMP_FILE=$(mktemp /tmp/tmp.XXXXXX.binn)
python3 -c "
import json
import sys
import struct
data = json.load(sys.stdin)
# Flatten all embeddings completely
flattened = []
for item in data:
embedding = item['embedding']
for token_embedding in embedding:
flattened.extend(token_embedding)
print(f'Total embedding values: {len(flattened)}', file=sys.stderr)
# Write as binary floats - matches logitc.cpp fwrite format
with open('$TEMP_FILE', 'wb') as f:
for value in flattened:
f.write(struct.pack('f', value))
"
CPP_EMBEDDINGS="$TEMP_FILE"
trap "rm -f $TEMP_FILE" EXIT
fi
python scripts/utils/semantic_check.py --model-path $MODEL_PATH \
--python-embeddings data/pytorch-${MODEL_NAME}-embeddings.bin \
--cpp-embeddings $CPP_EMBEDDINGS \
--prompt "Hello world today"

View File

@@ -0,0 +1,22 @@
#!/bin/bash
set -e
MODEL_NAME="${MODEL_NAME:-$(basename "$EMBEDDING_MODEL_PATH")}"
OUTPUT_DIR="${OUTPUT_DIR:-../../models}"
TYPE="${OUTTYPE:-f16}"
METADATA_OVERRIDE="${METADATA_OVERRIDE:-}"
CONVERTED_MODEL="${OUTPUT_DIR}/${MODEL_NAME}.gguf"
echo "Model path: ${EMBEDDING_MODEL_PATH}"
echo "Model name: ${MODEL_NAME}"
echo "Data type: ${TYPE}"
echo "Converted model path:: ${CONVERTED_MODEL}"
python ../../convert_hf_to_gguf.py --verbose \
${EMBEDDING_MODEL_PATH} \
--outfile ${CONVERTED_MODEL} \
--outtype ${TYPE}
echo ""
echo "The environment variable CONVERTED_EMBEDDING MODEL can be set to this path using:"
echo "export CONVERTED_EMBEDDING_MODEL=$(realpath ${CONVERTED_MODEL})"

View File

@@ -0,0 +1,20 @@
#!/bin/bash
set -e
# First try command line argument, then environment variable, then file
CONVERTED_MODEL="${1:-"$CONVERTED_EMBEDDING_MODEL"}"
# Final check if we have a model path
if [ -z "$CONVERTED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. CONVERTED_EMBEDDING_MODEL environment variable" >&2
exit 1
fi
echo $CONVERTED_MODEL
cmake --build ../../build --target llama-logits -j8
../../build/bin/llama-logits -m "$CONVERTED_MODEL" -embd-mode "Hello world today"

View File

@@ -0,0 +1,116 @@
#!/usr/bin/env python3
import argparse
import os
import numpy as np
import importlib
from pathlib import Path
from transformers import AutoTokenizer, AutoConfig, AutoModel
import torch
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
parser = argparse.ArgumentParser(description='Process model with specified path')
parser.add_argument('--model-path', '-m', help='Path to the model')
args = parser.parse_args()
model_path = os.environ.get('EMBEDDING_MODEL_PATH', args.model_path)
if model_path is None:
parser.error("Model path must be specified either via --model-path argument or EMBEDDING_MODEL_PATH environment variable")
tokenizer = AutoTokenizer.from_pretrained(model_path)
if unreleased_model_name:
model_name_lower = unreleased_model_name.lower()
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
class_name = f"{unreleased_model_name}Model"
print(f"Importing unreleased model module: {unreleased_module_path}")
try:
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
model = model_class.from_pretrained(model_path) # Note: from_pretrained, not fromPretrained
except (ImportError, AttributeError) as e:
print(f"Failed to import or load model: {e}")
exit(1)
else:
model = AutoModel.from_pretrained(model_path)
print(f"Model class: {type(model)}")
#print(f"Model file: {type(model).__module__}")
config = AutoConfig.from_pretrained(model_path)
model_name = os.path.basename(model_path)
texts = [ "Hello world today" ]
encoded = tokenizer(
texts,
padding=True,
truncation=True,
return_tensors="pt"
)
tokens = encoded['input_ids'][0]
token_strings = tokenizer.convert_ids_to_tokens(tokens)
for i, (token_id, token_str) in enumerate(zip(tokens, token_strings)):
print(f"{token_id:6d} -> '{token_str}'")
with torch.no_grad():
outputs = model(**encoded)
hidden_states = outputs.last_hidden_state # Shape: [batch_size, seq_len, hidden_size]
# Extract embeddings for each token (matching LLAMA_POOLING_TYPE_NONE behavior)
all_embeddings = hidden_states[0].cpu().numpy() # Shape: [seq_len, hidden_size]
print(f"Hidden states shape: {hidden_states.shape}")
print(f"All embeddings shape: {all_embeddings.shape}")
print(f"Embedding dimension: {all_embeddings.shape[1]}")
# Print embeddings exactly like embedding.cpp does for LLAMA_POOLING_TYPE_NONE
n_embd = all_embeddings.shape[1]
n_embd_count = all_embeddings.shape[0]
print() # Empty line to match C++ output
for j in range(n_embd_count):
embedding = all_embeddings[j]
print(f"embedding {j}: ", end="")
# Print first 3 values
for i in range(min(3, n_embd)):
print(f"{embedding[i]:9.6f} ", end="")
print(" ... ", end="")
# Print last 3 values
for i in range(n_embd - 3, n_embd):
print(f"{embedding[i]:9.6f} ", end="")
print() # New line
print() # Final empty line to match C++ output
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
bin_filename = data_dir / f"pytorch-{model_name}-embeddings.bin"
txt_filename = data_dir / f"pytorch-{model_name}-embeddings.txt"
# Save all embeddings flattened (matching what embedding.cpp would save if it did)
flattened_embeddings = all_embeddings.flatten()
flattened_embeddings.astype(np.float32).tofile(bin_filename)
with open(txt_filename, "w") as f:
f.write(f"# Model class: {model_name}\n")
f.write(f"# Tokens: {token_strings}\n")
f.write(f"# Shape: {all_embeddings.shape}\n")
f.write(f"# n_embd_count: {n_embd_count}, n_embd: {n_embd}\n\n")
for j in range(n_embd_count):
f.write(f"# Token {j} ({token_strings[j]}):\n")
for i, value in enumerate(all_embeddings[j]):
f.write(f"{j}_{i}: {value:.6f}\n")
f.write("\n")
print(f"Total values: {len(flattened_embeddings)} ({n_embd_count} tokens × {n_embd} dimensions)")
print("")
print(f"Saved bin embeddings to: {bin_filename}")
print(f"Saved txt embeddings to: {txt_filename}")

View File

@@ -0,0 +1,13 @@
---
base_model:
- {base_model}
---
# {model_name} GGUF
Recommended way to run this model:
```sh
llama-server -hf {namespace}/{model_name}-GGUF -c 0 -fa
```
Then, access http://localhost:8080

View File

@@ -0,0 +1,174 @@
#!/usr/bin/env python3
import numpy as np
import sys
import os
import argparse
from pathlib import Path
def calculate_nmse(reference, test):
mse = np.mean((test - reference) ** 2)
ref_var = np.var(reference)
if ref_var == 0:
nmse = float('inf') if mse > 0 else 0.0
return mse, mse, ref_var
nmse = mse / ref_var
return nmse, mse, ref_var
def load_logits(file_path):
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
if file_path.suffix == '.npy':
return np.load(file_path)
elif file_path.suffix == '.bin':
return np.fromfile(file_path, dtype=np.float32)
else:
# Try to load as text file
try:
# If it has index format "0: value", extract just values
data = []
with open(file_path, 'r') as f:
for line in f:
if ':' in line:
# Format: "index: value"
value = float(line.split(':')[1].strip())
else:
# Just the value
value = float(line.strip())
data.append(value)
return np.array(data, dtype=np.float32)
except:
return np.loadtxt(file_path, dtype=np.float32)
def interpret_nmse(nmse):
"""Provide interpretation of NMSE value"""
if nmse == 0:
return "Perfect match", "🎉"
elif nmse < 1e-6:
return "Essentially identical", ""
elif nmse < 1e-4:
return "Excellent match", ""
elif nmse < 1e-3:
return "Very good match", "👍"
elif nmse < 1e-2:
return "Good match", "👍"
elif nmse < 0.1:
return "Acceptable match", "⚠️"
elif nmse < 1.0:
return "Poor match", ""
else:
return "Very poor match (worse than noise)", ""
def main():
parser = argparse.ArgumentParser(description='Validate model logits')
parser.add_argument('-m', '--model-path', required=True, help='Path to the model directory')
args = parser.parse_args()
model_name = os.path.splitext(os.path.basename(args.model_path))[0]
data_dir = Path("data")
pytorch_file = data_dir / f"pytorch-{model_name}.bin"
llamacpp_file = data_dir / f"llamacpp-{model_name}.bin"
print(f"Model name: {model_name}")
print(f"PyTorch logits file: {pytorch_file}")
print(f"llama.cpp logits file: {llamacpp_file}")
reference_file = pytorch_file
test_file = llamacpp_file
print("📊 NMSE Check for Model Comparison")
print("=" * 50)
print(f"Reference (ground truth): {reference_file}")
print(f"Test (to evaluate): {test_file}")
print()
try:
print("Loading reference logits...")
reference = load_logits(reference_file)
print(f" Shape: {reference.shape}, Type: {reference.dtype}")
print("Loading test logits...")
test = load_logits(test_file)
print(f" Shape: {test.shape}, Type: {test.dtype}")
# Check shapes match
if reference.shape != test.shape:
print(f"\n❌ Error: Shape mismatch!")
print(f" Reference: {reference.shape}")
print(f" Test: {test.shape}")
sys.exit(1)
print(f"\n✅ Shapes match: {reference.shape}")
nmse, mse, ref_var = calculate_nmse(reference, test)
# Additional metrics
max_abs_error = np.max(np.abs(test - reference))
mean_abs_error = np.mean(np.abs(test - reference))
# Results
print(f"\n📈 METRICS")
print("=" * 30)
print(f"MSE (Mean Squared Error): {mse:.6e}")
print(f"Reference Variance: {ref_var:.6e}")
print(f"NMSE: {nmse:.6e}")
print(f"Max Absolute Error: {max_abs_error:.6f}")
print(f"Mean Absolute Error: {mean_abs_error:.6f}")
# NMSE in dB (common in signal processing)
if nmse > 0:
nmse_db = 10 * np.log10(nmse)
print(f"NMSE (dB): {nmse_db:.2f} dB")
# Interpretation
interpretation, emoji = interpret_nmse(nmse)
print(f"\n🎯 INTERPRETATION")
print("=" * 30)
print(f"{emoji} {interpretation}")
# Detailed guidance
print(f"\n📋 GUIDANCE")
print("=" * 30)
if nmse < 1e-3:
print("✅ EXCELLENT: Your GGML conversion is working very well!")
print(" The differences are negligible for practical use.")
elif nmse < 1e-2:
print("👍 GOOD: Your GGML conversion is working well.")
print(" Small differences are likely due to precision/quantization.")
elif nmse < 0.1:
print("⚠️ ACCEPTABLE: Conversion is working but with some differences.")
print(" Check if you're using quantization (Q4, Q8, etc.)")
print(" Test generation quality to see if it's acceptable.")
else:
print("❌ PROBLEMATIC: Large differences detected.")
print(" Check your conversion process for potential issues.")
print(" Verify you're using the same model weights.")
# NMSE benchmarks
print(f"\n📚 NMSE BENCHMARKS")
print("=" * 30)
print("< 1e-6: Essentially identical")
print("< 1e-4: Excellent (typical for good conversions)")
print("< 1e-3: Very good")
print("< 1e-2: Good (acceptable for most use cases)")
print("< 0.1: Acceptable (may need verification)")
print("> 1.0: Poor (worse than random)")
# Exit code based on NMSE
if nmse < 1e-2:
print(f"\n✅ RESULT: PASS (NMSE = {nmse:.2e})")
sys.exit(0)
else:
print(f"\n❌ RESULT: NEEDS REVIEW (NMSE = {nmse:.2e})")
sys.exit(1)
except Exception as e:
print(f"❌ Error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,6 @@
COLLECTION_SLUG=$(python ./create_collection.py --return-slug)
echo "Created collection: $COLLECTION_SLUG"
# Use it in the next command
python add_model_to_collection.py "$COLLECTION_SLUG" "username/my-model"

View File

@@ -0,0 +1,80 @@
#!/usr/bin/env python3
from huggingface_hub import HfApi
import argparse
import sys
def add_model_to_collection(collection_slug, model_id, note=""):
"""
Add a model to an existing collection
Args:
collection_slug: The slug of the collection (e.g., "username/collection-name-12345")
model_id: The model repository ID (e.g., "username/model-name")
note: Optional note about the model
Returns:
True if successful, False if failed
"""
# Initialize API
api = HfApi()
try:
user_info = api.whoami()
print(f"✅ Authenticated as: {user_info['name']}")
# Verify the model exists
print(f"🔍 Checking if model exists: {model_id}")
try:
model_info = api.model_info(model_id)
except Exception as e:
print(f"❌ Model not found or not accessible: {model_id}")
print(f"Error: {e}")
return False
print(f"📚 Adding model to collection...")
api.add_collection_item(
collection_slug=collection_slug,
item_id=model_id,
item_type="model",
note=note
)
print(f"✅ Model added to collection successfully!")
print(f"🔗 Collection URL: https://huggingface.co/collections/{collection_slug}")
return True
except Exception as e:
print(f"❌ Error adding model to collection: {e}")
return False
def main():
# This script requires that the environment variable HF_TOKEN is set with your
# Hugging Face API token.
api = HfApi()
parser = argparse.ArgumentParser(description='Add model to a Huggingface Collection')
parser.add_argument('--collection', '-c', help='The collection slug username/collection-hash', required=True)
parser.add_argument('--model', '-m', help='The model to add to the Collection', required=True)
parser.add_argument('--note', '-n', help='An optional note/description', required=False)
args = parser.parse_args()
collection = args.collection
model = args.model
note = args.note
success = add_model_to_collection(
collection_slug=collection,
model_id=model,
note=note
)
if success:
print("\n🎉 Model added successfully!")
else:
print("\n❌ Failed to add model to collection")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,106 @@
#!/usr/bin/env python3
from huggingface_hub import HfApi
import argparse
import os
import sys
def create_collection(title, description, private=False, namespace=None, return_slug=False):
"""
Create a new collection on Hugging Face
Args:
title: Collection title
description: Collection description
private: Whether the collection should be private (default: False)
namespace: Optional namespace (defaults to your username)
Returns:
Collection object if successful, None if failed
"""
# Check if HF_TOKEN is available
token = os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACE_HUB_TOKEN")
if not token:
print("❌ No HF_TOKEN or HUGGINGFACE_HUB_TOKEN found in environment variables")
print("Please set your Hugging Face token as an environment variable")
return None
# Initialize API
api = HfApi()
try:
# Test authentication first
user_info = api.whoami()
if not return_slug:
print(f"✅ Authenticated as: {user_info['name']}")
# Create the collection
if not return_slug:
print(f"📚 Creating collection: '{title}'...")
collection = api.create_collection(
title=title,
description=description,
private=private,
namespace=namespace
)
if not return_slug:
print(f"✅ Collection created successfully!")
print(f"📋 Collection slug: {collection.slug}")
print(f"🔗 Collection URL: https://huggingface.co/collections/{collection.slug}")
return collection
except Exception as e:
print(f"❌ Error creating collection: {e}")
return None
def main():
# This script requires that the environment variable HF_TOKEN is set with your
# Hugging Face API token.
api = HfApi()
parser = argparse.ArgumentParser(description='Create a Huggingface Collection')
parser.add_argument('--name', '-n', help='The name/title of the Collection', required=True)
parser.add_argument('--description', '-d', help='The description for the Collection', required=True)
parser.add_argument('--namespace', '-ns', help='The namespace to add the Collection to', required=True)
parser.add_argument('--private', '-p', help='Create a private Collection', action='store_true') # Fixed
parser.add_argument('--return-slug', '-s', help='Only output the collection slug', action='store_true') # Fixed
args = parser.parse_args()
name = args.name
description = args.description
private = args.private
namespace = args.namespace
return_slug = args.return_slug
if not return_slug:
print("🚀 Creating Hugging Face Collection")
print(f"Title: {name}")
print(f"Description: {description}")
print(f"Namespace: {namespace}")
print(f"Private: {private}")
collection = create_collection(
title=name,
description=description,
private=private,
namespace=namespace,
return_slug=return_slug
)
if collection:
if return_slug:
print(collection.slug)
else:
print("\n🎉 Collection created successfully!")
print(f"Use this slug to add models: {collection.slug}")
else:
print("\n❌ Failed to create collection")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,63 @@
#!/usr/bin/env python3
from huggingface_hub import HfApi
import argparse
# This script requires that the environment variable HF_TOKEN is set with your
# Hugging Face API token.
api = HfApi()
def load_template_and_substitute(template_path, **kwargs):
try:
with open(template_path, 'r', encoding='utf-8') as f:
template_content = f.read()
return template_content.format(**kwargs)
except FileNotFoundError:
print(f"Template file '{template_path}' not found!")
return None
except KeyError as e:
print(f"Missing template variable: {e}")
return None
parser = argparse.ArgumentParser(description='Create a new Hugging Face model repository')
parser.add_argument('--model-name', '-m', help='Name for the model', required=True)
parser.add_argument('--namespace', '-ns', help='Namespace to add the model to', required=True)
parser.add_argument('--org-base-model', '-b', help='Original Base model name', default="")
parser.add_argument('--no-card', action='store_true', help='Skip creating model card')
parser.add_argument('--private', '-p', action='store_true', help='Create private model')
args = parser.parse_args()
repo_id = f"{args.namespace}/{args.model_name}-GGUF"
print("Repository ID: ", repo_id)
repo_url = api.create_repo(
repo_id=repo_id,
repo_type="model",
private=args.private,
exist_ok=False
)
if not args.no_card:
template_path = "scripts/readme.md.template"
model_card_content = load_template_and_substitute(
template_path,
model_name=args.model_name,
namespace=args.namespace,
base_model=args.org_base_model,
)
if model_card_content:
api.upload_file(
path_or_fileobj=model_card_content.encode('utf-8'),
path_in_repo="README.md",
repo_id=repo_id
)
print("Model card created successfully.")
else:
print("Failed to create model card.")
print(f"Repository created: {repo_url}")

View File

@@ -0,0 +1,58 @@
#!/usr/bin/env python3
from huggingface_hub import HfApi
import argparse
import os
def upload_gguf_file(local_file_path, repo_id, filename_in_repo=None):
"""
Upload a GGUF file to a Hugging Face model repository
Args:
local_file_path: Path to your local GGUF file
repo_id: Your repository ID (e.g., "username/model-name")
filename_in_repo: Optional custom name for the file in the repo
"""
if not os.path.exists(local_file_path):
print(f"❌ File not found: {local_file_path}")
return False
if filename_in_repo is None:
filename_in_repo = os.path.basename(local_file_path)
if filename_in_repo is None or filename_in_repo == "":
filename_in_repo = os.path.basename(local_file_path)
print(f"📤 Uploading {local_file_path} to {repo_id}/{filename_in_repo}")
api = HfApi()
try:
api.upload_file(
path_or_fileobj=local_file_path,
path_in_repo=filename_in_repo,
repo_id=repo_id,
repo_type="model",
commit_message=f"Upload {filename_in_repo}"
)
print("✅ Upload successful!")
print(f"🔗 File available at: https://huggingface.co/{repo_id}/blob/main/{filename_in_repo}")
return True
except Exception as e:
print(f"❌ Upload failed: {e}")
return False
# This script requires that the environment variable HF_TOKEN is set with your
# Hugging Face API token.
api = HfApi()
parser = argparse.ArgumentParser(description='Upload a GGUF model to a Huggingface model repository')
parser.add_argument('--gguf-model-path', '-m', help='The GGUF model file to upload', required=True)
parser.add_argument('--repo-id', '-r', help='The repository to upload to', required=True)
parser.add_argument('--name', '-o', help='The name in the model repository', required=False)
args = parser.parse_args()
upload_gguf_file(args.gguf_model_path, args.repo_id, args.name)

View File

@@ -0,0 +1,14 @@
#!/bin/bash
# First try command line argument, then environment variable, then file
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
# Final check if we have a model path
if [ -z "$CONVERTED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. CONVERTED_MODEL environment variable" >&2
exit 1
fi
../../gguf-py/gguf/scripts/gguf_dump.py $CONVERTED_MODEL

View File

@@ -0,0 +1,67 @@
#!/usr/bin/env python3
import argparse
import os
import json
from safetensors import safe_open
from collections import defaultdict
parser = argparse.ArgumentParser(description='Process model with specified path')
parser.add_argument('--model-path', '-m', help='Path to the model')
args = parser.parse_args()
model_path = os.environ.get('MODEL_PATH', args.model_path)
if model_path is None:
parser.error("Model path must be specified either via --model-path argument or MODEL_PATH environment variable")
# Check if there's an index file (multi-file model)
index_path = os.path.join(model_path, "model.safetensors.index.json")
single_file_path = os.path.join(model_path, "model.safetensors")
if os.path.exists(index_path):
# Multi-file model
print("Multi-file model detected")
with open(index_path, 'r') as f:
index_data = json.load(f)
# Get the weight map (tensor_name -> file_name)
weight_map = index_data.get("weight_map", {})
# Group tensors by file for efficient processing
file_tensors = defaultdict(list)
for tensor_name, file_name in weight_map.items():
file_tensors[file_name].append(tensor_name)
print("Tensors in model:")
# Process each shard file
for file_name, tensor_names in file_tensors.items():
file_path = os.path.join(model_path, file_name)
print(f"\n--- From {file_name} ---")
with safe_open(file_path, framework="pt") as f:
for tensor_name in sorted(tensor_names):
tensor = f.get_tensor(tensor_name)
print(f"- {tensor_name} : shape = {tensor.shape}, dtype = {tensor.dtype}")
elif os.path.exists(single_file_path):
# Single file model (original behavior)
print("Single-file model detected")
with safe_open(single_file_path, framework="pt") as f:
keys = f.keys()
print("Tensors in model:")
for key in sorted(keys):
tensor = f.get_tensor(key)
print(f"- {key} : shape = {tensor.shape}, dtype = {tensor.dtype}")
else:
print(f"Error: Neither 'model.safetensors.index.json' nor 'model.safetensors' found in {model_path}")
print("Available files:")
if os.path.exists(model_path):
for item in sorted(os.listdir(model_path)):
print(f" {item}")
else:
print(f" Directory {model_path} does not exist")
exit(1)

View File

@@ -0,0 +1,35 @@
#!/bin/bash
set -e
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
# Final check if we have a model path
if [ -z "$CONVERTED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. CONVERTED_MODEL environment variable" >&2
exit 1
fi
# Check if data/wikitext-2-raw directory exists
if [ ! -d "ppl/wikitext-2-raw" ]; then
echo "ppl/wikitext-2-raw directory does not exist. Downloading..." >&2
mkdir -p ppl
pushd ppl
./../../../scripts/get-wikitext-2.sh
popd
fi
mkdir -p ppl
OUTPUTFILE="ppl/$(basename $CONVERTED_MODEL).kld"
echo "Model: $CONVERTED_MODEL"
cmake --build ../../build --target llama-perplexity -j8
../.././build/bin/llama-perplexity -m $CONVERTED_MODEL \
-f ppl/wikitext-2-raw/wiki.test.raw \
--kl-divergence-base $OUTPUTFILE
echo "Generated logits in $OUTPUTFILE"

View File

@@ -0,0 +1,27 @@
#!/bin/bash
set -e
QUANTIZED_MODEL="${1:-"$QUANTIZED_MODEL"}"
if [ -z "$QUANTIZED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. QUANTIZED_MODEL environment variable" >&2
exit 1
fi
# Check if data/wikitext-2-raw directory exists
if [ ! -d "ppl/wikitext-2-raw" ]; then
echo "ppl/wikitext-2-raw directory does not exist. Downloading..." >&2
mkdir -p ppl
pushd ppl
./../../../scripts/get-wikitext-2.sh
popd
fi
cmake --build ../../build --target llama-perplexity -j8
../.././build/bin/llama-perplexity -m $QUANTIZED_MODEL -f ppl/wikitext-2-raw/wiki.test.raw

View File

@@ -0,0 +1,28 @@
#!/bin/bash
set -e
QUANTIZED_MODEL="${1:-"$QUANTIZED_MODEL"}"
LOGITS_FILE="${1:-"$LOGITS_FILE"}"
if [ -z "$QUANTIZED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. QUANTIZED_MODEL environment variable" >&2
exit 1
fi
if [ ! -f ${LOGITS_FILE} ]; then
echo "Error: logits file '${LOGITS_FILE} was not found"
echo "Did you run the perplexity-gen.sh script?"
exit 1
fi
echo "Model: $QUANTIZED_MODEL"
echo "Data file: $LOGITS_FILE"
cmake --build ../../build --target llama-perplexity -j8
../.././build/bin/llama-perplexity -m $QUANTIZED_MODEL \
--kl-divergence-base $LOGITS_FILE \
--kl-divergence

View File

@@ -0,0 +1,34 @@
#!/bin/bash
set -e
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
QUANTIZED_TYPE="${2:-"$QUANTIZED_TYPE"}"
QUANTIZED_MODEL=$CONVERTED_MODEL
# Final check if we have a model path
if [ -z "$CONVERTED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. CONVERTED_MODEL environment variable" >&2
exit 1
fi
echo $CONVERTED_MODEL
# Process the quantized model filename
if [[ "$QUANTIZED_MODEL" == *.gguf ]]; then
# Remove .gguf suffix, add quantized type, then add .gguf back
BASE_NAME="${QUANTIZED_MODEL%.gguf}"
QUANTIZED_MODEL="${BASE_NAME}-${QUANTIZED_TYPE}.gguf"
else
echo "Error: QUANTIZED_MODEL must end with .gguf extension" >&2
exit 1
fi
cmake --build ../../build --target llama-quantize -j8
../../build/bin/llama-quantize $CONVERTED_MODEL $QUANTIZED_MODEL $QUANTIZED_TYPE
echo "Quantized model saved to: $QUANTIZED_MODEL"

View File

@@ -0,0 +1,22 @@
#!/bin/bash
set -e
#
# First try command line argument, then environment variable, then file
CONVERTED_MODEL="${1:-"$CONVERTED_MODEL"}"
# Final check if we have a model path
if [ -z "$CONVERTED_MODEL" ]; then
echo "Error: Model path must be provided either as:" >&2
echo " 1. Command line argument" >&2
echo " 2. CONVERTED_MODEL environment variable" >&2
exit 1
fi
echo $CONVERTED_MODEL
cmake --build ../../build --target llama-server
../../build/bin/llama-server -m $CONVERTED_MODEL \
--embedding \
--pooling none

View File

@@ -0,0 +1,179 @@
#!/usr/bin/env python3
import numpy as np
import argparse
import os
import importlib
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, AutoModel
unreleased_model_name = os.getenv('UNRELEASED_MODEL_NAME')
def cosine_similarity(a, b=None):
a = np.asarray(a)
if b is None:
b = a
else:
b = np.asarray(b)
if a.ndim == 1:
a = a.reshape(1, -1)
if b.ndim == 1:
b = b.reshape(1, -1)
a_norms = np.linalg.norm(a, axis=1, keepdims=True)
b_norms = np.linalg.norm(b, axis=1, keepdims=True)
a_norms = np.where(a_norms == 0, 1e-8, a_norms)
b_norms = np.where(b_norms == 0, 1e-8, b_norms)
a_normalized = a / a_norms
b_normalized = b / b_norms
# Compute cosine similarity
return np.dot(a_normalized, b_normalized.T)
def load_embeddings_from_file(filename, n_tokens, n_embd):
embeddings = np.fromfile(filename, dtype=np.float32)
return embeddings.reshape(n_tokens, n_embd)
def test_single_prompt_similarity(python_emb, cpp_emb, tokens, prompt):
np.set_printoptions(suppress=True, precision=6)
print("pytorch embeddings:");
print(python_emb)
print("llama.cpp embeddings:");
print(cpp_emb)
print(f"\n=== Prompt: '{prompt}' ===")
print(f"Tokens: {tokens}")
print(f"Embeddings shape: Python {python_emb.shape}, llama.cpp {cpp_emb.shape}")
n_tokens = len(tokens)
# 1. Direct embedding comparison
print(f"\n1. Raw Embedding Magnitude Comparison:")
# Check if the distance of each token embedding from the origin and compare
# if the vectors are on the same "sphere". This does not tell us about
# direction (meaning of the token embedding), just magnitude.
for i in range(n_tokens):
py_mag = np.linalg.norm(python_emb[i]) # calculate standard euclidean norm for Python embeddings
cpp_mag = np.linalg.norm(cpp_emb[i]) # calculate standard euclidean norm for llama.cpp embeddings
ratio = py_mag / cpp_mag if cpp_mag > 0 else float('inf')
print(f" Token {i} ({tokens[i]}): Python={py_mag:.3f}, llama.cpp={cpp_mag:.3f}, ratio={ratio:.3f}")
# 2. Cosine similarity between tokens within each model
# Here we check the direction of token embeddings to see if the have the
# same meaning (similarity). This is done by calculating cosine similarity
# of a pair of token embeddings within each model.
print(f"\n2. Within-Model Token Similarities:")
print(" Python model:")
for i in range(n_tokens):
for j in range(i+1, n_tokens):
sim = cosine_similarity([python_emb[i]], [python_emb[j]])[0][0]
print(f" {tokens[i]}{tokens[j]}: {sim:.4f}")
print(" llama.cpp model:")
for i in range(n_tokens):
for j in range(i+1, n_tokens):
sim = cosine_similarity([cpp_emb[i]], [cpp_emb[j]])[0][0]
print(f" {tokens[i]}{tokens[j]}: {sim:.4f}")
# 3. Cross-model similarity (same token position)
print(f"\n3. Cross-Model Same-Token Similarities:")
for i in range(n_tokens):
sim = cosine_similarity([python_emb[i]], [cpp_emb[i]])[0][0]
print(f" Token {i} ({tokens[i]}): {sim:.4f}")
# 4. Similarity matrix comparison
print(f"\n4. Similarity Matrix Differences:")
py_sim_matrix = cosine_similarity(python_emb)
cpp_sim_matrix = cosine_similarity(cpp_emb)
diff_matrix = np.abs(py_sim_matrix - cpp_sim_matrix)
print(f" Max difference: {np.max(diff_matrix):.4f}")
print(f" Mean difference: {np.mean(diff_matrix):.4f}")
print(f" RMS difference: {np.sqrt(np.mean(diff_matrix**2)):.4f}")
return {
'cross_model_similarities': [cosine_similarity([python_emb[i]], [cpp_emb[i]])[0][0] for i in range(n_tokens)],
'similarity_matrix_diff': diff_matrix,
'max_diff': np.max(diff_matrix),
'mean_diff': np.mean(diff_matrix),
'rms_diff': np.sqrt(np.mean(diff_matrix**2))
}
def main():
parser = argparse.ArgumentParser(description='Test semantic similarity between Python and llama.cpp embeddings')
parser.add_argument('--model-path', '-m', required=True, help='Path to the original Python model')
parser.add_argument('--python-embeddings', '-pe', help='Path to pytorch embeddings "logits" binary file')
parser.add_argument('--cpp-embeddings', '-ce', help='Path to llama.cpp embeddings "logits" binary file')
parser.add_argument('--causal', '-c', default=False, help='if the model is causal (default: false)', action='store_true')
parser.add_argument('--prompt', '-p', default='Hello world today', help='Test prompt')
args = parser.parse_args()
print("Semantic Similarity Test Between Python and llama.cpp Embedding Models")
print("=" * 70)
# Single prompt detailed comparison
print(f"\nTesting with prompt: '{args.prompt}'")
# Load the python model to get configuration information and also to load the tokenizer.
print("Loading model and tokenizer using AutoTokenizer:", args.model_path)
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
config = AutoConfig.from_pretrained(args.model_path)
if unreleased_model_name:
model_name_lower = unreleased_model_name.lower()
unreleased_module_path = f"transformers.models.{model_name_lower}.modular_{model_name_lower}"
if args.causal:
class_name = f"{unreleased_model_name}ForCausalLM"
else:
class_name = f"{unreleased_model_name}Model"
print(f"Model class: {class_name}")
print(f"Importing unreleased model module: {unreleased_module_path}")
try:
model_class = getattr(importlib.import_module(unreleased_module_path), class_name)
model = model_class.from_pretrained(args.model_path)
except (ImportError, AttributeError) as e:
print(f"Failed to import or load model: {e}")
exit(1)
else:
if args.causal:
model = AutoModelForCausalLM.from_pretrained(args.model_path)
else:
model = AutoModel.from_pretrained(args.model_path)
encoded = tokenizer(args.prompt, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
n_tokens = len(tokens)
print(f"n_tokens: {n_tokens}");
print(f"hidden_size: {model.config.hidden_size}")
# Load binary embeddings from data directory.
llamacpp_embeddings = load_embeddings_from_file(args.cpp_embeddings, n_tokens, model.config.hidden_size)
python_embeddings = load_embeddings_from_file(args.python_embeddings, n_tokens, model.config.hidden_size)
# Run comparison
results = test_single_prompt_similarity(python_embeddings, llamacpp_embeddings, tokens, args.prompt)
# Summary
print(f"\n=== SUMMARY ===")
avg_cross_sim = np.mean(results['cross_model_similarities'])
print(f"Average cross-model similarity: {avg_cross_sim:.4f}")
print(f"Similarity matrix RMS difference: {results['rms_diff']:.4f}")
# Quality assessment
if avg_cross_sim > 0.95:
print("✅ EXCELLENT: Models are highly similar")
elif avg_cross_sim > 0.90:
print("✅ VERY GOOD: Models are very similar")
elif avg_cross_sim > 0.80:
print("⚠️ GOOD: Models are reasonably similar")
elif avg_cross_sim > 0.70:
print("⚠️ FAIR: Models have some differences")
else:
print("❌ POOR: Models are significantly different")
if __name__ == "__main__":
main()

View File

@@ -11,5 +11,5 @@ See the following PRs for more info:
### Usage
```bash
make -j && ./llama-passkey -m ./models/llama-7b-v2/ggml-model-f16.gguf --junk 250
llama-passkey -m ./models/llama-7b-v2/ggml-model-f16.gguf --junk 250
```

View File

@@ -15,7 +15,7 @@ https://github.com/ggml-org/llama.cpp/pull/6193
`retrieval` example can be tested as follows:
```bash
make -j && ./llama-retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator .
llama-retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator .
```
This chunks and embeds all given files and starts a loop requesting query inputs:

View File

@@ -18,8 +18,6 @@ if %errorlevel% neq 0 goto ERROR
:: for FP32
cmake -G "Ninja" .. -DLLAMA_CURL=OFF -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release
if %errorlevel% neq 0 goto ERROR
:: build example/main only
:: make main
:: build all binary
cmake --build . -j

View File

@@ -158,7 +158,6 @@ option(GGML_CUDA "ggml: use CUDA"
option(GGML_MUSA "ggml: use MUSA" OFF)
option(GGML_CUDA_FORCE_MMQ "ggml: use mmq kernels instead of cuBLAS" OFF)
option(GGML_CUDA_FORCE_CUBLAS "ggml: always use cuBLAS instead of mmq kernels" OFF)
option(GGML_CUDA_F16 "ggml: use 16 bit floats for some calculations" OFF)
set (GGML_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
"ggml: max. batch size for using peer access")
option(GGML_CUDA_NO_PEER_COPY "ggml: do not use peer to peer copies" OFF)

View File

@@ -244,6 +244,13 @@
#define GGML_MROPE_SECTIONS 4
#define GGML_UNUSED(x) (void)(x)
#ifdef __CUDACC__
template<typename... Args>
__host__ __device__ constexpr inline void ggml_unused_vars_impl(Args&&...) noexcept {}
#define GGML_UNUSED_VARS(...) ggml_unused_vars_impl(__VA_ARGS__)
#else
#define GGML_UNUSED_VARS(...) do { (void)sizeof((__VA_ARGS__, 0)); } while(0)
#endif // __CUDACC__
#define GGML_PAD(x, n) (((x) + (n) - 1) & ~((n) - 1))
@@ -505,6 +512,7 @@ extern "C" {
GGML_OP_IM2COL,
GGML_OP_IM2COL_BACK,
GGML_OP_CONV_2D,
GGML_OP_CONV_3D,
GGML_OP_CONV_2D_DW,
GGML_OP_CONV_TRANSPOSE_2D,
GGML_OP_POOL_1D,
@@ -1933,6 +1941,23 @@ extern "C" {
int d0, // dilation dimension 0
int d1); // dilation dimension 1
GGML_API struct ggml_tensor * ggml_conv_3d(
struct ggml_context * ctx,
struct ggml_tensor * a, // kernel [KW, KH, KD, IC * OC]
struct ggml_tensor * b, // input [W, H, D, C * N]
int s0, // stride
int s1,
int s2,
int p0, // padding
int p1,
int p2,
int d0, // dilation
int d1,
int d2,
int n_channels,
int n_batch,
int n_channels_out);
enum ggml_op_pool {
GGML_OP_POOL_MAX,
GGML_OP_POOL_AVG,

View File

@@ -19,9 +19,8 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>
#include <vector>
#include <algorithm>
#include <vector>
#ifdef __APPLE__
#include <sys/types.h>
@@ -1352,15 +1351,19 @@ static bool ggml_backend_sched_alloc_splits(ggml_backend_sched_t sched) {
static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t sched) {
struct ggml_backend_sched_split * splits = sched->splits;
for (int i = 0; i < sched->n_splits; i++) {
struct ggml_backend_sched_split * split = &splits[i];
ggml_tensor * prev_ids_tensor = nullptr;
std::vector<int32_t> ids;
std::vector<ggml_bitset_t> used_ids;
for (int split_id = 0; split_id < sched->n_splits; split_id++) {
struct ggml_backend_sched_split * split = &splits[split_id];
int split_backend_id = split->backend_id;
ggml_backend_t split_backend = sched->backends[split_backend_id];
// copy the input tensors to the split backend
for (int j = 0; j < split->n_inputs; j++) {
ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[j]);
struct ggml_tensor * input = split->inputs[j];
for (int input_id = 0; input_id < split->n_inputs; input_id++) {
ggml_backend_t input_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[input_id]);
struct ggml_tensor * input = split->inputs[input_id];
struct ggml_tensor * input_cpy = tensor_copy(input, split_backend_id, sched->cur_copy);
if (input->flags & GGML_TENSOR_FLAG_INPUT) {
@@ -1378,16 +1381,104 @@ static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t s
} else {
ggml_backend_synchronize(split_backend);
}
// try async copy, but if not possible, we can still use a sync copy without synchronizing the dst backend, since we handle the synchronization here with multiple copies and events
// TODO: add public function to facilitate this, since applications do not have direct access to the backend interface
if (!split_backend->iface.cpy_tensor_async || !split_backend->iface.cpy_tensor_async(input_backend, split_backend, input, input_cpy)) {
// when offloading MoE weights, we can reduce the amount of data copied by copying only the experts that are used
ggml_tensor * node = split->graph.nodes[0];
if (split->graph.n_nodes > 0 &&
ggml_backend_buffer_get_usage(input->buffer) == GGML_BACKEND_BUFFER_USAGE_WEIGHTS &&
ggml_backend_buffer_is_host(input->buffer) && (
(node->src[0] == input_cpy && node->op == GGML_OP_MUL_MAT_ID)
//|| (node->src[1] == input_cpy && node->op == GGML_OP_ADD_ID) /* GGML_OP_ADD_ID weights are small and not worth splitting */
)) {
const int64_t n_expert = node->op == GGML_OP_MUL_MAT_ID ? input->ne[2] : input->ne[1];
const size_t expert_size = node->op == GGML_OP_MUL_MAT_ID ? input->nb[2] : input->nb[1];
ggml_backend_synchronize(input_backend);
if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]);
} else {
ggml_backend_synchronize(split_backend);
// get the ids
ggml_tensor * ids_tensor = node->src[2];
ggml_backend_t ids_backend = split_backend;
// if the ids tensor is also an input of the split, it may not have been copied yet to the split backend
// in that case, we use the original ids tensor
for (int i = input_id + 1; i < split->n_inputs; i++) {
if (ids_tensor == tensor_copy(split->inputs[i], split_backend_id, sched->cur_copy)) {
ids_tensor = split->inputs[i];
ids_backend = ggml_backend_sched_get_tensor_backend(sched, split->inputs[i]);
break;
}
}
if (ids_tensor != prev_ids_tensor) {
ids.resize(ggml_nbytes(ids_tensor) / sizeof(int32_t));
ggml_backend_tensor_get_async(ids_backend, ids_tensor, ids.data(), 0, ggml_nbytes(ids_tensor));
ggml_backend_synchronize(ids_backend);
// find the used experts
used_ids.clear();
used_ids.resize(ggml_bitset_size(n_expert));
for (int64_t i1 = 0; i1 < ids_tensor->ne[1]; i1++) {
for (int64_t i0 = 0; i0 < ids_tensor->ne[0]; i0++) {
int32_t id = ids[i1 * ids_tensor->nb[1]/sizeof(int32_t) + i0 * ids_tensor->nb[0]/sizeof(int32_t)];
GGML_ASSERT(id >= 0 && id < n_expert);
ggml_bitset_set(used_ids.data(), id);
}
}
prev_ids_tensor = ids_tensor;
}
// group consecutive experts and copy them together
auto copy_experts = [&](int32_t first_id, int32_t last_id) {
const size_t expert_offset = first_id * expert_size;
const size_t expert_size_copy = (last_id - first_id + 1) * expert_size;
const size_t padding = std::min<size_t>(expert_size, 512);
const size_t padding_end = last_id < n_expert - 1 ? padding : 0;
ggml_backend_tensor_set_async(split_backend,
input_cpy,
(const uint8_t *)input->data + expert_offset, expert_offset,
// copy a bit extra at the to ensure there are no NaNs in the padding of the last expert
// this is necessary for MMQ in the CUDA backend
expert_size_copy + padding_end);
};
int id = 0;
while (!ggml_bitset_get(used_ids.data(), id)) {
id++;
}
int32_t first_id = id;
int32_t last_id = first_id;
for (++id; id < n_expert; ++id) {
if (!ggml_bitset_get(used_ids.data(), id)) {
continue;
}
if (id == last_id + 1) {
last_id = id;
continue;
}
copy_experts(first_id, last_id);
first_id = id;
last_id = id;
}
copy_experts(first_id, last_id);
} else {
// try async copy, but if not possible, we can still use a sync copy without synchronizing the dst backend, since we handle the synchronization here with multiple copies and events
// TODO: add public function to facilitate this, since applications do not have direct access to the backend interface
if (!split_backend->iface.cpy_tensor_async || !split_backend->iface.cpy_tensor_async(input_backend, split_backend, input, input_cpy)) {
ggml_backend_synchronize(input_backend);
if (sched->events[split_backend_id][sched->cur_copy] != NULL) {
ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]);
} else {
ggml_backend_synchronize(split_backend);
}
ggml_backend_tensor_copy(input, input_cpy);
}
ggml_backend_tensor_copy(input, input_cpy);
}
}
}

View File

@@ -867,6 +867,86 @@ static aclTensor* aclnn_values(ggml_backend_cann_context& ctx, void* buffer,
return acl_tensor;
}
/**
* @brief Fills a tensor with a scalar value.
*
* This function fills the destination tensor `acl_dst` with the scalar value
* `scalar`.
*
* @param ctx The context for the CANN backend operations.
* @param scalar The scalar value used to fill the tensor.
* @param acl_dst The destination tensor to be filled with the scalar value.
*/
static void aclnn_fill_scalar(ggml_backend_cann_context& ctx, float scalar,
aclTensor* acl_dst) {
auto acl_scalar = aclCreateScalar(&scalar, aclDataType::ACL_FLOAT);
GGML_CANN_CALL_ACLNN_OP(ctx, InplaceFillScalar, acl_dst, acl_scalar);
ggml_cann_release_resources(ctx, acl_scalar);
}
/**
* @brief Get or expand a cached float32 tensor filled with a scalar value.
*
* This function manages cached device memory for float32 tensors. If the current
* cache size is insufficient for the requested tensor shape, the old memory will
* be released and new memory will be allocated. The allocated buffer is then
* initialized either with zeros (when @p value == 0.0f) or with the given scalar
* value using CANN operations. Finally, an aclTensor object is created from the
* cached memory and returned.
*
* @param ctx The CANN backend context that manages device memory.
* @param buffer A pointer to the cached device buffer (will be allocated
* or reallocated if necessary).
* @param cache_element The current number of cached elements. This will be
* updated when the cache is expanded.
* @param ne The tensor shape array (number of elements in each dimension).
* @param nb The stride size for each dimension.
* @param dims The number of tensor dimensions.
* @param value The scalar value used to fill the tensor (supports zero
* initialization via memset or arbitrary values via fill_scalar).
* @return An aclTensor pointer created from the cached buffer.
*/
static aclTensor* get_f32_cache_acl_tensor(
ggml_backend_cann_context& ctx,
void** buffer,
int64_t &cache_element,
int64_t* ne,
size_t* nb,
int64_t dims,
float value) {
// Calculate total number of elements
int64_t n_element = 1;
for (int i = 0; i < dims; i++) {
n_element *= ne[i];
}
size_t size = n_element * sizeof(float);
// Allocate or expand cache if needed
if (cache_element < n_element) {
if (*buffer != nullptr) {
aclrtFree(*buffer);
*buffer = nullptr;
}
ACL_CHECK(aclrtMalloc(buffer, size, ACL_MEM_MALLOC_HUGE_FIRST));
cache_element = n_element;
// Initialize cache
if (value == 0.0f) {
ACL_CHECK(aclrtMemsetAsync(*buffer, size, 0, size, ctx.stream()));
} else {
int64_t pool_ne[1] = { n_element };
size_t pool_nb[1] = { sizeof(float) };
aclTensor* acl_value = ggml_cann_create_tensor(
*buffer, ACL_FLOAT, sizeof(float), pool_ne, pool_nb, 1);
aclnn_fill_scalar(ctx, 1, acl_value);
ggml_cann_release_resources(ctx, acl_value);
}
}
return ggml_cann_create_tensor(*buffer, ACL_FLOAT, sizeof(float), ne, nb, dims);
}
void ggml_cann_rms_norm(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
ggml_tensor* src = dst->src[0];
@@ -875,20 +955,39 @@ void ggml_cann_rms_norm(ggml_backend_cann_context& ctx, ggml_tensor* dst) {
float eps;
memcpy(&eps, dst->op_params, sizeof(float));
size_t one_tensor_n_bytes = src->ne[0] * ggml_element_size(src);
ggml_cann_pool_alloc one_tensor_allocator(ctx.pool(), one_tensor_n_bytes);
aclTensor* acl_gamma = aclnn_values(
ctx, one_tensor_allocator.get(), one_tensor_n_bytes, src->ne, 1,
ggml_cann_type_mapping(src->type), ggml_element_size(src));
// build gamma, one...
size_t acl_gamma_nb[GGML_MAX_DIMS];
acl_gamma_nb[0] = sizeof(float);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
acl_gamma_nb[i] = acl_gamma_nb[i - 1] * src->ne[i - 1];
}
aclTensor* acl_gamma = get_f32_cache_acl_tensor(
ctx,
&ctx.f32_one_cache,
ctx.f32_one_cache_element,
src->ne,
acl_gamma_nb,
1, // dims
1.0f // value
);
// build rstd, zero...
size_t acl_rstd_nb[GGML_MAX_DIMS];
acl_rstd_nb[0] = sizeof(float);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
acl_rstd_nb[i] = acl_rstd_nb[i - 1] * src->ne[i - 1];
}
aclTensor* acl_rstd = get_f32_cache_acl_tensor(
ctx,
&ctx.f32_zero_cache,
ctx.f32_zero_cache_element,
src->ne,
acl_rstd_nb,
GGML_MAX_DIMS,
0.0f // value
);
size_t zero_tensor_n_bytes =
src->ne[1] * src->ne[2] * src->ne[3] * ggml_element_size(src);
ggml_cann_pool_alloc zero_tensor_allocator(ctx.pool(), zero_tensor_n_bytes);
aclTensor* acl_rstd =
aclnn_zero(ctx, zero_tensor_allocator.get(), zero_tensor_n_bytes,
src->ne, GGML_MAX_DIMS, ggml_cann_type_mapping(src->type),
ggml_element_size(src));
GGML_CANN_CALL_ACLNN_OP(ctx, RmsNorm, acl_src, acl_gamma, eps, acl_dst, acl_rstd);
ggml_cann_release_resources(ctx, acl_src, acl_dst, acl_gamma, acl_rstd);
}
@@ -903,14 +1002,13 @@ void ggml_cann_diag_mask(ggml_backend_cann_context& ctx, ggml_tensor* dst,
const int n_past = ((int32_t*)dst->op_params)[0];
size_t one_tensor_n_bytes = src->ne[0] * src->ne[1] * src->ne[2] *
src->ne[3] * ggml_element_size(src);
ggml_cann_pool_alloc one_tensor_allocator(ctx.pool(), one_tensor_n_bytes);
ggml_cann_pool_alloc one_tensor_allocator(ctx.pool(), ggml_nbytes(src));
void* buffer = one_tensor_allocator.get();
aclTensor* mask_tensor =
aclnn_values(ctx, one_tensor_allocator.get(), one_tensor_n_bytes,
src->ne, GGML_MAX_DIMS, ggml_cann_type_mapping(src->type),
ggml_element_size(src), value);
aclTensor* mask_tensor = ggml_cann_create_tensor(buffer, ggml_cann_type_mapping(src->type),
ggml_type_size(src->type), src->ne, src->nb, GGML_MAX_DIMS);
aclnn_fill_scalar(ctx, value, mask_tensor);
aclScalar* alpha = nullptr;
float alphaValue = 1.0f;
@@ -1277,23 +1375,6 @@ void ggml_cann_timestep_embedding(ggml_backend_cann_context& ctx,
tmp_permute_tensor, tmp_mul_tensor, acl_dst);
}
/**
* @brief Fills a tensor with a scalar value.
*
* This function fills the destination tensor `acl_dst` with the scalar value
* `scalar`.
*
* @param ctx The context for the CANN backend operations.
* @param scalar The scalar value used to fill the tensor.
* @param acl_dst The destination tensor to be filled with the scalar value.
*/
static void aclnn_fill_scalar(ggml_backend_cann_context& ctx, float scalar,
aclTensor* acl_dst) {
auto acl_scalar = aclCreateScalar(&scalar, aclDataType::ACL_FLOAT);
GGML_CANN_CALL_ACLNN_OP(ctx, InplaceFillScalar, acl_dst, acl_scalar);
ggml_cann_release_resources(ctx, acl_scalar);
}
/**
* @brief Raises each element of a tensor to the power of the corresponding
* element in another tensor.
@@ -2154,86 +2235,129 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
GGML_TENSOR_BINARY_OP_LOCALS
// theta_scale arange, [0,1,...,ne00/2 - 1]
int64_t theta_scale_length = ne00 / 2;
ggml_cann_pool_alloc theta_scale_allocator(ctx.pool(),
theta_scale_length * sizeof(float_t));
void* theta_scale_buffer = theta_scale_allocator.get();
int64_t theta_scale_ne[] = {theta_scale_length, 1, 1, 1};
size_t theta_scale_nb[] = {sizeof(float_t), sizeof(float_t), sizeof(float_t),
theta_scale_length * sizeof(float_t)};
aclTensor* acl_theta_scale_tensor =
ggml_cann_create_tensor(theta_scale_buffer, ACL_FLOAT, sizeof(float_t),
theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
float start = 0;
float step = 1;
float stop = ne00 / 2;
float n_elements = ne00 / 2;
aclnn_arange(ctx, acl_theta_scale_tensor, start, stop, step, n_elements);
// power
aclScalar* acl_theta_scale = aclCreateScalar(&theta_scale, aclDataType::ACL_FLOAT);
GGML_CANN_CALL_ACLNN_OP(ctx, PowScalarTensor, acl_theta_scale, acl_theta_scale_tensor,
acl_theta_scale_tensor);
// freq_scale
if (freq_scale != 1) {
aclnn_muls(ctx, acl_theta_scale_tensor, freq_scale, nullptr, true);
}
// freq_factors
if (src2) {
aclTensor* acl_freq_factors_tensor = ggml_cann_create_tensor(
src2->data, ggml_cann_type_mapping(src2->type),
ggml_type_size(src2->type), theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
aclnn_div(ctx, acl_theta_scale_tensor, acl_freq_factors_tensor);
ggml_cann_release_resources(ctx, acl_freq_factors_tensor);
}
// position
GGML_ASSERT(src1->type == GGML_TYPE_I32);
int64_t position_length = src1->ne[0];
int64_t position_ne[] = {1, 1, position_length, 1};
size_t position_nb[] = {sizeof(int32_t), sizeof(int32_t), sizeof(int32_t),
sizeof(int32_t) * position_length};
aclTensor* acl_position_tensor = ggml_cann_create_tensor(
src1->data, ggml_cann_type_mapping(src1->type),
ggml_type_size(src1->type), position_ne, position_nb, GGML_MAX_DIMS);
// power * position
int64_t theta_length = theta_scale_length * position_length;
ggml_cann_pool_alloc theta_allocator(ctx.pool(),
theta_length * sizeof(float_t));
void* theta_buffer = theta_allocator.get();
int64_t theta_ne[] = {theta_scale_length, 1, position_length, 1};
size_t theta_nb[GGML_MAX_DIMS];
theta_nb[0] = sizeof(float_t);
for (int i = 1; i < GGML_MAX_DIMS; i++) {
theta_nb[i] = theta_nb[i - 1] * theta_ne[i - 1];
}
aclTensor* acl_theta_tensor =
ggml_cann_create_tensor(theta_buffer, ACL_FLOAT, sizeof(float_t),
theta_ne, theta_nb, GGML_MAX_DIMS);
aclnn_mul(ctx, acl_position_tensor, acl_theta_scale_tensor,
acl_theta_tensor);
// sin/cos
ggml_cann_pool_alloc sin_allocator(ctx.pool(),
theta_length * sizeof(float_t));
void* sin_buffer = sin_allocator.get();
bool is_q = (std::strncmp(dst->name, "Qcur-", 5) == 0);
bool is_k = (std::strncmp(dst->name, "Kcur-", 5) == 0);
// used for accuracy testing
bool is_attention = is_q || is_k;
if(ctx.init_ptr == nullptr || !is_attention) {
// theta_scale arange, [0,1,...,ne00/2 - 1]
if(ctx.init_ptr != nullptr){
ACL_CHECK(aclrtFree(ctx.init_ptr));
}
ACL_CHECK(aclrtMalloc(&ctx.init_ptr, theta_scale_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
aclTensor* acl_theta_scale_tensor =
ggml_cann_create_tensor(ctx.init_ptr, ACL_FLOAT, sizeof(float_t),
theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
float start = 0;
float step = 1;
float stop = ne00 / 2;
float n_elements = ne00 / 2;
aclnn_arange(ctx, acl_theta_scale_tensor, start, stop, step, n_elements);
// power
aclScalar* acl_theta_scale = aclCreateScalar(&theta_scale, aclDataType::ACL_FLOAT);
GGML_CANN_CALL_ACLNN_OP(ctx, PowScalarTensor, acl_theta_scale, acl_theta_scale_tensor,
acl_theta_scale_tensor);
// freq_scale
if (freq_scale != 1) {
aclnn_muls(ctx, acl_theta_scale_tensor, freq_scale, nullptr, true);
}
// freq_factors
if (src2) {
aclTensor* acl_freq_factors_tensor = ggml_cann_create_tensor(
src2->data, ggml_cann_type_mapping(src2->type),
ggml_type_size(src2->type), theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
aclnn_div(ctx, acl_theta_scale_tensor, acl_freq_factors_tensor);
ggml_cann_release_resources(ctx, acl_freq_factors_tensor);
}
// release
ggml_cann_release_resources(ctx, acl_theta_scale_tensor,acl_theta_scale);
}
if(ctx.sin_ptr == nullptr) {
int64_t theta_length = theta_scale_length * ctx.max_prompt_length;
ACL_CHECK(aclrtMalloc(&ctx.sin_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
ACL_CHECK(aclrtMalloc(&ctx.cos_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
}
if(position_length > ctx.max_prompt_length) {
ctx.max_prompt_length = position_length;
int64_t theta_length = theta_scale_length * ctx.max_prompt_length;
ACL_CHECK(aclrtFree(ctx.sin_ptr));
ACL_CHECK(aclrtFree(ctx.cos_ptr));
ACL_CHECK(aclrtMalloc(&ctx.sin_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
ACL_CHECK(aclrtMalloc(&ctx.cos_ptr, theta_length * sizeof(float_t), ACL_MEM_MALLOC_HUGE_FIRST));
}
bool is_fisrt_layer = (std::strncmp(dst->name, "Qcur-0", GGML_MAX_NAME) == 0);
if(is_fisrt_layer || !is_attention) {
aclTensor* acl_theta_scale_tensor =
ggml_cann_create_tensor(ctx.init_ptr, ACL_FLOAT, sizeof(float_t),
theta_scale_ne, theta_scale_nb, GGML_MAX_DIMS);
// position
aclTensor* acl_position_tensor = ggml_cann_create_tensor(
src1->data, ggml_cann_type_mapping(src1->type),
ggml_type_size(src1->type), position_ne, position_nb, GGML_MAX_DIMS);
// power * position
int64_t theta_length = theta_scale_length * position_length;
ggml_cann_pool_alloc theta_allocator(ctx.pool(),
theta_length * sizeof(float_t));
void* theta_buffer = theta_allocator.get();
aclTensor* acl_theta_tensor =
ggml_cann_create_tensor(theta_buffer, ACL_FLOAT, sizeof(float_t),
theta_ne, theta_nb, GGML_MAX_DIMS);
aclnn_mul(ctx, acl_position_tensor, acl_theta_scale_tensor,
acl_theta_tensor);
// sin/cos
aclTensor* acl_sin_tensor = ggml_cann_create_tensor(
ctx.sin_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
GGML_MAX_DIMS, ACL_FORMAT_ND);
aclnn_sin(ctx, acl_theta_tensor, acl_sin_tensor);
aclTensor* acl_cos_tensor = ggml_cann_create_tensor(
ctx.cos_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
GGML_MAX_DIMS, ACL_FORMAT_ND);
aclnn_cos(ctx, acl_theta_tensor, acl_cos_tensor);
// release
ggml_cann_release_resources(ctx, acl_theta_scale_tensor, acl_position_tensor,
acl_theta_tensor, acl_sin_tensor, acl_cos_tensor);
}
aclTensor* acl_sin_tensor = ggml_cann_create_tensor(
sin_buffer, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
GGML_MAX_DIMS, ACL_FORMAT_ND);
aclnn_sin(ctx, acl_theta_tensor, acl_sin_tensor);
ggml_cann_pool_alloc cos_allocator(ctx.pool(),
theta_length * sizeof(float_t));
void* cos_buffer = cos_allocator.get();
ctx.sin_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
GGML_MAX_DIMS, ACL_FORMAT_ND);
aclTensor* acl_cos_tensor = ggml_cann_create_tensor(
cos_buffer, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
GGML_MAX_DIMS, ACL_FORMAT_ND);
aclnn_cos(ctx, acl_theta_tensor, acl_cos_tensor);
ctx.cos_ptr, ACL_FLOAT, sizeof(float_t), theta_ne, theta_nb,
GGML_MAX_DIMS, ACL_FORMAT_ND);
// attn_factor
if (attn_factor != 1) {
@@ -2257,8 +2381,7 @@ static void aclnn_cache_init(ggml_backend_cann_context& ctx, ggml_tensor* dst,
}
// release
ggml_cann_release_resources(ctx, acl_theta_scale_tensor, acl_position_tensor,
acl_theta_tensor, acl_sin_tensor, acl_cos_tensor, acl_theta_scale);
ggml_cann_release_resources(ctx, acl_sin_tensor, acl_cos_tensor);
}
#ifdef __cplusplus

View File

@@ -368,6 +368,10 @@ struct ggml_backend_cann_context {
std::string name; /**< Name of the device. */
std::string description; /**< Description of the device. */
aclrtEvent copy_event = nullptr; /**< Event for managing copy operations. */
void* init_ptr = nullptr;
void* sin_ptr = nullptr;
void* cos_ptr = nullptr;
int64_t max_prompt_length = 65536;
#ifdef USE_ACL_GRAPH
/// Cached CANN ACL graph used for executing the current ggml computation graph.
std::unique_ptr<ggml_cann_graph> cann_graph;
@@ -375,6 +379,10 @@ struct ggml_backend_cann_context {
cann_task_queue task_queue;
bool async_mode;
bool support_set_rows;
void* f32_zero_cache = nullptr;
void* f32_one_cache = nullptr;
int64_t f32_zero_cache_element = 0;
int64_t f32_one_cache_element = 0;
aclrtStream streams[GGML_CANN_MAX_STREAMS] = {nullptr}; /**< Array of streams for the device. */
@@ -414,6 +422,15 @@ struct ggml_backend_cann_context {
ACL_CHECK(aclrtDestroyStream(streams[i]));
}
}
if(init_ptr != nullptr) {
ACL_CHECK(aclrtFree(init_ptr));
}
if(sin_ptr != nullptr) {
ACL_CHECK(aclrtFree(sin_ptr));
}
if(cos_ptr != nullptr) {
ACL_CHECK(aclrtFree(cos_ptr));
}
}
/**

View File

@@ -73,7 +73,6 @@
#define ggml_vec_dot_tq1_0_q8_K_generic ggml_vec_dot_tq1_0_q8_K
#define ggml_vec_dot_tq2_0_q8_K_generic ggml_vec_dot_tq2_0_q8_K
#define ggml_vec_dot_iq1_m_q8_K_generic ggml_vec_dot_iq1_m_q8_K
#define ggml_vec_dot_mxfp4_q8_0_generic ggml_vec_dot_mxfp4_q8_0
// repack.cpp
#define ggml_quantize_mat_q8_0_4x4_generic ggml_quantize_mat_q8_0_4x4
#define ggml_quantize_mat_q8_0_4x8_generic ggml_quantize_mat_q8_0_4x8
@@ -151,8 +150,6 @@
#elif defined(__s390x__)
// quants.c
#define quantize_row_q8_K_generic quantize_row_q8_K
#define ggml_vec_dot_q5_0_q8_0_generic ggml_vec_dot_q5_0_q8_0
#define ggml_vec_dot_q5_1_q8_1_generic ggml_vec_dot_q5_1_q8_1
#define ggml_vec_dot_tq1_0_q8_K_generic ggml_vec_dot_tq1_0_q8_K
#define ggml_vec_dot_tq2_0_q8_K_generic ggml_vec_dot_tq2_0_q8_K
#define ggml_vec_dot_q2_K_q8_K_generic ggml_vec_dot_q2_K_q8_K

View File

@@ -278,6 +278,72 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
#endif
}
void ggml_vec_dot_mxfp4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
assert(nrc == 1);
UNUSED(nrc);
UNUSED(bx);
UNUSED(by);
UNUSED(bs);
assert(n % QK_MXFP4 == 0);
static_assert(QK_MXFP4 == QK8_0, "QK_MXFP4 and QK8_0 must be the same");
const block_mxfp4 * GGML_RESTRICT x = vx;
const block_q8_0 * GGML_RESTRICT y = vy;
const int nb = n / QK_MXFP4;
int ib = 0;
float sumf = 0;
#if defined(__POWER9_VECTOR__)
const vector signed char lowMask = vec_splats((signed char)0xF);
const vector unsigned char vshift4 = vec_splats((unsigned char)4);
vector float vsumf0 = vec_splats(0.0f);
vector signed char kv = vec_xl(0, (const signed char *)kvalues_mxfp4);
#pragma GCC unroll 8
for (; ib < nb; ++ib) {
__builtin_prefetch(x[ib].qs, 0, 1);
__builtin_prefetch(y[ib].qs, 0, 1);
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d) *
GGML_E8M0_TO_FP32_HALF(x[ib].e));
vector signed char q8y0 = vec_xl( 0, y[ib].qs);
vector signed char q8y1 = vec_xl(16, y[ib].qs);
vector signed char qxs = (vector signed char)vec_xl(0, x[ib].qs);
vector unsigned char lo_nibbles = (vector unsigned char)vec_and(qxs, lowMask);
vector unsigned char hi_nibbles = (vector unsigned char)vec_sr(qxs, vshift4);
vector signed char q4x0 = vec_perm(kv, kv, lo_nibbles);
vector signed char q4x1 = vec_perm(kv, kv, hi_nibbles);
vector signed short qv0 = vec_add(vec_mule(q4x0, q8y0), vec_mulo(q4x0, q8y0));
vector signed short qv1 = vec_add(vec_mule(q4x1, q8y1), vec_mulo(q4x1, q8y1));
vector signed int vsumi0 = vec_splats((int32_t)0);
vsumi0 = vec_sum4s(qv0, vsumi0);
vsumi0 = vec_sum4s(qv1, vsumi0);
vsumf0 = vec_madd(vec_ctf(vsumi0, 0), vyd, vsumf0);
}
vsumf0 = vec_add(vsumf0, vec_sld(vsumf0, vsumf0, 4));
vsumf0 = vec_add(vsumf0, vec_sld(vsumf0, vsumf0, 8));
sumf = vec_extract(vsumf0, 0);
*s = sumf;
#else
UNUSED(x);
UNUSED(y);
UNUSED(ib);
UNUSED(sumf);
ggml_vec_dot_mxfp4_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK8_0;
const int nb = n / qk;

View File

@@ -23,6 +23,27 @@
#define UNUSED GGML_UNUSED
#if defined(__VXE__) || defined(__VXE2__)
#define B1(c,s,n) 0x ## n ## c , 0x ## n ## s
#define B2(c,s,n) B1(c,s,n ## c), B1(c,s,n ## s)
#define B3(c,s,n) B2(c,s,n ## c), B2(c,s,n ## s)
#define B4(c,s,n) B3(c,s,n ## c), B3(c,s,n ## s)
#define B5(c,s,n) B4(c,s,n ## c), B4(c,s,n ## s)
#define B6(c,s,n) B5(c,s,n ## c), B5(c,s,n ## s)
#define B7(c,s,n) B6(c,s,n ## c), B6(c,s,n ## s)
#define B8(c,s ) B7(c,s, c), B7(c,s, s)
// precomputed tables for expanding 8bits to 8 bytes:
static const __attribute__((aligned(16))) uint64_t table_b2b_0[1 << 8] = { B8(00, 10) }; // ( b ) << 4
static const __attribute__((aligned(16))) uint64_t table_b2b_1[1 << 8] = { B8(10, 00) }; // (!b) << 4
// permute mask for byteswapping
static const uint8x16_t v_kperm = (const uint8x16_t){
7, 6, 5, 4, 3, 2, 1, 0,
15, 14, 13, 12, 11, 10, 9, 8
};
#endif
void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k) {
assert(QK8_0 == 32);
assert(k % QK8_0 == 0);
@@ -241,6 +262,301 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
#endif
}
void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK8_0;
const int nb = n / qk;
assert(n % qk == 0);
assert(qk == QK5_0);
assert(nrc == 1);
UNUSED(nrc);
UNUSED(bx);
UNUSED(by);
UNUSED(bs);
const block_q5_0 * GGML_RESTRICT x = vx;
const block_q8_0 * GGML_RESTRICT y = vy;
int ib = 0;
float sumf = 0.0f;
#if defined(__VXE__) || defined(__VXE2__)
float32x4_t v_sum0 = vec_splats(0.0f);
float32x4_t v_sum1 = vec_splats(0.0f);
uint32_t qh0, qh1;
uint64_t tmp0[4], tmp1[4];
const uint8x16_t v_m = vec_splats((uint8_t)0x0F);
#pragma GCC unroll 4
for (; ib + 1 < nb; ib += 2) {
const block_q5_0 * GGML_RESTRICT x0 = &x[ib + 0];
const block_q5_0 * GGML_RESTRICT x1 = &x[ib + 1];
const block_q8_0 * GGML_RESTRICT y0 = &y[ib + 0];
const block_q8_0 * GGML_RESTRICT y1 = &y[ib + 1];
memcpy(&qh0, x0->qh, sizeof(qh0));
memcpy(&qh1, x1->qh, sizeof(qh1));
tmp0[0] = table_b2b_1[(qh0 >> 0) & 0xFF];
tmp0[1] = table_b2b_1[(qh0 >> 8) & 0xFF];
tmp0[2] = table_b2b_1[(qh0 >> 16) & 0xFF];
tmp0[3] = table_b2b_1[(qh0 >> 24) ];
tmp1[0] = table_b2b_1[(qh1 >> 0) & 0xFF];
tmp1[1] = table_b2b_1[(qh1 >> 8) & 0xFF];
tmp1[2] = table_b2b_1[(qh1 >> 16) & 0xFF];
tmp1[3] = table_b2b_1[(qh1 >> 24) ];
int8x16_t v_qh0l = vec_xl(0, (const int8_t *)(tmp0 + 0));
int8x16_t v_qh0h = vec_xl(0, (const int8_t *)(tmp0 + 2));
int8x16_t v_qh1l = vec_xl(0, (const int8_t *)(tmp1 + 0));
int8x16_t v_qh1h = vec_xl(0, (const int8_t *)(tmp1 + 2));
// required for fixing the byteorder
v_qh0l = vec_perm(v_qh0l, v_qh0l, v_kperm);
v_qh0h = vec_perm(v_qh0h, v_qh0h, v_kperm);
v_qh1l = vec_perm(v_qh1l, v_qh1l, v_kperm);
v_qh1h = vec_perm(v_qh1h, v_qh1h, v_kperm);
const uint8x16_t v_x0 = vec_xl(0, (const uint8_t *)x0->qs);
const uint8x16_t v_x1 = vec_xl(0, (const uint8_t *)x1->qs);
int8x16_t v_x0l = (int8x16_t)vec_and(v_x0, v_m);
int8x16_t v_x0h = (int8x16_t)vec_sr(v_x0, 4);
int8x16_t v_x1l = (int8x16_t)vec_and(v_x1, v_m);
int8x16_t v_x1h = (int8x16_t)vec_sr(v_x1, 4);
const int8x16_t v_x0lf = vec_sub(v_x0l, v_qh0l);
const int8x16_t v_x0hf = vec_sub(v_x0h, v_qh0h);
const int8x16_t v_x1lf = vec_sub(v_x1l, v_qh1l);
const int8x16_t v_x1hf = vec_sub(v_x1h, v_qh1h);
const int8x16_t v_y0l = vec_xl(0, (const int8_t *)y0->qs);
const int8x16_t v_y0h = vec_xl(QK8_0/2, (const int8_t *)y0->qs);
const int8x16_t v_y1l = vec_xl(0, (const int8_t *)y1->qs);
const int8x16_t v_y1h = vec_xl(QK8_0/2, (const int8_t *)y1->qs);
const int32x4_t v_xy0 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x0lf, v_y0l), v_x0hf, v_y0h);
const int32x4_t v_xy1 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x1lf, v_y1l), v_x1hf, v_y1h);
const float32x4_t v_xy0f = vec_float(v_xy0);
const float32x4_t v_xy1f = vec_float(v_xy1);
const float32x4_t v_d0 = vec_splats(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d));
const float32x4_t v_d1 = vec_splats(GGML_CPU_FP16_TO_FP32(x1->d) * GGML_CPU_FP16_TO_FP32(y1->d));
v_sum0 = vec_madd(v_xy0f, v_d0, v_sum0);
v_sum1 = vec_madd(v_xy1f, v_d1, v_sum1);
}
sumf += vec_hsum(v_sum0) + vec_hsum(v_sum1);
#pragma GCC unroll 4
for (; ib < nb; ++ib) {
const block_q5_0 * GGML_RESTRICT x0 = &x[ib];
const block_q8_0 * GGML_RESTRICT y0 = &y[ib];
uint32_t qh;
memcpy(&qh, x0->qh, sizeof(qh));
uint64_t tmp[4];
tmp[0] = table_b2b_1[(qh >> 0) & 0xFF];
tmp[1] = table_b2b_1[(qh >> 8) & 0xFF];
tmp[2] = table_b2b_1[(qh >> 16) & 0xFF];
tmp[3] = table_b2b_1[(qh >> 24) ];
int8x16_t v_qhl = vec_xl(0, (const int8_t *)(tmp + 0));
int8x16_t v_qhh = vec_xl(0, (const int8_t *)(tmp + 2));
// required for fixing the byteorder
v_qhl = vec_perm(v_qhl, v_qhl, v_kperm);
v_qhh = vec_perm(v_qhh, v_qhh, v_kperm);
const uint8x16_t v_x = vec_xl(0, (const uint8_t *)x0->qs);
int8x16_t v_xl = (int8x16_t)vec_and(v_x, v_m);
int8x16_t v_xh = (int8x16_t)vec_sr(v_x, 4);
const int8x16_t v_xlf = vec_sub(v_xl, v_qhl);
const int8x16_t v_xhf = vec_sub(v_xh, v_qhh);
const int8x16_t v_yl = vec_xl(0, (const int8_t *)y0->qs);
const int8x16_t v_yh = vec_xl(QK8_0/2, (const int8_t *)y0->qs);
const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xlf, v_yl), v_xhf, v_yh);
const float32x4_t v_xyf = vec_float(v_xy);
const float32x4_t v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d));
const float32x4_t v_acc = vec_madd(v_xyf, v_d, vec_splats(0.0f));
sumf += vec_hsum(v_acc);
}
*s = sumf;
#else
UNUSED(nb);
UNUSED(x);
UNUSED(y);
UNUSED(ib);
UNUSED(sumf);
ggml_vec_dot_q5_0_q8_0_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK8_1;
const int nb = n / qk;
assert(n % qk == 0);
assert(qk == QK5_1);
assert(nrc == 1);
UNUSED(nrc);
UNUSED(bx);
UNUSED(by);
UNUSED(bs);
const block_q5_1 * GGML_RESTRICT x = vx;
const block_q8_1 * GGML_RESTRICT y = vy;
int ib = 0;
float sumf = 0.0f;
#if defined(__VXE__) || defined(__VXE2__)
float32x4_t v_sum0 = vec_splats(0.0f);
float32x4_t v_sum1 = vec_splats(0.0f);
float summs0 = 0.0f;
float summs1 = 0.0f;
uint32_t qh0;
uint32_t qh1;
uint64_t tmp0[4];
uint64_t tmp1[4];
const uint8x16_t v_m = vec_splats((uint8_t)0x0F);
#pragma GCC unroll 4
for (; ib + 1 < nb; ib += 2) {
const block_q5_1 * GGML_RESTRICT x0 = &x[ib + 0];
const block_q5_1 * GGML_RESTRICT x1 = &x[ib + 1];
const block_q8_1 * GGML_RESTRICT y0 = &y[ib + 0];
const block_q8_1 * GGML_RESTRICT y1 = &y[ib + 1];
summs0 += GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s);
summs1 += GGML_CPU_FP16_TO_FP32(x1->m) * GGML_CPU_FP16_TO_FP32(y1->s);
memcpy(&qh0, x0->qh, sizeof(qh0));
memcpy(&qh1, x1->qh, sizeof(qh1));
tmp0[0] = table_b2b_0[(qh0 >> 0) & 0xFF];
tmp0[1] = table_b2b_0[(qh0 >> 8) & 0xFF];
tmp0[2] = table_b2b_0[(qh0 >> 16) & 0xFF];
tmp0[3] = table_b2b_0[(qh0 >> 24) ];
tmp1[0] = table_b2b_0[(qh1 >> 0) & 0xFF];
tmp1[1] = table_b2b_0[(qh1 >> 8) & 0xFF];
tmp1[2] = table_b2b_0[(qh1 >> 16) & 0xFF];
tmp1[3] = table_b2b_0[(qh1 >> 24) ];
int8x16_t v_qh0l = vec_xl(0, (const int8_t *)(tmp0 + 0));
int8x16_t v_qh0h = vec_xl(0, (const int8_t *)(tmp0 + 2));
int8x16_t v_qh1l = vec_xl(0, (const int8_t *)(tmp1 + 0));
int8x16_t v_qh1h = vec_xl(0, (const int8_t *)(tmp1 + 2));
// required for fixing the byteorder
v_qh0l = vec_perm(v_qh0l, v_qh0l, v_kperm);
v_qh0h = vec_perm(v_qh0h, v_qh0h, v_kperm);
v_qh1l = vec_perm(v_qh1l, v_qh1l, v_kperm);
v_qh1h = vec_perm(v_qh1h, v_qh1h, v_kperm);
const uint8x16_t v_x0 = vec_xl(0, x0->qs);
const uint8x16_t v_x1 = vec_xl(0, x1->qs);
const int8x16_t v_x0l = (int8x16_t)vec_and(v_x0, v_m);
const int8x16_t v_x0h = (int8x16_t)vec_sr(v_x0, 4);
const int8x16_t v_x1l = (int8x16_t)vec_and(v_x1, v_m);
const int8x16_t v_x1h = (int8x16_t)vec_sr(v_x1, 4);
const int8x16_t v_x0lf = vec_or(v_x0l, v_qh0l);
const int8x16_t v_x0hf = vec_or(v_x0h, v_qh0h);
const int8x16_t v_x1lf = vec_or(v_x1l, v_qh1l);
const int8x16_t v_x1hf = vec_or(v_x1h, v_qh1h);
const int8x16_t v_y0l = vec_xl(0 , y0->qs);
const int8x16_t v_y0h = vec_xl(QK8_1/2, y0->qs);
const int8x16_t v_y1l = vec_xl(0 , y1->qs);
const int8x16_t v_y1h = vec_xl(QK8_1/2, y1->qs);
const int32x4_t v_xy0 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x0lf, v_y0l), v_x0hf, v_y0h);
const int32x4_t v_xy1 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x1lf, v_y1l), v_x1hf, v_y1h);
const float32x4_t v_xy0f = vec_float(v_xy0);
const float32x4_t v_xy1f = vec_float(v_xy1);
const float32x4_t v_d0 = vec_splats(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d));
const float32x4_t v_d1 = vec_splats(GGML_CPU_FP16_TO_FP32(x1->d) * GGML_CPU_FP16_TO_FP32(y1->d));
v_sum0 = vec_madd(v_xy0f, v_d0, v_sum0);
v_sum1 = vec_madd(v_xy1f, v_d1, v_sum1);
}
sumf += vec_hsum(v_sum0) + vec_hsum(v_sum1) + summs0 + summs1;
#pragma GCC unroll 4
for (; ib < nb; ++ib) {
const block_q5_1 * GGML_RESTRICT x0 = &x[ib];
const block_q8_1 * GGML_RESTRICT y0 = &y[ib];
float summs = GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s);
uint32_t qh;
memcpy(&qh, x0->qh, sizeof(qh));
uint64_t tmp[4];
tmp[0] = table_b2b_0[(qh >> 0) & 0xFF];
tmp[1] = table_b2b_0[(qh >> 8) & 0xFF];
tmp[2] = table_b2b_0[(qh >> 16) & 0xFF];
tmp[3] = table_b2b_0[(qh >> 24) ];
int8x16_t v_qhl = vec_xl(0, (const int8_t *)(tmp + 0));
int8x16_t v_qhh = vec_xl(0, (const int8_t *)(tmp + 2));
// required for fixing the byteorder
v_qhl = vec_perm(v_qhl, v_qhl, v_kperm);
v_qhh = vec_perm(v_qhh, v_qhh, v_kperm);
const uint8x16_t v_x = vec_xl(0, x0->qs);
const int8x16_t v_xl = (int8x16_t)vec_and(v_x, v_m);
const int8x16_t v_xh = (int8x16_t)vec_sr(v_x, 4);
const int8x16_t v_xlf = vec_or(v_xl, v_qhl);
const int8x16_t v_xhf = vec_or(v_xh, v_qhh);
const int8x16_t v_yl = vec_xl(0 , y0->qs);
const int8x16_t v_yh = vec_xl(QK8_1/2, y0->qs);
const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xlf, v_yl), v_xhf, v_yh);
const float32x4_t v_xyf = vec_float(v_xy);
const float32x4_t v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d));
const float32x4_t v_acc = vec_madd(v_xyf, v_d, v_acc);
sumf += vec_hsum(v_acc) + summs;
}
*s = sumf;
#else
UNUSED(nb);
UNUSED(x);
UNUSED(y);
UNUSED(ib);
UNUSED(sumf);
ggml_vec_dot_q5_1_q8_1_generic(n, s, bs, vx, bx, vy, by, nrc);
#endif
}
void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
const int qk = QK8_0;
const int nb = n / qk;

View File

@@ -486,6 +486,14 @@ inline static int16x8_t vec_padd_s16(int16x8_t a, int16x8_t b) {
return v_abo + v_abe;
}
/**
* @see https://github.com/ggml-org/llama.cpp/pull/14037
*/
inline float vec_hsum(float32x4_t v) {
float32x4_t v_temp = v + vec_reve(v);
return v_temp[0] + v_temp[1];
}
inline static int32x4_t ggml_vec_dot(int32x4_t acc, int8x16_t a, int8x16_t b) {
const int16x8_t p = vec_mule(a, b) + vec_mulo(a, b);
return acc + (vec_unpackh(p) + vec_unpackl(p));

View File

@@ -1880,6 +1880,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
{
ggml_compute_forward_conv_2d(params, tensor);
} break;
case GGML_OP_CONV_3D:
{
ggml_compute_forward_conv_3d(params, tensor);
} break;
case GGML_OP_CONV_2D_DW:
{
ggml_compute_forward_conv_2d_dw(params, tensor);
@@ -2252,6 +2256,7 @@ static int ggml_get_n_tasks(struct ggml_tensor * node, int n_threads) {
case GGML_OP_IM2COL:
case GGML_OP_IM2COL_BACK:
case GGML_OP_CONV_2D:
case GGML_OP_CONV_3D:
case GGML_OP_CONV_2D_DW:
case GGML_OP_CONV_TRANSPOSE_1D:
case GGML_OP_CONV_TRANSPOSE_2D:
@@ -2773,6 +2778,7 @@ struct ggml_cplan ggml_graph_plan(
}
} break;
case GGML_OP_CONV_2D:
case GGML_OP_CONV_3D:
{
cur = GGML_IM2COL_WORK_SIZE;
} break;

View File

@@ -7207,6 +7207,148 @@ void ggml_compute_forward_conv_2d(
ggml_compute_forward_conv_2d_impl(params, src0, src1, dst, src0->type);
}
// ggml_compute_forward_conv_3d
static void ggml_compute_forward_conv_3d_impl(const ggml_compute_params * params,
const ggml_tensor * kernel,
const ggml_tensor * src,
ggml_tensor * dst,
ggml_type kernel_type) {
GGML_ASSERT(ggml_is_contiguous(kernel));
GGML_ASSERT(kernel_type == GGML_TYPE_F16 || kernel_type == GGML_TYPE_F32);
GGML_ASSERT(kernel->type == kernel_type);
const ggml_type_traits * traits = ggml_get_type_traits(kernel_type);
const int32_t s0 = dst->op_params[0];
const int32_t s1 = dst->op_params[1];
const int32_t s2 = dst->op_params[2];
const int32_t p0 = dst->op_params[3];
const int32_t p1 = dst->op_params[4];
const int32_t p2 = dst->op_params[5];
const int32_t d0 = dst->op_params[6];
const int32_t d1 = dst->op_params[7];
const int32_t d2 = dst->op_params[8];
const int32_t c = dst->op_params[9];
const int32_t n = dst->op_params[10];
const int32_t oc = dst->op_params[11];
const int64_t src_w = src->ne[0];
const int64_t src_h = src->ne[1];
const int64_t src_d = src->ne[2];
const int64_t knl_w = kernel->ne[0];
const int64_t knl_h = kernel->ne[1];
const int64_t knl_d = kernel->ne[2];
const int64_t dst_w = dst->ne[0];
const int64_t dst_h = dst->ne[1];
const int64_t dst_d = dst->ne[2];
const float * src_data = (float *) src->data;
void * knl_data = kernel->data;
float * dst_data = (float *) dst->data;
const int64_t knl_n_per_channel = knl_w * knl_h * knl_d;
const int64_t knl_n_total = knl_n_per_channel * c;
const int64_t patch_total = n * dst_w * dst_h * dst_d;
const int64_t space_per_patch = knl_n_total * traits->type_size + oc * sizeof(float);
const int64_t batch_size = params->wsize / space_per_patch;
const int64_t patches_per_batch = batch_size > 8 ? (batch_size / 8) * 8 : batch_size;
const int64_t batch_n = (patch_total + patches_per_batch - 1) / patches_per_batch;
GGML_ASSERT(patches_per_batch > 0 && batch_size >= 1);
void * tmp = params->wdata;
for (int64_t batch_i = 0; batch_i < batch_n; ++batch_i) {
const int64_t patch_start_batch = batch_i * patches_per_batch;
const int64_t patch_end_batch = std::min(patch_start_batch + patches_per_batch, patch_total);
const int64_t patch_n_in_batch = patch_end_batch - patch_start_batch;
const int64_t patch_per_thread = (patch_n_in_batch + params->nth - 1) / params->nth;
const int64_t patch_start = patch_start_batch + params->ith * patch_per_thread;
const int64_t patch_end = std::min(patch_start + patch_per_thread, patch_end_batch);
for (int64_t p = patch_start; p < patch_end; ++p) {
const int64_t p_in_batch = p % (dst_w * dst_h * dst_d);
const int64_t p_in_depth = p_in_batch % (dst_w * dst_h);
const int64_t batch_idx = p / (dst_w * dst_h * dst_d);
const int64_t dst_z = p_in_batch / (dst_w * dst_h);
const int64_t dst_y = p_in_depth / dst_w;
const int64_t dst_x = p_in_depth % dst_w;
char * dst_row = (char *) tmp + (p % patches_per_batch) * knl_n_total * traits->type_size;
for (int64_t ic = 0; ic < c; ++ic) {
for (int64_t kz = 0; kz < knl_d; ++kz) {
for (int64_t ky = 0; ky < knl_h; ++ky) {
for (int64_t kx = 0; kx < knl_w; ++kx) {
const int64_t sz = dst_z * s2 + kz * d2 - p2;
const int64_t sy = dst_y * s1 + ky * d1 - p1;
const int64_t sx = dst_x * s0 + kx * d0 - p0;
int64_t dst_idx = ic * knl_n_per_channel + kz * (knl_h * knl_w) + ky * knl_w + kx;
float src_val;
if (sz < 0 || sz >= src_d || sy < 0 || sy >= src_h || sx < 0 || sx >= src_w) {
src_val = 0.0f;
} else {
const int64_t cn_idx = batch_idx * c + ic;
const float * src_ptr = (const float *)((const char *)src_data + sx*src->nb[0] + sy*src->nb[1] + sz*src->nb[2] + cn_idx*src->nb[3]);
src_val = *src_ptr;
}
char * element_ptr = dst_row + dst_idx * traits->type_size;
if (kernel_type == GGML_TYPE_F32) {
*(float *)element_ptr = src_val;
} else if (kernel_type == GGML_TYPE_F16) {
*(ggml_fp16_t *)element_ptr = GGML_CPU_FP32_TO_FP16(src_val);
}
}
}
}
}
}
ggml_barrier(params->threadpool);
float * gemm_output = (float *) ((char *) tmp + patches_per_batch * knl_n_total * traits->type_size);
ggml_call_mul_mat(kernel_type, params, patch_n_in_batch, oc, knl_n_total, tmp, knl_data, gemm_output);
ggml_barrier(params->threadpool);
const int64_t permute_per_thread = (patch_n_in_batch + params->nth - 1) / params->nth;
const int64_t permute_start = params->ith * permute_per_thread;
const int64_t permute_end = std::min(permute_start + permute_per_thread, patch_n_in_batch);
for (int64_t i = permute_start; i < permute_end; ++i) {
const int64_t p = patch_start_batch + i;
const int64_t p_in_batch = p % (dst_w * dst_h * dst_d);
const int64_t p_in_depth = p_in_batch % (dst_w * dst_h);
const int64_t batch_idx = p / (dst_w * dst_h * dst_d);
const int64_t dst_z = p_in_batch / (dst_w * dst_h);
const int64_t dst_y = p_in_depth / dst_w;
const int64_t dst_x = p_in_depth % dst_w;
for (int64_t ioc = 0; ioc < oc; ++ioc) {
const float value = gemm_output[i * oc + ioc];
const int64_t ocn_idx = batch_idx * oc + ioc;
float * dst_ptr = (float *)((char *)dst_data + dst_x*dst->nb[0] + dst_y*dst->nb[1] + dst_z*dst->nb[2] + ocn_idx*dst->nb[3]);
*dst_ptr = value;
}
}
}
}
void ggml_compute_forward_conv_3d(
const ggml_compute_params * params,
ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
const ggml_tensor * src1 = dst->src[1];
ggml_compute_forward_conv_3d_impl(params, src0, src1, dst, src0->type);
}
// ggml_compute_forward_conv_transpose_2d
void ggml_compute_forward_conv_transpose_2d(

View File

@@ -70,6 +70,7 @@ void ggml_compute_forward_conv_transpose_1d(const struct ggml_compute_params * p
void ggml_compute_forward_im2col(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_im2col_back_f32(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_conv_2d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_conv_3d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_conv_transpose_2d(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_conv_2d_dw(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_pool_1d(const struct ggml_compute_params * params, struct ggml_tensor * dst);

View File

@@ -24,12 +24,6 @@ if (CUDAToolkit_FOUND)
# for best performance and to also build real architectures for the most commonly used GPUs.
if (GGML_NATIVE AND CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.6" AND CMAKE_VERSION VERSION_GREATER_EQUAL "3.24")
set(CMAKE_CUDA_ARCHITECTURES "native")
elseif(GGML_CUDA_F16 OR GGML_CUDA_DMMV_F16)
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.8")
set(CMAKE_CUDA_ARCHITECTURES "60-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real;89-real")
else()
set(CMAKE_CUDA_ARCHITECTURES "60-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real")
endif()
else()
if (CUDAToolkit_VERSION VERSION_GREATER_EQUAL "11.8")
set(CMAKE_CUDA_ARCHITECTURES "50-virtual;61-virtual;70-virtual;75-virtual;80-virtual;86-real;89-real")
@@ -91,10 +85,6 @@ if (CUDAToolkit_FOUND)
add_compile_definitions(GGML_CUDA_NO_FA)
endif()
if (GGML_CUDA_F16 OR GGML_CUDA_DMMV_F16)
add_compile_definitions(GGML_CUDA_F16)
endif()
if (GGML_CUDA_NO_PEER_COPY)
add_compile_definitions(GGML_CUDA_NO_PEER_COPY)
endif()

View File

@@ -11,14 +11,14 @@ static __global__ void add_id_kernel(
const int64_t i1 = blockIdx.x;
const int64_t i2 = blockIdx.y;
const int i11 = *(int32_t *) ((char *) src2 + i1*sizeof(int32_t) + i2*nb21);
const int i11 = *(const int32_t *) ((const char *) src2 + i1*sizeof(int32_t) + i2*nb21);
const size_t nb1 = ne0 * sizeof(float);
const size_t nb2 = ne1 * nb1;
float * dst_row = (float *)((char *)dst + i1*nb1 + i2*nb2);
const float * src0_row = (const float *)((char *)src0 + i1*nb01 + i2*nb02);
const float * src1_row = (const float *)((char *)src1 + i11*nb11);
const float * src0_row = (const float *)((const char *)src0 + i1*nb01 + i2*nb02);
const float * src1_row = (const float *)((const char *)src1 + i11*nb11);
for (int64_t i0 = threadIdx.x; i0 < ne0; i0 += blockDim.x) {
dst_row[i0] = src0_row[i0] + src1_row[i0];

View File

@@ -78,6 +78,8 @@
#define GGML_CUDA_CC_IS_CDNA3(cc) (cc >= GGML_CUDA_CC_CDNA3 && cc < GGML_CUDA_CC_RDNA1)
// Moore Threads
#define MUSART_HMASK 40300 // MUSA rc4.3, min. ver. for half2 -> uint mask comparisons
#define GGML_CUDA_CC_QY1 (GGML_CUDA_CC_OFFSET_MTHREADS + 0x210) // MTT S80, MTT S3000
#define GGML_CUDA_CC_QY2 (GGML_CUDA_CC_OFFSET_MTHREADS + 0x220) // MTT S4000
#define GGML_CUDA_CC_NG (GGML_CUDA_CC_OFFSET_MTHREADS + 0x310) // TBD
@@ -204,14 +206,6 @@ static const char * cu_get_error_str(CUresult err) {
#define GGML_CUDA_ASSUME(x)
#endif // CUDART_VERSION >= 11010
#ifdef GGML_CUDA_F16
typedef half dfloat; // dequantize float
typedef half2 dfloat2;
#else
typedef float dfloat; // dequantize float
typedef float2 dfloat2;
#endif // GGML_CUDA_F16
#if (!defined(GGML_USE_HIP) && !defined(GGML_CUDA_NO_VMM)) || (defined(GGML_USE_HIP) && !defined(GGML_HIP_NO_VMM))
#define GGML_USE_VMM
#endif // (!defined(GGML_USE_HIP) && !defined(GGML_CUDA_NO_VMM)) || (defined(GGML_USE_HIP) && !defined(GGML_HIP_NO_VMM))
@@ -490,13 +484,14 @@ static __device__ __forceinline__ half2 warp_reduce_max(half2 x) {
#endif // !defined(GGML_USE_HIP) && __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL || defined(GGML_USE_HIP)
}
#if CUDART_VERSION < CUDART_HMASK
#if (defined(CUDART_VERSION) && CUDART_VERSION < CUDART_HMASK) || defined(GGML_USE_HIP) || \
(defined(MUSART_VERSION) && MUSART_VERSION < MUSART_HMASK)
static __device__ __forceinline__ uint32_t __hgt2_mask(const half2 a, const half2 b) {
const uint32_t mask_low = 0x0000FFFF * (float( __low2half(a)) > float( __low2half(b)));
const uint32_t mask_high = 0xFFFF0000 * (float(__high2half(a)) > float(__high2half(b)));
return mask_low | mask_high;
}
#endif // CUDART_VERSION < CUDART_HMASK
#endif // (defined(CUDART_VERSION) && CUDART_VERSION < CUDART_HMASK) || defined(GGML_USE_HIP) || (defined(MUSART_VERSION) && MUSART_VERSION < MUSART_HMASK)
static __device__ __forceinline__ int ggml_cuda_dp4a(const int a, const int b, int c) {
#if defined(GGML_USE_HIP)
@@ -556,7 +551,7 @@ static __device__ __forceinline__ float ggml_cuda_e8m0_to_fp32(uint8_t x) {
#endif // CUDART_VERSION >= 12050
}
typedef void (*dequantize_kernel_t)(const void * vx, const int64_t ib, const int iqs, dfloat2 & v);
typedef void (*dequantize_kernel_t)(const void * vx, const int64_t ib, const int iqs, float2 & v);
static __device__ __forceinline__ float get_alibi_slope(
const float max_bias, const uint32_t h, const uint32_t n_head_log2, const float m0, const float m1

View File

@@ -34,10 +34,7 @@ static __global__ void conv_transpose_1d_kernel(
}
}
dst[global_index] = accumulator;
GGML_UNUSED(p0); GGML_UNUSED(d0); GGML_UNUSED(src0_ne3);
GGML_UNUSED(src1_ne3); GGML_UNUSED(dst_ne3);
GGML_UNUSED(src1_ne1); GGML_UNUSED(dst_ne1);
GGML_UNUSED(src1_ne2); GGML_UNUSED(dst_ne2);
GGML_UNUSED_VARS(p0, d0, src0_ne3, src1_ne3, dst_ne3, src1_ne1, dst_ne1, src1_ne2, dst_ne2);
}
static void conv_transpose_1d_f32_f32_cuda(

View File

@@ -27,7 +27,7 @@ static __global__ void dequantize_block(const void * __restrict__ vx, dst_t * __
const int64_t y_offset = qr == 1 ? 1 : qk/2;
// dequantize
dfloat2 v;
float2 v;
dequantize_kernel(vx, ib, iqs, v);
const int64_t iy0 = ((i03*ne02 + i02)*ne01 + i01)*ne00 + iybs + iqs;
@@ -71,9 +71,7 @@ static __global__ void dequantize_block_q8_0_f16(const void * __restrict__ vx, h
y2[iy/2 + threadIdx.x] = __hmul2(make_half2(qs.x, qs.y), __half2half2(d));
}
#else
GGML_UNUSED(vx);
GGML_UNUSED(y);
GGML_UNUSED(k);
GGML_UNUSED_VARS(vx, y, k);
NO_DEVICE_CODE;
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_PASCAL
}

View File

@@ -42,7 +42,7 @@ static __device__ void cpy_blck_q8_0_f32(const char * cxi, char * cdsti) {
#pragma unroll
for (int j = 0; j < QK8_0; j += 2) {
dfloat2 dq;
float2 dq;
dequantize_q8_0(cxi, 0, j, dq);
*(cdstf + j) = dq.x;
*(cdstf + j + 1) = dq.y;
@@ -55,7 +55,7 @@ static __device__ void cpy_blck_q_f32(const char * cxi, char * cdsti) {
#pragma unroll
for (int j = 0; j < qk/2; j++) {
dfloat2 dq;
float2 dq;
dequant(cxi, 0, j, dq);
*(cdstf + j) = dq.x;
*(cdstf + j + qk/2) = dq.y;
@@ -134,8 +134,7 @@ void ggml_cuda_cpy_dest_ptrs_copy(ggml_cuda_graph * cuda_graph, char ** host_des
CUDA_CHECK(cudaMemcpyAsync(cuda_graph->dest_ptrs_d, host_dest_ptrs, host_dest_ptrs_size*sizeof(char *), cudaMemcpyHostToDevice, stream));
cuda_graph->graph_cpynode_index = 0; // reset index
#else
GGML_UNUSED(cuda_graph); GGML_UNUSED(host_dest_ptrs);
GGML_UNUSED(host_dest_ptrs_size); GGML_UNUSED(stream);
GGML_UNUSED_VARS(cuda_graph, host_dest_ptrs, host_dest_ptrs_size, stream);
#endif
}

View File

@@ -1,48 +1,37 @@
#include "common.cuh"
static __device__ __forceinline__ void dequantize_q4_0(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
static __device__ __forceinline__ void dequantize_q4_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
const block_q4_0 * x = (const block_q4_0 *) vx;
const dfloat d = x[ib].d;
const float d = x[ib].d;
const int vui = x[ib].qs[iqs];
v.x = vui & 0xF;
v.y = vui >> 4;
#ifdef GGML_CUDA_F16
v = __hsub2(v, {8.0f, 8.0f});
v = __hmul2(v, {d, d});
#else
v.x = (v.x - 8.0f) * d;
v.y = (v.y - 8.0f) * d;
#endif // GGML_CUDA_F16
}
static __device__ __forceinline__ void dequantize_q4_1(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
static __device__ __forceinline__ void dequantize_q4_1(const void * vx, const int64_t ib, const int iqs, float2 & v){
const block_q4_1 * x = (const block_q4_1 *) vx;
const dfloat d = __low2half(x[ib].dm);
const dfloat m = __high2half(x[ib].dm);
const float2 dm = __half22float2(x[ib].dm);
const int vui = x[ib].qs[iqs];
v.x = vui & 0xF;
v.y = vui >> 4;
#ifdef GGML_CUDA_F16
v = __hmul2(v, {d, d});
v = __hadd2(v, {m, m});
#else
v.x = (v.x * d) + m;
v.y = (v.y * d) + m;
#endif // GGML_CUDA_F16
v.x = (v.x * dm.x) + dm.y;
v.y = (v.y * dm.x) + dm.y;
}
static __device__ __forceinline__ void dequantize_q5_0(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
static __device__ __forceinline__ void dequantize_q5_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
const block_q5_0 * x = (const block_q5_0 *) vx;
const dfloat d = x[ib].d;
const float d = x[ib].d;
uint32_t qh;
memcpy(&qh, x[ib].qh, sizeof(qh));
@@ -53,20 +42,14 @@ static __device__ __forceinline__ void dequantize_q5_0(const void * vx, const in
v.x = ((x[ib].qs[iqs] & 0xf) | xh_0);
v.y = ((x[ib].qs[iqs] >> 4) | xh_1);
#ifdef GGML_CUDA_F16
v = __hsub2(v, {16.0f, 16.0f});
v = __hmul2(v, {d, d});
#else
v.x = (v.x - 16.0f) * d;
v.y = (v.y - 16.0f) * d;
#endif // GGML_CUDA_F16
}
static __device__ __forceinline__ void dequantize_q5_1(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
static __device__ __forceinline__ void dequantize_q5_1(const void * vx, const int64_t ib, const int iqs, float2 & v){
const block_q5_1 * x = (const block_q5_1 *) vx;
const dfloat d = __low2half(x[ib].dm);
const dfloat m = __high2half(x[ib].dm);
const float2 dm = __half22float2(x[ib].dm);
uint32_t qh;
memcpy(&qh, x[ib].qh, sizeof(qh));
@@ -77,27 +60,18 @@ static __device__ __forceinline__ void dequantize_q5_1(const void * vx, const in
v.x = ((x[ib].qs[iqs] & 0xf) | xh_0);
v.y = ((x[ib].qs[iqs] >> 4) | xh_1);
#ifdef GGML_CUDA_F16
v = __hmul2(v, {d, d});
v = __hadd2(v, {m, m});
#else
v.x = (v.x * d) + m;
v.y = (v.y * d) + m;
#endif // GGML_CUDA_F16
v.x = (v.x * dm.x) + dm.y;
v.y = (v.y * dm.x) + dm.y;
}
static __device__ __forceinline__ void dequantize_q8_0(const void * vx, const int64_t ib, const int iqs, dfloat2 & v){
static __device__ __forceinline__ void dequantize_q8_0(const void * vx, const int64_t ib, const int iqs, float2 & v){
const block_q8_0 * x = (const block_q8_0 *) vx;
const dfloat d = x[ib].d;
const float d = x[ib].d;
v.x = x[ib].qs[iqs + 0];
v.y = x[ib].qs[iqs + 1];
#ifdef GGML_CUDA_F16
v = __hmul2(v, {d, d});
#else
v.x *= d;
v.y *= d;
#endif // GGML_CUDA_F16
}

View File

@@ -704,28 +704,6 @@ static __global__ void flash_attn_combine_results(
dst[tid] = VKQ_numerator / VKQ_denominator;
}
[[noreturn]]
static void on_no_fattn_vec_case(const int D) {
if (D == 64) {
fprintf(stderr, "Unsupported KV type combination for head_size 64.\n");
fprintf(stderr, "By default only f16 KV cache is supported.\n");
fprintf(stderr, "Compile with GGML_CUDA_FA_ALL_QUANTS for V cache quantization support.\n");
GGML_ABORT("fatal error");
} else if (D == 128) {
fprintf(stderr, "Unsupported KV type combination for head_size 128.\n");
fprintf(stderr, "Supported combinations:\n");
fprintf(stderr, " - K == q4_0, V == q4_0, 4.50 BPV\n");
fprintf(stderr, " - K == q8_0, V == q8_0, 8.50 BPV\n");
fprintf(stderr, " - K == f16, V == f16, 16.00 BPV\n");
fprintf(stderr, "Compile with GGML_CUDA_FA_ALL_QUANTS for all combinations of q4_0, q4_1, q5_0, q5_1, q8_0, and f16.\n");
GGML_ABORT("fatal error");
} else {
fprintf(stderr, "Unsupported KV type combination for head_size %d.\n", D);
fprintf(stderr, "Only f16 is supported.\n");
GGML_ABORT("fatal error");
}
}
template <int DV, int ncols1, int ncols2>
void launch_fattn(
ggml_backend_cuda_context & ctx, ggml_tensor * dst, fattn_kernel_t fattn_kernel, const int nwarps, const size_t nbytes_shared,

View File

@@ -767,14 +767,11 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
}
}
#else
GGML_UNUSED(Q_f2); GGML_UNUSED(K_h2); GGML_UNUSED(V_h2);
GGML_UNUSED(mask_h2); GGML_UNUSED(dstk); GGML_UNUSED(dstk_fixup);
GGML_UNUSED(scale); GGML_UNUSED(slope); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(stride_K); GGML_UNUSED(stride_V);
GGML_UNUSED(stride_mask); GGML_UNUSED(tile_K);
GGML_UNUSED(tile_V); GGML_UNUSED(tile_mask); GGML_UNUSED(Q_B);
GGML_UNUSED(VKQ_C); GGML_UNUSED(KQ_max); GGML_UNUSED(KQ_rowsum);
GGML_UNUSED(kb0); GGML_UNUSED(tile_Q);
GGML_UNUSED_VARS(Q_f2, K_h2, V_h2, mask_h2, dstk, dstk_fixup,
scale, slope, logit_softcap, ne01, ne02,
stride_K, stride_V, stride_mask,
tile_Q, tile_K, tile_V, tile_mask,
Q_B, VKQ_C, KQ_max, KQ_rowsum, kb0);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -1236,12 +1233,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
}
}
#else
GGML_UNUSED(Q_f2); GGML_UNUSED(K_h2); GGML_UNUSED(V_h2);
GGML_UNUSED(mask_h2); GGML_UNUSED(dstk); GGML_UNUSED(dstk_fixup);
GGML_UNUSED(scale); GGML_UNUSED(slope); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(stride_Q1);
GGML_UNUSED(stride_Q2); GGML_UNUSED(stride_K); GGML_UNUSED(stride_V); GGML_UNUSED(stride_mask);
GGML_UNUSED(jt); GGML_UNUSED(kb0_start); GGML_UNUSED(kb0_stop);
GGML_UNUSED_VARS(Q_f2, K_h2, V_h2, mask_h2, sinks_f, dstk, dstk_fixup,
scale, slope, logit_softcap, ne01, ne02,
stride_Q1, stride_Q2, stride_K, stride_V, stride_mask,
jt, kb0_start, kb0_stop);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -1395,17 +1390,15 @@ static __global__ void flash_attn_ext_f16(
(Q_f2, K_h2, V_h2, mask_h2, sinks_f, dstk, dst_meta, scale, slope, logit_softcap,
ne01, ne02, stride_Q1, stride_Q2, stride_K, stride_V, stride_mask, jt, kb0_start_kernel, kb0_stop_kernel);
#else
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(TURING_MMA_AVAILABLE)
}

View File

@@ -299,17 +299,15 @@ static __global__ void flash_attn_tile_ext_f16(
}
}
#else
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02);
GGML_UNUSED(ne03); GGML_UNUSED(ne10); GGML_UNUSED(ne11);
GGML_UNUSED(ne12); GGML_UNUSED(ne13); GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33); GGML_UNUSED(nb01); GGML_UNUSED(nb02);
GGML_UNUSED(nb03); GGML_UNUSED(nb11); GGML_UNUSED(nb12);
GGML_UNUSED(nb13); GGML_UNUSED(nb21); GGML_UNUSED(nb22);
GGML_UNUSED(nb23);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(FP16_AVAILABLE)
}

View File

@@ -38,17 +38,15 @@ static __global__ void flash_attn_tile_ext_f32(
return;
#endif // FP16_MMA_AVAILABLE
if (use_logit_softcap && !(D == 128 || D == 256)) {
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
return;
}
@@ -312,17 +310,15 @@ static __global__ void flash_attn_tile_ext_f32(
}
}
#else
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
#endif // FLASH_ATTN_AVAILABLE
}

View File

@@ -349,17 +349,15 @@ static __global__ void flash_attn_vec_ext_f16(
dst_meta[((sequence*ne01 + ic0 + tid)*ne02 + head)*gridDim.y + blockIdx.y] = make_float2(kqmax[tid], kqsum[tid]);
}
#else
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta);
GGML_UNUSED(scale); GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
#endif // defined(FLASH_ATTN_AVAILABLE) && defined(FP16_AVAILABLE)
}

View File

@@ -37,17 +37,15 @@ static __global__ void flash_attn_vec_ext_f32(
// Skip unused kernel variants for faster compilation:
if (use_logit_softcap && !(D == 128 || D == 256)) {
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02);
GGML_UNUSED(ne03); GGML_UNUSED(ne10); GGML_UNUSED(ne11);
GGML_UNUSED(ne12); GGML_UNUSED(ne13); GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33); GGML_UNUSED(nb01); GGML_UNUSED(nb02);
GGML_UNUSED(nb03); GGML_UNUSED(nb11); GGML_UNUSED(nb12);
GGML_UNUSED(nb13); GGML_UNUSED(nb21); GGML_UNUSED(nb22);
GGML_UNUSED(nb23);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
return;
}
@@ -345,17 +343,15 @@ static __global__ void flash_attn_vec_ext_f32(
dst_meta[((sequence*ne01 + ic0 + tid)*ne02 + head)*gridDim.y + blockIdx.y] = make_float2(kqmax[tid], kqsum[tid]);
}
#else
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33);
GGML_UNUSED(nb31); GGML_UNUSED(nb32); GGML_UNUSED(nb33);
GGML_UNUSED(nb01); GGML_UNUSED(nb02); GGML_UNUSED(nb03);
GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
#endif // FLASH_ATTN_AVAILABLE
}

View File

@@ -471,16 +471,15 @@ static __global__ void flash_attn_ext_f16(
dst_meta[j_dst_unrolled] = dst_meta_val;
}
#else
GGML_UNUSED(Q); GGML_UNUSED(K); GGML_UNUSED(V); GGML_UNUSED(mask); GGML_UNUSED(sinks);
GGML_UNUSED(dst); GGML_UNUSED(dst_meta); GGML_UNUSED(scale);
GGML_UNUSED(max_bias); GGML_UNUSED(m0); GGML_UNUSED(m1);
GGML_UNUSED(n_head_log2); GGML_UNUSED(logit_softcap);
GGML_UNUSED(ne00); GGML_UNUSED(ne01); GGML_UNUSED(ne02); GGML_UNUSED(ne03);
GGML_UNUSED(ne10); GGML_UNUSED(ne11); GGML_UNUSED(ne12); GGML_UNUSED(ne13);
GGML_UNUSED(ne31); GGML_UNUSED(ne32); GGML_UNUSED(ne33); GGML_UNUSED(nb31);
GGML_UNUSED(nb32); GGML_UNUSED(nb33); GGML_UNUSED(nb01); GGML_UNUSED(nb02);
GGML_UNUSED(nb03); GGML_UNUSED(nb11); GGML_UNUSED(nb12); GGML_UNUSED(nb13);
GGML_UNUSED(nb21); GGML_UNUSED(nb22); GGML_UNUSED(nb23);
GGML_UNUSED_VARS(Q, K, V, mask, sinks, KV_max, dst, dst_meta, scale,
max_bias, m0, m1, n_head_log2, logit_softcap,
ne00, ne01, ne02, ne03,
nb01, nb02, nb03,
ne10, ne11, ne12, ne13,
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
#endif // defined(FLASH_ATTN_AVAILABLE) && (__CUDA_ARCH__ == GGML_CUDA_CC_VOLTA || (defined(GGML_HIP_ROCWMMA_FATTN) && defined(FP16_MMA_AVAILABLE)))
}

View File

@@ -190,7 +190,7 @@ static void ggml_cuda_flash_attn_ext_vec_f16(ggml_backend_cuda_context & ctx, gg
FATTN_VEC_F16_CASE(256, GGML_TYPE_F16, GGML_TYPE_F16)
#endif // GGML_CUDA_FA_ALL_QUANTS
on_no_fattn_vec_case(Q->ne[0]);
GGML_ABORT("fatal error");
}
#define FATTN_VEC_F32_CASE(D, type_K, type_V) \
@@ -265,74 +265,184 @@ static void ggml_cuda_flash_attn_ext_vec_f32(ggml_backend_cuda_context & ctx, gg
FATTN_VEC_F32_CASE(256, GGML_TYPE_F16, GGML_TYPE_F16)
#endif // GGML_CUDA_FA_ALL_QUANTS
on_no_fattn_vec_case(Q->ne[0]);
GGML_ABORT("fatal error");
}
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
// Best FlashAttention kernel for a specific GPU:
enum best_fattn_kernel {
BEST_FATTN_KERNEL_NONE = 0,
BEST_FATTN_KERNEL_TILE_F32 = 200,
BEST_FATTN_KERNEL_TILE_F16 = 210,
BEST_FATTN_KERNEL_VEC_F32 = 100,
BEST_FATTN_KERNEL_VEC_F16 = 110,
BEST_FATTN_KERNEL_WMMA_F16 = 300,
BEST_FATTN_KERNEL_MMA_F16 = 400,
};
static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const ggml_tensor * dst) {
#ifndef FLASH_ATTN_AVAILABLE
GGML_UNUSED(device); GGML_UNUSED(dst);
return BEST_FATTN_KERNEL_NONE;
#endif// FLASH_ATTN_AVAILABLE
const ggml_tensor * KQV = dst;
const ggml_tensor * Q = dst->src[0];
const ggml_tensor * K = dst->src[1];
const ggml_tensor * V = dst->src[2];
const ggml_tensor * mask = dst->src[3];
ggml_cuda_set_device(ctx.device);
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
const int gqa_ratio = Q->ne[2] / K->ne[2];
GGML_ASSERT(Q->ne[2] % K->ne[2] == 0);
const int cc = ggml_cuda_info().devices[device].cc;
const int warp_size = ggml_cuda_info().devices[device].warp_size;
const enum ggml_prec prec = ggml_flash_attn_ext_get_prec(KQV);
#if defined(GGML_HIP_ROCWMMA_FATTN)
if (GGML_CUDA_CC_IS_AMD(cc) && fp16_mma_available(cc)) {
ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
return;
}
#endif // defined(GGML_HIP_ROCWMMA_FATTN)
if (!fast_fp16_available(cc)) {
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
} else {
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
}
return;
}
if (!fp16_mma_available(cc)) {
if (prec == GGML_PREC_DEFAULT) {
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
} else {
ggml_cuda_flash_attn_ext_tile_f16(ctx, dst);
switch (K->ne[0]) {
case 64:
case 128:
case 256:
if (V->ne[0] != K->ne[0]) {
return BEST_FATTN_KERNEL_NONE;
}
} else {
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
} else {
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
break;
case 80:
case 96:
case 112:
if (V->ne[0] != K->ne[0]) {
return BEST_FATTN_KERNEL_NONE;
}
}
return;
if (!fp16_mma_available(cc) && !turing_mma_available(cc)) {
return BEST_FATTN_KERNEL_NONE;
}
break;
case 576:
if (V->ne[0] != 512) {
return BEST_FATTN_KERNEL_NONE;
}
if (!turing_mma_available(cc) || gqa_ratio % 16 != 0) {
return BEST_FATTN_KERNEL_NONE;
}
break;
default:
return BEST_FATTN_KERNEL_NONE;
}
#ifndef GGML_CUDA_FA_ALL_QUANTS
if (K->type != V->type) {
return BEST_FATTN_KERNEL_NONE;
}
#endif // GGML_CUDA_FA_ALL_QUANTS
switch (K->type) {
case GGML_TYPE_F16:
break;
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
case GGML_TYPE_Q5_1:
#ifndef GGML_CUDA_FA_ALL_QUANTS
return BEST_FATTN_KERNEL_NONE;
#endif // GGML_CUDA_FA_ALL_QUANTS
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
#ifdef GGML_CUDA_FA_ALL_QUANTS
if (K->ne[0] != 128 && K->ne[0] != 64) {
return BEST_FATTN_KERNEL_NONE;
}
#else
if (K->ne[0] != 128) {
return BEST_FATTN_KERNEL_NONE;
}
#endif // GGML_CUDA_FA_ALL_QUANTS
break;
default:
return BEST_FATTN_KERNEL_NONE;
}
switch (V->type) {
case GGML_TYPE_F16:
break;
case GGML_TYPE_Q4_1:
case GGML_TYPE_Q5_0:
case GGML_TYPE_Q5_1:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q8_0:
if (K->ne[0] != 128) {
return BEST_FATTN_KERNEL_NONE;
}
break;
default:
return BEST_FATTN_KERNEL_NONE;
}
if (mask && mask->ne[2] != 1) {
return BEST_FATTN_KERNEL_NONE;
}
const bool gqa_opt_applies = ((Q->ne[2] / K->ne[2]) % 2 == 0) && mask; // The mma-based kernels have GQA-specific optimizations
const bool mma_needs_data_conversion = K->type != GGML_TYPE_F16 || V->type != GGML_TYPE_F16;
const bool mma_faster_for_rtx4000 = Q->ne[3] > 1 || (Q->ne[2] > 4*K->ne[2] && K->ne[1] >= 8192);
const bool mma_faster_for_bs1 = turing_mma_available(cc) && gqa_opt_applies && !mma_needs_data_conversion &&
(cc < GGML_CUDA_CC_ADA_LOVELACE || mma_faster_for_rtx4000);
const bool can_use_vector_kernel = Q->ne[0] <= 256 && Q->ne[0] % (2*warp_size) == 0;
if (Q->ne[1] == 1 && can_use_vector_kernel && !mma_faster_for_bs1) {
if (prec == GGML_PREC_DEFAULT) {
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
} else {
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
// If Turing tensor cores available, use them except for some cases with batch size 1:
if (turing_mma_available(cc)) {
const bool gqa_opt_applies = gqa_ratio % 2 == 0 && mask; // The mma-based kernels have GQA-specific optimizations
const bool mma_needs_data_conversion = K->type != GGML_TYPE_F16 || V->type != GGML_TYPE_F16;
const bool mma_faster_for_rtx4000 = Q->ne[3] > 1 || (gqa_ratio > 4 && K->ne[1] >= 8192);
const bool mma_faster_for_bs1 = gqa_opt_applies && !mma_needs_data_conversion &&
(cc < GGML_CUDA_CC_ADA_LOVELACE || mma_faster_for_rtx4000);
if (Q->ne[1] == 1 && can_use_vector_kernel && !mma_faster_for_bs1) {
if (prec == GGML_PREC_DEFAULT && fast_fp16_available(cc)) {
return BEST_FATTN_KERNEL_VEC_F16;
}
return BEST_FATTN_KERNEL_VEC_F32;
}
return;
return BEST_FATTN_KERNEL_MMA_F16;
}
// The MMA implementation needs Turing or newer, use the old WMMA code for Volta:
if (fp16_mma_available(cc) && !turing_mma_available(cc)) {
ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
return;
// Use kernels specializes for small batch sizes if possible:
if (Q->ne[1] <= 8 && can_use_vector_kernel) {
if (prec == GGML_PREC_DEFAULT && fast_fp16_available(cc)) {
return BEST_FATTN_KERNEL_VEC_F16;
}
return BEST_FATTN_KERNEL_VEC_F32;
}
ggml_cuda_flash_attn_ext_mma_f16(ctx, dst);
// For large batch sizes, use the WMMA kernel if possible:
if (fp16_mma_available(cc)) {
return BEST_FATTN_KERNEL_WMMA_F16;
}
// If there is no suitable kernel for tensor cores or small batch sizes, use the generic kernel for large batch sizes:
if (prec == GGML_PREC_DEFAULT && fast_fp16_available(cc)) {
return BEST_FATTN_KERNEL_TILE_F16;
}
return BEST_FATTN_KERNEL_TILE_F32;
}
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
ggml_cuda_set_device(ctx.device);
switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
case BEST_FATTN_KERNEL_NONE:
GGML_ABORT("fatal error");
case BEST_FATTN_KERNEL_TILE_F32:
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
break;
case BEST_FATTN_KERNEL_TILE_F16:
ggml_cuda_flash_attn_ext_tile_f16(ctx, dst);
break;
case BEST_FATTN_KERNEL_VEC_F32:
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
break;
case BEST_FATTN_KERNEL_VEC_F16:
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
break;
case BEST_FATTN_KERNEL_WMMA_F16:
ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
break;
case BEST_FATTN_KERNEL_MMA_F16:
ggml_cuda_flash_attn_ext_mma_f16(ctx, dst);
break;
}
}
bool ggml_cuda_flash_attn_ext_supported(int device, const ggml_tensor * dst) {
return ggml_cuda_get_best_fattn_kernel(device, dst) != BEST_FATTN_KERNEL_NONE;
}

View File

@@ -1,3 +1,5 @@
#include "common.cuh"
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
bool ggml_cuda_flash_attn_ext_supported(int device, const ggml_tensor * dst);

View File

@@ -32,7 +32,7 @@ static __global__ void k_get_rows(
const int y_offset = qr == 1 ? 1 : qk/2;
// dequantize
dfloat2 v;
float2 v;
dequantize_kernel(src0_row, ib, iqs, v);
dst_row[iybs + iqs + 0] = ggml_cuda_cast<dst_t>(v.x);

View File

@@ -49,6 +49,7 @@
#include "ggml-cuda/wkv.cuh"
#include "ggml-cuda/gla.cuh"
#include "ggml-cuda/set-rows.cuh"
#include "ggml-cuda/pad_reflect_1d.cuh"
#include "ggml.h"
#include <algorithm>
@@ -1328,9 +1329,7 @@ static void ggml_cuda_op_mul_mat_cublas(
&beta, dst_dd_i, ldc));
}
GGML_UNUSED(dst);
GGML_UNUSED(src1_ddq_i);
GGML_UNUSED(src1_padded_row_size);
GGML_UNUSED_VARS(dst, src1_ddq_i, src1_padded_row_size);
}
static void ggml_cuda_set_peer_access(const int n_tokens, int main_device) {
@@ -2354,6 +2353,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
case GGML_OP_PAD:
ggml_cuda_op_pad(ctx, dst);
break;
case GGML_OP_PAD_REFLECT_1D:
ggml_cuda_op_pad_reflect_1d(ctx, dst);
break;
case GGML_OP_ARANGE:
ggml_cuda_op_arange(ctx, dst);
break;
@@ -3483,15 +3485,16 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_OP_CONV_TRANSPOSE_2D:
case GGML_OP_POOL_2D:
case GGML_OP_SUM:
case GGML_OP_SUM_ROWS:
case GGML_OP_MEAN:
case GGML_OP_ARGSORT:
case GGML_OP_ACC:
return true;
case GGML_OP_SUM_ROWS:
case GGML_OP_MEAN:
case GGML_OP_GROUP_NORM:
return ggml_is_contiguous(op->src[0]);
case GGML_OP_UPSCALE:
case GGML_OP_PAD:
case GGML_OP_PAD_REFLECT_1D:
case GGML_OP_ARANGE:
case GGML_OP_TIMESTEP_EMBEDDING:
case GGML_OP_LEAKY_RELU:
@@ -3499,44 +3502,8 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_OP_GATED_LINEAR_ATTN:
case GGML_OP_RWKV_WKV7:
return true;
case GGML_OP_FLASH_ATTN_EXT: {
#ifndef FLASH_ATTN_AVAILABLE
return false;
#endif // FLASH_ATTN_AVAILABLE
if (op->src[1]->ne[0] != op->src[2]->ne[0]) {
const int cc = ggml_cuda_info().devices[dev_ctx->device].cc;
if (!turing_mma_available(cc)) {
return false;
}
const int gqa_ratio = op->src[0]->ne[2] / op->src[1]->ne[2];
return op->src[1]->ne[0] == 576 && op->src[2]->ne[0] == 512 && op->src[3] && gqa_ratio % 16 == 0;
}
// TODO: more general-purpose attention sink support [TAG_ATTN_SINKS]
if (op->src[4] && !fp16_mma_available(ggml_cuda_info().devices[dev_ctx->device].cc)
&& op->src[0]->ne[0] != 64 && op->src[0]->ne[0] != 128) {
return false;
}
if (op->src[0]->ne[0] == 192) {
return false;
}
if (op->src[1]->type == GGML_TYPE_BF16 || op->src[2]->type == GGML_TYPE_BF16) {
return false;
}
if (op->src[0]->ne[0] == 64 && op->src[1]->type == GGML_TYPE_F16) {
return true;
}
if (op->src[0]->ne[0] == 128) {
return true;
}
if (op->src[0]->ne[0] == 256 && op->src[1]->type == GGML_TYPE_F16 && op->src[2]->type == GGML_TYPE_F16) {
return true;
}
if (op->src[3] && op->src[3]->ne[2] != 1) {
return false;
}
return fp16_mma_available(ggml_cuda_info().devices[dev_ctx->device].cc) &&
op->src[1]->type == GGML_TYPE_F16 && op->src[2]->type == GGML_TYPE_F16;
}
case GGML_OP_FLASH_ATTN_EXT:
return ggml_cuda_flash_attn_ext_supported(dev_ctx->device, op);
case GGML_OP_CROSS_ENTROPY_LOSS:
case GGML_OP_CROSS_ENTROPY_LOSS_BACK:
case GGML_OP_OPT_STEP_ADAMW:
@@ -3672,10 +3639,6 @@ static ggml_backend_feature * ggml_backend_cuda_get_features(ggml_backend_reg_t
features.push_back({ "NO_PEER_COPY", "1" });
#endif
#ifdef GGML_CUDA_F16
features.push_back({ "F16", "1" });
#endif
#ifdef GGML_CUDA_USE_GRAPHS
features.push_back({ "USE_GRAPHS", "1" });
#endif

View File

@@ -291,9 +291,7 @@ namespace ggml_cuda_mma {
: "=r"(xi[0]), "=r"(xi[2]), "=r"(xi[1]), "=r"(xi[3])
: "l"(xs));
#else
GGML_UNUSED(t);
GGML_UNUSED(xs0);
GGML_UNUSED(stride);
GGML_UNUSED_VARS(t, xs0, stride);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -315,9 +313,7 @@ namespace ggml_cuda_mma {
: "r"(A.x[1]), "r"(B.x[0]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -345,9 +341,7 @@ namespace ggml_cuda_mma {
: "r"(A.x[3]), "r"(B.x[1]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -372,9 +366,7 @@ namespace ggml_cuda_mma {
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[1]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -408,9 +400,7 @@ namespace ggml_cuda_mma {
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -425,9 +415,7 @@ namespace ggml_cuda_mma {
: "+r"(Dxi[0]), "+r"(Dxi[1]), "+r"(Dxi[2]), "+r"(Dxi[3])
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]));
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // AMPERE_MMA_AVAILABLE
}
@@ -452,9 +440,7 @@ namespace ggml_cuda_mma {
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[1]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -469,9 +455,7 @@ namespace ggml_cuda_mma {
: "+r"(Dxi[0]), "+r"(Dxi[1]), "+r"(Dxi[2]), "+r"(Dxi[3])
: "r"(Axi[0]), "r"(Axi[1]), "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[0]), "r"(Bxi[1]));
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // AMPERE_MMA_AVAILABLE
}
@@ -505,9 +489,7 @@ namespace ggml_cuda_mma {
: "r"(Axi[2]), "r"(Axi[3]), "r"(Bxi[3]));
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_AMPERE
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // TURING_MMA_AVAILABLE
}
@@ -533,9 +515,7 @@ namespace ggml_cuda_mma {
0, 0, 0);
#endif // defined(CDNA3)
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // AMD_MFMA_AVAILABLE
}
@@ -561,9 +541,7 @@ namespace ggml_cuda_mma {
0, 0, 0);
#endif // defined(CDNA3)
#else
GGML_UNUSED(D);
GGML_UNUSED(A);
GGML_UNUSED(B);
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // AMD_MFMA_AVAILABLE
}

View File

@@ -132,11 +132,11 @@ static __global__ void mul_mat_f(
dst[j*stride_col_dst + row0 + threadIdx.x] = sum;
}
#else
GGML_UNUSED_VARS(x, y, ids, dst,
ncols, nchannels_y, stride_row, stride_col_y, stride_col_dst,
channel_ratio, stride_channel_x, stride_channel_y, stride_channel_dst,
sample_ratio, stride_sample_x, stride_sample_y, stride_sample_dst);
NO_DEVICE_CODE;
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(ids); GGML_UNUSED(dst);
GGML_UNUSED(ncols); GGML_UNUSED(nchannels_y); GGML_UNUSED(stride_row); GGML_UNUSED(stride_col_y); GGML_UNUSED(stride_col_dst);
GGML_UNUSED(channel_ratio); GGML_UNUSED(stride_channel_x); GGML_UNUSED(stride_channel_y); GGML_UNUSED(stride_channel_dst);
GGML_UNUSED(sample_ratio); GGML_UNUSED(stride_sample_x); GGML_UNUSED(stride_sample_y); GGML_UNUSED(stride_sample_dst);
#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
}
@@ -151,7 +151,6 @@ static void mul_mat_f_cuda(
cudaStream_t stream) {
typedef tile<16, 8, T> tile_A;
typedef tile< 8, 8, T> tile_B;
typedef tile<16, 8, float> tile_C;
GGML_ASSERT(!ids && "mul_mat_id not implemented");
@@ -352,9 +351,6 @@ void ggml_cuda_mul_mat_f(ggml_backend_cuda_context & ctx, const ggml_tensor * sr
GGML_ASSERT(!ids || ids->nb[0] == ggml_type_size(ids->type));
GGML_ASSERT( nb0 == ts_dst);
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
const enum ggml_prec prec = fast_fp16_available(cc) ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;
const float * src1_d = (const float *) src1->data;
const int32_t * ids_d = ids ? (const int32_t *) ids->data : nullptr;
float * dst_d = (float *) dst->data;

View File

@@ -266,10 +266,7 @@ void ggml_cuda_op_mul_mat_q(
ggml_cuda_mul_mat_q_switch_type(ctx, args, stream);
GGML_UNUSED(src1);
GGML_UNUSED(dst);
GGML_UNUSED(src1_ddf_i);
GGML_UNUSED(src1_padded_row_size);
GGML_UNUSED_VARS(src1, dst, src1_ddf_i, src1_padded_row_size);
}
bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11) {

View File

@@ -1255,7 +1255,7 @@ static __device__ __forceinline__ void vec_dot_q8_0_16_q8_1_mma(
}
}
#else
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(sum); GGML_UNUSED(k00);
GGML_UNUSED_VARS(x, y, sum, k00);
NO_DEVICE_CODE;
#endif // AMD_MFMA_AVAILABLE
}
@@ -1572,7 +1572,7 @@ static __device__ __forceinline__ void vec_dot_q2_K_q8_1_mma(
}
}
#else
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(sum); GGML_UNUSED(k00);
GGML_UNUSED_VARS(x, y, sum, k00);
NO_DEVICE_CODE;
#endif // AMD_MFMA_AVAILABLE
}
@@ -2301,7 +2301,7 @@ static __device__ __forceinline__ void vec_dot_q6_K_q8_1_mma(
}
}
#else
GGML_UNUSED(x); GGML_UNUSED(y); GGML_UNUSED(sum); GGML_UNUSED(k00);
GGML_UNUSED_VARS(x, y, sum, k00);
NO_DEVICE_CODE;
#endif // AMD_MFMA_AVAILABLE
}
@@ -2855,12 +2855,14 @@ static __device__ __forceinline__ void mmq_write_back_mma(
#else
typedef tile<16, 8, int> tile_C;
constexpr int rows_per_warp = 2 * granularity;
#endif
#endif // defined(AMD_MFMA_AVAILABLE)
constexpr int ntx = rows_per_warp/tile_C::I; // Number of x minitiles per warp.
const int i0 = (threadIdx.y / ntx) * (ntx*tile_C::I);
#if defined(TURING_MMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
static_assert(nwarps*tile_C::I == mmq_y, "nwarps*tile_C::I != mmq_y");
#else
GGML_UNUSED(nwarps);
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE)
#pragma unroll

View File

@@ -433,12 +433,7 @@ void ggml_cuda_op_mul_mat_vec_f(
GGML_ABORT("unsupported type: %s", ggml_type_name(src0->type));
}
GGML_UNUSED(ctx);
GGML_UNUSED(src1);
GGML_UNUSED(dst);
GGML_UNUSED(src1_ddq_i);
GGML_UNUSED(src1_ncols);
GGML_UNUSED(src1_padded_row_size);
GGML_UNUSED_VARS(ctx, src1, dst, src1_ddq_i, src1_ncols, src1_padded_row_size);
}
bool ggml_cuda_should_use_mmvf(enum ggml_type type, int cc, const int64_t * src0_ne, int64_t ne11) {

View File

@@ -596,9 +596,5 @@ void ggml_cuda_op_mul_mat_vec_q(
src0_dd_i, src0->type, src1_ddq_i, nullptr, dst_dd_i, ne00, row_diff, src1_ncols, stride_row_x, stride_col_y, nrows_dst,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, stream);
GGML_UNUSED(src1);
GGML_UNUSED(dst);
GGML_UNUSED(src1_ddf_i);
GGML_UNUSED(src1_ncols);
GGML_UNUSED(src1_padded_row_size);
GGML_UNUSED_VARS(src1, dst, src1_ddf_i, src1_ncols, src1_padded_row_size);
}

View File

@@ -0,0 +1,82 @@
#include "pad_reflect_1d.cuh"
static __global__ void pad_reflect_1d_kernel_f32(
const void * __restrict__ src0,
void * __restrict__ dst,
const int64_t ne0,
const int64_t ne00,
const int64_t ne01,
const int64_t ne02,
const int64_t ne03,
const int64_t nb00,
const int64_t nb01,
const int64_t nb02,
const int64_t nb03,
const int64_t nb0,
const int64_t nb1,
const int64_t nb2,
const int64_t nb3,
const int p0,
const int p1) {
const int64_t i3 = blockIdx.z;
const int64_t i2 = blockIdx.y;
const int64_t i1 = blockIdx.x;
if (i1 >= ne01 || i2 >= ne02 || i3 >= ne03) {
return;
}
const char * src0_ptr = (const char *)src0 + i3*nb03 + i2*nb02 + i1*nb01;
char * dst_ptr = (char *)dst + i3*nb3 + i2*nb2 + i1*nb1;
for (int64_t i0 = threadIdx.x; i0 < ne0; i0 += blockDim.x) {
float value;
if (i0 < p0) {
// Left padding - reflect
value = *(const float *)(src0_ptr + (p0 - i0) * nb00);
} else if (i0 < ne0 - p1) {
// Middle - copy
value = *(const float *)(src0_ptr + (i0 - p0) * nb00);
} else {
// Right padding - reflect
int64_t src_idx = (ne0 - p1 - p0) - (p1 + 1 - (ne0 - i0)) - 1;
value = *(const float *)(src0_ptr + src_idx * nb00);
}
*(float *)(dst_ptr + i0 * nb0) = value;
}
}
void ggml_cuda_op_pad_reflect_1d(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
const ggml_tensor * src0 = dst->src[0];
cudaStream_t stream = ctx.stream();
GGML_ASSERT(src0->type == GGML_TYPE_F32);
GGML_ASSERT(dst->type == GGML_TYPE_F32);
const int32_t * opts = (const int32_t *) dst->op_params;
const int p0 = opts[0];
const int p1 = opts[1];
const int64_t ne00 = src0->ne[0];
const int64_t ne01 = src0->ne[1];
const int64_t ne02 = src0->ne[2];
const int64_t ne03 = src0->ne[3];
const int64_t ne0 = dst->ne[0];
GGML_ASSERT(ne0 == ne00 + p0 + p1);
const dim3 block_dims(CUDA_PAD_REFLECT_1D_BLOCK_SIZE, 1, 1);
const dim3 grid_dims(ne01, ne02, ne03);
pad_reflect_1d_kernel_f32<<<grid_dims, block_dims, 0, stream>>>(
src0->data, dst->data,
ne0, ne00, ne01, ne02, ne03,
src0->nb[0], src0->nb[1], src0->nb[2], src0->nb[3],
dst->nb[0], dst->nb[1], dst->nb[2], dst->nb[3],
p0, p1
);
}

View File

@@ -0,0 +1,5 @@
#include "common.cuh"
#define CUDA_PAD_REFLECT_1D_BLOCK_SIZE 256
void ggml_cuda_op_pad_reflect_1d(ggml_backend_cuda_context & ctx, ggml_tensor * dst);

View File

@@ -39,7 +39,7 @@ static __global__ void reduce_rows_f32(const float * __restrict__ x, float * __r
}
__syncthreads();
sum = 0.0f;
if (lane_id < (blockDim.x / WARP_SIZE)) {
if (lane_id < (static_cast<int>(blockDim.x) / WARP_SIZE)) {
sum = s_sum[lane_id];
}
sum = warp_reduce_sum(sum);

View File

@@ -87,7 +87,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q4_1_q8_1_imp
sumi = ggml_cuda_dp4a(vi1, u[2*i+1], sumi);
}
#ifdef GGML_CUDA_F16
#ifdef FAST_FP16_AVAILABLE
const float2 tmp = __half22float2(__hmul2(dm4, ds8));
const float d4d8 = tmp.x;
const float m4s8 = tmp.y;
@@ -96,7 +96,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q4_1_q8_1_imp
const float2 ds8f = __half22float2(ds8);
const float d4d8 = dm4f.x * ds8f.x;
const float m4s8 = dm4f.y * ds8f.y;
#endif // GGML_CUDA_F16
#endif // FAST_FP16_AVAILABLE
// scale second part of sum by QI8_1/(vdr * QR4_1) to compensate for multiple threads adding it
return sumi * d4d8 + m4s8 / (QI8_1 / (vdr * QR4_1));
@@ -158,7 +158,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q5_1_q8_1_imp
sumi = ggml_cuda_dp4a(vi1, u[2*i+1], sumi); // SIMD dot product of quantized values
}
#ifdef GGML_CUDA_F16
#ifdef FAST_FP16_AVAILABLE
const float2 tmp = __half22float2(__hmul2(dm5, ds8));
const float d5d8 = tmp.x;
const float m5s8 = tmp.y;
@@ -167,7 +167,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q5_1_q8_1_imp
const float2 ds8f = __half22float2(ds8);
const float d5d8 = dm5f.x * ds8f.x;
const float m5s8 = dm5f.y * ds8f.y;
#endif // GGML_CUDA_F16
#endif // FAST_FP16_AVAILABLE
// scale second part of sum by QI5_1 / vdr to compensate for multiple threads adding it
return sumi*d5d8 + m5s8 / (QI5_1 / vdr);
@@ -201,7 +201,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q8_1_q8_1_imp
sumi = ggml_cuda_dp4a(v[i], u[i], sumi);
}
#ifdef GGML_CUDA_F16
#ifdef FAST_FP16_AVAILABLE
const float2 tmp = __half22float2(__hmul2(dm8, ds8));
const float d8d8 = tmp.x;
const float m8s8 = tmp.y;
@@ -210,7 +210,7 @@ template <int vdr> static __device__ __forceinline__ float vec_dot_q8_1_q8_1_imp
const float2 ds8f = __half22float2(ds8);
const float d8d8 = dm8f.x * ds8f.x;
const float m8s8 = dm8f.y * ds8f.y;
#endif // GGML_CUDA_F16
#endif // FAST_FP16_AVAILABLE
// scale second part of sum by QI8_1/ vdr to compensate for multiple threads adding it
return sumi*d8d8 + m8s8 / (QI8_1 / vdr);

View File

@@ -1846,7 +1846,7 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
case GGML_OP_ROPE:
return true;
case GGML_OP_IM2COL:
return op->src[0]->type == GGML_TYPE_F16;
return op->src[1]->type == GGML_TYPE_F32 && (op->type == GGML_TYPE_F16 || op->type == GGML_TYPE_F32);
case GGML_OP_POOL_1D:
return false;
case GGML_OP_UPSCALE:
@@ -4703,7 +4703,6 @@ static int ggml_metal_encode_node(
{
GGML_ASSERT(ggml_is_contiguous(src0));
GGML_ASSERT(ggml_is_contiguous(src1));
GGML_ASSERT(src0->type == GGML_TYPE_F16);
GGML_ASSERT(src1->type == GGML_TYPE_F32);
GGML_ASSERT( dst->type == GGML_TYPE_F16 || dst->type == GGML_TYPE_F32);

Some files were not shown because too many files have changed in this diff Show More