Compare commits

...

12 Commits
b8175 ... b8187

Author SHA1 Message Date
Ruben Ortlam
feefb92836 vulkan: tune MMVQ for Intel Windows (#19988) 2026-03-02 15:58:25 +01:00
Adrien Gallouët
ec88c3ceea scripts : improve get-wikitext-2.sh (#19952)
* scripts : improve get-wikitext-2.sh

Switch to sh, add curl fallback, and avoid redundant downloads

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>

* fix indent

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-02 15:40:49 +01:00
Aaron Teo
2afcdb9777 ggml-cpu: optimise s390x multiply extend instructions (#20032) 2026-03-02 16:23:56 +08:00
Ruben Ortlam
319146247e vulkan: improve partial offloading performance on AMD (#19976)
* vulkan: fix and enable cpy_tensor_async function

* use transfer_queue for async transfers on AMD, synchronize with timeline semaphore

* update offload_op logic

* fix missing transfer submission

* disable async transfer queue on AMD GCN

* revert op batch size change

* fix cpy_tensor_async checks
2026-03-01 17:32:14 +01:00
oobabooga
66d65ec29b cuda: cap grid.y at 65535 in non-contiguous dequantize/convert kernels (#19999) 2026-03-01 13:40:22 +08:00
Dmitry Atamanov
05728db18e vendors : update miniaudio library to 0.11.24 (#19914) 2026-02-28 16:10:01 +01:00
Adrien Gallouët
4720819d45 vendor : update cpp-httplib to 0.35.0 (#19969)
Signed-off-by: Adrien Gallouët <adrien@gallouet.fr>
2026-02-28 13:53:56 +01:00
Bartowski
d979f2b176 tests : model metadata loading from huggingface (#19796)
* Add model metadata loading from huggingface for use with other tests

* Add incremental chunking instead of full redownload, fix caching issue and add warning when it fails

* Add support for split models, load metadata from each individual split file, also avoid mmproj

* Code cleanup, revert incremental downloading

* Only compile when cpp-httplib has SSL support

* Fix formatting
2026-02-28 10:44:38 +01:00
Jayant Lohia
ecbcb7ea9d CUDA: add CDNA3 MFMA support for flash attention MMA kernel (#19806)
* CUDA: add CDNA3 MFMA support for flash attention MMA kernel

Add MI300X (gfx942) MFMA tensor core flash attention using
v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate).

- Add FATTN_WARP_SIZE=64 for CDNA wavefront64
- Add CDNA config for head sizes 64, 80, 96, 112, 128
- Add FP16 MFMA intrinsic path in mma.cuh
- Add manual V transpose load for MFMA register layout
- Route CDNA to MMA for prompt processing, VEC for token generation
- Fix Q loading and combine stride granularity for non-power-of-2 heads

Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X):
  pp512  +7%,  pp1024 +13%,  pp2048 +23%,  pp4096 +39%
  tg128  -10% (FA overhead, VEC used for both)

All 2480 flash attention tests pass.

Ref: https://github.com/ggml-org/llama.cpp/issues/17917

* address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch

- Replace #define FATTN_WARP_SIZE with constexpr int warp_size =
  ggml_cuda_get_physical_warp_size() in each device function
- Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked
  crossover on MI300X @ d32768 with power-of-2 GQA models:
    hsk=64  (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%)
    hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%)
  Unified threshold: eff_nq >= 128 for all head sizes.
- Remove VEC fallback; small batches fall through to tile kernel

* Update ggml/src/ggml-cuda/fattn.cu

* use ggml_cuda_info().devices warp_size instead of hardcoded check

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-02-27 19:37:26 +01:00
Roj234
3e6ab244ad server: Add pragma once to server-context.h (#19944) 2026-02-27 18:28:36 +01:00
Sami Kama
5596a35791 server: Mirroring /v1/responses to /responses to match /v1/chat/completions pattern (#19873) 2026-02-28 00:44:42 +08:00
Daniel Bevenius
8d3b962f47 ci : use ubuntu-latest for gguf-publish workflow (#19951)
This commit changes the runner for the gguf-publish workflow from
ubuntu-slim back to ubuntu-latest, which was updated in Commit
142cbe2ac6 ("ci : use new 1vCPU runner for
lightweight jobs (#19107)").

The motivation for this is that the action used in the workflow depends
on the docker daemon, which does not seem not available in the
ubuntu-slim runner. This is currently causing an error in the workflow
and preventing the gguf-publish workflow from running successfully.
Today was the the first time since the original change (I think) that
publish task has been run which may be why the issue was not noticed
before.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/22481900566
2026-02-27 14:42:24 +01:00
19 changed files with 1756 additions and 550 deletions

View File

@@ -21,7 +21,7 @@ on:
jobs:
deploy:
runs-on: ubuntu-slim
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

View File

@@ -181,11 +181,11 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
const int8x16_t v_yh = vec_xl(QK8_0/2, y[ib].qs);
const int16x8_t v_xylso = vec_mulo(v_xls, v_yl);
const int16x8_t v_xylse = vec_mule(v_xls, v_yl);
const int16x8_t v_xyl = vec_meadd(v_xls, v_yl, v_xylso);
const int16x8_t v_xyhso = vec_mulo(v_xhs, v_yh);
const int16x8_t v_xyhse = vec_mule(v_xhs, v_yh);
const int16x8_t v_xyh = vec_meadd(v_xhs, v_yh, v_xyhso);
int16x8_t v_xy_ = v_xylso + v_xylse + v_xyhso + v_xyhse; v_xy_ += vec_reve(v_xy_);
int16x8_t v_xy_ = v_xyl + v_xyh; v_xy_ += vec_reve(v_xy_);
const float32x4_t v_xy = vec_float(vec_unpackh(v_xy_));
const float32x4_t v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
@@ -890,8 +890,7 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
const int16x8_t v_minsh = (int16x8_t)vec_unpackh((uint8x16_t)v_mins8);
const int32x4_t v_minso = vec_mulo(v_ysums, v_minsh);
const int32x4_t v_minse = vec_mule(v_ysums, v_minsh);
const int32x4_t v_mins = v_minso + v_minse;
const int32x4_t v_mins = vec_meadd(v_ysums, v_minsh, v_minso);
sumf -= dmin * (v_mins[0] + v_mins[1] + v_mins[2] + v_mins[3]);
const uint8_t * scales = (const uint8_t *)utmp;
@@ -1004,8 +1003,7 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
const int16x8_t v_minsh = (int16x8_t)vec_unpackh(v_mins8);
const int32x4_t v_minsho = vec_mulo(v_ysums, v_minsh);
const int32x4_t v_minshe = vec_mule(v_ysums, v_minsh);
const int32x4_t v_mins = vec_add(v_minsho, v_minshe);
const int32x4_t v_mins = vec_meadd(v_ysums, v_minsh, v_minsho);
const int32_t mins = vec_hsum_i32x4(v_mins);
const uint8_t * scales = (const uint8_t *)utmp;
@@ -1110,10 +1108,10 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
const int16x8_t v_scaleh = vec_unpackl(v_scale);
const int32x4_t v_minslo = vec_mulo(v_ysumsl, v_scalel);
const int32x4_t v_minsle = vec_mule(v_ysumsl, v_scalel);
const int32x4_t v_minsl = vec_meadd(v_ysumsl, v_scalel, v_minslo);
const int32x4_t v_minsho = vec_mulo(v_ysumsh, v_scaleh);
const int32x4_t v_minshe = vec_mule(v_ysumsh, v_scaleh);
const int32x4_t v_mins = v_minslo + v_minsle + v_minsho + v_minshe;
const int32x4_t v_minsh = vec_meadd(v_ysumsh, v_scaleh, v_minsho);
const int32x4_t v_mins = vec_add(v_minsl, v_minsh);
const int32_t mins = vec_hsum_i32x4(v_mins);

View File

@@ -16,27 +16,27 @@ static __global__ void dequantize_block(const void * __restrict__ vx, dst_t * __
return;
}
const int64_t i01 = blockIdx.y;
for (int64_t i01 = blockIdx.y; i01 < ne01; i01 += gridDim.y) {
for (int64_t i0203 = blockIdx.z; i0203 < ne0203; i0203 += gridDim.z) {
const uint2 dm = fast_div_modulo((uint32_t)i0203, ne02);
const int64_t i02 = dm.y;
const int64_t i03 = dm.x;
for (int64_t i0203 = blockIdx.z; i0203 < ne0203; i0203 += gridDim.z) {
const uint2 dm = fast_div_modulo((uint32_t)i0203, ne02);
const int64_t i02 = dm.y;
const int64_t i03 = dm.x;
const int64_t ibx0 = i03*s03 + i02*s02 + i01*s01;
const int64_t ibx0 = i03*s03 + i02*s02 + i01*s01;
const int64_t ib = ibx0 + i00/qk; // block index
const int64_t iqs = (i00%qk)/qr; // quant index
const int64_t iybs = i00 - i00%qk; // y block start index
const int64_t y_offset = qr == 1 ? 1 : qk/2;
const int64_t ib = ibx0 + i00/qk; // block index
const int64_t iqs = (i00%qk)/qr; // quant index
const int64_t iybs = i00 - i00%qk; // y block start index
const int64_t y_offset = qr == 1 ? 1 : qk/2;
// dequantize
float2 v;
dequantize_kernel(vx, ib, iqs, v);
// dequantize
float2 v;
dequantize_kernel(vx, ib, iqs, v);
const int64_t iy0 = (i0203*ne01 + i01)*ne00 + iybs + iqs;
y[iy0 + 0] = ggml_cuda_cast<dst_t>(v.x);
y[iy0 + y_offset] = ggml_cuda_cast<dst_t>(v.y);
const int64_t iy0 = (i0203*ne01 + i01)*ne00 + iybs + iqs;
y[iy0 + 0] = ggml_cuda_cast<dst_t>(v.x);
y[iy0 + y_offset] = ggml_cuda_cast<dst_t>(v.y);
}
}
}
@@ -492,7 +492,7 @@ static void dequantize_block_cuda(const void * vx, dst_t * y,
const int64_t s01, const int64_t s02, const int64_t s03, cudaStream_t stream) {
const int64_t ne0203 = ne02*ne03;
const uint3 ne02_fdv = init_fastdiv_values(ne02);
const dim3 num_blocks((ne00 + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE), ne01, (int)std::min(ne0203, (int64_t)65535));
const dim3 num_blocks((ne00 + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE), (int)std::min(ne01, (int64_t)65535), (int)std::min(ne0203, (int64_t)65535));
dequantize_block<qk, qr, dequantize_kernel><<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>
(vx, y, ne00, ne01, ne0203, ne02_fdv, s01, s02, s03);
}
@@ -628,18 +628,18 @@ static __global__ void convert_unary(
return;
}
const int64_t i01 = blockIdx.y;
const src_t * x = (const src_t *) vx;
for (int64_t i0203 = blockIdx.z; i0203 < ne0203; i0203 += gridDim.z) {
const uint2 dm = fast_div_modulo((uint32_t)i0203, ne02);
const int64_t i02 = dm.y;
const int64_t i03 = dm.x;
for (int64_t i01 = blockIdx.y; i01 < ne01; i01 += gridDim.y) {
for (int64_t i0203 = blockIdx.z; i0203 < ne0203; i0203 += gridDim.z) {
const uint2 dm = fast_div_modulo((uint32_t)i0203, ne02);
const int64_t i02 = dm.y;
const int64_t i03 = dm.x;
const int64_t ix = i03*s03 + i02*s02 + i01*s01 + i00;
const int64_t iy = (i0203*ne01 + i01)*ne00 + i00;
y[iy] = ggml_cuda_cast<dst_t>(x[ix]);
const int64_t ix = i03*s03 + i02*s02 + i01*s01 + i00;
const int64_t iy = (i0203*ne01 + i01)*ne00 + i00;
y[iy] = ggml_cuda_cast<dst_t>(x[ix]);
}
}
}
@@ -649,7 +649,7 @@ static void convert_unary_cuda(const void * vx, dst_t * y,
const int64_t s01, const int64_t s02, const int64_t s03, cudaStream_t stream) {
const int64_t ne0203 = ne02*ne03;
const uint3 ne02_fdv = init_fastdiv_values(ne02);
const dim3 num_blocks((ne00 + CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / CUDA_DEQUANTIZE_BLOCK_SIZE, ne01, (int)std::min(ne0203, (int64_t)65535));
const dim3 num_blocks((ne00 + CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / CUDA_DEQUANTIZE_BLOCK_SIZE, (int)std::min(ne01, (int64_t)65535), (int)std::min(ne0203, (int64_t)65535));
convert_unary<src_t><<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>
(vx, y, ne00, ne01, ne0203, ne02_fdv, s01, s02, s03);
}

View File

@@ -111,6 +111,44 @@ static constexpr __host__ __device__ fattn_mma_config ggml_cuda_fattn_mma_get_co
return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);
}
static constexpr __host__ __device__ fattn_mma_config ggml_cuda_fattn_mma_get_config_cdna(const int DKQ, const int DV, const int ncols) {
// Conservative configs for CDNA (MI100+): 64KB LDS, wavefront64, nstages=1 (no cp.async).
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 64, 64, 8, 128, 2, 128, 32, 32, 32, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 64, 64, 16, 128, 2, 64, 32, 32, 32, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 64, 64, 32, 128, 2, 64, 32, 32, 32, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 64, 64, 64, 256, 2, 64, 32, 32, 32, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 80, 80, 8, 128, 2, 128, 40, 40, 40, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 80, 80, 16, 128, 2, 64, 40, 40, 40, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 80, 80, 32, 128, 2, 64, 40, 40, 40, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 80, 80, 64, 256, 2, 64, 40, 40, 40, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 96, 96, 8, 128, 2, 128, 48, 48, 48, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 96, 96, 16, 128, 2, 64, 48, 48, 48, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 96, 96, 32, 128, 2, 64, 48, 48, 48, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE( 96, 96, 64, 256, 2, 64, 48, 48, 48, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(112, 112, 8, 128, 2, 128, 56, 56, 56, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(112, 112, 16, 128, 2, 64, 56, 56, 56, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(112, 112, 32, 128, 2, 64, 56, 56, 56, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(112, 112, 64, 256, 2, 64, 56, 56, 56, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(128, 128, 8, 128, 2, 128, 64, 64, 64, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(128, 128, 16, 128, 2, 64, 64, 64, 64, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(128, 128, 32, 128, 2, 64, 64, 64, 64, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(128, 128, 64, 256, 2, 64, 64, 64, 64, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 8, 64, 4, 64, 128, 128, 128, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 16, 64, 4, 32, 128, 128, 128, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 32, 128, 2, 32, 128, 128, 128, 1, true);
GGML_CUDA_FATTN_MMA_CONFIG_CASE(256, 256, 64, 256, 2, 32, 128, 128, 128, 1, true);
// Fallback for unsupported DKQ values (e.g. 576). Must return non-zero values to satisfy
// compile-time static_asserts even though the kernel guard prevents runtime execution.
// nthreads=256 gives nwarps=4 (warp_size=64) or 8 (warp_size=32), nbatch_fa=128 satisfies np*16 divisibility.
return fattn_mma_config(256, 1, 128, 4, 4, 4, 1, false);
}
static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(const int DKQ, const int DV, const int ncols, const int cc) {
if (ampere_mma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);
@@ -118,6 +156,9 @@ static __host__ fattn_mma_config ggml_cuda_fattn_mma_get_config(const int DKQ, c
if (turing_mma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols);
}
if (amd_mfma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_cdna(DKQ, DV, ncols);
}
if (amd_wmma_available(cc)) {
return ggml_cuda_fattn_mma_get_config_rdna(DKQ, DV, ncols);
}
@@ -130,6 +171,8 @@ static constexpr __device__ fattn_mma_config ggml_cuda_fattn_mma_get_config(cons
return ggml_cuda_fattn_mma_get_config_ampere(DKQ, DV, ncols);
#elif defined(TURING_MMA_AVAILABLE)
return ggml_cuda_fattn_mma_get_config_turing(DKQ, DV, ncols);
#elif defined(AMD_MFMA_AVAILABLE)
return ggml_cuda_fattn_mma_get_config_cdna(DKQ, DV, ncols);
#elif defined(VOLTA_MMA_AVAILABLE)
return ggml_cuda_fattn_mma_get_config_volta(DKQ, DV, ncols);
#elif defined(AMD_WMMA_AVAILABLE)
@@ -205,15 +248,15 @@ static constexpr __device__ bool ggml_cuda_fattn_mma_get_Q_in_reg(const int DKQ,
}
static constexpr __device__ int get_cols_per_thread() {
#if defined(AMD_WMMA_AVAILABLE)
return 1; // RDNA has a single column.
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
return 1; // AMD has a single column per thread.
#else
return 2; // This is specifically KQ columns, Volta only has a single VKQ column.
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
}
static __host__ int get_cols_per_warp(const int cc) {
if (turing_mma_available(cc) || amd_wmma_available(cc)) {
if (turing_mma_available(cc) || amd_wmma_available(cc) || amd_mfma_available(cc)) {
return 16;
} else {
// Volta
@@ -241,6 +284,7 @@ static constexpr __device__ int ggml_cuda_fattn_mma_get_nstages(const int DKQ, c
template<int stride_tile, int nwarps, int nbatch_fa, bool use_cp_async, bool oob_check>
static __device__ __forceinline__ void flash_attn_ext_f16_load_tile(
const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int D2, const int stride_KV, const int i_sup) {
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
// K/V data is loaded with decreasing granularity for D for better memory bandwidth.
// The minimum granularity with cp.async is 16 bytes, with synchronous data loading it's 4 bytes.
if constexpr (use_cp_async) {
@@ -252,10 +296,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_tile(
const unsigned int tile_KV_32 = ggml_cuda_cvta_generic_to_shared(tile_KV);
auto load = [&] __device__ (auto n) {
const int stride_k = WARP_SIZE >> n;
const int k0_start = stride_k == WARP_SIZE ? 0 : chunks_per_row - chunks_per_row % (2*stride_k);
const int stride_k = warp_size >> n;
const int k0_start = stride_k == warp_size ? 0 : chunks_per_row - chunks_per_row % (2*stride_k);
const int k0_stop = chunks_per_row - chunks_per_row % (1*stride_k);
const int stride_i = WARP_SIZE / stride_k;
const int stride_i = warp_size / stride_k;
if (k0_start == k0_stop) {
return;
@@ -263,7 +307,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_tile(
#pragma unroll
for (int i0 = 0; i0 < nbatch_fa; i0 += nwarps*stride_i) {
const int i = i0 + threadIdx.y*stride_i + (stride_k == WARP_SIZE ? 0 : threadIdx.x / stride_k);
const int i = i0 + threadIdx.y*stride_i + (stride_k == warp_size ? 0 : threadIdx.x / stride_k);
if (i0 + nwarps*stride_i > nbatch_fa && i >= nbatch_fa) {
break;
@@ -271,7 +315,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_tile(
#pragma unroll
for (int k0 = k0_start; k0 < k0_stop; k0 += stride_k) {
const int k = k0 + (stride_k == WARP_SIZE ? threadIdx.x : threadIdx.x % stride_k);
const int k = k0 + (stride_k == warp_size ? threadIdx.x : threadIdx.x % stride_k);
cp_async_cg_16<preload>(tile_KV_32 + i*(stride_tile*sizeof(half2)) + k*16, KV + i*stride_KV + k*h2_per_chunk);
}
@@ -287,10 +331,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_tile(
} else {
// TODO use ggml_cuda_memcpy_1
auto load = [&] __device__ (const int n) {
const int stride_k = WARP_SIZE >> n;
const int k0_start = stride_k == WARP_SIZE ? 0 : D2 - D2 % (2*stride_k);
const int stride_k = warp_size >> n;
const int k0_start = stride_k == warp_size ? 0 : D2 - D2 % (2*stride_k);
const int k0_stop = D2 - D2 % (1*stride_k);
const int stride_i = WARP_SIZE / stride_k;
const int stride_i = warp_size / stride_k;
if (k0_start == k0_stop) {
return;
@@ -298,7 +342,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_tile(
#pragma unroll
for (int i0 = 0; i0 < nbatch_fa; i0 += nwarps*stride_i) {
const int i = i0 + threadIdx.y*stride_i + (stride_k == WARP_SIZE ? 0 : threadIdx.x / stride_k);
const int i = i0 + threadIdx.y*stride_i + (stride_k == warp_size ? 0 : threadIdx.x / stride_k);
if (i0 + nwarps*stride_i > nbatch_fa && i >= nbatch_fa) {
break;
@@ -306,7 +350,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_tile(
#pragma unroll
for (int k0 = k0_start; k0 < k0_stop; k0 += stride_k) {
const int k = k0 + (stride_k == WARP_SIZE ? threadIdx.x : threadIdx.x % stride_k);
const int k = k0 + (stride_k == warp_size ? threadIdx.x : threadIdx.x % stride_k);
tile_KV[i*stride_tile + k] = !oob_check || i < i_sup ? KV[i*stride_KV + k] : make_half2(0.0f, 0.0f);
}
@@ -324,18 +368,19 @@ template<int ncols1, int nwarps, int nbatch_fa, bool use_cp_async, bool oob_chec
static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
const half * const __restrict__ mask_h, half * const __restrict__ tile_mask,
const int stride_mask, const int i_sup, const int j0, const uint3 ne01) {
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
if constexpr (use_cp_async) {
static_assert(nbatch_fa <= 8*WARP_SIZE && nbatch_fa % 8 == 0, "bad nbatch_fa");
static_assert(nbatch_fa <= 8*warp_size && nbatch_fa % 8 == 0, "bad nbatch_fa");
static_assert(!oob_check, "OOB check incompatible with cp_async");
constexpr int preload = nbatch_fa >= 32 ? nbatch_fa * sizeof(half) : 64;
constexpr int cols_per_warp = 8*WARP_SIZE/nbatch_fa;
constexpr int cols_per_warp = 8*warp_size/nbatch_fa;
constexpr int stride_j = nwarps * cols_per_warp;
const unsigned int tile_mask_32 = ggml_cuda_cvta_generic_to_shared(tile_mask);
#pragma unroll
for (int j1 = 0; j1 < ncols1; j1 += stride_j) {
const int j_sram = j1 + threadIdx.y*cols_per_warp + threadIdx.x / (WARP_SIZE/cols_per_warp);
const int j_sram = j1 + threadIdx.y*cols_per_warp + threadIdx.x / (warp_size/cols_per_warp);
const int j_vram = fastmodulo(j0 + j_sram, ne01);
if (j1 + stride_j > ncols1 && j_sram >= ncols1) {
@@ -357,25 +402,25 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
}
#pragma unroll
for (int i0 = 0; i0 < nbatch_fa; i0 += WARP_SIZE) {
for (int i0 = 0; i0 < nbatch_fa; i0 += warp_size) {
const int i = i0 + threadIdx.x;
tile_mask[j_sram*(nbatch_fa + 8) + i] = i < i_sup ? mask_h[j_vram*stride_mask + i] : half(0.0f);
}
}
} else if constexpr (nbatch_fa < 2*WARP_SIZE) {
constexpr int cols_per_warp = 2*WARP_SIZE/nbatch_fa;
} else if constexpr (nbatch_fa < 2*warp_size) {
constexpr int cols_per_warp = 2*warp_size/nbatch_fa;
constexpr int stride_j = nwarps * cols_per_warp;
#pragma unroll
for (int j1 = 0; j1 < ncols1; j1 += stride_j) {
const int j_sram = j1 + threadIdx.y*cols_per_warp + threadIdx.x / (WARP_SIZE/cols_per_warp);
const int j_sram = j1 + threadIdx.y*cols_per_warp + threadIdx.x / (warp_size/cols_per_warp);
const int j_vram = fastmodulo(j0 + j_sram, ne01);
if (j1 + stride_j > ncols1 && j_sram >= ncols1) {
break;
}
const int i = threadIdx.x % (WARP_SIZE/cols_per_warp);
const int i = threadIdx.x % (warp_size/cols_per_warp);
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + 2*i, mask_h + j_vram*stride_mask + 2*i);
}
@@ -390,7 +435,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_load_mask(
}
#pragma unroll
for (int i0 = 0; i0 < nbatch_fa; i0 += 2*WARP_SIZE) {
for (int i0 = 0; i0 < nbatch_fa; i0 += 2*warp_size) {
const int i = i0 + 2*threadIdx.x;
ggml_cuda_memcpy_1<sizeof(half2)>(tile_mask + j_sram*(nbatch_fa + 8) + i, mask_h + j_vram*stride_mask + i);
@@ -428,7 +473,8 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
const int jt,
const int kb0,
const int k_VKQ_sup) {
#if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
#if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)) || defined(AMD_MFMA_AVAILABLE)
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int ncols = ncols1 * ncols2;
constexpr int cols_per_warp = T_B_KQ::I;
constexpr int cols_per_thread = get_cols_per_thread();
@@ -447,7 +493,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
const int k_VKQ_0 = kb0 * nbatch_fa;
#if defined(TURING_MMA_AVAILABLE)
T_C_KQ KQ_C[nbatch_fa/(np*(cols_per_warp == 8 ? T_C_KQ::I : T_C_KQ::J))];
#elif defined(AMD_WMMA_AVAILABLE)
#elif defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
T_C_KQ KQ_C[nbatch_fa/(np*T_C_KQ::J)];
#else // Volta
T_C_KQ KQ_C[nbatch_fa/(np*T_C_KQ::J)];
@@ -500,13 +546,13 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[k_KQ_0/T_A_KQ::J]);
} else {
// Wide version of KQ_C is column-major
#if defined(AMD_WMMA_AVAILABLE)
// RDNA matrix C is column-major.
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
// AMD matrix C is column-major.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[k_KQ_0/T_A_KQ::J]);
#else
// swap A and B for CUDA.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[k_KQ_0/T_A_KQ::J], K_A);
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
}
}
}
@@ -526,13 +572,13 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
} else {
// Wide version of KQ_C is column-major
#if defined(AMD_WMMA_AVAILABLE)
// RDNA matrix C is column-major.
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
// AMD matrix C is column-major.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], K_A, Q_B[0]);
#else
// swap A and B for CUDA.
mma(KQ_C[i_KQ_00/(np*T_A_KQ::I)], Q_B[0], K_A);
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
}
}
}
@@ -585,12 +631,12 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) {
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::I + T_C_KQ::get_i(l) < k_VKQ_sup) {
#if defined(AMD_WMMA_AVAILABLE)
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
constexpr int KQ_idx = 0;
#else
// Turing + Volta:
const int KQ_idx = l % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
KQ_max_new[KQ_idx] = fmaxf(KQ_max_new[KQ_idx], KQ_C[k0/(np*T_C_KQ::I)].x[l] + FATTN_KQ_MAX_OFFSET);
}
}
@@ -601,7 +647,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
for (int col = 0; col < cols_per_thread; ++col) {
#pragma unroll
for (int offset = 16; offset >= 4; offset >>= 1) {
KQ_max_new[col] = fmaxf(KQ_max_new[col], __shfl_xor_sync(0xFFFFFFFF, KQ_max_new[col], offset, WARP_SIZE));
KQ_max_new[col] = fmaxf(KQ_max_new[col], __shfl_xor_sync(0xFFFFFFFF, KQ_max_new[col], offset, warp_size));
}
}
@@ -611,12 +657,12 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) {
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::I + T_C_KQ::get_i(l) < k_VKQ_sup) {
#if defined(AMD_WMMA_AVAILABLE)
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
constexpr int KQ_idx = 0;
#else
// Turing + Volta:
const int KQ_idx = l % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
KQ_C[k0/(np*T_C_KQ::I)].x[l] = expf(KQ_C[k0/(np*T_C_KQ::I)].x[l] - KQ_max_new[KQ_idx]);
KQ_rowsum_add[KQ_idx] += KQ_C[k0/(np*T_C_KQ::I)].x[l];
} else {
@@ -649,12 +695,12 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) {
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::J + T_C_KQ::get_j(l) < k_VKQ_sup) {
#if defined(AMD_WMMA_AVAILABLE)
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
constexpr int KQ_idx = 0;
#else
// Turing + Volta:
const int KQ_idx = (l/2) % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
KQ_max_new[KQ_idx] = fmaxf(KQ_max_new[KQ_idx], KQ_C[(k0/(np*T_C_KQ::J))].x[l] + FATTN_KQ_MAX_OFFSET);
}
}
@@ -666,6 +712,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
// Values per KQ column are spread across 4 threads:
constexpr int offset_first = 2;
constexpr int offset_last = 1;
#elif defined(AMD_MFMA_AVAILABLE)
// MFMA: 4 threads per Q column (threadIdx.x % 16 == col, spaced by 16).
constexpr int offset_first = 32;
constexpr int offset_last = 16;
#elif defined(AMD_WMMA_AVAILABLE)
// Values per KQ column are spread across 2 threads:
constexpr int offset_first = 16;
@@ -677,7 +727,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#endif // defined(TURING_MMA_AVAILABLE)
#pragma unroll
for (int offset = offset_first; offset >= offset_last; offset >>= 1) {
KQ_max_new[col] = fmaxf(KQ_max_new[col], __shfl_xor_sync(0xFFFFFFFF, KQ_max_new[col], offset, WARP_SIZE));
KQ_max_new[col] = fmaxf(KQ_max_new[col], __shfl_xor_sync(0xFFFFFFFF, KQ_max_new[col], offset, warp_size));
}
}
@@ -687,12 +737,12 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
#pragma unroll
for (int l = 0; l < T_C_KQ::ne; ++l) {
if (!oob_check || k0 + (threadIdx.y % np)*T_C_KQ::J + T_C_KQ::get_j(l) < k_VKQ_sup) {
#if defined(AMD_WMMA_AVAILABLE)
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
constexpr int KQ_idx = 0;
#else
// Turing + Volta:
const int KQ_idx = (l/2) % 2;
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
KQ_C[(k0/(np*T_C_KQ::J))].x[l] = expf(KQ_C[(k0/(np*T_C_KQ::J))].x[l] - KQ_max_new[KQ_idx]);
KQ_rowsum_add[KQ_idx] += KQ_C[(k0/(np*T_C_KQ::J))].x[l];
} else {
@@ -739,7 +789,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
}
}
}
#elif defined(AMD_WMMA_AVAILABLE)
#elif defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
const half2 KQ_max_scale_h2 = make_half2(
KQ_max_scale[0], KQ_max_scale[0]);
#pragma unroll
@@ -818,7 +868,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
}
const half2 * tile_V_i = !V_is_K_view || i0_stop > 2*nbatch_K2 ? tile_V : tile_V + i0_start/2;
#if defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
#if defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
constexpr int i0_stride = cols_per_warp == 8 ? T_C_VKQ::I : 2*T_C_VKQ::J;
#pragma unroll
for (int i_VKQ_0 = i0_start; i_VKQ_0 < i0_stop; i_VKQ_0 += i0_stride) {
@@ -830,24 +880,38 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
T_A_VKQ A; // Transposed in SRAM but not in registers, gets transposed on load.
#if defined(LDMATRIX_TRANS_AVAILABLE)
load_ldmatrix_trans(A, tile_V_i + 2*k0*stride_tile_V + (i_VKQ_0 - i0_start)/2, stride_tile_V);
#elif defined(AMD_MFMA_AVAILABLE)
// MFMA A register layout: A_mat[i=lane%16][k=4*(lane/16)+reg].
// Normal load gives A_mat[seq][dv] but we need A_mat[dv][seq] = V^T.
// Load with transposed addressing: 4 strided half loads.
{
const half2 * xs0 = tile_V_i + 2*k0*stride_tile_V + (i_VKQ_0 - i0_start)/2;
const half * xs0_h = (const half *) xs0;
const int stride_h = stride_tile_V * 2; // stride in half units
half * A_h = (half *) A.x;
#pragma unroll
for (int l = 0; l < 4; ++l) {
A_h[l] = xs0_h[(4*(threadIdx.x / 16) + l) * stride_h + threadIdx.x % 16];
}
}
#else
// TODO: Try to transpose tile_V when loading gmem to smem.
// Use mma to transpose T_A_VKQ for RDNA.
T_A_VKQ A_trans;
load_ldmatrix(A_trans, tile_V_i + 2*k0*stride_tile_V + (i_VKQ_0 - i0_start)/2, stride_tile_V);
mma(A, A_trans, A_identity);
#endif // defined(TURING_MMA_AVAILABLE)
#endif // defined(LDMATRIX_TRANS_AVAILABLE)
if constexpr (T_B_KQ::I == 8) {
mma(VKQ_C[i_VKQ_0/i0_stride], A, B[k00/(np*T_A_VKQ::J)]);
} else {
// Wide version of VKQ_C is column-major.
#if defined(AMD_WMMA_AVAILABLE)
// RDNA matrix C is column-major.
#if defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
// AMD matrix C is column-major.
mma(VKQ_C[i_VKQ_0/i0_stride], A, B[k00/(np*T_A_VKQ::J)]);
#else
// swap A and B for CUDA.
mma(VKQ_C[i_VKQ_0/i0_stride], B[k00/(np*T_A_VKQ::J)], A);
#endif // defined(AMD_WMMA_AVAILABLE)
#endif // defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
}
}
}
@@ -866,7 +930,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
mma(VKQ_C[i_VKQ_0/i0_stride], B[k00/(np*T_A_VKQ::I)], A);
}
}
#endif // defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
#endif // defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
if constexpr (nstages <= 1) {
__syncthreads(); // Only needed if tile_K == tile_V.
@@ -879,7 +943,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_iter(
tile_Q, tile_K, tile_V, tile_mask,
Q_B, VKQ_C, KQ_max, KQ_rowsum, kb0);
NO_DEVICE_CODE;
#endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
#endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)) || defined(AMD_MFMA_AVAILABLE)
}
#if defined(TURING_MMA_AVAILABLE)
@@ -899,7 +963,7 @@ template<> struct mma_tile_sizes<8> {
using T_B_VKQ = tile< 8, 8, half2>; // column-major
using T_C_VKQ = tile<16, 4, half2>; // row-major
};
#elif defined(AMD_WMMA_AVAILABLE)
#elif defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
template<int ncols> struct mma_tile_sizes {
using T_A_KQ = tile<16, 8, half2>; // row-major
using T_B_KQ = tile<16, 8, half2>; // column-major
@@ -944,9 +1008,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
const int zt_gqa,
const int kb0_start,
const int kb0_stop) {
#if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
#if defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)) || defined(AMD_MFMA_AVAILABLE)
//In this kernel Q, K, V are matrices while i, j, k are matrix indices.
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int ncols = ncols1 * ncols2;
using T_A_KQ = typename mma_tile_sizes<ncols>::T_A_KQ;
using T_B_KQ = typename mma_tile_sizes<ncols>::T_B_KQ;
@@ -986,7 +1051,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
T_B_KQ Q_B[(Q_in_reg ? DKQ/(2*T_B_KQ::J) : 1)];
#if defined(TURING_MMA_AVAILABLE)
T_C_VKQ VKQ_C[cols_per_warp == 8 ? DV/T_C_VKQ::I : DV/(2*T_C_VKQ::J)];
#elif defined(AMD_WMMA_AVAILABLE)
#elif defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
T_C_VKQ VKQ_C[ DV/(2*T_C_VKQ::J)];
#else // Volta
T_C_VKQ VKQ_C[ DV/(2*T_C_VKQ::J)];
@@ -1004,10 +1069,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
// The loading is done with decreasing granularity for D for better memory bandwidth.
const half2 scale_h2 = make_half2(scale, scale);
#pragma unroll
for (int stride_k : {WARP_SIZE, WARP_SIZE/2, WARP_SIZE/4}) {
const int k0_start = stride_k == WARP_SIZE ? 0 : DKQ/2 - (DKQ/2) % (2*stride_k);
for (int stride_k : {warp_size, warp_size/2, warp_size/4, warp_size/8}) {
const int k0_start = stride_k == warp_size ? 0 : DKQ/2 - (DKQ/2) % (2*stride_k);
const int k0_stop = DKQ/2 - (DKQ/2) % (1*stride_k);
const int stride_jc = WARP_SIZE / stride_k;
const int stride_jc = warp_size / stride_k;
if (k0_start == k0_stop) {
continue;
@@ -1015,7 +1080,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
#pragma unroll
for (int jc0 = 0; jc0 < ncols; jc0 += nwarps*stride_jc) {
const int jc = jc0 + threadIdx.y*stride_jc + (stride_k == WARP_SIZE ? 0 : threadIdx.x / stride_k);
const int jc = jc0 + threadIdx.y*stride_jc + (stride_k == warp_size ? 0 : threadIdx.x / stride_k);
if (jc0 + nwarps*stride_jc > ncols && jc >= ncols) {
break;
@@ -1027,7 +1092,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
if ((ncols1 == 1 || jt*ncols1 + j < int(ne01.z)) && (ncols2 == 1 || zt_gqa*ncols2 + c < gqa_ratio)) {
#pragma unroll
for (int k0 = k0_start; k0 < k0_stop; k0 += stride_k) {
const int k = k0 + (stride_k == WARP_SIZE ? threadIdx.x : threadIdx.x % stride_k);
const int k = k0 + (stride_k == warp_size ? threadIdx.x : threadIdx.x % stride_k);
const float2 tmp = Q_f2[(jt*ncols1 + j)*stride_Q1 + c*stride_Q2 + k];
tile_Q[jc*stride_tile_Q + k] = scale_h2 * make_half2(tmp.x, tmp.y);
@@ -1035,7 +1100,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
} else {
#pragma unroll
for (int k0 = k0_start; k0 < k0_stop; k0 += stride_k) {
const int k = k0 + (stride_k == WARP_SIZE ? threadIdx.x : threadIdx.x % stride_k);
const int k = k0 + (stride_k == warp_size ? threadIdx.x : threadIdx.x % stride_k);
tile_Q[jc*stride_tile_Q + k] = make_half2(0.0f, 0.0f);
}
@@ -1127,6 +1192,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
// The partial sums are spread across 8/4 threads.
constexpr int offset_first = cols_per_warp == 8 ? 16 : 2;
constexpr int offset_last = cols_per_warp == 8 ? 4 : 1;
#elif defined(AMD_MFMA_AVAILABLE)
// The partial sums are spread across 4 threads (wavefront64, 16 cols).
constexpr int offset_first = 32;
constexpr int offset_last = 16;
#elif defined(AMD_WMMA_AVAILABLE)
// The partial sums are spread across 2 threads.
constexpr int offset_first = 16;
@@ -1140,7 +1209,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
for (int col = 0; col < cols_per_thread; ++col) {
#pragma unroll
for (int offset = offset_first; offset >= offset_last; offset >>= 1) {
KQ_rowsum[col] += __shfl_xor_sync(0xFFFFFFFF, KQ_rowsum[col], offset, WARP_SIZE);
KQ_rowsum[col] += __shfl_xor_sync(0xFFFFFFFF, KQ_rowsum[col], offset, warp_size);
}
}
}
@@ -1189,7 +1258,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
}
}
}
#elif defined(AMD_WMMA_AVAILABLE)
#elif defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
const half2 KQ_max_scale_h2 = make_half2(KQ_max_scale[0], KQ_max_scale[0]);
#pragma unroll
for (int i = 0; i < (DV/2)/T_C_VKQ::J; ++i) {
@@ -1249,7 +1318,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
const int jc_cwm = threadIdx.y*cols_per_warp + T_C_VKQ::get_i(threadIdx.x % 4);
const float2 KQ_cmr = make_float2(KQ_max[threadIdx.x % cols_per_thread], KQ_rowsum[threadIdx.x % cols_per_thread]);
const bool thread_should_write = threadIdx.x % 4 < cols_per_thread;
#elif defined(AMD_WMMA_AVAILABLE)
#elif defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
const int jc_cwm = threadIdx.y*cols_per_warp + T_C_VKQ::get_i(0);
const float2 KQ_cmr = make_float2(KQ_max[0], KQ_rowsum[0]);
const bool thread_should_write = threadIdx.x / 16 < cols_per_thread;
@@ -1283,14 +1352,14 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
// Warps with threadIdx.y % np != 0 must NOT return early.
// All threads must return simultaneously to avoid race conditions with work on the next tile.
constexpr int nmeta = np*cols_per_warp >= WARP_SIZE ? np*cols_per_warp/WARP_SIZE : 1;
constexpr int nmeta = np*cols_per_warp >= warp_size ? np*cols_per_warp/warp_size : 1;
const int jc_meta = threadIdx.y*cols_per_warp + (np*cols_per_warp < WARP_SIZE ? threadIdx.x % (np*cols_per_warp) : threadIdx.x);
const int jc_meta = threadIdx.y*cols_per_warp + (np*cols_per_warp < warp_size ? threadIdx.x % (np*cols_per_warp) : threadIdx.x);
float2 * const meta_ptr = ((float2 *) tile_Q) + jc_meta*(tile_stride/2) + nbatch_combine/2;
float2 meta[nmeta];
#pragma unroll
for (int imeta = 0; imeta < nmeta; ++imeta) {
meta[imeta] = meta_ptr[imeta * WARP_SIZE * tile_stride/2];
meta[imeta] = meta_ptr[imeta * warp_size * tile_stride/2];
}
float KQ_cmn = meta[0].x; // KQ combine max new, max between all parallel warps.
@@ -1300,8 +1369,8 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
}
#pragma unroll
for (int offset = np*cols_per_warp/2; offset >= cols_per_warp; offset >>= 1) {
if (offset < WARP_SIZE) {
KQ_cmn = fmaxf(KQ_cmn, __shfl_xor_sync(0xFFFFFFFF, KQ_cmn, offset, WARP_SIZE));
if (offset < warp_size) {
KQ_cmn = fmaxf(KQ_cmn, __shfl_xor_sync(0xFFFFFFFF, KQ_cmn, offset, warp_size));
}
}
@@ -1318,8 +1387,8 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
}
#pragma unroll
for (int offset = np*cols_per_warp/2; offset >= cols_per_warp; offset >>= 1) {
if (offset < WARP_SIZE) {
KQ_crs += __shfl_xor_sync(0xFFFFFFFF, KQ_crs, offset, WARP_SIZE);
if (offset < warp_size) {
KQ_crs += __shfl_xor_sync(0xFFFFFFFF, KQ_crs, offset, warp_size);
}
}
@@ -1328,19 +1397,19 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
// Write back combined meta data:
#pragma unroll
for (int imeta = 0; imeta < nmeta; ++imeta) {
if (np*cols_per_warp >= WARP_SIZE || threadIdx.x < np*cols_per_warp) {
if (np*cols_per_warp >= warp_size || threadIdx.x < np*cols_per_warp) {
// Combined KQ max scale + rowsum.
meta_ptr[imeta * WARP_SIZE * tile_stride/2] = make_float2(KQ_cms[imeta], KQ_crs);
meta_ptr[imeta * warp_size * tile_stride/2] = make_float2(KQ_cms[imeta], KQ_crs);
}
}
// Combined KQ max + rowsum.
static_assert(cols_per_warp <= WARP_SIZE);
if (needs_fixup && (cols_per_warp == WARP_SIZE || threadIdx.x < cols_per_warp)) {
static_assert(cols_per_warp <= warp_size);
if (needs_fixup && (cols_per_warp == warp_size || threadIdx.x < cols_per_warp)) {
float2 * dstk_fixup_meta = dstk_fixup + blockIdx.x*ncols;
dstk_fixup_meta[(threadIdx.y/np)*cols_per_warp + threadIdx.x] = make_float2(KQ_cmn, KQ_crs);
}
if (is_fixup && (cols_per_warp == WARP_SIZE || threadIdx.x < cols_per_warp)) {
if (is_fixup && (cols_per_warp == warp_size || threadIdx.x < cols_per_warp)) {
float2 * dstk_fixup_meta = dstk_fixup + (gridDim.x + blockIdx.x)*ncols;
dstk_fixup_meta[(threadIdx.y/np)*cols_per_warp + threadIdx.x] = make_float2(KQ_cmn, KQ_crs);
}
@@ -1388,10 +1457,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
float2 * dstk_fixup_data = dstk_fixup + gridDim.x*(2*ncols) + blockIdx.x*(ncols*(DV/2));
#pragma unroll
for (int stride_k : {WARP_SIZE, WARP_SIZE/2, WARP_SIZE/4}) {
const int k0_start = stride_k == WARP_SIZE ? 0 : nbatch_combine - nbatch_combine % (2*stride_k);
for (int stride_k : {warp_size, warp_size/2, warp_size/4, warp_size/8}) {
const int k0_start = stride_k == warp_size ? 0 : nbatch_combine - nbatch_combine % (2*stride_k);
const int k0_stop = nbatch_combine - nbatch_combine % (1*stride_k);
const int stride_jc = WARP_SIZE / stride_k;
const int stride_jc = warp_size / stride_k;
if (k0_start == k0_stop) {
continue;
@@ -1399,7 +1468,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
#pragma unroll
for (int jc0_dst = 0; jc0_dst < ncols; jc0_dst += (nwarps/np)*stride_jc) {
const int jc_dst = jc0_dst + (threadIdx.y/np)*stride_jc + (stride_k == WARP_SIZE ? 0 : threadIdx.x / stride_k);
const int jc_dst = jc0_dst + (threadIdx.y/np)*stride_jc + (stride_k == warp_size ? 0 : threadIdx.x / stride_k);
if (jc0_dst + (nwarps/np)*stride_jc > ncols && jc_dst >= ncols) {
break;
@@ -1417,7 +1486,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
const float * meta_j = (const float *) tile_Q + jc_tile_K*tile_stride + nbatch_combine;
#pragma unroll
for (int k0 = k0_start; k0 < k0_stop; k0 += stride_k) {
const int k = k0 + (stride_k == WARP_SIZE ? threadIdx.x : threadIdx.x % stride_k);
const int k = k0 + (stride_k == warp_size ? threadIdx.x : threadIdx.x % stride_k);
float2 dstk_val = make_float2(0.0f, 0.0f);
#pragma unroll
@@ -1453,7 +1522,7 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
stride_Q1, stride_Q2, stride_K, stride_V, stride_mask,
jt, kb0_start, kb0_stop);
NO_DEVICE_CODE;
#endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4))
#endif // defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)) || defined(AMD_MFMA_AVAILABLE)
}
template<int DKQ, int DV, int ncols1, int ncols2, bool use_logit_softcap, bool V_is_K_view>
@@ -1480,7 +1549,7 @@ static __global__ void flash_attn_ext_f16(
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
const int32_t nb31, const int32_t nb32, const int64_t nb33) {
#if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)))
#if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)) || defined(AMD_MFMA_AVAILABLE))
// Skip unused kernel variants for faster compilation:
if (use_logit_softcap && !(DKQ == 128 || DKQ == 256)) {
@@ -1508,10 +1577,18 @@ static __global__ void flash_attn_ext_f16(
}
#endif // defined(AMD_WMMA_AVAILABLE)
#if defined(AMD_MFMA_AVAILABLE)
if (DKQ != 64 && DKQ != 80 && DKQ != 96 && DKQ != 112 && DKQ != 128) {
NO_DEVICE_CODE;
return;
}
#endif // defined(AMD_MFMA_AVAILABLE)
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int ncols = ncols1 * ncols2;
constexpr int nbatch_fa = ggml_cuda_fattn_mma_get_nbatch_fa(DKQ, DV, ncols);
constexpr int nthreads = ggml_cuda_fattn_mma_get_nthreads(DKQ, DV, ncols);
constexpr int nwarps = nthreads / WARP_SIZE;
constexpr int nwarps = nthreads / warp_size;
const int gqa_ratio = ne02 / ne12; // With grouped query attention there are > 1 Q matrices per K, V matrix.
@@ -1624,7 +1701,7 @@ static __global__ void flash_attn_ext_f16(
ne31, ne32, ne33,
nb31, nb32, nb33);
NO_DEVICE_CODE;
#endif // defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)))
#endif // defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || (defined(AMD_WMMA_AVAILABLE) && defined(RDNA4)) || defined(AMD_MFMA_AVAILABLE))
}
template <int DKQ, int DV, int ncols1, int ncols2>
@@ -1644,7 +1721,8 @@ void ggml_cuda_flash_attn_ext_mma_f16_case(ggml_backend_cuda_context & ctx, ggml
const int nstages = ggml_cuda_fattn_mma_get_nstages (DKQ, DV, ncols1, ncols2, cc);
const int cols_per_warp = std::min(ncols, get_cols_per_warp(cc));
const int nwarps = nthreads / WARP_SIZE;
const int warp_size_host = ggml_cuda_info().devices[ctx.device].warp_size;
const int nwarps = nthreads / warp_size_host;
constexpr bool V_is_K_view = DKQ == 576; // Guaranteed by the kernel selection logic in fattn.cu
@@ -1694,7 +1772,7 @@ void ggml_cuda_flash_attn_ext_mma_f16_case(ggml_backend_cuda_context & ctx, ggml
}
launch_fattn<DV, ncols1, ncols2>
(ctx, dst, fattn_kernel, nwarps, nbytes_shared_total, nbatch_fa, true, true, true);
(ctx, dst, fattn_kernel, nwarps, nbytes_shared_total, nbatch_fa, true, true, true, warp_size_host);
}

View File

@@ -440,6 +440,18 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
return BEST_FATTN_KERNEL_MMA_F16;
}
// Use MFMA flash attention for CDNA (MI100+):
if (amd_mfma_available(cc) && Q->ne[0] != 40 && Q->ne[0] != 72 && Q->ne[0] != 256 && Q->ne[0] != 576) {
const int64_t eff_nq = Q->ne[1] * (gqa_opt_applies ? gqa_ratio : 1);
// MMA vs tile crossover benchmarked on MI300X @ d32768:
// hsk=64 (gqa=4): MMA wins at eff >= 128 (+11%)
// hsk=128 (gqa=4): MMA wins at eff >= 128 (+4%)
if (eff_nq >= (GGML_CUDA_CC_IS_CDNA1(cc) && Q->ne[0] == 64 ? 64 : 128)) {
return BEST_FATTN_KERNEL_MMA_F16;
}
// Fall through to tile kernel for small effective batch sizes.
}
// If there are no tensor cores available, use the generic tile kernel:
if (can_use_vector_kernel) {
if (!ggml_is_quantized(K->type) && !ggml_is_quantized(V->type)) {

View File

@@ -668,7 +668,7 @@ namespace ggml_cuda_mma {
return ret;
}
#elif defined(AMD_WMMA_AVAILABLE)
#elif defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)
template <int I, int J>
static __device__ __forceinline__ tile<I, J/2, half2> get_half2(const tile<I, J, float> & tile_float) {
tile<I, J/2, half2> ret;
@@ -964,6 +964,34 @@ namespace ggml_cuda_mma {
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;
#endif // defined(RDNA4)
#elif defined(AMD_MFMA_AVAILABLE)
// MFMA: FP16 input, FP32 accumulate, convert back to half2.
using halfx4_t = __attribute__((ext_vector_type(4))) _Float16;
using floatx4_t = __attribute__((ext_vector_type(4))) float;
// Convert existing half2 accumulator to float for MFMA:
floatx4_t acc_f32;
{
const halfx4_t acc_h = reinterpret_cast<const halfx4_t&>(D.x[0]);
#pragma unroll
for (int i = 0; i < 4; ++i) {
acc_f32[i] = (float)acc_h[i];
}
}
const halfx4_t& a_frag = reinterpret_cast<const halfx4_t&>(A.x[0]);
const halfx4_t& b_frag = reinterpret_cast<const halfx4_t&>(B.x[0]);
acc_f32 = __builtin_amdgcn_mfma_f32_16x16x16f16(a_frag, b_frag, acc_f32, 0, 0, 0);
// Convert back to half2:
{
halfx4_t result_h;
#pragma unroll
for (int i = 0; i < 4; ++i) {
result_h[i] = (_Float16)acc_f32[i];
}
reinterpret_cast<halfx4_t&>(D.x[0]) = result_h;
}
#else
GGML_UNUSED_VARS(D, A, B);
NO_DEVICE_CODE;

View File

@@ -590,6 +590,7 @@ struct vk_device_struct {
vk_queue transfer_queue;
bool single_queue;
bool support_async;
bool async_use_transfer_queue;
uint32_t subgroup_size;
uint32_t subgroup_size_log2;
uint32_t shader_core_count;
@@ -1858,6 +1859,10 @@ struct ggml_backend_vk_context {
vk_context_ref compute_ctx;
vk_context_ref transfer_ctx;
vk_semaphore transfer_semaphore;
uint64_t transfer_semaphore_last_submitted {};
std::vector<vk_context_ref> tensor_ctxs;
std::vector<vk::DescriptorPool> descriptor_pools;
@@ -1866,6 +1871,7 @@ struct ggml_backend_vk_context {
uint32_t pipeline_descriptor_set_requirements {};
vk_command_pool compute_cmd_pool;
vk_command_pool transfer_cmd_pool;
// number of additional consecutive nodes that are being fused with the
// node currently being processed
@@ -5391,13 +5397,19 @@ static vk_device ggml_vk_get_device(size_t idx) {
ggml_vk_load_shaders(device);
const bool prefers_transfer_queue = device->vendor_id == VK_VENDOR_ID_AMD && device->architecture != AMD_GCN;
if (!device->single_queue) {
const uint32_t transfer_queue_index = compute_queue_family_index == transfer_queue_family_index ? 1 : 0;
ggml_vk_create_queue(device, device->transfer_queue, transfer_queue_family_index, transfer_queue_index, { vk::PipelineStageFlagBits::eTransfer }, true);
device->async_use_transfer_queue = prefers_transfer_queue || (getenv("GGML_VK_ASYNC_USE_TRANSFER_QUEUE") != nullptr);
} else {
// TODO: Use pointer or reference to avoid copy
device->transfer_queue.copyFrom(device->compute_queue);
device->transfer_queue.cmd_pool.init(device, &device->transfer_queue);
device->async_use_transfer_queue = false;
}
device->buffer_type = {
@@ -5871,6 +5883,15 @@ static void ggml_vk_init(ggml_backend_vk_context * ctx, size_t idx) {
ctx->almost_ready_fence = ctx->device->device.createFence({});
ctx->compute_cmd_pool.init(ctx->device, &ctx->device->compute_queue);
if (ctx->device->async_use_transfer_queue) {
vk::SemaphoreTypeCreateInfo tci{ vk::SemaphoreType::eTimeline, 0 };
vk::SemaphoreCreateInfo ci{};
ci.setPNext(&tci);
ctx->transfer_semaphore.s = ctx->device->device.createSemaphore(ci);
ctx->transfer_semaphore.value = 0;
ctx->transfer_cmd_pool.init(ctx->device, &ctx->device->transfer_queue);
}
if (vk_perf_logger_enabled) {
ctx->perf_logger = std::unique_ptr<vk_perf_logger>(new vk_perf_logger());
@@ -6419,6 +6440,47 @@ static void ggml_vk_ctx_begin(vk_device& device, vk_context& subctx) {
subctx->s = subctx->seqs[subctx->seqs.size() - 1].data();
}
static vk_context ggml_vk_get_compute_ctx(ggml_backend_vk_context * ctx) {
if (!ctx->compute_ctx.expired()) {
return ctx->compute_ctx.lock();
}
vk_context result = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = result;
ggml_vk_ctx_begin(ctx->device, result);
if (ctx->device->async_use_transfer_queue && ctx->transfer_semaphore_last_submitted < ctx->transfer_semaphore.value) {
result->s->wait_semaphores.push_back(ctx->transfer_semaphore);
ctx->transfer_semaphore_last_submitted = ctx->transfer_semaphore.value;
}
return result;
}
// Submit any pending transfer queue work and signal the transfer semaphore.
// The next compute context created via ggml_vk_get_compute_ctx will wait on this semaphore.
// Returns true if work was submitted.
static bool ggml_vk_submit_transfer_ctx(ggml_backend_vk_context * ctx) {
if (!ctx->device->async_use_transfer_queue || ctx->transfer_ctx.expired()) {
return false;
}
vk_context cpy_ctx = ctx->transfer_ctx.lock();
ggml_vk_ctx_end(cpy_ctx);
for (auto& cpy : cpy_ctx->in_memcpys) {
memcpy(cpy.dst, cpy.src, cpy.n);
}
ctx->transfer_semaphore.value++;
cpy_ctx->seqs.back().back().signal_semaphores.push_back(ctx->transfer_semaphore);
ggml_vk_submit(cpy_ctx, {});
ctx->transfer_ctx.reset();
return true;
}
static size_t ggml_vk_align_size(size_t width, size_t align) {
VK_LOG_DEBUG("ggml_vk_align_size(" << width << ", " << align << ")");
return CEIL_DIV(width, align) * align;
@@ -7512,6 +7574,18 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
return false;
}
if (device->driver_id == vk::DriverId::eIntelProprietaryWindows) {
// Intel Windows proprietary driver tuning
switch (src0_type) {
case GGML_TYPE_MXFP4:
case GGML_TYPE_Q4_K:
case GGML_TYPE_Q5_K:
return false;
default:
return true;
}
}
switch (src0_type) {
// From tests on A770 Linux, may need more tuning
case GGML_TYPE_Q4_0:
@@ -12529,15 +12603,7 @@ static bool ggml_vk_build_graph(ggml_backend_vk_context * ctx, ggml_cgraph * cgr
}
}
vk_context compute_ctx;
if (ctx->compute_ctx.expired()) {
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
} else {
compute_ctx = ctx->compute_ctx.lock();
}
vk_context compute_ctx = ggml_vk_get_compute_ctx(ctx);
{
// This logic detects dependencies between modes in the graph and calls ggml_vk_sync_buffers
@@ -13055,6 +13121,9 @@ static void ggml_vk_graph_cleanup(ggml_backend_vk_context * ctx) {
ctx->prealloc_x_need_sync = ctx->prealloc_y_need_sync = ctx->prealloc_split_k_need_sync = false;
ggml_vk_command_pool_cleanup(ctx->device, ctx->compute_cmd_pool);
if (ctx->device->async_use_transfer_queue) {
ggml_vk_command_pool_cleanup(ctx->device, ctx->transfer_cmd_pool);
}
for (size_t i = 0; i < ctx->gc.semaphores.size(); i++) {
ctx->device->device.destroySemaphore({ ctx->gc.semaphores[i].s });
@@ -13116,6 +13185,11 @@ static void ggml_vk_cleanup(ggml_backend_vk_context * ctx) {
ctx->descriptor_sets.clear();
ctx->compute_cmd_pool.destroy(ctx->device->device);
if (ctx->device->async_use_transfer_queue) {
ctx->device->device.destroySemaphore(ctx->transfer_semaphore.s);
ctx->transfer_cmd_pool.destroy(ctx->device->device);
}
if (vk_perf_logger_enabled) {
ctx->perf_logger->print_timings(true);
}
@@ -13387,34 +13461,38 @@ static void ggml_backend_vk_set_tensor_async(ggml_backend_t backend, ggml_tensor
ggml_backend_vk_buffer_context * buf_ctx = (ggml_backend_vk_buffer_context *)tensor->buffer->context;
vk_context compute_ctx;
vk_context cpy_ctx;
if (ctx->compute_ctx.expired()) {
// Initialize new transfer context
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
if (ctx->device->async_use_transfer_queue) {
if (ctx->transfer_ctx.expired()) {
// Initialize new transfer context
cpy_ctx = ggml_vk_create_context(ctx, ctx->transfer_cmd_pool);
ctx->transfer_ctx = cpy_ctx;
ggml_vk_ctx_begin(ctx->device, cpy_ctx);
} else {
cpy_ctx = ctx->transfer_ctx.lock();
}
} else {
compute_ctx = ctx->compute_ctx.lock();
cpy_ctx = ggml_vk_get_compute_ctx(ctx);
}
vk_buffer buf = buf_ctx->dev_buffer;
auto dst_offset = vk_tensor_offset(tensor) + tensor->view_offs + offset;
bool ret = ggml_vk_buffer_write_async(compute_ctx, buf, dst_offset, data, size);
bool ret = ggml_vk_buffer_write_async(cpy_ctx, buf, dst_offset, data, size);
if (!ret) {
ggml_vk_ensure_sync_staging_buffer(ctx, size);
ggml_vk_sync_buffers(nullptr, compute_ctx);
ggml_vk_sync_buffers(nullptr, cpy_ctx);
vk::BufferCopy buffer_cpy;
buffer_cpy.srcOffset = 0;
buffer_cpy.dstOffset = dst_offset;
buffer_cpy.size = size;
compute_ctx->s->buffer.copyBuffer(ctx->sync_staging->buffer, buf->buffer, { buffer_cpy });
deferred_memcpy(ctx->sync_staging->ptr, data, size, &compute_ctx->in_memcpys);
cpy_ctx->s->buffer.copyBuffer(ctx->sync_staging->buffer, buf->buffer, { buffer_cpy });
deferred_memcpy(ctx->sync_staging->ptr, data, size, &cpy_ctx->in_memcpys);
ggml_vk_synchronize(ctx);
}
}
@@ -13426,16 +13504,7 @@ static void ggml_backend_vk_get_tensor_async(ggml_backend_t backend, const ggml_
ggml_backend_vk_buffer_context * buf_ctx = (ggml_backend_vk_buffer_context *)tensor->buffer->context;
vk_context compute_ctx;
if (ctx->compute_ctx.expired()) {
// Initialize new transfer context
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
} else {
compute_ctx = ctx->compute_ctx.lock();
}
vk_context compute_ctx = ggml_vk_get_compute_ctx(ctx);
vk_buffer buf = buf_ctx->dev_buffer;
@@ -13458,31 +13527,60 @@ static void ggml_backend_vk_get_tensor_async(ggml_backend_t backend, const ggml_
}
}
static bool ggml_backend_vk_cpy_tensor_async(ggml_backend_t backend, const ggml_tensor * src, ggml_tensor * dst) {
static bool ggml_backend_vk_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const ggml_tensor * src, ggml_tensor * dst) {
VK_LOG_DEBUG("ggml_backend_vk_cpy_tensor_async()");
ggml_backend_vk_context * ctx = (ggml_backend_vk_context *)backend->context;
if ((dst->buffer->buft == ggml_backend_vk_get_default_buffer_type(backend) || dst->buffer->buft == ggml_backend_vk_host_buffer_type()) && ggml_backend_buffer_is_vk(src->buffer)) {
ggml_backend_vk_context * ctx = (ggml_backend_vk_context *)backend_dst->context;
if (dst->buffer->buft != ggml_backend_vk_get_default_buffer_type(backend_dst)) {
return false;
}
ggml_backend_vk_buffer_context * dst_buf_ctx = (ggml_backend_vk_buffer_context *)dst->buffer->context;
vk_buffer dst_buf = dst_buf_ctx->dev_buffer;
if (ggml_backend_buffer_is_vk(src->buffer)) {
ggml_backend_vk_buffer_context * src_buf_ctx = (ggml_backend_vk_buffer_context *)src->buffer->context;
ggml_backend_vk_buffer_context * dst_buf_ctx = (ggml_backend_vk_buffer_context *)dst->buffer->context;
vk_context compute_ctx;
if (ctx->compute_ctx.expired()) {
// Initialize new transfer context
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
} else {
compute_ctx = ctx->compute_ctx.lock();
// Async copy only works within the same device
if (src_buf_ctx->dev_buffer->device != dst_buf->device) {
return false;
}
vk_buffer src_buf = src_buf_ctx->dev_buffer;
vk_buffer dst_buf = dst_buf_ctx->dev_buffer;
vk_context compute_ctx = ggml_vk_get_compute_ctx(ctx);
ggml_vk_buffer_copy_async(compute_ctx, dst_buf, vk_tensor_offset(dst) + dst->view_offs, src_buf, vk_tensor_offset(src) + src->view_offs, ggml_nbytes(src));
ggml_vk_buffer_copy_async(compute_ctx, dst_buf, vk_tensor_offset(dst) + dst->view_offs,
src_buf_ctx->dev_buffer, vk_tensor_offset(src) + src->view_offs,
ggml_nbytes(src));
return true;
}
if (ggml_backend_buffer_is_host(src->buffer)) {
vk_buffer pinned_buf = nullptr;
size_t pinned_offset = 0;
ggml_vk_host_get(ctx->device, src->data, pinned_buf, pinned_offset);
if (pinned_buf == nullptr) {
return false;
}
vk_context cpy_ctx;
if (ctx->device->async_use_transfer_queue) {
if (ctx->transfer_ctx.expired()) {
cpy_ctx = ggml_vk_create_context(ctx, ctx->transfer_cmd_pool);
ctx->transfer_ctx = cpy_ctx;
ggml_vk_ctx_begin(ctx->device, cpy_ctx);
} else {
cpy_ctx = ctx->transfer_ctx.lock();
}
} else {
cpy_ctx = ggml_vk_get_compute_ctx(ctx);
}
return ggml_vk_buffer_write_async(cpy_ctx, dst_buf,
vk_tensor_offset(dst) + dst->view_offs,
src->data, ggml_nbytes(src));
}
GGML_UNUSED(backend_src);
return false;
}
@@ -13491,6 +13589,10 @@ static void ggml_vk_synchronize(ggml_backend_vk_context * ctx) {
bool do_transfer = !ctx->compute_ctx.expired();
if (ggml_vk_submit_transfer_ctx(ctx)) {
ctx->submit_pending = true;
}
vk_context compute_ctx;
if (do_transfer) {
compute_ctx = ctx->compute_ctx.lock();
@@ -13506,7 +13608,22 @@ static void ggml_vk_synchronize(ggml_backend_vk_context * ctx) {
}
if (ctx->submit_pending) {
{
if (ctx->device->async_use_transfer_queue && ctx->transfer_semaphore_last_submitted < ctx->transfer_semaphore.value) {
vk::TimelineSemaphoreSubmitInfo tl_info{
1, &ctx->transfer_semaphore.value,
0, nullptr,
};
vk::PipelineStageFlags stage = ctx->device->transfer_queue.stage_flags;
vk::SubmitInfo si{
1, &ctx->transfer_semaphore.s, &stage,
0, nullptr,
0, nullptr,
};
si.setPNext(&tl_info);
std::lock_guard<std::mutex> guard(queue_mutex);
ctx->device->compute_queue.queue.submit({ si }, ctx->fence);
ctx->transfer_semaphore_last_submitted = ctx->transfer_semaphore.value;
} else {
std::lock_guard<std::mutex> guard(queue_mutex);
ctx->device->compute_queue.queue.submit({}, ctx->fence);
}
@@ -13972,6 +14089,8 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
bool first_node_in_batch = true; // true if next node will be first node in a batch
int submit_node_idx = 0; // index to first node in a batch
ggml_vk_submit_transfer_ctx(ctx);
vk_context compute_ctx;
if (vk_perf_logger_enabled) {
// allocate/resize the query pool
@@ -13997,9 +14116,7 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
std::fill(ctx->query_node_idx.begin(), ctx->query_node_idx.end(), 0);
GGML_ASSERT(ctx->compute_ctx.expired());
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
compute_ctx = ggml_vk_get_compute_ctx(ctx);
ctx->query_idx = 0;
compute_ctx->s->buffer.writeTimestamp(vk::PipelineStageFlagBits::eAllCommands, ctx->query_pool, ctx->query_idx++);
}
@@ -14009,13 +14126,7 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
if (ctx->prealloc_size_add_rms_partials) {
ggml_vk_preallocate_buffers(ctx, nullptr);
if (ctx->compute_ctx.expired()) {
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
} else {
compute_ctx = ctx->compute_ctx.lock();
}
compute_ctx = ggml_vk_get_compute_ctx(ctx);
// initialize partial sums to zero.
ggml_vk_buffer_memset_async(compute_ctx, ctx->prealloc_add_rms_partials, 0, 0, ctx->prealloc_size_add_rms_partials);
ggml_vk_sync_buffers(ctx, compute_ctx);
@@ -14238,13 +14349,7 @@ static ggml_status ggml_backend_vk_graph_compute(ggml_backend_t backend, ggml_cg
bool enqueued = ggml_vk_build_graph(ctx, cgraph, i, cgraph->nodes[submit_node_idx], submit_node_idx, i + ctx->num_additional_fused_ops >= last_node, almost_ready, submit);
if (vk_perf_logger_enabled && enqueued) {
if (ctx->compute_ctx.expired()) {
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
} else {
compute_ctx = ctx->compute_ctx.lock();
}
compute_ctx = ggml_vk_get_compute_ctx(ctx);
if (!vk_perf_logger_concurrent) {
// track a single node/fusion for the current query
ctx->query_nodes[ctx->query_idx] = cgraph->nodes[i];
@@ -14579,16 +14684,9 @@ static void ggml_backend_vk_event_record(ggml_backend_t backend, ggml_backend_ev
ggml_backend_vk_context * ctx = (ggml_backend_vk_context *)backend->context;
vk_event *vkev = (vk_event *)event->context;
vk_context compute_ctx;
ggml_vk_submit_transfer_ctx(ctx);
if (ctx->compute_ctx.expired()) {
// Initialize new transfer context
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
} else {
compute_ctx = ctx->compute_ctx.lock();
}
vk_context compute_ctx = ggml_vk_get_compute_ctx(ctx);
// the backend interface doesn't have an explicit reset, so reset it here
// before we record the command to set it
@@ -14609,16 +14707,7 @@ static void ggml_backend_vk_event_wait(ggml_backend_t backend, ggml_backend_even
ggml_backend_vk_context * ctx = (ggml_backend_vk_context *)backend->context;
vk_event *vkev = (vk_event *)event->context;
vk_context compute_ctx;
if (ctx->compute_ctx.expired()) {
// Initialize new transfer context
compute_ctx = ggml_vk_create_context(ctx, ctx->compute_cmd_pool);
ctx->compute_ctx = compute_ctx;
ggml_vk_ctx_begin(ctx->device, compute_ctx);
} else {
compute_ctx = ctx->compute_ctx.lock();
}
vk_context compute_ctx = ggml_vk_get_compute_ctx(ctx);
ggml_vk_wait_events(compute_ctx, {vkev->event});
ggml_vk_ctx_end(compute_ctx);
@@ -14631,7 +14720,7 @@ static ggml_backend_i ggml_backend_vk_interface = {
/* .free = */ ggml_backend_vk_free,
/* .set_tensor_async = */ ggml_backend_vk_set_tensor_async,
/* .get_tensor_async = */ ggml_backend_vk_get_tensor_async,
/* .cpy_tensor_async = */ NULL, // ggml_backend_vk_cpy_tensor_async,
/* .cpy_tensor_async = */ ggml_backend_vk_cpy_tensor_async,
/* .synchronize = */ ggml_backend_vk_synchronize,
/* .graph_plan_create = */ NULL,
/* .graph_plan_free = */ NULL,
@@ -15367,11 +15456,25 @@ static bool ggml_backend_vk_device_supports_buft(ggml_backend_dev_t dev, ggml_ba
return buft_ctx->device->idx == ctx->device;
}
static int64_t ggml_vk_get_op_batch_size(const ggml_tensor * op) {
switch (op->op) {
case GGML_OP_GET_ROWS:
return 0;
case GGML_OP_MUL_MAT:
return op->ne[1];
case GGML_OP_MUL_MAT_ID:
case GGML_OP_ROPE:
case GGML_OP_ROPE_BACK:
return op->ne[2];
default:
return ggml_nrows(op);
}
}
static bool ggml_backend_vk_device_offload_op(ggml_backend_dev_t dev, const ggml_tensor * op) {
ggml_backend_vk_device_context * dev_ctx = (ggml_backend_vk_device_context *)dev->context;
return (op->ne[1] >= dev_ctx->op_offload_min_batch_size && op->op != GGML_OP_GET_ROWS) ||
(op->ne[2] >= dev_ctx->op_offload_min_batch_size && op->op == GGML_OP_MUL_MAT_ID);
return ggml_vk_get_op_batch_size(op) >= dev_ctx->op_offload_min_batch_size;
}
static ggml_backend_event_t ggml_backend_vk_device_event_new(ggml_backend_dev_t dev) {

View File

@@ -1,11 +1,43 @@
#!/usr/bin/env bash
#!/bin/sh
# vim: set ts=4 sw=4 et:
wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
ZIP="wikitext-2-raw-v1.zip"
FILE="wikitext-2-raw/wiki.test.raw"
URL="https://huggingface.co/datasets/ggml-org/ci/resolve/main/$ZIP"
echo "Usage:"
echo ""
echo " ./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw [other params]"
echo ""
die() {
printf "%s\n" "$@" >&2
exit 1
}
exit 0
have_cmd() {
for cmd; do
command -v "$cmd" >/dev/null || return
done
}
dl() {
[ -f "$2" ] && return
if have_cmd wget; then
wget "$1" -O "$2"
elif have_cmd curl; then
curl -L "$1" -o "$2"
else
die "Please install wget or curl"
fi
}
have_cmd unzip || die "Please install unzip"
if [ ! -f "$FILE" ]; then
dl "$URL" "$ZIP" || exit
unzip -o "$ZIP" || exit
rm -f -- "$ZIP"
fi
cat <<EOF
Usage:
llama-perplexity -m model.gguf -f $FILE [other params]
EOF

View File

@@ -5,7 +5,7 @@ import os
import sys
import subprocess
HTTPLIB_VERSION = "refs/tags/v0.34.0"
HTTPLIB_VERSION = "refs/tags/v0.35.0"
vendor = {
"https://github.com/nlohmann/json/releases/latest/download/json.hpp": "vendor/nlohmann/json.hpp",
@@ -14,8 +14,8 @@ vendor = {
"https://raw.githubusercontent.com/nothings/stb/refs/heads/master/stb_image.h": "vendor/stb/stb_image.h",
# not using latest tag to avoid this issue: https://github.com/ggml-org/llama.cpp/pull/17179#discussion_r2515877926
# "https://github.com/mackron/miniaudio/raw/refs/tags/0.11.23/miniaudio.h": "vendor/miniaudio/miniaudio.h",
"https://github.com/mackron/miniaudio/raw/669ed3e844524fcd883231b13095baee9f6de304/miniaudio.h": "vendor/miniaudio/miniaudio.h",
# "https://github.com/mackron/miniaudio/raw/refs/tags/0.11.24/miniaudio.h": "vendor/miniaudio/miniaudio.h",
"https://github.com/mackron/miniaudio/raw/13d161bc8d856ad61ae46b798bbeffc0f49808e8/miniaudio.h": "vendor/miniaudio/miniaudio.h",
f"https://raw.githubusercontent.com/yhirose/cpp-httplib/{HTTPLIB_VERSION}/httplib.h": "httplib.h",
f"https://raw.githubusercontent.com/yhirose/cpp-httplib/{HTTPLIB_VERSION}/split.py": "split.py",

View File

@@ -257,6 +257,21 @@ set(LLAMA_TEST_NAME test-mtmd-c-api)
llama_build_and_test(test-mtmd-c-api.c)
target_link_libraries(${LLAMA_TEST_NAME} PRIVATE mtmd)
# GGUF model data fetcher library for tests that need real model metadata
# Only compile when cpp-httplib has SSL support (CPPHTTPLIB_OPENSSL_SUPPORT)
if (TARGET cpp-httplib)
get_target_property(_cpp_httplib_defs cpp-httplib INTERFACE_COMPILE_DEFINITIONS)
if (_cpp_httplib_defs MATCHES "CPPHTTPLIB_OPENSSL_SUPPORT")
add_library(gguf-model-data STATIC gguf-model-data.cpp)
target_link_libraries(gguf-model-data PRIVATE common cpp-httplib)
target_include_directories(gguf-model-data PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
add_executable(test-gguf-model-data test-gguf-model-data.cpp)
target_link_libraries(test-gguf-model-data PRIVATE gguf-model-data common)
llama_test(test-gguf-model-data LABEL "model")
endif()
endif()
# dummy executable - not installed
get_filename_component(TEST_TARGET test-c.c NAME_WE)
add_executable(${TEST_TARGET} test-c.c)

613
tests/gguf-model-data.cpp Normal file
View File

@@ -0,0 +1,613 @@
// GGUF binary parser adapted from the huggingface/gguf package.
// Reference: https://github.com/huggingface/huggingface.js
#include "gguf-model-data.h"
#include "common.h"
#include "gguf.h"
#include <algorithm>
#include <cstdio>
#include <cstring>
#include <filesystem>
#include <fstream>
#include "http.h"
#define JSON_ASSERT GGML_ASSERT
#include <nlohmann/json.hpp>
// Equivalent of RangeView
struct gguf_buf_reader {
const char * data;
size_t size;
size_t pos;
gguf_buf_reader(const std::vector<char> & buf) : data(buf.data()), size(buf.size()), pos(0) {}
bool has_n_bytes(size_t n) const {
return pos + n <= size;
}
template <typename T>
bool read_val(T & out) {
if (!has_n_bytes(sizeof(T))) {
return false;
}
memcpy(&out, data + pos, sizeof(T));
pos += sizeof(T);
return true;
}
bool read_str(std::string & out) {
uint64_t len;
if (!read_val(len)) {
return false;
}
if (!has_n_bytes((size_t)len)) {
return false;
}
out.assign(data + pos, (size_t)len);
pos += (size_t)len;
return true;
}
bool skip(size_t n) {
if (!has_n_bytes(n)) {
return false;
}
pos += n;
return true;
}
};
static size_t gguf_val_type_size(int32_t vtype) {
switch (vtype) {
case GGUF_TYPE_UINT8: return 1;
case GGUF_TYPE_INT8: return 1;
case GGUF_TYPE_UINT16: return 2;
case GGUF_TYPE_INT16: return 2;
case GGUF_TYPE_UINT32: return 4;
case GGUF_TYPE_INT32: return 4;
case GGUF_TYPE_FLOAT32: return 4;
case GGUF_TYPE_BOOL: return 1;
case GGUF_TYPE_UINT64: return 8;
case GGUF_TYPE_INT64: return 8;
case GGUF_TYPE_FLOAT64: return 8;
default: return 0; // string/array handled separately
}
}
// Equivalent of readMetadataValue(), skips unused values rather than storing
static bool gguf_skip_value(gguf_buf_reader & r, int32_t vtype) {
if (vtype == GGUF_TYPE_STRING) {
std::string tmp;
return r.read_str(tmp);
}
if (vtype == GGUF_TYPE_ARRAY) {
int32_t elem_type;
uint64_t count;
if (!r.read_val(elem_type)) {
return false;
}
if (!r.read_val(count)) {
return false;
}
if (elem_type == GGUF_TYPE_STRING) {
for (uint64_t i = 0; i < count; i++) {
std::string tmp;
if (!r.read_str(tmp)) {
return false;
}
}
return true;
}
if (elem_type == GGUF_TYPE_ARRAY) {
// nested arrays - recurse
for (uint64_t i = 0; i < count; i++) {
if (!gguf_skip_value(r, GGUF_TYPE_ARRAY)) {
return false;
}
}
return true;
}
size_t elem_sz = gguf_val_type_size(elem_type);
if (elem_sz == 0) {
return false;
}
return r.skip((size_t)count * elem_sz);
}
size_t sz = gguf_val_type_size(vtype);
if (sz == 0) {
return false;
}
return r.skip(sz);
}
static bool gguf_read_uint32_val(gguf_buf_reader & r, int32_t vtype, uint32_t & out) {
if (vtype == GGUF_TYPE_UINT8) {
uint8_t v;
if (!r.read_val(v)) {
return false;
}
out = v;
return true;
}
if (vtype == GGUF_TYPE_INT8) {
int8_t v;
if (!r.read_val(v)) {
return false;
}
out = (uint32_t)v;
return true;
}
if (vtype == GGUF_TYPE_UINT16) {
uint16_t v;
if (!r.read_val(v)) {
return false;
}
out = v;
return true;
}
if (vtype == GGUF_TYPE_INT16) {
int16_t v;
if (!r.read_val(v)) {
return false;
}
out = (uint32_t)v;
return true;
}
if (vtype == GGUF_TYPE_UINT32) {
uint32_t v;
if (!r.read_val(v)) {
return false;
}
out = v;
return true;
}
if (vtype == GGUF_TYPE_INT32) {
int32_t v;
if (!r.read_val(v)) {
return false;
}
out = (uint32_t)v;
return true;
}
if (vtype == GGUF_TYPE_UINT64) {
uint64_t v;
if (!r.read_val(v)) {
return false;
}
out = (uint32_t)v;
return true;
}
if (vtype == GGUF_TYPE_INT64) {
int64_t v;
if (!r.read_val(v)) {
return false;
}
out = (uint32_t)v;
return true;
}
return false;
}
// Follows the same header -> KV -> tensor parsing sequence as gguf() huggingface/gguf
static std::optional<gguf_remote_model> gguf_parse_meta(const std::vector<char> & buf) {
gguf_buf_reader r(buf);
// Header: magic(4) + version(4) + tensor_count(8) + kv_count(8) = 24 bytes minimum
uint32_t magic_raw;
if (!r.read_val(magic_raw)) {
return std::nullopt;
}
if (memcmp(&magic_raw, "GGUF", 4) != 0) {
fprintf(stderr, "gguf_parse_meta: invalid magic\n");
return std::nullopt;
}
uint32_t version;
if (!r.read_val(version)) {
return std::nullopt;
}
if (version < 2 || version > 3) {
fprintf(stderr, "gguf_parse_meta: unsupported version %u\n", version);
return std::nullopt;
}
int64_t tensor_count_raw;
int64_t kv_count_raw;
if (!r.read_val(tensor_count_raw)) {
return std::nullopt;
}
if (!r.read_val(kv_count_raw)) {
return std::nullopt;
}
uint64_t tensor_count = (uint64_t)tensor_count_raw;
uint64_t kv_count = (uint64_t)kv_count_raw;
gguf_remote_model model;
std::string arch_prefix;
// Parse KV pairs
for (uint64_t i = 0; i < kv_count; i++) {
std::string key;
if (!r.read_str(key)) {
return std::nullopt;
}
int32_t vtype;
if (!r.read_val(vtype)) {
return std::nullopt;
}
if (key == "general.architecture" && vtype == GGUF_TYPE_STRING) {
if (!r.read_str(model.architecture)) {
return std::nullopt;
}
arch_prefix = model.architecture + ".";
continue;
}
// Extract split.count for proper handling of split files
if (key == "split.count") {
uint32_t v;
if (!gguf_read_uint32_val(r, vtype, v)) {
return std::nullopt;
}
model.n_split = (uint16_t)v;
continue;
}
// Extract split.tensors.count so we can verify we have all tensors
if (key == "split.tensors.count") {
uint32_t v;
if (!gguf_read_uint32_val(r, vtype, v)) {
return std::nullopt;
}
model.n_split_tensors = v;
continue;
}
if (!arch_prefix.empty()) {
uint32_t * target = nullptr;
if (key == arch_prefix + "embedding_length") { target = &model.n_embd; }
else if (key == arch_prefix + "feed_forward_length") { target = &model.n_ff; }
else if (key == arch_prefix + "block_count") { target = &model.n_layer; }
else if (key == arch_prefix + "attention.head_count") { target = &model.n_head; }
else if (key == arch_prefix + "attention.head_count_kv") { target = &model.n_head_kv; }
else if (key == arch_prefix + "expert_count") { target = &model.n_expert; }
else if (key == arch_prefix + "attention.key_length") { target = &model.n_embd_head_k; }
else if (key == arch_prefix + "attention.value_length") { target = &model.n_embd_head_v; }
if (target) {
if (!gguf_read_uint32_val(r, vtype, *target)) {
return std::nullopt;
}
continue;
}
}
if (!gguf_skip_value(r, vtype)) {
return std::nullopt;
}
}
// Parse tensor info entries
model.tensors.reserve((size_t)tensor_count);
for (uint64_t i = 0; i < tensor_count; i++) {
gguf_remote_tensor t;
if (!r.read_str(t.name)) {
return std::nullopt;
}
if (!r.read_val(t.n_dims)) {
return std::nullopt;
}
if (t.n_dims > 4) {
fprintf(stderr, "gguf_parse_meta: tensor '%s' has %u dims (max 4)\n", t.name.c_str(), t.n_dims);
return std::nullopt;
}
for (uint32_t d = 0; d < t.n_dims; d++) {
if (!r.read_val(t.ne[d])) {
return std::nullopt;
}
}
int32_t type_raw;
if (!r.read_val(type_raw)) {
return std::nullopt;
}
t.type = (ggml_type)type_raw;
uint64_t offset;
if (!r.read_val(offset)) {
return std::nullopt;
}
// Infer n_vocab from token_embd.weight
if (t.name == "token_embd.weight") {
model.n_vocab = (uint32_t)t.ne[1];
}
model.tensors.push_back(std::move(t));
}
return model;
}
// cache handling for local download
static std::string get_default_cache_dir() {
return fs_get_cache_directory() + "gguf-headers/";
}
static std::string sanitize_for_path(const std::string & s) {
std::string out = s;
for (char & c : out) {
if (c == '/' || c == '\\' || c == ':') {
c = '_';
}
}
return out;
}
static bool read_file(const std::string & path, std::vector<char> & out) {
std::ifstream f(path, std::ios::binary | std::ios::ate);
if (!f.good()) {
return false;
}
auto sz = f.tellg();
if (sz <= 0) {
return false;
}
out.resize((size_t)sz);
f.seekg(0);
f.read(out.data(), sz);
return f.good();
}
static bool write_file(const std::string & path, const std::vector<char> & data) {
std::ofstream f(path, std::ios::binary | std::ios::trunc);
if (!f.good()) {
return false;
}
f.write(data.data(), (std::streamsize)data.size());
return f.good();
}
// HuggingFace file auto-detection and HTTP download
static std::pair<long, std::vector<char>> gguf_http_get(
const std::string & url,
const httplib::Headers & headers = {},
int timeout_sec = 60) {
try {
auto [cli, parts] = common_http_client(url);
if (timeout_sec > 0) {
cli.set_read_timeout(timeout_sec, 0);
cli.set_write_timeout(timeout_sec, 0);
}
cli.set_connection_timeout(30, 0);
std::vector<char> body;
auto res = cli.Get(parts.path, headers,
[&](const char * data, size_t len) {
body.insert(body.end(), data, data + len);
return true;
}, nullptr);
if (!res) {
fprintf(stderr, "gguf_fetch: HTTP request failed for %s (error %d)\n",
url.c_str(), (int)res.error());
return {-1, {}};
}
return {res->status, std::move(body)};
} catch (const std::exception & e) {
fprintf(stderr, "gguf_fetch: HTTP error: %s\n", e.what());
return {-1, {}};
}
}
// Find the filename for given repo/quant.
// For split models, returns the first shard (the one containing "00001-of-")
// split_prefix is set to the portion before "-00001-of-XXXXX.gguf" when a split file is found
static std::string detect_gguf_filename(const std::string & repo, const std::string & quant,
std::string & split_prefix) {
split_prefix.clear();
std::string api_url = "https://huggingface.co/api/models/" + repo;
auto [code, body] = gguf_http_get(api_url, {}, 30);
if (code != 200 || body.empty()) {
fprintf(stderr, "gguf_fetch: failed to query HF API for %s (HTTP %ld)\n", repo.c_str(), code);
return "";
}
nlohmann::json j;
try {
j = nlohmann::json::parse(body.begin(), body.end());
} catch (...) {
fprintf(stderr, "gguf_fetch: failed to parse HF API response\n");
return "";
}
if (!j.contains("siblings") || !j["siblings"].is_array()) {
fprintf(stderr, "gguf_fetch: unexpected HF API response format\n");
return "";
}
std::vector<std::string> matches;
std::string quant_upper = quant;
for (char & c : quant_upper) { c = (char)toupper(c); }
for (const auto & sibling : j["siblings"]) {
if (!sibling.contains("rfilename")) { continue; }
std::string fname = sibling["rfilename"].get<std::string>();
if (fname.size() < 5 || fname.substr(fname.size() - 5) != ".gguf") {
continue;
}
std::string fname_upper = fname;
for (char & c : fname_upper) { c = (char)toupper(c); }
if (fname_upper.find(quant_upper) != std::string::npos) {
matches.push_back(fname);
}
}
if (matches.empty()) {
fprintf(stderr, "gguf_fetch: no .gguf files matching '%s' in %s\n", quant.c_str(), repo.c_str());
return "";
}
std::sort(matches.begin(), matches.end());
// Prefer non-split, non-supplementary file
for (const auto & m : matches) {
if (m.find("-of-") == std::string::npos && m.find("mmproj") == std::string::npos) {
return m;
}
}
// Return the first shard (00001-of-) and extract the prefix
for (const auto & m : matches) {
auto pos = m.find("-00001-of-");
if (pos != std::string::npos) {
split_prefix = m.substr(0, pos);
return m;
}
}
return matches[0];
}
static std::optional<gguf_remote_model> fetch_and_parse(
const std::string & repo,
const std::string & filename,
const std::string & cache_path) {
std::string url = "https://huggingface.co/" + repo + "/resolve/main/" + filename;
// Progressive download inspired by RangeView.fetchChunk()
// Start at 2MB, double each time, cap at 64MB
size_t chunk_size = 2 * 1024 * 1024;
const size_t max_chunk = 64 * 1024 * 1024;
while (chunk_size <= max_chunk) {
fprintf(stderr, "gguf_fetch: downloading %zu bytes from %s\n", chunk_size, filename.c_str());
char range_buf[64];
snprintf(range_buf, sizeof(range_buf), "bytes=0-%zu", chunk_size - 1);
httplib::Headers headers = {{"Range", range_buf}};
auto [code, body] = gguf_http_get(url, headers, 120);
if (code != 200 && code != 206) {
fprintf(stderr, "gguf_fetch: HTTP %ld fetching %s\n", code, url.c_str());
return std::nullopt;
}
if (body.empty()) {
fprintf(stderr, "gguf_fetch: empty response\n");
return std::nullopt;
}
auto result = gguf_parse_meta(body);
if (result.has_value()) {
write_file(cache_path, body);
return result;
}
if (code == 200) {
fprintf(stderr, "gguf_fetch: server returned full response but metadata parse failed\n");
return std::nullopt;
}
// Parse failed, try larger chunk
chunk_size *= 2;
}
fprintf(stderr, "gguf_fetch: metadata exceeds 64MB, giving up\n");
return std::nullopt;
}
// Try cache first, then fetch and parse a single GGUF shard.
static std::optional<gguf_remote_model> fetch_or_cached(
const std::string & repo,
const std::string & filename,
const std::string & cdir,
const std::string & repo_part) {
std::string fname_part = sanitize_for_path(filename);
std::string cache_path = cdir + "/" + repo_part + "--" + fname_part + ".partial";
{
std::vector<char> cached;
if (std::filesystem::exists(cache_path) && read_file(cache_path, cached)) {
auto result = gguf_parse_meta(cached);
if (result.has_value()) {
fprintf(stderr, "gguf_fetch: loaded from cache: %s\n", cache_path.c_str());
return result;
}
}
}
fs_create_directory_with_parents(cdir);
return fetch_and_parse(repo, filename, cache_path);
}
std::optional<gguf_remote_model> gguf_fetch_model_meta(
const std::string & repo,
const std::string & quant,
const std::string & cache_dir) {
std::string cdir = cache_dir.empty() ? get_default_cache_dir() : cache_dir;
std::string repo_part = sanitize_for_path(repo);
std::string split_prefix;
std::string filename = detect_gguf_filename(repo, quant, split_prefix);
if (filename.empty()) {
return std::nullopt;
}
auto model_opt = fetch_or_cached(repo, filename, cdir, repo_part);
if (!model_opt.has_value()) {
fprintf(stderr, "gguf_fetch: failed to fetch %s\n", filename.c_str());
return std::nullopt;
}
auto & model = model_opt.value();
// If the model is split across multiple files we need to fetch the remaining shards metadata
if (model.n_split > 1) {
if (split_prefix.empty()) {
fprintf(stderr, "gguf_fetch: model reports %u splits but filename has no split pattern\n", model.n_split);
return std::nullopt;
}
fprintf(stderr, "gguf_fetch: split model with %u shards, fetching remaining %u...\n",
model.n_split, model.n_split - 1);
for (int i = 2; i <= model.n_split; i++) {
char num_buf[6], total_buf[6];
snprintf(num_buf, sizeof(num_buf), "%05d", i);
snprintf(total_buf, sizeof(total_buf), "%05d", (int)model.n_split);
std::string shard_name = split_prefix + "-" + num_buf + "-of-" + total_buf + ".gguf";
auto shard = fetch_or_cached(repo, shard_name, cdir, repo_part);
if (!shard.has_value()) {
fprintf(stderr, "gguf_fetch: failed to fetch shard %d: %s\n", i, shard_name.c_str());
return std::nullopt;
}
model.tensors.insert(model.tensors.end(),
std::make_move_iterator(shard->tensors.begin()),
std::make_move_iterator(shard->tensors.end()));
}
if (model.n_split_tensors > 0 && model.tensors.size() != model.n_split_tensors) {
fprintf(stderr, "gguf_fetch: WARNING: expected %u tensors from split.tensors.count, got %zu\n",
model.n_split_tensors, model.tensors.size());
}
}
return model_opt;
}

42
tests/gguf-model-data.h Normal file
View File

@@ -0,0 +1,42 @@
#pragma once
#include "ggml.h"
#include <cstdint>
#include <optional>
#include <string>
#include <vector>
struct gguf_remote_tensor {
std::string name;
ggml_type type = GGML_TYPE_F32;
int64_t ne[4] = {1, 1, 1, 1}; // dimensions, unused dims = 1
uint32_t n_dims = 0;
};
struct gguf_remote_model {
// Selected KV metadata
std::string architecture; // general.architecture
uint32_t n_embd = 0; // <arch>.embedding_length
uint32_t n_ff = 0; // <arch>.feed_forward_length
uint32_t n_vocab = 0; // inferred from token_embd.weight ne[1]
uint32_t n_layer = 0; // <arch>.block_count
uint32_t n_head = 0; // <arch>.attention.head_count
uint32_t n_head_kv = 0; // <arch>.attention.head_count_kv
uint32_t n_expert = 0; // <arch>.expert_count (0 if absent)
uint32_t n_embd_head_k = 0; // <arch>.attention.key_length
uint32_t n_embd_head_v = 0; // <arch>.attention.value_length
uint16_t n_split = 0; // split.count (0 = not split)
uint32_t n_split_tensors = 0; // split.tensors.count (0 if not split)
std::vector<gguf_remote_tensor> tensors;
};
// Fetch model metadata from HuggingFace with local caching.
// repo: e.g., "ggml-org/Qwen3-32B-GGUF"
// quant: e.g., "Q8_0" -- auto-detects filename (including first shard of split models)
// Returns nullopt if download fails or network is unavailable.
std::optional<gguf_remote_model> gguf_fetch_model_meta(
const std::string & repo,
const std::string & quant = "Q8_0",
const std::string & cache_dir = ""); // empty = default

View File

@@ -0,0 +1,121 @@
#include "gguf-model-data.h"
#include <cstdio>
#define TEST_ASSERT(cond, msg) \
do { \
if (!(cond)) { \
fprintf(stderr, "FAIL: %s (line %d): %s\n", #cond, __LINE__, msg); \
return 1; \
} \
} while (0)
int main() {
fprintf(stderr, "=== test-gguf-model-data ===\n");
// Fetch Qwen3-0.6B Q8_0 metadata
auto result = gguf_fetch_model_meta("ggml-org/Qwen3-0.6B-GGUF", "Q8_0");
if (!result.has_value()) {
fprintf(stderr, "SKIP: could not fetch model metadata (no network or HTTP disabled)\n");
return 0;
}
const auto & model = result.value();
fprintf(stderr, "Architecture: %s\n", model.architecture.c_str());
fprintf(stderr, "n_embd: %u\n", model.n_embd);
fprintf(stderr, "n_ff: %u\n", model.n_ff);
fprintf(stderr, "n_vocab: %u\n", model.n_vocab);
fprintf(stderr, "n_layer: %u\n", model.n_layer);
fprintf(stderr, "n_head: %u\n", model.n_head);
fprintf(stderr, "n_head_kv: %u\n", model.n_head_kv);
fprintf(stderr, "n_expert: %u\n", model.n_expert);
fprintf(stderr, "n_embd_head_k: %u\n", model.n_embd_head_k);
fprintf(stderr, "n_embd_head_v: %u\n", model.n_embd_head_v);
fprintf(stderr, "tensors: %zu\n", model.tensors.size());
// Verify architecture
TEST_ASSERT(model.architecture == "qwen3", "expected architecture 'qwen3'");
// Verify key dimensions (Qwen3-0.6B)
TEST_ASSERT(model.n_layer == 28, "expected n_layer == 28");
TEST_ASSERT(model.n_embd == 1024, "expected n_embd == 1024");
TEST_ASSERT(model.n_head == 16, "expected n_head == 16");
TEST_ASSERT(model.n_head_kv == 8, "expected n_head_kv == 8");
TEST_ASSERT(model.n_expert == 0, "expected n_expert == 0 (not MoE)");
TEST_ASSERT(model.n_vocab == 151936, "expected n_vocab == 151936");
// Verify tensor count
TEST_ASSERT(model.tensors.size() == 311, "expected tensor count == 311");
// Verify known tensor names exist
bool found_attn_q = false;
bool found_token_embd = false;
bool found_output_norm = false;
for (const auto & t : model.tensors) {
if (t.name == "blk.0.attn_q.weight") {
found_attn_q = true;
}
if (t.name == "token_embd.weight") {
found_token_embd = true;
}
if (t.name == "output_norm.weight") {
found_output_norm = true;
}
}
TEST_ASSERT(found_attn_q, "expected tensor 'blk.0.attn_q.weight'");
TEST_ASSERT(found_token_embd, "expected tensor 'token_embd.weight'");
TEST_ASSERT(found_output_norm, "expected tensor 'output_norm.weight'");
// Verify token_embd.weight shape
for (const auto & t : model.tensors) {
if (t.name == "token_embd.weight") {
TEST_ASSERT(t.ne[0] == 1024, "expected token_embd.weight ne[0] == 1024");
TEST_ASSERT(t.n_dims == 2, "expected token_embd.weight to be 2D");
break;
}
}
// Test that second call uses cache (just call again, it should work)
auto result2 = gguf_fetch_model_meta("ggml-org/Qwen3-0.6B-GGUF", "Q8_0");
TEST_ASSERT(result2.has_value(), "cached fetch should succeed");
TEST_ASSERT(result2->tensors.size() == model.tensors.size(), "cached result should match");
// Test a split MoE model without specifying quant (should default to Q8_0)
auto result3 = gguf_fetch_model_meta("ggml-org/GLM-4.6V-GGUF");
if (!result3.has_value()) {
fprintf(stderr, "SKIP: could not fetch GLM-4.6V metadata (no network?)\n");
return 0;
}
const auto & model3 = result3.value();
fprintf(stderr, "Architecture: %s\n", model3.architecture.c_str());
fprintf(stderr, "n_embd: %u\n", model3.n_embd);
fprintf(stderr, "n_ff: %u\n", model3.n_ff);
fprintf(stderr, "n_vocab: %u\n", model3.n_vocab);
fprintf(stderr, "n_layer: %u\n", model3.n_layer);
fprintf(stderr, "n_head: %u\n", model3.n_head);
fprintf(stderr, "n_head_kv: %u\n", model3.n_head_kv);
fprintf(stderr, "n_expert: %u\n", model3.n_expert);
fprintf(stderr, "n_embd_head_k: %u\n", model3.n_embd_head_k);
fprintf(stderr, "n_embd_head_v: %u\n", model3.n_embd_head_v);
fprintf(stderr, "tensors: %zu\n", model3.tensors.size());
// Verify architecture
TEST_ASSERT(model3.architecture == "glm4moe", "expected architecture 'glm4moe'");
// Verify key dimensions (GLM-4.6V)
TEST_ASSERT(model3.n_layer == 46, "expected n_layer == 46");
TEST_ASSERT(model3.n_embd == 4096, "expected n_embd == 4096");
TEST_ASSERT(model3.n_head == 96, "expected n_head == 96");
TEST_ASSERT(model3.n_head_kv == 8, "expected n_head_kv == 8");
TEST_ASSERT(model3.n_expert == 128, "expected n_expert == 128 (MoE)");
TEST_ASSERT(model3.n_vocab == 151552, "expected n_vocab == 151552");
// Verify tensor count
TEST_ASSERT(model3.tensors.size() == 780, "expected tensor count == 780");
fprintf(stderr, "=== ALL TESTS PASSED ===\n");
return 0;
}

View File

@@ -1,3 +1,5 @@
#pragma once
#include "server-http.h"
#include "server-task.h"
#include "server-queue.h"

View File

@@ -178,6 +178,7 @@ int main(int argc, char ** argv) {
ctx_http.post("/v1/chat/completions", ex_wrapper(routes.post_chat_completions));
ctx_http.post("/api/chat", ex_wrapper(routes.post_chat_completions)); // ollama specific endpoint
ctx_http.post("/v1/responses", ex_wrapper(routes.post_responses_oai));
ctx_http.post("/responses", ex_wrapper(routes.post_responses_oai));
ctx_http.post("/v1/messages", ex_wrapper(routes.post_anthropic_messages)); // anthropic messages API
ctx_http.post("/v1/messages/count_tokens", ex_wrapper(routes.post_anthropic_count_tokens)); // anthropic token counting
ctx_http.post("/infill", ex_wrapper(routes.post_infill));

View File

@@ -171,7 +171,6 @@ endif()
if (CPPHTTPLIB_OPENSSL_SUPPORT)
target_compile_definitions(${TARGET} PUBLIC CPPHTTPLIB_OPENSSL_SUPPORT) # used in server.cpp
if (APPLE AND CMAKE_SYSTEM_NAME STREQUAL "Darwin")
target_compile_definitions(${TARGET} PRIVATE CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN)
find_library(CORE_FOUNDATION_FRAMEWORK CoreFoundation REQUIRED)
find_library(SECURITY_FRAMEWORK Security REQUIRED)
target_link_libraries(${TARGET} PUBLIC ${CORE_FOUNDATION_FRAMEWORK} ${SECURITY_FRAMEWORK})

View File

@@ -2571,10 +2571,46 @@ find_content_type(const std::string &path,
}
}
std::string
extract_media_type(const std::string &content_type,
std::map<std::string, std::string> *params = nullptr) {
// Extract type/subtype from Content-Type value (RFC 2045)
// e.g. "application/json; charset=utf-8" -> "application/json"
auto media_type = content_type;
auto semicolon_pos = media_type.find(';');
if (semicolon_pos != std::string::npos) {
auto param_str = media_type.substr(semicolon_pos + 1);
media_type = media_type.substr(0, semicolon_pos);
if (params) {
// Parse parameters: key=value pairs separated by ';'
split(param_str.data(), param_str.data() + param_str.size(), ';',
[&](const char *b, const char *e) {
std::string key;
std::string val;
split(b, e, '=', [&](const char *b2, const char *e2) {
if (key.empty()) {
key.assign(b2, e2);
} else {
val.assign(b2, e2);
}
});
if (!key.empty()) {
params->emplace(trim_copy(key), trim_double_quotes_copy(val));
}
});
}
}
// Trim whitespace from media type
return trim_copy(media_type);
}
bool can_compress_content_type(const std::string &content_type) {
using udl::operator""_t;
auto tag = str2tag(content_type);
auto mime_type = extract_media_type(content_type);
auto tag = str2tag(mime_type);
switch (tag) {
case "image/svg+xml"_t:
@@ -2586,7 +2622,7 @@ bool can_compress_content_type(const std::string &content_type) {
case "text/event-stream"_t: return false;
default: return !content_type.rfind("text/", 0);
default: return !mime_type.rfind("text/", 0);
}
}
@@ -3141,7 +3177,8 @@ bool is_chunked_transfer_encoding(const Headers &headers) {
template <typename T, typename U>
bool prepare_content_receiver(T &x, int &status,
ContentReceiverWithProgress receiver,
bool decompress, U callback) {
bool decompress, size_t payload_max_length,
bool &exceed_payload_max_length, U callback) {
if (decompress) {
std::string encoding = x.get_header_value("Content-Encoding");
std::unique_ptr<decompressor> decompressor;
@@ -3157,12 +3194,22 @@ bool prepare_content_receiver(T &x, int &status,
if (decompressor) {
if (decompressor->is_valid()) {
size_t decompressed_size = 0;
ContentReceiverWithProgress out = [&](const char *buf, size_t n,
size_t off, size_t len) {
return decompressor->decompress(buf, n,
[&](const char *buf2, size_t n2) {
return receiver(buf2, n2, off, len);
});
return decompressor->decompress(
buf, n, [&](const char *buf2, size_t n2) {
// Guard against zip-bomb: check
// decompressed size against limit.
if (payload_max_length > 0 &&
(decompressed_size >= payload_max_length ||
n2 > payload_max_length - decompressed_size)) {
exceed_payload_max_length = true;
return false;
}
decompressed_size += n2;
return receiver(buf2, n2, off, len);
});
};
return callback(std::move(out));
} else {
@@ -3183,11 +3230,14 @@ template <typename T>
bool read_content(Stream &strm, T &x, size_t payload_max_length, int &status,
DownloadProgress progress,
ContentReceiverWithProgress receiver, bool decompress) {
bool exceed_payload_max_length = false;
return prepare_content_receiver(
x, status, std::move(receiver), decompress,
[&](const ContentReceiverWithProgress &out) {
x, status, std::move(receiver), decompress, payload_max_length,
exceed_payload_max_length, [&](const ContentReceiverWithProgress &out) {
auto ret = true;
auto exceed_payload_max_length = false;
// Note: exceed_payload_max_length may also be set by the decompressor
// wrapper in prepare_content_receiver when the decompressed payload
// size exceeds the limit.
if (is_chunked_transfer_encoding(x.headers)) {
auto result = read_content_chunked(strm, x, payload_max_length, out);
@@ -3603,12 +3653,11 @@ std::string normalize_query_string(const std::string &query) {
bool parse_multipart_boundary(const std::string &content_type,
std::string &boundary) {
auto boundary_keyword = "boundary=";
auto pos = content_type.find(boundary_keyword);
if (pos == std::string::npos) { return false; }
auto end = content_type.find(';', pos);
auto beg = pos + strlen(boundary_keyword);
boundary = trim_double_quotes_copy(content_type.substr(beg, end - beg));
std::map<std::string, std::string> params;
extract_media_type(content_type, &params);
auto it = params.find("boundary");
if (it == params.end()) { return false; }
boundary = it->second;
return !boundary.empty();
}
@@ -3776,11 +3825,7 @@ bool parse_accept_header(const std::string &s,
}
// Remove additional parameters from media type
auto param_pos = accept_entry.media_type.find(';');
if (param_pos != std::string::npos) {
accept_entry.media_type =
trim_copy(accept_entry.media_type.substr(0, param_pos));
}
accept_entry.media_type = extract_media_type(accept_entry.media_type);
// Basic validation of media type format
if (accept_entry.media_type.empty()) {
@@ -5610,7 +5655,7 @@ size_t Request::get_param_value_count(const std::string &key) const {
bool Request::is_multipart_form_data() const {
const auto &content_type = get_header_value("Content-Type");
return !content_type.rfind("multipart/form-data", 0);
return detail::extract_media_type(content_type) == "multipart/form-data";
}
// Multipart FormData implementation
@@ -7092,7 +7137,8 @@ bool Server::read_content(Stream &strm, Request &req, Response &res) {
return true;
})) {
const auto &content_type = req.get_header_value("Content-Type");
if (!content_type.find("application/x-www-form-urlencoded")) {
if (detail::extract_media_type(content_type) ==
"application/x-www-form-urlencoded") {
if (req.body.size() > CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH) {
res.status = StatusCode::PayloadTooLarge_413; // NOTE: should be 414?
output_error_log(Error::ExceedMaxPayloadSize, &req);
@@ -7479,45 +7525,63 @@ bool Server::routing(Request &req, Response &res, Stream &strm) {
if (detail::expect_content(req)) {
// Content reader handler
{
// Track whether the ContentReader was aborted due to the decompressed
// payload exceeding `payload_max_length_`.
// The user handler runs after the lambda returns, so we must restore the
// 413 status if the handler overwrites it.
bool content_reader_payload_too_large = false;
ContentReader reader(
[&](ContentReceiver receiver) {
auto result = read_content_with_content_receiver(
strm, req, res, std::move(receiver), nullptr, nullptr);
if (!result) { output_error_log(Error::Read, &req); }
if (!result) {
output_error_log(Error::Read, &req);
if (res.status == StatusCode::PayloadTooLarge_413) {
content_reader_payload_too_large = true;
}
}
return result;
},
[&](FormDataHeader header, ContentReceiver receiver) {
auto result = read_content_with_content_receiver(
strm, req, res, nullptr, std::move(header),
std::move(receiver));
if (!result) { output_error_log(Error::Read, &req); }
if (!result) {
output_error_log(Error::Read, &req);
if (res.status == StatusCode::PayloadTooLarge_413) {
content_reader_payload_too_large = true;
}
}
return result;
});
bool dispatched = false;
if (req.method == "POST") {
if (dispatch_request_for_content_reader(
req, res, std::move(reader),
post_handlers_for_content_reader_)) {
return true;
}
dispatched = dispatch_request_for_content_reader(
req, res, std::move(reader), post_handlers_for_content_reader_);
} else if (req.method == "PUT") {
if (dispatch_request_for_content_reader(
req, res, std::move(reader),
put_handlers_for_content_reader_)) {
return true;
}
dispatched = dispatch_request_for_content_reader(
req, res, std::move(reader), put_handlers_for_content_reader_);
} else if (req.method == "PATCH") {
if (dispatch_request_for_content_reader(
req, res, std::move(reader),
patch_handlers_for_content_reader_)) {
return true;
}
dispatched = dispatch_request_for_content_reader(
req, res, std::move(reader), patch_handlers_for_content_reader_);
} else if (req.method == "DELETE") {
if (dispatch_request_for_content_reader(
req, res, std::move(reader),
delete_handlers_for_content_reader_)) {
return true;
dispatched = dispatch_request_for_content_reader(
req, res, std::move(reader), delete_handlers_for_content_reader_);
}
if (dispatched) {
if (content_reader_payload_too_large) {
// Enforce the limit: override any status the handler may have set
// and return false so the error path sends a plain 413 response.
res.status = StatusCode::PayloadTooLarge_413;
res.body.clear();
res.content_length_ = 0;
res.content_provider_ = nullptr;
return false;
}
return true;
}
}
@@ -7930,16 +7994,6 @@ Server::process_request(Stream &strm, const std::string &remote_addr,
routed = true;
} else {
res.status = StatusCode::InternalServerError_500;
std::string val;
auto s = e.what();
for (size_t i = 0; s[i]; i++) {
switch (s[i]) {
case '\r': val += "\\r"; break;
case '\n': val += "\\n"; break;
default: val += s[i]; break;
}
}
res.set_header("EXCEPTION_WHAT", val);
}
} catch (...) {
if (exception_handler_) {
@@ -7948,7 +8002,6 @@ Server::process_request(Stream &strm, const std::string &remote_addr,
routed = true;
} else {
res.status = StatusCode::InternalServerError_500;
res.set_header("EXCEPTION_WHAT", "UNKNOWN");
}
}
#endif
@@ -11629,8 +11682,7 @@ void SSLClient::set_session_verifier(
session_verifier_ = std::move(verifier);
}
#if defined(_WIN32) && \
!defined(CPPHTTPLIB_DISABLE_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE)
#ifdef CPPHTTPLIB_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE
void SSLClient::enable_windows_certificate_verification(bool enabled) {
enable_windows_cert_verification_ = enabled;
}
@@ -11788,8 +11840,7 @@ bool SSLClient::initialize_ssl(Socket &socket, Error &error) {
}
}
#if defined(_WIN32) && \
!defined(CPPHTTPLIB_DISABLE_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE)
#ifdef CPPHTTPLIB_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE
// Additional Windows Schannel verification.
// This provides real-time certificate validation with Windows Update
// integration, working with both OpenSSL and MbedTLS backends.
@@ -11835,8 +11886,7 @@ void Client::enable_server_hostname_verification(bool enabled) {
cli_->enable_server_hostname_verification(enabled);
}
#if defined(_WIN32) && \
!defined(CPPHTTPLIB_DISABLE_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE)
#ifdef CPPHTTPLIB_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE
void Client::enable_windows_certificate_verification(bool enabled) {
if (is_ssl_) {
static_cast<SSLClient &>(*cli_).enable_windows_certificate_verification(
@@ -11959,7 +12009,7 @@ bool enumerate_windows_system_certs(Callback cb) {
}
#endif
#if defined(__APPLE__) && defined(CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN)
#ifdef CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
// Enumerate macOS Keychain certificates and call callback with DER data
template <typename Callback>
bool enumerate_macos_keychain_certs(Callback cb) {

View File

@@ -8,8 +8,8 @@
#ifndef CPPHTTPLIB_HTTPLIB_H
#define CPPHTTPLIB_HTTPLIB_H
#define CPPHTTPLIB_VERSION "0.34.0"
#define CPPHTTPLIB_VERSION_NUM "0x002200"
#define CPPHTTPLIB_VERSION "0.35.0"
#define CPPHTTPLIB_VERSION_NUM "0x002300"
/*
* Platform compatibility check
@@ -357,14 +357,32 @@ using socket_t = int;
#include <any>
#endif
// On macOS with a TLS backend, enable Keychain root certificates by default
// unless the user explicitly opts out.
#if defined(__APPLE__) && \
!defined(CPPHTTPLIB_DISABLE_MACOSX_AUTOMATIC_ROOT_CERTIFICATES) && \
(defined(CPPHTTPLIB_OPENSSL_SUPPORT) || \
defined(CPPHTTPLIB_MBEDTLS_SUPPORT) || \
defined(CPPHTTPLIB_WOLFSSL_SUPPORT))
#ifndef CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#define CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#endif
#endif
// On Windows, enable Schannel certificate verification by default
// unless the user explicitly opts out.
#if defined(_WIN32) && \
!defined(CPPHTTPLIB_DISABLE_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE)
#define CPPHTTPLIB_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE
#endif
#if defined(CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO) || \
defined(CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN)
#if TARGET_OS_MAC
#include <CFNetwork/CFHost.h>
#include <CoreFoundation/CoreFoundation.h>
#endif
#endif // CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO or
// CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#endif
#ifdef CPPHTTPLIB_OPENSSL_SUPPORT
#ifdef _WIN32
@@ -382,11 +400,11 @@ using socket_t = int;
#endif
#endif // _WIN32
#if defined(CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN)
#ifdef CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#if TARGET_OS_MAC
#include <Security/Security.h>
#endif
#endif // CPPHTTPLIB_USE_NON_BLOCKING_GETADDRINFO
#endif
#include <openssl/err.h>
#include <openssl/evp.h>
@@ -430,11 +448,11 @@ using socket_t = int;
#pragma comment(lib, "crypt32.lib")
#endif
#endif // _WIN32
#if defined(CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN)
#ifdef CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#if TARGET_OS_MAC
#include <Security/Security.h>
#endif
#endif // CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#endif
// Mbed TLS 3.x API compatibility
#if MBEDTLS_VERSION_MAJOR >= 3
@@ -473,11 +491,11 @@ using socket_t = int;
#pragma comment(lib, "crypt32.lib")
#endif
#endif // _WIN32
#if defined(CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN)
#ifdef CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#if TARGET_OS_MAC
#include <Security/Security.h>
#endif
#endif // CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN
#endif
#endif // CPPHTTPLIB_WOLFSSL_SUPPORT
// Define CPPHTTPLIB_SSL_ENABLED if any SSL backend is available
@@ -2557,8 +2575,7 @@ public:
tls::ctx_t tls_context() const;
#if defined(_WIN32) && \
!defined(CPPHTTPLIB_DISABLE_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE)
#ifdef CPPHTTPLIB_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE
void enable_windows_certificate_verification(bool enabled);
#endif
@@ -2679,8 +2696,7 @@ public:
tls::ctx_t tls_context() const { return ctx_; }
#if defined(_WIN32) && \
!defined(CPPHTTPLIB_DISABLE_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE)
#ifdef CPPHTTPLIB_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE
void enable_windows_certificate_verification(bool enabled);
#endif
@@ -2712,8 +2728,7 @@ private:
std::function<SSLVerifierResponse(tls::session_t)> session_verifier_;
#if defined(_WIN32) && \
!defined(CPPHTTPLIB_DISABLE_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE)
#ifdef CPPHTTPLIB_WINDOWS_AUTOMATIC_ROOT_CERTIFICATES_UPDATE
bool enable_windows_cert_verification_ = true;
#endif

File diff suppressed because it is too large Load Diff