llama-bench: print -n-cpu-moe when offloaded layers > 1 (#20984 )

ci: Allow ninja to be used during unit test (#20742 )
* Remove make dependency * Added option to specify Ninja generator * use ninja-build as default for several CI * Revert "use ninja-build as default for several CI" This reverts commit f552c4559b. * changed use plain string rather than arrays * Enabled ninja build by default for experimentation * ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Enabled ninja build by default on self-hosted envs for experimentation * ci: revert generator to ninja instead of ninja multi-config Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: install ninja-build for self-hosted workflows Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: revert ninja from self-hosted runners Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: missed one self-hosted step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: fix windows ci errors from an errenous revert Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Added explicit build types for Ninja Also reverted some needless change * ci: use ninja multi-config for vulkan-x64 build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * added time command to measure build time * Keeping some configs to use Ninja which show improvement * minor fix based on review Co-authored-by: Aaron Teo <taronaeo@gmail.com> * ci: rm `time` from custom containers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-04-23 16:37:33 +03:00 · 2026-03-25 21:17:27 +08:00 · 2026-03-25 21:00:49 +08:00 · 2026-03-25 14:46:40 +02:00 · 2026-03-25 12:22:48 +01:00 · 2026-03-25 13:12:50 +02:00
68 changed files with 1903 additions and 702 deletions
--- a/.github/ISSUE_TEMPLATE/010-bug-compilation.yml
+++ b/.github/ISSUE_TEMPLATE/010-bug-compilation.yml
@@ -41,7 +41,7 @@ body:
    attributes:
        label: GGML backends
        description: Which GGML backends do you know to be affected?
-        options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
+        options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, OpenVINO, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
        multiple: true
    validations:
      required: true
--- a/.github/ISSUE_TEMPLATE/011-bug-results.yml
+++ b/.github/ISSUE_TEMPLATE/011-bug-results.yml
@@ -42,7 +42,7 @@ body:
    attributes:
        label: GGML backends
        description: Which GGML backends do you know to be affected?
-        options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
+        options: [AMX, BLAS, CANN, CPU, CUDA, Hexagon, HIP, Metal, Musa, OpenCL, OpenVINO, RPC, SYCL, VirtGPU, Vulkan, WebGPU, zDNN, ZenDNN]
        multiple: true
    validations:
      required: true
--- a/.github/workflows/build-self-hosted.yml
+++ b/.github/workflows/build-self-hosted.yml
@@ -141,60 +141,61 @@ jobs:
  #         amd-smi static
  #         GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp

-  ggml-ci-mac-metal:
-    runs-on: [self-hosted, macOS, ARM64]
-
-    steps:
-      - name: Clone
-        id: checkout
-        uses: actions/checkout@v6
-
-      - name: Test
-        id: ggml-ci
-        run: |
-          GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
-
-  ggml-ci-mac-webgpu:
-    runs-on: [self-hosted, macOS, ARM64]
-
-    steps:
-      - name: Clone
-        id: checkout
-        uses: actions/checkout@v6
-
-      - name: Dawn Dependency
-        id: dawn-depends
-        run: |
-          DAWN_VERSION="v2.0.0"
-          DAWN_OWNER="reeselevine"
-          DAWN_REPO="dawn"
-          DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
-          echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
-          curl -L -o artifact.zip \
-            "https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
-          mkdir dawn
-          unzip artifact.zip
-          tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
-
-      - name: Test
-        id: ggml-ci
-        run: |
-          GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
-            bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
-
-  ggml-ci-mac-vulkan:
-    runs-on: [self-hosted, macOS, ARM64]
-
-    steps:
-      - name: Clone
-        id: checkout
-        uses: actions/checkout@v6
-
-      - name: Test
-        id: ggml-ci
-        run: |
-          vulkaninfo --summary
-          GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
+  # TODO: sandbox Mac runners
+  #  ggml-ci-mac-metal:
+  #    runs-on: [self-hosted, macOS, ARM64]
+  #
+  #    steps:
+  #      - name: Clone
+  #        id: checkout
+  #        uses: actions/checkout@v6
+  #
+  #      - name: Test
+  #        id: ggml-ci
+  #        run: |
+  #          GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
+  #
+  #  ggml-ci-mac-webgpu:
+  #    runs-on: [self-hosted, macOS, ARM64]
+  #
+  #    steps:
+  #      - name: Clone
+  #        id: checkout
+  #        uses: actions/checkout@v6
+  #
+  #      - name: Dawn Dependency
+  #        id: dawn-depends
+  #        run: |
+  #          DAWN_VERSION="v2.0.0"
+  #          DAWN_OWNER="reeselevine"
+  #          DAWN_REPO="dawn"
+  #          DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
+  #          echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
+  #          curl -L -o artifact.zip \
+  #            "https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
+  #          mkdir dawn
+  #          unzip artifact.zip
+  #          tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
+  #
+  #      - name: Test
+  #        id: ggml-ci
+  #        run: |
+  #          GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
+  #            bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
+  #
+  #  ggml-ci-mac-vulkan:
+  #    runs-on: [self-hosted, macOS, ARM64]
+  #
+  #    steps:
+  #      - name: Clone
+  #        id: checkout
+  #        uses: actions/checkout@v6
+  #
+  #      - name: Test
+  #        id: ggml-ci
+  #        run: |
+  #          vulkaninfo --summary
+  #          GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp

  ggml-ci-linux-intel-vulkan:
    runs-on: [self-hosted, Linux, Intel]
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -87,7 +87,7 @@ jobs:
            -DGGML_METAL_EMBED_LIBRARY=OFF \
            -DGGML_METAL_SHADER_DEBUG=ON \
            -DGGML_RPC=ON
-          cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
+          time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
          leaks -atExit -- ./build/bin/test-thread-safety -hf ggml-org/gemma-3-270m-qat-GGUF -ngl 99 -p "$(printf 'hello %.0s' {1..128})" -n 16 -c 512 -ub 32 -np 2 -t 2 -lv 1

      - name: Test
@@ -124,7 +124,7 @@ jobs:
            -DGGML_METAL=OFF \
            -DGGML_RPC=ON \
            -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3
-          cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
+          time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)

      - name: Test
        id: cmake_test
@@ -165,8 +165,8 @@ jobs:
        id: cmake_build
        run: |
          export CMAKE_PREFIX_PATH=dawn
-          cmake -B build -DGGML_WEBGPU=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF
-          cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
+          cmake -B build -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DGGML_WEBGPU=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF
+          time cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)

      - name: Test
        id: cmake_test
@@ -231,7 +231,7 @@ jobs:
          cmake -B build \
            -DLLAMA_FATAL_WARNINGS=ON \
            -DGGML_RPC=ON
-          cmake --build build --config Release -j $(nproc)
+          time cmake --build build --config Release -j $(nproc)

      - name: Test
        id: cmake_test
@@ -274,14 +274,16 @@ jobs:
        id: depends
        run: |
          sudo apt-get update
-          sudo apt-get install build-essential libssl-dev
+          sudo apt-get install build-essential libssl-dev ninja-build

      - name: Build
        id: cmake_build
        run: |
          cmake -B build \
+            -G "Ninja" \
+            -DCMAKE_BUILD_TYPE=Release \
            -DGGML_RPC=ON
-          cmake --build build --config Release -j $(nproc)
+          time cmake --build build --config Release -j $(nproc)

      - name: Test
        id: cmake_test
@@ -300,12 +302,13 @@ jobs:
      - name: Dependencies
        id: depends
        run: |
-          sudo apt-get install -y glslc libvulkan-dev libssl-dev
+          sudo apt-get install -y glslc libvulkan-dev libssl-dev ninja-build

      - name: Configure
        id: cmake_configure
        run: |
          cmake -B build \
+            -G "Ninja" \
            -DCMAKE_BUILD_TYPE=RelWithDebInfo \
            -DGGML_BACKEND_DL=ON \
            -DGGML_CPU_ALL_VARIANTS=ON \
@@ -314,7 +317,7 @@ jobs:
      - name: Build
        id: cmake_build
        run: |
-          cmake --build build -j $(nproc)
+          time cmake --build build -j $(nproc)

  ubuntu-24-webgpu:
    runs-on: ubuntu-24.04
@@ -336,7 +339,8 @@ jobs:
        run: |
          sudo add-apt-repository -y ppa:kisak/kisak-mesa
          sudo apt-get update -y
-          sudo apt-get install -y build-essential mesa-vulkan-drivers libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev
+          sudo apt-get install -y build-essential mesa-vulkan-drivers \
+            libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libssl-dev

      - name: Get latest Vulkan SDK version
        id: vulkan_sdk_version
@@ -378,7 +382,7 @@ jobs:
          export Dawn_DIR=dawn/lib64/cmake/Dawn
          cmake -B build \
            -DGGML_WEBGPU=ON
-          cmake --build build --config Release -j $(nproc)
+          time cmake --build build --config Release -j $(nproc)

      - name: Test
        id: cmake_test
@@ -415,11 +419,13 @@ jobs:
        run: |
          source emsdk/emsdk_env.sh
          emcmake cmake -B build-wasm \
+            -G "Ninja" \
+            -DCMAKE_BUILD_TYPE=Release \
            -DGGML_WEBGPU=ON \
            -DLLAMA_OPENSSL=OFF \
            -DEMDAWNWEBGPU_DIR=emdawnwebgpu_pkg

-          cmake --build build-wasm --target test-backend-ops -j $(nproc)
+          time cmake --build build-wasm --config Release --target test-backend-ops -j $(nproc)

  ubuntu-22-hip:
    runs-on: ubuntu-22.04
@@ -479,7 +485,7 @@ jobs:
        run: |
          cmake -B build -S . \
            -DGGML_MUSA=ON
-          cmake --build build --config Release -j $(nproc)
+          time cmake --build build --config Release -j $(nproc)

  ubuntu-22-sycl:
    runs-on: ubuntu-22.04
@@ -528,7 +534,7 @@ jobs:
            -DGGML_SYCL=ON \
            -DCMAKE_C_COMPILER=icx \
            -DCMAKE_CXX_COMPILER=icpx
-          cmake --build build --config Release -j $(nproc)
+          time cmake --build build --config Release -j $(nproc)

  ubuntu-22-sycl-fp16:
    runs-on: ubuntu-22.04
@@ -551,7 +557,7 @@ jobs:
        shell: bash
        run: |
          sudo apt update
-          sudo apt install intel-oneapi-compiler-dpcpp-cpp libssl-dev
+          sudo apt install intel-oneapi-compiler-dpcpp-cpp libssl-dev ninja-build

      - name: install oneAPI MKL library
        shell: bash
@@ -574,11 +580,13 @@ jobs:
        run: |
          source /opt/intel/oneapi/setvars.sh
          cmake -B build \
+            -G "Ninja" \
+            -DCMAKE_BUILD_TYPE=Release \
            -DGGML_SYCL=ON \
            -DCMAKE_C_COMPILER=icx \
            -DCMAKE_CXX_COMPILER=icpx \
            -DGGML_SYCL_F16=ON
-          cmake --build build --config Release -j $(nproc)
+          time cmake --build build --config Release -j $(nproc)

  ubuntu-24-openvino:
      name: ubuntu-24-openvino-${{ matrix.openvino_device }}
@@ -648,7 +656,7 @@ jobs:
            cmake -B build/ReleaseOV -G Ninja \
              -DCMAKE_BUILD_TYPE=Release \
              -DGGML_OPENVINO=ON
-            cmake --build build/ReleaseOV --config Release -j $(nproc)
+            time cmake --build build/ReleaseOV --config Release -j $(nproc)

        - name: Test
          id: cmake_test
@@ -1039,7 +1047,7 @@ jobs:
            -DCMAKE_C_COMPILER=riscv64-linux-gnu-gcc-14 \
            -DCMAKE_CXX_COMPILER=riscv64-linux-gnu-g++-14

-          cmake --build build --config Release -j $(nproc)
+          time cmake --build build --config Release -j $(nproc)

      - name: Test
        id: cmake_test
--- a/.github/workflows/copilot-setup-steps.yml
+++ b/.github/workflows/copilot-setup-steps.yml
@@ -54,4 +54,3 @@ jobs:
          python3 -m venv .venv
          source .venv/bin/activate
          pip install -r requirements/requirements-all.txt -r tools/server/tests/requirements.txt
-          pip install flake8 pyright pre-commit
--- a/.github/workflows/gguf-publish.yml
+++ b/.github/workflows/gguf-publish.yml
@@ -28,11 +28,11 @@ jobs:
    - name: Set up Python
      uses: actions/setup-python@v6
      with:
-        python-version: '3.9.x'
+        python-version: '3.11'
    - name: Install dependencies
      run: |
        cd gguf-py
-        python -m pip install poetry
+        python -m pip install poetry==2.3.2
        poetry install

    - name: Build package
--- a/README.md
+++ b/README.md
@@ -17,6 +17,7 @@ LLM inference in C/C++

 ## Hot topics

+- **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
 - **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
 - [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)
 - [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
@@ -241,7 +242,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo
 <details>
 <summary>Tools</summary>

- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from HuggingFace Hub and convert them to GGML
+- [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from Hugging Face Hub and convert them to GGML
 - [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
 - [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
 - [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
@@ -300,13 +301,13 @@ The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](htt
 - [Trending](https://huggingface.co/models?library=gguf&sort=trending)
 - [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)

-You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, such as [ModelScope](https://modelscope.cn/), by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:
+You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:

 ```sh
 llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
 ```

-By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. For example, you may opt to downloading model checkpoints from ModelScope or other model sharing communities by setting the environment variable, e.g. `MODEL_ENDPOINT=https://www.modelscope.cn/`.
+By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. The `MODEL_ENDPOINT` must point to a Hugging Face compatible API endpoint.

 After downloading a model, use the CLI tools to run it locally - see below.

--- a/ci/run.sh
+++ b/ci/run.sh
@@ -57,6 +57,13 @@ SRC=`pwd`
 CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=${LLAMA_FATAL_WARNINGS:-ON} -DLLAMA_OPENSSL=OFF -DGGML_SCHED_NO_REALLOC=ON"
 CTEST_EXTRA=""

+# Default to use make unless specified for compatibility
+CMAKE_GENERATOR="Unix Makefiles"
+
+if [ ! -z "${GG_BUILD_NINJA}" ]; then
+    CMAKE_GENERATOR="Ninja"
+fi
+
 if [ ! -z ${GG_BUILD_METAL} ]; then
    CMAKE_EXTRA="${CMAKE_EXTRA} -DGGML_METAL=ON"
 fi
@@ -242,13 +249,13 @@ function gg_run_ctest_debug {

    set -e

-    # Check cmake, make and ctest are installed
+    # Check cmake and ctest are installed
    gg_check_build_requirements

-    (time cmake -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
-    (time make -j$(nproc)                                  ) 2>&1 | tee -a $OUT/${ci}-make.log
+    (cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Debug ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
+    (time cmake --build . --config Debug -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log

-    (time ctest --output-on-failure -L main -E "test-opt|test-backend-ops" ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
+    (time ctest -C Debug --output-on-failure -L main -E "test-opt|test-backend-ops" ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log

    set +e
 }
@@ -273,16 +280,16 @@ function gg_run_ctest_release {

    set -e

-    # Check cmake, make and ctest are installed
+    # Check cmake and ctest are installed
    gg_check_build_requirements

-    (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
-    (time make -j$(nproc)                                    ) 2>&1 | tee -a $OUT/${ci}-make.log
+    (cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
+    (time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log

    if [ -z ${GG_BUILD_LOW_PERF} ]; then
-        (time ctest --output-on-failure -L 'main|python' ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
+        (time ctest -C Release --output-on-failure -L 'main|python' ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
    else
-        (time ctest --output-on-failure -L main -E test-opt ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
+        (time ctest -C Release --output-on-failure -L main -E test-opt ${CTEST_EXTRA}) 2>&1 | tee -a $OUT/${ci}-ctest.log
    fi

    set +e
@@ -340,7 +347,7 @@ function gg_run_ctest_with_model_debug {
    cd build-ci-debug
    set -e

-    (LLAMACPP_TEST_MODELFILE="$model" time ctest --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log
+    (LLAMACPP_TEST_MODELFILE="$model" time ctest -C Debug --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log

    set +e
    cd ..
@@ -353,7 +360,7 @@ function gg_run_ctest_with_model_release {
    cd build-ci-release
    set -e

-    (LLAMACPP_TEST_MODELFILE="$model" time ctest --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log
+    (LLAMACPP_TEST_MODELFILE="$model" time ctest -C Release --output-on-failure -L model) 2>&1 | tee -a $OUT/${ci}-ctest.log

    # test memory leaks
    #if [[ ! -z ${GG_BUILD_METAL} ]]; then
@@ -407,8 +414,8 @@ function gg_run_qwen3_0_6b {

    set -e

-    (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
-    (time make -j$(nproc)                                    ) 2>&1 | tee -a $OUT/${ci}-make.log
+    (cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
+    (time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log

    python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf  --outtype f16
    python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-bf16.gguf --outtype bf16
@@ -556,8 +563,8 @@ function gg_run_embd_bge_small {

    set -e

-    (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
-    (time make -j$(nproc)                                    ) 2>&1 | tee -a $OUT/${ci}-make.log
+    (cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
+    (time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log

    python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf

@@ -601,8 +608,8 @@ function gg_run_rerank_tiny {

    set -e

-    (time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
-    (time make -j$(nproc)                                    ) 2>&1 | tee -a $OUT/${ci}-make.log
+    (cmake -G "${CMAKE_GENERATOR}" -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
+    (time cmake --build . --config Release -j$(nproc)) 2>&1 | tee -a $OUT/${ci}-make.log

    python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf

@@ -652,10 +659,6 @@ function gg_check_build_requirements {
        gg_printf 'cmake not found, please install'
    fi

-    if ! command -v make &> /dev/null; then
-        gg_printf 'make not found, please install'
-    fi
-
    if ! command -v ctest &> /dev/null; then
        gg_printf 'ctest not found, please install'
    fi
--- a/common/CMakeLists.txt
+++ b/common/CMakeLists.txt
@@ -63,6 +63,8 @@ add_library(${TARGET} STATIC
    debug.h
    download.cpp
    download.h
+    hf-cache.cpp
+    hf-cache.h
    http.h
    json-partial.cpp
    json-partial.h
--- a/common/arg.cpp
+++ b/common/arg.cpp
@@ -3,6 +3,7 @@
 #include "chat.h"
 #include "common.h"
 #include "download.h"
+#include "hf-cache.h"
 #include "json-schema-to-grammar.h"
 #include "log.h"
 #include "sampling.h"
@@ -326,60 +327,48 @@ struct handle_model_result {
    common_params_model mmproj;
 };

-static handle_model_result common_params_handle_model(
-        struct common_params_model & model,
-        const std::string & bearer_token,
-        bool offline) {
+static handle_model_result common_params_handle_model(struct common_params_model & model,
+                                                      const std::string          & bearer_token,
+                                                      bool                         offline) {
    handle_model_result result;
-    // handle pre-fill default model path and url based on hf_repo and hf_file
-    {
-        if (!model.docker_repo.empty()) {  // Handle Docker URLs by resolving them to local paths
-            model.path = common_docker_resolve_model(model.docker_repo);
-            model.name = model.docker_repo; // set name for consistency
-        } else if (!model.hf_repo.empty()) {
-            // short-hand to avoid specifying --hf-file -> default it to --model
-            if (model.hf_file.empty()) {
-                if (model.path.empty()) {
-                    auto auto_detected = common_get_hf_file(model.hf_repo, bearer_token, offline);
-                    if (auto_detected.repo.empty() || auto_detected.ggufFile.empty()) {
-                        exit(1); // error message already printed
-                    }
-                    model.name    = model.hf_repo;      // repo name with tag
-                    model.hf_repo = auto_detected.repo; // repo name without tag
-                    model.hf_file = auto_detected.ggufFile;
-                    if (!auto_detected.mmprojFile.empty()) {
-                        result.found_mmproj   = true;
-                        result.mmproj.hf_repo = model.hf_repo;
-                        result.mmproj.hf_file = auto_detected.mmprojFile;
-                    }
-                } else {
-                    model.hf_file = model.path;
-                }
-            }
-
-            std::string model_endpoint = get_model_endpoint();
-            model.url = model_endpoint + model.hf_repo + "/resolve/main/" + model.hf_file;
-            // make sure model path is present (for caching purposes)
-            if (model.path.empty()) {
-                // this is to avoid different repo having same file name, or same file name in different subdirs
-                std::string filename = clean_file_name(model.hf_repo + "_" + model.hf_file);
-                model.path = fs_get_cache_file(filename);
-            }
-
-        } else if (!model.url.empty()) {
-            if (model.path.empty()) {
-                auto f = string_split<std::string>(model.url, '#').front();
-                f = string_split<std::string>(f, '?').front();
-                model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
-            }

+    if (!model.docker_repo.empty()) {
+        model.path = common_docker_resolve_model(model.docker_repo);
+        model.name = model.docker_repo;
+    } else if (!model.hf_repo.empty()) {
+        // If -m was used with -hf, treat the model "path" as the hf_file to download
+        if (model.hf_file.empty() && !model.path.empty()) {
+            model.hf_file = model.path;
+            model.path = "";
        }
-    }
+        common_download_model_opts opts;
+        opts.download_mmproj = true;
+        opts.offline = offline;
+        auto download_result = common_download_model(model, bearer_token, opts);

-    // then, download it if needed
-    if (!model.url.empty()) {
-        bool ok = common_download_model(model, bearer_token, offline);
-        if (!ok) {
+        if (download_result.model_path.empty()) {
+            LOG_ERR("error: failed to download model from Hugging Face\n");
+            exit(1);
+        }
+
+        model.name = model.hf_repo;
+        model.path = download_result.model_path;
+
+        if (!download_result.mmproj_path.empty()) {
+            result.found_mmproj = true;
+            result.mmproj.path  = download_result.mmproj_path;
+        }
+    } else if (!model.url.empty()) {
+        if (model.path.empty()) {
+            auto f = string_split<std::string>(model.url, '#').front();
+            f = string_split<std::string>(f, '?').front();
+            model.path = fs_get_cache_file(string_split<std::string>(f, '/').back());
+        }
+
+        common_download_model_opts opts;
+        opts.offline = offline;
+        auto download_result = common_download_model(model, bearer_token, opts);
+        if (download_result.model_path.empty()) {
            LOG_ERR("error: failed to download model from %s\n", model.url.c_str());
            exit(1);
        }
@@ -539,6 +528,13 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
    // parse the first time to get -hf option (used for remote preset)
    parse_cli_args();

+    // TODO: Remove later
+    try {
+        hf_cache::migrate_old_cache_to_hf_cache(params.hf_token, params.offline);
+    } catch (const std::exception & e) {
+        LOG_WRN("HF cache migration failed: %s\n", e.what());
+    }
+
    // maybe handle remote preset
    if (!params.model.hf_repo.empty()) {
        std::string cli_hf_repo = params.model.hf_repo;
@@ -1061,12 +1057,10 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
        {"-cl", "--cache-list"},
        "show list of models in cache",
        [](common_params &) {
-            printf("model cache directory: %s\n", fs_get_cache_directory().c_str());
            auto models = common_list_cached_models();
            printf("number of models in cache: %zu\n", models.size());
            for (size_t i = 0; i < models.size(); i++) {
-                auto & model = models[i];
-                printf("%4d. %s\n", (int) i + 1, model.to_string().c_str());
+                printf("%4zu. %s\n", i + 1, models[i].to_string().c_str());
            }
            exit(0);
        }
--- a/common/download.cpp
+++ b/common/download.cpp
@@ -1,9 +1,9 @@
 #include "arg.h"

 #include "common.h"
-#include "gguf.h" // for reading GGUF splits
 #include "log.h"
 #include "download.h"
+#include "hf-cache.h"

 #define JSON_ASSERT GGML_ASSERT
 #include <nlohmann/json.hpp>
@@ -15,6 +15,7 @@
 #include <map>
 #include <mutex>
 #include <regex>
+#include <unordered_set>
 #include <string>
 #include <thread>
 #include <vector>
@@ -35,8 +36,6 @@
 #endif
 #endif

-#define LLAMA_MAX_URL_LENGTH 2084 // Maximum URL Length in Chrome: 2083
-
 // isatty
 #if defined(_WIN32)
 #include <io.h>
@@ -51,31 +50,6 @@ using json = nlohmann::ordered_json;
 //

 // validate repo name format: owner/repo
-static bool validate_repo_name(const std::string & repo) {
-    static const std::regex repo_regex(R"(^[A-Za-z0-9_.\-]+\/[A-Za-z0-9_.\-]+$)");
-    return std::regex_match(repo, repo_regex);
-}
-
-static std::string get_manifest_path(const std::string & repo, const std::string & tag) {
-    // we use "=" to avoid clashing with other component, while still being allowed on windows
-    std::string fname = "manifest=" + repo + "=" + tag + ".json";
-    if (!validate_repo_name(repo)) {
-        throw std::runtime_error("error: repo name must be in the format 'owner/repo'");
-    }
-    string_replace_all(fname, "/", "=");
-    return fs_get_cache_file(fname);
-}
-
-static std::string read_file(const std::string & fname) {
-    std::ifstream file(fname);
-    if (!file) {
-        throw std::runtime_error(string_format("error: failed to open file '%s'\n", fname.c_str()));
-    }
-    std::string content((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
-    file.close();
-    return content;
-}
-
 static void write_file(const std::string & fname, const std::string & content) {
    const std::string fname_tmp = fname + ".tmp";
    std::ofstream     file(fname_tmp);
@@ -132,7 +106,7 @@ static bool is_http_status_ok(int status) {

 std::pair<std::string, std::string> common_download_split_repo_tag(const std::string & hf_repo_with_tag) {
    auto parts = string_split<std::string>(hf_repo_with_tag, ':');
-    std::string tag = parts.size() > 1 ? parts.back() : "latest";
+    std::string tag = parts.size() > 1 ? parts.back() : "";
    std::string hf_repo = parts[0];
    if (string_split<std::string>(hf_repo, '/').size() != 2) {
        throw std::invalid_argument("error: invalid HF repo format, expected <user>/<model>[:quant]\n");
@@ -290,7 +264,8 @@ static bool common_pull_file(httplib::Client & cli,
 static int common_download_file_single_online(const std::string        & url,
                                              const std::string        & path,
                                              const std::string        & bearer_token,
-                                              const common_header_list & custom_headers) {
+                                              const common_header_list & custom_headers,
+                                              bool                       skip_etag = false) {
    static const int max_attempts        = 3;
    static const int retry_delay_seconds = 2;

@@ -310,6 +285,11 @@ static int common_download_file_single_online(const std::string        & url,

    const bool file_exists = std::filesystem::exists(path);

+    if (file_exists && skip_etag) {
+        LOG_INF("%s: using cached file: %s\n", __func__, path.c_str());
+        return 304; // 304 Not Modified - fake cached response
+    }
+
    std::string last_etag;
    if (file_exists) {
        last_etag = read_etag(path);
@@ -361,6 +341,12 @@ static int common_download_file_single_online(const std::string        & url,
        }
    }

+    { // silent
+        std::error_code ec;
+        std::filesystem::path p(path);
+        std::filesystem::create_directories(p.parent_path(), ec);
+    }
+
    const std::string path_temporary = path + ".downloadInProgress";
    int delay = retry_delay_seconds;

@@ -391,7 +377,7 @@ static int common_download_file_single_online(const std::string        & url,
                LOG_ERR("%s: unable to rename file: %s to %s\n", __func__, path_temporary.c_str(), path.c_str());
                return -1;
            }
-            if (!etag.empty()) {
+            if (!etag.empty() && !skip_etag) {
                write_etag(path, etag);
            }
            return head->status;
@@ -440,9 +426,10 @@ int common_download_file_single(const std::string & url,
                                const std::string & path,
                                const std::string & bearer_token,
                                bool offline,
-                                const common_header_list & headers) {
+                                const common_header_list & headers,
+                                bool skip_etag) {
    if (!offline) {
-        return common_download_file_single_online(url, path, bearer_token, headers);
+        return common_download_file_single_online(url, path, bearer_token, headers, skip_etag);
    }

    if (!std::filesystem::exists(path)) {
@@ -454,193 +441,293 @@ int common_download_file_single(const std::string & url,
    return 304; // Not Modified - fake cached response
 }

-// download multiple files from remote URLs to local paths
-// the input is a vector of pairs <url, path>
-static bool common_download_file_multiple(const std::vector<std::pair<std::string, std::string>> & urls,
-                                          const std::string & bearer_token,
-                                          bool offline,
-                                          const common_header_list & headers) {
-    // Prepare download in parallel
-    std::vector<std::future<bool>> futures_download;
-    futures_download.reserve(urls.size());
+struct gguf_split_info {
+    std::string prefix; // tag included
+    std::string tag;
+    int index;
+    int count;
+};

-    for (auto const & item : urls) {
-        futures_download.push_back(
-            std::async(
-                std::launch::async,
-                [&bearer_token, offline, &headers](const std::pair<std::string, std::string> & it) -> bool {
-                    const int http_status = common_download_file_single(it.first, it.second, bearer_token, offline, headers);
-                    return is_http_status_ok(http_status);
-                },
-                item
-            )
-        );
+static gguf_split_info get_gguf_split_info(const std::string & path) {
+    static const std::regex re_split("^(.+)-([0-9]{5})-of-([0-9]{5})$", std::regex::icase);
+    static const std::regex re_tag("[-.]([A-Z0-9_]+)$", std::regex::icase);
+    std::smatch m;
+
+    std::string prefix = path;
+    string_remove_suffix(prefix, ".gguf");
+
+    int index = 1;
+    int count = 1;
+
+    if (std::regex_match(prefix, m, re_split)) {
+        index = std::stoi(m[2].str());
+        count = std::stoi(m[3].str());
+        prefix = m[1].str();
    }

-    // Wait for all downloads to complete
-    for (auto & f : futures_download) {
-        if (!f.get()) {
-            return false;
+    std::string tag;
+    if (std::regex_search(prefix, m, re_tag)) {
+        tag = m[1].str();
+        for (char & c : tag) {
+            c = std::toupper((unsigned char)c);
        }
    }

-    return true;
+    return {std::move(prefix), std::move(tag), index, count};
 }

-bool common_download_model(const common_params_model & model,
-                           const std::string & bearer_token,
-                           bool offline,
-                           const common_header_list & headers) {
-    // Basic validation of the model.url
-    if (model.url.empty()) {
-        LOG_ERR("%s: invalid model url\n", __func__);
-        return false;
+// Q4_0 -> 4, F16 -> 16, NVFP4 -> 4, Q8_K_M -> 8, etc
+static int extract_quant_bits(const std::string & filename) {
+    auto split = get_gguf_split_info(filename);
+
+    auto pos = split.tag.find_first_of("0123456789");
+    if (pos == std::string::npos) {
+        return 0;
    }

-    const int http_status = common_download_file_single(model.url, model.path, bearer_token, offline, headers);
-    if (!is_http_status_ok(http_status)) {
-        return false;
-    }
-
-    // check for additional GGUFs split to download
-    int n_split = 0;
-    {
-        struct gguf_init_params gguf_params = {
-            /*.no_alloc = */ true,
-            /*.ctx      = */ NULL,
-        };
-        auto * ctx_gguf = gguf_init_from_file(model.path.c_str(), gguf_params);
-        if (!ctx_gguf) {
-            LOG_ERR("\n%s:  failed to load input GGUF from %s\n", __func__, model.path.c_str());
-            return false;
-        }
-
-        auto key_n_split = gguf_find_key(ctx_gguf, LLM_KV_SPLIT_COUNT);
-        if (key_n_split >= 0) {
-            n_split = gguf_get_val_u16(ctx_gguf, key_n_split);
-        }
-
-        gguf_free(ctx_gguf);
-    }
-
-    if (n_split > 1) {
-        char split_prefix[PATH_MAX] = {0};
-        char split_url_prefix[LLAMA_MAX_URL_LENGTH] = {0};
-
-        // Verify the first split file format
-        // and extract split URL and PATH prefixes
-        {
-            if (!llama_split_prefix(split_prefix, sizeof(split_prefix), model.path.c_str(), 0, n_split)) {
-                LOG_ERR("\n%s: unexpected model file name: %s n_split=%d\n", __func__, model.path.c_str(), n_split);
-                return false;
-            }
-
-            if (!llama_split_prefix(split_url_prefix, sizeof(split_url_prefix), model.url.c_str(), 0, n_split)) {
-                LOG_ERR("\n%s: unexpected model url: %s n_split=%d\n", __func__, model.url.c_str(), n_split);
-                return false;
-            }
-        }
-
-        std::vector<std::pair<std::string, std::string>> urls;
-        for (int idx = 1; idx < n_split; idx++) {
-            char split_path[PATH_MAX] = {0};
-            llama_split_path(split_path, sizeof(split_path), split_prefix, idx, n_split);
-
-            char split_url[LLAMA_MAX_URL_LENGTH] = {0};
-            llama_split_path(split_url, sizeof(split_url), split_url_prefix, idx, n_split);
-
-            if (std::string(split_path) == model.path) {
-                continue; // skip the already downloaded file
-            }
-
-            urls.push_back({split_url, split_path});
-        }
-
-        // Download in parallel
-        common_download_file_multiple(urls, bearer_token, offline, headers);
-    }
-
-    return true;
+    return std::stoi(split.tag.substr(pos));
 }

-common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag,
-                                      const std::string & bearer_token,
-                                      bool offline,
-                                      const common_header_list & custom_headers) {
-    // the returned hf_repo is without tag
-    auto [hf_repo, tag] = common_download_split_repo_tag(hf_repo_with_tag);
+static hf_cache::hf_files get_split_files(const hf_cache::hf_files & files,
+                                          const hf_cache::hf_file  & file) {
+    auto split = get_gguf_split_info(file.path);

-    std::string url = get_model_endpoint() + "v2/" + hf_repo + "/manifests/" + tag;
-
-    // headers
-    common_header_list headers = custom_headers;
-    headers.push_back({"Accept", "application/json"});
-    if (!bearer_token.empty()) {
-        headers.push_back({"Authorization", "Bearer " + bearer_token});
+    if (split.count <= 1) {
+        return {file};
    }
-    // Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response
-    // User-Agent header is already set in common_remote_get_content, no need to set it here
+    hf_cache::hf_files result;

-    // make the request
-    common_remote_params params;
-    params.headers = headers;
-    long res_code = 0;
-    std::string res_str;
-    bool use_cache = false;
-    std::string cached_response_path = get_manifest_path(hf_repo, tag);
-    if (!offline) {
-        try {
-            auto res = common_remote_get_content(url, params);
-            res_code = res.first;
-            res_str = std::string(res.second.data(), res.second.size());
-        } catch (const std::exception & e) {
-            LOG_WRN("error: failed to get manifest at %s: %s\n", url.c_str(), e.what());
+    for (const auto & f : files) {
+        auto split_f = get_gguf_split_info(f.path);
+        if (split_f.count == split.count && split_f.prefix == split.prefix) {
+            result.push_back(f);
        }
    }
-    if (res_code == 0) {
-        if (std::filesystem::exists(cached_response_path)) {
-            LOG_WRN("trying to read manifest from cache: %s\n", cached_response_path.c_str());
-            res_str = read_file(cached_response_path);
-            res_code = 200;
-            use_cache = true;
-        } else {
-            throw std::runtime_error(
-                offline ? "error: failed to get manifest (offline mode)"
-                : "error: failed to get manifest (check your internet connection)");
+    return result;
+}
+
+static hf_cache::hf_file find_best_mmproj(const hf_cache::hf_files & files,
+                                          const std::string        & model) {
+    hf_cache::hf_file best;
+    size_t best_depth = 0;
+    int best_diff = 0;
+    bool found = false;
+
+    auto model_bits = extract_quant_bits(model);
+    auto model_parts = string_split<std::string>(model, '/');
+    auto model_dir = model_parts.end() - 1;
+
+    for (const auto & f : files) {
+        if (!string_ends_with(f.path, ".gguf") ||
+            f.path.find("mmproj") == std::string::npos) {
+            continue;
+        }
+
+        auto mmproj_parts = string_split<std::string>(f.path, '/');
+        auto mmproj_dir = mmproj_parts.end() - 1;
+
+        auto [_, dir] = std::mismatch(model_parts.begin(), model_dir,
+                                      mmproj_parts.begin(), mmproj_dir);
+        if (dir != mmproj_dir) {
+            continue;
+        }
+
+        size_t depth = dir - mmproj_parts.begin();
+        auto bits = extract_quant_bits(f.path);
+        auto diff = std::abs(bits - model_bits);
+
+        if (!found || depth > best_depth || (depth == best_depth && diff < best_diff)) {
+            best = f;
+            best_depth = depth;
+            best_diff = diff;
+            found = true;
        }
    }
-    std::string ggufFile;
-    std::string mmprojFile;
+    return best;
+}

-    if (res_code == 200 || res_code == 304) {
-        try {
-            auto j = json::parse(res_str);
+static hf_cache::hf_file find_best_model(const hf_cache::hf_files & files,
+                                         const std::string        & tag) {
+    std::vector<std::string> tags;

-            if (j.contains("ggufFile") && j["ggufFile"].contains("rfilename")) {
-                ggufFile = j["ggufFile"]["rfilename"].get<std::string>();
-            }
-            if (j.contains("mmprojFile") && j["mmprojFile"].contains("rfilename")) {
-                mmprojFile = j["mmprojFile"]["rfilename"].get<std::string>();
-            }
-        } catch (const std::exception & e) {
-            throw std::runtime_error(std::string("error parsing manifest JSON: ") + e.what());
-        }
-        if (!use_cache) {
-            // if not using cached response, update the cache file
-            write_file(cached_response_path, res_str);
-        }
-    } else if (res_code == 401) {
-        throw std::runtime_error("error: model is private or does not exist; if you are accessing a gated model, please provide a valid HF token");
+    if (!tag.empty()) {
+        tags.push_back(tag);
    } else {
-        throw std::runtime_error(string_format("error from HF API (%s), response code: %ld, data: %s", url.c_str(), res_code, res_str.c_str()));
+        tags = {"Q4_K_M", "Q4_0"};
    }

-    // check response
-    if (ggufFile.empty()) {
-        throw std::runtime_error("error: model does not have ggufFile");
+    for (const auto & t : tags) {
+        std::regex pattern(t + "[.-]", std::regex::icase);
+        for (const auto & f : files) {
+            if (string_ends_with(f.path, ".gguf") &&
+                f.path.find("mmproj") == std::string::npos &&
+                std::regex_search(f.path, pattern)) {
+                return f;
+            }
+        }
    }

-    return { hf_repo, ggufFile, mmprojFile };
+    for (const auto & f : files) {
+        if (string_ends_with(f.path, ".gguf") &&
+            f.path.find("mmproj") == std::string::npos) {
+            return f;
+        }
+    }
+
+    return {};
+}
+
+static void list_available_gguf_files(const hf_cache::hf_files & files) {
+    LOG_INF("Available GGUF files:\n");
+    for (const auto & f : files) {
+        if (string_ends_with(f.path, ".gguf")) {
+            LOG_INF(" - %s\n", f.path.c_str());
+        }
+    }
+}
+
+struct hf_plan {
+    hf_cache::hf_files model_files;
+    hf_cache::hf_file mmproj;
+};
+
+static hf_plan get_hf_plan(const common_params_model        & model,
+                           const std::string                & token,
+                           const common_download_model_opts & opts) {
+    hf_plan plan;
+    hf_cache::hf_files all;
+
+    auto [repo, tag] = common_download_split_repo_tag(model.hf_repo);
+
+    if (!opts.offline) {
+        all = hf_cache::get_repo_files(repo, token);
+    }
+    if (all.empty()) {
+        all = hf_cache::get_cached_files(repo);
+    }
+    if (all.empty()) {
+        return plan;
+    }
+
+    hf_cache::hf_file primary;
+
+    if (!model.hf_file.empty()) {
+        for (const auto & f : all) {
+            if (f.path == model.hf_file) {
+                primary = f;
+                break;
+            }
+        }
+        if (primary.path.empty()) {
+            LOG_ERR("%s: file '%s' not found in repository\n", __func__, model.hf_file.c_str());
+            list_available_gguf_files(all);
+            return plan;
+        }
+    } else {
+        primary = find_best_model(all, tag);
+        if (primary.path.empty()) {
+            LOG_ERR("%s: no GGUF files found in repository %s\n", __func__, repo.c_str());
+            list_available_gguf_files(all);
+            return plan;
+        }
+    }
+
+    plan.model_files = get_split_files(all, primary);
+
+    if (opts.download_mmproj) {
+        plan.mmproj = find_best_mmproj(all, primary.path);
+    }
+
+    return plan;
+}
+
+struct download_task {
+    std::string url;
+    std::string path;
+};
+
+static std::vector<download_task> get_url_tasks(const common_params_model & model) {
+    auto split = get_gguf_split_info(model.url);
+
+    if (split.count <= 1) {
+        return {{model.url, model.path}};
+    }
+
+    auto filename = split.prefix;
+    if (auto pos = split.prefix.rfind('/'); pos != std::string::npos) {
+        filename = split.prefix.substr(pos + 1);
+    }
+
+    auto parent_path = std::filesystem::path(model.path).parent_path();
+    auto prefix_path = (parent_path / filename).string();
+
+    std::vector<download_task> tasks;
+    for (int i = 1; i <= split.count; i++) {
+        auto suffix = string_format("-%05d-of-%05d.gguf", i, split.count);
+        tasks.push_back({split.prefix + suffix, prefix_path + suffix});
+    }
+    return tasks;
+}
+
+common_download_model_result common_download_model(const common_params_model        & model,
+                                                   const std::string                & bearer_token,
+                                                   const common_download_model_opts & opts,
+                                                   const common_header_list         & headers) {
+    common_download_model_result result;
+    std::vector<download_task> tasks;
+    hf_plan hf;
+
+    bool is_hf = !model.hf_repo.empty();
+
+    if (is_hf) {
+        hf = get_hf_plan(model, bearer_token, opts);
+        for (const auto & f : hf.model_files) {
+            tasks.push_back({f.url, f.local_path});
+        }
+        if (!hf.mmproj.path.empty()) {
+            tasks.push_back({hf.mmproj.url, hf.mmproj.local_path});
+        }
+    } else if (!model.url.empty()) {
+        tasks = get_url_tasks(model);
+    } else {
+        result.model_path = model.path;
+        return result;
+    }
+
+    if (tasks.empty()) {
+        return result;
+    }
+
+    std::vector<std::future<bool>> futures;
+    for (const auto & task : tasks) {
+        futures.push_back(std::async(std::launch::async,
+            [&task, &bearer_token, offline = opts.offline, &headers, is_hf]() {
+                int status = common_download_file_single(task.url, task.path, bearer_token, offline, headers, is_hf);
+                return is_http_status_ok(status);
+            }
+        ));
+    }
+
+    for (auto & f : futures) {
+        if (!f.get()) {
+            return {};
+        }
+    }
+
+    if (is_hf) {
+        for (const auto & f : hf.model_files) {
+            hf_cache::finalize_file(f);
+        }
+        result.model_path = hf.model_files[0].final_path;
+
+        if (!hf.mmproj.path.empty()) {
+            result.mmproj_path = hf_cache::finalize_file(hf.mmproj);
+        }
+    } else {
+        result.model_path = model.path;
+    }
+
+    return result;
 }

 //
@@ -765,28 +852,21 @@ std::string common_docker_resolve_model(const std::string & docker) {
 }

 std::vector<common_cached_model_info> common_list_cached_models() {
-    std::vector<common_cached_model_info> models;
-    const std::string cache_dir = fs_get_cache_directory();
-    const std::vector<common_file_info> files = fs_list(cache_dir, false);
-    for (const auto & file : files) {
-        if (string_starts_with(file.name, "manifest=") && string_ends_with(file.name, ".json")) {
-            common_cached_model_info model_info;
-            model_info.manifest_path = file.path;
-            std::string fname = file.name;
-            string_replace_all(fname, ".json", ""); // remove extension
-            auto parts = string_split<std::string>(fname, '=');
-            if (parts.size() == 4) {
-                // expect format: manifest=<user>=<model>=<tag>=<other>
-                model_info.user  = parts[1];
-                model_info.model = parts[2];
-                model_info.tag   = parts[3];
-            } else {
-                // invalid format
-                continue;
-            }
-            model_info.size = 0; // TODO: get GGUF size, not manifest size
-            models.push_back(model_info);
+    std::unordered_set<std::string> seen;
+    std::vector<common_cached_model_info> result;
+
+    auto files = hf_cache::get_cached_files();
+
+    for (const auto & f : files) {
+        auto split = get_gguf_split_info(f.path);
+        if (split.index != 1 || split.tag.empty() ||
+            split.prefix.find("mmproj") != std::string::npos) {
+            continue;
+        }
+        if (seen.insert(f.repo_id + ":" + split.tag).second) {
+            result.push_back({f.repo_id, split.tag});
        }
    }
-    return models;
+
+    return result;
 }
--- a/common/download.h
+++ b/common/download.h
@@ -17,54 +17,60 @@ struct common_remote_params {
 // get remote file content, returns <http_code, raw_response_body>
 std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params);

-// split HF repo with tag into <repo, tag>
-// for example: "user/model:tag" -> <"user/model", "tag">
-// if tag is not present, default to "latest"
-// example: "user/model" -> <"user/model", "latest">
+// split HF repo with tag into <repo, tag>, for example:
+// - "ggml-org/models:F16" -> <"ggml-org/models", "F16">
+// tag is optional and can be empty
 std::pair<std::string, std::string> common_download_split_repo_tag(const std::string & hf_repo_with_tag);

+// Result of common_list_cached_models
 struct common_cached_model_info {
-    std::string manifest_path;
-    std::string user;
-    std::string model;
+    std::string repo;
    std::string tag;
-    size_t      size = 0; // GGUF size in bytes
-    // return string representation like "user/model:tag"
-    // if tag is "latest", it will be omitted
    std::string to_string() const {
-        return user + "/" + model + (tag == "latest" ? "" : ":" + tag);
+        return repo + ":" + tag;
    }
 };

-struct common_hf_file_res {
-    std::string repo; // repo name with ":tag" removed
-    std::string ggufFile;
-    std::string mmprojFile;
+// Options for common_download_model
+struct common_download_model_opts {
+    bool download_mmproj = false;
+    bool offline         = false;
 };

-/**
- * Allow getting the HF file from the HF repo with tag (like ollama), for example:
- * - bartowski/Llama-3.2-3B-Instruct-GGUF:q4
- * - bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
- * - bartowski/Llama-3.2-3B-Instruct-GGUF:q5_k_s
- * Tag is optional, default to "latest" (meaning it checks for Q4_K_M first, then Q4, then if not found, return the first GGUF file in repo)
- *
- * Return pair of <repo, file> (with "repo" already having tag removed)
- *
- * Note: we use the Ollama-compatible HF API, but not using the blobId. Instead, we use the special "ggufFile" field which returns the value for "hf_file". This is done to be backward-compatible with existing cache files.
- */
-common_hf_file_res common_get_hf_file(
-    const std::string & hf_repo_with_tag,
-    const std::string & bearer_token,
-    bool offline,
-    const common_header_list & headers = {}
-);
+// Result of common_download_model
+struct common_download_model_result {
+    std::string model_path;
+    std::string mmproj_path;
+};

-// returns true if download succeeded
-bool common_download_model(
+// Download model from HuggingFace repo or URL
+//
+// input (via model struct):
+// - model.hf_repo: HF repo with optional tag, see common_download_split_repo_tag
+// - model.hf_file: specific file in the repo (requires hf_repo)
+// - model.url: simple download (used if hf_repo is empty)
+// - model.path: local file path
+//
+// tag matching (for HF repos without model.hf_file):
+// - if tag is specified, searches for GGUF matching that quantization
+// - if no tag, searches for Q4_K_M, then Q4_0, then first available GGUF
+//
+// split GGUF: multi-part files like "model-00001-of-00003.gguf" are automatically
+// detected and all parts are downloaded
+//
+// caching:
+// - HF repos: uses HuggingFace cache
+// - URLs: uses ETag-based caching
+//
+// when opts.offline=true, no network requests are made
+// when download_mmproj=true, searches for mmproj in same directory as model or any parent directory
+// then with the closest quantization bits
+//
+// returns result with model_path and mmproj_path (empty on failure)
+common_download_model_result common_download_model(
    const common_params_model & model,
    const std::string & bearer_token,
-    bool offline,
+    const common_download_model_opts & opts = {},
    const common_header_list & headers = {}
 );

@@ -73,11 +79,13 @@ std::vector<common_cached_model_info> common_list_cached_models();

 // download single file from url to local path
 // returns status code or -1 on error
+// skip_etag: if true, don't read/write .etag files (for HF cache where filename is the hash)
 int common_download_file_single(const std::string & url,
                                const std::string & path,
                                const std::string & bearer_token,
                                bool offline,
-                                const common_header_list & headers = {});
+                                const common_header_list & headers = {},
+                                bool skip_etag = false);

 // resolve and download model from Docker registry
 // return local path to downloaded model file
--- a/common/hf-cache.cpp
+++ b/common/hf-cache.cpp
@@ -0,0 +1,644 @@
+#include "hf-cache.h"
+
+#include "common.h"
+#include "log.h"
+#include "http.h"
+
+#define JSON_ASSERT GGML_ASSERT
+#include <nlohmann/json.hpp>
+
+#include <filesystem>
+#include <fstream>
+#include <atomic>
+#include <regex> // migration only
+#include <string>
+#include <string_view>
+#include <stdexcept>
+
+namespace nl = nlohmann;
+
+#if defined(_WIN32)
+#define WIN32_LEAN_AND_MEAN
+#ifndef NOMINMAX
+#define NOMINMAX
+#endif
+#define HOME_DIR "USERPROFILE"
+#include <windows.h>
+#else
+#define HOME_DIR "HOME"
+#endif
+
+namespace hf_cache {
+
+namespace fs = std::filesystem;
+
+static fs::path get_cache_directory() {
+    static const fs::path cache = []() {
+        struct {
+            const char * var;
+            fs::path path;
+        } entries[] = {
+            {"HF_HUB_CACHE",          fs::path()},
+            {"HUGGINGFACE_HUB_CACHE", fs::path()},
+            {"HF_HOME",               fs::path("hub")},
+            {"XDG_CACHE_HOME",        fs::path("huggingface") / "hub"},
+            {HOME_DIR,                fs::path(".cache") / "huggingface" / "hub"}
+        };
+        for (const auto & entry : entries) {
+            if (auto * p = std::getenv(entry.var); p && *p) {
+                fs::path base(p);
+                return entry.path.empty() ? base : base / entry.path;
+            }
+        }
+        throw std::runtime_error("Failed to determine HF cache directory");
+    }();
+
+    return cache;
+}
+
+static std::string folder_name_to_repo(const std::string & folder) {
+    constexpr std::string_view prefix = "models--";
+    if (folder.rfind(prefix, 0)) {
+        return {};
+    }
+    std::string result = folder.substr(prefix.length());
+    string_replace_all(result, "--", "/");
+    return result;
+}
+
+static std::string repo_to_folder_name(const std::string & repo_id) {
+    constexpr std::string_view prefix = "models--";
+    std::string result = std::string(prefix) + repo_id;
+    string_replace_all(result, "/", "--");
+    return result;
+}
+
+static fs::path get_repo_path(const std::string & repo_id) {
+    return get_cache_directory() / repo_to_folder_name(repo_id);
+}
+
+static bool is_hex_char(const char c) {
+    return (c >= 'A' && c <= 'F') ||
+           (c >= 'a' && c <= 'f') ||
+           (c >= '0' && c <= '9');
+}
+
+static bool is_hex_string(const std::string & s, size_t expected_len) {
+    if (s.length() != expected_len) {
+        return false;
+    }
+    for (const char c : s) {
+        if (!is_hex_char(c)) {
+            return false;
+        }
+    }
+    return true;
+}
+
+static bool is_alphanum(const char c) {
+    return (c >= 'A' && c <= 'Z') ||
+           (c >= 'a' && c <= 'z') ||
+           (c >= '0' && c <= '9');
+}
+
+static bool is_special_char(char c) {
+    return c == '/' || c == '.' || c == '-';
+}
+
+// base chars [A-Za-z0-9_] are always valid
+// special chars [/.-] must be surrounded by base chars
+// exactly one '/' required
+static bool is_valid_repo_id(const std::string & repo_id) {
+    if (repo_id.empty() || repo_id.length() > 256) {
+        return false;
+    }
+    int slash = 0;
+    bool special = true;
+
+    for (const char c : repo_id) {
+        if (is_alphanum(c) || c == '_') {
+            special = false;
+        } else if (is_special_char(c)) {
+            if (special) {
+                return false;
+            }
+            slash += (c == '/');
+            special = true;
+        } else {
+            return false;
+        }
+    }
+    return !special && slash == 1;
+}
+
+static bool is_valid_hf_token(const std::string & token) {
+    if (token.length() < 37 || token.length() > 256 ||
+        !string_starts_with(token, "hf_")) {
+        return false;
+    }
+    for (size_t i = 3; i < token.length(); ++i) {
+        if (!is_alphanum(token[i])) {
+            return false;
+        }
+    }
+    return true;
+}
+
+static bool is_valid_commit(const std::string & hash) {
+    return is_hex_string(hash, 40);
+}
+
+static bool is_valid_oid(const std::string & oid) {
+    return is_hex_string(oid, 40) || is_hex_string(oid, 64);
+}
+
+static bool is_valid_subpath(const fs::path & path, const fs::path & subpath) {
+    if (subpath.is_absolute()) {
+        return false; // never do a / b with b absolute
+    }
+    auto b = fs::absolute(path).lexically_normal();
+    auto t = (b / subpath).lexically_normal();
+    auto [b_end, _] = std::mismatch(b.begin(), b.end(), t.begin(), t.end());
+
+    return b_end == b.end();
+}
+
+static void safe_write_file(const fs::path & path, const std::string & data) {
+    fs::path path_tmp = path.string() + ".tmp";
+
+    if (path.has_parent_path()) {
+        fs::create_directories(path.parent_path());
+    }
+
+    std::ofstream file(path_tmp);
+    file << data;
+    file.close();
+
+    std::error_code ec;
+
+    if (!file.fail()) {
+        fs::rename(path_tmp, path, ec);
+    }
+    if (file.fail() || ec) {
+        fs::remove(path_tmp, ec);
+        throw std::runtime_error("failed to write file: " + path.string());
+    }
+}
+
+static nl::json api_get(const std::string & url,
+                        const std::string & token) {
+    auto [cli, parts] = common_http_client(url);
+
+    httplib::Headers headers = {
+        {"User-Agent", "llama-cpp/" + build_info},
+        {"Accept", "application/json"}
+    };
+
+    if (is_valid_hf_token(token)) {
+        headers.emplace("Authorization", "Bearer " + token);
+    } else if (!token.empty()) {
+        LOG_WRN("%s: invalid token, authentication disabled\n", __func__);
+    }
+
+    if (auto res = cli.Get(parts.path, headers)) {
+        auto body = res->body;
+
+        if (res->status == 200) {
+            return nl::json::parse(res->body);
+        }
+        try {
+            body = nl::json::parse(res->body)["error"].get<std::string>();
+        } catch (...) { }
+
+        throw std::runtime_error("GET failed (" + std::to_string(res->status) + "): " + body);
+    } else {
+        throw std::runtime_error("HTTPLIB failed: " + httplib::to_string(res.error()));
+    }
+}
+
+static std::string get_repo_commit(const std::string & repo_id,
+                                   const std::string & token) {
+    try {
+        auto endpoint = get_model_endpoint();
+        auto json = api_get(endpoint + "api/models/" + repo_id + "/refs", token);
+
+        if (!json.is_object() ||
+            !json.contains("branches") || !json["branches"].is_array()) {
+            LOG_WRN("%s: missing 'branches' for '%s'\n", __func__, repo_id.c_str());
+            return {};
+        }
+
+        fs::path refs_path = get_repo_path(repo_id) / "refs";
+        std::string name;
+        std::string commit;
+
+        for (const auto & branch : json["branches"]) {
+            if (!branch.is_object() ||
+                !branch.contains("name") || !branch["name"].is_string() ||
+                !branch.contains("targetCommit") || !branch["targetCommit"].is_string()) {
+                continue;
+            }
+            std::string _name = branch["name"].get<std::string>();
+            std::string _commit = branch["targetCommit"].get<std::string>();
+
+            if (!is_valid_subpath(refs_path, _name)) {
+                LOG_WRN("%s: skip invalid branch: %s\n", __func__, _name.c_str());
+                continue;
+            }
+            if (!is_valid_commit(_commit)) {
+                LOG_WRN("%s: skip invalid commit: %s\n", __func__, _commit.c_str());
+                continue;
+            }
+
+            if (_name == "main") {
+                name = _name;
+                commit = _commit;
+                break;
+            }
+
+            if (name.empty() || commit.empty()) {
+                name = _name;
+                commit = _commit;
+            }
+        }
+
+        if (name.empty() || commit.empty()) {
+            LOG_WRN("%s: no valid branch for '%s'\n", __func__, repo_id.c_str());
+            return {};
+        }
+
+        safe_write_file(refs_path / name, commit);
+        return commit;
+
+    } catch (const nl::json::exception & e) {
+        LOG_ERR("%s: JSON error: %s\n", __func__, e.what());
+    } catch (const std::exception & e) {
+        LOG_ERR("%s: error: %s\n", __func__, e.what());
+    }
+    return {};
+}
+
+hf_files get_repo_files(const std::string & repo_id,
+                        const std::string & token) {
+    if (!is_valid_repo_id(repo_id)) {
+        LOG_WRN("%s: invalid repository: %s\n", __func__, repo_id.c_str());
+        return {};
+    }
+
+    std::string commit = get_repo_commit(repo_id, token);
+    if (commit.empty()) {
+        LOG_WRN("%s: failed to resolve commit for %s\n", __func__, repo_id.c_str());
+        return {};
+    }
+
+    fs::path blobs_path = get_repo_path(repo_id) / "blobs";
+    fs::path commit_path = get_repo_path(repo_id) / "snapshots" / commit;
+
+    hf_files files;
+
+    try {
+        auto endpoint = get_model_endpoint();
+        auto json = api_get(endpoint + "api/models/" + repo_id + "/tree/" + commit + "?recursive=true", token);
+
+        if (!json.is_array()) {
+            LOG_WRN("%s: response is not an array for '%s'\n", __func__, repo_id.c_str());
+            return {};
+        }
+
+        for (const auto & item : json) {
+            if (!item.is_object() ||
+                !item.contains("type") || !item["type"].is_string() || item["type"] != "file" ||
+                !item.contains("path") || !item["path"].is_string()) {
+                continue;
+            }
+
+            hf_file file;
+            file.repo_id = repo_id;
+            file.path = item["path"].get<std::string>();
+
+            if (!is_valid_subpath(commit_path, file.path)) {
+                LOG_WRN("%s: skip invalid path: %s\n", __func__, file.path.c_str());
+                continue;
+            }
+
+            if (item.contains("lfs") && item["lfs"].is_object()) {
+                if (item["lfs"].contains("oid") && item["lfs"]["oid"].is_string()) {
+                    file.oid = item["lfs"]["oid"].get<std::string>();
+                }
+            } else if (item.contains("oid") && item["oid"].is_string()) {
+                file.oid = item["oid"].get<std::string>();
+            }
+
+            if (!file.oid.empty() && !is_valid_oid(file.oid)) {
+                LOG_WRN("%s: skip invalid oid: %s\n", __func__, file.oid.c_str());
+                continue;
+            }
+
+            file.url = endpoint + repo_id + "/resolve/" + commit + "/" + file.path;
+
+            fs::path final_path = commit_path / file.path;
+            file.final_path = final_path.string();
+
+            if (!file.oid.empty() && !fs::exists(final_path)) {
+                fs::path local_path = blobs_path / file.oid;
+                file.local_path = local_path.string();
+            } else {
+                file.local_path = file.final_path;
+            }
+
+            files.push_back(file);
+        }
+    } catch (const nl::json::exception & e) {
+        LOG_ERR("%s: JSON error: %s\n", __func__, e.what());
+    } catch (const std::exception & e) {
+        LOG_ERR("%s: error: %s\n", __func__, e.what());
+    }
+    return files;
+}
+
+static std::string get_cached_ref(const fs::path & repo_path) {
+    fs::path refs_path = repo_path / "refs";
+    if (!fs::is_directory(refs_path)) {
+        return {};
+    }
+    std::string fallback;
+
+    for (const auto & entry : fs::directory_iterator(refs_path)) {
+        if (!entry.is_regular_file()) {
+            continue;
+        }
+        std::ifstream f(entry.path());
+        std::string commit;
+        if (!f || !std::getline(f, commit) || commit.empty()) {
+            continue;
+        }
+        if (!is_valid_commit(commit)) {
+            LOG_WRN("%s: skip invalid commit: %s\n", __func__, commit.c_str());
+            continue;
+        }
+        if (entry.path().filename() == "main") {
+            return commit;
+        }
+        if (fallback.empty()) {
+            fallback = commit;
+        }
+    }
+    return fallback;
+}
+
+hf_files get_cached_files(const std::string & repo_id) {
+    fs::path cache_dir = get_cache_directory();
+    if (!fs::exists(cache_dir)) {
+        return {};
+    }
+
+    if (!repo_id.empty() && !is_valid_repo_id(repo_id)) {
+        LOG_WRN("%s: invalid repository: %s\n", __func__, repo_id.c_str());
+        return {};
+    }
+
+    hf_files files;
+
+    for (const auto & repo : fs::directory_iterator(cache_dir)) {
+        if (!repo.is_directory()) {
+            continue;
+        }
+        fs::path snapshots_path = repo.path() / "snapshots";
+
+        if (!fs::exists(snapshots_path)) {
+            continue;
+        }
+        std::string _repo_id = folder_name_to_repo(repo.path().filename().string());
+
+        if (!is_valid_repo_id(_repo_id)) {
+            continue;
+        }
+        if (!repo_id.empty() && _repo_id != repo_id) {
+            continue;
+        }
+        std::string commit = get_cached_ref(repo.path());
+        fs::path commit_path = snapshots_path / commit;
+
+        if (commit.empty() || !fs::is_directory(commit_path)) {
+            continue;
+        }
+        for (const auto & entry : fs::recursive_directory_iterator(commit_path)) {
+            if (!entry.is_regular_file() && !entry.is_symlink()) {
+                continue;
+            }
+            fs::path path = entry.path().lexically_relative(commit_path);
+
+            if (!path.empty()) {
+                hf_file file;
+                file.repo_id = _repo_id;
+                file.path = path.generic_string();
+                file.local_path = entry.path().string();
+                file.final_path = file.local_path;
+                files.push_back(std::move(file));
+            }
+        }
+    }
+
+    return files;
+}
+
+std::string finalize_file(const hf_file & file) {
+    static std::atomic<bool> symlinks_disabled{false};
+
+    std::error_code ec;
+    fs::path local_path(file.local_path);
+    fs::path final_path(file.final_path);
+
+    if (local_path == final_path || fs::exists(final_path, ec)) {
+        return file.final_path;
+    }
+
+    if (!fs::exists(local_path, ec)) {
+        return file.final_path;
+    }
+
+    fs::create_directories(final_path.parent_path(), ec);
+
+    if (!symlinks_disabled) {
+        fs::path target = fs::relative(local_path, final_path.parent_path(), ec);
+        if (!ec) {
+            fs::create_symlink(target, final_path, ec);
+        }
+        if (!ec) {
+            return file.final_path;
+        }
+    }
+
+    if (!symlinks_disabled.exchange(true)) {
+        LOG_WRN("%s: failed to create symlink: %s\n", __func__, ec.message().c_str());
+        LOG_WRN("%s: switching to degraded mode\n", __func__);
+    }
+
+    fs::rename(local_path, final_path, ec);
+    if (ec) {
+        LOG_WRN("%s: failed to move file to snapshots: %s\n", __func__, ec.message().c_str());
+        fs::copy(local_path, final_path, ec);
+        if (ec) {
+            LOG_ERR("%s: failed to copy file to snapshots: %s\n", __func__, ec.message().c_str());
+        }
+    }
+    return file.final_path;
+}
+
+// delete everything after this line, one day
+
+static std::pair<std::string, std::string> parse_manifest_name(std::string & filename) {
+    static const std::regex re(R"(^manifest=([^=]+)=([^=]+)=.*\.json$)");
+    std::smatch match;
+    if (std::regex_match(filename, match, re)) {
+        return {match[1].str(), match[2].str()};
+    }
+    return {};
+}
+
+static std::string make_old_cache_filename(const std::string & owner,
+                                           const std::string & repo,
+                                           const std::string & filename) {
+    auto result = owner + "_" + repo + "_" + filename;
+    string_replace_all(result, "/", "_");
+    return result;
+}
+
+static bool migrate_single_file(const fs::path    & old_cache,
+                                const std::string & owner,
+                                const std::string & repo,
+                                const nl::json    & node,
+                                const hf_files    & files) {
+
+    if (!node.contains("rfilename") ||
+        !node.contains("lfs")       ||
+        !node["lfs"].contains("sha256")) {
+        return false;
+    }
+
+    std::string path = node["rfilename"];
+    std::string sha256 = node["lfs"]["sha256"];
+
+    const hf_file * file_info = nullptr;
+    for (const auto & f : files) {
+        if (f.path == path) {
+            file_info = &f;
+            break;
+        }
+    }
+
+    std::string old_filename = make_old_cache_filename(owner, repo, path);
+    fs::path old_path = old_cache / old_filename;
+    fs::path etag_path = old_path.string() + ".etag";
+
+    if (!fs::exists(old_path)) {
+        if (fs::exists(etag_path)) {
+            LOG_WRN("%s: %s is orphan, deleting...\n", __func__, etag_path.string().c_str());
+            fs::remove(etag_path);
+        }
+        return false;
+    }
+
+    bool delete_old_path = false;
+
+    if (!file_info) {
+        LOG_WRN("%s: %s not found in current repo, deleting...\n", __func__, old_filename.c_str());
+        delete_old_path = true;
+    } else if (!sha256.empty() && !file_info->oid.empty() && sha256 != file_info->oid) {
+        LOG_WRN("%s: %s is not up to date (sha256 mismatch), deleting...\n", __func__, old_filename.c_str());
+        delete_old_path = true;
+    }
+
+    std::error_code ec;
+
+    if (delete_old_path) {
+        fs::remove(old_path, ec);
+        fs::remove(etag_path, ec);
+        return true;
+    }
+
+    fs::path new_path(file_info->local_path);
+    fs::create_directories(new_path.parent_path(), ec);
+
+    if (!fs::exists(new_path, ec)) {
+        fs::rename(old_path, new_path, ec);
+        if (ec) {
+            fs::copy_file(old_path, new_path, ec);
+            if (ec) {
+                LOG_WRN("%s: failed to move/copy %s: %s\n", __func__, old_path.string().c_str(), ec.message().c_str());
+                return false;
+            }
+        }
+        fs::remove(old_path, ec);
+    }
+    fs::remove(etag_path, ec);
+
+    std::string filename = finalize_file(*file_info);
+    LOG_INF("%s: migrated %s -> %s\n", __func__, old_filename.c_str(), filename.c_str());
+
+    return true;
+}
+
+void migrate_old_cache_to_hf_cache(const std::string & token, bool offline) {
+    fs::path old_cache = fs_get_cache_directory();
+    if (!fs::exists(old_cache)) {
+        return;
+    }
+
+    if (offline) {
+        LOG_WRN("%s: skipping migration in offline mode (will run when online)\n", __func__);
+        return; // -hf is not going to work
+    }
+
+    bool warned = false;
+
+    for (const auto & entry : fs::directory_iterator(old_cache)) {
+        if (!entry.is_regular_file()) {
+            continue;
+        }
+        auto filename = entry.path().filename().string();
+        auto [owner, repo] = parse_manifest_name(filename);
+
+        if (owner.empty() || repo.empty()) {
+            continue;
+        }
+
+        if (!warned) {
+            warned = true;
+            LOG_WRN("================================================================================\n"
+                    "WARNING: Migrating cache to HuggingFace cache directory\n"
+                    "  Old cache: %s\n"
+                    "  New cache: %s\n"
+                    "This one-time migration moves models previously downloaded with -hf\n"
+                    "from the legacy llama.cpp cache to the standard HuggingFace cache.\n"
+                    "Models downloaded with --model-url are not affected.\n"
+                    "================================================================================\n",
+                    old_cache.string().c_str(), get_cache_directory().string().c_str());
+        }
+
+        auto repo_id = owner + "/" + repo;
+        auto files = get_repo_files(repo_id, token);
+
+        if (files.empty()) {
+            LOG_WRN("%s: could not get repo files for %s, skipping\n", __func__, repo_id.c_str());
+            continue;
+        }
+
+        try {
+            std::ifstream manifest(entry.path());
+            auto json = nl::json::parse(manifest);
+
+            for (const char * key : {"ggufFile", "mmprojFile"}) {
+                if (json.contains(key)) {
+                    migrate_single_file(old_cache, owner, repo, json[key], files);
+                }
+            }
+        } catch (const std::exception & e) {
+            LOG_WRN("%s: failed to parse manifest %s: %s\n", __func__, filename.c_str(), e.what());
+            continue;
+        }
+        fs::remove(entry.path());
+    }
+}
+
+} // namespace hf_cache
--- a/common/hf-cache.h
+++ b/common/hf-cache.h
@@ -0,0 +1,35 @@
+#pragma once
+
+#include <string>
+#include <vector>
+
+// Ref: https://huggingface.co/docs/hub/local-cache.md
+
+namespace hf_cache {
+
+struct hf_file {
+    std::string path;
+    std::string url;
+    std::string local_path;
+    std::string final_path;
+    std::string oid;
+    std::string repo_id;
+};
+
+using hf_files = std::vector<hf_file>;
+
+// Get files from HF API
+hf_files get_repo_files(
+    const std::string & repo_id,
+    const std::string & token
+);
+
+hf_files get_cached_files(const std::string & repo_id = {});
+
+// Create snapshot path (link or move/copy) and return it
+std::string finalize_file(const hf_file & file);
+
+// TODO: Remove later
+void migrate_old_cache_to_hf_cache(const std::string & token, bool offline = false);
+
+} // namespace hf_cache
--- a/common/jinja/runtime.cpp
+++ b/common/jinja/runtime.cpp
@@ -667,8 +667,9 @@ value macro_statement::execute_impl(context & ctx) {
                if (is_stmt<identifier>(this->args[i])) {
                    // normal parameter
                    std::string param_name = cast_stmt<identifier>(this->args[i])->val;
-                    JJ_DEBUG("  Binding parameter '%s' to argument of type %s", param_name.c_str(), args.get_pos(i)->type().c_str());
-                    macro_ctx.set_val(param_name, args.get_pos(i));
+                    value param_value = args.get_kwarg_or_pos(param_name, i);
+                    JJ_DEBUG("  Binding parameter '%s' to argument of type %s", param_name.c_str(), param_value->type().c_str());
+                    macro_ctx.set_val(param_name, param_value);
                } else if (is_stmt<keyword_argument_expression>(this->args[i])) {
                    // default argument used as normal parameter
                    auto kwarg = cast_stmt<keyword_argument_expression>(this->args[i]);
@@ -676,8 +677,9 @@ value macro_statement::execute_impl(context & ctx) {
                        throw std::runtime_error("Keyword argument key must be an identifier in macro '" + name + "'");
                    }
                    std::string param_name = cast_stmt<identifier>(kwarg->key)->val;
-                    JJ_DEBUG("  Binding parameter '%s' to argument of type %s", param_name.c_str(), args.get_pos(i)->type().c_str());
-                    macro_ctx.set_val(param_name, args.get_pos(i));
+                    value param_value = args.get_kwarg_or_pos(param_name, i);
+                    JJ_DEBUG("  Binding parameter '%s' to argument of type %s", param_name.c_str(), param_value->type().c_str());
+                    macro_ctx.set_val(param_name, param_value);
                } else {
                    throw std::runtime_error("Invalid parameter type in macro '" + name + "'");
                }
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -4572,7 +4572,7 @@ class Qwen2MoeModel(TextModel):
                raise ValueError(f"Unprocessed experts: {experts}")


-@ModelBase.register("Qwen3ForCausalLM")
+@ModelBase.register("Qwen3ForCausalLM", "Qwen3Model")
 class Qwen3Model(Qwen2Model):
    model_arch = gguf.MODEL_ARCH.QWEN3

--- a/docs/backend/OPENVINO.md
+++ b/docs/backend/OPENVINO.md
@@ -1,6 +1,9 @@
 # OpenVINO Backend for llama.cpp
-[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge.
-This document describes the [OpenVINO backend for llama.cpp](../../src/ggml-openvino), which enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.
+
+> [!NOTE]
+> Performance and memory optimizations, accuracy validation, broader quantization coverage, broader operator and model support are work in progress.
+
+[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge. [OpenVINO backend for llama.cpp](../../src/ggml-openvino) enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.

 The OpenVINO backend is implemented in `ggml/src/ggml-openvino` and provides a translation layer for core GGML operations. The OpenVINO backend replaces the standard GGML graph execution path with Intel's OpenVINO inference engine. This approach allows the same GGUF model file to run on Intel CPUs, Intel GPUs (integrated and discrete), and Intel NPUs without changes to the model or the rest of the llama.cpp stack. When a `ggml_cgraph` is dispatched to OpenVINO backend, it:

@@ -179,31 +182,73 @@ curl -L https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/L

 When using the OpenVINO backend, the first inference token may have slightly higher latency due to on-the-fly conversion to the OpenVINO graph. Subsequent tokens and runs will be faster.

+> [!NOTE]
+> Default context size is set to the model training context, which may be very large. For example, 131072 for Llama 3.2 1B, which may result in lower performance, especially on edge/laptop devices. Use `-c` to limit context size in supported llama.cpp tools for better performance. For example, `-c 512`.
+
 ```bash
 # If device is unset or unavailable, defaults to CPU.
 # If the system has multiple GPUs, use GPU.0 or GPU.1 to explicitly target a specific GPU.

 # Linux
 export GGML_OPENVINO_DEVICE=GPU
+# Enable stateful execution with GPU device to avoid known stateless execution failures.
+export GGML_OPENVINO_STATEFUL_EXECUTION=1
 # To run llama-simple:
 ./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
 # To run in chat mode:
-./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf
+./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 1024
+# To run llama-bench, -fa 1 is needed
+GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-bench -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -fa 1
+
+# NPU: keep context small to avoid failures from very large model context windows.
+export GGML_OPENVINO_DEVICE=NPU
+./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 512

 # Windows Command Line
 set GGML_OPENVINO_DEVICE=GPU
+# Enable stateful execution with GPU device to avoid known stateless execution failures.
+set GGML_OPENVINO_STATEFUL_EXECUTION=1
 # Windows PowerShell
 $env:GGML_OPENVINO_DEVICE = "GPU"
+$env:GGML_OPENVINO_STATEFUL_EXECUTION = "1"

 # To run llama-simple
 build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
 # To run in chat mode:
-build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf"
+build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -c 1024
+# To run llama-bench, -fa 1 is needed
+build\ReleaseOV\bin\llama-bench.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -fa 1

+# NPU: keep context small to avoid failures from very large model context windows.
+# Windows Command Line
+set GGML_OPENVINO_DEVICE=NPU
+# Windows PowerShell
+$env:GGML_OPENVINO_DEVICE = "NPU"
+build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -c 512
 ```
 > [!NOTE]
 > On systems with multiple GPUs, use `GPU.0` or `GPU.1` to explicitly target specific GPU. See [OpenVINO GPU Device](https://docs.openvino.ai/2026/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) for more details.

+### Known Issues and Current Workarounds
+
+- GPU stateless execution is currently affected by a known issue.
+  - Workaround: set `GGML_OPENVINO_STATEFUL_EXECUTION=1` when using GPU device.
+- NPU failures can happen when context size is too large. Recent llama.cpp behavior may resolve context size to the model training context (for example, 131072 for Llama 3.2 1B), which is too large for current NPU usage and can also stress laptop CPU/GPU on larger models. To inspect the selected context size, run `llama-cli` or `llama-server` with `-lv 3`.
+  - Workaround: explicitly set context size, for ex. `-c 1024` for NPU runs. Performance will be better with lower context size.
+- Additional NPU limitations:
+  - Model caching is not yet supported.
+  - `llama-server -np > 1` (multiple parallel sequences) is not supported.
+  - `llama-perplexity` is only supported with `-b 512` or smaller.
+- `--context-shift` with `llama-cli` is currently not supported with OpenVINO backend across CPU, GPU, and NPU devices.
+- Encoder models (embedding, reranking) are not supported with the current OpenVINO backend implementation.
+- `-fa 1` is required when running llama-bench with the OpenVINO backend.
+  - `GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1`
+- `llama-server` with OpenVINO backend supports only one chat session/thread, when `GGML_OPENVINO_STATEFUL_EXECUTION=1` is enabled.
+- For Intel GPU, NPU detection in containers, GPU, NPU user-space drivers/libraries must be present inside the image. We will include in a future PR. Until then, you can use this reference Dockerfile: [openvino.Dockerfile](https://github.com/ravi9/llama.cpp/blob/ov-docker-update/.devops/openvino.Dockerfile)
+
+> [!NOTE]
+> The OpenVINO backend is actively under development. Fixes are underway, and this document will continue to be updated as issues are resolved.
+

 ### Docker Build

@@ -229,31 +274,42 @@ docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_p
 Run llama.cpp with OpenVINO backend Docker container.
 Save sample models in `~/models` as [shown above](#3-download-sample-model). It will be mounted to the container in the examples below.

+> [!NOTE]
+> Intel GPU, NPU detection in containers will be included in a future PR. Until then, you can use this reference Dockerfile: [openvino.Dockerfile](https://github.com/ravi9/llama.cpp/blob/ov-docker-update/.devops/openvino.Dockerfile).
+
 ```bash
 #  Run Docker container
-docker run --rm -it -v ~/models:/models llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
+docker run --rm -it -v ~/models:/models llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf

 # With Intel GPU access (iGPU or dGPU)
 docker run --rm -it -v ~/models:/models \
 --device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
-llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
+--env=GGML_OPENVINO_DEVICE=GPU --env=GGML_OPENVINO_STATEFUL_EXECUTION=1 \
+llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf

 # With Intel NPU access
-docker run --rm -it --env GGML_OPENVINO_DEVICE=NPU -v ~/models:/models \
+docker run --rm -it -v ~/models:/models \
 --device=/dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
-llama-openvino:light --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
+--env=GGML_OPENVINO_DEVICE=NPU \
+llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
 ```

-Run Llama.cpp Server with OpenVINO Backend:
+Run Llama.cpp Server with OpenVINO Backend.
+> [!NOTE]
+> `llama-server` with OpenVINO backend supports only one chat session/thread, when `GGML_OPENVINO_STATEFUL_EXECUTION=1` is enabled.
+
 ```bash
 # Run the Server Docker container
-docker run --rm -it -p 8080:8080 -v ~/models:/models llama-openvino:server --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
-
-# In a NEW terminal, test the server with curl
+docker run --rm -it -p 8080:8080 -v ~/models:/models llama-openvino:server --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 1024
+# Or Using llama-server executable
+./build/ReleaseOV/bin/llama-server -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf --port 8080 -c 1024

 # If you are behind a proxy, make sure to set NO_PROXY to avoid proxy for localhost
 export NO_PROXY=localhost,127.0.0.1

+# Option 1: Open your browser to http://localhost:8080 to access the web UI for the llama.cpp server.
+# Option 2: In a NEW terminal, test the server with curl
+
 # Test health endpoint
 curl -f http://localhost:8080/health

@@ -295,6 +351,7 @@ The OpenVINO backend can be configured using the following environment variables
 export GGML_OPENVINO_CACHE_DIR=/tmp/ov_cache
 export GGML_OPENVINO_PROFILING=1
 export GGML_OPENVINO_DEVICE=GPU
+export GGML_OPENVINO_STATEFUL_EXECUTION=1

 ./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "

@@ -302,38 +359,27 @@ export GGML_OPENVINO_DEVICE=GPU
 set GGML_OPENVINO_CACHE_DIR=C:\tmp\ov_cache
 set GGML_OPENVINO_PROFILING=1
 set GGML_OPENVINO_DEVICE=GPU
+set GGML_OPENVINO_STATEFUL_EXECUTION=1

 # Windows PowerShell
 $env:GGML_OPENVINO_CACHE_DIR = "C:\tmp\ov_cache"
 $env:GGML_OPENVINO_PROFILING = "1"
 $env:GGML_OPENVINO_DEVICE = "GPU"
+$env:GGML_OPENVINO_STATEFUL_EXECUTION = "1"

 build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "

 ```

-#### llama-bench
-
-```bash
-# -fa 1 is required when running llama-bench with the OpenVINO backend.
-GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1
-```
-
-### NPU Notes
-
- Model caching is not yet supported
- Does not support llama-server -np > 1 (multiple parallel sequences)
- Only supports llama-perplexity -b 512 or smaller
-
 ## Llama.cpp Tools

 The following tools work with the OpenVINO backend on CPU, GPU, NPU:
- llama-simple
- llama-run
- llama-cli
- llama-server
 - llama-bench
+- llama-cli
+- llama-completion
 - llama-perplexity
+- llama-server
+- llama-simple

 ## Work in Progress

--- a/examples/llama.android/lib/src/main/cpp/ai_chat.cpp
+++ b/examples/llama.android/lib/src/main/cpp/ai_chat.cpp
@@ -365,13 +365,13 @@ Java_com_arm_aichat_internal_InferenceEngineImpl_processSystemPrompt(
    const auto *system_prompt = env->GetStringUTFChars(jsystem_prompt, nullptr);
    LOGd("%s: System prompt received: \n%s", __func__, system_prompt);
    std::string formatted_system_prompt(system_prompt);
-    env->ReleaseStringUTFChars(jsystem_prompt, system_prompt);

    // Format system prompt if applicable
    const bool has_chat_template = common_chat_templates_was_explicit(g_chat_templates.get());
    if (has_chat_template) {
        formatted_system_prompt = chat_add_and_format(ROLE_SYSTEM, system_prompt);
    }
+    env->ReleaseStringUTFChars(jsystem_prompt, system_prompt);

    // Tokenize system prompt
    const auto system_tokens = common_tokenize(g_context, formatted_system_prompt,
@@ -414,13 +414,13 @@ Java_com_arm_aichat_internal_InferenceEngineImpl_processUserPrompt(
    const auto *const user_prompt = env->GetStringUTFChars(juser_prompt, nullptr);
    LOGd("%s: User prompt received: \n%s", __func__, user_prompt);
    std::string formatted_user_prompt(user_prompt);
-    env->ReleaseStringUTFChars(juser_prompt, user_prompt);

    // Format user prompt if applicable
    const bool has_chat_template = common_chat_templates_was_explicit(g_chat_templates.get());
    if (has_chat_template) {
        formatted_user_prompt = chat_add_and_format(ROLE_USER, user_prompt);
    }
+    env->ReleaseStringUTFChars(juser_prompt, user_prompt);

    // Decode formatted user prompts
    auto user_tokens = common_tokenize(g_context, formatted_user_prompt, has_chat_template, has_chat_template);
--- a/ggml/include/gguf.h
+++ b/ggml/include/gguf.h
@@ -77,6 +77,7 @@ extern "C" {
    };

    GGML_API struct gguf_context * gguf_init_empty(void);
+    GGML_API struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_params params);
    GGML_API struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_params params);
    //GGML_API struct gguf_context * gguf_init_from_buffer(..);

@@ -189,6 +190,7 @@ extern "C" {
    //

    // write the entire context to a binary file
+    GGML_API bool gguf_write_to_file_ptr(const struct gguf_context * ctx, FILE * file, bool only_meta);
    GGML_API bool gguf_write_to_file(const struct gguf_context * ctx, const char * fname, bool only_meta);

    // get the size in bytes of the meta data (header, kv pairs, tensor info) including padding
--- a/ggml/src/ggml-impl.h
+++ b/ggml/src/ggml-impl.h
@@ -773,6 +773,5 @@ inline bool ggml_check_edges(const struct ggml_cgraph *                cgraph,

 // expose GGUF internals for test code
 GGML_API size_t gguf_type_size(enum gguf_type type);
-GGML_API struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_params params);
 GGML_API void gguf_write_to_buf(const struct gguf_context * ctx, std::vector<int8_t> & buf, bool only_meta);
 #endif // __cplusplus
--- a/ggml/src/ggml-metal/ggml-metal-device.cpp
+++ b/ggml/src/ggml-metal/ggml-metal-device.cpp
@@ -246,6 +246,10 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_unary(ggml_metal
                case GGML_UNARY_OP_EXP:         op_num = OP_UNARY_NUM_EXP;         break;
                case GGML_UNARY_OP_SOFTPLUS:    op_num = OP_UNARY_NUM_SOFTPLUS;    break;
                case GGML_UNARY_OP_EXPM1:       op_num = OP_UNARY_NUM_EXPM1;       break;
+                case GGML_UNARY_OP_FLOOR:       op_num = OP_UNARY_NUM_FLOOR;       break;
+                case GGML_UNARY_OP_CEIL:        op_num = OP_UNARY_NUM_CEIL;        break;
+                case GGML_UNARY_OP_ROUND:       op_num = OP_UNARY_NUM_ROUND;       break;
+                case GGML_UNARY_OP_TRUNC:       op_num = OP_UNARY_NUM_TRUNC;       break;
                default: GGML_ABORT("fatal error");
            } break;
        default: GGML_ABORT("fatal error");
--- a/ggml/src/ggml-metal/ggml-metal-device.m
+++ b/ggml/src/ggml-metal/ggml-metal-device.m
@@ -1039,6 +1039,10 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
                case GGML_UNARY_OP_EXP:
                case GGML_UNARY_OP_SOFTPLUS:
                case GGML_UNARY_OP_EXPM1:
+                case GGML_UNARY_OP_FLOOR:
+                case GGML_UNARY_OP_CEIL:
+                case GGML_UNARY_OP_ROUND:
+                case GGML_UNARY_OP_TRUNC:
                    return ggml_is_contiguous_rows(op->src[0]) && (op->src[0]->type == GGML_TYPE_F32 || op->src[0]->type == GGML_TYPE_F16);
                default:
                    return false;
@@ -1148,6 +1152,7 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
                op->src[0]->ne[0] != 192 &&
                op->src[0]->ne[0] != 256 &&
                op->src[0]->ne[0] != 320 &&
+                op->src[0]->ne[0] != 512 &&
                op->src[0]->ne[0] != 576) {
                return false;
            }
--- a/ggml/src/ggml-metal/ggml-metal-impl.h
+++ b/ggml/src/ggml-metal/ggml-metal-impl.h
@@ -120,6 +120,10 @@
 #define OP_UNARY_NUM_EXP         114
 #define OP_UNARY_NUM_SOFTPLUS    115
 #define OP_UNARY_NUM_EXPM1       116
+#define OP_UNARY_NUM_FLOOR       117
+#define OP_UNARY_NUM_CEIL        118
+#define OP_UNARY_NUM_ROUND       119
+#define OP_UNARY_NUM_TRUNC       120

 #define OP_SUM_ROWS_NUM_SUM_ROWS 10
 #define OP_SUM_ROWS_NUM_MEAN     11
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -1094,6 +1094,22 @@ kernel void kernel_unary_impl(
            // TODO: precise implementation
            dst_ptr[i0] = (T) (exp(x) - 1);
        }
+
+        if (FC_OP == OP_UNARY_NUM_FLOOR) {
+            dst_ptr[i0] = (T) floor(x);
+        }
+
+        if (FC_OP == OP_UNARY_NUM_CEIL) {
+            dst_ptr[i0] = (T) ceil(x);
+        }
+
+        if (FC_OP == OP_UNARY_NUM_ROUND) {
+            dst_ptr[i0] = (T) round(x);
+        }
+
+        if (FC_OP == OP_UNARY_NUM_TRUNC) {
+            dst_ptr[i0] = (T) trunc(x);
+        }
    }

 #undef FC_OP
@@ -6269,6 +6285,7 @@ template [[host_name("kernel_flash_attn_ext_f32_dk192_dv192")]]  kernel flash_at
 template [[host_name("kernel_flash_attn_ext_f32_dk192_dv128")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_F32, float4x4,   1, dequantize_f32,  float4x4,   1, dequantize_f32,  192, 128>;
 template [[host_name("kernel_flash_attn_ext_f32_dk256_dv256")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_F32, float4x4,   1, dequantize_f32,  float4x4,   1, dequantize_f32,  256, 256>;
 template [[host_name("kernel_flash_attn_ext_f32_dk320_dv256")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_F32, float4x4,   1, dequantize_f32,  float4x4,   1, dequantize_f32,  320, 256>;
+template [[host_name("kernel_flash_attn_ext_f32_dk512_dv512")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_F32, float4x4,   1, dequantize_f32,  float4x4,   1, dequantize_f32,  512, 512>;
 template [[host_name("kernel_flash_attn_ext_f32_dk576_dv512")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_F32, float4x4,   1, dequantize_f32,  float4x4,   1, dequantize_f32,  576, 512>;

 template [[host_name("kernel_flash_attn_ext_f16_dk32_dv32"  )]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    half4x4,    1, dequantize_f16,  half4x4,    1, dequantize_f16,  32,  32>;
@@ -6284,6 +6301,7 @@ template [[host_name("kernel_flash_attn_ext_f16_dk192_dv192")]]  kernel flash_at
 template [[host_name("kernel_flash_attn_ext_f16_dk192_dv128")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    half4x4,    1, dequantize_f16,  half4x4,    1, dequantize_f16,  192, 128>;
 template [[host_name("kernel_flash_attn_ext_f16_dk256_dv256")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    half4x4,    1, dequantize_f16,  half4x4,    1, dequantize_f16,  256, 256>;
 template [[host_name("kernel_flash_attn_ext_f16_dk320_dv256")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    half4x4,    1, dequantize_f16,  half4x4,    1, dequantize_f16,  320, 256>;
+template [[host_name("kernel_flash_attn_ext_f16_dk512_dv512")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    half4x4,    1, dequantize_f16,  half4x4,    1, dequantize_f16,  512, 512>;
 template [[host_name("kernel_flash_attn_ext_f16_dk576_dv512")]]  kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    half4x4,    1, dequantize_f16,  half4x4,    1, dequantize_f16,  576, 512>;

 #if defined(GGML_METAL_HAS_BF16)
@@ -6300,6 +6318,7 @@ template [[host_name("kernel_flash_attn_ext_bf16_dk192_dv192")]] kernel flash_at
 template [[host_name("kernel_flash_attn_ext_bf16_dk192_dv128")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_BF, bfloat4x4,  1, dequantize_bf16, bfloat4x4,  1, dequantize_bf16, 192, 128>;
 template [[host_name("kernel_flash_attn_ext_bf16_dk256_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_BF, bfloat4x4,  1, dequantize_bf16, bfloat4x4,  1, dequantize_bf16, 256, 256>;
 template [[host_name("kernel_flash_attn_ext_bf16_dk320_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_BF, bfloat4x4,  1, dequantize_bf16, bfloat4x4,  1, dequantize_bf16, 320, 256>;
+template [[host_name("kernel_flash_attn_ext_bf16_dk512_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_BF, bfloat4x4,  1, dequantize_bf16, bfloat4x4,  1, dequantize_bf16, 512, 512>;
 template [[host_name("kernel_flash_attn_ext_bf16_dk576_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES_BF, bfloat4x4,  1, dequantize_bf16, bfloat4x4,  1, dequantize_bf16, 576, 512>;
 #endif

@@ -6316,6 +6335,7 @@ template [[host_name("kernel_flash_attn_ext_q4_0_dk192_dv192")]] kernel flash_at
 template [[host_name("kernel_flash_attn_ext_q4_0_dk192_dv128")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_0, 2, dequantize_q4_0, block_q4_0, 2, dequantize_q4_0, 192, 128>;
 template [[host_name("kernel_flash_attn_ext_q4_0_dk256_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_0, 2, dequantize_q4_0, block_q4_0, 2, dequantize_q4_0, 256, 256>;
 template [[host_name("kernel_flash_attn_ext_q4_0_dk320_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_0, 2, dequantize_q4_0, block_q4_0, 2, dequantize_q4_0, 320, 256>;
+template [[host_name("kernel_flash_attn_ext_q4_0_dk512_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_0, 2, dequantize_q4_0, block_q4_0, 2, dequantize_q4_0, 512, 512>;
 template [[host_name("kernel_flash_attn_ext_q4_0_dk576_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_0, 2, dequantize_q4_0, block_q4_0, 2, dequantize_q4_0, 576, 512>;

 template [[host_name("kernel_flash_attn_ext_q4_1_dk32_dv32"  )]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_1, 2, dequantize_q4_1, block_q4_1, 2, dequantize_q4_1, 32,  32>;
@@ -6331,6 +6351,7 @@ template [[host_name("kernel_flash_attn_ext_q4_1_dk192_dv192")]] kernel flash_at
 template [[host_name("kernel_flash_attn_ext_q4_1_dk192_dv128")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_1, 2, dequantize_q4_1, block_q4_1, 2, dequantize_q4_1, 192, 128>;
 template [[host_name("kernel_flash_attn_ext_q4_1_dk256_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_1, 2, dequantize_q4_1, block_q4_1, 2, dequantize_q4_1, 256, 256>;
 template [[host_name("kernel_flash_attn_ext_q4_1_dk320_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_1, 2, dequantize_q4_1, block_q4_1, 2, dequantize_q4_1, 320, 256>;
+template [[host_name("kernel_flash_attn_ext_q4_1_dk512_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_1, 2, dequantize_q4_1, block_q4_1, 2, dequantize_q4_1, 512, 512>;
 template [[host_name("kernel_flash_attn_ext_q4_1_dk576_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q4_1, 2, dequantize_q4_1, block_q4_1, 2, dequantize_q4_1, 576, 512>;

 template [[host_name("kernel_flash_attn_ext_q5_0_dk32_dv32"  )]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_0, 2, dequantize_q5_0, block_q5_0, 2, dequantize_q5_0, 32,  32>;
@@ -6346,6 +6367,7 @@ template [[host_name("kernel_flash_attn_ext_q5_0_dk192_dv192")]] kernel flash_at
 template [[host_name("kernel_flash_attn_ext_q5_0_dk192_dv128")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_0, 2, dequantize_q5_0, block_q5_0, 2, dequantize_q5_0, 192, 128>;
 template [[host_name("kernel_flash_attn_ext_q5_0_dk256_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_0, 2, dequantize_q5_0, block_q5_0, 2, dequantize_q5_0, 256, 256>;
 template [[host_name("kernel_flash_attn_ext_q5_0_dk320_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_0, 2, dequantize_q5_0, block_q5_0, 2, dequantize_q5_0, 320, 256>;
+template [[host_name("kernel_flash_attn_ext_q5_0_dk512_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_0, 2, dequantize_q5_0, block_q5_0, 2, dequantize_q5_0, 512, 512>;
 template [[host_name("kernel_flash_attn_ext_q5_0_dk576_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_0, 2, dequantize_q5_0, block_q5_0, 2, dequantize_q5_0, 576, 512>;

 template [[host_name("kernel_flash_attn_ext_q5_1_dk32_dv32"  )]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_1, 2, dequantize_q5_1, block_q5_1, 2, dequantize_q5_1, 32,  32>;
@@ -6361,6 +6383,7 @@ template [[host_name("kernel_flash_attn_ext_q5_1_dk192_dv192")]] kernel flash_at
 template [[host_name("kernel_flash_attn_ext_q5_1_dk192_dv128")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_1, 2, dequantize_q5_1, block_q5_1, 2, dequantize_q5_1, 192, 128>;
 template [[host_name("kernel_flash_attn_ext_q5_1_dk256_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_1, 2, dequantize_q5_1, block_q5_1, 2, dequantize_q5_1, 256, 256>;
 template [[host_name("kernel_flash_attn_ext_q5_1_dk320_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_1, 2, dequantize_q5_1, block_q5_1, 2, dequantize_q5_1, 320, 256>;
+template [[host_name("kernel_flash_attn_ext_q5_1_dk512_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_1, 2, dequantize_q5_1, block_q5_1, 2, dequantize_q5_1, 512, 512>;
 template [[host_name("kernel_flash_attn_ext_q5_1_dk576_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q5_1, 2, dequantize_q5_1, block_q5_1, 2, dequantize_q5_1, 576, 512>;

 template [[host_name("kernel_flash_attn_ext_q8_0_dk32_dv32"  )]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q8_0, 2, dequantize_q8_0, block_q8_0, 2, dequantize_q8_0, 32,  32>;
@@ -6376,6 +6399,7 @@ template [[host_name("kernel_flash_attn_ext_q8_0_dk192_dv192")]] kernel flash_at
 template [[host_name("kernel_flash_attn_ext_q8_0_dk192_dv128")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q8_0, 2, dequantize_q8_0, block_q8_0, 2, dequantize_q8_0, 192, 128>;
 template [[host_name("kernel_flash_attn_ext_q8_0_dk256_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q8_0, 2, dequantize_q8_0, block_q8_0, 2, dequantize_q8_0, 256, 256>;
 template [[host_name("kernel_flash_attn_ext_q8_0_dk320_dv256")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q8_0, 2, dequantize_q8_0, block_q8_0, 2, dequantize_q8_0, 320, 256>;
+template [[host_name("kernel_flash_attn_ext_q8_0_dk512_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q8_0, 2, dequantize_q8_0, block_q8_0, 2, dequantize_q8_0, 512, 512>;
 template [[host_name("kernel_flash_attn_ext_q8_0_dk576_dv512")]] kernel flash_attn_ext_t kernel_flash_attn_ext<FA_TYPES,    block_q8_0, 2, dequantize_q8_0, block_q8_0, 2, dequantize_q8_0, 576, 512>;

 #undef FA_TYPES
@@ -6957,6 +6981,17 @@ template [[host_name("kernel_flash_attn_ext_vec_q5_0_dk320_dv256")]] kernel flas
 template [[host_name("kernel_flash_attn_ext_vec_q5_1_dk320_dv256")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     block_q5_1, 8, dequantize_q5_1_t4, block_q5_1,  8, dequantize_q5_1_t4, 320, 256, 2>;
 template [[host_name("kernel_flash_attn_ext_vec_q8_0_dk320_dv256")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     block_q8_0, 8, dequantize_q8_0_t4, block_q8_0,  8, dequantize_q8_0_t4, 320, 256, 2>;

+template [[host_name("kernel_flash_attn_ext_vec_f32_dk512_dv512")]]  kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES_F32, float4,     1, dequantize_f32_t4,  float4,      1, dequantize_f32_t4,  512, 512, 1>;
+template [[host_name("kernel_flash_attn_ext_vec_f16_dk512_dv512")]]  kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     half4,      1, dequantize_f16_t4,  half4,       1, dequantize_f16_t4,  512, 512, 1>;
+#if defined(GGML_METAL_HAS_BF16)
+template [[host_name("kernel_flash_attn_ext_vec_bf16_dk512_dv512")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     bfloat4,    1, dequantize_bf16_t4, bfloat4,     1, dequantize_bf16_t4, 512, 512, 1>;
+#endif
+template [[host_name("kernel_flash_attn_ext_vec_q4_0_dk512_dv512")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     block_q4_0, 8, dequantize_q4_0_t4, block_q4_0,  8, dequantize_q4_0_t4, 512, 512, 1>;
+template [[host_name("kernel_flash_attn_ext_vec_q4_1_dk512_dv512")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     block_q4_1, 8, dequantize_q4_1_t4, block_q4_1,  8, dequantize_q4_1_t4, 512, 512, 1>;
+template [[host_name("kernel_flash_attn_ext_vec_q5_0_dk512_dv512")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     block_q5_0, 8, dequantize_q5_0_t4, block_q5_0,  8, dequantize_q5_0_t4, 512, 512, 1>;
+template [[host_name("kernel_flash_attn_ext_vec_q5_1_dk512_dv512")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     block_q5_1, 8, dequantize_q5_1_t4, block_q5_1,  8, dequantize_q5_1_t4, 512, 512, 1>;
+template [[host_name("kernel_flash_attn_ext_vec_q8_0_dk512_dv512")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     block_q8_0, 8, dequantize_q8_0_t4, block_q8_0,  8, dequantize_q8_0_t4, 512, 512, 1>;
+
 template [[host_name("kernel_flash_attn_ext_vec_f32_dk576_dv512")]]  kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES_F32, float4,     1, dequantize_f32_t4,  float4,      1, dequantize_f32_t4,  576, 512, 2>;
 template [[host_name("kernel_flash_attn_ext_vec_f16_dk576_dv512")]]  kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES,     half4,      1, dequantize_f16_t4,  half4,       1, dequantize_f16_t4,  576, 512, 2>;
 #if defined(GGML_METAL_HAS_BF16)
--- a/ggml/src/ggml-sycl/add-id.cpp
+++ b/ggml/src/ggml-sycl/add-id.cpp
@@ -56,7 +56,7 @@ void ggml_sycl_add_id(ggml_backend_sycl_context& ctx, ggml_tensor* dst) {
  float* dst_d = (float*)dst->data;

  const unsigned int max_work_group_size = ggml_sycl_info().max_work_group_sizes[ctx.device];
-  assert(work_group_size % (WARP_SIZE * WARP_SIZE) == 0);
+  GGML_ASSERT(max_work_group_size % (WARP_SIZE * WARP_SIZE) == 0);

  int threads = std::min((unsigned int)ne00, max_work_group_size);  // cols

--- a/ggml/src/gguf.cpp
+++ b/ggml/src/gguf.cpp
@@ -394,7 +394,11 @@ bool gguf_read_emplace_helper(const struct gguf_reader & gr, std::vector<struct
    return true;
 }

-struct gguf_context * gguf_init_from_file_impl(FILE * file, struct gguf_init_params params) {
+struct gguf_context * gguf_init_from_file_ptr(FILE * file, struct gguf_init_params params) {
+    if (!file) {
+        return nullptr;
+    }
+
    const struct gguf_reader gr(file);
    struct gguf_context * ctx = new gguf_context;

@@ -848,7 +852,7 @@ struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_p
        return nullptr;
    }

-    struct gguf_context * result = gguf_init_from_file_impl(file, params);
+    struct gguf_context * result = gguf_init_from_file_ptr(file, params);
    fclose(file);
    return result;
 }
@@ -1508,6 +1512,19 @@ void gguf_write_to_buf(const struct gguf_context * ctx, std::vector<int8_t> & bu
    gguf_write_out(ctx, gw, only_meta);
 }

+bool gguf_write_to_file_ptr(const struct gguf_context * ctx, FILE * file, bool only_meta) {
+    GGML_ASSERT(file);
+
+    try {
+        gguf_writer_file gw(file);
+        gguf_write_out(ctx, gw, only_meta);
+    } catch (const std::runtime_error& ex) {
+        GGML_LOG_ERROR("%s: failed to write GGUF data: %s\n", __func__, ex.what());
+        return false;
+    }
+    return true;
+}
+
 bool gguf_write_to_file(const struct gguf_context * ctx, const char * fname, bool only_meta) {
    FILE * file = ggml_fopen(fname, "wb");

@@ -1516,17 +1533,13 @@ bool gguf_write_to_file(const struct gguf_context * ctx, const char * fname, boo
        return false;
    }

-    try {
-        gguf_writer_file gw(file);
-        gguf_write_out(ctx, gw, only_meta);
-    } catch (const std::runtime_error& ex) {
-        GGML_LOG_ERROR("%s: failed to write GGUF data into '%s': %s\n", __func__, fname, ex.what());
-        fclose(file);
-        return false;
+    const bool success = gguf_write_to_file_ptr(ctx, file, only_meta);
+    if (!success) {
+        GGML_LOG_ERROR("%s: failed to write GGUF data into '%s'\n", __func__, fname);
    }

    fclose(file);
-    return true;
+    return success;
 }

 size_t gguf_get_meta_size(const struct gguf_context * ctx) {
--- a/include/llama.h
+++ b/include/llama.h
@@ -465,6 +465,11 @@ extern "C" {
                             const char * path_model,
              struct llama_model_params   params);

+    // Load a model from an open FILE pointer
+    LLAMA_API struct llama_model * llama_model_load_from_file_ptr(
+                                   FILE * file,
+              struct llama_model_params   params);
+
    // Load a model from multiple splits (support custom naming scheme)
    // The paths must be in the correct order
    LLAMA_API struct llama_model * llama_model_load_from_splits(
--- a/requirements/requirements-pydantic.txt
+++ b/requirements/requirements-pydantic.txt
@@ -1,3 +1,3 @@
 docstring_parser~=0.15
 pydantic~=2.11.7
-requests
+requests~=2.32.3
--- a/scripts/sync_vendor.py
+++ b/scripts/sync_vendor.py
@@ -5,7 +5,7 @@ import os
 import sys
 import subprocess

-HTTPLIB_VERSION = "refs/tags/v0.38.0"
+HTTPLIB_VERSION = "refs/tags/v0.39.0"

 vendor = {
    "https://github.com/nlohmann/json/releases/latest/download/json.hpp":     "vendor/nlohmann/json.hpp",
--- a/src/llama-arch.cpp
+++ b/src/llama-arch.cpp
@@ -544,6 +544,10 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
        case LLM_ARCH_CLIP:
            return {};
        case LLM_ARCH_LLAMA:
+        case LLM_ARCH_REFACT:
+        case LLM_ARCH_MINICPM:
+        case LLM_ARCH_GRANITE:
+        case LLM_ARCH_GRANITE_MOE:
        case LLM_ARCH_DECI:
        case LLM_ARCH_MISTRAL3:
        case LLM_ARCH_LLAMA_EMBED:
@@ -744,11 +748,9 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
                LLM_TENSOR_ATTN_Q_NORM,
                LLM_TENSOR_ATTN_K_NORM,
            };
-        case LLM_ARCH_REFACT:
        case LLM_ARCH_QWEN2:
        case LLM_ARCH_QWEN2VL:
        case LLM_ARCH_INTERNLM2:
-        case LLM_ARCH_GRANITE:
        case LLM_ARCH_ERNIE4_5:
        case LLM_ARCH_PADDLEOCR:
        case LLM_ARCH_SMOLLM3:
@@ -759,6 +761,7 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
                LLM_TENSOR_TOKEN_EMBD,
                LLM_TENSOR_OUTPUT_NORM,
                LLM_TENSOR_OUTPUT,
+                LLM_TENSOR_ROPE_FREQS,
                LLM_TENSOR_ATTN_NORM,
                LLM_TENSOR_ATTN_Q,
                LLM_TENSOR_ATTN_K,
@@ -1232,29 +1235,6 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
                LLM_TENSOR_FFN_DOWN,
                LLM_TENSOR_FFN_UP,
            };
-        case LLM_ARCH_MINICPM:
-            return {
-                LLM_TENSOR_TOKEN_EMBD,
-                LLM_TENSOR_OUTPUT_NORM,
-                LLM_TENSOR_OUTPUT,
-                LLM_TENSOR_ROPE_FREQS,
-                LLM_TENSOR_ROPE_FACTORS_LONG,
-                LLM_TENSOR_ROPE_FACTORS_SHORT,
-                LLM_TENSOR_ATTN_NORM,
-                LLM_TENSOR_ATTN_Q,
-                LLM_TENSOR_ATTN_K,
-                LLM_TENSOR_ATTN_V,
-                LLM_TENSOR_ATTN_OUT,
-                LLM_TENSOR_ATTN_ROT_EMBD,
-                LLM_TENSOR_FFN_GATE_INP,
-                LLM_TENSOR_FFN_NORM,
-                LLM_TENSOR_FFN_GATE,
-                LLM_TENSOR_FFN_DOWN,
-                LLM_TENSOR_FFN_UP,
-                LLM_TENSOR_FFN_GATE_EXP,
-                LLM_TENSOR_FFN_DOWN_EXP,
-                LLM_TENSOR_FFN_UP_EXP,
-            };
        case LLM_ARCH_MINICPM3:
            return {
                LLM_TENSOR_TOKEN_EMBD,
@@ -1442,6 +1422,7 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
                LLM_TENSOR_TOKEN_EMBD,
                LLM_TENSOR_OUTPUT,
                LLM_TENSOR_OUTPUT_NORM,
+                LLM_TENSOR_ROPE_FREQS,
                LLM_TENSOR_ATTN_NORM,
                LLM_TENSOR_ATTN_Q,
                LLM_TENSOR_ATTN_K,
@@ -1657,7 +1638,9 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
                LLM_TENSOR_ROPE_FREQS,
                LLM_TENSOR_OUTPUT_NORM,
                LLM_TENSOR_OUTPUT,
+                LLM_TENSOR_TOKEN_EMBD,
                LLM_TENSOR_ATTN_NORM,
+                LLM_TENSOR_ATTN_QKV,
                LLM_TENSOR_ATTN_Q,
                LLM_TENSOR_ATTN_K,
                LLM_TENSOR_ATTN_V,
@@ -2061,30 +2044,12 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
                LLM_TENSOR_FFN_DOWN,
                LLM_TENSOR_FFN_UP,
            };
-        case LLM_ARCH_GRANITE_MOE:
-            return {
-                LLM_TENSOR_TOKEN_EMBD,
-                LLM_TENSOR_OUTPUT_NORM,
-                LLM_TENSOR_OUTPUT,
-                LLM_TENSOR_ATTN_NORM,
-                LLM_TENSOR_ATTN_Q,
-                LLM_TENSOR_ATTN_K,
-                LLM_TENSOR_ATTN_V,
-                LLM_TENSOR_ATTN_OUT,
-                LLM_TENSOR_FFN_NORM,
-                LLM_TENSOR_FFN_GATE_INP,
-                LLM_TENSOR_FFN_GATE_EXPS,
-                LLM_TENSOR_FFN_DOWN_EXPS,
-                LLM_TENSOR_FFN_UP_EXPS,
-                LLM_TENSOR_FFN_GATE_SHEXP,
-                LLM_TENSOR_FFN_DOWN_SHEXP,
-                LLM_TENSOR_FFN_UP_SHEXP,
-            };
        case LLM_ARCH_GRANITE_HYBRID:
            return {
                LLM_TENSOR_TOKEN_EMBD,
                LLM_TENSOR_OUTPUT_NORM,
                LLM_TENSOR_OUTPUT,
+                LLM_TENSOR_ROPE_FREQS,
                LLM_TENSOR_ATTN_NORM,
                LLM_TENSOR_SSM_IN,
                LLM_TENSOR_SSM_CONV1D,
@@ -2412,6 +2377,7 @@ static std::set<llm_tensor> llm_get_tensor_names(llm_arch arch) {
                LLM_TENSOR_TOKEN_EMBD,
                LLM_TENSOR_OUTPUT_NORM,
                LLM_TENSOR_OUTPUT,
+                LLM_TENSOR_ROPE_FREQS,
                LLM_TENSOR_ATTN_NORM,
                LLM_TENSOR_ATTN_QKV,
                LLM_TENSOR_ATTN_OUT,
@@ -2564,7 +2530,7 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
    {LLM_TENSOR_TOKEN_EMBD,                 {LLM_TENSOR_LAYER_INPUT, GGML_OP_GET_ROWS}},
    {LLM_TENSOR_POS_EMBD,                   {LLM_TENSOR_LAYER_INPUT, GGML_OP_GET_ROWS}},
    {LLM_TENSOR_TOKEN_TYPES,                {LLM_TENSOR_LAYER_INPUT, GGML_OP_GET_ROWS}},
-    {LLM_TENSOR_TOKEN_EMBD_NORM,            {LLM_TENSOR_LAYER_INPUT, GGML_OP_MUL}},
+    {LLM_TENSOR_TOKEN_EMBD_NORM,            {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},  // do the norms on the first layer (not the input layer)
    {LLM_TENSOR_OUTPUT,                     {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
    {LLM_TENSOR_CLS,                        {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
    {LLM_TENSOR_CLS_OUT,                    {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
@@ -2725,7 +2691,7 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
    {LLM_TENSOR_LAUREL_POST_NORM,           {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
    // this tensor is loaded for T5, but never used
    {LLM_TENSOR_DEC_CROSS_ATTN_REL_B,       {LLM_TENSOR_LAYER_REPEATING, GGML_OP_NONE}},
-    {LLM_TENSOR_CONV1D,                     {LLM_TENSOR_LAYER_INPUT,     GGML_OP_IM2COL}},
+    {LLM_TENSOR_CONV1D,                     {LLM_TENSOR_LAYER_REPEATING, GGML_OP_IM2COL}},
    {LLM_TENSOR_POS_NET_NORM,               {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
    {LLM_TENSOR_POS_NET_NORM1,              {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
    {LLM_TENSOR_POS_NET_NORM2,              {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL}},
@@ -2789,7 +2755,12 @@ std::string LLM_TN_IMPL::str() const {
    }

    if (model_tensors.find(tensor) == model_tensors.end()) {
-        return LLM_TENSOR_NAMES.at(tensor);
+        const char * name = LLM_TENSOR_NAMES.at(tensor);
+        if (suffix != nullptr || bid != -1 || xid != -1) {
+            LLAMA_LOG_WARN("%s: cannot properly format tensor name %s with suffix=%s bid=%d xid=%d\n",
+                __func__, name, suffix, bid, xid);
+        }
+        return name;
    }

    std::string name = ::format(LLM_TENSOR_NAMES.at(tensor), bid, xid);
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -342,14 +342,6 @@ llama_context::llama_context(

        if (cparams.pipeline_parallel) {
            LLAMA_LOG_INFO("%s: pipeline parallelism enabled\n", __func__);
-
-            if (!graph_reuse_disable) {
-                // TODO: figure out a way to make graph reuse work with pipeline parallelism
-                // ref: https://github.com/ggml-org/llama.cpp/pull/20463
-                LLAMA_LOG_WARN("%s: graph reuse is currently not compatible with pipeline parallelism - disabling\n", __func__);
-
-                graph_reuse_disable = true;
-            }
        }

        sched_reserve();
@@ -1189,6 +1181,13 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
    if (!graph_reuse_disable && res->can_reuse(gparams)) {
        //LLAMA_LOG_DEBUG("%s: reusing previous graph\n", __func__);

+        // with pipeline parallelism, the previous graph_compute_async may still be running
+        // on the GPU. we must synchronize before set_inputs to avoid overwriting input tensors
+        // that the previous compute is still reading.
+        if (cparams.pipeline_parallel) {
+            ggml_backend_sched_synchronize(sched.get());
+        }
+
        n_reused++;
    } else {
        res->reset();
--- a/src/llama-mmap.cpp
+++ b/src/llama-mmap.cpp
@@ -86,6 +86,14 @@ struct llama_file::impl {
        seek(0, SEEK_SET);
    }

+    impl(FILE * file) : owns_fp(false) {
+        fp = file;
+        fp_win32 = (HANDLE) _get_osfhandle(_fileno(fp));
+        seek(0, SEEK_END);
+        size = tell();
+        seek(0, SEEK_SET);
+    }
+
    size_t tell() const {
        LARGE_INTEGER li;
        li.QuadPart = 0;
@@ -159,7 +167,7 @@ struct llama_file::impl {
    }

    ~impl() {
-        if (fp) {
+        if (fp && owns_fp) {
            std::fclose(fp);
        }
    }
@@ -209,6 +217,13 @@ struct llama_file::impl {
        seek(0, SEEK_SET);
    }

+    impl(FILE * file) : fname("(file*)"), owns_fp(false) {
+        fp = file;
+        seek(0, SEEK_END);
+        size = tell();
+        seek(0, SEEK_SET);
+    }
+
    size_t tell() const {
        if (fd == -1) {
            long ret = std::ftell(fp);
@@ -353,7 +368,7 @@ struct llama_file::impl {
    ~impl() {
        if (fd != -1) {
            close(fd);
-        } else {
+        } else if (owns_fp) {
            std::fclose(fp);
        }
    }
@@ -369,10 +384,14 @@ struct llama_file::impl {

    FILE * fp{};
    size_t size{};
+    bool owns_fp = true;
 };

 llama_file::llama_file(const char * fname, const char * mode, const bool use_direct_io) :
    pimpl(std::make_unique<impl>(fname, mode, use_direct_io)) {}
+
+llama_file::llama_file(FILE * file) : pimpl(std::make_unique<impl>(file)) {}
+
 llama_file::~llama_file() = default;

 size_t llama_file::tell() const { return pimpl->tell(); }
--- a/src/llama-mmap.h
+++ b/src/llama-mmap.h
@@ -15,6 +15,7 @@ using llama_mlocks = std::vector<std::unique_ptr<llama_mlock>>;

 struct llama_file {
    llama_file(const char * fname, const char * mode, bool use_direct_io = false);
+    llama_file(FILE * file);
    ~llama_file();

    size_t tell() const;
--- a/src/llama-model-loader.cpp
+++ b/src/llama-model-loader.cpp
@@ -511,6 +511,7 @@ llama_model_loader::llama_model_loader(
        void * set_tensor_data_ud,
        const std::string & fname,
        std::vector<std::string> & splits,
+        FILE * file,
        bool use_mmap,
        bool use_direct_io,
        bool check_tensors,
@@ -658,6 +659,36 @@ llama_model_loader::llama_model_loader(

            LLAMA_LOG_INFO("%s: additional %d GGUFs metadata loaded.\n",  __func__, n_split - 1);
        }
+    } else if (file != nullptr) {
+        struct ggml_context * ctx = NULL;
+        struct gguf_init_params params = {
+            /*.no_alloc = */ true,
+            /*.ctx      = */ &ctx,
+        };
+
+        metadata_ptr.reset(gguf_init_from_file_ptr(file, params));
+        metadata = metadata_ptr.get();
+        if (metadata == nullptr) {
+            throw std::runtime_error(format("%s: failed to load model from file pointer", __func__));
+        }
+
+        get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
+        llm_kv = LLM_KV(llm_arch_from_string(arch_name));
+
+        files.emplace_back(new llama_file(file));
+        contexts.emplace_back(ctx);
+
+        // Save tensors data offset info of the main file.
+        for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
+            std::string tensor_name = std::string(cur->name);
+            // make sure there is no duplicated tensor names
+            if (weights_map.find(tensor_name) != weights_map.end()) {
+                throw std::runtime_error(format("invalid model: tensor '%s' is duplicated", ggml_get_name(cur)));
+            }
+            n_elements += ggml_nelements(cur);
+            n_bytes    += ggml_nbytes(cur);
+            weights_map.emplace(tensor_name, llama_tensor_weight(files.back().get(), 0, metadata, cur));
+        }
    } else {
        get_key(llm_kv(LLM_KV_GENERAL_ARCHITECTURE), arch_name, false);
        llm_kv = LLM_KV(llm_arch_from_string(arch_name));
@@ -669,7 +700,7 @@ llama_model_loader::llama_model_loader(
    fver = (enum llama_fver) gguf_get_version(metadata);

    LLAMA_LOG_INFO("%s: loaded meta data with %d key-value pairs and %d tensors from %s (version %s)\n",
-            __func__, n_kv, n_tensors, fname.c_str(), llama_file_version_name(fver));
+            __func__, n_kv, n_tensors, fname.empty() ? "(file*)" : fname.c_str(), llama_file_version_name(fver));

    // determine file type based on the number of tensors for each quantization and print meta data
    // TODO: make optional
--- a/src/llama-model-loader.h
+++ b/src/llama-model-loader.h
@@ -125,6 +125,7 @@ struct llama_model_loader {
        void * set_tensor_data_ud,
        const std::string & fname,
        std::vector<std::string> & splits, // optional, only need if the split does not follow naming scheme
+        FILE * file,
        bool use_mmap,
        bool use_direct_io,
        bool check_tensors,
--- a/src/llama-model-saver.cpp
+++ b/src/llama-model-saver.cpp
@@ -1,7 +1,9 @@
 #include "llama-model-saver.h"

+#include "ggml.h"
 #include "gguf.h"

+#include "llama-arch.h"
 #include "llama.h"
 #include "llama-hparams.h"
 #include "llama-model.h"
@@ -10,8 +12,33 @@
 #include <cstdint>
 #include <string>

+bool llama_model_saver_supports_arch(llm_arch arch) {
+    switch (arch) {
+        case LLM_ARCH_QWEN3NEXT:
+        case LLM_ARCH_QWEN35:
+        case LLM_ARCH_QWEN35MOE:
+        case LLM_ARCH_PLAMO3:
+        case LLM_ARCH_GEMMA3:
+        case LLM_ARCH_GEMMA3N:
+        case LLM_ARCH_COHERE2:
+        case LLM_ARCH_OLMO2:
+        case LLM_ARCH_BITNET:
+        case LLM_ARCH_T5:
+        case LLM_ARCH_EXAONE_MOE:
+        case LLM_ARCH_AFMOE:
+        case LLM_ARCH_APERTUS:
+        case LLM_ARCH_MIMO2:
+        case LLM_ARCH_STEP35:
+            return false;
+        default:
+            return true;
+    }
+}
+
 llama_model_saver::llama_model_saver(const struct llama_model * model) :
-    gguf_ctx(gguf_init_empty()), gguf_ctx_owned(true), model(model), llm_kv(model->arch) {}
+        gguf_ctx(gguf_init_empty()), gguf_ctx_owned(true), model(model), llm_kv(model->arch) {
+    GGML_ASSERT(llama_model_saver_supports_arch(model->arch));
+}

 llama_model_saver::llama_model_saver(enum llm_arch arch, struct gguf_context * gguf_ctx) :
        gguf_ctx(gguf_ctx == nullptr ? gguf_init_empty() : gguf_ctx), gguf_ctx_owned(gguf_ctx == nullptr), model(nullptr), llm_kv(arch) {}
@@ -105,7 +132,10 @@ void llama_model_saver::add_tensor(const struct ggml_tensor * tensor) {
        return;
    }
    if (gguf_find_tensor(gguf_ctx, tensor->name) >= 0) {
-        GGML_ASSERT(std::string(tensor->name) == "rope_freqs.weight"); // FIXME
+        const std::string tensor_name = tensor->name;
+        GGML_ASSERT(
+            tensor_name == "rope_freqs.weight" || tensor_name == "rope_factors_long.weight" ||
+            tensor_name == "rope_factors_short.weight"); // FIXME
        return;
    }
    gguf_add_tensor(gguf_ctx, tensor);
@@ -127,6 +157,7 @@ void llama_model_saver::add_kv_from_model() {
            tokens[id] = token_data.text;
            scores[id] = token_data.score;

+            // FIXME should this be treated as flags?
            switch(token_data.attr) {
                case LLAMA_TOKEN_ATTR_UNKNOWN:      token_types[id] = LLAMA_TOKEN_TYPE_UNKNOWN;      break;
                case LLAMA_TOKEN_ATTR_UNUSED:       token_types[id] = LLAMA_TOKEN_TYPE_UNUSED;       break;
@@ -134,6 +165,9 @@ void llama_model_saver::add_kv_from_model() {
                case LLAMA_TOKEN_ATTR_CONTROL:      token_types[id] = LLAMA_TOKEN_TYPE_CONTROL;      break;
                case LLAMA_TOKEN_ATTR_USER_DEFINED: token_types[id] = LLAMA_TOKEN_TYPE_USER_DEFINED; break;
                case LLAMA_TOKEN_ATTR_BYTE:         token_types[id] = LLAMA_TOKEN_TYPE_BYTE;         break;
+                // case LLAMA_TOKEN_ATTR_NORMALIZED:   ???
+                // case LLAMA_TOKEN_ATTR_LSTRIP:       ???
+                // case LLAMA_TOKEN_ATTR_RSTRIP:       ???
                case LLAMA_TOKEN_ATTR_UNDEFINED:
                default:                            token_types[id] = LLAMA_TOKEN_TYPE_UNDEFINED;    break;
            }
@@ -144,6 +178,19 @@ void llama_model_saver::add_kv_from_model() {
    add_kv(LLM_KV_GENERAL_ARCHITECTURE,              model->arch_name());
    // add_kv(LLM_KV_GENERAL_QUANTIZATION_VERSION,      ???);
    // add_kv(LLM_KV_GENERAL_ALIGNMENT,                 ???);
+    // add_kv(LLM_KV_GENERAL_FILE_TYPE,                 ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_SEQUENCE,         ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_TOP_K,            ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_TOP_P,            ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_MIN_P,            ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_XTC_PROBABILITY,  ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_XTC_THRESHOLD,    ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_TEMP,             ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_PENALTY_LAST_N,   ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_PENALTY_REPEAT,   ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_MIROSTAT,         ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_MIROSTAT_TAU,     ???);
+    // add_kv(LLM_KV_GENERAL_SAMPLING_MIROSTAT_ETA,     ???);
    add_kv(LLM_KV_GENERAL_NAME,                      model->name);
    // add_kv(LLM_KV_GENERAL_AUTHOR,                    ???);
    // add_kv(LLM_KV_GENERAL_VERSION,                   ???);
@@ -163,17 +210,31 @@ void llama_model_saver::add_kv_from_model() {
    add_kv(LLM_KV_LEADING_DENSE_BLOCK_COUNT,         hparams.n_layer_dense_lead);
    add_kv(LLM_KV_FEED_FORWARD_LENGTH,               hparams.n_ff_arr, true);
    add_kv(LLM_KV_EXPERT_FEED_FORWARD_LENGTH,        hparams.n_ff_exp);
-    add_kv(LLM_KV_EXPERT_SHARED_FEED_FORWARD_LENGTH, hparams.n_ff_exp);
+    add_kv(LLM_KV_EXPERT_SHARED_FEED_FORWARD_LENGTH, hparams.n_ff_shexp);
+    add_kv(LLM_KV_EXPERT_SHARED_FEED_FORWARD_LENGTH, hparams.n_ff_chexp);
+    add_kv(LLM_KV_SWIGLU_CLAMP_EXP,                  hparams.swiglu_clamp_exp);
+    add_kv(LLM_KV_SWIGLU_CLAMP_SHEXP,                hparams.swiglu_clamp_shexp);
    add_kv(LLM_KV_USE_PARALLEL_RESIDUAL,             hparams.use_par_res);
    // add_kv(LLM_KV_TENSOR_DATA_LAYOUT,                ???);
    add_kv(LLM_KV_EXPERT_COUNT,                      hparams.n_expert);
    add_kv(LLM_KV_EXPERT_USED_COUNT,                 hparams.n_expert_used);
    add_kv(LLM_KV_EXPERT_SHARED_COUNT,               hparams.n_expert_shared);
+    add_kv(LLM_KV_EXPERT_GROUP_COUNT,                hparams.n_expert_groups);
+    add_kv(LLM_KV_EXPERT_GROUP_USED_COUNT,           hparams.n_group_used);
    add_kv(LLM_KV_EXPERT_WEIGHTS_SCALE,              hparams.expert_weights_scale);
+    add_kv(LLM_KV_EXPERT_WEIGHTS_NORM,               hparams.expert_weights_norm);
+    add_kv(LLM_KV_EXPERT_GATING_FUNC,                hparams.expert_gating_func);
+    add_kv(LLM_KV_EXPERT_GROUP_SCALE,                hparams.expert_group_scale);
+    add_kv(LLM_KV_EXPERTS_PER_GROUP,                 hparams.n_group_experts);
+    add_kv(LLM_KV_MOE_EVERY_N_LAYERS,                hparams.moe_every_n_layers);
+    add_kv(LLM_KV_NEXTN_PREDICT_LAYERS,              hparams.nextn_predict_layers);
+    add_kv(LLM_KV_NUM_DEEPSTACK_LAYERS,              hparams.n_deepstack_layers);
    add_kv(LLM_KV_POOLING_TYPE,                      uint32_t(hparams.pooling_type));
    add_kv(LLM_KV_LOGIT_SCALE,                       hparams.f_logit_scale);
    add_kv(LLM_KV_DECODER_START_TOKEN_ID,            hparams.dec_start_token_id);
+    add_kv(LLM_KV_DECODER_BLOCK_COUNT,               hparams.dec_n_layer);
    add_kv(LLM_KV_ATTN_LOGIT_SOFTCAPPING,            hparams.f_attn_logit_softcapping);
+    add_kv(LLM_KV_ROUTER_LOGIT_SOFTCAPPING,          hparams.f_router_logit_softcapping);
    add_kv(LLM_KV_FINAL_LOGIT_SOFTCAPPING,           hparams.f_final_logit_softcapping);
    add_kv(LLM_KV_SWIN_NORM,                         hparams.swin_norm);
    add_kv(LLM_KV_RESCALE_EVERY_N_LAYERS,            hparams.rescale_every_n_layers);
@@ -181,6 +242,9 @@ void llama_model_saver::add_kv_from_model() {
    add_kv(LLM_KV_TIME_DECAY_EXTRA_DIM,              hparams.time_decay_extra_dim);
    add_kv(LLM_KV_RESIDUAL_SCALE,                    hparams.f_residual_scale);
    add_kv(LLM_KV_EMBEDDING_SCALE,                   hparams.f_embedding_scale);
+    add_kv(LLM_KV_TOKEN_SHIFT_COUNT,                 hparams.token_shift_count);
+    add_kv(LLM_KV_INTERLEAVE_MOE_LAYER_STEP,         hparams.n_moe_layer_step);
+    // add_kv(LLM_KV_FULL_ATTENTION_INTERVAL,           ???);

    add_kv(LLM_KV_ATTENTION_HEAD_COUNT,              hparams.n_head_arr, true);
    add_kv(LLM_KV_ATTENTION_HEAD_COUNT_KV,           hparams.n_head_kv_arr, true);
@@ -188,22 +252,39 @@ void llama_model_saver::add_kv_from_model() {
    add_kv(LLM_KV_ATTENTION_CLAMP_KQV,               hparams.f_clamp_kqv);
    add_kv(LLM_KV_ATTENTION_KEY_LENGTH,              hparams.n_embd_head_k_full);
    add_kv(LLM_KV_ATTENTION_VALUE_LENGTH,            hparams.n_embd_head_v_full);
-    add_kv(LLM_KV_ATTENTION_KEY_LENGTH_SWA,          hparams.n_embd_head_k_swa);
-    add_kv(LLM_KV_ATTENTION_VALUE_LENGTH_SWA,        hparams.n_embd_head_v_swa);
    add_kv(LLM_KV_ATTENTION_LAYERNORM_EPS,           hparams.f_norm_eps);
    add_kv(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS,       hparams.f_norm_rms_eps);
+    add_kv(LLM_KV_ATTENTION_GROUPNORM_EPS,           hparams.f_norm_group_eps);
+    add_kv(LLM_KV_ATTENTION_GROUPNORM_GROUPS,        hparams.n_norm_groups);
    add_kv(LLM_KV_ATTENTION_CAUSAL,                  hparams.causal_attn);
    add_kv(LLM_KV_ATTENTION_Q_LORA_RANK,             hparams.n_lora_q);
    add_kv(LLM_KV_ATTENTION_KV_LORA_RANK,            hparams.n_lora_kv);
+    add_kv(LLM_KV_ATTENTION_DECAY_LORA_RANK,         hparams.n_lora_decay);
+    add_kv(LLM_KV_ATTENTION_ICLR_LORA_RANK,          hparams.n_lora_iclr);
+    add_kv(LLM_KV_ATTENTION_VALUE_RESIDUAL_MIX_LORA_RANK, hparams.n_lora_value_res_mix);
+    add_kv(LLM_KV_ATTENTION_GATE_LORA_RANK,          hparams.n_lora_gate);
    add_kv(LLM_KV_ATTENTION_RELATIVE_BUCKETS_COUNT,  hparams.n_rel_attn_bkts);
    add_kv(LLM_KV_ATTENTION_SLIDING_WINDOW,          hparams.n_swa);
+    // add_kv(LLM_KV_ATTENTION_SLIDING_WINDOW_PATTERN,  ???);
    add_kv(LLM_KV_ATTENTION_SCALE,                   hparams.f_attention_scale);
+    add_kv(LLM_KV_ATTENTION_OUTPUT_SCALE,            hparams.f_attn_out_scale);
+    add_kv(LLM_KV_ATTENTION_TEMPERATURE_LENGTH,      hparams.attn_temp_length);
+    add_kv(LLM_KV_ATTENTION_TEMPERATURE_SCALE,       hparams.f_attn_temp_scale);
+    add_kv(LLM_KV_ATTENTION_KEY_LENGTH_MLA,          hparams.n_embd_head_k_mla_impl);
+    add_kv(LLM_KV_ATTENTION_VALUE_LENGTH_MLA,        hparams.n_embd_head_v_mla_impl);
+    add_kv(LLM_KV_ATTENTION_KEY_LENGTH_SWA,          hparams.n_embd_head_k_swa);
+    add_kv(LLM_KV_ATTENTION_VALUE_LENGTH_SWA,        hparams.n_embd_head_v_swa);
+    add_kv(LLM_KV_ATTENTION_INDEXER_HEAD_COUNT,      hparams.indexer_n_head);
+    add_kv(LLM_KV_ATTENTION_INDEXER_KEY_LENGTH,      hparams.indexer_head_size);
+    add_kv(LLM_KV_ATTENTION_INDEXER_TOP_K,           hparams.indexer_top_k);

    const float rope_scaling_factor = hparams.rope_freq_scale_train == 1.0f ? 0.0f : 1.0f/hparams.rope_freq_scale_train;

    add_kv(LLM_KV_ROPE_DIMENSION_COUNT,              hparams.n_rot_full);
    add_kv(LLM_KV_ROPE_DIMENSION_COUNT_SWA,          hparams.n_rot_swa);
+    add_kv(LLM_KV_ROPE_DIMENSION_SECTIONS,           hparams.rope_sections);
    add_kv(LLM_KV_ROPE_FREQ_BASE,                    hparams.rope_freq_base_train);
+    add_kv(LLM_KV_ROPE_FREQ_BASE_SWA,                hparams.rope_freq_base_train_swa);
    // add_kv(LLM_KV_ROPE_SCALE_LINEAR,                 rope_scaling_factor); // old name
    add_kv(LLM_KV_ROPE_SCALING_TYPE,                 llama_rope_scaling_type_name(hparams.rope_scaling_type_train));
    add_kv(LLM_KV_ROPE_SCALING_FACTOR,               rope_scaling_factor);
@@ -211,6 +292,10 @@ void llama_model_saver::add_kv_from_model() {
    add_kv(LLM_KV_ROPE_SCALING_ORIG_CTX_LEN,         hparams.n_ctx_orig_yarn);
    add_kv(LLM_KV_ROPE_SCALING_FINETUNED,            hparams.rope_finetuned);
    add_kv(LLM_KV_ROPE_SCALING_YARN_LOG_MUL,         hparams.rope_yarn_log_mul);
+    add_kv(LLM_KV_ROPE_SCALING_YARN_EXT_FACTOR,      hparams.yarn_ext_factor);
+    add_kv(LLM_KV_ROPE_SCALING_YARN_ATTN_FACTOR,     hparams.yarn_attn_factor);
+    add_kv(LLM_KV_ROPE_SCALING_YARN_BETA_FAST,       hparams.yarn_beta_fast);
+    add_kv(LLM_KV_ROPE_SCALING_YARN_BETA_SLOW,       hparams.yarn_beta_slow);

    // TODO: implement split file support
    // add_kv(LLM_KV_SPLIT_NO,                          ???);
@@ -221,8 +306,11 @@ void llama_model_saver::add_kv_from_model() {
    add_kv(LLM_KV_SSM_CONV_KERNEL,                   hparams.ssm_d_conv);
    add_kv(LLM_KV_SSM_STATE_SIZE,                    hparams.ssm_d_state);
    add_kv(LLM_KV_SSM_TIME_STEP_RANK,                hparams.ssm_dt_rank);
+    add_kv(LLM_KV_SSM_GROUP_COUNT,                   hparams.ssm_n_group);
    add_kv(LLM_KV_SSM_DT_B_C_RMS,                    hparams.ssm_dt_b_c_rms);

+    add_kv(LLM_KV_KDA_HEAD_DIM,                      hparams.n_embd_head_kda);
+
    add_kv(LLM_KV_WKV_HEAD_SIZE,                     hparams.wkv_head_size);

    add_kv(LLM_KV_TOKENIZER_MODEL,                   vocab.get_tokenizer_model());
@@ -260,15 +348,39 @@ void llama_model_saver::add_kv_from_model() {
    // TODO: implement LoRA support
    // add_kv(LLM_KV_ADAPTER_TYPE,                      ???);
    // add_kv(LLM_KV_ADAPTER_LORA_ALPHA,                ???);
+    // add_kv(LLM_KV_ADAPTER_LORA_TASK_NAME,            ???);
+    // add_kv(LLM_KV_ADAPTER_LORA_PROMPT_PREFIX,        ???);
+    // add_kv(LLM_KV_ADAPTER_ALORA_INVOCATION_TOKENS,   ???);
+
+    add_kv(LLM_KV_POSNET_EMBEDDING_LENGTH,           hparams.posnet.n_embd);
+    add_kv(LLM_KV_POSNET_BLOCK_COUNT,                hparams.posnet.n_layer);
+
+    add_kv(LLM_KV_CONVNEXT_EMBEDDING_LENGTH,         hparams.convnext.n_embd);
+    add_kv(LLM_KV_CONVNEXT_BLOCK_COUNT,              hparams.convnext.n_layer);
+
+    add_kv(LLM_KV_CLASSIFIER_OUTPUT_LABELS,          model->classifier_labels);
+
+    add_kv(LLM_KV_SHORTCONV_L_CACHE,                 hparams.n_shortconv_l_cache);
+
+    add_kv(LLM_KV_XIELU_ALPHA_N,                     hparams.xielu_alpha_n);
+    add_kv(LLM_KV_XIELU_ALPHA_P,                     hparams.xielu_alpha_p);
+    add_kv(LLM_KV_XIELU_BETA,                        hparams.xielu_beta);
+    add_kv(LLM_KV_XIELU_EPS,                         hparams.xielu_eps);

    // deprecated
    // add_kv(LLM_KV_TOKENIZER_PREFIX_ID,               ???);
    // add_kv(LLM_KV_TOKENIZER_SUFFIX_ID,               ???);
    // add_kv(LLM_KV_TOKENIZER_MIDDLE_ID,               ???);
+
+    add_kv(LLM_KV_DENSE_2_FEAT_IN,                   hparams.dense_2_feat_in);
+    add_kv(LLM_KV_DENSE_2_FEAT_OUT,                  hparams.dense_2_feat_out);
+    add_kv(LLM_KV_DENSE_3_FEAT_IN,                   hparams.dense_3_feat_in);
+    add_kv(LLM_KV_DENSE_3_FEAT_OUT,                  hparams.dense_3_feat_out);
 }

 void llama_model_saver::add_tensors_from_model() {
-    if (std::string(model->output->name) != std::string(model->tok_embd->name)) {
+    if (model->output != nullptr &&
+            std::string(model->output->name) != std::string(model->tok_embd->name)) {
        add_tensor(model->tok_embd); // some models use the same tensor for tok_embd and output
    }
    add_tensor(model->type_embd);
@@ -297,3 +409,6 @@ void llama_model_saver::save(const std::string & path_model) {
    gguf_write_to_file(gguf_ctx, path_model.c_str(), false);
 }

+void llama_model_saver::save(FILE * file) {
+    gguf_write_to_file_ptr(gguf_ctx, file, false);
+}
--- a/src/llama-model-saver.h
+++ b/src/llama-model-saver.h
@@ -6,6 +6,9 @@

 #include <vector>

+// FIXME temporary function for better error messages
+bool llama_model_saver_supports_arch(llm_arch arch);
+
 struct llama_model_saver {
    struct gguf_context * gguf_ctx = nullptr;
    const bool gguf_ctx_owned;
@@ -37,4 +40,5 @@ struct llama_model_saver {
    void add_tensors_from_model();

    void save(const std::string & path_model);
+    void save(FILE * file);
 };
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -1624,7 +1624,7 @@ void llama_model::load_hparams(llama_model_loader & ml) {

                // (optional) temperature tuning - used by mistral-large
                ml.get_key(LLM_KV_ATTENTION_TEMPERATURE_SCALE,  hparams.f_attn_temp_scale,       false);
-                ml.get_key(LLM_KV_ATTENTION_TEMPERATURE_LENGTH, hparams.n_attn_temp_floor_scale, false);
+                ml.get_key(LLM_KV_ATTENTION_TEMPERATURE_LENGTH, hparams.n_attn_temp_floor_scale, false); // FIXME why not use temperature_length?

                hparams.f_attn_temp_offset = 0.0f;

@@ -3217,8 +3217,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                        cls_out_b = create_tensor(tn(LLM_TENSOR_CLS_OUT, "bias"),   {hparams.n_cls_out},         TENSOR_NOT_REQUIRED);
                    }

-                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, 0);
-                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"),   {n_embd}, 0);
+                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight", 0), {n_embd}, 0);
+                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias",   0), {n_embd}, 0);

                    for (int i = 0; i < n_layer; ++i) {
                        auto & layer = layers[i];
@@ -3265,7 +3265,7 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
            case LLM_ARCH_MODERN_BERT:
                {
                    tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);
-                    tok_norm = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, 0);
+                    tok_norm = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight", 0), {n_embd}, 0);

                    output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);

@@ -3348,8 +3348,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                    tok_embd  = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD,  "weight"), {n_embd, n_vocab}, 0); // word_embeddings
                    type_embd = create_tensor(tn(LLM_TENSOR_TOKEN_TYPES, "weight"), {n_embd, n_token_types}, 0); // token_type_embeddings

-                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, 0); // LayerNorm
-                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"),   {n_embd}, 0); //LayerNorm bias
+                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight", 0), {n_embd}, 0); // LayerNorm
+                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias",   0), {n_embd}, 0); // LayerNorm bias

                    cls   = create_tensor(tn(LLM_TENSOR_CLS, "weight"), {n_embd, 1}, TENSOR_NOT_REQUIRED);
                    cls_b = create_tensor(tn(LLM_TENSOR_CLS, "bias"),   {1},         TENSOR_NOT_REQUIRED);
@@ -3400,8 +3400,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
            case LLM_ARCH_BLOOM:
                {
                    tok_embd   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD,      "weight"), {n_embd, n_vocab}, 0);
-                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, 0);
-                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"),   {n_embd}, 0);
+                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight", 0), {n_embd}, 0);
+                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias",   0), {n_embd}, 0);

                    // output
                    output_norm   = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
@@ -5780,8 +5780,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                    tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);

                    // Block 0, LN0
-                    tok_norm = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, 0);
-                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"), {n_embd}, 0);
+                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight", 0), {n_embd}, 0);
+                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias",   0), {n_embd}, 0);

                    // output
                    output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
@@ -5895,8 +5895,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                    tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);

                    // Block 0, LN0
-                    tok_norm = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {n_embd}, 0);
-                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"), {n_embd}, 0);
+                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight", 0), {n_embd}, 0);
+                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias",   0), {n_embd}, 0);

                    // output
                    output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
@@ -6067,8 +6067,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                {
                    tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {hparams.n_embd, n_vocab}, 0);

-                    conv1d   = create_tensor(tn(LLM_TENSOR_CONV1D, "weight"), {7, hparams.n_embd, hparams.posnet.n_embd}, 0);
-                    conv1d_b = create_tensor(tn(LLM_TENSOR_CONV1D, "bias"),   {1, hparams.posnet.n_embd}, 0);
+                    conv1d   = create_tensor(tn(LLM_TENSOR_CONV1D, "weight", 0), {7, hparams.n_embd, hparams.posnet.n_embd}, 0);
+                    conv1d_b = create_tensor(tn(LLM_TENSOR_CONV1D, "bias",   0), {1, hparams.posnet.n_embd}, 0);

                    // posnet
                    {
@@ -6133,8 +6133,8 @@ bool llama_model::load_tensors(llama_model_loader & ml) {

                    GGML_ASSERT(hparams.posnet.n_embd == hparams.convnext.n_embd);

-                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight"), {hparams.posnet.n_embd}, 0);
-                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias"),   {hparams.posnet.n_embd}, 0);
+                    tok_norm   = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "weight", 0), {hparams.posnet.n_embd}, 0);
+                    tok_norm_b = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD_NORM, "bias",   0), {hparams.posnet.n_embd}, 0);

                    // convnext
                    {
@@ -7607,14 +7607,15 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                buf_map.emplace(idx, buf);
            }
        }
-        pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));

-        for (auto & buf : buf_map) {
+        for (auto & buf : bufs) {
            // indicate that this buffer contains weights
            // this is used by ggml_backend_sched to improve op scheduling: ops that use a weight are preferably scheduled to the backend that contains the weight
-            ggml_backend_buffer_set_usage(buf.second, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
+            ggml_backend_buffer_set_usage(buf.get(), GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
        }

+        pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));
+
        ctx_buf_maps.emplace_back(ctx, buf_map);
    }

--- a/src/llama-quant.cpp
+++ b/src/llama-quant.cpp
@@ -859,7 +859,7 @@ static void llama_model_quantize_impl(const std::string & fname_inp, const std::

    std::vector<std::string> splits = {};
    llama_model_loader ml(/*metadata*/ nullptr, /*set_tensor_data*/ nullptr, /*set_tensor_data_ud*/ nullptr,
-        fname_inp, splits, use_mmap, /*use_direct_io*/ false, /*check_tensors*/ true, /*no_alloc*/ false, kv_overrides, nullptr);
+        fname_inp, splits, /*file*/ nullptr, use_mmap, /*use_direct_io*/ false, /*check_tensors*/ true, /*no_alloc*/ false, kv_overrides, nullptr);
    ml.init_mappings(false); // no prefetching

    llama_model model(llama_model_default_params());
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -365,14 +365,14 @@ static void llama_params_fit_impl(
            case LAYER_FRACTION_ATTN: {
                static std::array<std::string, n_strings> patterns;
                if (patterns[il].empty()) {
-                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(up|gate|down).*";
+                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(gate|up|gate_up|down).*";
                }
                return patterns[il].c_str();
            }
            case LAYER_FRACTION_UP: {
                static std::array<std::string, n_strings> patterns;
                if (patterns[il].empty()) {
-                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(gate|down).*";
+                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(gate|gate_up|down).*";
                }
                return patterns[il].c_str();
            }
@@ -386,7 +386,7 @@ static void llama_params_fit_impl(
            case LAYER_FRACTION_MOE: {
                static std::array<std::string, n_strings> patterns;
                if (patterns[il].empty()) {
-                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(up|down|gate)_(ch|)exps";
+                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(up|down|gate_up|gate)_(ch|)exps";
                }
                return patterns[il].c_str();
            }
@@ -480,7 +480,7 @@ static void llama_params_fit_impl(

    int64_t global_surplus_cpu_moe = 0;
    if (hp_nex > 0) {
-        const static std::string pattern_moe_all = "blk\\.\\d+\\.ffn_(up|down|gate)_(ch|)exps"; // matches all MoE tensors
+        const static std::string pattern_moe_all = "blk\\.\\d+\\.ffn_(up|down|gate_up|gate)_(ch|)exps"; // matches all MoE tensors
        ggml_backend_buffer_type_t cpu_buft = ggml_backend_cpu_buffer_type();
        tensor_buft_overrides[0] = {pattern_moe_all.c_str(), cpu_buft};
        tensor_buft_overrides[1] = {nullptr, nullptr};
@@ -828,7 +828,7 @@ int64_t llama_time_us(void) {

 // Returns 0 on success, -1 on error, and -2 on cancellation via llama_progress_callback
 static int llama_model_load(struct gguf_context * metadata, llama_model_set_tensor_data_t set_tensor_data, void * set_tensor_data_ud,
-        const std::string & fname, std::vector<std::string> & splits, llama_model & model, llama_model_params & params) {
+        const std::string & fname, std::vector<std::string> & splits, FILE * file, llama_model & model, llama_model_params & params) {
    // loading time will be recalculated after the first eval, so
    // we take page faults deferred by mmap() into consideration
    model.t_load_us = 0;
@@ -837,7 +837,7 @@ static int llama_model_load(struct gguf_context * metadata, llama_model_set_tens
    model.t_start_us = tm.t_start_us;

    try {
-        llama_model_loader ml(metadata, set_tensor_data, set_tensor_data_ud, fname, splits, params.use_mmap, params.use_direct_io,
+        llama_model_loader ml(metadata, set_tensor_data, set_tensor_data_ud, fname, splits, file, params.use_mmap, params.use_direct_io,
            params.check_tensors, params.no_alloc, params.kv_overrides, params.tensor_buft_overrides);

        ml.print_info();
@@ -889,8 +889,24 @@ static struct llama_model * llama_model_load_from_file_impl(
        void * set_tensor_data_ud,
        const std::string & path_model,
        std::vector<std::string> & splits,
+        FILE * file,
        struct llama_model_params params) {
-    GGML_ASSERT((metadata == nullptr) != path_model.empty() && "exactly one out of metadata and path_model needs to be defined");
+    {
+        int n_sources_defined = 0;
+        if (metadata != nullptr) {
+            n_sources_defined++;
+        }
+        if (!path_model.empty()) {
+            n_sources_defined++;
+        }
+        if (file != nullptr) {
+            n_sources_defined++;
+        }
+        if (n_sources_defined != 1) {
+            LLAMA_LOG_ERROR("%s: exactly one out metadata, path_model, and file must be defined\n", __func__);
+            return nullptr;
+        }
+    }
    ggml_time_init();

    if (!params.vocab_only && ggml_backend_reg_count() == 0) {
@@ -1011,7 +1027,7 @@ static struct llama_model * llama_model_load_from_file_impl(
                props.memory_free/1024/1024);
    }

-    const int status = llama_model_load(metadata, set_tensor_data, set_tensor_data_ud, path_model, splits, *model, params);
+    const int status = llama_model_load(metadata, set_tensor_data, set_tensor_data_ud, path_model, splits, file, *model, params);
    GGML_ASSERT(status <= 0);
    if (status < 0) {
        if (status == -1) {
@@ -1037,7 +1053,7 @@ struct llama_model * llama_model_init_from_user(
    std::vector<std::string> splits = {};
    params.use_mmap = false;
    params.use_extra_bufts = false;
-    return llama_model_load_from_file_impl(metadata, set_tensor_data, set_tensor_data_ud, path_model, splits, params);
+    return llama_model_load_from_file_impl(metadata, set_tensor_data, set_tensor_data_ud, path_model, splits, /*file*/ nullptr, params);
 }
 // deprecated
 struct llama_model * llama_load_model_from_file(
@@ -1050,7 +1066,7 @@ struct llama_model * llama_model_load_from_file(
        const char * path_model,
        struct llama_model_params params) {
    std::vector<std::string> splits = {};
-    return llama_model_load_from_file_impl(nullptr, nullptr, nullptr, path_model, splits, params);
+    return llama_model_load_from_file_impl(nullptr, nullptr, nullptr, path_model, splits, /*file*/ nullptr, params);
 }

 struct llama_model * llama_model_load_from_splits(
@@ -1066,7 +1082,17 @@ struct llama_model * llama_model_load_from_splits(
    for (size_t i = 0; i < n_paths; ++i) {
        splits.push_back(paths[i]);
    }
-    return llama_model_load_from_file_impl(nullptr, nullptr, nullptr, splits.front(), splits, params);
+    return llama_model_load_from_file_impl(nullptr, nullptr, nullptr, splits.front(), splits, /*file*/ nullptr, params);
+}
+
+struct llama_model * llama_model_load_from_file_ptr(FILE * file, struct llama_model_params params) {
+    if (!file) {
+        LLAMA_LOG_ERROR("%s: file is NULL\n", __func__);
+        return nullptr;
+    }
+    std::string path_model;
+    std::vector<std::string> splits = {};
+    return llama_model_load_from_file_impl(nullptr, nullptr, nullptr, path_model, splits, file, params);
 }

 void llama_model_save_to_file(const struct llama_model * model, const char * path_model) {
--- a/src/models/bert.cpp
+++ b/src/models/bert.cpp
@@ -28,8 +28,8 @@ llm_build_bert::llm_build_bert(const llama_model & model, const llm_graph_params
    cb(inpL, "inp_embd", -1);

    // embed layer norm
-    inpL = build_norm(inpL, model.tok_norm, model.tok_norm_b, LLM_NORM, -1);
-    cb(inpL, "inp_norm", -1);
+    inpL = build_norm(inpL, model.tok_norm, model.tok_norm_b, LLM_NORM, 0);
+    cb(inpL, "inp_norm", 0);

    auto * inp_attn = build_attn_inp_no_cache();

--- a/src/models/bloom.cpp
+++ b/src/models/bloom.cpp
@@ -16,8 +16,8 @@ llm_build_bloom::llm_build_bloom(const llama_model & model, const llm_graph_para
    inpL = build_norm(inpL,
            model.tok_norm,
            model.tok_norm_b,
-            LLM_NORM, -1);
-    cb(inpL, "inp_norm", -1);
+            LLM_NORM, 0);
+    cb(inpL, "inp_norm", 0);

    ggml_tensor * inp_out_ids = build_inp_out_ids();

--- a/src/models/modern-bert.cpp
+++ b/src/models/modern-bert.cpp
@@ -15,8 +15,8 @@ llm_build_modern_bert::llm_build_modern_bert(const llama_model & model, const ll
    cb(inpL, "inp_embd", -1);

    // embed layer norm
-    inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, -1);
-    cb(inpL, "inp_norm", -1);
+    inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, 0);
+    cb(inpL, "inp_norm", 0);

    ggml_tensor * inp_out_ids = build_inp_out_ids();

--- a/src/models/rwkv6.cpp
+++ b/src/models/rwkv6.cpp
@@ -8,7 +8,7 @@ llm_build_rwkv6::llm_build_rwkv6(const llama_model & model, const llm_graph_para
    ggml_tensor * inpL;

    inpL = build_inp_embd(model.tok_embd);
-    inpL = build_norm(inpL, model.tok_norm, model.tok_norm_b, LLM_NORM, -1);
+    inpL = build_norm(inpL, model.tok_norm, model.tok_norm_b, LLM_NORM, 0);

    auto * rs_inp = build_rs_inp();

--- a/src/models/rwkv7.cpp
+++ b/src/models/rwkv7.cpp
@@ -9,7 +9,7 @@ llm_build_rwkv7::llm_build_rwkv7(const llama_model & model, const llm_graph_para
    ggml_tensor * v_first = nullptr;

    inpL = build_inp_embd(model.tok_embd);
-    inpL = build_norm(inpL, model.tok_norm, model.tok_norm_b, LLM_NORM, -1);
+    inpL = build_norm(inpL, model.tok_norm, model.tok_norm_b, LLM_NORM, 0);

    auto * rs_inp = build_rs_inp();

--- a/src/models/wavtokenizer-dec.cpp
+++ b/src/models/wavtokenizer-dec.cpp
@@ -93,7 +93,7 @@ llm_build_wavtokenizer_dec::llm_build_wavtokenizer_dec(const llama_model & model
    cur = build_norm(cur,
            model.tok_norm,
            model.tok_norm_b,
-            LLM_NORM, -1);
+            LLM_NORM, 0);

    cur = ggml_cont(ctx0, ggml_transpose(ctx0, cur));

--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -8576,12 +8576,12 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
        }
    }

-    for (int hsk : { 40, 64, 72, 80, 96, 128, 192, 256, 320, 576 }) {
+    for (int hsk : { 40, 64, 72, 80, 96, 128, 192, 256, 320, 512, 576 }) {
        for (int hsv : { 40, 64, 72, 80, 96, 128, 192, 256, 512 }) {
            if (hsk != 192 && hsk != 320 && hsk != 576 && hsk != hsv) continue;
            if (hsk == 192 && (hsv != 128 && hsv != 192)) continue;
            if (hsk == 576 && hsv != 512) continue; // DeepSeek MLA
-            if (hsk == 320 && hsv != 256) continue; // MLA
+            if (hsk == 320 && hsv != 256) continue; // Mistral4 MLA

            for (bool mask : { true, false } ) {
                for (bool sinks : { true, false } ) {
@@ -8590,7 +8590,7 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
                        for (float logit_softcap : {0.0f, 10.0f}) {
                            if (hsk != 128 && logit_softcap != 0.0f) continue;
                            for (int nh : { 1, 4 }) {
-                                if (nh == 1 && hsk != 320 && hsk != 576) continue; // GLM 4.7 Flash
+                                if (nh == 1 && hsk != 320 && hsk != 576) continue;
                                for (int nr3 : { 1, 3, }) {
                                    if (hsk > 64 && nr3 > 1) continue; // skip broadcast for large head sizes
                                    for (int nr2 : { 1, 4, 12, 20, 32 }) {
--- a/tests/test-gguf.cpp
+++ b/tests/test-gguf.cpp
@@ -742,7 +742,7 @@ static std::pair<int, int> test_handcrafted_file(const unsigned int seed) {
            /*ctx      =*/ hft >= offset_has_data ? &ctx : nullptr,
        };

-        struct gguf_context * gguf_ctx = gguf_init_from_file_impl(file, gguf_params);
+        struct gguf_context * gguf_ctx = gguf_init_from_file_ptr(file, gguf_params);

        if (expect_context_not_null(hft)) {
            printf("%s:   - context_not_null: ", __func__);
@@ -1125,19 +1125,15 @@ static std::pair<int, int> test_roundtrip(ggml_backend_dev_t dev, const unsigned
    GGML_ASSERT(file);
 #endif // _WIN32

-    {
-        std::vector<int8_t> buf;
-        gguf_write_to_buf(gguf_ctx_0, buf, only_meta);
-        GGML_ASSERT(fwrite(buf.data(), 1, buf.size(), file) == buf.size());
-        rewind(file);
-    }
+    gguf_write_to_file_ptr(gguf_ctx_0, file, only_meta);
+    rewind(file);

    struct ggml_context * ctx_1 = nullptr;
    struct gguf_init_params gguf_params = {
        /*no_alloc =*/ false,
        /*ctx      =*/ only_meta ? nullptr : &ctx_1,
    };
-    struct gguf_context * gguf_ctx_1 = gguf_init_from_file_impl(file, gguf_params);
+    struct gguf_context * gguf_ctx_1 = gguf_init_from_file_ptr(file, gguf_params);

    printf("%s: same_version: ", __func__);
    if (gguf_get_version(gguf_ctx_0) == gguf_get_version(gguf_ctx_1)) {
--- a/tests/test-jinja.cpp
+++ b/tests/test-jinja.cpp
@@ -884,6 +884,24 @@ static void test_macros(testing & t) {
        json::object(),
        "Hi Guest"
    );
+
+    test_template(t, "macro kwargs input",
+        "{% macro my_func(a, b=False) %}{% if b %}{{ a }}{% else %}nope{% endif %}{% endmacro %}{{ my_func(1, b=True) }}",
+        json::object(),
+        "1"
+    );
+
+    test_template(t, "macro with multiple args",
+        "{% macro add(a, b, c=0) %}{{ a + b + c }}{% endmacro %}{{ add(1, 2) }},{{ add(1, 2, 3) }},{{ add(1, b=10) }},{{ add(1, 2, c=5) }}",
+        json::object(),
+        "3,6,11,8"
+    );
+
+    test_template(t, "macro with kwarg out-of-order input",
+        "{% macro greet(first, last, greeting='Hello') %}{{ greeting }}, {{ first }} {{ last }}{% endmacro %}{{ greet(last='Smith', first='John') }},{{ greet(last='Doe', greeting='Hi', first='Jane') }}",
+        json::object(),
+        "Hello, John Smith,Hi, Jane Doe"
+    );
 }

 static void test_namespace(testing & t) {
--- a/tests/test-llama-archs.cpp
+++ b/tests/test-llama-archs.cpp
@@ -90,6 +90,7 @@ static gguf_context_ptr get_gguf_ctx(const llm_arch arch, const bool moe) {
        n_embd = 64;
        n_head = 1;
        n_ff   = 96;
+        n_layer = 22; // hparams.n_layer_kv_from_start = 20 is hardcoded
    } else if (arch == LLM_ARCH_DEEPSEEK2
            || arch == LLM_ARCH_GLM_DSA
            || arch == LLM_ARCH_KIMI_LINEAR
@@ -101,8 +102,6 @@ static gguf_context_ptr get_gguf_ctx(const llm_arch arch, const bool moe) {
        n_layer = 3;
    } else if (arch == LLM_ARCH_CHAMELEON) {
        n_vocab = 10240;
-    } else if (arch == LLM_ARCH_GEMMA3N) {
-        n_layer = 22; // hparams.n_layer_kv_from_start = 20 is hardcoded
    }

    const uint32_t n_embd_head = n_embd / n_head;
@@ -231,9 +230,15 @@ static gguf_context_ptr get_gguf_ctx(const llm_arch arch, const bool moe) {
    return ret;
 }

+static bool silent_model_load_progress(float /*progress*/, void * /*user_data*/) {
+    return true;
+}
+
 static std::pair<llama_model_ptr, llama_context_ptr> get_model_and_ctx(
-        struct gguf_context * gguf_ctx, const size_t seed, const std::vector<ggml_backend_dev_t> & devs) {
+        struct gguf_context * gguf_ctx, FILE * file, const size_t seed, const std::vector<ggml_backend_dev_t> & devs) {
+    GGML_ASSERT((gguf_ctx == nullptr) != (file == nullptr));
    llama_model_params model_params = llama_model_default_params();
+    model_params.progress_callback = silent_model_load_progress;
    std::vector<ggml_backend_dev_t> devs_copy = devs;
    devs_copy.push_back(nullptr);
    model_params.devices = devs_copy.data();
@@ -244,7 +249,9 @@ static std::pair<llama_model_ptr, llama_context_ptr> get_model_and_ctx(
    ctx_params.n_threads_batch = 4;

    size_t tmp = seed;
-    llama_model_ptr model(llama_model_init_from_user(gguf_ctx, set_tensor_data, &tmp, model_params));
+    llama_model_ptr model(gguf_ctx != nullptr ?
+        llama_model_init_from_user(gguf_ctx, set_tensor_data, &tmp, model_params) :
+        llama_model_load_from_file_ptr(file, model_params));
    if (!model) {
        throw std::runtime_error("failed to create llama model");
    }
@@ -351,7 +358,6 @@ static bool moe_implemented(const llm_arch arch) {
 }

 static int save_models(const llm_arch target_arch, const size_t seed, const ggml_log_level log_level, const std::string & dir) {
-    GGML_ABORT("llama_model_save_to_file is broken");
    struct user_data_t {
        struct {
            ggml_log_callback callback;
@@ -376,6 +382,19 @@ static int save_models(const llm_arch target_arch, const size_t seed, const ggml
        if (arch == LLM_ARCH_CLIP || arch == LLM_ARCH_GPTJ || arch == LLM_ARCH_UNKNOWN) {
            continue; // These models don't have usable implementations.
        }
+        if (arch == LLM_ARCH_CHAMELEON) {
+            continue; // Only half-implemented and to be removed in the future.
+        }
+        if (arch == LLM_ARCH_RWKV6 || arch == LLM_ARCH_RWKV6QWEN2 || arch == LLM_ARCH_RWKV7 || arch == LLM_ARCH_ARWKV7) {
+            continue; // FIXME
+        }
+        if (arch == LLM_ARCH_BERT || arch == LLM_ARCH_MODERN_BERT || arch == LLM_ARCH_NOMIC_BERT || arch == LLM_ARCH_NOMIC_BERT_MOE ||
+                arch == LLM_ARCH_NEO_BERT || arch == LLM_ARCH_JINA_BERT_V2 || arch == LLM_ARCH_JINA_BERT_V3 || arch == LLM_ARCH_EUROBERT) {
+            continue; // TODO vocab
+        }
+        if (arch == LLM_ARCH_PLM) {
+            continue; // TODO tensor shapes
+        }
        for (bool moe : {false, true}) {
            if (moe && !moe_implemented(arch)) {
                continue;
@@ -383,8 +402,12 @@ static int save_models(const llm_arch target_arch, const size_t seed, const ggml
            if (!moe && moe_mandatory(arch)) {
                continue;
            }
+            if (!llama_model_saver_supports_arch(arch)) {
+                LOG_INF("%s: %s model (%s) is unsupported, skipping\n", __func__, llm_arch_name(arch), moe ? "MoE" : "dense");
+                continue;
+            }
            gguf_context_ptr gguf_ctx = get_gguf_ctx(arch, moe);
-            auto model_and_ctx = get_model_and_ctx(gguf_ctx.get(), seed, {});
+            auto model_and_ctx = get_model_and_ctx(gguf_ctx.get(), nullptr, seed, {});
            const std::string path = dir + "/" + llm_arch_name(arch) + (moe ? "-moe.gguf" : "-dense.gguf");
            LOG_INF("%s: Saving %s model (%s) to %s...\n", __func__, llm_arch_name(arch), moe ? "MoE" : "dense", path.c_str());
            llama_model_save_to_file(model_and_ctx.first.get(), path.c_str());
@@ -416,8 +439,8 @@ static int test_backends(const llm_arch target_arch, const size_t seed, const gg

    bool all_ok = true;
    common_log_flush(common_log_main());
-    printf("|%15s|%30s|%6s|%8s|%6s|\n", "Model arch.", "Device", "Config", "NMSE", "Status");
-    printf("|---------------|------------------------------|------|--------|------|\n");
+    printf("|%15s|%30s|%6s|%15s|%9s|\n", "Model arch.", "Device", "Config", "NMSE vs. CPU", "Roundtrip");
+    printf("|---------------|------------------------------|------|---------------|---------|\n");
    for (const llm_arch & arch : llm_arch_all()) {
        if (target_arch != LLM_ARCH_UNKNOWN && arch != target_arch) {
            continue;
@@ -425,6 +448,9 @@ static int test_backends(const llm_arch target_arch, const size_t seed, const gg
        if (arch == LLM_ARCH_CLIP || arch == LLM_ARCH_GPTJ || arch == LLM_ARCH_UNKNOWN) {
            continue; // These models don't have usable implementations.
        }
+        if (arch == LLM_ARCH_CHAMELEON) {
+            continue; // Only half-implemented and to be removed in the future.
+        }
        if (arch == LLM_ARCH_WAVTOKENIZER_DEC) {
            continue; // FIXME CUDA backend crashes.
        }
@@ -458,22 +484,50 @@ static int test_backends(const llm_arch target_arch, const size_t seed, const gg
                continue;
            }
            gguf_context_ptr gguf_ctx = get_gguf_ctx(arch, moe);
-            auto model_and_ctx_cpu = get_model_and_ctx(gguf_ctx.get(), seed, {});
+            auto model_and_ctx_cpu = get_model_and_ctx(gguf_ctx.get(), nullptr, seed, {});
            const std::vector<float> logits_cpu = get_logits(model_and_ctx_cpu.first.get(), model_and_ctx_cpu.second.get(), tokens, encode);
            for (size_t i = 0; i < ggml_backend_dev_count(); i++) {
                ggml_backend_dev_t dev = ggml_backend_dev_get(i);
                if (ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
                    continue;
                }
-                auto model_and_ctx_dev = get_model_and_ctx(gguf_ctx.get(), seed, {dev});
+                auto model_and_ctx_dev = get_model_and_ctx(gguf_ctx.get(), nullptr, seed, {dev});
+                std::string config_name = moe ? "MoE" : "Dense";
                const std::vector<float> logits_dev = get_logits(model_and_ctx_dev.first.get(), model_and_ctx_dev.second.get(), tokens, encode);
                const double nmse_val = nmse(logits_cpu, logits_dev);
-                const bool ok = nmse_val <= 1e-4;
-                all_ok = all_ok && ok;
                char nmse_str[10];
                snprintf(nmse_str, sizeof(nmse_str), "%.2e", nmse_val);
-                printf("|%15s|%30s|%6s|%8s|%17s|\n", llm_arch_name(arch), ggml_backend_dev_description(dev),
-                    moe ? "MoE" : "Dense", nmse_str, ok ? "\033[1;32mOK\033[0m" : "\033[1;31mFAIL\033[0m");
+                std::string status_nmse = "\033[1;32mOK\033[0m";
+                if (nmse_val > 1e-4) {
+                    all_ok = false;
+                    status_nmse = "\033[1;31mFAIL\033[0m";
+                }
+
+                std::string status_roundtrip = "\033[1;33mSKIP\033[0m";
+                FILE * file = tmpfile(); // Can be null on Windows without administrator privileges.
+                if (file != nullptr && llama_model_saver_supports_arch(arch)) {
+                    llama_model_saver ms = llama_model_saver(model_and_ctx_dev.first.get());
+                    ms.add_kv_from_model();
+                    ms.add_tensors_from_model();
+                    ms.save(file);
+                    rewind(file);
+
+                    auto model_and_ctx_roundtrip = get_model_and_ctx(nullptr, file, seed, {dev});
+                    const std::vector<float> logits_roundtrip = get_logits(
+                        model_and_ctx_roundtrip.first.get(), model_and_ctx_roundtrip.second.get(), tokens, encode);
+                    status_roundtrip = "\033[1;32mOK\033[0m";
+                    GGML_ASSERT(logits_roundtrip.size() == logits_dev.size());
+                    for (size_t i = 0; i < logits_roundtrip.size(); i++) {
+                        if (logits_roundtrip[i] != logits_dev[i]) {
+                            all_ok = false;
+                            status_roundtrip = "\033[1;31mFAIL\033[0m";
+                            break;
+                        }
+                    }
+                }
+
+                printf("|%15s|%30s|%6s|%15s (%8s)|%20s|\n", llm_arch_name(arch), ggml_backend_dev_description(dev),
+                    config_name.c_str(), status_nmse.c_str(), nmse_str, status_roundtrip.c_str());
            }
        }
    }
@@ -526,6 +580,7 @@ int main(int argc, char ** argv) {
            }
        }
    }
+    printf("%s: using seed %zu\n", __func__, seed);

    try {
        if (!out.empty()) {
--- a/tools/gguf-split/README.md
+++ b/tools/gguf-split/README.md
@@ -7,4 +7,4 @@ CLI to split / merge GGUF files.
 - `--split`: split GGUF to multiple GGUF, default operation.
 - `--split-max-size`: max size per split in `M` or `G`, f.ex. `500M` or `2G`.
 - `--split-max-tensors`: maximum tensors in each split: default(128)
- `--merge`: merge multiple GGUF to a single GGUF.
+- `--merge`: merge multiple GGUF to a single GGUF. You only need to specify the name of the first GGUF to merge, the name of the merged GGUF, and the CLI will find the other GGUFs it needs within the same folder.
--- a/tools/llama-bench/llama-bench.cpp
+++ b/tools/llama-bench/llama-bench.cpp
@@ -979,37 +979,20 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
        for (size_t i = 0; i < params.hf_repo.size(); i++) {
            common_params_model model;

-            // step 1: no `-hff` provided, we auto-detect based on the `-hf` flag
            if (params.hf_file.empty() || params.hf_file[i].empty()) {
-                auto auto_detected = common_get_hf_file(params.hf_repo[i], params.hf_token, false);
-                if (auto_detected.repo.empty() || auto_detected.ggufFile.empty()) {
-                    exit(1);
-                }
-
-                model.name    = params.hf_repo[i];
-                model.hf_repo = auto_detected.repo;
-                model.hf_file = auto_detected.ggufFile;
+                model.hf_repo = params.hf_repo[i];
            } else {
+                model.hf_repo = params.hf_repo[i];
                model.hf_file = params.hf_file[i];
            }

-            // step 2: construct the model cache path
-            std::string clean_fname = model.hf_repo + "_" + model.hf_file;
-            string_replace_all(clean_fname, "\\", "_");
-            string_replace_all(clean_fname, "/", "_");
-            model.path = fs_get_cache_file(clean_fname);
-
-            // step 3: download the model if not exists
-            std::string model_endpoint = get_model_endpoint();
-            model.url = model_endpoint + model.hf_repo + "/resolve/main/" + model.hf_file;
-
-            bool ok = common_download_model(model, params.hf_token, false);
-            if (!ok) {
-                fprintf(stderr, "error: failed to download model from %s\n", model.url.c_str());
+            auto download_result = common_download_model(model, params.hf_token);
+            if (download_result.model_path.empty()) {
+                fprintf(stderr, "error: failed to download model from HuggingFace\n");
                exit(1);
            }

-            params.model.push_back(model.path);
+            params.model.push_back(download_result.model_path);
        }
    }

@@ -1824,7 +1807,7 @@ struct markdown_printer : public printer {
        if (!is_cpu_backend) {
            fields.emplace_back("n_gpu_layers");
        }
-        if (params.n_cpu_moe.size() > 1) {
+        if (params.n_cpu_moe.size() > 1 || params.n_cpu_moe != cmd_params_defaults.n_cpu_moe) {
            fields.emplace_back("n_cpu_moe");
        }
        if (params.n_threads.size() > 1 || params.n_threads != cmd_params_defaults.n_threads || is_cpu_backend) {
--- a/tools/server/public/index.html.gz
+++ b/tools/server/public/index.html.gz
--- a/tools/server/tests/unit/test_router.py
+++ b/tools/server/tests/unit/test_router.py
@@ -103,8 +103,8 @@ def test_router_models_max_evicts_lru():

    candidate_models = [
        "ggml-org/tinygemma3-GGUF:Q8_0",
-        "ggml-org/test-model-stories260K",
-        "ggml-org/test-model-stories260K-infill",
+        "ggml-org/test-model-stories260K:F32",
+        "ggml-org/test-model-stories260K-infill:F32",
    ]

    # Load only the first 2 models to fill the cache
--- a/tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormTextarea.svelte
+++ b/tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormTextarea.svelte
@@ -26,6 +26,7 @@

 	onMount(() => {
 		if (textareaElement) {
+			autoResizeTextarea(textareaElement);
 			textareaElement.focus();
 		}
 	});
@@ -50,8 +51,9 @@
 	<textarea
 		bind:this={textareaElement}
 		bind:value
-		class="text-md max-h-32 min-h-12 w-full resize-none border-0 bg-transparent p-0 leading-6 outline-none placeholder:text-muted-foreground focus-visible:ring-0 focus-visible:ring-offset-0"
+		class="text-md min-h-12 w-full resize-none border-0 bg-transparent p-0 leading-6 outline-none placeholder:text-muted-foreground focus-visible:ring-0 focus-visible:ring-offset-0"
 		class:cursor-not-allowed={disabled}
+		style="max-height: var(--max-message-height);"
 		{disabled}
 		onkeydown={onKeydown}
 		oninput={(event) => {
--- a/tools/server/webui/src/lib/components/app/models/ModelsSelector.svelte
+++ b/tools/server/webui/src/lib/components/app/models/ModelsSelector.svelte
@@ -77,7 +77,7 @@
 	let filteredOptions = $derived(filterModelOptions(options, searchTerm));

 	let groupedFilteredOptions = $derived(
-		groupModelOptions(filteredOptions, modelsStore.favouriteModelIds, (m) =>
+		groupModelOptions(filteredOptions, modelsStore.favoriteModelIds, (m) =>
 			modelsStore.isModelLoaded(m)
 		)
 	);
@@ -353,7 +353,7 @@
 								{@const { option, flatIndex } = item}
 								{@const isSelected = currentModel === option.model || activeId === option.id}
 								{@const isHighlighted = flatIndex === highlightedIndex}
-								{@const isFav = modelsStore.favouriteModelIds.has(option.model)}
+								{@const isFav = modelsStore.favoriteModelIds.has(option.model)}

 								<ModelsSelectorOption
 									{option}
--- a/tools/server/webui/src/lib/components/app/models/ModelsSelectorList.svelte
+++ b/tools/server/webui/src/lib/components/app/models/ModelsSelectorList.svelte
@@ -30,7 +30,7 @@
 {#snippet defaultOption(item: ModelItem, showOrgName: boolean)}
 	{@const { option } = item}
 	{@const isSelected = currentModel === option.model || activeId === option.id}
-	{@const isFav = modelsStore.favouriteModelIds.has(option.model)}
+	{@const isFav = modelsStore.favoriteModelIds.has(option.model)}

 	<ModelsSelectorOption
 		{option}
@@ -52,9 +52,9 @@
 	{/each}
 {/if}

-{#if groups.favourites.length > 0}
-	<p class={sectionHeaderClass}>Favourite models</p>
-	{#each groups.favourites as item (`fav-${item.option.id}`)}
+{#if groups.favorites.length > 0}
+	<p class={sectionHeaderClass}>Favorite models</p>
+	{#each groups.favorites as item (`fav-${item.option.id}`)}
 		{@render render(item, true)}
 	{/each}
 {/if}
--- a/tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte
+++ b/tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte
@@ -46,7 +46,10 @@
 	});
 	let isOperationInProgress = $derived(modelsStore.isModelOperationInProgress(option.model));
 	let isFailed = $derived(serverStatus === ServerModelStatus.FAILED);
-	let isLoaded = $derived(serverStatus === ServerModelStatus.LOADED && !isOperationInProgress);
+	let isSleeping = $derived(serverStatus === ServerModelStatus.SLEEPING);
+	let isLoaded = $derived(
+		(serverStatus === ServerModelStatus.LOADED || isSleeping) && !isOperationInProgress
+	);
 	let isLoading = $derived(serverStatus === ServerModelStatus.LOADING || isOperationInProgress);
 </script>

@@ -85,17 +88,17 @@
 				<ActionIcon
 					iconSize="h-2.5 w-2.5"
 					icon={HeartOff}
-					tooltip="Remove from favourites"
+					tooltip="Remove from favorites"
 					class="h-3 w-3 hover:text-foreground"
-					onclick={() => modelsStore.toggleFavourite(option.model)}
+					onclick={() => modelsStore.toggleFavorite(option.model)}
 				/>
 			{:else}
 				<ActionIcon
 					iconSize="h-2.5 w-2.5"
 					icon={Heart}
-					tooltip="Add to favourites"
+					tooltip="Add to favorites"
 					class="h-3 w-3 hover:text-foreground"
-					onclick={() => modelsStore.toggleFavourite(option.model)}
+					onclick={() => modelsStore.toggleFavorite(option.model)}
 				/>
 			{/if}

@@ -129,6 +132,23 @@
 					/>
 				</div>
 			</div>
+		{:else if isSleeping}
+			<div class="flex w-4 items-center justify-center">
+				<span class="h-2 w-2 rounded-full bg-orange-400 group-hover:hidden"></span>
+
+				<div class="hidden group-hover:flex">
+					<ActionIcon
+						iconSize="h-2.5 w-2.5"
+						icon={PowerOff}
+						tooltip="Unload model"
+						class="h-3 w-3 text-red-500 hover:text-red-600"
+						onclick={(e) => {
+							e?.stopPropagation();
+							modelsStore.unloadModel(option.model);
+						}}
+					/>
+				</div>
+			</div>
 		{:else if isLoaded}
 			<div class="flex w-4 items-center justify-center">
 				<span class="h-2 w-2 rounded-full bg-green-500 group-hover:hidden"></span>
--- a/tools/server/webui/src/lib/components/app/models/ModelsSelectorSheet.svelte
+++ b/tools/server/webui/src/lib/components/app/models/ModelsSelectorSheet.svelte
@@ -76,7 +76,7 @@
 	let filteredOptions = $derived(filterModelOptions(options, searchTerm));

 	let groupedFilteredOptions = $derived(
-		groupModelOptions(filteredOptions, modelsStore.favouriteModelIds, (m) =>
+		groupModelOptions(filteredOptions, modelsStore.favoriteModelIds, (m) =>
 			modelsStore.isModelLoaded(m)
 		)
 	);
--- a/tools/server/webui/src/lib/components/app/models/index.ts
+++ b/tools/server/webui/src/lib/components/app/models/index.ts
@@ -47,7 +47,7 @@ export { default as ModelsSelector } from './ModelsSelector.svelte';
 /**
 * **ModelsSelectorList** - Grouped model options list
 *
- * Renders grouped model options (loaded, favourites, available) with section
+ * Renders grouped model options (loaded, favorites, available) with section
 * headers and org subgroups. Shared between ModelsSelector and ModelsSelectorSheet
 * to avoid template duplication.
 *
@@ -59,7 +59,7 @@ export { default as ModelsSelectorList } from './ModelsSelectorList.svelte';
 /**
 * **ModelsSelectorOption** - Single model option row
 *
- * Renders a single model option with selection state, favourite toggle,
+ * Renders a single model option with selection state, favorite toggle,
 * load/unload actions, status indicators, and an info button.
 * Used inside ModelsSelectorList or directly in custom render snippets.
 */
--- a/tools/server/webui/src/lib/components/app/models/utils.ts
+++ b/tools/server/webui/src/lib/components/app/models/utils.ts
@@ -13,7 +13,7 @@ export interface OrgGroup {

 export interface GroupedModelOptions {
 	loaded: ModelItem[];
-	favourites: ModelItem[];
+	favorites: ModelItem[];
 	available: OrgGroup[];
 }

@@ -32,7 +32,7 @@ export function filterModelOptions(options: ModelOption[], searchTerm: string):

 export function groupModelOptions(
 	filteredOptions: ModelOption[],
-	favouriteIds: Set<string>,
+	favoriteIds: Set<string>,
 	isModelLoaded: (model: string) => boolean
 ): GroupedModelOptions {
 	// Loaded models
@@ -43,24 +43,24 @@ export function groupModelOptions(
 		}
 	}

-	// Favourites (excluding loaded)
+	// Favorites (excluding loaded)
 	const loadedModelIds = new Set(loaded.map((item) => item.option.model));
-	const favourites: ModelItem[] = [];
+	const favorites: ModelItem[] = [];
 	for (let i = 0; i < filteredOptions.length; i++) {
 		if (
-			favouriteIds.has(filteredOptions[i].model) &&
+			favoriteIds.has(filteredOptions[i].model) &&
 			!loadedModelIds.has(filteredOptions[i].model)
 		) {
-			favourites.push({ option: filteredOptions[i], flatIndex: i });
+			favorites.push({ option: filteredOptions[i], flatIndex: i });
 		}
 	}

-	// Available models grouped by org (excluding loaded and favourites)
+	// Available models grouped by org (excluding loaded and favorites)
 	const available: OrgGroup[] = [];
 	const orgGroups = new SvelteMap<string, ModelItem[]>();
 	for (let i = 0; i < filteredOptions.length; i++) {
 		const option = filteredOptions[i];
-		if (loadedModelIds.has(option.model) || favouriteIds.has(option.model)) continue;
+		if (loadedModelIds.has(option.model) || favoriteIds.has(option.model)) continue;

 		const key = option.parsedId?.orgName ?? '';
 		if (!orgGroups.has(key)) orgGroups.set(key, []);
@@ -71,5 +71,5 @@ export function groupModelOptions(
 		available.push({ orgName: orgName || null, items });
 	}

-	return { loaded, favourites, available };
+	return { loaded, favorites, available };
 }
--- a/tools/server/webui/src/lib/constants/localstorage-keys.ts
+++ b/tools/server/webui/src/lib/constants/localstorage-keys.ts
@@ -1,4 +1,4 @@
 export const CONFIG_LOCALSTORAGE_KEY = 'LlamaCppWebui.config';
 export const USER_OVERRIDES_LOCALSTORAGE_KEY = 'LlamaCppWebui.userOverrides';
-export const FAVOURITE_MODELS_LOCALSTORAGE_KEY = 'LlamaCppWebui.favouriteModels';
+export const FAVORITE_MODELS_LOCALSTORAGE_KEY = 'LlamaCppWebui.favoriteModels';
 export const MCP_DEFAULT_ENABLED_LOCALSTORAGE_KEY = 'LlamaCppWebui.mcpDefaultEnabled';
--- a/tools/server/webui/src/lib/enums/server.ts
+++ b/tools/server/webui/src/lib/enums/server.ts
@@ -16,5 +16,6 @@ export enum ServerModelStatus {
 	UNLOADED = 'unloaded',
 	LOADING = 'loading',
 	LOADED = 'loaded',
+	SLEEPING = 'sleeping',
 	FAILED = 'failed'
 }
--- a/tools/server/webui/src/lib/stores/chat.svelte.ts
+++ b/tools/server/webui/src/lib/stores/chat.svelte.ts
@@ -1207,7 +1207,6 @@ class ChatStore {
 				await conversationsStore.updateCurrentNode(newMessage.id);
 			} else {
 				await DatabaseService.updateMessage(msg.id, { content: newContent });
-				await conversationsStore.updateCurrentNode(msg.id);
 				conversationsStore.updateMessageAtIndex(idx, { content: newContent });
 			}

--- a/tools/server/webui/src/lib/stores/models.svelte.ts
+++ b/tools/server/webui/src/lib/stores/models.svelte.ts
@@ -7,7 +7,7 @@ import { TTLCache } from '$lib/utils';
 import {
 	MODEL_PROPS_CACHE_TTL_MS,
 	MODEL_PROPS_CACHE_MAX_ENTRIES,
-	FAVOURITE_MODELS_LOCALSTORAGE_KEY
+	FAVORITE_MODELS_LOCALSTORAGE_KEY
 } from '$lib/constants';

 /**
@@ -57,7 +57,7 @@ class ModelsStore {
 	private modelUsage = $state<Map<string, SvelteSet<string>>>(new Map());
 	private modelLoadingStates = new SvelteMap<string, boolean>();

-	favouriteModelIds = $state<Set<string>>(this.loadFavouritesFromStorage());
+	favoriteModelIds = $state<Set<string>>(this.loadFavoritesFromStorage());

 	/**
 	 * Model-specific props cache with TTL
@@ -90,7 +90,11 @@ class ModelsStore {

 	get loadedModelIds(): string[] {
 		return this.routerModels
-			.filter((m) => m.status.value === ServerModelStatus.LOADED)
+			.filter(
+				(m) =>
+					m.status.value === ServerModelStatus.LOADED ||
+					m.status.value === ServerModelStatus.SLEEPING
+			)
 			.map((m) => m.id);
 	}

@@ -215,7 +219,11 @@ class ModelsStore {

 	isModelLoaded(modelId: string): boolean {
 		const model = this.routerModels.find((m) => m.id === modelId);
-		return model?.status.value === ServerModelStatus.LOADED || false;
+		return (
+			model?.status.value === ServerModelStatus.LOADED ||
+			model?.status.value === ServerModelStatus.SLEEPING ||
+			false
+		);
 	}

 	isModelOperationInProgress(modelId: string): boolean {
@@ -621,17 +629,17 @@ class ModelsStore {
 	/**
 	 *
 	 *
-	 * Favourites
+	 * Favorites
 	 *
 	 *
 	 */

-	isFavourite(modelId: string): boolean {
-		return this.favouriteModelIds.has(modelId);
+	isFavorite(modelId: string): boolean {
+		return this.favoriteModelIds.has(modelId);
 	}

-	toggleFavourite(modelId: string): void {
-		const next = new SvelteSet(this.favouriteModelIds);
+	toggleFavorite(modelId: string): void {
+		const next = new SvelteSet(this.favoriteModelIds);

 		if (next.has(modelId)) {
 			next.delete(modelId);
@@ -639,22 +647,22 @@ class ModelsStore {
 			next.add(modelId);
 		}

-		this.favouriteModelIds = next;
+		this.favoriteModelIds = next;

 		try {
-			localStorage.setItem(FAVOURITE_MODELS_LOCALSTORAGE_KEY, JSON.stringify([...next]));
+			localStorage.setItem(FAVORITE_MODELS_LOCALSTORAGE_KEY, JSON.stringify([...next]));
 		} catch {
-			toast.error('Failed to save favourite models to local storage');
+			toast.error('Failed to save favorite models to local storage');
 		}
 	}

-	private loadFavouritesFromStorage(): Set<string> {
+	private loadFavoritesFromStorage(): Set<string> {
 		try {
-			const raw = localStorage.getItem(FAVOURITE_MODELS_LOCALSTORAGE_KEY);
+			const raw = localStorage.getItem(FAVORITE_MODELS_LOCALSTORAGE_KEY);

 			return raw ? new Set(JSON.parse(raw) as string[]) : new Set();
 		} catch {
-			toast.error('Failed to load favourite models from local storage');
+			toast.error('Failed to load favorite models from local storage');

 			return new Set();
 		}
@@ -713,4 +721,4 @@ export const loadingModelIds = () => modelsStore.loadingModelIds;
 export const propsCacheVersion = () => modelsStore.propsCacheVersion;
 export const singleModelName = () => modelsStore.singleModelName;
 export const selectedModelContextSize = () => modelsStore.selectedModelContextSize;
-export const favouriteModelIds = () => modelsStore.favouriteModelIds;
+export const favoriteModelIds = () => modelsStore.favoriteModelIds;
--- a/tools/server/webui/src/lib/types/api.d.ts
+++ b/tools/server/webui/src/lib/types/api.d.ts
@@ -54,7 +54,7 @@ export interface ApiChatMessageData {
 * Model status object from /models endpoint
 */
 export interface ApiModelStatus {
-	/** Status value: loaded, unloaded, loading, failed */
+	/** Status value: loaded, unloaded, loading, sleeping, failed */
 	value: ServerModelStatus;
 	/** Command line arguments used when loading (only for loaded models) */
 	args?: string[];
--- a/vendor/cpp-httplib/httplib.cpp
+++ b/vendor/cpp-httplib/httplib.cpp
@@ -142,6 +142,12 @@ SSEClient &SSEClient::set_max_reconnect_attempts(int n) {
  return *this;
 }

+SSEClient &SSEClient::set_headers(const Headers &headers) {
+  std::lock_guard<std::mutex> lock(headers_mutex_);
+  headers_ = headers;
+  return *this;
+}
+
 bool SSEClient::is_connected() const { return connected_.load(); }

 const std::string &SSEClient::last_event_id() const {
@@ -220,7 +226,11 @@ void SSEClient::run_event_loop() {

  while (running_.load()) {
    // Build headers, including Last-Event-ID if we have one
-    auto request_headers = headers_;
+    Headers request_headers;
+    {
+      std::lock_guard<std::mutex> lock(headers_mutex_);
+      request_headers = headers_;
+    }
    if (!last_event_id_.empty()) {
      request_headers.emplace("Last-Event-ID", last_event_id_);
    }
@@ -239,19 +249,19 @@ void SSEClient::run_event_loop() {
      continue;
    }

-    if (result.status() != 200) {
+    if (result.status() != StatusCode::OK_200) {
      connected_.store(false);
-      // For certain errors, don't reconnect
-      if (result.status() == 204 || // No Content - server wants us to stop
-          result.status() == 404 || // Not Found
-          result.status() == 401 || // Unauthorized
-          result.status() == 403) { // Forbidden
-        if (on_error_) { on_error_(Error::Connection); }
+      if (on_error_) { on_error_(Error::Connection); }
+
+      // For certain errors, don't reconnect.
+      // Note: 401 is intentionally absent so that handlers can refresh
+      // credentials via set_headers() and let the client reconnect.
+      if (result.status() == StatusCode::NoContent_204 ||
+          result.status() == StatusCode::NotFound_404 ||
+          result.status() == StatusCode::Forbidden_403) {
        break;
      }

-      if (on_error_) { on_error_(Error::Connection); }
-
      if (!should_reconnect(reconnect_count)) { break; }
      wait_for_reconnect();
      reconnect_count++;
@@ -9168,18 +9178,11 @@ void ClientImpl::setup_redirect_client(ClientType &client) {
  client.set_compress(compress_);
  client.set_decompress(decompress_);

-  // Copy authentication settings BEFORE proxy setup
-  if (!basic_auth_username_.empty()) {
-    client.set_basic_auth(basic_auth_username_, basic_auth_password_);
-  }
-  if (!bearer_token_auth_token_.empty()) {
-    client.set_bearer_token_auth(bearer_token_auth_token_);
-  }
-#ifdef CPPHTTPLIB_SSL_ENABLED
-  if (!digest_auth_username_.empty()) {
-    client.set_digest_auth(digest_auth_username_, digest_auth_password_);
-  }
-#endif
+  // NOTE: Authentication credentials (basic auth, bearer token, digest auth)
+  // are intentionally NOT copied to the redirect client. Per RFC 9110 Section
+  // 15.4, credentials must not be forwarded when redirecting to a different
+  // host. This function is only called for cross-host redirects; same-host
+  // redirects are handled directly in ClientImpl::redirect().

  // Setup proxy configuration (CRITICAL ORDER - proxy must be set
  // before proxy auth)
@@ -11425,7 +11428,8 @@ void Client::set_follow_location(bool on) {

 void Client::set_path_encode(bool on) { cli_->set_path_encode(on); }

-[[deprecated("Use set_path_encode instead")]]
+[[deprecated("Use set_path_encode() instead. "
+             "This function will be removed by v1.0.0.")]]
 void Client::set_url_encode(bool on) {
  cli_->set_path_encode(on);
 }
@@ -16330,9 +16334,10 @@ bool WebSocketClient::connect() {

  Error error;
  sock_ = detail::create_client_socket(
-      host_, std::string(), port_, AF_UNSPEC, false, false, nullptr, 5, 0,
+      host_, std::string(), port_, address_family_, tcp_nodelay_, ipv6_v6only_,
+      socket_options_, connection_timeout_sec_, connection_timeout_usec_,
      read_timeout_sec_, read_timeout_usec_, write_timeout_sec_,
-      write_timeout_usec_, std::string(), error);
+      write_timeout_usec_, interface_, error);

  if (sock_ == INVALID_SOCKET) { return false; }

@@ -16398,6 +16403,27 @@ void WebSocketClient::set_websocket_ping_interval(time_t sec) {
  websocket_ping_interval_sec_ = sec;
 }

+void WebSocketClient::set_tcp_nodelay(bool on) { tcp_nodelay_ = on; }
+
+void WebSocketClient::set_address_family(int family) {
+  address_family_ = family;
+}
+
+void WebSocketClient::set_ipv6_v6only(bool on) { ipv6_v6only_ = on; }
+
+void WebSocketClient::set_socket_options(SocketOptions socket_options) {
+  socket_options_ = std::move(socket_options);
+}
+
+void WebSocketClient::set_connection_timeout(time_t sec, time_t usec) {
+  connection_timeout_sec_ = sec;
+  connection_timeout_usec_ = usec;
+}
+
+void WebSocketClient::set_interface(const std::string &intf) {
+  interface_ = intf;
+}
+
 #ifdef CPPHTTPLIB_SSL_ENABLED

 void WebSocketClient::set_ca_cert_path(const std::string &path) {
--- a/vendor/cpp-httplib/httplib.h
+++ b/vendor/cpp-httplib/httplib.h
@@ -8,8 +8,8 @@
 #ifndef CPPHTTPLIB_HTTPLIB_H
 #define CPPHTTPLIB_HTTPLIB_H

-#define CPPHTTPLIB_VERSION "0.38.0"
-#define CPPHTTPLIB_VERSION_NUM "0x002600"
+#define CPPHTTPLIB_VERSION "0.39.0"
+#define CPPHTTPLIB_VERSION_NUM "0x002700"

 #ifdef _WIN32
 #if defined(_WIN32_WINNT) && _WIN32_WINNT < 0x0A00
@@ -1001,8 +1001,8 @@ private:

  protected:
    std::streamsize xsputn(const char *s, std::streamsize n) override {
-      sink_.write(s, static_cast<size_t>(n));
-      return n;
+      if (sink_.write(s, static_cast<size_t>(n))) { return n; }
+      return 0;
    }

  private:
@@ -1058,9 +1058,12 @@ make_file_provider(const std::string &name, const std::string &filepath,

 inline std::pair<size_t, ContentProvider>
 make_file_body(const std::string &filepath) {
-  std::ifstream f(filepath, std::ios::binary | std::ios::ate);
-  if (!f) { return {0, ContentProvider{}}; }
-  auto size = static_cast<size_t>(f.tellg());
+  size_t size = 0;
+  {
+    std::ifstream f(filepath, std::ios::binary | std::ios::ate);
+    if (!f) { return {0, ContentProvider{}}; }
+    size = static_cast<size_t>(f.tellg());
+  }

  ContentProvider provider = [filepath](size_t offset, size_t length,
                                        DataSink &sink) -> bool {
@@ -1882,7 +1885,8 @@ private:

 #ifdef CPPHTTPLIB_OPENSSL_SUPPORT
 public:
-  [[deprecated("Use ssl_backend_error() instead")]]
+  [[deprecated("Use ssl_backend_error() instead. "
+               "This function will be removed by v1.0.0.")]]
  uint64_t ssl_openssl_error() const {
    return ssl_backend_error_;
  }
@@ -2362,13 +2366,16 @@ protected:

 #ifdef CPPHTTPLIB_OPENSSL_SUPPORT
 public:
-  [[deprecated("Use load_ca_cert_store() instead")]]
+  [[deprecated("Use load_ca_cert_store() instead. "
+               "This function will be removed by v1.0.0.")]]
  void set_ca_cert_store(X509_STORE *ca_cert_store);

-  [[deprecated("Use tls::create_ca_store() instead")]]
+  [[deprecated("Use tls::create_ca_store() instead. "
+               "This function will be removed by v1.0.0.")]]
  X509_STORE *create_ca_cert_store(const char *ca_cert, std::size_t size) const;

-  [[deprecated("Use set_server_certificate_verifier(VerifyCallback) instead")]]
+  [[deprecated("Use set_server_certificate_verifier(VerifyCallback) instead. "
+               "This function will be removed by v1.0.0.")]]
  virtual void set_server_certificate_verifier(
      std::function<SSLVerifierResponse(SSL *ssl)> verifier);
 #endif
@@ -2597,14 +2604,17 @@ private:

 #ifdef CPPHTTPLIB_OPENSSL_SUPPORT
 public:
-  [[deprecated("Use tls_context() instead")]]
+  [[deprecated("Use tls_context() instead. "
+               "This function will be removed by v1.0.0.")]]
  SSL_CTX *ssl_context() const;

-  [[deprecated("Use set_session_verifier(session_t) instead")]]
+  [[deprecated("Use set_session_verifier(session_t) instead. "
+               "This function will be removed by v1.0.0.")]]
  void set_server_certificate_verifier(
      std::function<SSLVerifierResponse(SSL *ssl)> verifier);

-  [[deprecated("Use Result::ssl_backend_error() instead")]]
+  [[deprecated("Use Result::ssl_backend_error() instead. "
+               "This function will be removed by v1.0.0.")]]
  long get_verify_result() const;
 #endif
 };
@@ -2656,18 +2666,22 @@ private:
 #ifdef CPPHTTPLIB_OPENSSL_SUPPORT
 public:
  [[deprecated("Use SSLServer(PemMemory) or "
-               "SSLServer(ContextSetupCallback) instead")]]
+               "SSLServer(ContextSetupCallback) instead. "
+               "This constructor will be removed by v1.0.0.")]]
  SSLServer(X509 *cert, EVP_PKEY *private_key,
            X509_STORE *client_ca_cert_store = nullptr);

-  [[deprecated("Use SSLServer(ContextSetupCallback) instead")]]
+  [[deprecated("Use SSLServer(ContextSetupCallback) instead. "
+               "This constructor will be removed by v1.0.0.")]]
  SSLServer(
      const std::function<bool(SSL_CTX &ssl_ctx)> &setup_ssl_ctx_callback);

-  [[deprecated("Use tls_context() instead")]]
+  [[deprecated("Use tls_context() instead. "
+               "This function will be removed by v1.0.0.")]]
  SSL_CTX *ssl_context() const;

-  [[deprecated("Use update_certs_pem() instead")]]
+  [[deprecated("Use update_certs_pem() instead. "
+               "This function will be removed by v1.0.0.")]]
  void update_certs(X509 *cert, EVP_PKEY *private_key,
                    X509_STORE *client_ca_cert_store = nullptr);
 #endif
@@ -2752,18 +2766,22 @@ private:

 #ifdef CPPHTTPLIB_OPENSSL_SUPPORT
 public:
-  [[deprecated("Use SSLClient(host, port, PemMemory) instead")]]
+  [[deprecated("Use SSLClient(host, port, PemMemory) instead. "
+               "This constructor will be removed by v1.0.0.")]]
  explicit SSLClient(const std::string &host, int port, X509 *client_cert,
                     EVP_PKEY *client_key,
                     const std::string &private_key_password = std::string());

-  [[deprecated("Use Result::ssl_backend_error() instead")]]
+  [[deprecated("Use Result::ssl_backend_error() instead. "
+               "This function will be removed by v1.0.0.")]]
  long get_verify_result() const;

-  [[deprecated("Use tls_context() instead")]]
+  [[deprecated("Use tls_context() instead. "
+               "This function will be removed by v1.0.0.")]]
  SSL_CTX *ssl_context() const;

-  [[deprecated("Use set_session_verifier(session_t) instead")]]
+  [[deprecated("Use set_session_verifier(session_t) instead. "
+               "This function will be removed by v1.0.0.")]]
  void set_server_certificate_verifier(
      std::function<SSLVerifierResponse(SSL *ssl)> verifier) override;

@@ -3641,6 +3659,9 @@ public:
  SSEClient &set_reconnect_interval(int ms);
  SSEClient &set_max_reconnect_attempts(int n);

+  // Update headers (thread-safe)
+  SSEClient &set_headers(const Headers &headers);
+
  // State accessors
  bool is_connected() const;
  const std::string &last_event_id() const;
@@ -3665,6 +3686,7 @@ private:
  Client &client_;
  std::string path_;
  Headers headers_;
+  mutable std::mutex headers_mutex_;

  // Callbacks
  MessageHandler on_message_;
@@ -3785,6 +3807,12 @@ public:
  void set_read_timeout(time_t sec, time_t usec = 0);
  void set_write_timeout(time_t sec, time_t usec = 0);
  void set_websocket_ping_interval(time_t sec);
+  void set_tcp_nodelay(bool on);
+  void set_address_family(int family);
+  void set_ipv6_v6only(bool on);
+  void set_socket_options(SocketOptions socket_options);
+  void set_connection_timeout(time_t sec, time_t usec = 0);
+  void set_interface(const std::string &intf);

 #ifdef CPPHTTPLIB_SSL_ENABLED
  void set_ca_cert_path(const std::string &path);
@@ -3810,6 +3838,13 @@ private:
  time_t write_timeout_usec_ = CPPHTTPLIB_CLIENT_WRITE_TIMEOUT_USECOND;
  time_t websocket_ping_interval_sec_ =
      CPPHTTPLIB_WEBSOCKET_PING_INTERVAL_SECOND;
+  int address_family_ = AF_UNSPEC;
+  bool tcp_nodelay_ = CPPHTTPLIB_TCP_NODELAY;
+  bool ipv6_v6only_ = CPPHTTPLIB_IPV6_V6ONLY;
+  SocketOptions socket_options_ = nullptr;
+  time_t connection_timeout_sec_ = CPPHTTPLIB_CONNECTION_TIMEOUT_SECOND;
+  time_t connection_timeout_usec_ = CPPHTTPLIB_CONNECTION_TIMEOUT_USECOND;
+  std::string interface_;

 #ifdef CPPHTTPLIB_SSL_ENABLED
  bool is_ssl_ = false;
Author	SHA1	Message	Date
Aman Gupta	9c600bcd4b	llama-bench: print `-n-cpu-moe` when offloaded layers > 1 (#20984 )	2026-03-25 21:17:27 +08:00
Masato Nakasaka	b2704f9028	ci: Allow ninja to be used during unit test (#20742 ) * Remove make dependency * Added option to specify Ninja generator * use ninja-build as default for several CI * Revert "use ninja-build as default for several CI" This reverts commit `f552c4559b`. * changed use plain string rather than arrays * Enabled ninja build by default for experimentation * ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Enabled ninja build by default on self-hosted envs for experimentation * ci: revert generator to ninja instead of ninja multi-config Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: install ninja-build for self-hosted workflows Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: revert ninja from self-hosted runners Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: missed one self-hosted step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ci: fix windows ci errors from an errenous revert Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Added explicit build types for Ninja Also reverted some needless change * ci: use ninja multi-config for vulkan-x64 build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * added time command to measure build time * Keeping some configs to use Ninja which show improvement * minor fix based on review Co-authored-by: Aaron Teo <taronaeo@gmail.com> * ci: rm `time` from custom containers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-03-25 21:00:49 +08:00
Georgi Gerganov	3fab96cd04	ci : disable self-hosted mac jobs (#20985 )	2026-03-25 14:46:40 +02:00
Xuan-Son Nguyen	914eb5ff0c	jinja: fix macro with kwargs (#20960 ) * jinja: fix macro with kwargs * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix newline problem --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-25 12:22:48 +01:00
Francisco Herrera	8fc17493c3	gguf-split : clarify operation of gguf-split (#19749 ) * clarify operation of gguf-split so that you don't have to find out by trial and error * formatting	2026-03-25 13:12:50 +02:00
Johannes Gäßler	36dafba5c4	llama: fix llama-model-saver (#20503 ) * llama : add fd-based model loading via llama_model_load_from_fd * llama : address review feedback for fd-based model loading * llama : use FILE pointer instead of fd in public API * llama : use FILE pointer consistently, address review feedback * fixup * fix tensor names * fix llama-model-saver * roundtrip tests * fixup * refactor tests * fix prints * fix model saving * fix CI, disable Chameleon * print seed --------- Co-authored-by: Siddhesh2377 <siddheshsonar2377@gmail.com>	2026-03-25 12:53:16 +02:00
Aleksander Grygier	69e0ecef06	webui: Fix editing assistant message without branching (#20944 ) * fix: Editing assistant response without branching * chore: update webui build output	2026-03-25 12:47:33 +02:00
Pascal	062cca58fc	Add SLEEPING status to the WebUI model selector (#20949 ) * webui: handle sleeping model status, fix favourite -> favorite * Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: fix optional event parameter in sleeping model onclick * typo * webui: restore orange sleeping indicator dot with hover unload * chore: update webui build output * webui: move stopPropagation into ActionIcon onclick, remove svelte-ignore * chore: update webui build output * webui: fix favourite -> favorite (UK -> US spelling) everywhere Address review feedback from WhyNotHugo * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-25 11:02:32 +01:00
yikechayedan	406f4e3f61	android : fix-pointer-dangling (#20974 )	2026-03-25 11:51:26 +02:00
Neo Zhang	53dc8b59bf	sycl : fix wrong variable check by assert (#20903 ) * fix wrong variable check by assert * use GGML api	2026-03-25 11:48:37 +02:00
Sigbjørn Skjæret	403c9c9cef	ci : bump gguf publish python version (#20982 )	2026-03-25 11:04:59 +02:00
Sigbjørn Skjæret	8fc85db9d2	ci : limit requirements versions (#20980 ) * set requests version * limit versions outside requirements	2026-03-25 10:55:37 +02:00
Dowon	3a60d06ad9	convert : register Qwen3Model architecture (#20967 )	2026-03-25 10:37:59 +02:00
Ravi Panchumarthy	abd86ef175	docs : Update OpenVINO backend docs (#20968 ) * OpenVINO doc updates * Update docs/backend/OPENVINO.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-03-25 10:33:51 +02:00
Georgi Gerganov	9f102a1407	models : move the token embedding norms to the first layer (#20943 ) * models : move the token embedding norms to the first layer * cont : fix LLM_TENSOR_CONV1D + fix il indexing	2026-03-24 17:00:30 +02:00
Aman Gupta	3fc6f1aed1	ggml-backend: re-enable graph reuse with pipeline parallelism (#20927 )	2026-03-24 20:47:00 +08:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	29771a0a4c	vendor : update cpp-httplib to 0.39.0 (#20933 )	2026-03-24 13:33:33 +01:00
Adrien Gallouët	42ebce3beb	common : fix get_gguf_split_info (#20946 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 13:33:14 +01:00
BlueMöhre	a94fdb090a	WebUI: fix edit msg form textarea height (#20830 ) * autoresize textarea on mount * allow textarea to grow to same height as rendered messages * add UI build file	2026-03-24 13:17:45 +01:00
Adrien Gallouët	c9dc43333f	readme : clarify MODEL_ENDPOINT usage (#20941 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 10:35:07 +01:00
Adrien Gallouët	2d2d9c2062	common : add a WARNING for HF cache migration (#20935 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 09:24:39 +01:00
nuri	92080b4396	metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930 ) Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-03-24 10:13:07 +02:00
Georgi Gerganov	342d6125bc	metal : add FA instantiations for HSK=512, HSV=512 (#20902 )	2026-03-24 10:03:09 +02:00
Aaron Teo	c2e224d829	issues: add openvino backends (#20932 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:41:10 +08:00
Adrien Gallouët	8c7957ca33	common : add standard Hugging Face cache support (#20775 ) * common : add standard Hugging Face cache support - Use HF API to find all files - Migrate all manifests to hugging face cache at startup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check with the quant tag Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve error handling and report API errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Restore common_cached_model_info and align mmproj filtering Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Prefer main when getting cached ref Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use cached files when HF API fails Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use final_path.. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check all inputs Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 07:30:33 +01:00
Aman Gupta	e852eb4901	llama-fit: fix regex pattern for gate_up tensors (#20910 ) * llama-fit: fix regex pattern for gate_up tensors * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-24 12:57:57 +08:00