llama : fix Metal KV cache sync (close #1695 )

readme : update hot topics
2026-04-23 16:37:33 +03:00 · 2023-06-05 10:19:03 +03:00 · 2023-06-04 23:38:19 +03:00
2 changed files with 13 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -9,9 +9,11 @@ Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++

 **Hot topics:**

- Quantization formats `Q4` and `Q8` have changed again (19 May) - [(info)](https://github.com/ggerganov/llama.cpp/pull/1508)
- Quantization formats `Q4` and `Q5` have changed - requantize any old models [(info)](https://github.com/ggerganov/llama.cpp/pull/1405)
- [Roadmap May 2023](https://github.com/ggerganov/llama.cpp/discussions/1220)
+- GPU support with Metal (Apple Silicon): https://github.com/ggerganov/llama.cpp/pull/1642
+- High-quality 2,3,4,5,6-bit quantization: https://github.com/ggerganov/llama.cpp/pull/1684
+- Multi-GPU support: https://github.com/ggerganov/llama.cpp/pull/1607
+- Training LLaMA models from scratch: https://github.com/ggerganov/llama.cpp/pull/1652
+- CPU threading improvements: https://github.com/ggerganov/llama.cpp/pull/1632

 <details>
  <summary>Table of Contents</summary>
--- a/llama.cpp
+++ b/llama.cpp
@@ -1455,6 +1455,14 @@ static bool llama_eval_internal(
        // When we implement Matrix x Matrix Metal multiplication, we can avoid this branch.
        // But for now, we have focused only on Matrix x Vector Metal multiplication.
        //
+        // TODO: avoid these syncs via shared memory (ref #1696)
+        //
+        if (lctx.ctx_metal) {
+            // We need to sync the GPU KV cache with the CPU KV cache
+            ggml_metal_get_tensor(lctx.ctx_metal, kv_self.k);
+            ggml_metal_get_tensor(lctx.ctx_metal, kv_self.v);
+        }
+
        ggml_graph_compute(ctx0, &gf);

        if (lctx.ctx_metal) {
Author	SHA1	Message	Date
Georgi Gerganov	d1f563a743	llama : fix Metal KV cache sync (close #1695 )	2023-06-05 10:19:03 +03:00
Georgi Gerganov	827f5eda91	readme : update hot topics	2023-06-04 23:38:19 +03:00