Implement server mode.

This new mode works by first loading the model then listening for TCP connections on a port. When a connection is received, arguments will be parsed using a simple protocol: - First the number of arguments will be read followed by a newline character. - Then each argument will be read, separated by the 0 byte. - With this we build an argument vector, similar to what is passed to the program entry point. We pass this to gpt_params_parse. Finally `run` will be executed with the input/output streams connected to the socket. Signed-off-by: Thiago Padilha <thiago@padilha.cc>
Remove direct access to std streams from "run"
2026-02-05 13:53:23 +02:00 · 2023-03-22 14:34:19 -03:00 · 2023-03-22 14:34:18 -03:00 · 2023-03-22 14:31:41 -03:00 · 2023-03-22 14:31:41 -03:00 · 2023-03-22 14:31:35 -03:00
42 changed files with 4864 additions and 2021 deletions
--- a/.devops/full.Dockerfile
+++ b/.devops/full.Dockerfile
@@ -0,0 +1,17 @@
+ARG UBUNTU_VERSION=22.04
+
+FROM ubuntu:$UBUNTU_VERSION as build
+
+RUN apt-get update && \
+    apt-get install -y build-essential python3 python3-pip
+
+RUN pip install --upgrade pip setuptools wheel \
+    && pip install numpy requests sentencepiece torch tqdm
+
+WORKDIR /app
+
+COPY . .
+
+RUN make
+
+ENTRYPOINT ["/app/.devops/tools.sh"]
--- a/.devops/main.Dockerfile
+++ b/.devops/main.Dockerfile
@@ -0,0 +1,18 @@
+ARG UBUNTU_VERSION=22.04
+
+FROM ubuntu:$UBUNTU_VERSION as build
+
+RUN apt-get update && \
+    apt-get install -y build-essential
+
+WORKDIR /app
+
+COPY . .
+
+RUN make
+
+FROM ubuntu:$UBUNTU_VERSION as runtime
+
+COPY --from=build /app/main /main
+
+ENTRYPOINT [ "/main" ]
--- a/.devops/tools.sh
+++ b/.devops/tools.sh
@@ -0,0 +1,46 @@
+#!/bin/bash
+set -e
+
+# Read the first argument into a variable
+arg1="$1"
+
+# Shift the arguments to remove the first one
+shift
+
+# Join the remaining arguments into a single string
+arg2="$@"
+
+if [[ $arg1 == '--convert' || $arg1 == '-c' ]]; then
+    python3 ./convert-pth-to-ggml.py $arg2
+elif [[ $arg1 == '--quantize' || $arg1 == '-q' ]]; then
+    ./quantize $arg2
+elif [[ $arg1 == '--run' || $arg1 == '-r' ]]; then
+    ./main $arg2
+elif [[ $arg1 == '--download' || $arg1 == '-d' ]]; then
+    python3 ./download-pth.py $arg2
+elif [[ $arg1 == '--all-in-one' || $arg1 == '-a' ]]; then
+    echo "Downloading model..."
+    python3 ./download-pth.py "$1" "$2"
+    echo "Converting PTH to GGML..."
+    for i in `ls $1/$2/ggml-model-f16.bin*`; do
+        if [ -f "${i/f16/q4_0}" ]; then
+            echo "Skip model quantization, it already exists: ${i/f16/q4_0}"
+        else
+            echo "Converting PTH to GGML: $i into ${i/f16/q4_0}..."
+            ./quantize "$i" "${i/f16/q4_0}" 2
+        fi
+    done
+else
+    echo "Unknown command: $arg1"
+    echo "Available commands: "
+    echo "  --run (-r): Run a model previously converted into ggml"
+    echo "              ex: -m /models/7B/ggml-model-q4_0.bin -p \"Building a website can be done in 10 simple steps:\" -n 512"
+    echo "  --convert (-c): Convert a llama model into ggml"
+    echo "              ex: \"/models/7B/\" 1"
+    echo "  --quantize (-q): Optimize with quantization process ggml"
+    echo "              ex: \"/models/7B/ggml-model-f16.bin\" \"/models/7B/ggml-model-q4_0.bin\" 2"
+    echo "  --download (-d): Download original llama model from CDN: https://agi.gpt4.org/llama/"
+    echo "              ex: \"/models/\" 7B"
+    echo "  --all-in-one (-a): Execute --download, --convert & --quantize"
+    echo "              ex: \"/models/\" 7B"
+fi
--- a/.dockerignore
+++ b/.dockerignore
@@ -0,0 +1,24 @@
+*.o
+*.a
+.cache/
+.vs/
+.vscode/
+.DS_Store
+
+build/
+build-em/
+build-debug/
+build-release/
+build-static/
+build-no-accel/
+build-sanitize-addr/
+build-sanitize-thread/
+
+models/*
+
+/main
+/quantize
+
+arm_neon.h
+compile_commands.json
+Dockerfile
--- a/.github/ISSUE_TEMPLATE/custom.md
+++ b/.github/ISSUE_TEMPLATE/custom.md
@@ -0,0 +1,198 @@
+---
+name: Custom issue template
+about: Used to report user-related issues with the software
+title: "[User] I encountered a problem .."
+labels: ''
+assignees: ''
+
+---
+
+# Prerequisites
+
+Please answer the following questions for yourself before submitting an issue.
+
+- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
+- [ ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
+- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
+- [ ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.
+
+# Expected Behavior
+
+Please provide a detailed written description of what you were trying to do, and what you expected `lamma.cpp` to do.
+
+# Current Behavior
+
+Please provide a detailed written description of what `lamma.cpp` did, instead. 
+
+# Environment and Context 
+
+Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
+
+* Physical (or virtual) hardware you are using, e.g. for Linux:
+
+`$ lscpu`
+
+* Operating System, e.g. for Linux:
+
+`$ uname -a`
+
+* SDK version, e.g. for Linux:
+
+```
+$ python3 --version
+$ make --version
+$ g++ --version
+```
+
+# Models
+
+* The LLaMA models are officially distributed by Facebook and will never be provided through this repository. See this [pull request in Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to obtain access to the model data.
+* If your issue is with model conversion please verify the `sha256sum` of each of your `consolidated*.pth` and `ggml-model-XXX.bin` files to confirm that you have the correct model data files before logging an issue. [Latest sha256 sums for your reference](https://github.com/ggerganov/llama.cpp/issues/238).
+* If your issue is with model generation quality then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
+  * LLaMA:
+    * [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
+    * [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
+  * GPT-3
+    * [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+  * GPT-3.5 / InstructGPT / ChatGPT:
+    * [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
+    * [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
+
+# Failure Information (for bugs)
+
+Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
+
+# Steps to Reproduce
+
+Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
+
+1. step 1
+2. step 2
+3. step 3
+4. etc.
+
+# Failure Logs
+
+Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
+
+Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability. e.g.
+
+```
+llama.cpp$ git log | head -1
+commit 2af23d30434a677c6416812eea52ccc0af65119c
+
+llama.cpp$ lscpu | egrep "AMD|Flags"
+Vendor ID:                       AuthenticAMD
+Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
+Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
+Virtualization:                  AMD-V
+
+llama.cpp$ python3 --version
+Python 3.10.9
+
+llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
+numpy                         1.24.2
+numpydoc                      1.5.0
+sentencepiece                 0.1.97
+torch                         1.13.1
+torchvision                   0.14.1
+
+llama.cpp$ make --version | head -1
+GNU Make 4.3
+
+$ md5sum ./models/65B/ggml-model-q4_0.bin
+dbdd682cce80e2d6e93cefc7449df487  ./models/65B/ggml-model-q4_0.bin
+```
+Here's a run with the Linux command [perf](https://www.brendangregg.com/perf.html)
+
+```
+llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
+main: seed = 1679149377
+llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
+llama_model_load: n_vocab = 32000
+llama_model_load: n_ctx   = 512
+llama_model_load: n_embd  = 8192
+llama_model_load: n_mult  = 256
+llama_model_load: n_head  = 64
+llama_model_load: n_layer = 80
+llama_model_load: n_rot   = 128
+llama_model_load: f16     = 2
+llama_model_load: n_ff    = 22016
+llama_model_load: n_parts = 8
+llama_model_load: ggml ctx size = 41477.73 MB
+llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
+llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
+llama_model_load: .......................................................................................... done
+llama_model_load: model size =  4869.09 MB / num tensors = 723
+
+system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
+
+main: prompt: 'Please close your issue when it has been answered.'
+main: number of tokens in prompt = 11
+     1 -> ''
+ 12148 -> 'Please'
+  3802 -> ' close'
+   596 -> ' your'
+  2228 -> ' issue'
+   746 -> ' when'
+   372 -> ' it'
+   756 -> ' has'
+  1063 -> ' been'
+  7699 -> ' answered'
+ 29889 -> '.'
+
+sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
+
+
+Please close your issue when it has been answered.
+@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
+I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
+@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]
+
+
+main: mem per token = 71159620 bytes
+main:     load time = 19309.95 ms
+main:   sample time =   168.62 ms
+main:  predict time = 223895.61 ms / 888.47 ms per token
+main:    total time = 246406.42 ms
+
+ Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':
+
+        3636882.89 msec task-clock                #   14.677 CPUs utilized          
+             13509      context-switches          #    3.714 /sec                   
+              2436      cpu-migrations            #    0.670 /sec                   
+          10476679      page-faults               #    2.881 K/sec                  
+    13133115082869      cycles                    #    3.611 GHz                      (16.77%)
+       29314462753      stalled-cycles-frontend   #    0.22% frontend cycles idle     (16.76%)
+    10294402631459      stalled-cycles-backend    #   78.39% backend cycles idle      (16.74%)
+    23479217109614      instructions              #    1.79  insn per cycle         
+                                                  #    0.44  stalled cycles per insn  (16.76%)
+     2353072268027      branches                  #  647.002 M/sec                    (16.77%)
+        1998682780      branch-misses             #    0.08% of all branches          (16.76%)
+
+     247.802177522 seconds time elapsed
+
+    3618.573072000 seconds user
+      18.491698000 seconds sys
+```
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -1,8 +1,42 @@
 name: CI
-on: [push, pull_request]
+
+on:
+  workflow_dispatch: # allows manual triggering
+    inputs:
+      create_release:
+        description: 'Create new release'
+        required: true
+        type: boolean
+  push:
+    paths: ['.github/workflows/**', 'CMakeLists.txt', 'Makefile', '**.h', '*.c', '**.cpp']
+  pull_request:
+    types: [opened, synchronize, edited, reopened, review_requested, ready_for_review]
+    paths: ['CMakeLists.txt', 'Makefile', '**.h', '*.c', '**.cpp']
+
+env:
+ BRANCH_NAME: ${{ github.head_ref || github.ref_name }}

 jobs:
-  ubuntu-latest:
+  ubuntu-latest-make:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Clone
+        id: checkout
+        uses: actions/checkout@v1
+
+      - name: Dependencies
+        id: depends
+        run: |
+          sudo apt-get update
+          sudo apt-get install build-essential
+
+      - name: Build
+        id: make_build
+        run: |
+          make
+
+  ubuntu-latest-cmake:
    runs-on: ubuntu-latest

    steps:
@@ -15,10 +49,32 @@ jobs:
          sudo apt-get install build-essential

      - name: Build
+        run: |
+          mkdir build
+          cd build
+          cmake ..
+          cmake --build . --config Release
+          ctest --output-on-failure
+
+  macOS-latest-make:
+    runs-on: macos-latest
+
+    steps:
+      - name: Clone
+        id: checkout
+        uses: actions/checkout@v1
+
+      - name: Dependencies
+        id: depends
+        run: |
+          brew update
+
+      - name: Build
+        id: make_build
        run: |
          make

-  macOS-latest:
+  macOS-latest-cmake:
    runs-on: macOS-latest

    steps:
@@ -31,7 +87,60 @@ jobs:

      - name: Build
        run: |
-          make
+          mkdir build
+          cd build
+          cmake -DLLAMA_AVX2=OFF ..
+          cmake --build . --config Release
+          ctest --output-on-failure
+
+  windows-latest-cmake:
+    runs-on: windows-latest
+
+    steps:
+      - name: Clone
+        id: checkout
+        uses: actions/checkout@v1
+
+      - name: Build
+        id: cmake_build
+        run: |
+          mkdir build
+          cd build
+          cmake ..
+          cmake --build . --config Release
+          ctest -C Release --output-on-failure
+
+      - name: Get commit hash
+        id: commit
+        if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
+        uses: pr-mpt/actions-commit-hash@v2
+
+      - name: Pack artifacts
+        id: pack_artifacts
+        if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
+        run: |
+          7z a llama-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-x64.zip .\build\bin\Release\*
+
+      - name: Create release
+        id: create_release
+        if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
+        uses: zendesk/action-create-release@v1
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        with:
+          tag_name: ${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}
+
+      - name: Upload release
+        id: upload_release
+        if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
+        uses: actions/upload-release-asset@v1
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        with:
+          upload_url: ${{ steps.create_release.outputs.upload_url }} 
+          asset_path: .\llama-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-x64.zip
+          asset_name: llama-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-x64.zip
+          asset_content_type: application/octet-stream

 #  ubuntu-latest-gcc:
 #    runs-on: ubuntu-latest
--- a/.github/workflows/docker.yml
+++ b/.github/workflows/docker.yml
@@ -0,0 +1,61 @@
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+# GitHub recommends pinning actions to a commit SHA.
+# To get a newer version, you will need to update the SHA.
+# You can also reference a tag or branch, but the action may change without warning.
+
+name: Publish Docker image
+
+on:
+  pull_request:
+  push:
+    branches:
+      - master
+
+jobs:
+  push_to_registry:
+    name: Push Docker image to Docker Hub
+    runs-on: ubuntu-latest
+    env:
+      COMMIT_SHA: ${{ github.sha }}
+    strategy:
+      matrix:
+        config:
+          - { tag: "light", dockerfile: ".devops/main.Dockerfile" }
+          - { tag: "full", dockerfile: ".devops/full.Dockerfile" }
+    steps:
+      - name: Check out the repo
+        uses: actions/checkout@v3
+
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@v2
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v2
+
+      - name: Log in to Docker Hub
+        uses: docker/login-action@v2
+        with:
+          registry: ghcr.io
+          username: ${{ github.repository_owner }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Build and push Docker image (versioned)
+        if: github.event_name == 'push'
+        uses: docker/build-push-action@v4
+        with:
+          context: .
+          push: true
+          tags: "ghcr.io/ggerganov/llama.cpp:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
+          file: ${{ matrix.config.dockerfile }}
+
+      - name: Build and push Docker image (tagged)
+        uses: docker/build-push-action@v4
+        with:
+          context: .
+          push: ${{ github.event_name == 'push' }}
+          tags: "ghcr.io/ggerganov/llama.cpp:${{ matrix.config.tag }}"
+          file: ${{ matrix.config.dockerfile }}
--- a/.gitignore
+++ b/.gitignore
@@ -18,6 +18,10 @@ models/*

 /main
 /quantize
+/result

 arm_neon.h
 compile_commands.json
+
+.envrc
+.direnv/
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -0,0 +1,265 @@
+cmake_minimum_required(VERSION 3.12) # Don't bump this version for no reason
+project("llama.cpp" C CXX)
+
+set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
+
+if (NOT XCODE AND NOT MSVC AND NOT CMAKE_BUILD_TYPE)
+    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
+    set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo")
+endif()
+
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
+
+if(CMAKE_SOURCE_DIR STREQUAL CMAKE_CURRENT_SOURCE_DIR)
+    set(LLAMA_STANDALONE ON)
+
+    # configure project version
+    # TODO
+else()
+    set(LLAMA_STANDALONE OFF)
+endif()
+
+if (EMSCRIPTEN)
+    set(BUILD_SHARED_LIBS_DEFAULT OFF)
+
+    option(LLAMA_WASM_SINGLE_FILE "llama: embed WASM inside the generated llama.js" ON)
+else()
+    if (MINGW)
+        set(BUILD_SHARED_LIBS_DEFAULT OFF)
+    else()
+        set(BUILD_SHARED_LIBS_DEFAULT ON)
+    endif()
+endif()
+
+
+#
+# Option list
+#
+
+# general
+option(LLAMA_STATIC                 "llama: static link libraries"                          OFF)
+option(LLAMA_NATIVE                 "llama: enable -march=native flag"                      OFF)
+option(LLAMA_LTO                    "llama: enable link time optimization"                  OFF)
+
+# debug
+option(LLAMA_ALL_WARNINGS           "llama: enable all compiler warnings"                   ON)
+option(LLAMA_ALL_WARNINGS_3RD_PARTY "llama: enable all compiler warnings in 3rd party libs" OFF)
+option(LLAMA_GPROF                  "llama: enable gprof"                                   OFF)
+
+# sanitizers
+option(LLAMA_SANITIZE_THREAD        "llama: enable thread sanitizer"                        OFF)
+option(LLAMA_SANITIZE_ADDRESS       "llama: enable address sanitizer"                       OFF)
+option(LLAMA_SANITIZE_UNDEFINED     "llama: enable undefined sanitizer"                     OFF)
+
+# instruction set specific
+option(LLAMA_AVX                    "llama: enable AVX"                                     ON)
+option(LLAMA_AVX2                   "llama: enable AVX2"                                    ON)
+option(LLAMA_FMA                    "llama: enable FMA"                                     ON)
+
+# 3rd party libs
+option(LLAMA_ACCELERATE             "llama: enable Accelerate framework"                    ON)
+option(LLAMA_OPENBLAS               "llama: use OpenBLAS"                                   OFF)
+
+option(LLAMA_BUILD_TESTS            "llama: build tests"    ${LLAMA_STANDALONE})
+option(LLAMA_BUILD_EXAMPLES         "llama: build examples" ${LLAMA_STANDALONE})
+
+#
+# Compile flags
+#
+
+set(CMAKE_CXX_STANDARD_REQUIRED true)
+set(CMAKE_C_STANDARD_REQUIRED true)
+set(THREADS_PREFER_PTHREAD_FLAG ON)
+find_package(Threads REQUIRED)
+
+if (NOT MSVC)
+    if (LLAMA_SANITIZE_THREAD)
+        add_compile_options(-fsanitize=thread)
+    endif()
+
+    if (LLAMA_SANITIZE_ADDRESS)
+        add_compile_options(-fsanitize=address -fno-omit-frame-pointer)
+    endif()
+
+    if (LLAMA_SANITIZE_UNDEFINED)
+        add_compile_options(-fsanitize=undefined)
+    endif()
+endif()
+
+if (APPLE AND LLAMA_ACCELERATE)
+    find_library(ACCELERATE_FRAMEWORK Accelerate)
+    if (ACCELERATE_FRAMEWORK)
+        message(STATUS "Accelerate framework found")
+
+        add_compile_definitions(GGML_USE_ACCELERATE)
+        set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} ${ACCELERATE_FRAMEWORK})
+    else()
+        message(WARNING "Accelerate framework not found")
+    endif()
+endif()
+if (LLAMA_OPENBLAS)
+    if (LLAMA_STATIC)
+        set(BLA_STATIC ON)
+    endif()
+
+    set(BLA_VENDOR OpenBLAS)
+    find_package(BLAS)
+    if (BLAS_FOUND)
+        message(STATUS "OpenBLAS found")
+
+        add_compile_definitions(GGML_USE_OPENBLAS)
+        add_link_options(${BLAS_LIBRARIES})
+    else()
+        message(WARNING "OpenBLAS not found")
+    endif()
+endif()
+
+if (LLAMA_ALL_WARNINGS)
+    if (NOT MSVC)
+        set(c_flags
+            -Wall
+            -Wextra
+            -Wpedantic
+            -Wshadow
+            -Wcast-qual
+            -Wstrict-prototypes
+            -Wpointer-arith
+            -Wno-unused-function
+        )
+        set(cxx_flags
+            -Wall
+            -Wextra
+            -Wpedantic
+            -Wcast-qual
+        )
+    else()
+        # todo : msvc
+    endif()
+
+    add_compile_options(
+            "$<$<COMPILE_LANGUAGE:C>:${c_flags}>"
+            "$<$<COMPILE_LANGUAGE:CXX>:${cxx_flags}>"
+    )
+
+endif()
+
+if (LLAMA_LTO)
+    include(CheckIPOSupported)
+    check_ipo_supported(RESULT result OUTPUT output)
+    if (result)
+        set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)
+    else()
+        message(WARNING "IPO is not supported: ${output}")
+    endif()
+endif()
+
+# Architecture specific
+# TODO: probably these flags need to be tweaked on some architectures
+#       feel free to update the Makefile for your architecture and send a pull request or issue
+message(STATUS "CMAKE_SYSTEM_PROCESSOR: ${CMAKE_SYSTEM_PROCESSOR}")
+if (NOT MSVC)
+    if (LLAMA_STATIC)
+        add_link_options(-static)
+        if (MINGW)
+            add_link_options(-static-libgcc -static-libstdc++)
+        endif()
+    endif()
+    if (LLAMA_GPROF)
+        add_compile_options(-pg)
+    endif()
+    if (LLAMA_NATIVE)
+        add_compile_options(-march=native)
+    endif()
+endif()
+
+if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm" OR ${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
+    message(STATUS "ARM detected")
+    if (MSVC)
+        # TODO: arm msvc?
+    else()
+        if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
+            add_compile_options(-mcpu=native)
+        endif()
+        # TODO: armv6,7,8 version specific flags
+    endif()
+elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "^(x86_64|i686|AMD64)$")
+    message(STATUS "x86 detected")
+    if (MSVC)
+        if (LLAMA_AVX2)
+            add_compile_options(/arch:AVX2)
+        elseif (LLAMA_AVX)
+            add_compile_options(/arch:AVX)
+        endif()
+    else()
+        add_compile_options(-mf16c)
+        if (LLAMA_FMA)
+            add_compile_options(-mfma)
+        endif()
+        if (LLAMA_AVX)
+            add_compile_options(-mavx)
+        endif()
+        if (LLAMA_AVX2)
+            add_compile_options(-mavx2)
+        endif()
+    endif()
+else()
+    # TODO: support PowerPC
+    message(STATUS "Unknown architecture")
+endif()
+
+#
+# Build libraries
+#
+
+add_library(utils OBJECT
+            utils.cpp
+            utils.h)
+
+target_include_directories(utils PUBLIC .)
+target_compile_features(utils PUBLIC cxx_std_11) # don't bump
+target_link_libraries(utils PRIVATE ${LLAMA_EXTRA_LIBS})
+
+add_library(ggml OBJECT
+            ggml.c
+            ggml.h)
+
+target_include_directories(ggml PUBLIC .)
+target_compile_features(ggml PUBLIC c_std_11) # don't bump
+target_link_libraries(ggml PRIVATE Threads::Threads ${LLAMA_EXTRA_LIBS})
+
+add_library(llama
+            llama.cpp
+            llama.h)
+
+target_include_directories(llama PUBLIC .)
+target_compile_features(llama PUBLIC cxx_std_11) # don't bump
+target_link_libraries(llama PRIVATE utils ggml ${LLAMA_EXTRA_LIBS})
+
+#
+# Executables
+#
+
+add_executable(main
+               main.cpp
+               run.cpp)
+target_link_libraries(main PRIVATE llama ggml utils)
+
+if(NOT WIN32)
+    target_sources(main PRIVATE tcp_server.cpp)
+endif()
+
+add_executable(quantize quantize.cpp)
+target_link_libraries(quantize PRIVATE llama ggml utils)
+
+#
+# programs, examples and tests
+#
+
+if (LLAMA_BUILD_TESTS AND NOT CMAKE_JS_VERSION)
+    enable_testing()
+    add_subdirectory(tests)
+endif ()
+
+#if (LLAMA_BUILD_EXAMPLES)
+#    add_subdirectory(examples)
+#endif()
--- a/84
+++ b/84
@@ -17,7 +17,7 @@ CXXV := $(shell $(CXX) --version | head -n 1)
 # ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
 ifeq ($(UNAME_S),Darwin)
 	ifneq ($(UNAME_P),arm)
-		SYSCTL_M := $(shell sysctl -n hw.optional.arm64)
+		SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
 		ifeq ($(SYSCTL_M),1)
 			# UNAME_P := arm
 			# UNAME_M := arm64
@@ -30,6 +30,7 @@ endif
 # Compile flags
 #

+# keep standard at C11 and C++11
 CFLAGS   = -I.              -O3 -DNDEBUG -std=c11   -fPIC
 CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
 LDFLAGS  =
@@ -48,6 +49,14 @@ ifeq ($(UNAME_S),FreeBSD)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
 endif
+ifeq ($(UNAME_S),NetBSD)
+	CFLAGS   += -pthread
+	CXXFLAGS += -pthread
+endif
+ifeq ($(UNAME_S),OpenBSD)
+	CFLAGS   += -pthread
+	CXXFLAGS += -pthread
+endif
 ifeq ($(UNAME_S),Haiku)
 	CFLAGS   += -pthread
 	CXXFLAGS += -pthread
@@ -91,30 +100,59 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
 		ifneq (,$(findstring sse3,$(SSE3_M)))
 			CFLAGS += -msse3
 		endif
+		AVX512F_M := $(shell grep "avx512f " /proc/cpuinfo)
+		ifneq (,$(findstring avx512f,$(AVX512F_M)))
+			CFLAGS += -mavx512f
+		endif
+		AVX512BW_M := $(shell grep "avx512bw " /proc/cpuinfo)
+		ifneq (,$(findstring avx512bw,$(AVX512BW_M)))
+			CFLAGS += -mavx512bw
+		endif
+		AVX512DQ_M := $(shell grep "avx512dq " /proc/cpuinfo)
+		ifneq (,$(findstring avx512dq,$(AVX512DQ_M)))
+			CFLAGS += -mavx512dq
+		endif
+		AVX512VL_M := $(shell grep "avx512vl " /proc/cpuinfo)
+		ifneq (,$(findstring avx512vl,$(AVX512VL_M)))
+			CFLAGS += -mavx512vl
+		endif
+		AVX512CD_M := $(shell grep "avx512cd " /proc/cpuinfo)
+		ifneq (,$(findstring avx512cd,$(AVX512CD_M)))
+			CFLAGS += -mavx512cd
+		endif
+		AVX512ER_M := $(shell grep "avx512er " /proc/cpuinfo)
+		ifneq (,$(findstring avx512er,$(AVX512ER_M)))
+			CFLAGS += -mavx512er
+		endif
+		AVX512IFMA_M := $(shell grep "avx512ifma " /proc/cpuinfo)
+		ifneq (,$(findstring avx512ifma,$(AVX512IFMA_M)))
+			CFLAGS += -mavx512ifma
+		endif
+		AVX512PF_M := $(shell grep "avx512pf " /proc/cpuinfo)
+		ifneq (,$(findstring avx512pf,$(AVX512PF_M)))
+			CFLAGS += -mavx512pf
+		endif
 	else ifeq ($(UNAME_S),Haiku)
-		AVX1_M := $(shell sysinfo -cpu | grep "AVX ")
-		ifneq (,$(findstring avx,$(AVX1_M)))
+		AVX1_M := $(shell sysinfo -cpu | grep -w "AVX")
+		ifneq (,$(findstring AVX,$(AVX1_M)))
 			CFLAGS += -mavx
 		endif
-		AVX2_M := $(shell sysinfo -cpu | grep "AVX2 ")
-		ifneq (,$(findstring avx2,$(AVX2_M)))
+		AVX2_M := $(shell sysinfo -cpu | grep -w "AVX2")
+		ifneq (,$(findstring AVX2,$(AVX2_M)))
 			CFLAGS += -mavx2
 		endif
-		FMA_M := $(shell sysinfo -cpu | grep "FMA ")
-		ifneq (,$(findstring fma,$(FMA_M)))
+		FMA_M := $(shell sysinfo -cpu | grep -w "FMA")
+		ifneq (,$(findstring FMA,$(FMA_M)))
 			CFLAGS += -mfma
 		endif
-		F16C_M := $(shell sysinfo -cpu | grep "F16C ")
-		ifneq (,$(findstring f16c,$(F16C_M)))
+		F16C_M := $(shell sysinfo -cpu | grep -w "F16C")
+		ifneq (,$(findstring F16C,$(F16C_M)))
 			CFLAGS += -mf16c
 		endif
 	else
 		CFLAGS += -mfma -mf16c -mavx -mavx2
 	endif
 endif
-ifeq ($(UNAME_M),amd64)
-	CFLAGS += -mavx -mavx2 -mfma -mf16c
-endif
 ifneq ($(filter ppc64%,$(UNAME_M)),)
 	POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
 	ifneq (,$(findstring POWER9,$(POWER9_M)))
@@ -126,7 +164,8 @@ ifneq ($(filter ppc64%,$(UNAME_M)),)
 	endif
 endif
 ifndef LLAMA_NO_ACCELERATE
-	# Mac M1 - include Accelerate framework
+	# Mac M1 - include Accelerate framework.
+	# `-framework Accelerate` works on Mac Intel as well, with negliable performance boost (as of the predict time).
 	ifeq ($(UNAME_S),Darwin)
 		CFLAGS  += -DGGML_USE_ACCELERATE
 		LDFLAGS += -framework Accelerate
@@ -181,18 +220,27 @@ default: main quantize
 ggml.o: ggml.c ggml.h
 	$(CC)  $(CFLAGS)   -c ggml.c -o ggml.o

+llama.o: llama.cpp llama.h
+	$(CXX) $(CXXFLAGS) -c llama.cpp -o llama.o
+
 utils.o: utils.cpp utils.h
 	$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o

+run.o: run.cpp run.h
+	$(CXX) $(CXXFLAGS) -c run.cpp -o run.o
+
+tcp_server.o: tcp_server.cpp tcp_server.h
+	$(CXX) $(CXXFLAGS) -c tcp_server.cpp -o tcp_server.o
+
 clean:
 	rm -f *.o main quantize

-main: main.cpp ggml.o utils.o
-	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
-	./main -h
+main: main.cpp ggml.o llama.o utils.o run.o tcp_server.o
+	$(CXX) $(CXXFLAGS) main.cpp ggml.o llama.o utils.o run.o tcp_server.o -o main $(LDFLAGS)
+	@echo "\x1b[36mrun ./main -h for help\x1b[0m"

-quantize: quantize.cpp ggml.o utils.o
-	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
+quantize: quantize.cpp ggml.o llama.o utils.o
+	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o llama.o utils.o -o quantize $(LDFLAGS)

 #
 # Tests
--- a/README.md
+++ b/README.md
@@ -3,19 +3,21 @@
 [![Actions Status](https://github.com/ggerganov/llama.cpp/workflows/CI/badge.svg)](https://github.com/ggerganov/llama.cpp/actions)
 [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)

-Inference of [Facebook's LLaMA](https://github.com/facebookresearch/llama) model in pure C/C++
+Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++

-**Hot topics**
+**Hot topics:**

- Running on Windows: https://github.com/ggerganov/llama.cpp/issues/22
- Fix Tokenizer / Unicode support: https://github.com/ggerganov/llama.cpp/issues/11
+- New C-style API is now available: https://github.com/ggerganov/llama.cpp/pull/370
+- [Added Alpaca support](https://github.com/ggerganov/llama.cpp#instruction-mode-with-alpaca)
+- Cache input prompts for faster initialization: https://github.com/ggerganov/llama.cpp/issues/64
+- Create a `llama.cpp` logo: https://github.com/ggerganov/llama.cpp/issues/105

 ## Description

 The main goal is to run the model using 4-bit quantization on a MacBook

 - Plain C/C++ implementation without dependencies
- Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
+- Apple silicon first-class citizen - optimized via ARM NEON
 - AVX2 support for x86 architectures
 - Mixed F16 / F32 precision
 - 4-bit quantization support
@@ -23,21 +25,22 @@ The main goal is to run the model using 4-bit quantization on a MacBook

 This was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022) - I have no idea if it works correctly.
 Please do not make conclusions about the models based on the results from this implementation.
-For all I know, it can be completely wrong. This project is for educational purposes and is not going to be maintained properly.
-New features will probably be added mostly through community contributions, if any.
+For all I know, it can be completely wrong. This project is for educational purposes.
+New features will probably be added mostly through community contributions.

 Supported platforms:

 - [X] Mac OS
 - [X] Linux
- [ ] Windows (soon)
+- [X] Windows (via CMake)
+- [X] Docker

 ---

 Here is a typical run using LLaMA-7B:

 ```java
-make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
+make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
 I llama.cpp build info:
 I UNAME_S:  Darwin
 I UNAME_P:  arm
@@ -145,76 +148,201 @@ python3 -m pip install torch numpy sentencepiece
 python3 convert-pth-to-ggml.py models/7B/ 1

 # quantize the model to 4-bits
-./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
+python3 quantize.py 7B

 # run the inference
-./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128
+./main -m ./models/7B/ggml-model-q4_0.bin -n 128
 ```

-For the bigger models, there are a few extra quantization steps. For example, for LLaMA-13B, converting to FP16 format
-will create 2 ggml files, instead of one:
-
-```bash
-ggml-model-f16.bin
-ggml-model-f16.bin.1
-```
-
-You need to quantize each of them separately like this:
-
-```bash
-./quantize ./models/13B/ggml-model-f16.bin   ./models/13B/ggml-model-q4_0.bin 2
-./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2
-```
-
-Everything else is the same. Simply run:
-
-```bash
-./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128
-```
-
-The number of files generated for each model is as follows:
-
-```
-7B  -> 1 file
-13B -> 2 files
-30B -> 4 files
-65B -> 8 files
-```
+Currently, it's best to use Python 3.9 or Python 3.10, as `sentencepiece` has not yet published a wheel for Python 3.11.

 When running the larger models, make sure you have enough disk space to store all the intermediate files.

+### Memory/Disk Requirements
+
+As the models are currently fully loaded into memory, you will need adequate disk space to save them
+and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
+
+| model | original size | quantized size (4-bit) |
+|-------|---------------|------------------------|
+| 7B    | 13 GB         | 3.9 GB                 |
+| 13B   | 24 GB         | 7.8 GB                 |
+| 30B   | 60 GB         | 19.5 GB                |
+| 65B   | 120 GB        | 38.5 GB                |
+
 ### Interactive mode

 If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
 In this mode, you can always interrupt generation by pressing Ctrl+C and enter one or more lines of text which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt which makes LLaMa emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.

 Here is an example few-shot interaction, invoked with the command
-```
-./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "User:" \
-                                           -p \
-"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

-User: Hello, Bob.
-Bob: Hello. How may I help you today?
-User: Please tell me the largest city in Europe.
-Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
-User:"
+```bash
+# default arguments using 7B model
+./chat.sh

+# custom arguments using 13B model
+./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
 ```
+
 Note the use of `--color` to distinguish between user input and generated text.

 ![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)

+### Instruction mode with Alpaca
+
+First, download the `ggml` Alpaca model into the `./models` folder:
+
+```
+# use one of these
+# TODO: add a script to simplify the download
+curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://gateway.estuary.tech/gw/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
+curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://ipfs.io/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
+curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://cloudflare-ipfs.com/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
+```
+
+Now run the `main` tool like this:
+
+```
+./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins
+```
+
+Sample run:
+
+```
+== Running in interactive mode. ==
+ - Press Ctrl+C to interject at any time.
+ - Press Return to return control to LLaMa.
+ - If you want to submit another line, end your input in '\'.
+
+ Below is an instruction that describes a task. Write a response that appropriately completes the request.
+
+> How many letters are there in the English alphabet?
+There 26 letters in the English Alphabet
+> What is the most common way of transportation in Amsterdam?
+The majority (54%) are using public transit. This includes buses, trams and metros with over 100 lines throughout the city which make it very accessible for tourists to navigate around town as well as locals who commute by tram or metro on a daily basis
+> List 5 words that start with "ca".
+cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
+> 
+```
+
+### Obtaining and verifying the Facebook LLaMA original model and Stanford Alpaca model data
+
+* The LLaMA models are officially distributed by Facebook and will never be provided through this repository. See this [Pull Request in Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to obtain access to the model data.
+
+* Please verify the sha256 checksums of all of your `consolidated*.pth` and corresponding converted `ggml-model-*.bin` model files to confirm that you have the correct model data files before creating an issue relating to your model files.
+
+The following command will verify if you have all possible latest files in your self-installed `./models` subdirectory:
+
+`sha256sum --ignore-missing -c SHA256SUMS` on Linux
+
+or
+
+`shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS
+
+### Perplexity (Measuring model quality)
+
+You can pass `--perplexity` as a command line option to measure perplexity over the given prompt.  For more background,
+see https://huggingface.co/docs/transformers/perplexity.  However, in general, lower perplexity is better for LLMs.
+
+#### Measurements
+
+https://github.com/ggerganov/llama.cpp/pull/270 is the unofficial tracking page for now.  llama.cpp is measuring very well
+compared to the baseline implementations.  Quantization has a small negative impact to quality, but, as you can see, running
+13B at q4_0 beats the 7B f16 model by a significant amount.
+
+All measurements are done against wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context).
+Note that the changing the context length will have a significant impact on perplexity (longer context = better perplexity).
+```
+Perplexity - model options
+5.5985 - 13B, q4_0
+5.9565 - 7B, f16
+6.3001 - 7B, q4_1
+6.5949 - 7B, q4_0
+6.5995 - 7B, q4_0, --memory_f16
+```
+
+#### How to run
+
+1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
+2. Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
+3. Output:
+```
+Calculating perplexity over 655 chunks
+24.43 seconds per pass - ETA 4.45 hours
+[1]4.5970,[2]5.1807,[3]6.0382,...
+```
+And after 4.45 hours, you will have the final perplexity.
+
+### Android
+
+You can easily run `llama.cpp` on Android device with [termux](https://play.google.com/store/apps/details?id=com.termux).
+First, obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake:
+```
+$ mkdir build-android
+$ cd build-android
+$ export NDK=<your_ndk_directory>
+$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
+$ make
+```
+Install [termux](https://play.google.com/store/apps/details?id=com.termux) on your device and run `termux-setup-storage` to get access to your SD card.
+Finally, copy the `llama` binary and the model files to your device storage. Here is a demo of an interactive session running on Pixel 5 phone:
+
+https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
+
+### Docker
+
+#### Prerequisites
+* Docker must be installed and running on your system.
+* Create a folder to store big models & intermediate files (in ex. im using /llama/models)
+
+#### Images
+We have two Docker images available for this project:
+
+1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
+2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file.
+
+#### Usage
+
+The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.
+
+ ```bash
+docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
+```
+
+On complete, you are ready to play!
+
+```bash
+docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
+```
+
+or with light image:
+
+```bash
+docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
+```
+
 ## Limitations

- Not sure if my tokenizer is correct. There are a few places where we might have a mistake:
-  - https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/convert-pth-to-ggml.py#L79-L87
-  - https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/utils.h#L65-L69
-  In general, it seems to work, but I think it fails for unicode character support. Hopefully, someone can help with that
- I don't know yet how much the quantization affects the quality of the generated text
 - Probably the token sampling can be improved
 - The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder,
-  there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't
+  there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simply don't
  know how to utilize it properly. But in any case, you can even disable it with `LLAMA_NO_ACCELERATE=1 make` and the
  performance will be the same, since no BLAS calls are invoked by the current implementation

+### Contributing
+
+- Contributors can open PRs
+- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
+- Collaborators will be invited based on contributions
+- Any help with managing issues and PRs is very appreciated!
+- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
+
+### Coding guidelines
+
+- Avoid adding third-party dependencies, extra files, extra headers, etc.
+- Always consider cross-compatibility with other operating systems and architectures
+- Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
+- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
+- Clean-up any trailing whitespaces, use 4 spaces indentation, brackets on same line, `void * ptr`, `int & a`
+- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
--- a/53
+++ b/53
@@ -0,0 +1,53 @@
+700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d  models/7B/consolidated.00.pth
+abe4aec2cdc297e2916011f66c7efd6fb4424e0e84315503005b5c118358cc22  models/7B/ggml-model-f16.bin
+f495fa02a0b5ef265e1864d9680eede7fd23a60b0a2f93edba8091e2a4ca68b9  models/7B/ggml-model-q4_0.bin
+7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265  models/7B/params.json
+745bf4e29a4dd6f411e72976d92b452da1b49168a4f41c951cfcc8051823cf08  models/13B/consolidated.00.pth
+d5ccbcc465c71c0de439a5aeffebe8344c68a519bce70bc7f9f92654ee567085  models/13B/consolidated.01.pth
+a6bd0537c6873f36c47292df0b6f794e1135f5aafb89c3343bcc9e93264bf167  models/13B/ggml-model-f16.bin
+0fb0951b90f2ec46c1f2f2372af5dacb4614b27e9fb6c10c69fbec58d7dd0e36  models/13B/ggml-model-f16.bin.1
+1c218ba37ae61e15e35efd9949c78d6edf553b6280824c263cad56ae0b9d5a8f  models/13B/ggml-model-q4_0.bin
+c37a20c2ab9fa74b006b389085660269ee06110d1e45a494eb57d4602c9bcdb2  models/13B/ggml-model-q4_0.bin.1
+4ab77bec4d4405ccb66a97b282574c89a94417e3c32e5f68f37e2876fc21322f  models/13B/params.json
+e23294a58552d8cdec5b7e8abb87993b97ea6eced4178ff2697c02472539d067  models/30B/consolidated.00.pth
+4e077b7136c7ae2302e954860cf64930458d3076fcde9443f4d0e939e95903ff  models/30B/consolidated.01.pth
+24a87f01028cbd3a12de551dcedb712346c0b5cbdeff1454e0ddf2df9b675378  models/30B/consolidated.02.pth
+1adfcef71420886119544949767f6a56cb6339b4d5fcde755d80fe68b49de93b  models/30B/consolidated.03.pth
+def20ea508f4e36793719f857471e85b85f96e497a2cbffbbaa1b60e2b18202c  models/30B/ggml-model-f16.bin
+b37040aa67fa8608cb2d8e0719132cf3e267fd35ec1e2f0d37dbc9fa43d674f1  models/30B/ggml-model-f16.bin.1
+e7f263557e99069fe29003262ea5fa9ed885dbe79069083e6eb569b328cf30d3  models/30B/ggml-model-f16.bin.2
+2ad6a23af05eb720f202f63d130f4fc5de9b6d2efc95b921be003209a56695aa  models/30B/ggml-model-f16.bin.3
+7de31d005e6d02ebd9603b2cf5329ad2f832b65d08873a098c5cafc4046cb9ed  models/30B/ggml-model-q4_0.bin
+f91feef9f30f9a023616db2e91297ca6d5d5d7b9eb351e452a82115c46f7da9e  models/30B/ggml-model-q4_0.bin.1
+66f3a0916ac7a81839153eb061fa861030ed1892477c2f7af2ce4f98d2f6d06f  models/30B/ggml-model-q4_0.bin.2
+e3c587ba97f83d2088b001bcda3026571065649ee3090bef6743a51390b01d3b  models/30B/ggml-model-q4_0.bin.3
+2c07118ea98d69dbe7810d88520e30288fa994751b337f8fca02b171955f44cb  models/30B/params.json
+135c563f6b3938114458183afb01adc9a63bef3d8ff7cccc3977e5d3664ecafe  models/65B/consolidated.00.pth
+9a600b37b19d38c7e43809485f70d17d1dc12206c07efa83bc72bb498a568bde  models/65B/consolidated.01.pth
+e7babf7c5606f165a3756f527cb0fedc4f83e67ef1290391e52fb1cce5f26770  models/65B/consolidated.02.pth
+73176ffb426b40482f2aa67ae1217ef79fbbd1fff5482bae5060cdc5a24ab70e  models/65B/consolidated.03.pth
+882e6431d0b08a8bc66261a0d3607da21cbaeafa96a24e7e59777632dbdac225  models/65B/consolidated.04.pth
+a287c0dfe49081626567c7fe87f74cce5831f58e459b427b5e05567641f47b78  models/65B/consolidated.05.pth
+72b4eba67a1a3b18cb67a85b70f8f1640caae9b40033ea943fb166bd80a7b36b  models/65B/consolidated.06.pth
+d27f5b0677d7ff129ceacd73fd461c4d06910ad7787cf217b249948c3f3bc638  models/65B/consolidated.07.pth
+7eba2625260cd91f8de901fd9704a1aa39448425514a335a0d3878de4ab9dc77  models/65B/ggml-model-f16.bin
+f6aa886575df0785d4231f30cc776d499ccde18857818effc0378c65b178e0b5  models/65B/ggml-model-f16.bin.1
+076037141682f5d7537955058c4740ab27f285aa4588915f830874a589c0693d  models/65B/ggml-model-f16.bin.2
+7853d96d2903ad7de2b2a89c4acf5a33a2f8e3c24ac39c9df6b44cdb42bf530a  models/65B/ggml-model-f16.bin.3
+b16b7b941abb3bc03a14df1656140855e9360a5371c83e919b9da83a72362314  models/65B/ggml-model-f16.bin.4
+5291270216f888697695acb78ef28df0c080f9e85d3245c92fb9992d1fde6678  models/65B/ggml-model-f16.bin.5
+0685ee77715f34686841006f8f94d3e7eaf148b97cecc9d3eee72808b0f7989c  models/65B/ggml-model-f16.bin.6
+00d993d73bb21d7c29388ffe0dced008cbaa0d391831dea77d7eb8f0b5c404b9  models/65B/ggml-model-f16.bin.7
+4e398f05842206e08cdc5e7bb4f6c7c34b9dc373435ece6f261b14b7b4fe9b89  models/65B/ggml-model-q4_0.bin
+4c4e899e3b12d9f57c9dcea5a1fb41bbc72023323535551f6273582ca7d7294b  models/65B/ggml-model-q4_0.bin.1
+d7b4594bbbd192043b3db0e5acc2561c42e6944e1cb91cc6e61510eee89dbcd8  models/65B/ggml-model-q4_0.bin.2
+9a099d271648863d923d0d097391ea0bc75591f27a2ca3a327760f42e6b69af2  models/65B/ggml-model-q4_0.bin.3
+5ee474051e418c5732b7949190b084d9d679db447f83c1de0d2a82daaa1a0cfa  models/65B/ggml-model-q4_0.bin.4
+a45aa05e7212bd6782790722d68056c5419667ea6b564ccc94bbcb8111d79b8b  models/65B/ggml-model-q4_0.bin.5
+a58fda714b759c28ad5e4c1d8bf8fda7b158fd5e4c4a49f851f36342fa97a105  models/65B/ggml-model-q4_0.bin.6
+a3540cfcbcda33c223c6b0d606034adbd78f17e0e5de1582b78795e78754f7a8  models/65B/ggml-model-q4_0.bin.7
+999ed1659b469ccc2a941714c0a9656fa571d17c9f7c8c7589817ca90edef51b  models/65B/params.json
+1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13  models/alpaca-7B/ggml-model-q4_0.bin
+e17730c6b62b565b098af023ca446dcb9e3535d4222ead6369c7aae67207eb3d  models/alpaca-13B/ggml-model-q4_0.bin
+9bcd1bb30e679c939f367be11b030fe20b3eb9a3606b9bc4106420f1827b6ae4  models/alpaca-30B/ggml-model-q4_0.bin
+36079249f53c292a4c2302d7784005dcae94c865f0bedfdbfa51d9ddad402935  models/alpaca-30B/params.json
--- a/alpaca.sh
+++ b/alpaca.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+#
+# Temporary script - will be removed in the future
+#
+
+./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7
--- a/chat.sh
+++ b/chat.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+#
+# Temporary script - will be removed in the future
+#
+
+./main -m ./models/7B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
--- a/chat_tcp_client.sh
+++ b/chat_tcp_client.sh
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+
+PORT=${PORT:-8080}
+PROMPT="${PROMPT:-"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
+
+User:Hello, Bob.
+Bob:Hello. How may I help you today?
+User:Please tell me the largest city in Europe.
+Bob:Sure. The largest city in Europe is Moscow, the capital of Russia.
+User:"}"
+RPROMPT="${RPROMPT:-"User:"}"
+N_PREDICT="${N_PREDICT:-"4096"}"
+REPEAT_PENALTY="${REPEAT_PENALTY:-"1.0"}"
+N_THREADS="${N_THREADS:-"4"}"
+
+# Open connection to the chat server
+exec 3<>/dev/tcp/127.0.0.1/${PORT}
+
+# Pass the arguments. The protocol is really simple:
+# 1. Pass the number of arguments followed by a linefeed
+# 2. Pass the arguments, with each being followed by "0"
+(
+echo -en "12\n"
+echo -en "-t\x00"
+echo -en "$N_THREADS\x00"
+echo -en "-n\x00"
+echo -en "$N_PREDICT\x00"
+echo -en "--repeat_penalty\x00"
+echo -en "$REPEAT_PENALTY\x00"
+echo -en "--color\x00"
+echo -en "-i\x00"
+echo -en "-r\x00"
+echo -en "$RPROMPT\x00"
+echo -en "-p\x00"
+echo -en "$PROMPT\x00"
+) >&3
+
+trap exit TERM
+
+# When we have passed the arguments, start printing socket data to the screen.
+# This is done in a background job because we also want to send data when
+# running in interactive mode.
+cat <&3 && echo "(disconnected, press \"enter\" twice to exit)" &
+cat >&3
+wait
--- a/chat_tcp_server.sh
+++ b/chat_tcp_server.sh
@@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+
+PORT=${PORT:-8080}
+MODEL=${MODEL:-models/7B/ggml-model-q4_0.bin}
+
+./main -l ${PORT} -m $MODEL
--- a/convert-gptq-to-ggml.py
+++ b/convert-gptq-to-ggml.py
@@ -0,0 +1,172 @@
+# Convert a GPTQ quantized LLaMA model to a ggml compatible file
+# Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa
+#
+import os
+import re
+import sys
+import json
+import struct
+import numpy as np
+import torch
+from sentencepiece import SentencePieceProcessor
+
+if len(sys.argv) != 4:
+    print("Usage: convert-gptq-to-ggml.py llamaXXb-4bit.pt tokenizer.model out.bin\n")
+    sys.exit(1)
+
+fname_model = sys.argv[1]
+fname_tokenizer = sys.argv[2]
+dir_out = sys.argv[3]
+
+model = torch.load(fname_model, map_location="cpu")
+
+n_vocab, n_embd = model['model.embed_tokens.weight'].shape
+n_layer = 1 + max(int(m.group(1)) for name in model
+                  if (m := re.match(r'model\.layers\.([0-9]+)', name)))
+
+# hardcoded:
+n_mult = 256
+n_head = {32: 32, 40: 40, 60: 52, 80: 64}[n_layer]
+
+tokenizer = SentencePieceProcessor(fname_tokenizer)
+
+assert tokenizer.vocab_size() == n_vocab
+
+fname_out = sys.argv[3]
+
+fout = open(fname_out, "wb")
+
+fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
+fout.write(struct.pack("i", n_vocab))
+fout.write(struct.pack("i", n_embd))
+fout.write(struct.pack("i", n_mult))
+fout.write(struct.pack("i", n_head))
+fout.write(struct.pack("i", n_layer))
+fout.write(struct.pack("i", n_embd // n_head)) # rot (obsolete)
+fout.write(struct.pack("i", 4))
+
+
+# This loop unchanged from convert-pth-to-ggml.py:
+for i in range(tokenizer.vocab_size()):
+    if tokenizer.is_unknown(i):
+        # "<unk>" token (translated as ??)
+        text = " \u2047 ".encode("utf-8")
+        fout.write(struct.pack("i", len(text)))
+        fout.write(text)
+    elif tokenizer.is_control(i):
+        # "<s>"/"</s>" tokens
+        fout.write(struct.pack("i", 0))
+    elif tokenizer.is_byte(i):
+        # "<U+XX>" tokens (which may be invalid UTF-8)
+        piece = tokenizer.id_to_piece(i)
+        if len(piece) != 6:
+            print("Invalid token: " + piece)
+            sys.exit(1)
+        byte_value = int(piece[3:-1], 16)
+        fout.write(struct.pack("i", 1))
+        fout.write(struct.pack("B", byte_value))
+    else:
+        # normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
+        text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
+        fout.write(struct.pack("i", len(text)))
+        fout.write(text)
+
+def write_header(shape, dst_name, ftype_cur):
+    sname = dst_name.encode('utf-8')
+    fout.write(struct.pack("iii", len(shape), len(sname), ftype_cur))
+    fout.write(struct.pack("i" * len(shape), *shape[::-1]))
+    fout.write(sname)
+
+def convert_non_q4(src_name, dst_name):
+    v = model[src_name]
+    shape = v.shape
+    print("Processing non-Q4 variable: " + src_name + " with shape: ", shape, " and type: ", v.dtype)
+    if len(shape) == 1:
+        print("  Converting to float32")
+        v = v.to(torch.float32)
+
+    ftype_cur = {torch.float16: 1, torch.float32: 0}[v.dtype]
+
+    # header
+    write_header(shape, dst_name, ftype_cur)
+
+    # data
+    v.numpy().tofile(fout)
+
+def convert_q4(src_name, dst_name, permute=False):
+    zeros = model[f"{src_name}.zeros"].numpy()
+    scales = model[f"{src_name}.scales"].numpy()
+    bias = model[f"{src_name}.bias"].numpy()
+    qweight = model[f"{src_name}.qweight"].numpy().T # transpose
+
+    # Q4_1 does not support bias; good thing the bias is always all zeros.
+    assert not np.any(bias)
+
+    # Each int32 item is actually 8 int4 items packed together, and it's transposed.
+    shape = (qweight.shape[0], qweight.shape[1] * 8)
+
+    print("Processing Q4 variable: " + src_name + " with shape: ", shape)
+
+    # The output format has the int4 weights in groups of 32 rather than 8.
+    # It looks like this:
+    # For each row:
+    #   For each group of 32 columns:
+    #     - addend (float32, 4 bytes)
+    #     - scale (float32, 4 bytes)
+    #     - weights (int4 * 32, 16 bytes)
+    # Note that in the input, the scales and addends are shared between all
+    # the columns in a row, so we end up wasting quite a bit of memory with
+    # repeated scales and addends.
+
+    addends = -zeros # flip sign
+
+    # Since the output format is mixed between integers and floats, we have
+    # to hackily view the floats as int32s just so numpy will let us
+    # concatenate them.
+    addends_view = addends.view(dtype=np.int32)
+    scales_view = scales.view(dtype=np.int32)
+
+    # Split into groups of 4 columns (i.e. 32 columns of quantized data):
+    grouped = qweight.reshape([qweight.shape[0], qweight.shape[1] // 4, 4])
+
+    # Repeat addends and scales:
+    addends_rep = np.atleast_3d(addends_view).repeat(grouped.shape[1], axis=1)
+    scales_rep = np.atleast_3d(scales_view).repeat(grouped.shape[1], axis=1)
+
+    blob = np.concatenate([scales_rep, addends_rep, grouped], axis=2, casting='no')
+
+    if permute:
+        # Permute some rows to undo the permutation done by convert_llama_weights_to_hf.py.
+        # This can be done after the above conversion because it doesn't affect column order/layout.
+        blob = (blob.reshape(n_head, 2, shape[0] // n_head // 2, *blob.shape[1:])
+                    .swapaxes(1, 2)
+                    .reshape(blob.shape))
+
+    # header
+    write_header(shape, dst_name, 3) # ftype = Q4_1
+
+    # data
+    blob.tofile(fout)
+
+convert_non_q4("model.embed_tokens.weight", "tok_embeddings.weight")
+convert_non_q4("model.norm.weight", "norm.weight")
+convert_non_q4("lm_head.weight", "output.weight")
+
+for i in range(n_layer):
+    convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True)
+    convert_q4(f"model.layers.{i}.self_attn.k_proj", f"layers.{i}.attention.wk.weight", permute=True)
+    convert_q4(f"model.layers.{i}.self_attn.v_proj", f"layers.{i}.attention.wv.weight")
+    convert_q4(f"model.layers.{i}.self_attn.o_proj", f"layers.{i}.attention.wo.weight")
+
+    convert_q4(f"model.layers.{i}.mlp.gate_proj", f"layers.{i}.feed_forward.w1.weight")
+    convert_q4(f"model.layers.{i}.mlp.down_proj", f"layers.{i}.feed_forward.w2.weight")
+    convert_q4(f"model.layers.{i}.mlp.up_proj",   f"layers.{i}.feed_forward.w3.weight")
+
+    convert_non_q4(f"model.layers.{i}.input_layernorm.weight", f"layers.{i}.attention_norm.weight")
+    convert_non_q4(f"model.layers.{i}.post_attention_layernorm.weight", f"layers.{i}.ffn_norm.weight")
+
+
+fout.close()
+
+print("Done. Output file: " + fname_out)
+print("")
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@@ -10,13 +10,12 @@
 #   - Name (char[name_length])
 #   - Data (float[n_dims])
 #
-# By default, the bigger matrices are converted to 16-bit floats.
-# This can be disabled by adding the "use-f32" CLI argument.
-#
 # At the start of the ggml file we write the model parameters
 # and vocabulary.
 #

+import argparse
+import os
 import sys
 import json
 import struct
@@ -25,116 +24,90 @@ import torch

 from sentencepiece import SentencePieceProcessor

-if len(sys.argv) < 3:
-    print("Usage: convert-ckpt-to-ggml.py dir-model ftype\n")
-    print("  ftype == 0 -> float32")
-    print("  ftype == 1 -> float16")
-    sys.exit(1)
+def parse_args():

-# output in the same directory as the model
-dir_model = sys.argv[1]
-
-fname_hparams   = sys.argv[1] + "/params.json"
-fname_tokenizer = sys.argv[1] + "/../tokenizer.model"
+    parser = argparse.ArgumentParser(description='Convert a LLaMA model checkpoint to a ggml compatible file')
+    parser.add_argument('dir_model',  help='directory containing the model checkpoint')
+    parser.add_argument('ftype',      help='file type (0: float32, 1: float16)', type=int, choices=[0, 1], default=1)
+    parser.add_argument('vocab_only', help='only write vocab to file', type=int, default=0, nargs='?')
+    return parser.parse_args()

 def get_n_parts(dim):
-    if dim == 4096:
-        return 1
-    elif dim == 5120:
-        return 2
-    elif dim == 6656:
-        return 4
-    elif dim == 8192:
-        return 8
-    else:
-        print("Invalid dim: " + str(dim))
+
+    mappings = {4096: 1, 5120: 2, 6656: 4, 8192: 8}
+    n_parts = mappings.get(dim)
+    if n_parts is None:
+        print(f"Invalid dim: {dim}")
        sys.exit(1)

-# possible data types
-#   ftype == 0 -> float32
-#   ftype == 1 -> float16
-#
-# map from ftype to string
-ftype_str = ["f32", "f16"]
+    print(f"n_parts = {n_parts}\n")
+    return n_parts

-ftype = 1
-if len(sys.argv) > 2:
-    ftype = int(sys.argv[2])
-    if ftype < 0 or ftype > 1:
-        print("Invalid ftype: " + str(ftype))
-        sys.exit(1)
-    fname_out = sys.argv[1] + "/ggml-model-" + ftype_str[ftype] + ".bin"
+def load_hparams_and_tokenizer(dir_model):

-with open(fname_hparams, "r") as f:
-    hparams = json.load(f)
+    # `dir_model` is something like `models/7B` or `models/7B/`.
+    # "tokenizer.model" is expected under model's parent dir.
+    # When `dir_model` is a symlink, f"{dir_model}/../tokenizer.model" would not be found.
+    # Let's use the model's parent dir directly.
+    model_parent_dir = os.path.dirname(os.path.normpath(dir_model))

-tokenizer = SentencePieceProcessor(fname_tokenizer)
+    fname_hparams = f"{dir_model}/params.json"
+    fname_tokenizer = f"{model_parent_dir}/tokenizer.model"

-hparams.update({"vocab_size": tokenizer.vocab_size()})
+    with open(fname_hparams, "r") as f:
+        hparams = json.load(f)
+        print(hparams)

-n_parts = get_n_parts(hparams["dim"])
+    tokenizer = SentencePieceProcessor(fname_tokenizer)
+    hparams.update({"vocab_size": tokenizer.vocab_size()})

-print(hparams)
-print('n_parts = ', n_parts)
+    return hparams, tokenizer

-for p in range(n_parts):
-    print('Processing part ', p)
+def write_header(fout, hparams, ftype):

-    #fname_model = sys.argv[1] + "/consolidated.00.pth"
-    fname_model = sys.argv[1] + "/consolidated.0" + str(p) + ".pth"
-    fname_out = sys.argv[1] + "/ggml-model-" + ftype_str[ftype] + ".bin"
-    if (p > 0):
-        fname_out = sys.argv[1] + "/ggml-model-" + ftype_str[ftype] + ".bin" + "." + str(p)
+    keys = ["vocab_size", "dim", "multiple_of", "n_heads", "n_layers"]
+    values = [
+        0x67676d66,  # magic: ggmf in hex
+        1, # file version
+        *[hparams[key] for key in keys],
+        hparams["dim"] // hparams["n_heads"],  # rot (obsolete)
+        ftype
+    ]
+    fout.write(struct.pack("i" * len(values), *values))

-    model = torch.load(fname_model, map_location="cpu")
+def write_tokens(fout, tokenizer):

-    fout = open(fname_out, "wb")
-
-    fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
-    fout.write(struct.pack("i", hparams["vocab_size"]))
-    fout.write(struct.pack("i", hparams["dim"]))
-    fout.write(struct.pack("i", hparams["multiple_of"]))
-    fout.write(struct.pack("i", hparams["n_heads"]))
-    fout.write(struct.pack("i", hparams["n_layers"]))
-    fout.write(struct.pack("i", hparams["dim"] // hparams["n_heads"])) # rot (obsolete)
-    fout.write(struct.pack("i", ftype))
-
-    # Is this correct??
-    for i in range(32000):
-        # TODO: this is probably wrong - not sure how this tokenizer works
-        text = tokenizer.decode([29889, i]).encode('utf-8')
-        # remove the first byte (it's always '.')
-        text = text[1:]
+    for i in range(tokenizer.vocab_size()):
+        if tokenizer.is_unknown(i):
+            text = " \u2047 ".encode("utf-8")
+        elif tokenizer.is_control(i):
+            text = b""
+        elif tokenizer.is_byte(i):
+            piece = tokenizer.id_to_piece(i)
+            if len(piece) != 6:
+                print(f"Invalid token: {piece}")
+                sys.exit(1)
+            byte_value = int(piece[3:-1], 16)
+            text = struct.pack("B", byte_value)
+        else:
+            text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
        fout.write(struct.pack("i", len(text)))
        fout.write(text)
+        fout.write(struct.pack("f", tokenizer.get_score(i)))

-    for k, v in model.items():
-        name = k
-        shape = v.shape
+def process_and_write_variables(fout, model, ftype):

-        # skip layers.X.attention.inner_attention.rope.freqs
-        if name[-5:] == "freqs":
+    for name, datao in model.items():
+
+        if name.endswith("freqs"):
            continue

-        print("Processing variable: " + name + " with shape: ", shape, " and type: ", v.dtype)
+        shape = datao.shape

-        #data = tf.train.load_variable(dir_model, name).squeeze()
-        data = v.numpy().squeeze()
-        n_dims = len(data.shape);
+        print(f"Processing variable: {name} with shape: {shape} and type: {datao.dtype}")

-        # for efficiency - transpose some matrices
-        # "model/h.*/attn/c_attn/w"
-        # "model/h.*/attn/c_proj/w"
-        # "model/h.*/mlp/c_fc/w"
-        # "model/h.*/mlp/c_proj/w"
-        #if name[-14:] == "/attn/c_attn/w" or \
-        #   name[-14:] == "/attn/c_proj/w" or \
-        #   name[-11:] == "/mlp/c_fc/w" or \
-        #   name[-13:] == "/mlp/c_proj/w":
-        #    print("  Transposing")
-        #    data = data.transpose()
-
-        dshape = data.shape
+        data = datao.numpy().squeeze()
+        n_dims = len(shape)

        # default type is fp16
        ftype_cur = 1
@@ -145,18 +118,64 @@ for p in range(n_parts):

        # header
        sname = name.encode('utf-8')
-        fout.write(struct.pack("iii", n_dims, len(sname), ftype_cur))
-        for i in range(n_dims):
-            fout.write(struct.pack("i", dshape[n_dims - 1 - i]))
-        fout.write(sname);
+        fout.write(struct.pack("iii", len(data.shape), len(sname), ftype_cur))
+        for dim in reversed(data.shape):
+            fout.write(struct.pack("i", dim))
+        fout.write(sname)

-        # data
+        # data output to file
        data.tofile(fout)

-    # I hope this deallocates the memory ..
-    model = None
+def main():

-    fout.close()
+    args = parse_args()
+    dir_model = args.dir_model
+    ftype = args.ftype
+    ftype_str = ["f32", "f16"]

-    print("Done. Output file: " + fname_out + ", (part ", p, ")")
-    print("")
+    hparams, tokenizer = load_hparams_and_tokenizer(dir_model)
+
+    print(args)
+
+    # if only writing vocab to file
+    if args.vocab_only:
+
+        fname_model = f"{dir_model}/consolidated.00.pth"
+        fname_out = f"{dir_model}/ggml-vocab.bin"
+
+        print(f"Extracting only the vocab from '{fname_model}'\n")
+
+        model = torch.load(fname_model, map_location="cpu")
+
+        with open(fname_out, "wb") as fout:
+            write_header(fout, hparams, ftype)
+            write_tokens(fout, tokenizer)
+
+        del model
+
+        print(f"Done. Output file: {fname_out}\n")
+
+        return
+
+    n_parts = get_n_parts(hparams["dim"])
+
+    for p in range(n_parts):
+
+        print(f"Processing part {p}\n")
+
+        fname_model = f"{dir_model}/consolidated.0{p}.pth"
+        fname_out = f"{dir_model}/ggml-model-{ftype_str[ftype]}.bin{'' if p == 0 else '.' + str(p)}"
+
+        model = torch.load(fname_model, map_location="cpu")
+
+        with open(fname_out, "wb") as fout:
+            write_header(fout, hparams, ftype)
+            write_tokens(fout, tokenizer)
+            process_and_write_variables(fout, model, ftype)
+
+        del model
+
+        print(f"Done. Output file: {fname_out}, (part {p})\n")
+
+if __name__ == "__main__":
+    main()
--- a/download-pth.py
+++ b/download-pth.py
@@ -0,0 +1,66 @@
+import os
+import sys
+from tqdm import tqdm
+import requests
+
+if len(sys.argv) < 3:
+    print("Usage: download-pth.py dir-model model-type\n")
+    print("  model-type: Available models 7B, 13B, 30B or 65B")
+    sys.exit(1)
+
+modelsDir = sys.argv[1]
+model = sys.argv[2]
+
+num = {
+    "7B": 1,
+    "13B": 2,
+    "30B": 4,
+    "65B": 8,
+}
+
+if model not in num:
+    print(f"Error: model {model} is not valid, provide 7B, 13B, 30B or 65B")
+    sys.exit(1)
+
+print(f"Downloading model {model}")
+
+files = ["checklist.chk", "params.json"]
+
+for i in range(num[model]):
+    files.append(f"consolidated.0{i}.pth")
+
+resolved_path = os.path.abspath(os.path.join(modelsDir, model))
+os.makedirs(resolved_path, exist_ok=True)
+
+for file in files:
+    dest_path = os.path.join(resolved_path, file)
+    
+    if os.path.exists(dest_path):
+        print(f"Skip file download, it already exists: {file}")
+        continue
+
+    url = f"https://agi.gpt4.org/llama/LLaMA/{model}/{file}"
+    response = requests.get(url, stream=True)
+    with open(dest_path, 'wb') as f:
+        with tqdm(unit='B', unit_scale=True, miniters=1, desc=file) as t:
+            for chunk in response.iter_content(chunk_size=1024):
+                if chunk:
+                    f.write(chunk)
+                    t.update(len(chunk))
+
+files2 = ["tokenizer_checklist.chk", "tokenizer.model"]
+for file in files2:
+    dest_path = os.path.join(modelsDir, file)
+    
+    if os.path.exists(dest_path):
+        print(f"Skip file download, it already exists: {file}")
+        continue
+    
+    url = f"https://agi.gpt4.org/llama/LLaMA/{file}"
+    response = requests.get(url, stream=True)
+    with open(dest_path, 'wb') as f:
+        with tqdm(unit='B', unit_scale=True, miniters=1, desc=file) as t:
+            for chunk in response.iter_content(chunk_size=1024):
+                if chunk:
+                    f.write(chunk)
+                    t.update(len(chunk))
--- a/examples/chatLLaMa
+++ b/examples/chatLLaMa
@@ -0,0 +1,53 @@
+#!/bin/bash
+
+cd "$(dirname "$0")/.." || exit
+
+MODEL="${MODEL:-./models/13B/ggml-model-q4_0.bin}"
+USER_NAME="${USER_NAME:-User}"
+AI_NAME="${AI_NAME:-ChatLLaMa}"
+
+# Adjust to the number of CPU cores you want to use.
+N_THREAD="${N_THREAD:-8}"
+# Number of tokens to predict (made it larger than default because we want a long interaction)
+N_PREDICTS="${N_PREDICTS:-2048}"
+
+# Note: you can also override the generation options by specifying them on the command line:
+# For example, override the context size by doing: ./chatLLaMa --ctx_size 1024
+GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647}"
+
+# shellcheck disable=SC2086 # Intended splitting of GEN_OPTIONS
+./main $GEN_OPTIONS \
+  --model "$MODEL" \
+  --threads "$N_THREAD" \
+  --n_predict "$N_PREDICTS" \
+  --color --interactive \
+  --reverse-prompt "${USER_NAME}:" \
+  --prompt "
+Text transcript of a never ending dialog, where ${USER_NAME} interacts with an AI assistant named ${AI_NAME}.
+${AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer ${USER_NAME}’s requests immediately and with details and precision.
+There are no annotations like (30 seconds passed...) or (to himself), just what ${USER_NAME} and ${AI_NAME} say aloud to each other.
+The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
+The transcript only includes text, it does not include markup like HTML and Markdown.
+
+$USER_NAME: Hello, $AI_NAME!
+$AI_NAME: Hello $USER_NAME! How may I help you today?
+$USER_NAME: What time is it?
+$AI_NAME: It is $(date +%H:%M).
+$USER_NAME: What year is it?
+$AI_NAME: We are in $(date +%Y).
+$USER_NAME: Please tell me the largest city in Europe.
+$AI_NAME: The largest city in Europe is Moscow, the capital of Russia.
+$USER_NAME: What can you tell me about Moscow?
+$AI_NAME: Moscow, on the Moskva River in western Russia, is the nation’s cosmopolitan capital. In its historic core is the Kremlin, a complex that’s home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
+$USER_NAME: What is a cat?
+$AI_NAME: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
+$USER_NAME: How do I pass command line arguments to a Node.js program?
+$AI_NAME: The arguments are stored in process.argv.
+
+    argv[0] is the path to the Node. js executable.
+    argv[1] is the path to the script file.
+    argv[2] is the first argument passed to the script.
+    argv[3] is the second argument passed to the script and so on.
+$USER_NAME: Name a color.
+$AI_NAME: Blue
+$USER_NAME:" "$@"
--- a/flake.lock
+++ b/flake.lock
@@ -0,0 +1,43 @@
+{
+  "nodes": {
+    "flake-utils": {
+      "locked": {
+        "lastModified": 1676283394,
+        "narHash": "sha256-XX2f9c3iySLCw54rJ/CZs+ZK6IQy7GXNY4nSOyu2QG4=",
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "rev": "3db36a8b464d0c4532ba1c7dda728f4576d6d073",
+        "type": "github"
+      },
+      "original": {
+        "owner": "numtide",
+        "repo": "flake-utils",
+        "type": "github"
+      }
+    },
+    "nixpkgs": {
+      "locked": {
+        "lastModified": 1678470307,
+        "narHash": "sha256-OEeMUr3ueLIXyW/OaFUX5jUdimyQwMg/7e+/Q0gC/QE=",
+        "owner": "NixOS",
+        "repo": "nixpkgs",
+        "rev": "0c4800d579af4ed98ecc47d464a5e7b0870c4b1f",
+        "type": "github"
+      },
+      "original": {
+        "owner": "NixOS",
+        "ref": "nixos-unstable",
+        "repo": "nixpkgs",
+        "type": "github"
+      }
+    },
+    "root": {
+      "inputs": {
+        "flake-utils": "flake-utils",
+        "nixpkgs": "nixpkgs"
+      }
+    }
+  },
+  "root": "root",
+  "version": 7
+}
--- a/flake.nix
+++ b/flake.nix
@@ -0,0 +1,49 @@
+{
+  inputs = {
+    nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
+    flake-utils.url = "github:numtide/flake-utils";
+  };
+  outputs = { self, nixpkgs, flake-utils }:
+    flake-utils.lib.eachDefaultSystem (system:
+      let
+        pkgs = import nixpkgs {
+          inherit system;
+        };
+        llama-python = pkgs.python310.withPackages (ps: with ps; [
+          torch
+          numpy
+          sentencepiece
+        ]);
+      in
+      {
+        packages.default = pkgs.stdenv.mkDerivation {
+          name = "llama.cpp";
+          src = ./.;
+          nativeBuildInputs = with pkgs; [ cmake ];
+          buildInputs = with pkgs; lib.optionals stdenv.isDarwin [
+            darwin.apple_sdk.frameworks.Accelerate
+          ];
+          cmakeFlags = with pkgs; lib.optionals (system == "aarch64-darwin") [
+            "-DCMAKE_C_FLAGS=-D__ARM_FEATURE_DOTPROD=1"
+          ];
+          installPhase = ''
+            mkdir -p $out/bin
+            mv llama $out/bin/llama
+            mv quantize $out/bin/quantize
+            echo "#!${llama-python}/bin/python" > $out/bin/convert-pth-to-ggml
+            cat ${./convert-pth-to-ggml.py} >> $out/bin/convert-pth-to-ggml
+            chmod +x $out/bin/convert-pth-to-ggml
+          '';
+          meta.mainProgram = "llama";
+        };
+        devShells.default = pkgs.mkShell {
+          packages = with pkgs; [
+            cmake
+            llama-python
+          ] ++ lib.optionals stdenv.isDarwin [
+            darwin.apple_sdk.frameworks.Accelerate
+          ];
+        };
+      }
+    );
+}
--- a/ggml.c
+++ b/ggml.c
@@ -1,8 +1,11 @@
+// Defines CLOCK_MONOTONIC on Linux
+#define _POSIX_C_SOURCE 199309L
+
 #include "ggml.h"

 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
-#elif !defined(__FreeBSD__)
+#elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
 #include <alloca.h>
 #endif

@@ -361,10 +364,10 @@ static const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);

 // AVX routines provided by GH user Const-me
 // ref: https://github.com/ggerganov/ggml/pull/27#issuecomment-1464934600
-#if __AVX2__
+#if __AVX2__ || __AVX512F__
 // Unpack 32 4-bit fields into 32 bytes
 // The output vector contains 32 bytes, each one in [ 0 .. 15 ] interval
-inline __m256i bytesFromNibbles( const uint8_t* rsi )
+static inline __m256i bytesFromNibbles( const uint8_t* rsi )
 {
    // Load 16 bytes from memory
    __m128i tmp = _mm_loadu_si128( ( const __m128i* )rsi );
@@ -381,7 +384,7 @@ inline __m256i bytesFromNibbles( const uint8_t* rsi )
    return bytes;
 }

-inline __m128i packNibbles( __m256i bytes )
+static inline __m128i packNibbles( __m256i bytes )
 {
    // Move bits within 16-bit lanes from 0000_abcd_0000_efgh into 0000_0000_abcd_efgh
    const __m256i lowByte = _mm256_set1_epi16( 0xFF );
@@ -397,13 +400,58 @@ inline __m128i packNibbles( __m256i bytes )
 }
 #endif

-
 // method 5
 // blocks of QK elements
 // represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
+
+// reference implementation for deterministic creation of model files
+static void quantize_row_q4_0_reference(const float * restrict x, void * restrict y, int k) {
+    assert(k % QK == 0);
+    const int nb = k / QK;
+
+    const size_t bs = sizeof(float) + QK/2;
+
+    uint8_t * restrict pd = ((uint8_t *)y + 0*bs);
+    uint8_t * restrict pb = ((uint8_t *)y + 0*bs + sizeof(float));
+
+    uint8_t pp[QK/2];
+
+    for (int i = 0; i < nb; i++) {
+        float amax = 0.0f; // absolute max
+
+        for (int l = 0; l < QK; l++) {
+            const float v = x[i*QK + l];
+            amax = MAX(amax, fabsf(v));
+        }
+
+        const float d = amax / ((1 << 3) - 1);
+        const float id = d ? 1.0f/d : 0.0f;
+
+        *(float *)pd = d;
+        pd += bs;
+
+        for (int l = 0; l < QK; l += 2) {
+            const float v0 = x[i*QK + l + 0]*id;
+            const float v1 = x[i*QK + l + 1]*id;
+
+            const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
+            const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
+
+            assert(vi0 >= 0 && vi0 < 16);
+            assert(vi1 >= 0 && vi1 < 16);
+
+            pp[l/2] = vi0 | (vi1 << 4);
+        }
+
+        memcpy(pb, pp, sizeof(pp));
+        pb += bs;
+    }
+}
+
 void quantize_row_q4_0(const float * restrict x, void * restrict y, int k) {
    assert(k % QK == 0);

+#if __ARM_NEON || defined(__AVX2__) || defined(__wasm_simd128__)
    const int nb = k / QK;
    const size_t bs = sizeof(float) + QK/2;

@@ -411,6 +459,7 @@ void quantize_row_q4_0(const float * restrict x, void * restrict y, int k) {
    uint8_t * restrict pb = ((uint8_t *)y + 0*bs + sizeof(float));

    uint8_t pp[QK/2];
+#endif

 #if __ARM_NEON
 #if QK == 32
@@ -567,36 +616,7 @@ void quantize_row_q4_0(const float * restrict x, void * restrict y, int k) {
 #endif
 #else
    // scalar
-    for (int i = 0; i < nb; i++) {
-        float amax = 0.0f; // absolute max
-
-        for (int l = 0; l < QK; l++) {
-            const float v = x[i*QK + l];
-            amax = MAX(amax, fabsf(v));
-        }
-
-        const float d = amax / ((1 << 3) - 1);
-        const float id = d ? 1.0f/d : 0.0f;
-
-        *(float *)pd = d;
-        pd += bs;
-
-        for (int l = 0; l < QK; l += 2) {
-            const float v0 = x[i*QK + l + 0]*id;
-            const float v1 = x[i*QK + l + 1]*id;
-
-            const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
-            const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
-
-            assert(vi0 >= 0 && vi0 < 16);
-            assert(vi1 >= 0 && vi1 < 16);
-
-            pp[l/2] = vi0 | (vi1 << 4);
-        }
-
-        memcpy(pb, pp, sizeof(pp));
-        pb += bs;
-    }
+    quantize_row_q4_0_reference(x, y, k);
 #endif
 }

@@ -607,10 +627,11 @@ void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
    assert(k % QK == 0);

    const int nb = k / QK;
+    const size_t bs = 2*sizeof(float) + QK/2;

-    float   * restrict pm = (float *)   (y);
-    float   * restrict pd = (float *)   (pm + nb);
-    uint8_t * restrict pb = (uint8_t *) (pd + nb);
+    uint8_t * restrict pd = ((uint8_t *)y + 0*bs);
+    uint8_t * restrict pm = ((uint8_t *)y + 0*bs +   sizeof(float));
+    uint8_t * restrict pb = ((uint8_t *)y + 0*bs + 2*sizeof(float));

    uint8_t pp[QK/2];

@@ -627,8 +648,10 @@ void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
        const float d = (max - min) / ((1 << 4) - 1);
        const float id = d ? 1.0f/d : 0.0f;

-        pm[i] = min;
-        pd[i] = d;
+        *(float *)pm = min;
+        *(float *)pd = d;
+        pm += bs;
+        pd += bs;

        for (int l = 0; l < QK; l += 2) {
            const float v0 = (x[i*QK + l + 0] - min)*id;
@@ -643,7 +666,8 @@ void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
            pp[l/2] = vi0 | (vi1 << 4);
        }

-        memcpy(pb + i*QK/2, pp, sizeof(pp));
+        memcpy(pb, pp, sizeof(pp));
+        pb += bs;
    }
 }

@@ -687,16 +711,17 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
    assert(k % QK == 0);

    const int nb = k / QK;
+    const size_t bs = 2*sizeof(float) + QK/2;

-    const float   * restrict pm = (const float *)   (x);
-    const float   * restrict pd = (const float *)   (pm + nb);
-    const uint8_t * restrict pb = (const uint8_t *) (pd + nb);
+    const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
+    const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs + sizeof(float));
+    const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float));

    for (int i = 0; i < nb; i++) {
-        const float m = pm[i];
-        const float d = pd[i];
+        const float d = *(const float *) (pd + i*bs);
+        const float m = *(const float *) (pm + i*bs);

-        const uint8_t * restrict pp = pb + i*QK/2;
+        const uint8_t * restrict pp = pb + i*bs;

        for (int l = 0; l < QK; l += 2) {
            const uint8_t vi = pp[l/2];
@@ -1257,6 +1282,47 @@ inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float
    *s = sumf;
 }

+#if __AVX512F__ && QK == 32
+static inline __m512 dot_q4_0_oneblock_avx512(
+    __m512 acc,
+    const uint8_t * pd0,
+    const uint8_t * pd1,
+    const uint8_t * pb0,
+    const uint8_t * pb1,
+    size_t bs,
+    int i
+) {
+    const float * d0_0 = (const float *) (pd0 + i*bs);
+    const float * d1_0 = (const float *) (pd1 + i*bs);
+
+    const uint8_t * restrict p0 = pb0 + (i+0)*bs;
+    const uint8_t * restrict p1 = pb1 + (i+0)*bs;
+
+    // Compute combined scale for the block
+    float scaleScalar = d0_0[0] * d1_0[0];
+    __m512 scale = _mm512_set1_ps( scaleScalar );
+
+    __m256i bx = bytesFromNibbles( p0 );
+    __m256i by = bytesFromNibbles( p1 );
+
+    // Now we have a vector with bytes in [ 0 .. 15 ] interval. Offset them into [ -8 .. +7 ] interval.
+    const __m256i off = _mm256_set1_epi8( 8 );
+    bx = _mm256_sub_epi8( bx, off );
+    by = _mm256_sub_epi8( by, off );
+
+    // Sign-extend 16 signed bytes into int16_t
+    __m512i x32 = _mm512_cvtepi8_epi16( bx );
+    __m512i y32 = _mm512_cvtepi8_epi16( by );
+    // Compute products of int16_t integers, add pairwise
+    __m512i i64 = _mm512_madd_epi16( x32, y32 );
+
+    // Convert int32_t to float
+    __m512 p = _mm512_cvtepi32_ps( i64 );
+    // Apply the scale, and accumulate
+    return _mm512_fmadd_ps( scale, p, acc );
+}
+#endif
+
 inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) {
    ggml_float sumf = 0.0;

@@ -1359,8 +1425,24 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
        const int8x16_t v0_1hs = vsubq_s8(v0_1h, s8b);
        const int8x16_t v1_1hs = vsubq_s8(v1_1h, s8b);

+#if defined(__ARM_FEATURE_DOTPROD)
        // dot product into int16x8_t
-        const int16x8_t pl0l = vmull_s8(vget_low_s8 (v0_0ls), vget_low_s8 (v1_0ls));
+        int32x4_t p_0 = vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0ls);
+        int32x4_t p_1 = vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1ls);
+
+        p_0 = vdotq_s32(p_0, v0_0hs, v1_0hs);
+        p_1 = vdotq_s32(p_1, v0_1hs, v1_1hs);
+
+        // scalar
+#if defined(__ARM_FEATURE_QRDMX)
+        sum0 += d0_0*d1_0*vaddvq_s32(p_0);
+        sum1 += d0_1*d1_1*vaddvq_s32(p_1);
+#else
+        sum0 += d0_0*d1_0*(vgetq_lane_s32(p_0, 0) + vgetq_lane_s32(p_0, 1) + vgetq_lane_s32(p_0, 2) + vgetq_lane_s32(p_0, 3));
+        sum1 += d0_1*d1_1*(vgetq_lane_s32(p_1, 0) + vgetq_lane_s32(p_1, 1) + vgetq_lane_s32(p_1, 2) + vgetq_lane_s32(p_1, 3));
+#endif
+#else
+	    const int16x8_t pl0l = vmull_s8(vget_low_s8 (v0_0ls), vget_low_s8 (v1_0ls));
        const int16x8_t pl0h = vmull_s8(vget_high_s8(v0_0ls), vget_high_s8(v1_0ls));

        const int16x8_t ph0l = vmull_s8(vget_low_s8 (v0_0hs), vget_low_s8 (v1_0hs));
@@ -1388,6 +1470,7 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
 #else
        sum0 += d0_0*d1_0*(vgetq_lane_s16(p_0, 0) + vgetq_lane_s16(p_0, 1) + vgetq_lane_s16(p_0, 2) + vgetq_lane_s16(p_0, 3) + vgetq_lane_s16(p_0, 4) + vgetq_lane_s16(p_0, 5) + vgetq_lane_s16(p_0, 6) + vgetq_lane_s16(p_0, 7));
        sum1 += d0_1*d1_1*(vgetq_lane_s16(p_1, 0) + vgetq_lane_s16(p_1, 1) + vgetq_lane_s16(p_1, 2) + vgetq_lane_s16(p_1, 3) + vgetq_lane_s16(p_1, 4) + vgetq_lane_s16(p_1, 5) + vgetq_lane_s16(p_1, 6) + vgetq_lane_s16(p_1, 7));
+#endif
 #endif
    }

@@ -1395,6 +1478,40 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
 #else
 #error "not implemented for QK"
 #endif
+#elif defined(__AVX512F__)
+
+#if QK == 32
+    // Initialize accumulator with zeros
+    __m512 acc0 = _mm512_setzero_ps();
+    __m512 acc1 = _mm512_setzero_ps();
+
+    const int superblock_size = 8;
+    const int superblock_count = nb / superblock_size;
+    const int remainder = nb % superblock_size;
+
+    for (int superblock_ix = 0; superblock_ix < superblock_count; superblock_ix += 1) {
+        int i = superblock_ix * superblock_size;
+
+        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+0 );
+        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+1 );
+        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+2 );
+        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+3 );
+        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+4 );
+        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+5 );
+        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+6 );
+        acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+7 );
+    }
+
+    // Remainders
+    for (int i = superblock_count * superblock_size; i < nb; ++i) {
+        acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i );
+    }
+
+    // Horizontal sum of all lanes of the accumulator
+    sumf = _mm512_reduce_add_ps( acc0 ) + _mm512_reduce_add_ps( acc1 );
+#else
+#error "not implemented for QK"
+#endif
 #elif defined(__AVX2__)
 #if QK == 32
    const size_t countBlocks = nb;
@@ -1567,28 +1684,109 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
 inline static void ggml_vec_dot_q4_1(const int n, float * restrict s, const void * restrict x, const void * restrict y) {
    const int nb = n / QK;

-    const float * restrict pm0 = (const float *) x;
-    const float * restrict pm1 = (const float *) y;
+    const size_t bs = 2*sizeof(float) + QK/2;

-    const float * restrict pd0 = (const float *) (pm0 + nb);
-    const float * restrict pd1 = (const float *) (pm1 + nb);
+    const uint8_t * restrict pd0 = ((const uint8_t *)x + 0*bs);
+    const uint8_t * restrict pd1 = ((const uint8_t *)y + 0*bs);

-    const uint8_t * restrict pb0 = (const uint8_t *) (pd0 + nb);
-    const uint8_t * restrict pb1 = (const uint8_t *) (pd1 + nb);
+    const uint8_t * restrict pm0 = ((const uint8_t *)x + 0*bs + sizeof(float));
+    const uint8_t * restrict pm1 = ((const uint8_t *)y + 0*bs + sizeof(float));
+
+    const uint8_t * restrict pb0 = ((const uint8_t *)x + 0*bs + 2*sizeof(float));
+    const uint8_t * restrict pb1 = ((const uint8_t *)y + 0*bs + 2*sizeof(float));

    float sumf = 0.0;

-#if 1
+#if defined(__AVX2__)
+#if QK == 32
+    // Initialize accumulator with zeros
+    __m256 acc = _mm256_setzero_ps();
+    // Accumulator for constant offsets
+    float acc_offset = 0.0f;
+
+    // Main loop
+    for (int i = 0; i < nb; ++i) {
+        const float * m0 = (const float *) (pm0 + i*bs);
+        const float * m1 = (const float *) (pm1 + i*bs);
+
+        const float * d0 = (const float *) (pd0 + i*bs);
+        const float * d1 = (const float *) (pd1 + i*bs);
+
+        const uint8_t * restrict p0 = pb0 + i*bs;
+        const uint8_t * restrict p1 = pb1 + i*bs;
+
+        const __m256 d0v = _mm256_broadcast_ss( d0 );
+        const __m256 d1v = _mm256_broadcast_ss( d1 );
+        const __m256 m0v = _mm256_broadcast_ss( m0 );
+        const __m256 m1v = _mm256_broadcast_ss( m1 );
+
+
+        // Compute combined scale for the block
+        const __m256 scale_01 = _mm256_mul_ps( d0v, d1v );
+
+        // Compute cross scales for the block
+        const __m256 scale_0 = _mm256_mul_ps( d0v, m1v );
+        const __m256 scale_1 = _mm256_mul_ps( m0v, d1v );
+        const __m256 cross_scales = _mm256_blend_ps( scale_0, scale_1, 0b10101010 );
+
+        // Load 16 bytes, and unpack 4 bit fields into bytes, making 32 bytes
+        __m256i bx = bytesFromNibbles( p0 );
+        __m256i by = bytesFromNibbles( p1 );
+
+        // Now we have a vector with bytes in [ 0 .. 15 ] interval.
+
+        // Sign-extend first 16 signed bytes into int16_t
+        __m256i x16 = _mm256_cvtepi8_epi16( _mm256_castsi256_si128( bx ) );
+        __m256i y16 = _mm256_cvtepi8_epi16( _mm256_castsi256_si128( by ) );
+        // Compute products of int16_t integers, add pairwise
+        __m256i i32 = _mm256_madd_epi16( x16, y16 );
+
+        // Sign-extend last 16 signed bytes into int16_t vectors
+        __m256i x16_h = _mm256_cvtepi8_epi16( _mm256_extracti128_si256( bx, 1 ) );
+        __m256i y16_h = _mm256_cvtepi8_epi16( _mm256_extracti128_si256( by, 1 ) );
+        // Accumulate products of int16_t integers
+        i32 = _mm256_add_epi32( i32, _mm256_madd_epi16( x16_h, y16_h ) );
+
+        // compute sums of unsigned bytes in bx, by in blocks of 8.
+        // This results in a layout like X100 0000 X200 0000 X300 0000 X400 0000,
+        // which we then interleave as X100 Y100 X200 Y200 X300 Y300 X400 Y400.
+        // so if we then cast to 8 singles, we get 8 floats like [ x0_7, y0_7, x8_15, y8_15, x16_23, y16_23, x24_31, y24_31 ]
+        __m256i xsumi = _mm256_sad_epu8( bx, _mm256_setzero_si256() );
+        __m256i ysumi = _mm256_sad_epu8( by, _mm256_setzero_si256() );
+        __m256i sumsi = _mm256_or_si256( xsumi, _mm256_slli_si256( ysumi, 4 ) );
+        __m256  sums  = _mm256_cvtepi32_ps( sumsi );
+
+        // Convert int32_t to float
+        __m256 p = _mm256_cvtepi32_ps( i32 );
+        // Apply the scale, and accumulate
+        // acc += d0*d1*x*y + d0*m1*x + d1*m0*y
+        acc = _mm256_fmadd_ps( scale_01, p, acc );
+        acc = _mm256_fmadd_ps( cross_scales, sums, acc );
+        // acc_offset += m0*m1 (for each entry in the block)
+        acc_offset += (*m0)*(*m1);
+    }
+
+    // Return horizontal sum of the acc vector
+    __m128 res = _mm256_extractf128_ps( acc, 1 );
+    res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
+    res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
+    res = _mm_add_ss( res, _mm_movehdup_ps( res ) );
+
+    sumf = _mm_cvtss_f32( res ) + acc_offset * QK;
+#else
+#error "not implemented for QK"
+#endif
+#else
    // scalar
    for (int i = 0; i < nb; i++) {
-        const float m0 = pm0[i];
-        const float m1 = pm1[i];
+        const float m0 = *(const float *) (pm0 + i*bs);
+        const float m1 = *(const float *) (pm1 + i*bs);

-        const float d0 = pd0[i];
-        const float d1 = pd1[i];
+        const float d0 = *(const float *) (pd0 + i*bs);
+        const float d1 = *(const float *) (pd1 + i*bs);

-        const uint8_t * restrict p0 = pb0 + i*QK/2;
-        const uint8_t * restrict p1 = pb1 + i*QK/2;
+        const uint8_t * restrict p0 = pb0 + i*bs;
+        const uint8_t * restrict p1 = pb1 + i*bs;

        for (int j = 0; j < QK/2; j++) {
            const uint8_t v0 = p0[j];
@@ -1822,16 +2020,17 @@ inline static void ggml_vec_mad_q4_1(const int n, float * restrict y, void * res
    assert(n % QK == 0);

    const int nb = n / QK;
+    const size_t bs = 2*sizeof(float) + QK/2;

-    const float   * restrict pm = (const float *)   (x);
-    const float   * restrict pd = (const float *)   (pm + nb);
-    const uint8_t * restrict pb = (const uint8_t *) (pd + nb);
+    const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
+    const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs +   sizeof(float));
+    const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float));

    for (int i = 0; i < nb; i++) {
-        const float m = pm[i];
-        const float d = pd[i];
+        const float d = *(const float *) (pd + i*bs);
+        const float m = *(const float *) (pm + i*bs);

-        const uint8_t * restrict pp = pb + i*QK/2;
+        const uint8_t * restrict pp = pb + i*bs;

        for (int l = 0; l < QK; l += 2) {
            const uint8_t vi = pp[l/2];
@@ -2052,6 +2251,7 @@ static const char * GGML_OP_LABEL[GGML_OP_COUNT] = {
    "GELU",
    "SILU",
    "NORM",
+    "RMS_NORM",

    "MUL_MAT",

@@ -2072,7 +2272,7 @@ static const char * GGML_OP_LABEL[GGML_OP_COUNT] = {
    "FLASH_FF",
 };

-static_assert(GGML_OP_COUNT == 34, "GGML_OP_COUNT != 34");
+static_assert(GGML_OP_COUNT == 35, "GGML_OP_COUNT != 35");

 static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
    "none",
@@ -2095,6 +2295,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
    "gelu(x)",
    "silu(x)",
    "norm(x)",
+    "rms_norm(x)",

    "X*Y",

@@ -2115,7 +2316,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
    "flash_ff(x)",
 };

-static_assert(GGML_OP_COUNT == 34, "GGML_OP_COUNT != 34");
+static_assert(GGML_OP_COUNT == 35, "GGML_OP_COUNT != 35");

 //
 // ggml object
@@ -3601,6 +3802,39 @@ struct ggml_tensor * ggml_norm_inplace(
    return ggml_norm_impl(ctx, a, true);
 }

+struct ggml_tensor * ggml_rms_norm_impl(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * a,
+        bool inplace) {
+    bool is_node = false;
+
+    if (!inplace && (a->grad)) {
+        GGML_ASSERT(false); // TODO: implement backward
+        is_node = true;
+    }
+
+    struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
+
+    result->op   = GGML_OP_RMS_NORM;
+    result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
+    result->src0 = a;
+    result->src1 = NULL; // TODO: maybe store epsilon here?
+
+    return result;
+}
+
+struct ggml_tensor * ggml_rms_norm(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * a) {
+    return ggml_rms_norm_impl(ctx, a, false);
+}
+
+struct ggml_tensor * ggml_rms_norm_inplace(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * a) {
+    return ggml_rms_norm_impl(ctx, a, true);
+}
+
 // ggml_mul_mat

 struct ggml_tensor * ggml_mul_mat(
@@ -5389,6 +5623,87 @@ static void ggml_compute_forward_norm(
    }
 }

+static void ggml_compute_forward_rms_norm_f32(
+        const struct ggml_compute_params * params,
+        const struct ggml_tensor * src0,
+        struct ggml_tensor * dst) {
+    GGML_ASSERT(ggml_are_same_shape(src0, dst));
+
+    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
+        return;
+    }
+
+    GGML_ASSERT(src0->nb[0] == sizeof(float));
+
+    const int ith = params->ith;
+    const int nth = params->nth;
+
+    const int ne00 = src0->ne[0];
+    const int ne01 = src0->ne[1];
+    const int ne02 = src0->ne[2];
+    const int ne03 = src0->ne[3];
+
+    const size_t nb01 = src0->nb[1];
+    const size_t nb02 = src0->nb[2];
+    const size_t nb03 = src0->nb[3];
+
+    const size_t nb1 = dst->nb[1];
+    const size_t nb2 = dst->nb[2];
+    const size_t nb3 = dst->nb[3];
+
+    const ggml_float eps = 1e-6f; // TODO: make this a parameter
+
+    // TODO: optimize
+    for (int i03 = 0; i03 < ne03; i03++) {
+        for (int i02 = 0; i02 < ne02; i02++) {
+            for (int i01 = ith; i01 < ne01; i01 += nth) {
+                const float * x = (float *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
+
+                ggml_float mean = 0.0;
+                for (int i00 = 0; i00 < ne00; i00++) {
+                    mean += x[i00] * x[i00];
+                }
+
+                mean /= ne00;
+
+                float * y = (float *) ((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3);
+
+                memcpy(y, x, ne00 * sizeof(float));
+                // for (int i00 = 0; i00 < ne00; i00++) {
+                //     y[i00] = x[i00];
+                // }
+
+                const float scale = 1.0/sqrt(mean + eps);
+
+                ggml_vec_scale_f32(ne00, y, scale);
+            }
+        }
+    }
+}
+
+static void ggml_compute_forward_rms_norm(
+        const struct ggml_compute_params * params,
+        const struct ggml_tensor * src0,
+        struct ggml_tensor * dst) {
+    switch (src0->type) {
+        case GGML_TYPE_F32:
+            {
+                ggml_compute_forward_rms_norm_f32(params, src0, dst);
+            } break;
+        case GGML_TYPE_Q4_0:
+        case GGML_TYPE_Q4_1:
+        case GGML_TYPE_I8:
+        case GGML_TYPE_I16:
+        case GGML_TYPE_I32:
+        case GGML_TYPE_F16:
+        case GGML_TYPE_COUNT:
+            {
+                GGML_ASSERT(false);
+            } break;
+    }
+}
+
+
 // ggml_compute_forward_mul_mat

 #if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS)
@@ -8505,6 +8820,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
            {
                ggml_compute_forward_norm(params, tensor->src0, tensor);
            } break;
+        case GGML_OP_RMS_NORM:
+            {
+                ggml_compute_forward_rms_norm(params, tensor->src0, tensor);
+            } break;
        case GGML_OP_MUL_MAT:
            {
                ggml_compute_forward_mul_mat(params, tensor->src0, tensor->src1, tensor);
@@ -8747,6 +9066,10 @@ static void ggml_compute_backward(struct ggml_context * ctx, struct ggml_tensor
            {
                GGML_ASSERT(false); // TODO: not implemented
            } break;
+        case GGML_OP_RMS_NORM:
+            {
+                GGML_ASSERT(false); // TODO: not implemented
+            } break;
        case GGML_OP_MUL_MAT:
            {
                if (src0->grad) {
@@ -9090,10 +9413,6 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
 }

 void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph) {
-    if (cgraph->n_threads <= 0) {
-        cgraph->n_threads = 8;
-    }
-
    const int n_threads = cgraph->n_threads;

    struct ggml_compute_state_shared state_shared = {
@@ -9173,6 +9492,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
                        node->n_tasks = n_threads;
                    } break;
                case GGML_OP_NORM:
+                case GGML_OP_RMS_NORM:
                    {
                        node->n_tasks = n_threads;
                    } break;
@@ -10403,6 +10723,68 @@ enum ggml_opt_result ggml_opt(

 ////////////////////////////////////////////////////////////////////////////////

+size_t ggml_quantize_q4_0(const float * src, void * dst, int n, int k, int qk, int64_t * hist) {
+    const int nb = k / qk;
+    const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
+    const size_t row_size = nb*bs;
+
+    assert(k % qk == 0);
+
+    char * pdst = (char *) dst;
+
+    for (int j = 0; j < n; j += k) {
+        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
+        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
+
+        quantize_row_q4_0_reference(src + j, pd, k);
+
+        for (int i = 0; i < nb; i++) {
+            for (int l = 0; l < qk; l += 2) {
+                const uint8_t vi0 = pb[l/2] & 0xF;
+                const uint8_t vi1 = pb[l/2] >> 4;
+
+                hist[vi0]++;
+                hist[vi1]++;
+            }
+            pb += bs;
+        }
+    }
+
+    return (n/k)*row_size;
+}
+
+size_t ggml_quantize_q4_1(const float * src, void * dst, int n, int k, int qk, int64_t * hist) {
+    const int nb = k / qk;
+    const size_t bs = (2*sizeof(float) + sizeof(uint8_t)*qk/2);
+    const size_t row_size = nb*bs;
+
+    assert(k % qk == 0);
+
+    char * pdst = (char *) dst;
+
+    for (int j = 0; j < n; j += k) {
+        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
+        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + 2*sizeof(float));
+
+        quantize_row_q4_1(src + j, pd, k);
+
+        for (int i = 0; i < nb; i++) {
+            for (int l = 0; l < qk; l += 2) {
+                const uint8_t vi0 = pb[l/2] & 0xF;
+                const uint8_t vi1 = pb[l/2] >> 4;
+
+                hist[vi0]++;
+                hist[vi1]++;
+            }
+            pb += bs;
+        }
+    }
+
+    return (n/k)*row_size;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+
 int ggml_cpu_has_avx(void) {
 #if defined(__AVX__)
    return 1;
--- a/ggml.h
+++ b/ggml.h
@@ -230,6 +230,7 @@ enum ggml_op {
    GGML_OP_GELU,
    GGML_OP_SILU,
    GGML_OP_NORM, // normalize
+    GGML_OP_RMS_NORM,

    GGML_OP_MUL_MAT,

@@ -482,6 +483,10 @@ struct ggml_tensor * ggml_norm(
        struct ggml_context * ctx,
        struct ggml_tensor  * a);

+struct ggml_tensor * ggml_rms_norm(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * a);
+
 // A: m rows, n columns
 // B: p rows, n columns (i.e. we transpose it internally)
 // result is m columns, p rows
@@ -736,6 +741,13 @@ enum ggml_opt_result ggml_opt(
        struct ggml_opt_params params,
        struct ggml_tensor * f);

+//
+// quantization
+//
+
+size_t ggml_quantize_q4_0(const float * src, void * dst, int n, int k, int qk, int64_t * hist);
+size_t ggml_quantize_q4_1(const float * src, void * dst, int n, int k, int qk, int64_t * hist);
+
 //
 // system info
 //
--- a/llama.cpp
+++ b/llama.cpp
--- a/llama.h
+++ b/llama.h
@@ -0,0 +1,139 @@
+#ifndef LLAMA_H
+#define LLAMA_H
+
+#include <stddef.h>
+#include <stdint.h>
+#include <stdbool.h>
+
+#ifdef LLAMA_SHARED
+#    ifdef _WIN32
+#        ifdef LLAMA_BUILD
+#            define LLAMA_API __declspec(dllexport)
+#        else
+#            define LLAMA_API __declspec(dllimport)
+#        endif
+#    else
+#        define LLAMA_API __attribute__ ((visibility ("default")))
+#    endif
+#else
+#    define LLAMA_API
+#endif
+
+#define LLAMA_FILE_VERSION 1
+#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex
+#define LLAMA_FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+    //
+    // C interface
+    //
+    // TODO: show sample usage
+    //
+
+    struct llama_context;
+
+    typedef int llama_token;
+
+    typedef struct llama_token_data {
+        llama_token id;  // token id
+
+        float p;     // probability of the token
+        float plog;  // log probability of the token
+
+    } llama_token_data;
+
+    struct llama_context_params {
+        int n_ctx;   // text context
+        int n_parts; // -1 for default
+        int seed;    // RNG seed, 0 for random
+
+        bool f16_kv;     // use fp16 for KV cache
+        bool logits_all; // the llama_eval() call computes all logits, not just the last one
+        bool vocab_only; // only load the vocabulary, no weights
+    };
+
+    LLAMA_API struct llama_context_params llama_context_default_params();
+
+    // Various functions for loading a ggml llama model.
+    // Allocate (almost) all memory needed for the model.
+    // Return NULL on failure
+    LLAMA_API struct llama_context * llama_init_from_file(
+                             const char * path_model,
+            struct llama_context_params   params);
+
+    // Frees all allocated memory
+    LLAMA_API void llama_free(struct llama_context * ctx);
+
+    // TODO: not great API - very likely to change
+    // Returns 0 on success
+    LLAMA_API int llama_model_quantize(
+            const char * fname_inp,
+            const char * fname_out,
+                   int   itype,
+                   int   qk);
+
+    // Run the llama inference to obtain the logits and probabilities for the next token.
+    // tokens + n_tokens is the provided batch of new tokens to process
+    // n_past is the number of tokens to use from previous eval calls
+    // Returns 0 on success
+    LLAMA_API int llama_eval(
+            struct llama_context * ctx,
+               const llama_token * tokens,
+                             int   n_tokens,
+                             int   n_past,
+                             int   n_threads);
+
+    // Convert the provided text into tokens.
+    // The tokens pointer must be large enough to hold the resulting tokens.
+    // Returns the number of tokens on success, no more than n_max_tokens
+    // Returns a negative number on failure - the number of tokens that would have been returned
+    // TODO: not sure if correct
+    LLAMA_API int llama_tokenize(
+            struct llama_context * ctx,
+                      const char * text,
+                     llama_token * tokens,
+                             int   n_max_tokens,
+                            bool   add_bos);
+
+    LLAMA_API int llama_n_vocab(struct llama_context * ctx);
+    LLAMA_API int llama_n_ctx  (struct llama_context * ctx);
+
+    // Token logits obtained from the last call to llama_eval()
+    // The logits for the last token are stored in the last row
+    // Can be mutated in order to change the probabilities of the next token
+    // Rows: n_tokens
+    // Cols: n_vocab
+    LLAMA_API float * llama_get_logits(struct llama_context * ctx);
+
+    // Token Id -> String. Uses the vocabulary in the provided context
+    LLAMA_API const char * llama_token_to_str(struct llama_context * ctx, llama_token token);
+
+    // Special tokens
+    LLAMA_API llama_token llama_token_bos();
+    LLAMA_API llama_token llama_token_eos();
+
+    // TODO: improve the last_n_tokens interface ?
+    LLAMA_API llama_token llama_sample_top_p_top_k(
+              llama_context * ctx,
+          const llama_token * last_n_tokens_data,
+                        int   last_n_tokens_size,
+                        int   top_k,
+                     double   top_p,
+                     double   temp,
+                     double   repeat_penalty);
+
+    // Performance information
+    LLAMA_API void llama_print_timings(struct llama_context * ctx);
+    LLAMA_API void llama_reset_timings(struct llama_context * ctx);
+
+    // Print system information
+    LLAMA_API const char * llama_print_system_info(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/main.cpp
+++ b/main.cpp
--- a/models/.gitignore
+++ b/models/.gitignore
--- a/models/ggml-vocab.bin
+++ b/models/ggml-vocab.bin
--- a/prompts/alpaca.txt
+++ b/prompts/alpaca.txt
@@ -0,0 +1 @@
+Below is an instruction that describes a task. Write a response that appropriately completes the request.
--- a/prompts/chat-with-bob.txt
+++ b/prompts/chat-with-bob.txt
@@ -0,0 +1,7 @@
+Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
+
+User: Hello, Bob.
+Bob: Hello. How may I help you today?
+User: Please tell me the largest city in Europe.
+Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
+User:
--- a/quantize.cpp
+++ b/quantize.cpp
@@ -1,295 +1,17 @@
 #include "ggml.h"
+#include "llama.h"

-#include "utils.h"
-
-#include <cassert>
-#include <cmath>
 #include <cstdio>
-#include <cstring>
-#include <fstream>
-#include <map>
 #include <string>
-#include <vector>
-#include <regex>

-// TODO: move somewhere else
-#define QK 32
-
-// default hparams (LLaMA76B)
-struct llama_hparams {
-    int32_t n_vocab = 32000;
-    int32_t n_ctx   = 512;   // this is provided as user input?
-    int32_t n_embd  = 4096;
-    int32_t n_mult  = 256;
-    int32_t n_head  = 32;
-    int32_t n_layer = 32;
-    int32_t n_rot   = 64;
-    int32_t f16     = 1;
-};
-
-
-// quantize a model
-bool llama_model_quantize(const std::string & fname_inp, const std::string & fname_out, int itype) {
-    ggml_type type = GGML_TYPE_Q4_1;
-
-    switch (itype) {
-        case 2: type = GGML_TYPE_Q4_0; break;
-        case 3: type = GGML_TYPE_Q4_1; break;
-        default: fprintf(stderr, "%s: invalid quantization type %d\n", __func__, itype); return 1;
-    };
-
-    if (type != GGML_TYPE_Q4_0 && type != GGML_TYPE_Q4_1) {
-        fprintf(stderr, "%s: invalid quantization type %d\n", __func__, type);
-        return false;
-    }
-
-    gpt_vocab vocab;
-
-    printf("%s: loading model from '%s'\n", __func__, fname_inp.c_str());
-
-    auto finp = std::ifstream(fname_inp, std::ios::binary);
-    if (!finp) {
-        fprintf(stderr, "%s: failed to open '%s' for reading\n", __func__, fname_inp.c_str());
-        return false;
-    }
-
-    auto fout = std::ofstream(fname_out, std::ios::binary);
-    if (!fout) {
-        fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname_out.c_str());
-        return false;
-    }
-
-    // verify magic
-    {
-        uint32_t magic;
-        finp.read((char *) &magic, sizeof(magic));
-        if (magic != 0x67676d6c) {
-            fprintf(stderr, "%s: invalid model file '%s' (bad magic)\n", __func__, fname_inp.c_str());
-            return false;
-        }
-
-        fout.write((char *) &magic, sizeof(magic));
-    }
-
-    llama_hparams hparams;
-
-    // load hparams
-    {
-        finp.read((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
-        //finp.read((char *) &hparams.n_ctx,   sizeof(hparams.n_ctx));
-        finp.read((char *) &hparams.n_embd,  sizeof(hparams.n_embd));
-        finp.read((char *) &hparams.n_mult,  sizeof(hparams.n_mult));
-        finp.read((char *) &hparams.n_head,  sizeof(hparams.n_head));
-        finp.read((char *) &hparams.n_layer, sizeof(hparams.n_layer));
-        finp.read((char *) &hparams.n_rot,   sizeof(hparams.n_rot));
-        finp.read((char *) &hparams.f16,     sizeof(hparams.f16));
-
-        printf("%s: n_vocab = %d\n", __func__, hparams.n_vocab);
-        printf("%s: n_ctx   = %d\n", __func__, hparams.n_ctx);
-        printf("%s: n_embd  = %d\n", __func__, hparams.n_embd);
-        printf("%s: n_mult  = %d\n", __func__, hparams.n_mult);
-        printf("%s: n_head  = %d\n", __func__, hparams.n_head);
-        printf("%s: n_layer = %d\n", __func__, hparams.n_layer);
-        printf("%s: f16     = %d\n", __func__, hparams.f16);
-
-        fout.write((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
-        //fout.write((char *) &hparams.n_ctx,   sizeof(hparams.n_ctx));
-        fout.write((char *) &hparams.n_embd,  sizeof(hparams.n_embd));
-        fout.write((char *) &hparams.n_mult,  sizeof(hparams.n_mult));
-        fout.write((char *) &hparams.n_head,  sizeof(hparams.n_head));
-        fout.write((char *) &hparams.n_layer, sizeof(hparams.n_layer));
-        fout.write((char *) &hparams.n_rot,   sizeof(hparams.n_rot));
-        fout.write((char *) &itype,           sizeof(hparams.f16));
-    }
-
-    // load vocab
-    {
-        const int32_t n_vocab = hparams.n_vocab;
-
-        if (n_vocab != hparams.n_vocab) {
-            fprintf(stderr, "%s: invalid model file '%s' (bad vocab size %d != %d)\n",
-                    __func__, fname_inp.c_str(), n_vocab, hparams.n_vocab);
-            return false;
-        }
-
-        std::string word;
-        for (int i = 0; i < n_vocab; i++) {
-            uint32_t len;
-            finp.read ((char *) &len, sizeof(len));
-            fout.write((char *) &len, sizeof(len));
-
-            word.resize(len);
-            finp.read ((char *) word.data(), len);
-            fout.write((char *) word.data(), len);
-
-            vocab.token_to_id[word] = i;
-            vocab.id_to_token[i] = word;
-        }
-    }
-
-    // load weights
-    {
-        size_t total_size_org = 0;
-        size_t total_size_new = 0;
-
-        std::vector<float> work;
-
-        std::vector<uint8_t>     data_u8;
-        std::vector<ggml_fp16_t> data_f16;
-        std::vector<float>       data_f32;
-
-        std::vector<int64_t> hist_all(1 << 4, 0);
-
-        while (true) {
-            int32_t n_dims;
-            int32_t length;
-            int32_t ftype;
-
-            finp.read(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
-            finp.read(reinterpret_cast<char *>(&length), sizeof(length));
-            finp.read(reinterpret_cast<char *>(&ftype),  sizeof(ftype));
-
-            if (finp.eof()) {
-                break;
-            }
-
-            int32_t nelements = 1;
-            int32_t ne[2] = { 1, 1 };
-            for (int i = 0; i < n_dims; ++i) {
-                finp.read (reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
-                nelements *= ne[i];
-            }
-
-            std::string name(length, 0);
-            finp.read (&name[0], length);
-
-            {
-                static const char * ftype_str[] = { "f32", "f16", "q4_0", "q4_1", };
-                printf("%48s - [%5d, %5d], type = %6s ", name.data(), ne[0], ne[1], ftype_str[ftype]);
-            }
-
-            // regexes of tensor names to be quantized
-            const std::vector<std::string> k_names = {
-                ".*weight",
-            };
-
-            bool quantize = false;
-            for (const auto & s : k_names) {
-                if (std::regex_match(name, std::regex(s))) {
-                    quantize = true;
-                    break;
-                }
-            }
-
-            // quantize only 2D tensors
-            quantize &= (n_dims == 2);
-
-            if (quantize) {
-                if (ftype != 0 && ftype != 1) {
-                    fprintf(stderr, "%s: unsupported ftype %d for integer quantization\n", __func__, ftype);
-                    return false;
-                }
-
-                if (ftype == 1) {
-                    data_f16.resize(nelements);
-                    finp.read(reinterpret_cast<char *>(data_f16.data()), nelements * sizeof(ggml_fp16_t));
-                    data_f32.resize(nelements);
-                    for (int i = 0; i < nelements; ++i) {
-                        data_f32[i] = ggml_fp16_to_fp32(data_f16[i]);
-                    }
-                } else {
-                    data_f32.resize(nelements);
-                    finp.read(reinterpret_cast<char *>(data_f32.data()), nelements * sizeof(float));
-                }
-
-                ftype = itype;
-            } else {
-                const int bpe = (ftype == 0) ? sizeof(float) : sizeof(uint16_t);
-
-                data_u8.resize(nelements*bpe);
-                finp.read(reinterpret_cast<char *>(data_u8.data()), nelements * bpe);
-            }
-
-            fout.write(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
-            fout.write(reinterpret_cast<char *>(&length), sizeof(length));
-            fout.write(reinterpret_cast<char *>(&ftype),  sizeof(ftype));
-            for (int i = 0; i < n_dims; ++i) {
-                fout.write(reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
-            }
-            fout.write(&name[0], length);
-
-            if (quantize) {
-                printf("quantizing .. ");
-                work.resize(nelements); // for quantization
-
-                size_t cur_size = 0;
-                std::vector<int64_t> hist_cur(1 << 4, 0);
-
-                switch (type) {
-                    case GGML_TYPE_Q4_0:
-                        {
-                            cur_size = ggml_quantize_q4_0(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
-                        } break;
-                    case GGML_TYPE_Q4_1:
-                        {
-                            cur_size = ggml_quantize_q4_1(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
-                        } break;
-                    default:
-                        {
-                            fprintf(stderr, "%s: unsupported quantization type %d\n", __func__, type);
-                            return false;
-                        }
-                }
-
-                fout.write(reinterpret_cast<char *>(work.data()), cur_size);
-                total_size_new += cur_size;
-
-                printf("size = %8.2f MB -> %8.2f MB | hist: ", nelements * sizeof(float)/1024.0/1024.0, cur_size/1024.0/1024.0);
-                for (int i = 0; i < hist_cur.size(); ++i) {
-                    hist_all[i] += hist_cur[i];
-                }
-
-                for (int i = 0; i < hist_cur.size(); ++i) {
-                    printf("%5.3f ", hist_cur[i] / (float)nelements);
-                }
-                printf("\n");
-            } else {
-                printf("size = %8.3f MB\n", data_u8.size()/1024.0/1024.0);
-                fout.write(reinterpret_cast<char *>(data_u8.data()), data_u8.size());
-                total_size_new += data_u8.size();
-            }
-
-            total_size_org += nelements * sizeof(float);
-        }
-
-        printf("%s: model size  = %8.2f MB\n", __func__, total_size_org/1024.0/1024.0);
-        printf("%s: quant size  = %8.2f MB\n", __func__, total_size_new/1024.0/1024.0);
-
-        {
-            int64_t sum_all = 0;
-            for (int i = 0; i < hist_all.size(); ++i) {
-                sum_all += hist_all[i];
-            }
-
-            printf("%s: hist: ", __func__);
-            for (int i = 0; i < hist_all.size(); ++i) {
-                printf("%5.3f ", hist_all[i] / (float)sum_all);
-            }
-            printf("\n");
-        }
-    }
-
-    finp.close();
-    fout.close();
-
-    return true;
-}
+const int QK = 32;

 // usage:
 //  ./llama-quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
 //
 int main(int argc, char ** argv) {
    ggml_time_init();
+
    if (argc != 4) {
        fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type\n", argv[0]);
        fprintf(stderr, "  type = 2 - q4_0\n");
@@ -317,7 +39,7 @@ int main(int argc, char ** argv) {
    {
        const int64_t t_start_us = ggml_time_us();

-        if (!llama_model_quantize(fname_inp, fname_out, itype)) {
+        if (llama_model_quantize(fname_inp.c_str(), fname_out.c_str(), itype, QK)) {
            fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
            return 1;
        }
--- a/quantize.py
+++ b/quantize.py
@@ -0,0 +1,126 @@
+#!/usr/bin/env python3
+
+"""Script to execute the "quantize" script on a given set of models."""
+
+import subprocess
+import argparse
+import glob
+import sys
+import os
+
+
+def main():
+    """Update the quantize binary name depending on the platform and parse
+    the command line arguments and execute the script.
+    """
+
+    if "linux" in sys.platform or "darwin" in sys.platform:
+        quantize_script_binary = "quantize"
+
+    elif "win32" in sys.platform or "cygwin" in sys.platform:
+        quantize_script_binary = "quantize.exe"
+
+    else:
+        print("WARNING: Unknown platform. Assuming a UNIX-like OS.\n")
+        quantize_script_binary = "quantize"
+
+    parser = argparse.ArgumentParser(
+        prog='python3 quantize.py',
+        description='This script quantizes the given models by applying the '
+        f'"{quantize_script_binary}" script on them.'
+    )
+    parser.add_argument(
+        'models', nargs='+', choices=('7B', '13B', '30B', '65B'),
+        help='The models to quantize.'
+    )
+    parser.add_argument(
+        '-r', '--remove-16', action='store_true', dest='remove_f16',
+        help='Remove the f16 model after quantizing it.'
+    )
+    parser.add_argument(
+        '-m', '--models-path', dest='models_path',
+        default=os.path.join(os.getcwd(), "models"),
+        help='Specify the directory where the models are located.'
+    )
+    parser.add_argument(
+        '-q', '--quantize-script-path', dest='quantize_script_path',
+        default=os.path.join(os.getcwd(), quantize_script_binary),
+        help='Specify the path to the "quantize" script.'
+    )
+
+    # TODO: Revise this code
+    # parser.add_argument(
+    #     '-t', '--threads', dest='threads', type='int',
+    #     default=os.cpu_count(),
+    #     help='Specify the number of threads to use to quantize many models at '
+    #     'once. Defaults to os.cpu_count().'
+    # )
+
+    args = parser.parse_args()
+
+    if not os.path.isfile(args.quantize_script_path):
+        print(
+            f'The "{quantize_script_binary}" script was not found in the '
+            "current location.\nIf you want to use it from another location, "
+            "set the --quantize-script-path argument from the command line."
+        )
+        sys.exit(1)
+
+    for model in args.models:
+        # The model is separated in various parts
+        # (ggml-model-f16.bin, ggml-model-f16.bin.0, ggml-model-f16.bin.1...)
+        f16_model_path_base = os.path.join(
+            args.models_path, model, "ggml-model-f16.bin"
+        )
+
+        f16_model_parts_paths = map(
+            lambda filename: os.path.join(f16_model_path_base, filename),
+            glob.glob(f"{f16_model_path_base}*")
+        )
+
+        for f16_model_part_path in f16_model_parts_paths:
+            if not os.path.isfile(f16_model_part_path):
+                print(
+                    f"The f16 model {os.path.basename(f16_model_part_path)} "
+                    f"was not found in {args.models_path}{os.path.sep}{model}"
+                    ". If you want to use it from another location, set the "
+                    "--models-path argument from the command line."
+                )
+                sys.exit(1)
+
+            __run_quantize_script(
+                args.quantize_script_path, f16_model_part_path
+            )
+
+            if args.remove_f16:
+                os.remove(f16_model_part_path)
+
+
+# This was extracted to a top-level function for parallelization, if
+# implemented. See https://github.com/ggerganov/llama.cpp/pull/222/commits/f8db3d6cd91bf1a1342db9d29e3092bc12dd783c#r1140496406
+
+def __run_quantize_script(script_path, f16_model_part_path):
+    """Run the quantize script specifying the path to it and the path to the
+    f16 model to quantize.
+    """
+
+    new_quantized_model_path = f16_model_part_path.replace("f16", "q4_0")
+    subprocess.run(
+        [script_path, f16_model_part_path, new_quantized_model_path, "2"],
+        check=True
+    )
+
+
+if __name__ == "__main__":
+    try:
+        main()
+
+    except subprocess.CalledProcessError:
+        print("\nAn error ocurred while trying to quantize the models.")
+        sys.exit(1)
+
+    except KeyboardInterrupt:
+        sys.exit(0)
+
+    else:
+        print("\nSuccesfully quantized all models.")
--- a/run.cpp
+++ b/run.cpp
@@ -0,0 +1,364 @@
+#include "utils.h"
+#include "llama.h"
+
+#include <cassert>
+#include <cinttypes>
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <string>
+#include <vector>
+
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
+#include <signal.h>
+#include <unistd.h>
+#elif defined (_WIN32)
+#include <signal.h>
+#endif
+
+#if defined (_WIN32)
+#pragma comment(lib,"kernel32.lib")
+extern "C" __declspec(dllimport) void* __stdcall GetStdHandle(unsigned long nStdHandle);
+extern "C" __declspec(dllimport) int __stdcall GetConsoleMode(void* hConsoleHandle, unsigned long* lpMode);
+extern "C" __declspec(dllimport) int __stdcall SetConsoleMode(void* hConsoleHandle, unsigned long dwMode);
+#endif
+
+#define ANSI_COLOR_RED     "\x1b[31m"
+#define ANSI_COLOR_GREEN   "\x1b[32m"
+#define ANSI_COLOR_YELLOW  "\x1b[33m"
+#define ANSI_COLOR_BLUE    "\x1b[34m"
+#define ANSI_COLOR_MAGENTA "\x1b[35m"
+#define ANSI_COLOR_CYAN    "\x1b[36m"
+#define ANSI_COLOR_RESET   "\x1b[0m"
+#define ANSI_BOLD          "\x1b[1m"
+
+/* Keep track of current color of output, and emit ANSI code if it changes. */
+enum console_state {
+    CONSOLE_STATE_DEFAULT=0,
+    CONSOLE_STATE_PROMPT,
+    CONSOLE_STATE_USER_INPUT
+};
+
+static console_state con_st = CONSOLE_STATE_DEFAULT;
+static bool con_use_color = false;
+
+void set_console_state(FILE *stream, console_state new_st)
+{
+    if (!con_use_color) return;
+    // only emit color code if state changed
+    if (new_st != con_st) {
+        con_st = new_st;
+        switch(con_st) {
+        case CONSOLE_STATE_DEFAULT:
+            fprintf(stream, ANSI_COLOR_RESET);
+            return;
+        case CONSOLE_STATE_PROMPT:
+            fprintf(stream, ANSI_COLOR_YELLOW);
+            return;
+        case CONSOLE_STATE_USER_INPUT:
+            fprintf(stream, ANSI_BOLD ANSI_COLOR_GREEN);
+            return;
+        }
+    }
+}
+
+static bool is_interacting = false;
+
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
+void sigint_handler(int signo) {
+    set_console_state(stdout, CONSOLE_STATE_DEFAULT);
+    printf("\n"); // this also force flush stdout.
+    if (signo == SIGINT) {
+        if (!is_interacting) {
+            is_interacting=true;
+        } else {
+            _exit(130);
+        }
+    }
+}
+#endif
+
+int run(llama_context * ctx,
+        gpt_params params,
+        std::istream & instream,
+        FILE *outstream,
+        FILE *errstream) {
+
+    if (params.seed <= 0) {
+        params.seed = time(NULL);
+    }
+
+    fprintf(errstream, "%s: seed = %d\n", __func__, params.seed);
+
+    std::mt19937 rng(params.seed);
+    if (params.random_prompt) {
+        params.prompt = gpt_random_prompt(rng);
+    }
+
+    // save choice to use color for later
+    // (note for later: this is a slightly awkward choice)
+    con_use_color = params.use_color;
+
+//    params.prompt = R"(// this function checks if the number n is prime
+//bool is_prime(int n) {)";
+
+    // determine the required inference memory per token:
+    // TODO: better way to do that
+    {
+        const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
+        llama_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
+    }
+
+    int n_past = 0;
+
+    // Add a space in front of the first character to match OG llama tokenizer behavior
+    params.prompt.insert(0, 1, ' ');
+
+    // tokenize the prompt
+    auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);
+
+    const int n_ctx = llama_n_ctx(ctx);
+
+    params.n_predict = std::min(params.n_predict, n_ctx - (int) embd_inp.size());
+
+    // prefix & suffix for instruct mode
+    const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
+    const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
+
+    // in instruct mode, we inject a prefix and a suffix to each input by the user
+    if (params.instruct) {
+        params.interactive = true;
+        params.antiprompt.push_back("### Instruction:\n\n");
+    }
+
+    // enable interactive mode if reverse prompt is specified
+    if (params.antiprompt.size() != 0) {
+        params.interactive = true;
+    }
+
+    if (params.interactive_start) {
+        params.interactive = true;
+    }
+
+    fprintf(errstream, "\n");
+    fprintf(errstream, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
+    fprintf(errstream, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
+    for (int i = 0; i < (int) embd_inp.size(); i++) {
+        fprintf(errstream, "%6d -> '%s'\n", embd_inp[i], llama_token_to_str(ctx, embd_inp[i]));
+    }
+    fprintf(errstream, "\n");
+    if (params.interactive) {
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
+        struct sigaction sigint_action;
+        sigint_action.sa_handler = sigint_handler;
+        sigemptyset (&sigint_action.sa_mask);
+        sigint_action.sa_flags = 0;
+        sigaction(SIGINT, &sigint_action, NULL);
+#elif defined (_WIN32)
+        signal(SIGINT, sigint_handler);
+#endif
+
+        fprintf(errstream, "%s: interactive mode on.\n", __func__);
+
+        if(params.antiprompt.size()) {
+            for (auto antiprompt : params.antiprompt) {
+                fprintf(errstream, "Reverse prompt: '%s'\n", antiprompt.c_str());
+            }
+        }
+    }
+    fprintf(errstream, "sampling parameters: temp = %f, top_k = %d, top_p = %f, repeat_last_n = %i, repeat_penalty = %f\n", params.temp, params.top_k, params.top_p, params.repeat_last_n, params.repeat_penalty);
+    fprintf(errstream, "\n\n");
+
+    std::vector<llama_token> embd;
+
+    int last_n_size = params.repeat_last_n;
+    std::vector<llama_token> last_n_tokens(last_n_size);
+    std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
+
+    if (params.interactive) {
+        fprintf(errstream, "== Running in interactive mode. ==\n"
+#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
+               " - Press Ctrl+C to interject at any time.\n"
+#endif
+               " - Press Return to return control to LLaMa.\n"
+               " - If you want to submit another line, end your input in '\\'.\n\n");
+        is_interacting = params.interactive_start;
+    }
+
+    int input_consumed = 0;
+    bool input_noecho = false;
+
+    int remaining_tokens = params.n_predict;
+
+#if defined (_WIN32)
+  if (params.use_color) {
+        // Enable ANSI colors on Windows 10+
+        unsigned long dwMode = 0;
+        void* hConOut = GetStdHandle((unsigned long)-11); // STD_OUTPUT_HANDLE (-11)
+        if (hConOut && hConOut != (void*)-1 && GetConsoleMode(hConOut, &dwMode) && !(dwMode & 0x4)) {
+            SetConsoleMode(hConOut, dwMode | 0x4); // ENABLE_VIRTUAL_TERMINAL_PROCESSING (0x4)
+        }
+    }
+#endif
+    // the first thing we will do is to output the prompt, so set color accordingly
+    set_console_state(outstream, CONSOLE_STATE_PROMPT);
+
+    while (remaining_tokens > 0 || params.interactive) {
+        // predict
+        if (embd.size() > 0) {
+            if (llama_eval(ctx, embd.data(), embd.size(), n_past, params.n_threads)) {
+                fprintf(errstream, "%s : failed to eval\n", __func__);
+                return 1;
+            }
+        }
+
+        n_past += embd.size();
+        embd.clear();
+
+        if ((int) embd_inp.size() <= input_consumed) {
+            // out of user input, sample next token
+            const float top_k          = params.top_k;
+            const float top_p          = params.top_p;
+            const float temp           = params.temp;
+            const float repeat_penalty = params.repeat_penalty;
+
+            llama_token id = 0;
+
+            {
+                auto logits = llama_get_logits(ctx);
+
+                if (params.ignore_eos) {
+                    // set the logit of the eos token to zero to avoid sampling it
+                    //logits[logits.size() - n_vocab + EOS_TOKEN_ID] = 0;
+                    // TODO: this does not work of params.logits_all == true
+                    assert(params.perplexity == false);
+                    logits[llama_token_eos()] = 0;
+                }
+
+                id = llama_sample_top_p_top_k(ctx, last_n_tokens.data(), last_n_tokens.size(), top_k, top_p, temp, repeat_penalty);
+
+                last_n_tokens.erase(last_n_tokens.begin());
+                last_n_tokens.push_back(id);
+            }
+
+            // add it to the context
+            embd.push_back(id);
+
+            // echo this to console
+            input_noecho = false;
+
+            // decrement remaining sampling budget
+            --remaining_tokens;
+        } else {
+            // some user input remains from prompt or interaction, forward it to processing
+            while ((int) embd_inp.size() > input_consumed) {
+                embd.push_back(embd_inp[input_consumed]);
+                last_n_tokens.erase(last_n_tokens.begin());
+                last_n_tokens.push_back(embd_inp[input_consumed]);
+                ++input_consumed;
+                if ((int) embd.size() >= params.n_batch) {
+                    break;
+                }
+            }
+        }
+
+        // display text
+        if (!input_noecho) {
+            for (auto id : embd) {
+                fprintf(outstream, "%s", llama_token_to_str(ctx, id));
+            }
+            fflush(outstream);
+        }
+        // reset color to default if we there is no pending user input
+        if (!input_noecho && (int)embd_inp.size() == input_consumed) {
+            set_console_state(outstream, CONSOLE_STATE_DEFAULT);
+        }
+
+        // in interactive mode, and not currently processing queued inputs;
+        // check if we should prompt the user for more
+        if (params.interactive && (int) embd_inp.size() <= input_consumed) {
+            // check for reverse prompt
+            std::string last_output;
+            for (auto id : last_n_tokens) {
+                last_output += llama_token_to_str(ctx, id);
+            }
+
+            // Check if each of the reverse prompts appears at the end of the output.
+            for (std::string antiprompt : params.antiprompt) {
+                if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
+                    is_interacting = true;
+                    break;
+                }
+            }
+            if (is_interacting) {
+                // potentially set color to indicate we are taking user input
+                set_console_state(outstream, CONSOLE_STATE_USER_INPUT);
+
+                if (params.instruct) {
+                    input_consumed = embd_inp.size();
+                    embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
+
+                    fprintf(outstream, "\n> ");
+                }
+
+                std::string buffer;
+                std::string line;
+                bool another_line = true;
+                do {
+                    std::getline(instream, line);
+                    if (line.empty() || line.back() != '\\') {
+                        another_line = false;
+                    } else {
+                        line.pop_back(); // Remove the continue character
+                    }
+                    buffer += line + '\n'; // Append the line to the result
+                } while (another_line);
+
+                // done taking input, reset color
+                set_console_state(outstream, CONSOLE_STATE_DEFAULT);
+
+                auto line_inp = ::llama_tokenize(ctx, buffer, false);
+                embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
+
+                if (params.instruct) {
+                    embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
+                }
+
+                remaining_tokens -= line_inp.size();
+
+                input_noecho = true; // do not echo this again
+            }
+            is_interacting = false;
+        }
+
+        // end of text token
+        if (embd.back() == llama_token_eos()) {
+            if (params.interactive) {
+                is_interacting = true;
+            } else {
+                fprintf(errstream, " [end of text]\n");
+                break;
+            }
+        }
+
+        // In interactive mode, respect the maximum number of tokens and drop back to user input when reached.
+        if (params.interactive && remaining_tokens <= 0) {
+            remaining_tokens = params.n_predict;
+            is_interacting = true;
+        }
+    }
+
+#if defined (_WIN32)
+    signal(SIGINT, SIG_DFL);
+#endif
+
+    llama_print_timings(ctx);
+
+    llama_free(ctx);
+
+    set_console_state(outstream, CONSOLE_STATE_DEFAULT);
+
+    return 0;
+}
--- a/run.h
+++ b/run.h
@@ -0,0 +1,10 @@
+#pragma once
+
+#include "llama.h"
+#include "utils.h"
+
+int run(llama_context * ctx,
+        gpt_params params,
+        std::istream & instream,
+        FILE *outstream,
+        FILE *errstream);
--- a/tcp_server.cpp
+++ b/tcp_server.cpp
@@ -0,0 +1,245 @@
+#include "tcp_server.h"
+#include "llama.h"
+#include "utils.h"
+
+#include <iostream>
+
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <errno.h>
+
+#include <signal.h>
+#include <unistd.h>
+#include <sys/wait.h>
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+#include <netdb.h>
+
+class PosixStream : public std::istream {
+    public:
+        PosixStream(int fd) : std::istream(&buf), buf(fd) {}
+        ~PosixStream() { close(buf.get_fd()); }
+
+    private:
+        class PosixStreamBuf : public std::streambuf {
+            public:
+                PosixStreamBuf(int fd) : fd(fd) {}
+                int get_fd() const { return fd; }
+
+            protected:
+                virtual int_type underflow() {
+                    if (gptr() < egptr()) {
+                        return traits_type::to_int_type(*gptr());
+                    }
+
+                    ssize_t num_read = ::read(fd, buffer, BUFFER_SIZE);
+                    if (num_read <= 0) {
+                        return traits_type::eof();
+                    }
+
+                    setg(buffer, buffer, buffer + num_read);
+                    return traits_type::to_int_type(*gptr());
+                }
+
+            private:
+                static const int BUFFER_SIZE = 1024;
+                int fd;
+                char buffer[BUFFER_SIZE];
+        };
+
+        PosixStreamBuf buf;
+};
+
+void die(const char *msg, ...)
+{
+    va_list ap;
+
+    va_start(ap, msg);
+    vfprintf(stderr, msg, ap);
+    va_end(ap);
+    fputc('\n', stderr);
+    exit(1);
+}
+
+static char *read_argument(uint8_t **param_buf, size_t *param_buf_size, FILE *instream) {
+    bool done = false;
+    uint8_t *buf = *param_buf;
+    size_t bufsize = *param_buf_size;
+    size_t bufpos = 0;
+    while (!done) {
+        if (bufpos == bufsize) {
+            bufsize += 1024;
+            buf = (uint8_t *)realloc(buf, bufsize);
+            if (!buf) {
+                die("failed to allocate memory");
+            }
+        }
+
+        int c = fgetc(instream);
+        if (c == EOF) {
+            die("unexpected EOF client socket");
+        }
+        buf[bufpos++] = (uint8_t)c;
+        if (c == 0) {
+            // done reading argument
+            break;
+        }
+    }
+    *param_buf = buf;
+    *param_buf_size = bufsize;
+    return strdup((char *)buf);
+}
+
+static int read_arguments(int argc, char **argv, FILE *instream) {
+    int i = 1;
+    size_t param_buf_size = 0;
+    uint8_t *param_buf = nullptr;
+
+    for (i = 1; i < argc; i++) {
+        argv[i] = read_argument(&param_buf, &param_buf_size, instream);
+    }
+
+    free(param_buf);
+    return i;
+}
+
+static int serve_model(llama_context * ctx,
+                       gpt_params params,
+                       int sock_fd)
+{
+    int argc;
+    char **argv;
+    FILE *instream = fdopen(sock_fd, "r");
+    FILE *outstream = fdopen(sock_fd, "w");
+    setvbuf(instream, NULL, _IONBF, 0);
+
+    // start by reading the parameter count
+    if (fscanf(instream, "%d\n", &argc) != 1) {
+        fprintf(outstream, "Error: First line must be character count\n");
+        fflush(outstream);
+        return 1;
+    }
+
+    argc += 1;  // add one extra argument to emulate the program command line
+    argv = (char **)malloc(argc * sizeof *argv);
+    argv[0] = nullptr;
+    if (read_arguments(argc, argv, instream) != argc) {
+        fprintf(outstream, "Error: Failed to read arguments\n");
+        fflush(outstream);
+    }
+
+    if (gpt_params_parse(argc, argv, params) == false) {
+        fprintf(outstream, "Error: Failed to parse parameters\n");
+        fflush(outstream);
+        return 1;
+    }
+
+    for (int i = 1; i < argc; i++) {
+        free(argv[i]);
+    }
+    free(argv);
+
+    PosixStream tcp_instream(sock_fd);
+
+    return run(ctx, params, tcp_instream, outstream, outstream);
+}
+
+int listen_tcp(llama_context * ctx, gpt_params params) {
+    int listen_fd;
+    int status;
+    pid_t child;
+    struct addrinfo hints;
+    struct addrinfo *servinfo, *p;
+    int yes = 1;
+
+    memset(&hints, 0, sizeof hints);
+    hints.ai_family = AF_INET;
+    hints.ai_socktype = SOCK_STREAM;
+    hints.ai_flags = AI_PASSIVE;
+
+    // This should only ever listen on a loopback address. Access from outside
+    // should be proxied via socat or similar software
+    status = getaddrinfo("127.0.0.1", params.listen_port.c_str(), &hints, &servinfo);
+    if (status) {
+        die("getaddrinfo error: %s", gai_strerror(status));
+    }
+
+    // bind to the first addrinfo we can from the getaddrinfo results
+    for (p = servinfo; p != NULL; p = p->ai_next) {
+        listen_fd = socket(p->ai_family, p->ai_socktype, p->ai_protocol);
+        if (listen_fd == -1) {
+            perror("server: socket");
+            continue;
+        }
+
+        if (setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR | SO_REUSEPORT, &yes, sizeof yes)) {
+            die("setsockopt error: %s", params.listen_port.c_str(), strerror(errno));
+        }
+
+        if (bind(listen_fd, p->ai_addr, p->ai_addrlen) == 0) {
+            struct sockaddr_in addr_in;
+            socklen_t addr_in_len = sizeof(addr_in);
+            memset(&addr_in, 0, addr_in_len);
+            getsockname(listen_fd, (struct sockaddr*)&addr_in, &addr_in_len);
+
+            printf("Listening on %s:%d\n", inet_ntoa(addr_in.sin_addr), ntohs(addr_in.sin_port));
+            break;
+        }
+
+        close(listen_fd);
+        perror("server: bind");
+    }
+
+    freeaddrinfo(servinfo);
+
+    if (p == NULL) {
+        die("failed to bind: %s", strerror(errno));
+    }
+
+    if (listen(listen_fd, 20)) {
+        die("listen error: %s", strerror(errno));
+    }
+    // Don't track child processes, so ignore SIGCHLD to prevent zombies
+    signal(SIGCHLD, SIG_IGN);
+
+    for (;;) {
+        struct sockaddr_in client_addr;
+        socklen_t client_addr_len = 0;
+        memset(&client_addr, 0, sizeof(client_addr));
+
+        int sock_fd = accept(listen_fd,
+                (struct sockaddr *)&client_addr,
+                &client_addr_len);
+        if (sock_fd < 0) {
+            fprintf(stderr, "accept error: %s\n", strerror(errno));
+            break;
+        }
+
+        child = fork();
+        if (child == 0) {
+            // close the listen_fd since we won't use it in the child
+            close(listen_fd);
+            int ret = serve_model(ctx, params, sock_fd);
+            close(sock_fd);
+            return ret;
+        } else {
+            // close the client since we won't use it in the server
+            close(sock_fd);
+            sock_fd = 0;
+        }
+    }
+    close(listen_fd);
+
+    // ignore SIGTERM since we'll send it to the group
+    signal(SIGTERM, SIG_IGN);
+    // tell children to exit
+    kill(0, SIGTERM);
+    // wait for children to terminate
+    wait(&status);
+    return 0;
+}
--- a/tcp_server.h
+++ b/tcp_server.h
@@ -0,0 +1,7 @@
+#pragma once
+
+#include "utils.h"
+#include "llama.h"
+#include "run.h"
+
+int listen_tcp(llama_context * ctx, gpt_params params);
--- a/tests/CMakeLists.txt
+++ b/tests/CMakeLists.txt
@@ -0,0 +1,9 @@
+function(llama_add_test source)
+    get_filename_component(TEST_TARGET ${source} NAME_WE)
+    add_executable(${TEST_TARGET} ${source})
+    target_link_libraries(${TEST_TARGET} PRIVATE llama ggml utils)
+    add_test(NAME ${TEST_TARGET} COMMAND $<TARGET_FILE:${TEST_TARGET}> ${ARGN})
+endfunction()
+
+llama_add_test(test-quantize.c)
+llama_add_test(test-tokenizer-0.cpp ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab.bin)
--- a/tests/test-quantize.c
+++ b/tests/test-quantize.c
@@ -0,0 +1,42 @@
+#include "ggml.h"
+#undef NDEBUG
+#include <assert.h>
+#include <math.h>
+
+int main(void) {
+    #define QK 32
+    float src[QK];
+    uint8_t dst[24];
+    int64_t hist[16];
+
+    for (int i = 0; i < QK; i++) {
+        src[i] = (float)(i + 1);
+    }
+
+    size_t size = ggml_quantize_q4_0(src, dst, QK, QK, QK, hist);
+    assert(size == 20);
+    float max_result = ((float *)dst)[0];
+    float max_expected = src[31] / ((1 << 3) - 1);
+    assert(max_result == max_expected);
+    for (int i = 0; i < QK; i++) {
+        uint8_t q4_result = (i % 2) ? (dst[sizeof(float) + i/2] >> 4) : (dst[sizeof(float) + i/2] & 0xF);
+        uint8_t q4_expected = roundf(src[i] / max_expected) + 8;
+        assert(q4_result == q4_expected);
+    }
+
+    size = ggml_quantize_q4_1(src, dst, QK, QK, QK, hist);
+    assert(size == 24);
+    float delta_result = ((float *)dst)[0];
+    float delta_expected = (src[31] - src[0]) / ((1 << 4) - 1);
+    assert(delta_result == delta_expected);
+    float min_result = ((float *)dst)[1];
+    float min_expected = src[0];
+    assert(min_result == min_expected);
+    for (int i = 0; i < QK; i++) {
+        uint8_t q4_result = (i % 2) ? (dst[sizeof(float)*2 + i/2] >> 4) : (dst[sizeof(float)*2 + i/2] & 0xF);
+        uint8_t q4_expected = roundf((src[i] - min_expected) / delta_expected);
+        assert(q4_result == q4_expected);
+    }
+
+    return 0;
+}
--- a/tests/test-tokenizer-0.cpp
+++ b/tests/test-tokenizer-0.cpp
@@ -0,0 +1,79 @@
+#include "utils.h"
+#include "llama.h"
+
+#include <cstdio>
+#include <string>
+#include <map>
+
+static const std::map<std::string, std::vector<llama_token>> k_tests = {
+    { "Hello World",        { 1,  10994,   2787, }, },
+    { " Hello World",       { 1,  15043,   2787, }, },
+    { " Hello World!",      { 1,  15043,   2787,  29991, }, },
+    { " this is 🦙.cpp",    { 1,    445,    338,  29871,    243,    162,    169,    156,  29889,   8223, }, },
+    { "w048 7tuijk dsdfhu", { 1,  29893,  29900,  29946,  29947,  29871,  29955,   9161,  13535,  18031,   2176,   6905, }, },
+    { "нещо на Български",  { 1,    821,   4851,    665,   1386,  29713,   1305, }, },
+};
+
+int main(int argc, char **argv) {
+    if (argc < 2) {
+        fprintf(stderr, "Usage: %s <vocab-file>\n", argv[0]);
+        return 1;
+    }
+
+    const std::string fname = argv[1];
+
+    fprintf(stderr, "%s : reading vocab from: '%s'\n", __func__, fname.c_str());
+
+    llama_context * ctx;
+
+    // load the vocab
+    {
+        auto lparams = llama_context_default_params();
+
+        lparams.vocab_only = true;
+
+        ctx = llama_init_from_file(fname.c_str(), lparams);
+
+        if (ctx == NULL) {
+            fprintf(stderr, "%s: error: failed to load vocab '%s'\n", __func__, fname.c_str());
+            return 1;
+        }
+    }
+
+    const int n_vocab = llama_n_vocab(ctx);
+
+    if (n_vocab != 32000) {
+        fprintf(stderr, "%s : expected 32000 tokens, got %d\n", __func__, n_vocab);
+        return 2;
+    }
+
+    for (const auto & test_kv : k_tests) {
+        const auto res = ::llama_tokenize(ctx, test_kv.first, true);
+
+        bool correct = res.size() == test_kv.second.size();
+
+        for (int i = 0; i < (int) res.size() && correct; ++i) {
+            if (res[i] != test_kv.second[i]) {
+                correct = false;
+            }
+        }
+
+        if (!correct) {
+            fprintf(stderr, "%s : failed test: '%s'\n", __func__, test_kv.first.c_str());
+            fprintf(stderr, "%s : expected tokens: ", __func__);
+            for (const auto & t : test_kv.second) {
+                fprintf(stderr, "%6d, ", t);
+            }
+            fprintf(stderr, "\n");
+            fprintf(stderr, "%s : got tokens:      ", __func__);
+            for (const auto & t : res) {
+                fprintf(stderr, "%6d, ", t);
+            }
+            fprintf(stderr, "\n");
+
+            return 3;
+        }
+    }
+
+    return 0;
+}
--- a/utils.cpp
+++ b/utils.cpp
@@ -3,19 +3,29 @@
 #include <cassert>
 #include <cstring>
 #include <fstream>
-#include <regex>
-#include <iostream>
-#include <iterator>
 #include <string>
-#include <math.h>
+#include <iterator>
+#include <algorithm>

 #if defined(_MSC_VER) || defined(__MINGW32__)
 #include <malloc.h> // using malloc.h with MSC/MINGW
- #elif !defined(__FreeBSD__)
+ #elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
 #include <alloca.h>
 #endif

 bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
+    // determine sensible default number of threads.
+    // std::thread::hardware_concurrency may not be equal to the number of cores, or may return 0.
+#ifdef __linux__
+    std::ifstream cpuinfo("/proc/cpuinfo");
+    params.n_threads = std::count(std::istream_iterator<std::string>(cpuinfo),
+                                  std::istream_iterator<std::string>(),
+                                  std::string("processor"));
+#endif
+    if (params.n_threads == 0) {
+        params.n_threads = std::max(1, (int32_t) std::thread::hardware_concurrency());
+    }
+
    for (int i = 1; i < argc; i++) {
        std::string arg = argv[i];

@@ -26,17 +36,19 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
        } else if (arg == "-p" || arg == "--prompt") {
            params.prompt = argv[++i];
        } else if (arg == "-f" || arg == "--file") {
-
            std::ifstream file(argv[++i]);
-
-            std::copy(std::istreambuf_iterator<char>(file),
-                    std::istreambuf_iterator<char>(),
-                    back_inserter(params.prompt));
-                
+            std::copy(std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>(), back_inserter(params.prompt));
+            if (params.prompt.back() == '\n') {
+                params.prompt.pop_back();
+            }
        } else if (arg == "-n" || arg == "--n_predict") {
            params.n_predict = std::stoi(argv[++i]);
        } else if (arg == "--top_k") {
            params.top_k = std::stoi(argv[++i]);
+        } else if (arg == "-c" || arg == "--ctx_size") {
+            params.n_ctx = std::stoi(argv[++i]);
+        } else if (arg == "--memory_f16") {
+            params.memory_f16 = true;
        } else if (arg == "--top_p") {
            params.top_p = std::stof(argv[++i]);
        } else if (arg == "--temp") {
@@ -51,16 +63,29 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
            params.model = argv[++i];
        } else if (arg == "-i" || arg == "--interactive") {
            params.interactive = true;
-        } else if (arg == "--interactive-start") {
-            params.interactive = true;
+        } else if (arg == "--interactive-first") {
            params.interactive_start = true;
+        } else if (arg == "-ins" || arg == "--instruct") {
+            params.instruct = true;
        } else if (arg == "--color") {
            params.use_color = true;
        } else if (arg == "-r" || arg == "--reverse-prompt") {
-            params.antiprompt = argv[++i];
+            params.antiprompt.push_back(argv[++i]);
+        } else if (arg == "--perplexity") {
+            params.perplexity = true;
+        } else if (arg == "--ignore-eos") {
+            params.ignore_eos = true;
+        } else if (arg == "--n_parts") {
+            params.n_parts = std::stoi(argv[++i]);
+#ifndef _WIN32
+        } else if (arg == "-l" || arg == "--listen") {
+            params.listen_port = argv[++i];
+#endif
        } else if (arg == "-h" || arg == "--help") {
            gpt_print_usage(argc, argv, params);
            exit(0);
+        } else if (arg == "--random-prompt") {
+            params.random_prompt = true;
        } else {
            fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
            gpt_print_usage(argc, argv, params);
@@ -71,20 +96,23 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
    return true;
 }

-void gpt_print_usage(int argc, char ** argv, const gpt_params & params) {
+void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
    fprintf(stderr, "usage: %s [options]\n", argv[0]);
    fprintf(stderr, "\n");
    fprintf(stderr, "options:\n");
    fprintf(stderr, "  -h, --help            show this help message and exit\n");
    fprintf(stderr, "  -i, --interactive     run in interactive mode\n");
-    fprintf(stderr, "  --interactive-start   run in interactive mode and poll user input at startup\n");
+    fprintf(stderr, "  --interactive-first   run in interactive mode and wait for input right away\n");
+    fprintf(stderr, "  -ins, --instruct      run in instruction mode (use with Alpaca models)\n");
    fprintf(stderr, "  -r PROMPT, --reverse-prompt PROMPT\n");
-    fprintf(stderr, "                        in interactive mode, poll user input upon seeing PROMPT\n");
+    fprintf(stderr, "                        run in interactive mode and poll user input upon seeing PROMPT (can be\n");
+    fprintf(stderr, "                        specified more than once for multiple prompts).\n");
    fprintf(stderr, "  --color               colorise output to distinguish prompt and user input from generations\n");
-    fprintf(stderr, "  -s SEED, --seed SEED  RNG seed (default: -1)\n");
+    fprintf(stderr, "  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for <= 0)\n");
    fprintf(stderr, "  -t N, --threads N     number of threads to use during computation (default: %d)\n", params.n_threads);
    fprintf(stderr, "  -p PROMPT, --prompt PROMPT\n");
-    fprintf(stderr, "                        prompt to start generation with (default: random)\n");
+    fprintf(stderr, "                        prompt to start generation with (default: empty)\n");
+    fprintf(stderr, "  --random-prompt       start with a randomized prompt.\n");
    fprintf(stderr, "  -f FNAME, --file FNAME\n");
    fprintf(stderr, "                        prompt file to start generation.\n");
    fprintf(stderr, "  -n N, --n_predict N   number of tokens to predict (default: %d)\n", params.n_predict);
@@ -92,10 +120,19 @@ void gpt_print_usage(int argc, char ** argv, const gpt_params & params) {
    fprintf(stderr, "  --top_p N             top-p sampling (default: %.1f)\n", params.top_p);
    fprintf(stderr, "  --repeat_last_n N     last n tokens to consider for penalize (default: %d)\n", params.repeat_last_n);
    fprintf(stderr, "  --repeat_penalty N    penalize repeat sequence of tokens (default: %.1f)\n", params.repeat_penalty);
+    fprintf(stderr, "  -c N, --ctx_size N    size of the prompt context (default: %d)\n", params.n_ctx);
+    fprintf(stderr, "  --ignore-eos          ignore end of stream token and continue generating\n");
+    fprintf(stderr, "  --memory_f16          use f16 instead of f32 for memory key+value\n");
    fprintf(stderr, "  --temp N              temperature (default: %.1f)\n", params.temp);
+    fprintf(stderr, "  --n_parts N           number of model parts (default: -1 = determine from dimensions)\n");
    fprintf(stderr, "  -b N, --batch_size N  batch size for prompt processing (default: %d)\n", params.n_batch);
+    fprintf(stderr, "  --perplexity          compute perplexity over the prompt\n");
    fprintf(stderr, "  -m FNAME, --model FNAME\n");
    fprintf(stderr, "                        model path (default: %s)\n", params.model.c_str());
+#ifndef _WIN32
+    fprintf(stderr, "  -l PORT, --listen PORT\n");
+    fprintf(stderr, "                        Run in TCP mode, listening on PORT\n");
+#endif
    fprintf(stderr, "\n");
 }

@@ -118,427 +155,13 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
    return "The";
 }

-void replace(std::string & str, const std::string & needle, const std::string & replacement) {
-    size_t pos = 0;
-    while ((pos = str.find(needle, pos)) != std::string::npos) {
-        str.replace(pos, needle.length(), replacement);
-        pos += replacement.length();
-    }
-}
-
-std::map<std::string, int32_t> json_parse(const std::string & fname) {
-    std::map<std::string, int32_t> result;
-
-    // read file into string
-    std::string json;
-    {
-        std::ifstream ifs(fname);
-        if (!ifs) {
-            fprintf(stderr, "Failed to open %s\n", fname.c_str());
-            exit(1);
-        }
-
-        json = std::string((std::istreambuf_iterator<char>(ifs)),
-                (std::istreambuf_iterator<char>()));
-    }
-
-    if (json[0] != '{') {
-        return result;
-    }
-
-    // parse json
-    {
-        bool has_key  = false;
-        bool in_token = false;
-
-        std::string str_key = "";
-        std::string str_val = "";
-
-        int n = json.size();
-        for (int i = 1; i < n; ++i) {
-            if (!in_token) {
-                if (json[i] == ' ') continue;
-                if (json[i] == '"') {
-                    in_token = true;
-                    continue;
-                }
-            } else {
-                if (json[i] == '\\' && i+1 < n) {
-                    if (has_key == false) {
-                        str_key += json[i];
-                    } else {
-                        str_val += json[i];
-                    }
-                    ++i;
-                } else if (json[i] == '"') {
-                    if (has_key == false) {
-                        has_key = true;
-                        ++i;
-                        while (json[i] == ' ') ++i;
-                        ++i; // :
-                        while (json[i] == ' ') ++i;
-                        if (json[i] != '\"') {
-                            while (json[i] != ',' && json[i] != '}') {
-                                str_val += json[i++];
-                            }
-                            has_key = false;
-                        } else {
-                            in_token = true;
-                            continue;
-                        }
-                    } else {
-                        has_key = false;
-                    }
-
-                    ::replace(str_key, "\\u0120", " " ); // \u0120 -> space
-                    ::replace(str_key, "\\u010a", "\n"); // \u010a -> new line
-                    ::replace(str_key, "\\\"",    "\""); // \\\"   -> "
-
-                    try {
-                        result[str_key] = std::stoi(str_val);
-                    } catch (...) {
-                        //fprintf(stderr, "%s: ignoring key '%s' with value '%s'\n", fname.c_str(), str_key.c_str(), str_val.c_str());
-
-                    }
-                    str_key = "";
-                    str_val = "";
-                    in_token = false;
-                    continue;
-                }
-                if (has_key == false) {
-                    str_key += json[i];
-                } else {
-                    str_val += json[i];
-                }
-            }
-        }
-    }
-
-    return result;
-}
-
-std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text) {
-    std::vector<std::string> words;
-
-    // first split the text into words
-    {
-        std::string str = text;
-        std::string pat = R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)";
-
-        std::regex re(pat);
-        std::smatch m;
-
-        while (std::regex_search(str, m, re)) {
-            for (auto x : m) {
-                words.push_back(x);
-            }
-            str = m.suffix();
-        }
-    }
-
-    // find the longest tokens that form the words:
-    std::vector<gpt_vocab::id> tokens;
-    for (const auto & word : words) {
-        if (word.size() == 0) continue;
-
-        int i = 0;
-        int n = word.size();
-        while (i < n) {
-            int j = n;
-            while (j > i) {
-                auto it = vocab.token_to_id.find(word.substr(i, j-i));
-                if (it != vocab.token_to_id.end()) {
-                    tokens.push_back(it->second);
-                    i = j;
-                    break;
-                }
-                --j;
-            }
-            if (i == n) {
-                break;
-            }
-            if (j == i) {
-                auto sub = word.substr(i, 1);
-                if (vocab.token_to_id.find(sub) != vocab.token_to_id.end()) {
-                    tokens.push_back(vocab.token_to_id.at(sub));
-                } else {
-                    fprintf(stderr, "%s: unknown token '%s'\n", __func__, sub.data());
-                }
-                ++i;
-            }
-        }
-    }
-
-    return tokens;
-}
-
-std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos) {
-    //auto res = gpt_tokenize(vocab, text);
-
-    //if (bos) {
-    //    res.insert(res.begin(), 1); // TODO: replace with vocab.bos
-    //}
-
-    std::vector<gpt_vocab::id> res;
-
-    if (bos) {
-        res.push_back(1); // TODO: replace with vocab.bos
-    }
-
-     //find the longest token that matches the text
-    int pos = 0;
-    while (true) {
-        int l = 0;
-        int t = 0;
-        for (const auto & kv : vocab.id_to_token) {
-            if (kv.second.size() < l) continue;
-            if (kv.second.size() > text.size() - pos) continue;
-            if (text.substr(pos, kv.second.size()) == kv.second) {
-                l = kv.second.size();
-                t = kv.first;
-            }
-        }
-
-        if (l == 0) {
-            break;
-        }
-
-        res.push_back(t);
-        pos += l;
-    }
+// TODO: not great allocating this every time
+std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
+    // initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
+    std::vector<llama_token> res(text.size() + (int)add_bos);
+    int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
+    assert(n >= 0);
+    res.resize(n);

    return res;
 }
-
-bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab) {
-    printf("%s: loading vocab from '%s'\n", __func__, fname.c_str());
-
-    vocab.token_to_id = ::json_parse(fname);
-
-    for (const auto & kv : vocab.token_to_id) {
-        vocab.id_to_token[kv.second] = kv.first;
-    }
-
-    printf("%s: vocab size = %d\n", __func__, (int) vocab.token_to_id.size());
-
-    // print the vocabulary
-    //for (auto kv : vocab.token_to_id) {
-    //    printf("'%s' -> %d\n", kv.first.data(), kv.second);
-    //}
-
-    return true;
-}
-
-
-void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k) {
-    // find the top K tokens
-    std::partial_sort(
-            logits_id.begin(),
-            logits_id.begin() + top_k, logits_id.end(),
-            [](const std::pair<double, gpt_vocab::id> & a, const std::pair<double, gpt_vocab::id> & b) {
-        return a.first > b.first;
-    });
-
-    logits_id.resize(top_k);
-}
-
-gpt_vocab::id llama_sample_top_p_top_k(
-        const gpt_vocab & vocab,
-        const float * logits,
-        std::vector<gpt_vocab::id> & last_n_tokens,
-        double repeat_penalty,
-        int top_k,
-        double top_p,
-        double temp,
-        std::mt19937 & rng) {
-    int n_logits = vocab.id_to_token.size();
-
-    std::vector<std::pair<double, gpt_vocab::id>> logits_id;
-    logits_id.reserve(n_logits);
-
-    {
-        const double scale = 1.0/temp;
-        for (int i = 0; i < n_logits; ++i) {
-            // repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)
-            // credit https://github.com/facebookresearch/llama/compare/main...shawwn:llama:main
-            if (std::find(last_n_tokens.begin(), last_n_tokens.end(), i) != last_n_tokens.end()) {
-                // if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
-                if (logits[i] < 0.0) {
-                    logits_id.push_back(std::make_pair(logits[i]*scale*repeat_penalty, i));
-                } else {
-                    logits_id.push_back(std::make_pair(logits[i]*scale/repeat_penalty, i));
-                }                
-            } else {
-                logits_id.push_back(std::make_pair(logits[i]*scale, i));
-            }
-        }
-    }
-
-    sample_top_k(logits_id, top_k);
-
-    double maxl = -INFINITY;
-    for (const auto & kv : logits_id) {
-        maxl = std::max(maxl, kv.first);
-    }
-
-    // compute probs for the top K tokens
-    std::vector<double> probs;
-    probs.reserve(logits_id.size());
-
-    double sum = 0.0;
-    for (const auto & kv : logits_id) {
-        double p = exp(kv.first - maxl);
-        probs.push_back(p);
-        sum += p;
-    }
-
-    // normalize the probs
-    for (auto & p : probs) {
-        p /= sum;
-    }
-
-    if (top_p < 1.0f) {
-        double cumsum = 0.0f;
-        for (int i = 0; i < (int) probs.size(); i++) {
-            cumsum += probs[i];
-            if (cumsum >= top_p) {
-                probs.resize(i + 1);
-                logits_id.resize(i + 1);
-                break;
-            }
-        }
-
-        cumsum = 1.0/cumsum;
-        for (int i = 0; i < (int) probs.size(); i++) {
-            probs[i] *= cumsum;
-        }
-    }
-
-    //printf("\n");
-    //for (int i = 0; i < (int) 10; i++) {
-    //    printf("%d: '%s' %f\n", i, vocab.id_to_token.at(logits_id[i].second).c_str(), probs[i]);
-    //}
-    //printf("\n\n");
-    //exit(0);
-
-    std::discrete_distribution<> dist(probs.begin(), probs.end());
-    int idx = dist(rng);
-
-    return logits_id[idx].second;
-}
-
-
-size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
-    const int nb = k / qk;
-    const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
-    const size_t row_size = nb*bs;
-
-    assert(k % qk == 0);
-
-    const size_t pp_size = qk / 2;
-    uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
-
-    char * pdst = (char *) dst;
-
-    for (int j = 0; j < n; j += k) {
-        uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
-        uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
-
-        for (int i = 0; i < nb; i++) {
-            float amax = 0.0f; // absolute max
-
-            {
-                for (int l = 0; l < qk; l++) {
-                    const float v = src[j + i*qk + l];
-                    amax = std::max(amax, fabsf(v));
-                }
-
-                const float d = amax / ((1 << 3) - 1);
-                const float id = d ? 1.0f/d : 0.0f;
-
-                *(float *) pd = d;
-                pd += bs;
-
-                for (int l = 0; l < qk; l += 2) {
-                    const float v0 = (src[j + i*qk + l + 0])*id;
-                    const float v1 = (src[j + i*qk + l + 1])*id;
-
-                    const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
-                    const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
-
-                    assert(vi0 >= 0 && vi0 < 16);
-                    assert(vi1 >= 0 && vi1 < 16);
-
-                    hist[vi0]++;
-                    hist[vi1]++;
-
-                    pp[l/2] = vi0 | (vi1 << 4);
-                }
-
-                memcpy(pb, pp, pp_size);
-                pb += bs;
-            }
-        }
-    }
-
-    return (n/k)*row_size;
-}
-
-size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
-    const int nb = k / qk;
-    const size_t row_size = nb*(2*sizeof(float) + sizeof(uint8_t)*qk/2);
-
-    assert(k % qk == 0);
-
-    const size_t pp_size = qk / 2;
-    uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
-
-    char * pdst = (char *) dst;
-
-    for (int j = 0; j < n; j += k) {
-        float   * pm = (float *)   (pdst + (j/k)*row_size);
-        float   * pd = (float *)   (pm + nb);
-        uint8_t * pb = (uint8_t *) (pd + nb);
-
-        //printf("n = %d, k = %d, nb = %d, row_size = %d, j = %d, pm = %p, pd = %p, pb = %p\n", n, k, nb, row_size, j, pm, pd, pb);
-
-        for (int i = 0; i < nb; i++) {
-            float min = std::numeric_limits<float>::max();
-            float max = std::numeric_limits<float>::min();
-
-            {
-                for (int l = 0; l < qk; l++) {
-                    const float v = src[j + i*qk + l];
-                    if (v < min) min = v;
-                    if (v > max) max = v;
-                }
-
-                const float d = (max - min) / ((1 << 4) - 1);
-                const float id = d ? 1.0f/d : 0.0f;
-
-                pm[i] = min;
-                pd[i] = d;
-
-                for (int l = 0; l < qk; l += 2) {
-                    const float v0 = (src[j + i*qk + l + 0] - min)*id;
-                    const float v1 = (src[j + i*qk + l + 1] - min)*id;
-
-                    const uint8_t vi0 = round(v0);
-                    const uint8_t vi1 = round(v1);
-
-                    assert(vi0 >= 0 && vi0 < 16);
-                    assert(vi1 >= 0 && vi1 < 16);
-
-                    hist[vi0]++;
-                    hist[vi1]++;
-
-                    pp[l/2] = vi0 | (vi1 << 4);
-                }
-
-                memcpy(pb + i*qk/2, pp, pp_size);
-            }
-        }
-    }
-
-    return (n/k)*row_size;
-}
--- a/utils.h
+++ b/utils.h
@@ -2,8 +2,9 @@

 #pragma once

+#include "llama.h"
+
 #include <string>
-#include <map>
 #include <vector>
 #include <random>
 #include <thread>
@@ -13,27 +14,38 @@
 //

 struct gpt_params {
-    int32_t seed      = -1; // RNG seed
-    int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
-    int32_t n_predict = 128; // new tokens to predict
+    int32_t seed          = -1;  // RNG seed
+    int32_t n_threads     = std::min(4, (int32_t) std::thread::hardware_concurrency());
+    int32_t n_predict     = 128; // new tokens to predict
    int32_t repeat_last_n = 64;  // last n tokens to penalize
+    int32_t n_parts       = -1;  // amount of model parts (-1 = determine from model dimensions)
+    int32_t n_ctx         = 512; //context size

    // sampling parameters
    int32_t top_k = 40;
    float   top_p = 0.95f;
    float   temp  = 0.80f;
-    float   repeat_penalty  = 1.30f;
+    float   repeat_penalty  = 1.10f;

    int32_t n_batch = 8; // batch size for prompt processing

-    std::string model = "models/lamma-7B/ggml-model.bin"; // model path
-    std::string prompt;
+    std::string model  = "models/lamma-7B/ggml-model.bin"; // model path
+    std::string prompt = "";

-    bool use_color = false; // use color to distinguish generations and inputs
+    std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted

-    bool interactive = false; // interactive mode
-    bool interactive_start = false; // reverse prompt immediately
-    std::string antiprompt = ""; // string upon seeing which more user input is prompted
+    bool memory_f16        = false; // use f16 instead of f32 for memory kv
+    bool random_prompt     = false; // do not randomize prompt if none provided
+    bool use_color         = false; // use color to distinguish generations and inputs
+    bool interactive       = false; // interactive mode
+    bool interactive_start = false; // wait for user input immediately
+    bool instruct          = false; // instruction mode (used for Alpaca models)
+    bool ignore_eos        = false; // do not stop generating after eos
+    bool perplexity        = false; // compute perplexity over the prompt
+
+#ifndef _WIN32
+    std::string listen_port = "";  // TCP port for when running in server mode
+#endif
 };

 bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
@@ -46,59 +58,4 @@ std::string gpt_random_prompt(std::mt19937 & rng);
 // Vocab utils
 //

-struct gpt_vocab {
-    using id    = int32_t;
-    using token = std::string;
-
-    std::map<token, id> token_to_id;
-    std::map<id, token> id_to_token;
-};
-
-void replace(std::string & str, const std::string & needle, const std::string & replacement);
-
-// poor-man's JSON parsing
-std::map<std::string, int32_t> json_parse(const std::string & fname);
-
-// split text into tokens
-//
-// ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
-//
-// Regex (Python):
-// r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
-//
-// Regex (C++):
-// R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)"
-//
-std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text);
-
-// TODO: this is probably wrong, but I cannot figure out how this tokenizer works ..
-// ref: https://github.com/google/sentencepiece
-std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos);
-
-// load the tokens from encoder.json
-bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab);
-
-// sample next token given probabilities for each embedding
-//
-//   - consider only the top K tokens
-//   - from them, consider only the top tokens with cumulative probability > P
-//
-gpt_vocab::id llama_sample_top_p_top_k(
-        const gpt_vocab & vocab,
-        const float * logits,
-        std::vector<gpt_vocab::id> & last_n_tokens,
-        double repeat_penalty,
-        int top_k,
-        double top_p,
-        double temp,
-        std::mt19937 & rng);
-
-// filer to top K tokens from list of logits
-void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k);
-
-//
-// Quantization
-//
-
-size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist);
-size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist);
+std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);
				`@@ -0,0 +1 @@`
				`Below is an instruction that describes a task. Write a response that appropriately completes the request.`