Compare commits

..

111 Commits

Author SHA1 Message Date
Thiago Padilha
3a0dcb3920 Implement server mode.
This new mode works by first loading the model then listening for TCP
connections on a port. When a connection is received, arguments will be
parsed using a simple protocol:

- First the number of arguments will be read followed by a newline
  character.
- Then each argument will be read, separated by the 0 byte.
- With this we build an argument vector, similar to what is passed to
  the program entry point. We pass this to gpt_params_parse.

Finally `run` will be executed with the input/output streams connected
to the socket.

Signed-off-by: Thiago Padilha <thiago@padilha.cc>
2023-03-22 14:34:19 -03:00
Thiago Padilha
bf44faa0ee Remove direct access to std streams from "run"
The goal is to allow running "run" while connected to other streams,
such as TCP sockets.

Signed-off-by: Thiago Padilha <thiago@padilha.cc>
2023-03-22 14:34:18 -03:00
Thiago Padilha
b7f1fa6d8c Move llama_context setup + perplexity back to main.cpp
Signed-off-by: Thiago Padilha <thiago@padilha.cc>
2023-03-22 14:31:41 -03:00
Thiago Padilha
d7d53b84db Add main.cpp back and invoke "run" from it
Signed-off-by: Thiago Padilha <thiago@padilha.cc>
2023-03-22 14:31:41 -03:00
Thiago Padilha
90175ee13f Move main.cpp to run.cpp
Signed-off-by: Thiago Padilha <thiago@padilha.cc>
2023-03-22 14:31:35 -03:00
Stephan Walter
69c92298a9 Deduplicate q4 quantization functions (#383)
* Deduplicate q4 quantization functions

* Use const; add basic test

* Re-enable quantization test

* Disable AVX2 flags in CI

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-22 19:29:06 +02:00
Valentyn Bezshapkin
97940520e8 fix: add POSIX functionality for Linux compilation (#51)
* fix: add POSIX functionality for Linux compilation

* fix: older standard for compatibility
2023-03-22 19:20:25 +02:00
tjohnman
305ba6f0e6 Don't force immediate interactive without -i (#354)
* Don't force immediate interactive without -i

Sometimes we might want to use a reverse prompt but we want to let the
model generate tokens right after the initial prompt. So we don't force
user input mode if the -i flag wasn't specified and instead let it run
until we encounter the reverse prompt.

This gives use some more flexibility, since it doesn't force the user to
enter a newline if they want to let the model generate text right after
the initial prompt and only be asked for input if the reverse prompt is
encountered.

The `--interactive-first` flag is reintroduced to force the old
behavior. `-r` behaves like `-i` plus introduces a reverse prompt (it
can be specified more than once).

* Update help output.

---------

Co-authored-by: Johnman <tjohnman@github>
2023-03-22 19:16:35 +02:00
Erik Scholz
4122dffff9 cmake: make llama an actual library (#392) 2023-03-22 18:37:10 +02:00
Erik Scholz
56e659a0b2 fix perplexity after c-api refactor (#390)
* preallocate a buffer of fitting size for tokenization (utils.cpp)

* don't create a new std::string (especially here, where it's usually large)
2023-03-22 18:09:38 +02:00
Gary Linscott
40ea807a97 Add details on perplexity to README.md (#395) 2023-03-22 08:53:54 -07:00
Yusuf Kağan Hanoğlu
d5850c53ca Add missing header for memcpy (#386)
fixed: memcpy is not defined
2023-03-22 10:55:45 +02:00
Georgi Gerganov
ae44e23ee3 When seed <= 0 - use the clock to generate one 2023-03-22 07:47:15 +02:00
Georgi Gerganov
928480ef5b Init llama_context_params properly from CLI (#370) 2023-03-22 07:45:14 +02:00
Georgi Gerganov
56817b1f88 Remove temporary notice and update hot topics 2023-03-22 07:34:02 +02:00
Georgi Gerganov
f5a77a629b Introduce C-style API (#370)
* Major refactoring - introduce C-style API

* Clean up

* Add <cassert>

* Add <iterator>

* Add <algorithm> ....

* Fix timing reporting and accumulation

* Measure eval time only for single-token calls

* Change llama_tokenize return meaning
2023-03-22 07:32:36 +02:00
Gary Mulder
da0e9fe90c Add SHA256SUMS file and instructions to README how to obtain and verify the downloads
Hashes created using:

sha256sum models/*B/*.pth models/*[7136]B/ggml-model-f16.bin* models/*[7136]B/ggml-model-q4_0.bin* > SHA256SUMS
2023-03-21 23:19:11 +01:00
anzz1
e6c9e0986c Fix bin dir for win ci 2023-03-22 00:01:08 +02:00
Erik Scholz
01a297b099 specify build type for ctest on windows (#371) 2023-03-21 23:34:25 +02:00
Georgi Gerganov
3366853e41 Add notice about pending change 2023-03-21 22:57:35 +02:00
Mathieu Nayrolles
3f9c6135e4 fix typo in chatLLaMa (#368)
The prompt contains a typo where 'alound' is used instead of 'aloud'.
2023-03-21 22:52:27 +02:00
Georgi Gerganov
0f61352708 Update issue templates 2023-03-21 19:47:27 +02:00
Fabio R. Sluzala
353ec251a4 We could use std::unordered_map over std::map (#305)
* Improve performance by changing std::map to std::unordered_map and std::map<id, token> id_to_token; to std::vector<token> id_to_token;

* fix last commit on gpt_vocab_init add vocab.id_to_token.resize(vocab.token_to_id.size());

* Removed include <map>

* Nest struct token score inside gpt_vocab

* renamed token to tok
2023-03-21 19:21:50 +02:00
Matvey Soloviev
89d5d90f3b Fix color codes emitting mid-UTF8 code. (#312) 2023-03-21 19:11:01 +02:00
comex
16ffc013c6 Importer for GPTQ quantized LLaMA models (#301)
* [WIP, broken] Importer for GPTQ quantized LLaMA models

Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa

Current status: Something is busted.  The output starts out decent, but
quickly degrades into gibberish.  This doesn't happen with either the
original GPTQ-for-LLaMa using the same weights, or llama.cpp when using
weights quantized by its own quantizer.  Is there a bug in the
conversion script that somehow only comes into play with a large context
size?

I did notice one potential issue.  It's clearly not the main cause of
the gibberish, since it doesn't happen when using q4_1 weights quantized
by llama.cpp itself, but it seems concerning.  When doing a matrix
multiplication of f16 * f32 => f32 or q4_1 * f32 => f32, at least when
the multiplication is not done with BLAS, the intermediate results are
stored in the smaller format rather than f32.  This seems like an
unnecessary waste of precision, especially in the q4_1 case.

I was originally hoping to validate the results by matching the Python
implementation's output exactly, but precision and non-associativity
issues make this very difficult, including when performing matrix
multiplications and, especially, computing norms.

Anyway, design details:

The models being imported store per-layer weights in essentially q4_1
format, although the addend and scale are shared across an entire row
rather than every group of 32 weights.  This script duplicates the
addend and scale to match ggml's expectations, at the cost of wasting
some memory.

However, there are two differences which I accommodated changing the
output format (and adding corresponding support to main.cpp) rather than
having the script match the existing one:

- The tok_embeddings and output weights (i.e. the weights that aren't
  per-layer) are f16 instead of q4_1.  They could be converted to q4_1,
  and the impact of the loss of precision would probably be low, but
  this would rule out exactly matching the Python implementation's
  output for validation.

- There is no sharding, since the input doesn't have it, and for a
  CPU-only implementation it seems more useful to avoid having to deal
  with multiple files.

The new format is differentiated from existing q4_1 format by changing
the 'f16' header flag to a new value, 4.  That said, I think a cleaner
approach would be to change main.cpp to support loading each tensor with
an arbitrary sharding configuration and type rather than hardcoding
specific combinations of types.  So far I've wasted too much time
debugging to try implementing this...

* Add missing permutation.  Now it works.

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 18:42:25 +02:00
Gary Linscott
486ae645fd Compute perplexity over prompt (#270)
* Compute perplexity over prompt

* More accurate perplexity calculation - over all logits in the context window (so 512x more tokens!)

* Output all perplexitiies

* Add timing/ETA
2023-03-21 18:27:42 +02:00
Jean-Christophe Hoelt
3ab3e6582f Add chatLLaMa script (#198)
* Add chatLLaMa script

* Fix shellcheck errors and do some cleanup

* Move chatLLaMa script to `examples` directory

* Reduce chatLLaMa context size to 2048

Ref d7def1a752

* Include n_predict to 2048 in examples/chatLLaMa
2023-03-21 18:23:15 +02:00
Alex von Gluck IV
f157088cb7 makefile: Fix CPU feature detection on Haiku (#218) 2023-03-21 18:21:06 +02:00
anzz1
c86ba036e6 Enable ANSI colors on Windows 10+ (#311)
* Enable ANSI colors on Windows 10+

On older versions function will silently fail without any ill effects

* Do not call SetConsoleMode if the mode is already set

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 18:14:46 +02:00
Georgi Gerganov
1daf4dd712 Minor style changes 2023-03-21 18:10:32 +02:00
Georgi Gerganov
dc6a845b85 Add chat.sh script 2023-03-21 18:09:46 +02:00
tjohnman
6a612959e1 Check for reverse prompt by characters instead of tokens (#292) (#330)
* Check for reverse prompt by characters instead of tokens (#292)

* Update main.cpp

Wording.

* Cleanup.

* Remove unnecessary use of std::stringstream.

---------

Co-authored-by: Johnman <tjohnman@github>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 18:05:06 +02:00
tjohnman
d5f56a5e5a Check for reverse prompt by characters instead of tokens (#292) (#330)
* Check for reverse prompt by characters instead of tokens (#292)

* Update main.cpp

Wording.

* Cleanup.

* Remove unnecessary use of std::stringstream.

---------

Co-authored-by: Johnman <tjohnman@github>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 18:04:43 +02:00
Georgi Gerganov
3bfa3b43b7 Fix convert script, warnings alpaca instructions, default params 2023-03-21 17:59:16 +02:00
Kevin Lo
715d292ee0 Add OpenBSD support (#314) 2023-03-21 17:50:09 +02:00
Mack Straight
c98ae02668 fix typo in comment (#318) 2023-03-21 17:49:43 +02:00
Qingyou Meng
c3b2306b18 Makefile: slightly cleanup for Mac Intel; echo instead of run ./main -h (#335) 2023-03-21 17:44:11 +02:00
anzz1
975d2cebf9 cmdline option for custom amount of model parts (--n_parts N) (#348)
* cmdline option for custom amount of model parts (--n_parts N)

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 17:42:43 +02:00
Kevin Kwok
e0ffc861fa Update IPFS links to quantized alpaca with new tokenizer format (#352) 2023-03-21 17:34:49 +02:00
Georgi Gerganov
8f644a0a85 Change default repeat_penalty to 1.0
I feel this penalty is not really helping.
Especially for the example from the README it makes results pretty bad
2023-03-21 17:32:14 +02:00
Georgi Gerganov
eb34620aec Add tokenizer test + revert to C++11 (#355)
* Add test-tokenizer-0 to do a few tokenizations - feel free to expand
* Added option to convert-pth-to-ggml.py script to dump just the vocabulary
* Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)
* Added utility to load vocabulary file from previous point (temporary implementation)
* Avoid using std::string_view and drop back to C++11 (hope I didn't break something)
* Rename gpt_vocab -> llama_vocab
* All CMake binaries go into ./bin/ now
2023-03-21 17:29:41 +02:00
Casey Primozic
2e664f1ff4 Add initial AVX512 support for dot product on Linux (#320)
* Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
2023-03-21 15:35:42 +01:00
nusu-github
8cf9f34edd Adding missing features of CMakeLists.txt & Refactoring (#131)
* Functionality addition CMakeLists.txt

Refactoring:
1. Simplify more options that are negation of negation.
LLAMA_NO_ACCELERATE -> LLAMA_ACCELERATE
2. Changed to an optional expression instead of forcing to enable AVX2 in MSVC.
3. Make CMAKE_CXX_STANDARD, which is different from Makefile, the same.
4. Use add_compile_options instead of adding options to CMAKE_C_FLAGS.
5. Make utils use target_link_libraries instead of directly referencing code.

Added features:
1. Added some options.
LLAMA_STATIC_LINK,LLAMA_NATIVE,LLAMA_LTO,LLAMA_GPROF,LLAMA_OPENBLAS

* Fix Accelerate link in CMake

* Windows build Fix

* C++11 to C++17

* Reflects C/C++ standard individually

* Change the version to 3.12

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-21 01:37:16 +01:00
Ben Siraphob
bd4b46d6ba Nix flake: set meta.mainProgram to llama 2023-03-20 22:50:22 +01:00
Qingyou Meng
6b6d5b5024 Fixed tokenizer.model not found error when model dir is symlink (#325) 2023-03-20 19:33:10 +00:00
Mack Straight
a791a68b61 move file magic/version to header, print expected version (#319) 2023-03-20 19:26:01 +00:00
Bernat Vadell
0f1b21cb90 Docker - Fix publish docker image in GitHub Registry (#235)
* fix publish permission

* try to fix docker pipeline using as password github_token & username repository_owner
2023-03-20 18:05:20 +01:00
Mack Straight
074bea2eb1 sentencepiece bpe compatible tokenizer (#252)
* potential out of bounds read

* fix quantize

* style

* Update convert-pth-to-ggml.py

* mild cleanup

* don't need the space-prefixing here rn since main.cpp already does it

* new file magic + version header field

* readme notice

* missing newlines

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
2023-03-20 03:17:23 -07:00
Stephan Walter
5cb63e2493 Add tqdm to Python requirements (#293)
* Add tqdm to Python requirements
* Remove torchvision torchaudio, add requests
2023-03-20 09:24:11 +01:00
cocktailpeanut
da5303c1ea bugfix: default should not be interactive (#304) 2023-03-19 23:44:20 +02:00
Georgi Gerganov
4545539d71 Rename script 2023-03-19 21:58:51 +02:00
Georgi Gerganov
edeba28366 Add temporary helper script for Alpaca chat 2023-03-19 21:57:48 +02:00
Rickey Bowers Jr
5c19c70ba6 fix coloring of last n_batch of prompt, and refactor line input (#221)
* fix coloring of last `n_batch` of prompt, and refactor line input
* forgot the newline that needs to be sent to the model
* (per #283) try to force flush of color reset in SIGINT handler
2023-03-19 19:44:30 +00:00
tjohnman
24568371ae Support for multiple reverse prompts. (#299)
Co-authored-by: Johnman <>
Co-authored-by: Johnman <tjohnman@github>
2023-03-19 21:33:06 +02:00
Suaj Carrot
7392f1cd2c Improved quantize script (#222)
* Improved quantize script

I improved the quantize script by adding error handling and allowing to select many models for quantization at once in the command line. I also converted it to Python for generalization as well as extensibility.

* Fixes and improvements based on Matt's observations

Fixed and improved many things in the script based on the reviews made by @mattsta. The parallelization suggestion is still to be revised, but code for it was still added (commented).

* Small fixes to the previous commit

* Corrected to use the original glob pattern

The original Bash script uses a glob pattern to match files that have endings such as ...bin.0, ...bin.1, etc. That has been translated correctly to Python now.

* Added support for Windows and updated README to use this script

New code to set the name of the quantize script binary depending on the platform has been added (quantize.exe if working on Windows) and the README.md file has been updated to use this script instead of the Bash one.

* Fixed a typo and removed shell=True in the subprocess.run call

Fixed a typo regarding the new filenames of the quantized models and removed the shell=True parameter in the subprocess.run call as it was conflicting with the list of parameters.

* Corrected previous commit

* Small tweak: changed the name of the program in argparse

This was making the automatic help message to be suggesting the program's usage as being literally "$ Quantization Script [arguments]". It should now be something like "$ python3 quantize.py [arguments]".
2023-03-19 20:38:44 +02:00
tjohnman
ad5fd5b60c Make prompt randomization optional. (#300)
Co-authored-by: Johnman <>
2023-03-19 20:36:19 +02:00
tjohnman
368d0c8a9e Respect the maximum number of tokens in interactive. (#298)
Co-authored-by: Johnman <johnman@github>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-19 20:31:17 +02:00
slaren
50fae10d03 Add --ignore-eos parameter (#181)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-19 20:22:48 +02:00
Qingyou Meng
084e2f0ec0 interactive mode: print '\n' in sigint_handler, this flush stdout thus ensure color reset. (#283) 2023-03-19 20:10:00 +02:00
Erik Scholz
0b366e7357 Command line switch to use F16 for memory_k and memory_v (refactor of #154) (#294)
* Use F16 for memory_k and memory_v

* add command line switch to use f16 instead of f32 for memory k+v

---------

Co-authored-by: Ty Everett <ty@tyweb.us>
2023-03-19 19:57:00 +02:00
Georgi Gerganov
160bfb217d Update hot topics to mention Alpaca support 2023-03-19 19:51:55 +02:00
Georgi Gerganov
c494ed5b94 Fix off-by-one bug (#115) 2023-03-19 19:46:32 +02:00
Georgi Gerganov
c1c7026b47 Fix python stuff (#109) 2023-03-19 19:33:18 +02:00
qunash
467b149761 Refactoring convert-pth-to-ggml.py: more concise and readable (#109)
* Refactor get_n_parts function to simplify code and improve readability

* Use f-strings instead of concatenation

* Refactoring: more concise and readable

* modularize

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-19 19:17:39 +02:00
Georgi Gerganov
70f01cb863 Drop trailing new line from file prompts (#80) 2023-03-19 19:05:04 +02:00
Georgi Gerganov
a4e63b73df Add instruction for using Alpaca (#240) 2023-03-19 18:49:50 +02:00
Georgi Gerganov
9e1707218a Add "--instruct" argument for usage with Alpaca (#240)
Also start adding prompts in "./prompts"
2023-03-19 18:37:02 +02:00
Georgi Gerganov
22213a17b5 Change RMSNorm eps to 1e-6 (#173)
I think this is what is used in the Python code
2023-03-19 17:30:00 +02:00
Ronsor
d7def1a752 Warn user if a context size greater than 2048 tokens is specified (#274)
LLaMA doesn't support more than 2048 token context sizes, and going above that produces terrible results.
2023-03-18 20:10:47 -04:00
Pavol Rusnak
6f61c18ec9 Fix typo in readme 2023-03-18 23:18:04 +01:00
Pavol Rusnak
1e5a6d088d Add note about Python 3.11 to readme 2023-03-18 22:25:35 +01:00
Pavol Rusnak
554b541521 Add memory/disk requirements to readme 2023-03-18 22:25:35 +01:00
Alex Nguyen
d3f202d57b Remove unused code since n_vocab is model.hparams.n_vocab (#262) 2023-03-18 13:51:49 +00:00
Justin Suess
e03e359730 fixed warning with std::ignore about unused function result (#151)
fixed warning with std::ignore about unused function result
2023-03-18 11:44:09 +00:00
Gary Linscott
a81d0c2a17 Fix n^2 loop in tokenization (#254)
This causes long prompts to parse very slowly.
2023-03-18 11:17:19 +00:00
anzz1
b2de7f18df CI Improvements (#230)
* CI Improvements

Manual build feature, autoreleases for Windows

* better CI naming convention

use branch name in releases and tags
2023-03-18 09:27:12 +02:00
Niklas Korz
a292747893 Nix flake (#40)
* Nix flake

* Nix: only add Accelerate framework on macOS

* Nix: development shel, direnv and compatibility

* Nix: use python packages supplied by withPackages

* Nix: remove channel compatibility

* Nix: fix ARM neon dotproduct on macOS

---------

Co-authored-by: Pavol Rusnak <pavol@rusnak.io>
2023-03-17 23:03:48 +01:00
thement
c9f670a177 Implement non-greedy tokenizer that tries to maximize token lengths (#242)
* Implement non-greedy tokenizer that tries to maximize token lengths

* Insert single space in front of the prompt

- this is to match original llama tokenizer behavior

---------

Co-authored-by: Jakub Horak <jakub.horak@ibawizard.net>
2023-03-17 21:05:58 +01:00
Georgi Gerganov
4f54609110 Default to 4 threads (#243) 2023-03-17 21:46:46 +02:00
Georgi Gerganov
e81b9c81c1 Update Contributing section 2023-03-17 20:30:04 +02:00
Stephan Walter
367946c668 Don't tell users to use a bad number of threads (#243)
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
2023-03-17 19:47:35 +02:00
mmyjona
6b0df5ccf3 add ptread link to fix cmake build under linux (#114)
* add ptread link to fix cmake build under linux

* add cmake to linux and macos platform

* separate make and cmake workflow

---------

Co-authored-by: Sebastián A <sebastian.aedo29@gmail.com>
2023-03-17 13:38:24 -03:00
Bernat Vadell
2af23d3043 🚀 Dockerize llamacpp (#132)
* feat: dockerize llamacpp

* feat: split build & runtime stages

* split dockerfile into main & tools

* add quantize into tool docker image

* Update .devops/tools.sh

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* add docker action pipeline

* change CI to publish at github docker registry

* fix name runs-on macOS-latest is macos-latest (lowercase)

* include docker versioned images

* fix github action docker

* fix docker.yml

* feat: include all-in-one command tool & update readme.md

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-17 10:47:06 +01:00
Matvey Soloviev
904d2a8d6a Q4_1 quantization (#193)
* Add AVX2 version of ggml_vec_dot_q4_1

* Small optimisations to q4_1 dot product (@Const-me)

* Rearrange Q4_1 quantization to work for multipart models. (Fix #152)

* Fix ggml_vec_mad_q4_1 too

* Fix non-vectorised q4_1 vec mul
2023-03-17 06:48:39 +02:00
Georgi Gerganov
721311070e Update README.md 2023-03-16 15:00:09 +02:00
Georgi Gerganov
ac15de7895 Expand "Contributing" section 2023-03-16 08:55:13 +02:00
Georgi Gerganov
273abc47ff Update hot topics - RMSnorm 2023-03-16 07:12:12 +02:00
Nebula
9b4a15b17d Fix RMS norm in GGML (#191) 2023-03-15 19:29:25 -04:00
hoangmit
6eac39ba95 Add RMS norm and use it (#187)
* add ggml_rms_norm

* update op num
2023-03-16 00:41:38 +02:00
moritzbrantner
27944c4206 fixed typo (#178) 2023-03-15 22:35:25 +02:00
Rickey Bowers Jr
2d15d6c9a9 add SIGINT support for _WIN32 environments (#120)
* add SIGINT support for _WIN32 environments

* perhaps more consistent
2023-03-15 21:56:24 +02:00
Justin Suess
2d64715ad4 added ctx_size parameter (#148)
* added ctx_size parameter

* added it in more places

* Apply suggestions from code review

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-15 21:42:40 +02:00
Justin Suess
16b2c61a22 fixed color reset on exit (#149)
* fixed color reset on exit

* added sigint handler for ansi_color_reset

* Update main.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-15 21:39:38 +02:00
Musab Gultekin
977295c700 Fix potential licensing issue (#126)
* Update README.md

* Update README.md

remove facebook
2023-03-15 21:39:06 +02:00
Ronsor
956dfda8ad Use tokenizer.vocab_size() instead of hardcoding 32000 in convert-pth-to-ggml.py (#142)
There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
2023-03-15 21:37:50 +02:00
hoangmit
113e685d18 inline -> static inline for "bytesFromNibbles" (#161)
Without "static" prefix, it fails to compile in clang
2023-03-15 21:05:14 +02:00
Ronsor
47857e564c Don't use vdotq_s32 if it's not available (#139)
* Don't use vdotq_s32 if it's not available

`dotprod` extensions aren't available on some ARM CPUs (e.g. Raspberry Pi 4), so check for them and only use them if they're available.

Reintroduces the code removed in 84d9015 if `__ARM_FEATURE_DOTPROD` isn't defined.

* Update ggml.c

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-14 21:34:37 +02:00
Radoslav Gerganov
60f819a2b1 Add section to README on how to run the project on Android (#130) 2023-03-14 15:30:08 +02:00
Georgi Gerganov
97ab2b2578 Add Misc section + update hot topics + minor fixes 2023-03-14 09:43:52 +02:00
Sebastián A
2f700a2738 Add windows to the CI (#98) 2023-03-13 22:29:10 +02:00
Georgi Gerganov
c09a9cfb06 CMake build in Release by default (#75) 2023-03-13 21:22:15 +02:00
Georgi Gerganov
7ec903d3c1 Update contribution section, hot topics, limitations, etc. 2023-03-13 19:21:51 +02:00
Georgi Gerganov
4497ad819c Print system information 2023-03-13 19:15:08 +02:00
Sebastián A
ed6849cc07 Initial support for CMake (#75) 2023-03-13 19:12:33 +02:00
Thomas Klausner
41be0a3b3d Add NetBSD support. (#90) 2023-03-13 18:40:54 +02:00
Pavol Rusnak
671d5cac15 Use fprintf for diagnostic output (#48)
keep printf only for printing model output

one can now use ./main ... 2>dev/null to suppress any diagnostic output
2023-03-13 18:39:56 +02:00
Georgi Gerganov
84d9015c4a Use vdotq_s32 to improve performance (#67)
* 10% performance boost on ARM

* Back to original change
2023-03-13 18:36:44 +02:00
uint256_t
63fd76fbb0 Reduce model loading time (#43)
* Use buffering

* Use vector

* Minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-13 18:33:43 +02:00
Val Kharitonov
2a20f48efa Fix UTF-8 handling (including colors) (#79) 2023-03-13 18:24:18 +02:00
Pavol Rusnak
d1f224712d Add quantize script for batch quantization (#92)
* Add quantize script for batch quantization

* Indentation

* README for new quantize.sh

* Fix script name

* Fix file list on Mac OS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-03-13 18:15:20 +02:00
Georgi Gerganov
1808ee0500 Add initial contribution guidelines 2023-03-13 09:42:26 +02:00
42 changed files with 4864 additions and 2021 deletions

17
.devops/full.Dockerfile Normal file
View File

@@ -0,0 +1,17 @@
ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build
RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip
RUN pip install --upgrade pip setuptools wheel \
&& pip install numpy requests sentencepiece torch tqdm
WORKDIR /app
COPY . .
RUN make
ENTRYPOINT ["/app/.devops/tools.sh"]

18
.devops/main.Dockerfile Normal file
View File

@@ -0,0 +1,18 @@
ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build
RUN apt-get update && \
apt-get install -y build-essential
WORKDIR /app
COPY . .
RUN make
FROM ubuntu:$UBUNTU_VERSION as runtime
COPY --from=build /app/main /main
ENTRYPOINT [ "/main" ]

46
.devops/tools.sh Executable file
View File

@@ -0,0 +1,46 @@
#!/bin/bash
set -e
# Read the first argument into a variable
arg1="$1"
# Shift the arguments to remove the first one
shift
# Join the remaining arguments into a single string
arg2="$@"
if [[ $arg1 == '--convert' || $arg1 == '-c' ]]; then
python3 ./convert-pth-to-ggml.py $arg2
elif [[ $arg1 == '--quantize' || $arg1 == '-q' ]]; then
./quantize $arg2
elif [[ $arg1 == '--run' || $arg1 == '-r' ]]; then
./main $arg2
elif [[ $arg1 == '--download' || $arg1 == '-d' ]]; then
python3 ./download-pth.py $arg2
elif [[ $arg1 == '--all-in-one' || $arg1 == '-a' ]]; then
echo "Downloading model..."
python3 ./download-pth.py "$1" "$2"
echo "Converting PTH to GGML..."
for i in `ls $1/$2/ggml-model-f16.bin*`; do
if [ -f "${i/f16/q4_0}" ]; then
echo "Skip model quantization, it already exists: ${i/f16/q4_0}"
else
echo "Converting PTH to GGML: $i into ${i/f16/q4_0}..."
./quantize "$i" "${i/f16/q4_0}" 2
fi
done
else
echo "Unknown command: $arg1"
echo "Available commands: "
echo " --run (-r): Run a model previously converted into ggml"
echo " ex: -m /models/7B/ggml-model-q4_0.bin -p \"Building a website can be done in 10 simple steps:\" -n 512"
echo " --convert (-c): Convert a llama model into ggml"
echo " ex: \"/models/7B/\" 1"
echo " --quantize (-q): Optimize with quantization process ggml"
echo " ex: \"/models/7B/ggml-model-f16.bin\" \"/models/7B/ggml-model-q4_0.bin\" 2"
echo " --download (-d): Download original llama model from CDN: https://agi.gpt4.org/llama/"
echo " ex: \"/models/\" 7B"
echo " --all-in-one (-a): Execute --download, --convert & --quantize"
echo " ex: \"/models/\" 7B"
fi

24
.dockerignore Normal file
View File

@@ -0,0 +1,24 @@
*.o
*.a
.cache/
.vs/
.vscode/
.DS_Store
build/
build-em/
build-debug/
build-release/
build-static/
build-no-accel/
build-sanitize-addr/
build-sanitize-thread/
models/*
/main
/quantize
arm_neon.h
compile_commands.json
Dockerfile

198
.github/ISSUE_TEMPLATE/custom.md vendored Normal file
View File

@@ -0,0 +1,198 @@
---
name: Custom issue template
about: Used to report user-related issues with the software
title: "[User] I encountered a problem .."
labels: ''
assignees: ''
---
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.
# Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected `lamma.cpp` to do.
# Current Behavior
Please provide a detailed written description of what `lamma.cpp` did, instead.
# Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
* Physical (or virtual) hardware you are using, e.g. for Linux:
`$ lscpu`
* Operating System, e.g. for Linux:
`$ uname -a`
* SDK version, e.g. for Linux:
```
$ python3 --version
$ make --version
$ g++ --version
```
# Models
* The LLaMA models are officially distributed by Facebook and will never be provided through this repository. See this [pull request in Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to obtain access to the model data.
* If your issue is with model conversion please verify the `sha256sum` of each of your `consolidated*.pth` and `ggml-model-XXX.bin` files to confirm that you have the correct model data files before logging an issue. [Latest sha256 sums for your reference](https://github.com/ggerganov/llama.cpp/issues/238).
* If your issue is with model generation quality then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
* LLaMA:
* [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
* [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
* GPT-3
* [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
* GPT-3.5 / InstructGPT / ChatGPT:
* [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
* [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
# Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
# Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
1. step 1
2. step 2
3. step 3
4. etc.
# Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability. e.g.
```
llama.cpp$ git log | head -1
commit 2af23d30434a677c6416812eea52ccc0af65119c
llama.cpp$ lscpu | egrep "AMD|Flags"
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
Virtualization: AMD-V
llama.cpp$ python3 --version
Python 3.10.9
llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy 1.24.2
numpydoc 1.5.0
sentencepiece 0.1.97
torch 1.13.1
torchvision 0.14.1
llama.cpp$ make --version | head -1
GNU Make 4.3
$ md5sum ./models/65B/ggml-model-q4_0.bin
dbdd682cce80e2d6e93cefc7449df487 ./models/65B/ggml-model-q4_0.bin
```
Here's a run with the Linux command [perf](https://www.brendangregg.com/perf.html)
```
llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
main: seed = 1679149377
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 8192
llama_model_load: n_mult = 256
llama_model_load: n_head = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size = 2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: prompt: 'Please close your issue when it has been answered.'
main: number of tokens in prompt = 11
1 -> ''
12148 -> 'Please'
3802 -> ' close'
596 -> ' your'
2228 -> ' issue'
746 -> ' when'
372 -> ' it'
756 -> ' has'
1063 -> ' been'
7699 -> ' answered'
29889 -> '.'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]
main: mem per token = 71159620 bytes
main: load time = 19309.95 ms
main: sample time = 168.62 ms
main: predict time = 223895.61 ms / 888.47 ms per token
main: total time = 246406.42 ms
Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':
3636882.89 msec task-clock # 14.677 CPUs utilized
13509 context-switches # 3.714 /sec
2436 cpu-migrations # 0.670 /sec
10476679 page-faults # 2.881 K/sec
13133115082869 cycles # 3.611 GHz (16.77%)
29314462753 stalled-cycles-frontend # 0.22% frontend cycles idle (16.76%)
10294402631459 stalled-cycles-backend # 78.39% backend cycles idle (16.74%)
23479217109614 instructions # 1.79 insn per cycle
# 0.44 stalled cycles per insn (16.76%)
2353072268027 branches # 647.002 M/sec (16.77%)
1998682780 branch-misses # 0.08% of all branches (16.76%)
247.802177522 seconds time elapsed
3618.573072000 seconds user
18.491698000 seconds sys
```

View File

@@ -1,8 +1,42 @@
name: CI
on: [push, pull_request]
on:
workflow_dispatch: # allows manual triggering
inputs:
create_release:
description: 'Create new release'
required: true
type: boolean
push:
paths: ['.github/workflows/**', 'CMakeLists.txt', 'Makefile', '**.h', '*.c', '**.cpp']
pull_request:
types: [opened, synchronize, edited, reopened, review_requested, ready_for_review]
paths: ['CMakeLists.txt', 'Makefile', '**.h', '*.c', '**.cpp']
env:
BRANCH_NAME: ${{ github.head_ref || github.ref_name }}
jobs:
ubuntu-latest:
ubuntu-latest-make:
runs-on: ubuntu-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v1
- name: Dependencies
id: depends
run: |
sudo apt-get update
sudo apt-get install build-essential
- name: Build
id: make_build
run: |
make
ubuntu-latest-cmake:
runs-on: ubuntu-latest
steps:
@@ -15,10 +49,32 @@ jobs:
sudo apt-get install build-essential
- name: Build
run: |
mkdir build
cd build
cmake ..
cmake --build . --config Release
ctest --output-on-failure
macOS-latest-make:
runs-on: macos-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v1
- name: Dependencies
id: depends
run: |
brew update
- name: Build
id: make_build
run: |
make
macOS-latest:
macOS-latest-cmake:
runs-on: macOS-latest
steps:
@@ -31,7 +87,60 @@ jobs:
- name: Build
run: |
make
mkdir build
cd build
cmake -DLLAMA_AVX2=OFF ..
cmake --build . --config Release
ctest --output-on-failure
windows-latest-cmake:
runs-on: windows-latest
steps:
- name: Clone
id: checkout
uses: actions/checkout@v1
- name: Build
id: cmake_build
run: |
mkdir build
cd build
cmake ..
cmake --build . --config Release
ctest -C Release --output-on-failure
- name: Get commit hash
id: commit
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: pr-mpt/actions-commit-hash@v2
- name: Pack artifacts
id: pack_artifacts
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
run: |
7z a llama-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-x64.zip .\build\bin\Release\*
- name: Create release
id: create_release
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: zendesk/action-create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: ${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}
- name: Upload release
id: upload_release
if: ${{ ( github.event_name == 'push' && github.ref == 'refs/heads/master' ) || github.event.inputs.create_release == 'true' }}
uses: actions/upload-release-asset@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
upload_url: ${{ steps.create_release.outputs.upload_url }}
asset_path: .\llama-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-x64.zip
asset_name: llama-${{ env.BRANCH_NAME }}-${{ steps.commit.outputs.short }}-bin-win-x64.zip
asset_content_type: application/octet-stream
# ubuntu-latest-gcc:
# runs-on: ubuntu-latest

61
.github/workflows/docker.yml vendored Normal file
View File

@@ -0,0 +1,61 @@
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.
# GitHub recommends pinning actions to a commit SHA.
# To get a newer version, you will need to update the SHA.
# You can also reference a tag or branch, but the action may change without warning.
name: Publish Docker image
on:
pull_request:
push:
branches:
- master
jobs:
push_to_registry:
name: Push Docker image to Docker Hub
runs-on: ubuntu-latest
env:
COMMIT_SHA: ${{ github.sha }}
strategy:
matrix:
config:
- { tag: "light", dockerfile: ".devops/main.Dockerfile" }
- { tag: "full", dockerfile: ".devops/full.Dockerfile" }
steps:
- name: Check out the repo
uses: actions/checkout@v3
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Log in to Docker Hub
uses: docker/login-action@v2
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push Docker image (versioned)
if: github.event_name == 'push'
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: "ghcr.io/ggerganov/llama.cpp:${{ matrix.config.tag }}-${{ env.COMMIT_SHA }}"
file: ${{ matrix.config.dockerfile }}
- name: Build and push Docker image (tagged)
uses: docker/build-push-action@v4
with:
context: .
push: ${{ github.event_name == 'push' }}
tags: "ghcr.io/ggerganov/llama.cpp:${{ matrix.config.tag }}"
file: ${{ matrix.config.dockerfile }}

4
.gitignore vendored
View File

@@ -18,6 +18,10 @@ models/*
/main
/quantize
/result
arm_neon.h
compile_commands.json
.envrc
.direnv/

265
CMakeLists.txt Normal file
View File

@@ -0,0 +1,265 @@
cmake_minimum_required(VERSION 3.12) # Don't bump this version for no reason
project("llama.cpp" C CXX)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
if (NOT XCODE AND NOT MSVC AND NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo")
endif()
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
if(CMAKE_SOURCE_DIR STREQUAL CMAKE_CURRENT_SOURCE_DIR)
set(LLAMA_STANDALONE ON)
# configure project version
# TODO
else()
set(LLAMA_STANDALONE OFF)
endif()
if (EMSCRIPTEN)
set(BUILD_SHARED_LIBS_DEFAULT OFF)
option(LLAMA_WASM_SINGLE_FILE "llama: embed WASM inside the generated llama.js" ON)
else()
if (MINGW)
set(BUILD_SHARED_LIBS_DEFAULT OFF)
else()
set(BUILD_SHARED_LIBS_DEFAULT ON)
endif()
endif()
#
# Option list
#
# general
option(LLAMA_STATIC "llama: static link libraries" OFF)
option(LLAMA_NATIVE "llama: enable -march=native flag" OFF)
option(LLAMA_LTO "llama: enable link time optimization" OFF)
# debug
option(LLAMA_ALL_WARNINGS "llama: enable all compiler warnings" ON)
option(LLAMA_ALL_WARNINGS_3RD_PARTY "llama: enable all compiler warnings in 3rd party libs" OFF)
option(LLAMA_GPROF "llama: enable gprof" OFF)
# sanitizers
option(LLAMA_SANITIZE_THREAD "llama: enable thread sanitizer" OFF)
option(LLAMA_SANITIZE_ADDRESS "llama: enable address sanitizer" OFF)
option(LLAMA_SANITIZE_UNDEFINED "llama: enable undefined sanitizer" OFF)
# instruction set specific
option(LLAMA_AVX "llama: enable AVX" ON)
option(LLAMA_AVX2 "llama: enable AVX2" ON)
option(LLAMA_FMA "llama: enable FMA" ON)
# 3rd party libs
option(LLAMA_ACCELERATE "llama: enable Accelerate framework" ON)
option(LLAMA_OPENBLAS "llama: use OpenBLAS" OFF)
option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
#
# Compile flags
#
set(CMAKE_CXX_STANDARD_REQUIRED true)
set(CMAKE_C_STANDARD_REQUIRED true)
set(THREADS_PREFER_PTHREAD_FLAG ON)
find_package(Threads REQUIRED)
if (NOT MSVC)
if (LLAMA_SANITIZE_THREAD)
add_compile_options(-fsanitize=thread)
endif()
if (LLAMA_SANITIZE_ADDRESS)
add_compile_options(-fsanitize=address -fno-omit-frame-pointer)
endif()
if (LLAMA_SANITIZE_UNDEFINED)
add_compile_options(-fsanitize=undefined)
endif()
endif()
if (APPLE AND LLAMA_ACCELERATE)
find_library(ACCELERATE_FRAMEWORK Accelerate)
if (ACCELERATE_FRAMEWORK)
message(STATUS "Accelerate framework found")
add_compile_definitions(GGML_USE_ACCELERATE)
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} ${ACCELERATE_FRAMEWORK})
else()
message(WARNING "Accelerate framework not found")
endif()
endif()
if (LLAMA_OPENBLAS)
if (LLAMA_STATIC)
set(BLA_STATIC ON)
endif()
set(BLA_VENDOR OpenBLAS)
find_package(BLAS)
if (BLAS_FOUND)
message(STATUS "OpenBLAS found")
add_compile_definitions(GGML_USE_OPENBLAS)
add_link_options(${BLAS_LIBRARIES})
else()
message(WARNING "OpenBLAS not found")
endif()
endif()
if (LLAMA_ALL_WARNINGS)
if (NOT MSVC)
set(c_flags
-Wall
-Wextra
-Wpedantic
-Wshadow
-Wcast-qual
-Wstrict-prototypes
-Wpointer-arith
-Wno-unused-function
)
set(cxx_flags
-Wall
-Wextra
-Wpedantic
-Wcast-qual
)
else()
# todo : msvc
endif()
add_compile_options(
"$<$<COMPILE_LANGUAGE:C>:${c_flags}>"
"$<$<COMPILE_LANGUAGE:CXX>:${cxx_flags}>"
)
endif()
if (LLAMA_LTO)
include(CheckIPOSupported)
check_ipo_supported(RESULT result OUTPUT output)
if (result)
set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)
else()
message(WARNING "IPO is not supported: ${output}")
endif()
endif()
# Architecture specific
# TODO: probably these flags need to be tweaked on some architectures
# feel free to update the Makefile for your architecture and send a pull request or issue
message(STATUS "CMAKE_SYSTEM_PROCESSOR: ${CMAKE_SYSTEM_PROCESSOR}")
if (NOT MSVC)
if (LLAMA_STATIC)
add_link_options(-static)
if (MINGW)
add_link_options(-static-libgcc -static-libstdc++)
endif()
endif()
if (LLAMA_GPROF)
add_compile_options(-pg)
endif()
if (LLAMA_NATIVE)
add_compile_options(-march=native)
endif()
endif()
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "arm" OR ${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
message(STATUS "ARM detected")
if (MSVC)
# TODO: arm msvc?
else()
if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "aarch64")
add_compile_options(-mcpu=native)
endif()
# TODO: armv6,7,8 version specific flags
endif()
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "^(x86_64|i686|AMD64)$")
message(STATUS "x86 detected")
if (MSVC)
if (LLAMA_AVX2)
add_compile_options(/arch:AVX2)
elseif (LLAMA_AVX)
add_compile_options(/arch:AVX)
endif()
else()
add_compile_options(-mf16c)
if (LLAMA_FMA)
add_compile_options(-mfma)
endif()
if (LLAMA_AVX)
add_compile_options(-mavx)
endif()
if (LLAMA_AVX2)
add_compile_options(-mavx2)
endif()
endif()
else()
# TODO: support PowerPC
message(STATUS "Unknown architecture")
endif()
#
# Build libraries
#
add_library(utils OBJECT
utils.cpp
utils.h)
target_include_directories(utils PUBLIC .)
target_compile_features(utils PUBLIC cxx_std_11) # don't bump
target_link_libraries(utils PRIVATE ${LLAMA_EXTRA_LIBS})
add_library(ggml OBJECT
ggml.c
ggml.h)
target_include_directories(ggml PUBLIC .)
target_compile_features(ggml PUBLIC c_std_11) # don't bump
target_link_libraries(ggml PRIVATE Threads::Threads ${LLAMA_EXTRA_LIBS})
add_library(llama
llama.cpp
llama.h)
target_include_directories(llama PUBLIC .)
target_compile_features(llama PUBLIC cxx_std_11) # don't bump
target_link_libraries(llama PRIVATE utils ggml ${LLAMA_EXTRA_LIBS})
#
# Executables
#
add_executable(main
main.cpp
run.cpp)
target_link_libraries(main PRIVATE llama ggml utils)
if(NOT WIN32)
target_sources(main PRIVATE tcp_server.cpp)
endif()
add_executable(quantize quantize.cpp)
target_link_libraries(quantize PRIVATE llama ggml utils)
#
# programs, examples and tests
#
if (LLAMA_BUILD_TESTS AND NOT CMAKE_JS_VERSION)
enable_testing()
add_subdirectory(tests)
endif ()
#if (LLAMA_BUILD_EXAMPLES)
# add_subdirectory(examples)
#endif()

View File

@@ -17,7 +17,7 @@ CXXV := $(shell $(CXX) --version | head -n 1)
# ref: https://github.com/ggerganov/whisper.cpp/issues/66#issuecomment-1282546789
ifeq ($(UNAME_S),Darwin)
ifneq ($(UNAME_P),arm)
SYSCTL_M := $(shell sysctl -n hw.optional.arm64)
SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
ifeq ($(SYSCTL_M),1)
# UNAME_P := arm
# UNAME_M := arm64
@@ -30,6 +30,7 @@ endif
# Compile flags
#
# keep standard at C11 and C++11
CFLAGS = -I. -O3 -DNDEBUG -std=c11 -fPIC
CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
LDFLAGS =
@@ -48,6 +49,14 @@ ifeq ($(UNAME_S),FreeBSD)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),NetBSD)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),OpenBSD)
CFLAGS += -pthread
CXXFLAGS += -pthread
endif
ifeq ($(UNAME_S),Haiku)
CFLAGS += -pthread
CXXFLAGS += -pthread
@@ -91,30 +100,59 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
ifneq (,$(findstring sse3,$(SSE3_M)))
CFLAGS += -msse3
endif
AVX512F_M := $(shell grep "avx512f " /proc/cpuinfo)
ifneq (,$(findstring avx512f,$(AVX512F_M)))
CFLAGS += -mavx512f
endif
AVX512BW_M := $(shell grep "avx512bw " /proc/cpuinfo)
ifneq (,$(findstring avx512bw,$(AVX512BW_M)))
CFLAGS += -mavx512bw
endif
AVX512DQ_M := $(shell grep "avx512dq " /proc/cpuinfo)
ifneq (,$(findstring avx512dq,$(AVX512DQ_M)))
CFLAGS += -mavx512dq
endif
AVX512VL_M := $(shell grep "avx512vl " /proc/cpuinfo)
ifneq (,$(findstring avx512vl,$(AVX512VL_M)))
CFLAGS += -mavx512vl
endif
AVX512CD_M := $(shell grep "avx512cd " /proc/cpuinfo)
ifneq (,$(findstring avx512cd,$(AVX512CD_M)))
CFLAGS += -mavx512cd
endif
AVX512ER_M := $(shell grep "avx512er " /proc/cpuinfo)
ifneq (,$(findstring avx512er,$(AVX512ER_M)))
CFLAGS += -mavx512er
endif
AVX512IFMA_M := $(shell grep "avx512ifma " /proc/cpuinfo)
ifneq (,$(findstring avx512ifma,$(AVX512IFMA_M)))
CFLAGS += -mavx512ifma
endif
AVX512PF_M := $(shell grep "avx512pf " /proc/cpuinfo)
ifneq (,$(findstring avx512pf,$(AVX512PF_M)))
CFLAGS += -mavx512pf
endif
else ifeq ($(UNAME_S),Haiku)
AVX1_M := $(shell sysinfo -cpu | grep "AVX ")
ifneq (,$(findstring avx,$(AVX1_M)))
AVX1_M := $(shell sysinfo -cpu | grep -w "AVX")
ifneq (,$(findstring AVX,$(AVX1_M)))
CFLAGS += -mavx
endif
AVX2_M := $(shell sysinfo -cpu | grep "AVX2 ")
ifneq (,$(findstring avx2,$(AVX2_M)))
AVX2_M := $(shell sysinfo -cpu | grep -w "AVX2")
ifneq (,$(findstring AVX2,$(AVX2_M)))
CFLAGS += -mavx2
endif
FMA_M := $(shell sysinfo -cpu | grep "FMA ")
ifneq (,$(findstring fma,$(FMA_M)))
FMA_M := $(shell sysinfo -cpu | grep -w "FMA")
ifneq (,$(findstring FMA,$(FMA_M)))
CFLAGS += -mfma
endif
F16C_M := $(shell sysinfo -cpu | grep "F16C ")
ifneq (,$(findstring f16c,$(F16C_M)))
F16C_M := $(shell sysinfo -cpu | grep -w "F16C")
ifneq (,$(findstring F16C,$(F16C_M)))
CFLAGS += -mf16c
endif
else
CFLAGS += -mfma -mf16c -mavx -mavx2
endif
endif
ifeq ($(UNAME_M),amd64)
CFLAGS += -mavx -mavx2 -mfma -mf16c
endif
ifneq ($(filter ppc64%,$(UNAME_M)),)
POWER9_M := $(shell grep "POWER9" /proc/cpuinfo)
ifneq (,$(findstring POWER9,$(POWER9_M)))
@@ -126,7 +164,8 @@ ifneq ($(filter ppc64%,$(UNAME_M)),)
endif
endif
ifndef LLAMA_NO_ACCELERATE
# Mac M1 - include Accelerate framework
# Mac M1 - include Accelerate framework.
# `-framework Accelerate` works on Mac Intel as well, with negliable performance boost (as of the predict time).
ifeq ($(UNAME_S),Darwin)
CFLAGS += -DGGML_USE_ACCELERATE
LDFLAGS += -framework Accelerate
@@ -181,18 +220,27 @@ default: main quantize
ggml.o: ggml.c ggml.h
$(CC) $(CFLAGS) -c ggml.c -o ggml.o
llama.o: llama.cpp llama.h
$(CXX) $(CXXFLAGS) -c llama.cpp -o llama.o
utils.o: utils.cpp utils.h
$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
run.o: run.cpp run.h
$(CXX) $(CXXFLAGS) -c run.cpp -o run.o
tcp_server.o: tcp_server.cpp tcp_server.h
$(CXX) $(CXXFLAGS) -c tcp_server.cpp -o tcp_server.o
clean:
rm -f *.o main quantize
main: main.cpp ggml.o utils.o
$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
./main -h
main: main.cpp ggml.o llama.o utils.o run.o tcp_server.o
$(CXX) $(CXXFLAGS) main.cpp ggml.o llama.o utils.o run.o tcp_server.o -o main $(LDFLAGS)
@echo "\x1b[36mrun ./main -h for help\x1b[0m"
quantize: quantize.cpp ggml.o utils.o
$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
quantize: quantize.cpp ggml.o llama.o utils.o
$(CXX) $(CXXFLAGS) quantize.cpp ggml.o llama.o utils.o -o quantize $(LDFLAGS)
#
# Tests

238
README.md
View File

@@ -3,19 +3,21 @@
[![Actions Status](https://github.com/ggerganov/llama.cpp/workflows/CI/badge.svg)](https://github.com/ggerganov/llama.cpp/actions)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
Inference of [Facebook's LLaMA](https://github.com/facebookresearch/llama) model in pure C/C++
Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
**Hot topics**
**Hot topics:**
- Running on Windows: https://github.com/ggerganov/llama.cpp/issues/22
- Fix Tokenizer / Unicode support: https://github.com/ggerganov/llama.cpp/issues/11
- New C-style API is now available: https://github.com/ggerganov/llama.cpp/pull/370
- [Added Alpaca support](https://github.com/ggerganov/llama.cpp#instruction-mode-with-alpaca)
- Cache input prompts for faster initialization: https://github.com/ggerganov/llama.cpp/issues/64
- Create a `llama.cpp` logo: https://github.com/ggerganov/llama.cpp/issues/105
## Description
The main goal is to run the model using 4-bit quantization on a MacBook
- Plain C/C++ implementation without dependencies
- Apple silicon first-class citizen - optimized via Arm Neon and Accelerate framework
- Apple silicon first-class citizen - optimized via ARM NEON
- AVX2 support for x86 architectures
- Mixed F16 / F32 precision
- 4-bit quantization support
@@ -23,21 +25,22 @@ The main goal is to run the model using 4-bit quantization on a MacBook
This was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022) - I have no idea if it works correctly.
Please do not make conclusions about the models based on the results from this implementation.
For all I know, it can be completely wrong. This project is for educational purposes and is not going to be maintained properly.
New features will probably be added mostly through community contributions, if any.
For all I know, it can be completely wrong. This project is for educational purposes.
New features will probably be added mostly through community contributions.
Supported platforms:
- [X] Mac OS
- [X] Linux
- [ ] Windows (soon)
- [X] Windows (via CMake)
- [X] Docker
---
Here is a typical run using LLaMA-7B:
```java
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
@@ -145,76 +148,201 @@ python3 -m pip install torch numpy sentencepiece
python3 convert-pth-to-ggml.py models/7B/ 1
# quantize the model to 4-bits
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
python3 quantize.py 7B
# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
```
For the bigger models, there are a few extra quantization steps. For example, for LLaMA-13B, converting to FP16 format
will create 2 ggml files, instead of one:
```bash
ggml-model-f16.bin
ggml-model-f16.bin.1
```
You need to quantize each of them separately like this:
```bash
./quantize ./models/13B/ggml-model-f16.bin ./models/13B/ggml-model-q4_0.bin 2
./quantize ./models/13B/ggml-model-f16.bin.1 ./models/13B/ggml-model-q4_0.bin.1 2
```
Everything else is the same. Simply run:
```bash
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128
```
The number of files generated for each model is as follows:
```
7B -> 1 file
13B -> 2 files
30B -> 4 files
65B -> 8 files
```
Currently, it's best to use Python 3.9 or Python 3.10, as `sentencepiece` has not yet published a wheel for Python 3.11.
When running the larger models, make sure you have enough disk space to store all the intermediate files.
### Memory/Disk Requirements
As the models are currently fully loaded into memory, you will need adequate disk space to save them
and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
| model | original size | quantized size (4-bit) |
|-------|---------------|------------------------|
| 7B | 13 GB | 3.9 GB |
| 13B | 24 GB | 7.8 GB |
| 30B | 60 GB | 19.5 GB |
| 65B | 120 GB | 38.5 GB |
### Interactive mode
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
In this mode, you can always interrupt generation by pressing Ctrl+C and enter one or more lines of text which will be converted into tokens and appended to the current context. You can also specify a *reverse prompt* with the parameter `-r "reverse prompt string"`. This will result in user input being prompted whenever the exact tokens of the reverse prompt string are encountered in the generation. A typical use is to use a prompt which makes LLaMa emulate a chat between multiple users, say Alice and Bob, and pass `-r "Alice:"`.
Here is an example few-shot interaction, invoked with the command
```
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "User:" \
-p \
"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:"
```bash
# default arguments using 7B model
./chat.sh
# custom arguments using 13B model
./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
```
Note the use of `--color` to distinguish between user input and generated text.
![image](https://user-images.githubusercontent.com/1991296/224575029-2af3c7dc-5a65-4f64-a6bb-517a532aea38.png)
### Instruction mode with Alpaca
First, download the `ggml` Alpaca model into the `./models` folder:
```
# use one of these
# TODO: add a script to simplify the download
curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://gateway.estuary.tech/gw/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://ipfs.io/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
curl -o ./models/ggml-alpaca-7b-q4.bin -C - https://cloudflare-ipfs.com/ipfs/QmUp1UGeQFDqJKvtjbSYPBiZZKRjLp8shVP9hT8ZB9Ynv1
```
Now run the `main` tool like this:
```
./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins
```
Sample run:
```
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in '\'.
Below is an instruction that describes a task. Write a response that appropriately completes the request.
> How many letters are there in the English alphabet?
There 26 letters in the English Alphabet
> What is the most common way of transportation in Amsterdam?
The majority (54%) are using public transit. This includes buses, trams and metros with over 100 lines throughout the city which make it very accessible for tourists to navigate around town as well as locals who commute by tram or metro on a daily basis
> List 5 words that start with "ca".
cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
>
```
### Obtaining and verifying the Facebook LLaMA original model and Stanford Alpaca model data
* The LLaMA models are officially distributed by Facebook and will never be provided through this repository. See this [Pull Request in Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to obtain access to the model data.
* Please verify the sha256 checksums of all of your `consolidated*.pth` and corresponding converted `ggml-model-*.bin` model files to confirm that you have the correct model data files before creating an issue relating to your model files.
The following command will verify if you have all possible latest files in your self-installed `./models` subdirectory:
`sha256sum --ignore-missing -c SHA256SUMS` on Linux
or
`shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS
### Perplexity (Measuring model quality)
You can pass `--perplexity` as a command line option to measure perplexity over the given prompt. For more background,
see https://huggingface.co/docs/transformers/perplexity. However, in general, lower perplexity is better for LLMs.
#### Measurements
https://github.com/ggerganov/llama.cpp/pull/270 is the unofficial tracking page for now. llama.cpp is measuring very well
compared to the baseline implementations. Quantization has a small negative impact to quality, but, as you can see, running
13B at q4_0 beats the 7B f16 model by a significant amount.
All measurements are done against wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context).
Note that the changing the context length will have a significant impact on perplexity (longer context = better perplexity).
```
Perplexity - model options
5.5985 - 13B, q4_0
5.9565 - 7B, f16
6.3001 - 7B, q4_1
6.5949 - 7B, q4_0
6.5995 - 7B, q4_0, --memory_f16
```
#### How to run
1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
2. Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw`
3. Output:
```
Calculating perplexity over 655 chunks
24.43 seconds per pass - ETA 4.45 hours
[1]4.5970,[2]5.1807,[3]6.0382,...
```
And after 4.45 hours, you will have the final perplexity.
### Android
You can easily run `llama.cpp` on Android device with [termux](https://play.google.com/store/apps/details?id=com.termux).
First, obtain the [Android NDK](https://developer.android.com/ndk) and then build with CMake:
```
$ mkdir build-android
$ cd build-android
$ export NDK=<your_ndk_directory>
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make
```
Install [termux](https://play.google.com/store/apps/details?id=com.termux) on your device and run `termux-setup-storage` to get access to your SD card.
Finally, copy the `llama` binary and the model files to your device storage. Here is a demo of an interactive session running on Pixel 5 phone:
https://user-images.githubusercontent.com/271616/225014776-1d567049-ad71-4ef2-b050-55b0b3b9274c.mp4
### Docker
#### Prerequisites
* Docker must be installed and running on your system.
* Create a folder to store big models & intermediate files (in ex. im using /llama/models)
#### Images
We have two Docker images available for this project:
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file.
#### Usage
The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.
```bash
docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-one "/models/" 7B
```
On complete, you are ready to play!
```bash
docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
```
or with light image:
```bash
docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
```
## Limitations
- Not sure if my tokenizer is correct. There are a few places where we might have a mistake:
- https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/convert-pth-to-ggml.py#L79-L87
- https://github.com/ggerganov/llama.cpp/blob/26c084662903ddaca19bef982831bfb0856e8257/utils.h#L65-L69
In general, it seems to work, but I think it fails for unicode character support. Hopefully, someone can help with that
- I don't know yet how much the quantization affects the quality of the generated text
- Probably the token sampling can be improved
- The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder,
there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't
there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simply don't
know how to utilize it properly. But in any case, you can even disable it with `LLAMA_NO_ACCELERATE=1 make` and the
performance will be the same, since no BLAS calls are invoked by the current implementation
### Contributing
- Contributors can open PRs
- Collaborators can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
- Collaborators will be invited based on contributions
- Any help with managing issues and PRs is very appreciated!
- Make sure to read this: [Inference at the edge](https://github.com/ggerganov/llama.cpp/discussions/205)
### Coding guidelines
- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy looking modern STL constructs, use basic `for` loops, avoid templates, keep it simple
- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
- Clean-up any trailing whitespaces, use 4 spaces indentation, brackets on same line, `void * ptr`, `int & a`
- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions

53
SHA256SUMS Normal file
View File

@@ -0,0 +1,53 @@
700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d models/7B/consolidated.00.pth
abe4aec2cdc297e2916011f66c7efd6fb4424e0e84315503005b5c118358cc22 models/7B/ggml-model-f16.bin
f495fa02a0b5ef265e1864d9680eede7fd23a60b0a2f93edba8091e2a4ca68b9 models/7B/ggml-model-q4_0.bin
7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265 models/7B/params.json
745bf4e29a4dd6f411e72976d92b452da1b49168a4f41c951cfcc8051823cf08 models/13B/consolidated.00.pth
d5ccbcc465c71c0de439a5aeffebe8344c68a519bce70bc7f9f92654ee567085 models/13B/consolidated.01.pth
a6bd0537c6873f36c47292df0b6f794e1135f5aafb89c3343bcc9e93264bf167 models/13B/ggml-model-f16.bin
0fb0951b90f2ec46c1f2f2372af5dacb4614b27e9fb6c10c69fbec58d7dd0e36 models/13B/ggml-model-f16.bin.1
1c218ba37ae61e15e35efd9949c78d6edf553b6280824c263cad56ae0b9d5a8f models/13B/ggml-model-q4_0.bin
c37a20c2ab9fa74b006b389085660269ee06110d1e45a494eb57d4602c9bcdb2 models/13B/ggml-model-q4_0.bin.1
4ab77bec4d4405ccb66a97b282574c89a94417e3c32e5f68f37e2876fc21322f models/13B/params.json
e23294a58552d8cdec5b7e8abb87993b97ea6eced4178ff2697c02472539d067 models/30B/consolidated.00.pth
4e077b7136c7ae2302e954860cf64930458d3076fcde9443f4d0e939e95903ff models/30B/consolidated.01.pth
24a87f01028cbd3a12de551dcedb712346c0b5cbdeff1454e0ddf2df9b675378 models/30B/consolidated.02.pth
1adfcef71420886119544949767f6a56cb6339b4d5fcde755d80fe68b49de93b models/30B/consolidated.03.pth
def20ea508f4e36793719f857471e85b85f96e497a2cbffbbaa1b60e2b18202c models/30B/ggml-model-f16.bin
b37040aa67fa8608cb2d8e0719132cf3e267fd35ec1e2f0d37dbc9fa43d674f1 models/30B/ggml-model-f16.bin.1
e7f263557e99069fe29003262ea5fa9ed885dbe79069083e6eb569b328cf30d3 models/30B/ggml-model-f16.bin.2
2ad6a23af05eb720f202f63d130f4fc5de9b6d2efc95b921be003209a56695aa models/30B/ggml-model-f16.bin.3
7de31d005e6d02ebd9603b2cf5329ad2f832b65d08873a098c5cafc4046cb9ed models/30B/ggml-model-q4_0.bin
f91feef9f30f9a023616db2e91297ca6d5d5d7b9eb351e452a82115c46f7da9e models/30B/ggml-model-q4_0.bin.1
66f3a0916ac7a81839153eb061fa861030ed1892477c2f7af2ce4f98d2f6d06f models/30B/ggml-model-q4_0.bin.2
e3c587ba97f83d2088b001bcda3026571065649ee3090bef6743a51390b01d3b models/30B/ggml-model-q4_0.bin.3
2c07118ea98d69dbe7810d88520e30288fa994751b337f8fca02b171955f44cb models/30B/params.json
135c563f6b3938114458183afb01adc9a63bef3d8ff7cccc3977e5d3664ecafe models/65B/consolidated.00.pth
9a600b37b19d38c7e43809485f70d17d1dc12206c07efa83bc72bb498a568bde models/65B/consolidated.01.pth
e7babf7c5606f165a3756f527cb0fedc4f83e67ef1290391e52fb1cce5f26770 models/65B/consolidated.02.pth
73176ffb426b40482f2aa67ae1217ef79fbbd1fff5482bae5060cdc5a24ab70e models/65B/consolidated.03.pth
882e6431d0b08a8bc66261a0d3607da21cbaeafa96a24e7e59777632dbdac225 models/65B/consolidated.04.pth
a287c0dfe49081626567c7fe87f74cce5831f58e459b427b5e05567641f47b78 models/65B/consolidated.05.pth
72b4eba67a1a3b18cb67a85b70f8f1640caae9b40033ea943fb166bd80a7b36b models/65B/consolidated.06.pth
d27f5b0677d7ff129ceacd73fd461c4d06910ad7787cf217b249948c3f3bc638 models/65B/consolidated.07.pth
7eba2625260cd91f8de901fd9704a1aa39448425514a335a0d3878de4ab9dc77 models/65B/ggml-model-f16.bin
f6aa886575df0785d4231f30cc776d499ccde18857818effc0378c65b178e0b5 models/65B/ggml-model-f16.bin.1
076037141682f5d7537955058c4740ab27f285aa4588915f830874a589c0693d models/65B/ggml-model-f16.bin.2
7853d96d2903ad7de2b2a89c4acf5a33a2f8e3c24ac39c9df6b44cdb42bf530a models/65B/ggml-model-f16.bin.3
b16b7b941abb3bc03a14df1656140855e9360a5371c83e919b9da83a72362314 models/65B/ggml-model-f16.bin.4
5291270216f888697695acb78ef28df0c080f9e85d3245c92fb9992d1fde6678 models/65B/ggml-model-f16.bin.5
0685ee77715f34686841006f8f94d3e7eaf148b97cecc9d3eee72808b0f7989c models/65B/ggml-model-f16.bin.6
00d993d73bb21d7c29388ffe0dced008cbaa0d391831dea77d7eb8f0b5c404b9 models/65B/ggml-model-f16.bin.7
4e398f05842206e08cdc5e7bb4f6c7c34b9dc373435ece6f261b14b7b4fe9b89 models/65B/ggml-model-q4_0.bin
4c4e899e3b12d9f57c9dcea5a1fb41bbc72023323535551f6273582ca7d7294b models/65B/ggml-model-q4_0.bin.1
d7b4594bbbd192043b3db0e5acc2561c42e6944e1cb91cc6e61510eee89dbcd8 models/65B/ggml-model-q4_0.bin.2
9a099d271648863d923d0d097391ea0bc75591f27a2ca3a327760f42e6b69af2 models/65B/ggml-model-q4_0.bin.3
5ee474051e418c5732b7949190b084d9d679db447f83c1de0d2a82daaa1a0cfa models/65B/ggml-model-q4_0.bin.4
a45aa05e7212bd6782790722d68056c5419667ea6b564ccc94bbcb8111d79b8b models/65B/ggml-model-q4_0.bin.5
a58fda714b759c28ad5e4c1d8bf8fda7b158fd5e4c4a49f851f36342fa97a105 models/65B/ggml-model-q4_0.bin.6
a3540cfcbcda33c223c6b0d606034adbd78f17e0e5de1582b78795e78754f7a8 models/65B/ggml-model-q4_0.bin.7
999ed1659b469ccc2a941714c0a9656fa571d17c9f7c8c7589817ca90edef51b models/65B/params.json
1f582babc2bd56bb63b33141898748657d369fd110c4358b2bc280907882bf13 models/alpaca-7B/ggml-model-q4_0.bin
e17730c6b62b565b098af023ca446dcb9e3535d4222ead6369c7aae67207eb3d models/alpaca-13B/ggml-model-q4_0.bin
9bcd1bb30e679c939f367be11b030fe20b3eb9a3606b9bc4106420f1827b6ae4 models/alpaca-30B/ggml-model-q4_0.bin
36079249f53c292a4c2302d7784005dcae94c865f0bedfdbfa51d9ddad402935 models/alpaca-30B/params.json

6
alpaca.sh Executable file
View File

@@ -0,0 +1,6 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
./main -m ./models/ggml-alpaca-7b-q4.bin --color -f ./prompts/alpaca.txt -ins --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7

6
chat.sh Executable file
View File

@@ -0,0 +1,6 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
./main -m ./models/7B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

45
chat_tcp_client.sh Executable file
View File

@@ -0,0 +1,45 @@
#!/usr/bin/env bash
PORT=${PORT:-8080}
PROMPT="${PROMPT:-"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
User:Hello, Bob.
Bob:Hello. How may I help you today?
User:Please tell me the largest city in Europe.
Bob:Sure. The largest city in Europe is Moscow, the capital of Russia.
User:"}"
RPROMPT="${RPROMPT:-"User:"}"
N_PREDICT="${N_PREDICT:-"4096"}"
REPEAT_PENALTY="${REPEAT_PENALTY:-"1.0"}"
N_THREADS="${N_THREADS:-"4"}"
# Open connection to the chat server
exec 3<>/dev/tcp/127.0.0.1/${PORT}
# Pass the arguments. The protocol is really simple:
# 1. Pass the number of arguments followed by a linefeed
# 2. Pass the arguments, with each being followed by "0"
(
echo -en "12\n"
echo -en "-t\x00"
echo -en "$N_THREADS\x00"
echo -en "-n\x00"
echo -en "$N_PREDICT\x00"
echo -en "--repeat_penalty\x00"
echo -en "$REPEAT_PENALTY\x00"
echo -en "--color\x00"
echo -en "-i\x00"
echo -en "-r\x00"
echo -en "$RPROMPT\x00"
echo -en "-p\x00"
echo -en "$PROMPT\x00"
) >&3
trap exit TERM
# When we have passed the arguments, start printing socket data to the screen.
# This is done in a background job because we also want to send data when
# running in interactive mode.
cat <&3 && echo "(disconnected, press \"enter\" twice to exit)" &
cat >&3
wait

6
chat_tcp_server.sh Executable file
View File

@@ -0,0 +1,6 @@
#!/usr/bin/env bash
PORT=${PORT:-8080}
MODEL=${MODEL:-models/7B/ggml-model-q4_0.bin}
./main -l ${PORT} -m $MODEL

172
convert-gptq-to-ggml.py Normal file
View File

@@ -0,0 +1,172 @@
# Convert a GPTQ quantized LLaMA model to a ggml compatible file
# Based on: https://github.com/qwopqwop200/GPTQ-for-LLaMa
#
import os
import re
import sys
import json
import struct
import numpy as np
import torch
from sentencepiece import SentencePieceProcessor
if len(sys.argv) != 4:
print("Usage: convert-gptq-to-ggml.py llamaXXb-4bit.pt tokenizer.model out.bin\n")
sys.exit(1)
fname_model = sys.argv[1]
fname_tokenizer = sys.argv[2]
dir_out = sys.argv[3]
model = torch.load(fname_model, map_location="cpu")
n_vocab, n_embd = model['model.embed_tokens.weight'].shape
n_layer = 1 + max(int(m.group(1)) for name in model
if (m := re.match(r'model\.layers\.([0-9]+)', name)))
# hardcoded:
n_mult = 256
n_head = {32: 32, 40: 40, 60: 52, 80: 64}[n_layer]
tokenizer = SentencePieceProcessor(fname_tokenizer)
assert tokenizer.vocab_size() == n_vocab
fname_out = sys.argv[3]
fout = open(fname_out, "wb")
fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
fout.write(struct.pack("i", n_vocab))
fout.write(struct.pack("i", n_embd))
fout.write(struct.pack("i", n_mult))
fout.write(struct.pack("i", n_head))
fout.write(struct.pack("i", n_layer))
fout.write(struct.pack("i", n_embd // n_head)) # rot (obsolete)
fout.write(struct.pack("i", 4))
# This loop unchanged from convert-pth-to-ggml.py:
for i in range(tokenizer.vocab_size()):
if tokenizer.is_unknown(i):
# "<unk>" token (translated as ??)
text = " \u2047 ".encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
elif tokenizer.is_control(i):
# "<s>"/"</s>" tokens
fout.write(struct.pack("i", 0))
elif tokenizer.is_byte(i):
# "<U+XX>" tokens (which may be invalid UTF-8)
piece = tokenizer.id_to_piece(i)
if len(piece) != 6:
print("Invalid token: " + piece)
sys.exit(1)
byte_value = int(piece[3:-1], 16)
fout.write(struct.pack("i", 1))
fout.write(struct.pack("B", byte_value))
else:
# normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
def write_header(shape, dst_name, ftype_cur):
sname = dst_name.encode('utf-8')
fout.write(struct.pack("iii", len(shape), len(sname), ftype_cur))
fout.write(struct.pack("i" * len(shape), *shape[::-1]))
fout.write(sname)
def convert_non_q4(src_name, dst_name):
v = model[src_name]
shape = v.shape
print("Processing non-Q4 variable: " + src_name + " with shape: ", shape, " and type: ", v.dtype)
if len(shape) == 1:
print(" Converting to float32")
v = v.to(torch.float32)
ftype_cur = {torch.float16: 1, torch.float32: 0}[v.dtype]
# header
write_header(shape, dst_name, ftype_cur)
# data
v.numpy().tofile(fout)
def convert_q4(src_name, dst_name, permute=False):
zeros = model[f"{src_name}.zeros"].numpy()
scales = model[f"{src_name}.scales"].numpy()
bias = model[f"{src_name}.bias"].numpy()
qweight = model[f"{src_name}.qweight"].numpy().T # transpose
# Q4_1 does not support bias; good thing the bias is always all zeros.
assert not np.any(bias)
# Each int32 item is actually 8 int4 items packed together, and it's transposed.
shape = (qweight.shape[0], qweight.shape[1] * 8)
print("Processing Q4 variable: " + src_name + " with shape: ", shape)
# The output format has the int4 weights in groups of 32 rather than 8.
# It looks like this:
# For each row:
# For each group of 32 columns:
# - addend (float32, 4 bytes)
# - scale (float32, 4 bytes)
# - weights (int4 * 32, 16 bytes)
# Note that in the input, the scales and addends are shared between all
# the columns in a row, so we end up wasting quite a bit of memory with
# repeated scales and addends.
addends = -zeros # flip sign
# Since the output format is mixed between integers and floats, we have
# to hackily view the floats as int32s just so numpy will let us
# concatenate them.
addends_view = addends.view(dtype=np.int32)
scales_view = scales.view(dtype=np.int32)
# Split into groups of 4 columns (i.e. 32 columns of quantized data):
grouped = qweight.reshape([qweight.shape[0], qweight.shape[1] // 4, 4])
# Repeat addends and scales:
addends_rep = np.atleast_3d(addends_view).repeat(grouped.shape[1], axis=1)
scales_rep = np.atleast_3d(scales_view).repeat(grouped.shape[1], axis=1)
blob = np.concatenate([scales_rep, addends_rep, grouped], axis=2, casting='no')
if permute:
# Permute some rows to undo the permutation done by convert_llama_weights_to_hf.py.
# This can be done after the above conversion because it doesn't affect column order/layout.
blob = (blob.reshape(n_head, 2, shape[0] // n_head // 2, *blob.shape[1:])
.swapaxes(1, 2)
.reshape(blob.shape))
# header
write_header(shape, dst_name, 3) # ftype = Q4_1
# data
blob.tofile(fout)
convert_non_q4("model.embed_tokens.weight", "tok_embeddings.weight")
convert_non_q4("model.norm.weight", "norm.weight")
convert_non_q4("lm_head.weight", "output.weight")
for i in range(n_layer):
convert_q4(f"model.layers.{i}.self_attn.q_proj", f"layers.{i}.attention.wq.weight", permute=True)
convert_q4(f"model.layers.{i}.self_attn.k_proj", f"layers.{i}.attention.wk.weight", permute=True)
convert_q4(f"model.layers.{i}.self_attn.v_proj", f"layers.{i}.attention.wv.weight")
convert_q4(f"model.layers.{i}.self_attn.o_proj", f"layers.{i}.attention.wo.weight")
convert_q4(f"model.layers.{i}.mlp.gate_proj", f"layers.{i}.feed_forward.w1.weight")
convert_q4(f"model.layers.{i}.mlp.down_proj", f"layers.{i}.feed_forward.w2.weight")
convert_q4(f"model.layers.{i}.mlp.up_proj", f"layers.{i}.feed_forward.w3.weight")
convert_non_q4(f"model.layers.{i}.input_layernorm.weight", f"layers.{i}.attention_norm.weight")
convert_non_q4(f"model.layers.{i}.post_attention_layernorm.weight", f"layers.{i}.ffn_norm.weight")
fout.close()
print("Done. Output file: " + fname_out)
print("")

View File

@@ -10,13 +10,12 @@
# - Name (char[name_length])
# - Data (float[n_dims])
#
# By default, the bigger matrices are converted to 16-bit floats.
# This can be disabled by adding the "use-f32" CLI argument.
#
# At the start of the ggml file we write the model parameters
# and vocabulary.
#
import argparse
import os
import sys
import json
import struct
@@ -25,116 +24,90 @@ import torch
from sentencepiece import SentencePieceProcessor
if len(sys.argv) < 3:
print("Usage: convert-ckpt-to-ggml.py dir-model ftype\n")
print(" ftype == 0 -> float32")
print(" ftype == 1 -> float16")
sys.exit(1)
def parse_args():
# output in the same directory as the model
dir_model = sys.argv[1]
fname_hparams = sys.argv[1] + "/params.json"
fname_tokenizer = sys.argv[1] + "/../tokenizer.model"
parser = argparse.ArgumentParser(description='Convert a LLaMA model checkpoint to a ggml compatible file')
parser.add_argument('dir_model', help='directory containing the model checkpoint')
parser.add_argument('ftype', help='file type (0: float32, 1: float16)', type=int, choices=[0, 1], default=1)
parser.add_argument('vocab_only', help='only write vocab to file', type=int, default=0, nargs='?')
return parser.parse_args()
def get_n_parts(dim):
if dim == 4096:
return 1
elif dim == 5120:
return 2
elif dim == 6656:
return 4
elif dim == 8192:
return 8
else:
print("Invalid dim: " + str(dim))
mappings = {4096: 1, 5120: 2, 6656: 4, 8192: 8}
n_parts = mappings.get(dim)
if n_parts is None:
print(f"Invalid dim: {dim}")
sys.exit(1)
# possible data types
# ftype == 0 -> float32
# ftype == 1 -> float16
#
# map from ftype to string
ftype_str = ["f32", "f16"]
print(f"n_parts = {n_parts}\n")
return n_parts
ftype = 1
if len(sys.argv) > 2:
ftype = int(sys.argv[2])
if ftype < 0 or ftype > 1:
print("Invalid ftype: " + str(ftype))
sys.exit(1)
fname_out = sys.argv[1] + "/ggml-model-" + ftype_str[ftype] + ".bin"
def load_hparams_and_tokenizer(dir_model):
with open(fname_hparams, "r") as f:
hparams = json.load(f)
# `dir_model` is something like `models/7B` or `models/7B/`.
# "tokenizer.model" is expected under model's parent dir.
# When `dir_model` is a symlink, f"{dir_model}/../tokenizer.model" would not be found.
# Let's use the model's parent dir directly.
model_parent_dir = os.path.dirname(os.path.normpath(dir_model))
tokenizer = SentencePieceProcessor(fname_tokenizer)
fname_hparams = f"{dir_model}/params.json"
fname_tokenizer = f"{model_parent_dir}/tokenizer.model"
hparams.update({"vocab_size": tokenizer.vocab_size()})
with open(fname_hparams, "r") as f:
hparams = json.load(f)
print(hparams)
n_parts = get_n_parts(hparams["dim"])
tokenizer = SentencePieceProcessor(fname_tokenizer)
hparams.update({"vocab_size": tokenizer.vocab_size()})
print(hparams)
print('n_parts = ', n_parts)
return hparams, tokenizer
for p in range(n_parts):
print('Processing part ', p)
def write_header(fout, hparams, ftype):
#fname_model = sys.argv[1] + "/consolidated.00.pth"
fname_model = sys.argv[1] + "/consolidated.0" + str(p) + ".pth"
fname_out = sys.argv[1] + "/ggml-model-" + ftype_str[ftype] + ".bin"
if (p > 0):
fname_out = sys.argv[1] + "/ggml-model-" + ftype_str[ftype] + ".bin" + "." + str(p)
keys = ["vocab_size", "dim", "multiple_of", "n_heads", "n_layers"]
values = [
0x67676d66, # magic: ggmf in hex
1, # file version
*[hparams[key] for key in keys],
hparams["dim"] // hparams["n_heads"], # rot (obsolete)
ftype
]
fout.write(struct.pack("i" * len(values), *values))
model = torch.load(fname_model, map_location="cpu")
def write_tokens(fout, tokenizer):
fout = open(fname_out, "wb")
fout.write(struct.pack("i", 0x67676d6c)) # magic: ggml in hex
fout.write(struct.pack("i", hparams["vocab_size"]))
fout.write(struct.pack("i", hparams["dim"]))
fout.write(struct.pack("i", hparams["multiple_of"]))
fout.write(struct.pack("i", hparams["n_heads"]))
fout.write(struct.pack("i", hparams["n_layers"]))
fout.write(struct.pack("i", hparams["dim"] // hparams["n_heads"])) # rot (obsolete)
fout.write(struct.pack("i", ftype))
# Is this correct??
for i in range(32000):
# TODO: this is probably wrong - not sure how this tokenizer works
text = tokenizer.decode([29889, i]).encode('utf-8')
# remove the first byte (it's always '.')
text = text[1:]
for i in range(tokenizer.vocab_size()):
if tokenizer.is_unknown(i):
text = " \u2047 ".encode("utf-8")
elif tokenizer.is_control(i):
text = b""
elif tokenizer.is_byte(i):
piece = tokenizer.id_to_piece(i)
if len(piece) != 6:
print(f"Invalid token: {piece}")
sys.exit(1)
byte_value = int(piece[3:-1], 16)
text = struct.pack("B", byte_value)
else:
text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
fout.write(struct.pack("f", tokenizer.get_score(i)))
for k, v in model.items():
name = k
shape = v.shape
def process_and_write_variables(fout, model, ftype):
# skip layers.X.attention.inner_attention.rope.freqs
if name[-5:] == "freqs":
for name, datao in model.items():
if name.endswith("freqs"):
continue
print("Processing variable: " + name + " with shape: ", shape, " and type: ", v.dtype)
shape = datao.shape
#data = tf.train.load_variable(dir_model, name).squeeze()
data = v.numpy().squeeze()
n_dims = len(data.shape);
print(f"Processing variable: {name} with shape: {shape} and type: {datao.dtype}")
# for efficiency - transpose some matrices
# "model/h.*/attn/c_attn/w"
# "model/h.*/attn/c_proj/w"
# "model/h.*/mlp/c_fc/w"
# "model/h.*/mlp/c_proj/w"
#if name[-14:] == "/attn/c_attn/w" or \
# name[-14:] == "/attn/c_proj/w" or \
# name[-11:] == "/mlp/c_fc/w" or \
# name[-13:] == "/mlp/c_proj/w":
# print(" Transposing")
# data = data.transpose()
dshape = data.shape
data = datao.numpy().squeeze()
n_dims = len(shape)
# default type is fp16
ftype_cur = 1
@@ -145,18 +118,64 @@ for p in range(n_parts):
# header
sname = name.encode('utf-8')
fout.write(struct.pack("iii", n_dims, len(sname), ftype_cur))
for i in range(n_dims):
fout.write(struct.pack("i", dshape[n_dims - 1 - i]))
fout.write(sname);
fout.write(struct.pack("iii", len(data.shape), len(sname), ftype_cur))
for dim in reversed(data.shape):
fout.write(struct.pack("i", dim))
fout.write(sname)
# data
# data output to file
data.tofile(fout)
# I hope this deallocates the memory ..
model = None
def main():
fout.close()
args = parse_args()
dir_model = args.dir_model
ftype = args.ftype
ftype_str = ["f32", "f16"]
print("Done. Output file: " + fname_out + ", (part ", p, ")")
print("")
hparams, tokenizer = load_hparams_and_tokenizer(dir_model)
print(args)
# if only writing vocab to file
if args.vocab_only:
fname_model = f"{dir_model}/consolidated.00.pth"
fname_out = f"{dir_model}/ggml-vocab.bin"
print(f"Extracting only the vocab from '{fname_model}'\n")
model = torch.load(fname_model, map_location="cpu")
with open(fname_out, "wb") as fout:
write_header(fout, hparams, ftype)
write_tokens(fout, tokenizer)
del model
print(f"Done. Output file: {fname_out}\n")
return
n_parts = get_n_parts(hparams["dim"])
for p in range(n_parts):
print(f"Processing part {p}\n")
fname_model = f"{dir_model}/consolidated.0{p}.pth"
fname_out = f"{dir_model}/ggml-model-{ftype_str[ftype]}.bin{'' if p == 0 else '.' + str(p)}"
model = torch.load(fname_model, map_location="cpu")
with open(fname_out, "wb") as fout:
write_header(fout, hparams, ftype)
write_tokens(fout, tokenizer)
process_and_write_variables(fout, model, ftype)
del model
print(f"Done. Output file: {fname_out}, (part {p})\n")
if __name__ == "__main__":
main()

66
download-pth.py Normal file
View File

@@ -0,0 +1,66 @@
import os
import sys
from tqdm import tqdm
import requests
if len(sys.argv) < 3:
print("Usage: download-pth.py dir-model model-type\n")
print(" model-type: Available models 7B, 13B, 30B or 65B")
sys.exit(1)
modelsDir = sys.argv[1]
model = sys.argv[2]
num = {
"7B": 1,
"13B": 2,
"30B": 4,
"65B": 8,
}
if model not in num:
print(f"Error: model {model} is not valid, provide 7B, 13B, 30B or 65B")
sys.exit(1)
print(f"Downloading model {model}")
files = ["checklist.chk", "params.json"]
for i in range(num[model]):
files.append(f"consolidated.0{i}.pth")
resolved_path = os.path.abspath(os.path.join(modelsDir, model))
os.makedirs(resolved_path, exist_ok=True)
for file in files:
dest_path = os.path.join(resolved_path, file)
if os.path.exists(dest_path):
print(f"Skip file download, it already exists: {file}")
continue
url = f"https://agi.gpt4.org/llama/LLaMA/{model}/{file}"
response = requests.get(url, stream=True)
with open(dest_path, 'wb') as f:
with tqdm(unit='B', unit_scale=True, miniters=1, desc=file) as t:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
t.update(len(chunk))
files2 = ["tokenizer_checklist.chk", "tokenizer.model"]
for file in files2:
dest_path = os.path.join(modelsDir, file)
if os.path.exists(dest_path):
print(f"Skip file download, it already exists: {file}")
continue
url = f"https://agi.gpt4.org/llama/LLaMA/{file}"
response = requests.get(url, stream=True)
with open(dest_path, 'wb') as f:
with tqdm(unit='B', unit_scale=True, miniters=1, desc=file) as t:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
t.update(len(chunk))

53
examples/chatLLaMa Executable file
View File

@@ -0,0 +1,53 @@
#!/bin/bash
cd "$(dirname "$0")/.." || exit
MODEL="${MODEL:-./models/13B/ggml-model-q4_0.bin}"
USER_NAME="${USER_NAME:-User}"
AI_NAME="${AI_NAME:-ChatLLaMa}"
# Adjust to the number of CPU cores you want to use.
N_THREAD="${N_THREAD:-8}"
# Number of tokens to predict (made it larger than default because we want a long interaction)
N_PREDICTS="${N_PREDICTS:-2048}"
# Note: you can also override the generation options by specifying them on the command line:
# For example, override the context size by doing: ./chatLLaMa --ctx_size 1024
GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647}"
# shellcheck disable=SC2086 # Intended splitting of GEN_OPTIONS
./main $GEN_OPTIONS \
--model "$MODEL" \
--threads "$N_THREAD" \
--n_predict "$N_PREDICTS" \
--color --interactive \
--reverse-prompt "${USER_NAME}:" \
--prompt "
Text transcript of a never ending dialog, where ${USER_NAME} interacts with an AI assistant named ${AI_NAME}.
${AI_NAME} is helpful, kind, honest, friendly, good at writing and never fails to answer ${USER_NAME}s requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what ${USER_NAME} and ${AI_NAME} say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.
$USER_NAME: Hello, $AI_NAME!
$AI_NAME: Hello $USER_NAME! How may I help you today?
$USER_NAME: What time is it?
$AI_NAME: It is $(date +%H:%M).
$USER_NAME: What year is it?
$AI_NAME: We are in $(date +%Y).
$USER_NAME: Please tell me the largest city in Europe.
$AI_NAME: The largest city in Europe is Moscow, the capital of Russia.
$USER_NAME: What can you tell me about Moscow?
$AI_NAME: Moscow, on the Moskva River in western Russia, is the nations cosmopolitan capital. In its historic core is the Kremlin, a complex thats home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russias symbolic center.
$USER_NAME: What is a cat?
$AI_NAME: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
$USER_NAME: How do I pass command line arguments to a Node.js program?
$AI_NAME: The arguments are stored in process.argv.
argv[0] is the path to the Node. js executable.
argv[1] is the path to the script file.
argv[2] is the first argument passed to the script.
argv[3] is the second argument passed to the script and so on.
$USER_NAME: Name a color.
$AI_NAME: Blue
$USER_NAME:" "$@"

43
flake.lock generated Normal file
View File

@@ -0,0 +1,43 @@
{
"nodes": {
"flake-utils": {
"locked": {
"lastModified": 1676283394,
"narHash": "sha256-XX2f9c3iySLCw54rJ/CZs+ZK6IQy7GXNY4nSOyu2QG4=",
"owner": "numtide",
"repo": "flake-utils",
"rev": "3db36a8b464d0c4532ba1c7dda728f4576d6d073",
"type": "github"
},
"original": {
"owner": "numtide",
"repo": "flake-utils",
"type": "github"
}
},
"nixpkgs": {
"locked": {
"lastModified": 1678470307,
"narHash": "sha256-OEeMUr3ueLIXyW/OaFUX5jUdimyQwMg/7e+/Q0gC/QE=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "0c4800d579af4ed98ecc47d464a5e7b0870c4b1f",
"type": "github"
},
"original": {
"owner": "NixOS",
"ref": "nixos-unstable",
"repo": "nixpkgs",
"type": "github"
}
},
"root": {
"inputs": {
"flake-utils": "flake-utils",
"nixpkgs": "nixpkgs"
}
}
},
"root": "root",
"version": 7
}

49
flake.nix Normal file
View File

@@ -0,0 +1,49 @@
{
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
flake-utils.url = "github:numtide/flake-utils";
};
outputs = { self, nixpkgs, flake-utils }:
flake-utils.lib.eachDefaultSystem (system:
let
pkgs = import nixpkgs {
inherit system;
};
llama-python = pkgs.python310.withPackages (ps: with ps; [
torch
numpy
sentencepiece
]);
in
{
packages.default = pkgs.stdenv.mkDerivation {
name = "llama.cpp";
src = ./.;
nativeBuildInputs = with pkgs; [ cmake ];
buildInputs = with pkgs; lib.optionals stdenv.isDarwin [
darwin.apple_sdk.frameworks.Accelerate
];
cmakeFlags = with pkgs; lib.optionals (system == "aarch64-darwin") [
"-DCMAKE_C_FLAGS=-D__ARM_FEATURE_DOTPROD=1"
];
installPhase = ''
mkdir -p $out/bin
mv llama $out/bin/llama
mv quantize $out/bin/quantize
echo "#!${llama-python}/bin/python" > $out/bin/convert-pth-to-ggml
cat ${./convert-pth-to-ggml.py} >> $out/bin/convert-pth-to-ggml
chmod +x $out/bin/convert-pth-to-ggml
'';
meta.mainProgram = "llama";
};
devShells.default = pkgs.mkShell {
packages = with pkgs; [
cmake
llama-python
] ++ lib.optionals stdenv.isDarwin [
darwin.apple_sdk.frameworks.Accelerate
];
};
}
);
}

528
ggml.c
View File

@@ -1,8 +1,11 @@
// Defines CLOCK_MONOTONIC on Linux
#define _POSIX_C_SOURCE 199309L
#include "ggml.h"
#if defined(_MSC_VER) || defined(__MINGW32__)
#include <malloc.h> // using malloc.h with MSC/MINGW
#elif !defined(__FreeBSD__)
#elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
#include <alloca.h>
#endif
@@ -361,10 +364,10 @@ static const size_t CACHE_LINE_SIZE_F32 = CACHE_LINE_SIZE/sizeof(float);
// AVX routines provided by GH user Const-me
// ref: https://github.com/ggerganov/ggml/pull/27#issuecomment-1464934600
#if __AVX2__
#if __AVX2__ || __AVX512F__
// Unpack 32 4-bit fields into 32 bytes
// The output vector contains 32 bytes, each one in [ 0 .. 15 ] interval
inline __m256i bytesFromNibbles( const uint8_t* rsi )
static inline __m256i bytesFromNibbles( const uint8_t* rsi )
{
// Load 16 bytes from memory
__m128i tmp = _mm_loadu_si128( ( const __m128i* )rsi );
@@ -381,7 +384,7 @@ inline __m256i bytesFromNibbles( const uint8_t* rsi )
return bytes;
}
inline __m128i packNibbles( __m256i bytes )
static inline __m128i packNibbles( __m256i bytes )
{
// Move bits within 16-bit lanes from 0000_abcd_0000_efgh into 0000_0000_abcd_efgh
const __m256i lowByte = _mm256_set1_epi16( 0xFF );
@@ -397,13 +400,58 @@ inline __m128i packNibbles( __m256i bytes )
}
#endif
// method 5
// blocks of QK elements
// represented with a single float (delta) and QK/2 8-bit ints (i.e QK 4-bit signed integer factors)
// reference implementation for deterministic creation of model files
static void quantize_row_q4_0_reference(const float * restrict x, void * restrict y, int k) {
assert(k % QK == 0);
const int nb = k / QK;
const size_t bs = sizeof(float) + QK/2;
uint8_t * restrict pd = ((uint8_t *)y + 0*bs);
uint8_t * restrict pb = ((uint8_t *)y + 0*bs + sizeof(float));
uint8_t pp[QK/2];
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int l = 0; l < QK; l++) {
const float v = x[i*QK + l];
amax = MAX(amax, fabsf(v));
}
const float d = amax / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
*(float *)pd = d;
pd += bs;
for (int l = 0; l < QK; l += 2) {
const float v0 = x[i*QK + l + 0]*id;
const float v1 = x[i*QK + l + 1]*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, sizeof(pp));
pb += bs;
}
}
void quantize_row_q4_0(const float * restrict x, void * restrict y, int k) {
assert(k % QK == 0);
#if __ARM_NEON || defined(__AVX2__) || defined(__wasm_simd128__)
const int nb = k / QK;
const size_t bs = sizeof(float) + QK/2;
@@ -411,6 +459,7 @@ void quantize_row_q4_0(const float * restrict x, void * restrict y, int k) {
uint8_t * restrict pb = ((uint8_t *)y + 0*bs + sizeof(float));
uint8_t pp[QK/2];
#endif
#if __ARM_NEON
#if QK == 32
@@ -567,36 +616,7 @@ void quantize_row_q4_0(const float * restrict x, void * restrict y, int k) {
#endif
#else
// scalar
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
for (int l = 0; l < QK; l++) {
const float v = x[i*QK + l];
amax = MAX(amax, fabsf(v));
}
const float d = amax / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
*(float *)pd = d;
pd += bs;
for (int l = 0; l < QK; l += 2) {
const float v0 = x[i*QK + l + 0]*id;
const float v1 = x[i*QK + l + 1]*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, sizeof(pp));
pb += bs;
}
quantize_row_q4_0_reference(x, y, k);
#endif
}
@@ -607,10 +627,11 @@ void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
assert(k % QK == 0);
const int nb = k / QK;
const size_t bs = 2*sizeof(float) + QK/2;
float * restrict pm = (float *) (y);
float * restrict pd = (float *) (pm + nb);
uint8_t * restrict pb = (uint8_t *) (pd + nb);
uint8_t * restrict pd = ((uint8_t *)y + 0*bs);
uint8_t * restrict pm = ((uint8_t *)y + 0*bs + sizeof(float));
uint8_t * restrict pb = ((uint8_t *)y + 0*bs + 2*sizeof(float));
uint8_t pp[QK/2];
@@ -627,8 +648,10 @@ void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
const float d = (max - min) / ((1 << 4) - 1);
const float id = d ? 1.0f/d : 0.0f;
pm[i] = min;
pd[i] = d;
*(float *)pm = min;
*(float *)pd = d;
pm += bs;
pd += bs;
for (int l = 0; l < QK; l += 2) {
const float v0 = (x[i*QK + l + 0] - min)*id;
@@ -643,7 +666,8 @@ void quantize_row_q4_1(const float * restrict x, void * restrict y, int k) {
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb + i*QK/2, pp, sizeof(pp));
memcpy(pb, pp, sizeof(pp));
pb += bs;
}
}
@@ -687,16 +711,17 @@ void dequantize_row_q4_1(const void * restrict x, float * restrict y, int k) {
assert(k % QK == 0);
const int nb = k / QK;
const size_t bs = 2*sizeof(float) + QK/2;
const float * restrict pm = (const float *) (x);
const float * restrict pd = (const float *) (pm + nb);
const uint8_t * restrict pb = (const uint8_t *) (pd + nb);
const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs + sizeof(float));
const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float));
for (int i = 0; i < nb; i++) {
const float m = pm[i];
const float d = pd[i];
const float d = *(const float *) (pd + i*bs);
const float m = *(const float *) (pm + i*bs);
const uint8_t * restrict pp = pb + i*QK/2;
const uint8_t * restrict pp = pb + i*bs;
for (int l = 0; l < QK; l += 2) {
const uint8_t vi = pp[l/2];
@@ -1257,6 +1282,47 @@ inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float
*s = sumf;
}
#if __AVX512F__ && QK == 32
static inline __m512 dot_q4_0_oneblock_avx512(
__m512 acc,
const uint8_t * pd0,
const uint8_t * pd1,
const uint8_t * pb0,
const uint8_t * pb1,
size_t bs,
int i
) {
const float * d0_0 = (const float *) (pd0 + i*bs);
const float * d1_0 = (const float *) (pd1 + i*bs);
const uint8_t * restrict p0 = pb0 + (i+0)*bs;
const uint8_t * restrict p1 = pb1 + (i+0)*bs;
// Compute combined scale for the block
float scaleScalar = d0_0[0] * d1_0[0];
__m512 scale = _mm512_set1_ps( scaleScalar );
__m256i bx = bytesFromNibbles( p0 );
__m256i by = bytesFromNibbles( p1 );
// Now we have a vector with bytes in [ 0 .. 15 ] interval. Offset them into [ -8 .. +7 ] interval.
const __m256i off = _mm256_set1_epi8( 8 );
bx = _mm256_sub_epi8( bx, off );
by = _mm256_sub_epi8( by, off );
// Sign-extend 16 signed bytes into int16_t
__m512i x32 = _mm512_cvtepi8_epi16( bx );
__m512i y32 = _mm512_cvtepi8_epi16( by );
// Compute products of int16_t integers, add pairwise
__m512i i64 = _mm512_madd_epi16( x32, y32 );
// Convert int32_t to float
__m512 p = _mm512_cvtepi32_ps( i64 );
// Apply the scale, and accumulate
return _mm512_fmadd_ps( scale, p, acc );
}
#endif
inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) {
ggml_float sumf = 0.0;
@@ -1359,8 +1425,24 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
const int8x16_t v0_1hs = vsubq_s8(v0_1h, s8b);
const int8x16_t v1_1hs = vsubq_s8(v1_1h, s8b);
#if defined(__ARM_FEATURE_DOTPROD)
// dot product into int16x8_t
const int16x8_t pl0l = vmull_s8(vget_low_s8 (v0_0ls), vget_low_s8 (v1_0ls));
int32x4_t p_0 = vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0ls);
int32x4_t p_1 = vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1ls);
p_0 = vdotq_s32(p_0, v0_0hs, v1_0hs);
p_1 = vdotq_s32(p_1, v0_1hs, v1_1hs);
// scalar
#if defined(__ARM_FEATURE_QRDMX)
sum0 += d0_0*d1_0*vaddvq_s32(p_0);
sum1 += d0_1*d1_1*vaddvq_s32(p_1);
#else
sum0 += d0_0*d1_0*(vgetq_lane_s32(p_0, 0) + vgetq_lane_s32(p_0, 1) + vgetq_lane_s32(p_0, 2) + vgetq_lane_s32(p_0, 3));
sum1 += d0_1*d1_1*(vgetq_lane_s32(p_1, 0) + vgetq_lane_s32(p_1, 1) + vgetq_lane_s32(p_1, 2) + vgetq_lane_s32(p_1, 3));
#endif
#else
const int16x8_t pl0l = vmull_s8(vget_low_s8 (v0_0ls), vget_low_s8 (v1_0ls));
const int16x8_t pl0h = vmull_s8(vget_high_s8(v0_0ls), vget_high_s8(v1_0ls));
const int16x8_t ph0l = vmull_s8(vget_low_s8 (v0_0hs), vget_low_s8 (v1_0hs));
@@ -1388,6 +1470,7 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
#else
sum0 += d0_0*d1_0*(vgetq_lane_s16(p_0, 0) + vgetq_lane_s16(p_0, 1) + vgetq_lane_s16(p_0, 2) + vgetq_lane_s16(p_0, 3) + vgetq_lane_s16(p_0, 4) + vgetq_lane_s16(p_0, 5) + vgetq_lane_s16(p_0, 6) + vgetq_lane_s16(p_0, 7));
sum1 += d0_1*d1_1*(vgetq_lane_s16(p_1, 0) + vgetq_lane_s16(p_1, 1) + vgetq_lane_s16(p_1, 2) + vgetq_lane_s16(p_1, 3) + vgetq_lane_s16(p_1, 4) + vgetq_lane_s16(p_1, 5) + vgetq_lane_s16(p_1, 6) + vgetq_lane_s16(p_1, 7));
#endif
#endif
}
@@ -1395,6 +1478,40 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
#else
#error "not implemented for QK"
#endif
#elif defined(__AVX512F__)
#if QK == 32
// Initialize accumulator with zeros
__m512 acc0 = _mm512_setzero_ps();
__m512 acc1 = _mm512_setzero_ps();
const int superblock_size = 8;
const int superblock_count = nb / superblock_size;
const int remainder = nb % superblock_size;
for (int superblock_ix = 0; superblock_ix < superblock_count; superblock_ix += 1) {
int i = superblock_ix * superblock_size;
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+0 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+1 );
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+2 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+3 );
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+4 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+5 );
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i+6 );
acc1 = dot_q4_0_oneblock_avx512( acc1, pd0, pd1, pb0, pb1, bs, i+7 );
}
// Remainders
for (int i = superblock_count * superblock_size; i < nb; ++i) {
acc0 = dot_q4_0_oneblock_avx512( acc0, pd0, pd1, pb0, pb1, bs, i );
}
// Horizontal sum of all lanes of the accumulator
sumf = _mm512_reduce_add_ps( acc0 ) + _mm512_reduce_add_ps( acc1 );
#else
#error "not implemented for QK"
#endif
#elif defined(__AVX2__)
#if QK == 32
const size_t countBlocks = nb;
@@ -1567,28 +1684,109 @@ inline static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void
inline static void ggml_vec_dot_q4_1(const int n, float * restrict s, const void * restrict x, const void * restrict y) {
const int nb = n / QK;
const float * restrict pm0 = (const float *) x;
const float * restrict pm1 = (const float *) y;
const size_t bs = 2*sizeof(float) + QK/2;
const float * restrict pd0 = (const float *) (pm0 + nb);
const float * restrict pd1 = (const float *) (pm1 + nb);
const uint8_t * restrict pd0 = ((const uint8_t *)x + 0*bs);
const uint8_t * restrict pd1 = ((const uint8_t *)y + 0*bs);
const uint8_t * restrict pb0 = (const uint8_t *) (pd0 + nb);
const uint8_t * restrict pb1 = (const uint8_t *) (pd1 + nb);
const uint8_t * restrict pm0 = ((const uint8_t *)x + 0*bs + sizeof(float));
const uint8_t * restrict pm1 = ((const uint8_t *)y + 0*bs + sizeof(float));
const uint8_t * restrict pb0 = ((const uint8_t *)x + 0*bs + 2*sizeof(float));
const uint8_t * restrict pb1 = ((const uint8_t *)y + 0*bs + 2*sizeof(float));
float sumf = 0.0;
#if 1
#if defined(__AVX2__)
#if QK == 32
// Initialize accumulator with zeros
__m256 acc = _mm256_setzero_ps();
// Accumulator for constant offsets
float acc_offset = 0.0f;
// Main loop
for (int i = 0; i < nb; ++i) {
const float * m0 = (const float *) (pm0 + i*bs);
const float * m1 = (const float *) (pm1 + i*bs);
const float * d0 = (const float *) (pd0 + i*bs);
const float * d1 = (const float *) (pd1 + i*bs);
const uint8_t * restrict p0 = pb0 + i*bs;
const uint8_t * restrict p1 = pb1 + i*bs;
const __m256 d0v = _mm256_broadcast_ss( d0 );
const __m256 d1v = _mm256_broadcast_ss( d1 );
const __m256 m0v = _mm256_broadcast_ss( m0 );
const __m256 m1v = _mm256_broadcast_ss( m1 );
// Compute combined scale for the block
const __m256 scale_01 = _mm256_mul_ps( d0v, d1v );
// Compute cross scales for the block
const __m256 scale_0 = _mm256_mul_ps( d0v, m1v );
const __m256 scale_1 = _mm256_mul_ps( m0v, d1v );
const __m256 cross_scales = _mm256_blend_ps( scale_0, scale_1, 0b10101010 );
// Load 16 bytes, and unpack 4 bit fields into bytes, making 32 bytes
__m256i bx = bytesFromNibbles( p0 );
__m256i by = bytesFromNibbles( p1 );
// Now we have a vector with bytes in [ 0 .. 15 ] interval.
// Sign-extend first 16 signed bytes into int16_t
__m256i x16 = _mm256_cvtepi8_epi16( _mm256_castsi256_si128( bx ) );
__m256i y16 = _mm256_cvtepi8_epi16( _mm256_castsi256_si128( by ) );
// Compute products of int16_t integers, add pairwise
__m256i i32 = _mm256_madd_epi16( x16, y16 );
// Sign-extend last 16 signed bytes into int16_t vectors
__m256i x16_h = _mm256_cvtepi8_epi16( _mm256_extracti128_si256( bx, 1 ) );
__m256i y16_h = _mm256_cvtepi8_epi16( _mm256_extracti128_si256( by, 1 ) );
// Accumulate products of int16_t integers
i32 = _mm256_add_epi32( i32, _mm256_madd_epi16( x16_h, y16_h ) );
// compute sums of unsigned bytes in bx, by in blocks of 8.
// This results in a layout like X100 0000 X200 0000 X300 0000 X400 0000,
// which we then interleave as X100 Y100 X200 Y200 X300 Y300 X400 Y400.
// so if we then cast to 8 singles, we get 8 floats like [ x0_7, y0_7, x8_15, y8_15, x16_23, y16_23, x24_31, y24_31 ]
__m256i xsumi = _mm256_sad_epu8( bx, _mm256_setzero_si256() );
__m256i ysumi = _mm256_sad_epu8( by, _mm256_setzero_si256() );
__m256i sumsi = _mm256_or_si256( xsumi, _mm256_slli_si256( ysumi, 4 ) );
__m256 sums = _mm256_cvtepi32_ps( sumsi );
// Convert int32_t to float
__m256 p = _mm256_cvtepi32_ps( i32 );
// Apply the scale, and accumulate
// acc += d0*d1*x*y + d0*m1*x + d1*m0*y
acc = _mm256_fmadd_ps( scale_01, p, acc );
acc = _mm256_fmadd_ps( cross_scales, sums, acc );
// acc_offset += m0*m1 (for each entry in the block)
acc_offset += (*m0)*(*m1);
}
// Return horizontal sum of the acc vector
__m128 res = _mm256_extractf128_ps( acc, 1 );
res = _mm_add_ps( res, _mm256_castps256_ps128( acc ) );
res = _mm_add_ps( res, _mm_movehl_ps( res, res ) );
res = _mm_add_ss( res, _mm_movehdup_ps( res ) );
sumf = _mm_cvtss_f32( res ) + acc_offset * QK;
#else
#error "not implemented for QK"
#endif
#else
// scalar
for (int i = 0; i < nb; i++) {
const float m0 = pm0[i];
const float m1 = pm1[i];
const float m0 = *(const float *) (pm0 + i*bs);
const float m1 = *(const float *) (pm1 + i*bs);
const float d0 = pd0[i];
const float d1 = pd1[i];
const float d0 = *(const float *) (pd0 + i*bs);
const float d1 = *(const float *) (pd1 + i*bs);
const uint8_t * restrict p0 = pb0 + i*QK/2;
const uint8_t * restrict p1 = pb1 + i*QK/2;
const uint8_t * restrict p0 = pb0 + i*bs;
const uint8_t * restrict p1 = pb1 + i*bs;
for (int j = 0; j < QK/2; j++) {
const uint8_t v0 = p0[j];
@@ -1822,16 +2020,17 @@ inline static void ggml_vec_mad_q4_1(const int n, float * restrict y, void * res
assert(n % QK == 0);
const int nb = n / QK;
const size_t bs = 2*sizeof(float) + QK/2;
const float * restrict pm = (const float *) (x);
const float * restrict pd = (const float *) (pm + nb);
const uint8_t * restrict pb = (const uint8_t *) (pd + nb);
const uint8_t * restrict pd = ((const uint8_t *)x + 0*bs);
const uint8_t * restrict pm = ((const uint8_t *)x + 0*bs + sizeof(float));
const uint8_t * restrict pb = ((const uint8_t *)x + 0*bs + 2*sizeof(float));
for (int i = 0; i < nb; i++) {
const float m = pm[i];
const float d = pd[i];
const float d = *(const float *) (pd + i*bs);
const float m = *(const float *) (pm + i*bs);
const uint8_t * restrict pp = pb + i*QK/2;
const uint8_t * restrict pp = pb + i*bs;
for (int l = 0; l < QK; l += 2) {
const uint8_t vi = pp[l/2];
@@ -2052,6 +2251,7 @@ static const char * GGML_OP_LABEL[GGML_OP_COUNT] = {
"GELU",
"SILU",
"NORM",
"RMS_NORM",
"MUL_MAT",
@@ -2072,7 +2272,7 @@ static const char * GGML_OP_LABEL[GGML_OP_COUNT] = {
"FLASH_FF",
};
static_assert(GGML_OP_COUNT == 34, "GGML_OP_COUNT != 34");
static_assert(GGML_OP_COUNT == 35, "GGML_OP_COUNT != 35");
static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
"none",
@@ -2095,6 +2295,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
"gelu(x)",
"silu(x)",
"norm(x)",
"rms_norm(x)",
"X*Y",
@@ -2115,7 +2316,7 @@ static const char * GGML_OP_SYMBOL[GGML_OP_COUNT] = {
"flash_ff(x)",
};
static_assert(GGML_OP_COUNT == 34, "GGML_OP_COUNT != 34");
static_assert(GGML_OP_COUNT == 35, "GGML_OP_COUNT != 35");
//
// ggml object
@@ -3601,6 +3802,39 @@ struct ggml_tensor * ggml_norm_inplace(
return ggml_norm_impl(ctx, a, true);
}
struct ggml_tensor * ggml_rms_norm_impl(
struct ggml_context * ctx,
struct ggml_tensor * a,
bool inplace) {
bool is_node = false;
if (!inplace && (a->grad)) {
GGML_ASSERT(false); // TODO: implement backward
is_node = true;
}
struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
result->op = GGML_OP_RMS_NORM;
result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
result->src0 = a;
result->src1 = NULL; // TODO: maybe store epsilon here?
return result;
}
struct ggml_tensor * ggml_rms_norm(
struct ggml_context * ctx,
struct ggml_tensor * a) {
return ggml_rms_norm_impl(ctx, a, false);
}
struct ggml_tensor * ggml_rms_norm_inplace(
struct ggml_context * ctx,
struct ggml_tensor * a) {
return ggml_rms_norm_impl(ctx, a, true);
}
// ggml_mul_mat
struct ggml_tensor * ggml_mul_mat(
@@ -5389,6 +5623,87 @@ static void ggml_compute_forward_norm(
}
}
static void ggml_compute_forward_rms_norm_f32(
const struct ggml_compute_params * params,
const struct ggml_tensor * src0,
struct ggml_tensor * dst) {
GGML_ASSERT(ggml_are_same_shape(src0, dst));
if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
return;
}
GGML_ASSERT(src0->nb[0] == sizeof(float));
const int ith = params->ith;
const int nth = params->nth;
const int ne00 = src0->ne[0];
const int ne01 = src0->ne[1];
const int ne02 = src0->ne[2];
const int ne03 = src0->ne[3];
const size_t nb01 = src0->nb[1];
const size_t nb02 = src0->nb[2];
const size_t nb03 = src0->nb[3];
const size_t nb1 = dst->nb[1];
const size_t nb2 = dst->nb[2];
const size_t nb3 = dst->nb[3];
const ggml_float eps = 1e-6f; // TODO: make this a parameter
// TODO: optimize
for (int i03 = 0; i03 < ne03; i03++) {
for (int i02 = 0; i02 < ne02; i02++) {
for (int i01 = ith; i01 < ne01; i01 += nth) {
const float * x = (float *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
ggml_float mean = 0.0;
for (int i00 = 0; i00 < ne00; i00++) {
mean += x[i00] * x[i00];
}
mean /= ne00;
float * y = (float *) ((char *) dst->data + i01*nb1 + i02*nb2 + i03*nb3);
memcpy(y, x, ne00 * sizeof(float));
// for (int i00 = 0; i00 < ne00; i00++) {
// y[i00] = x[i00];
// }
const float scale = 1.0/sqrt(mean + eps);
ggml_vec_scale_f32(ne00, y, scale);
}
}
}
}
static void ggml_compute_forward_rms_norm(
const struct ggml_compute_params * params,
const struct ggml_tensor * src0,
struct ggml_tensor * dst) {
switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_rms_norm_f32(params, src0, dst);
} break;
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
case GGML_TYPE_I8:
case GGML_TYPE_I16:
case GGML_TYPE_I32:
case GGML_TYPE_F16:
case GGML_TYPE_COUNT:
{
GGML_ASSERT(false);
} break;
}
}
// ggml_compute_forward_mul_mat
#if defined(GGML_USE_ACCELERATE) || defined(GGML_USE_OPENBLAS)
@@ -8505,6 +8820,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
{
ggml_compute_forward_norm(params, tensor->src0, tensor);
} break;
case GGML_OP_RMS_NORM:
{
ggml_compute_forward_rms_norm(params, tensor->src0, tensor);
} break;
case GGML_OP_MUL_MAT:
{
ggml_compute_forward_mul_mat(params, tensor->src0, tensor->src1, tensor);
@@ -8747,6 +9066,10 @@ static void ggml_compute_backward(struct ggml_context * ctx, struct ggml_tensor
{
GGML_ASSERT(false); // TODO: not implemented
} break;
case GGML_OP_RMS_NORM:
{
GGML_ASSERT(false); // TODO: not implemented
} break;
case GGML_OP_MUL_MAT:
{
if (src0->grad) {
@@ -9090,10 +9413,6 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
}
void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph) {
if (cgraph->n_threads <= 0) {
cgraph->n_threads = 8;
}
const int n_threads = cgraph->n_threads;
struct ggml_compute_state_shared state_shared = {
@@ -9173,6 +9492,7 @@ void ggml_graph_compute(struct ggml_context * ctx, struct ggml_cgraph * cgraph)
node->n_tasks = n_threads;
} break;
case GGML_OP_NORM:
case GGML_OP_RMS_NORM:
{
node->n_tasks = n_threads;
} break;
@@ -10403,6 +10723,68 @@ enum ggml_opt_result ggml_opt(
////////////////////////////////////////////////////////////////////////////////
size_t ggml_quantize_q4_0(const float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
quantize_row_q4_0_reference(src + j, pd, k);
for (int i = 0; i < nb; i++) {
for (int l = 0; l < qk; l += 2) {
const uint8_t vi0 = pb[l/2] & 0xF;
const uint8_t vi1 = pb[l/2] >> 4;
hist[vi0]++;
hist[vi1]++;
}
pb += bs;
}
}
return (n/k)*row_size;
}
size_t ggml_quantize_q4_1(const float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t bs = (2*sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + 2*sizeof(float));
quantize_row_q4_1(src + j, pd, k);
for (int i = 0; i < nb; i++) {
for (int l = 0; l < qk; l += 2) {
const uint8_t vi0 = pb[l/2] & 0xF;
const uint8_t vi1 = pb[l/2] >> 4;
hist[vi0]++;
hist[vi1]++;
}
pb += bs;
}
}
return (n/k)*row_size;
}
////////////////////////////////////////////////////////////////////////////////
int ggml_cpu_has_avx(void) {
#if defined(__AVX__)
return 1;

12
ggml.h
View File

@@ -230,6 +230,7 @@ enum ggml_op {
GGML_OP_GELU,
GGML_OP_SILU,
GGML_OP_NORM, // normalize
GGML_OP_RMS_NORM,
GGML_OP_MUL_MAT,
@@ -482,6 +483,10 @@ struct ggml_tensor * ggml_norm(
struct ggml_context * ctx,
struct ggml_tensor * a);
struct ggml_tensor * ggml_rms_norm(
struct ggml_context * ctx,
struct ggml_tensor * a);
// A: m rows, n columns
// B: p rows, n columns (i.e. we transpose it internally)
// result is m columns, p rows
@@ -736,6 +741,13 @@ enum ggml_opt_result ggml_opt(
struct ggml_opt_params params,
struct ggml_tensor * f);
//
// quantization
//
size_t ggml_quantize_q4_0(const float * src, void * dst, int n, int k, int qk, int64_t * hist);
size_t ggml_quantize_q4_1(const float * src, void * dst, int n, int k, int qk, int64_t * hist);
//
// system info
//

1570
llama.cpp Normal file

File diff suppressed because it is too large Load Diff

139
llama.h Normal file
View File

@@ -0,0 +1,139 @@
#ifndef LLAMA_H
#define LLAMA_H
#include <stddef.h>
#include <stdint.h>
#include <stdbool.h>
#ifdef LLAMA_SHARED
# ifdef _WIN32
# ifdef LLAMA_BUILD
# define LLAMA_API __declspec(dllexport)
# else
# define LLAMA_API __declspec(dllimport)
# endif
# else
# define LLAMA_API __attribute__ ((visibility ("default")))
# endif
#else
# define LLAMA_API
#endif
#define LLAMA_FILE_VERSION 1
#define LLAMA_FILE_MAGIC 0x67676d66 // 'ggmf' in hex
#define LLAMA_FILE_MAGIC_UNVERSIONED 0x67676d6c // pre-versioned files
#ifdef __cplusplus
extern "C" {
#endif
//
// C interface
//
// TODO: show sample usage
//
struct llama_context;
typedef int llama_token;
typedef struct llama_token_data {
llama_token id; // token id
float p; // probability of the token
float plog; // log probability of the token
} llama_token_data;
struct llama_context_params {
int n_ctx; // text context
int n_parts; // -1 for default
int seed; // RNG seed, 0 for random
bool f16_kv; // use fp16 for KV cache
bool logits_all; // the llama_eval() call computes all logits, not just the last one
bool vocab_only; // only load the vocabulary, no weights
};
LLAMA_API struct llama_context_params llama_context_default_params();
// Various functions for loading a ggml llama model.
// Allocate (almost) all memory needed for the model.
// Return NULL on failure
LLAMA_API struct llama_context * llama_init_from_file(
const char * path_model,
struct llama_context_params params);
// Frees all allocated memory
LLAMA_API void llama_free(struct llama_context * ctx);
// TODO: not great API - very likely to change
// Returns 0 on success
LLAMA_API int llama_model_quantize(
const char * fname_inp,
const char * fname_out,
int itype,
int qk);
// Run the llama inference to obtain the logits and probabilities for the next token.
// tokens + n_tokens is the provided batch of new tokens to process
// n_past is the number of tokens to use from previous eval calls
// Returns 0 on success
LLAMA_API int llama_eval(
struct llama_context * ctx,
const llama_token * tokens,
int n_tokens,
int n_past,
int n_threads);
// Convert the provided text into tokens.
// The tokens pointer must be large enough to hold the resulting tokens.
// Returns the number of tokens on success, no more than n_max_tokens
// Returns a negative number on failure - the number of tokens that would have been returned
// TODO: not sure if correct
LLAMA_API int llama_tokenize(
struct llama_context * ctx,
const char * text,
llama_token * tokens,
int n_max_tokens,
bool add_bos);
LLAMA_API int llama_n_vocab(struct llama_context * ctx);
LLAMA_API int llama_n_ctx (struct llama_context * ctx);
// Token logits obtained from the last call to llama_eval()
// The logits for the last token are stored in the last row
// Can be mutated in order to change the probabilities of the next token
// Rows: n_tokens
// Cols: n_vocab
LLAMA_API float * llama_get_logits(struct llama_context * ctx);
// Token Id -> String. Uses the vocabulary in the provided context
LLAMA_API const char * llama_token_to_str(struct llama_context * ctx, llama_token token);
// Special tokens
LLAMA_API llama_token llama_token_bos();
LLAMA_API llama_token llama_token_eos();
// TODO: improve the last_n_tokens interface ?
LLAMA_API llama_token llama_sample_top_p_top_k(
llama_context * ctx,
const llama_token * last_n_tokens_data,
int last_n_tokens_size,
int top_k,
double top_p,
double temp,
double repeat_penalty);
// Performance information
LLAMA_API void llama_print_timings(struct llama_context * ctx);
LLAMA_API void llama_reset_timings(struct llama_context * ctx);
// Print system information
LLAMA_API const char * llama_print_system_info(void);
#ifdef __cplusplus
}
#endif
#endif

1076
main.cpp

File diff suppressed because it is too large Load Diff

0
models/.gitignore vendored
View File

BIN
models/ggml-vocab.bin Normal file

Binary file not shown.

1
prompts/alpaca.txt Normal file
View File

@@ -0,0 +1 @@
Below is an instruction that describes a task. Write a response that appropriately completes the request.

View File

@@ -0,0 +1,7 @@
Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:

View File

@@ -1,295 +1,17 @@
#include "ggml.h"
#include "llama.h"
#include "utils.h"
#include <cassert>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <map>
#include <string>
#include <vector>
#include <regex>
// TODO: move somewhere else
#define QK 32
// default hparams (LLaMA76B)
struct llama_hparams {
int32_t n_vocab = 32000;
int32_t n_ctx = 512; // this is provided as user input?
int32_t n_embd = 4096;
int32_t n_mult = 256;
int32_t n_head = 32;
int32_t n_layer = 32;
int32_t n_rot = 64;
int32_t f16 = 1;
};
// quantize a model
bool llama_model_quantize(const std::string & fname_inp, const std::string & fname_out, int itype) {
ggml_type type = GGML_TYPE_Q4_1;
switch (itype) {
case 2: type = GGML_TYPE_Q4_0; break;
case 3: type = GGML_TYPE_Q4_1; break;
default: fprintf(stderr, "%s: invalid quantization type %d\n", __func__, itype); return 1;
};
if (type != GGML_TYPE_Q4_0 && type != GGML_TYPE_Q4_1) {
fprintf(stderr, "%s: invalid quantization type %d\n", __func__, type);
return false;
}
gpt_vocab vocab;
printf("%s: loading model from '%s'\n", __func__, fname_inp.c_str());
auto finp = std::ifstream(fname_inp, std::ios::binary);
if (!finp) {
fprintf(stderr, "%s: failed to open '%s' for reading\n", __func__, fname_inp.c_str());
return false;
}
auto fout = std::ofstream(fname_out, std::ios::binary);
if (!fout) {
fprintf(stderr, "%s: failed to open '%s' for writing\n", __func__, fname_out.c_str());
return false;
}
// verify magic
{
uint32_t magic;
finp.read((char *) &magic, sizeof(magic));
if (magic != 0x67676d6c) {
fprintf(stderr, "%s: invalid model file '%s' (bad magic)\n", __func__, fname_inp.c_str());
return false;
}
fout.write((char *) &magic, sizeof(magic));
}
llama_hparams hparams;
// load hparams
{
finp.read((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
//finp.read((char *) &hparams.n_ctx, sizeof(hparams.n_ctx));
finp.read((char *) &hparams.n_embd, sizeof(hparams.n_embd));
finp.read((char *) &hparams.n_mult, sizeof(hparams.n_mult));
finp.read((char *) &hparams.n_head, sizeof(hparams.n_head));
finp.read((char *) &hparams.n_layer, sizeof(hparams.n_layer));
finp.read((char *) &hparams.n_rot, sizeof(hparams.n_rot));
finp.read((char *) &hparams.f16, sizeof(hparams.f16));
printf("%s: n_vocab = %d\n", __func__, hparams.n_vocab);
printf("%s: n_ctx = %d\n", __func__, hparams.n_ctx);
printf("%s: n_embd = %d\n", __func__, hparams.n_embd);
printf("%s: n_mult = %d\n", __func__, hparams.n_mult);
printf("%s: n_head = %d\n", __func__, hparams.n_head);
printf("%s: n_layer = %d\n", __func__, hparams.n_layer);
printf("%s: f16 = %d\n", __func__, hparams.f16);
fout.write((char *) &hparams.n_vocab, sizeof(hparams.n_vocab));
//fout.write((char *) &hparams.n_ctx, sizeof(hparams.n_ctx));
fout.write((char *) &hparams.n_embd, sizeof(hparams.n_embd));
fout.write((char *) &hparams.n_mult, sizeof(hparams.n_mult));
fout.write((char *) &hparams.n_head, sizeof(hparams.n_head));
fout.write((char *) &hparams.n_layer, sizeof(hparams.n_layer));
fout.write((char *) &hparams.n_rot, sizeof(hparams.n_rot));
fout.write((char *) &itype, sizeof(hparams.f16));
}
// load vocab
{
const int32_t n_vocab = hparams.n_vocab;
if (n_vocab != hparams.n_vocab) {
fprintf(stderr, "%s: invalid model file '%s' (bad vocab size %d != %d)\n",
__func__, fname_inp.c_str(), n_vocab, hparams.n_vocab);
return false;
}
std::string word;
for (int i = 0; i < n_vocab; i++) {
uint32_t len;
finp.read ((char *) &len, sizeof(len));
fout.write((char *) &len, sizeof(len));
word.resize(len);
finp.read ((char *) word.data(), len);
fout.write((char *) word.data(), len);
vocab.token_to_id[word] = i;
vocab.id_to_token[i] = word;
}
}
// load weights
{
size_t total_size_org = 0;
size_t total_size_new = 0;
std::vector<float> work;
std::vector<uint8_t> data_u8;
std::vector<ggml_fp16_t> data_f16;
std::vector<float> data_f32;
std::vector<int64_t> hist_all(1 << 4, 0);
while (true) {
int32_t n_dims;
int32_t length;
int32_t ftype;
finp.read(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
finp.read(reinterpret_cast<char *>(&length), sizeof(length));
finp.read(reinterpret_cast<char *>(&ftype), sizeof(ftype));
if (finp.eof()) {
break;
}
int32_t nelements = 1;
int32_t ne[2] = { 1, 1 };
for (int i = 0; i < n_dims; ++i) {
finp.read (reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
nelements *= ne[i];
}
std::string name(length, 0);
finp.read (&name[0], length);
{
static const char * ftype_str[] = { "f32", "f16", "q4_0", "q4_1", };
printf("%48s - [%5d, %5d], type = %6s ", name.data(), ne[0], ne[1], ftype_str[ftype]);
}
// regexes of tensor names to be quantized
const std::vector<std::string> k_names = {
".*weight",
};
bool quantize = false;
for (const auto & s : k_names) {
if (std::regex_match(name, std::regex(s))) {
quantize = true;
break;
}
}
// quantize only 2D tensors
quantize &= (n_dims == 2);
if (quantize) {
if (ftype != 0 && ftype != 1) {
fprintf(stderr, "%s: unsupported ftype %d for integer quantization\n", __func__, ftype);
return false;
}
if (ftype == 1) {
data_f16.resize(nelements);
finp.read(reinterpret_cast<char *>(data_f16.data()), nelements * sizeof(ggml_fp16_t));
data_f32.resize(nelements);
for (int i = 0; i < nelements; ++i) {
data_f32[i] = ggml_fp16_to_fp32(data_f16[i]);
}
} else {
data_f32.resize(nelements);
finp.read(reinterpret_cast<char *>(data_f32.data()), nelements * sizeof(float));
}
ftype = itype;
} else {
const int bpe = (ftype == 0) ? sizeof(float) : sizeof(uint16_t);
data_u8.resize(nelements*bpe);
finp.read(reinterpret_cast<char *>(data_u8.data()), nelements * bpe);
}
fout.write(reinterpret_cast<char *>(&n_dims), sizeof(n_dims));
fout.write(reinterpret_cast<char *>(&length), sizeof(length));
fout.write(reinterpret_cast<char *>(&ftype), sizeof(ftype));
for (int i = 0; i < n_dims; ++i) {
fout.write(reinterpret_cast<char *>(&ne[i]), sizeof(ne[i]));
}
fout.write(&name[0], length);
if (quantize) {
printf("quantizing .. ");
work.resize(nelements); // for quantization
size_t cur_size = 0;
std::vector<int64_t> hist_cur(1 << 4, 0);
switch (type) {
case GGML_TYPE_Q4_0:
{
cur_size = ggml_quantize_q4_0(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
} break;
case GGML_TYPE_Q4_1:
{
cur_size = ggml_quantize_q4_1(data_f32.data(), work.data(), nelements, ne[0], QK, hist_cur.data());
} break;
default:
{
fprintf(stderr, "%s: unsupported quantization type %d\n", __func__, type);
return false;
}
}
fout.write(reinterpret_cast<char *>(work.data()), cur_size);
total_size_new += cur_size;
printf("size = %8.2f MB -> %8.2f MB | hist: ", nelements * sizeof(float)/1024.0/1024.0, cur_size/1024.0/1024.0);
for (int i = 0; i < hist_cur.size(); ++i) {
hist_all[i] += hist_cur[i];
}
for (int i = 0; i < hist_cur.size(); ++i) {
printf("%5.3f ", hist_cur[i] / (float)nelements);
}
printf("\n");
} else {
printf("size = %8.3f MB\n", data_u8.size()/1024.0/1024.0);
fout.write(reinterpret_cast<char *>(data_u8.data()), data_u8.size());
total_size_new += data_u8.size();
}
total_size_org += nelements * sizeof(float);
}
printf("%s: model size = %8.2f MB\n", __func__, total_size_org/1024.0/1024.0);
printf("%s: quant size = %8.2f MB\n", __func__, total_size_new/1024.0/1024.0);
{
int64_t sum_all = 0;
for (int i = 0; i < hist_all.size(); ++i) {
sum_all += hist_all[i];
}
printf("%s: hist: ", __func__);
for (int i = 0; i < hist_all.size(); ++i) {
printf("%5.3f ", hist_all[i] / (float)sum_all);
}
printf("\n");
}
}
finp.close();
fout.close();
return true;
}
const int QK = 32;
// usage:
// ./llama-quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
//
int main(int argc, char ** argv) {
ggml_time_init();
if (argc != 4) {
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type\n", argv[0]);
fprintf(stderr, " type = 2 - q4_0\n");
@@ -317,7 +39,7 @@ int main(int argc, char ** argv) {
{
const int64_t t_start_us = ggml_time_us();
if (!llama_model_quantize(fname_inp, fname_out, itype)) {
if (llama_model_quantize(fname_inp.c_str(), fname_out.c_str(), itype, QK)) {
fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
return 1;
}

126
quantize.py Normal file
View File

@@ -0,0 +1,126 @@
#!/usr/bin/env python3
"""Script to execute the "quantize" script on a given set of models."""
import subprocess
import argparse
import glob
import sys
import os
def main():
"""Update the quantize binary name depending on the platform and parse
the command line arguments and execute the script.
"""
if "linux" in sys.platform or "darwin" in sys.platform:
quantize_script_binary = "quantize"
elif "win32" in sys.platform or "cygwin" in sys.platform:
quantize_script_binary = "quantize.exe"
else:
print("WARNING: Unknown platform. Assuming a UNIX-like OS.\n")
quantize_script_binary = "quantize"
parser = argparse.ArgumentParser(
prog='python3 quantize.py',
description='This script quantizes the given models by applying the '
f'"{quantize_script_binary}" script on them.'
)
parser.add_argument(
'models', nargs='+', choices=('7B', '13B', '30B', '65B'),
help='The models to quantize.'
)
parser.add_argument(
'-r', '--remove-16', action='store_true', dest='remove_f16',
help='Remove the f16 model after quantizing it.'
)
parser.add_argument(
'-m', '--models-path', dest='models_path',
default=os.path.join(os.getcwd(), "models"),
help='Specify the directory where the models are located.'
)
parser.add_argument(
'-q', '--quantize-script-path', dest='quantize_script_path',
default=os.path.join(os.getcwd(), quantize_script_binary),
help='Specify the path to the "quantize" script.'
)
# TODO: Revise this code
# parser.add_argument(
# '-t', '--threads', dest='threads', type='int',
# default=os.cpu_count(),
# help='Specify the number of threads to use to quantize many models at '
# 'once. Defaults to os.cpu_count().'
# )
args = parser.parse_args()
if not os.path.isfile(args.quantize_script_path):
print(
f'The "{quantize_script_binary}" script was not found in the '
"current location.\nIf you want to use it from another location, "
"set the --quantize-script-path argument from the command line."
)
sys.exit(1)
for model in args.models:
# The model is separated in various parts
# (ggml-model-f16.bin, ggml-model-f16.bin.0, ggml-model-f16.bin.1...)
f16_model_path_base = os.path.join(
args.models_path, model, "ggml-model-f16.bin"
)
f16_model_parts_paths = map(
lambda filename: os.path.join(f16_model_path_base, filename),
glob.glob(f"{f16_model_path_base}*")
)
for f16_model_part_path in f16_model_parts_paths:
if not os.path.isfile(f16_model_part_path):
print(
f"The f16 model {os.path.basename(f16_model_part_path)} "
f"was not found in {args.models_path}{os.path.sep}{model}"
". If you want to use it from another location, set the "
"--models-path argument from the command line."
)
sys.exit(1)
__run_quantize_script(
args.quantize_script_path, f16_model_part_path
)
if args.remove_f16:
os.remove(f16_model_part_path)
# This was extracted to a top-level function for parallelization, if
# implemented. See https://github.com/ggerganov/llama.cpp/pull/222/commits/f8db3d6cd91bf1a1342db9d29e3092bc12dd783c#r1140496406
def __run_quantize_script(script_path, f16_model_part_path):
"""Run the quantize script specifying the path to it and the path to the
f16 model to quantize.
"""
new_quantized_model_path = f16_model_part_path.replace("f16", "q4_0")
subprocess.run(
[script_path, f16_model_part_path, new_quantized_model_path, "2"],
check=True
)
if __name__ == "__main__":
try:
main()
except subprocess.CalledProcessError:
print("\nAn error ocurred while trying to quantize the models.")
sys.exit(1)
except KeyboardInterrupt:
sys.exit(0)
else:
print("\nSuccesfully quantized all models.")

364
run.cpp Normal file
View File

@@ -0,0 +1,364 @@
#include "utils.h"
#include "llama.h"
#include <cassert>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
#include <signal.h>
#include <unistd.h>
#elif defined (_WIN32)
#include <signal.h>
#endif
#if defined (_WIN32)
#pragma comment(lib,"kernel32.lib")
extern "C" __declspec(dllimport) void* __stdcall GetStdHandle(unsigned long nStdHandle);
extern "C" __declspec(dllimport) int __stdcall GetConsoleMode(void* hConsoleHandle, unsigned long* lpMode);
extern "C" __declspec(dllimport) int __stdcall SetConsoleMode(void* hConsoleHandle, unsigned long dwMode);
#endif
#define ANSI_COLOR_RED "\x1b[31m"
#define ANSI_COLOR_GREEN "\x1b[32m"
#define ANSI_COLOR_YELLOW "\x1b[33m"
#define ANSI_COLOR_BLUE "\x1b[34m"
#define ANSI_COLOR_MAGENTA "\x1b[35m"
#define ANSI_COLOR_CYAN "\x1b[36m"
#define ANSI_COLOR_RESET "\x1b[0m"
#define ANSI_BOLD "\x1b[1m"
/* Keep track of current color of output, and emit ANSI code if it changes. */
enum console_state {
CONSOLE_STATE_DEFAULT=0,
CONSOLE_STATE_PROMPT,
CONSOLE_STATE_USER_INPUT
};
static console_state con_st = CONSOLE_STATE_DEFAULT;
static bool con_use_color = false;
void set_console_state(FILE *stream, console_state new_st)
{
if (!con_use_color) return;
// only emit color code if state changed
if (new_st != con_st) {
con_st = new_st;
switch(con_st) {
case CONSOLE_STATE_DEFAULT:
fprintf(stream, ANSI_COLOR_RESET);
return;
case CONSOLE_STATE_PROMPT:
fprintf(stream, ANSI_COLOR_YELLOW);
return;
case CONSOLE_STATE_USER_INPUT:
fprintf(stream, ANSI_BOLD ANSI_COLOR_GREEN);
return;
}
}
}
static bool is_interacting = false;
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
void sigint_handler(int signo) {
set_console_state(stdout, CONSOLE_STATE_DEFAULT);
printf("\n"); // this also force flush stdout.
if (signo == SIGINT) {
if (!is_interacting) {
is_interacting=true;
} else {
_exit(130);
}
}
}
#endif
int run(llama_context * ctx,
gpt_params params,
std::istream & instream,
FILE *outstream,
FILE *errstream) {
if (params.seed <= 0) {
params.seed = time(NULL);
}
fprintf(errstream, "%s: seed = %d\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = gpt_random_prompt(rng);
}
// save choice to use color for later
// (note for later: this is a slightly awkward choice)
con_use_color = params.use_color;
// params.prompt = R"(// this function checks if the number n is prime
//bool is_prime(int n) {)";
// determine the required inference memory per token:
// TODO: better way to do that
{
const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
llama_eval(ctx, tmp.data(), tmp.size(), 0, params.n_threads);
}
int n_past = 0;
// Add a space in front of the first character to match OG llama tokenizer behavior
params.prompt.insert(0, 1, ' ');
// tokenize the prompt
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);
const int n_ctx = llama_n_ctx(ctx);
params.n_predict = std::min(params.n_predict, n_ctx - (int) embd_inp.size());
// prefix & suffix for instruct mode
const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true);
const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false);
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive = true;
params.antiprompt.push_back("### Instruction:\n\n");
}
// enable interactive mode if reverse prompt is specified
if (params.antiprompt.size() != 0) {
params.interactive = true;
}
if (params.interactive_start) {
params.interactive = true;
}
fprintf(errstream, "\n");
fprintf(errstream, "%s: prompt: '%s'\n", __func__, params.prompt.c_str());
fprintf(errstream, "%s: number of tokens in prompt = %zu\n", __func__, embd_inp.size());
for (int i = 0; i < (int) embd_inp.size(); i++) {
fprintf(errstream, "%6d -> '%s'\n", embd_inp[i], llama_token_to_str(ctx, embd_inp[i]));
}
fprintf(errstream, "\n");
if (params.interactive) {
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
struct sigaction sigint_action;
sigint_action.sa_handler = sigint_handler;
sigemptyset (&sigint_action.sa_mask);
sigint_action.sa_flags = 0;
sigaction(SIGINT, &sigint_action, NULL);
#elif defined (_WIN32)
signal(SIGINT, sigint_handler);
#endif
fprintf(errstream, "%s: interactive mode on.\n", __func__);
if(params.antiprompt.size()) {
for (auto antiprompt : params.antiprompt) {
fprintf(errstream, "Reverse prompt: '%s'\n", antiprompt.c_str());
}
}
}
fprintf(errstream, "sampling parameters: temp = %f, top_k = %d, top_p = %f, repeat_last_n = %i, repeat_penalty = %f\n", params.temp, params.top_k, params.top_p, params.repeat_last_n, params.repeat_penalty);
fprintf(errstream, "\n\n");
std::vector<llama_token> embd;
int last_n_size = params.repeat_last_n;
std::vector<llama_token> last_n_tokens(last_n_size);
std::fill(last_n_tokens.begin(), last_n_tokens.end(), 0);
if (params.interactive) {
fprintf(errstream, "== Running in interactive mode. ==\n"
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__)) || defined (_WIN32)
" - Press Ctrl+C to interject at any time.\n"
#endif
" - Press Return to return control to LLaMa.\n"
" - If you want to submit another line, end your input in '\\'.\n\n");
is_interacting = params.interactive_start;
}
int input_consumed = 0;
bool input_noecho = false;
int remaining_tokens = params.n_predict;
#if defined (_WIN32)
if (params.use_color) {
// Enable ANSI colors on Windows 10+
unsigned long dwMode = 0;
void* hConOut = GetStdHandle((unsigned long)-11); // STD_OUTPUT_HANDLE (-11)
if (hConOut && hConOut != (void*)-1 && GetConsoleMode(hConOut, &dwMode) && !(dwMode & 0x4)) {
SetConsoleMode(hConOut, dwMode | 0x4); // ENABLE_VIRTUAL_TERMINAL_PROCESSING (0x4)
}
}
#endif
// the first thing we will do is to output the prompt, so set color accordingly
set_console_state(outstream, CONSOLE_STATE_PROMPT);
while (remaining_tokens > 0 || params.interactive) {
// predict
if (embd.size() > 0) {
if (llama_eval(ctx, embd.data(), embd.size(), n_past, params.n_threads)) {
fprintf(errstream, "%s : failed to eval\n", __func__);
return 1;
}
}
n_past += embd.size();
embd.clear();
if ((int) embd_inp.size() <= input_consumed) {
// out of user input, sample next token
const float top_k = params.top_k;
const float top_p = params.top_p;
const float temp = params.temp;
const float repeat_penalty = params.repeat_penalty;
llama_token id = 0;
{
auto logits = llama_get_logits(ctx);
if (params.ignore_eos) {
// set the logit of the eos token to zero to avoid sampling it
//logits[logits.size() - n_vocab + EOS_TOKEN_ID] = 0;
// TODO: this does not work of params.logits_all == true
assert(params.perplexity == false);
logits[llama_token_eos()] = 0;
}
id = llama_sample_top_p_top_k(ctx, last_n_tokens.data(), last_n_tokens.size(), top_k, top_p, temp, repeat_penalty);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(id);
}
// add it to the context
embd.push_back(id);
// echo this to console
input_noecho = false;
// decrement remaining sampling budget
--remaining_tokens;
} else {
// some user input remains from prompt or interaction, forward it to processing
while ((int) embd_inp.size() > input_consumed) {
embd.push_back(embd_inp[input_consumed]);
last_n_tokens.erase(last_n_tokens.begin());
last_n_tokens.push_back(embd_inp[input_consumed]);
++input_consumed;
if ((int) embd.size() >= params.n_batch) {
break;
}
}
}
// display text
if (!input_noecho) {
for (auto id : embd) {
fprintf(outstream, "%s", llama_token_to_str(ctx, id));
}
fflush(outstream);
}
// reset color to default if we there is no pending user input
if (!input_noecho && (int)embd_inp.size() == input_consumed) {
set_console_state(outstream, CONSOLE_STATE_DEFAULT);
}
// in interactive mode, and not currently processing queued inputs;
// check if we should prompt the user for more
if (params.interactive && (int) embd_inp.size() <= input_consumed) {
// check for reverse prompt
std::string last_output;
for (auto id : last_n_tokens) {
last_output += llama_token_to_str(ctx, id);
}
// Check if each of the reverse prompts appears at the end of the output.
for (std::string antiprompt : params.antiprompt) {
if (last_output.find(antiprompt.c_str(), last_output.length() - antiprompt.length(), antiprompt.length()) != std::string::npos) {
is_interacting = true;
break;
}
}
if (is_interacting) {
// potentially set color to indicate we are taking user input
set_console_state(outstream, CONSOLE_STATE_USER_INPUT);
if (params.instruct) {
input_consumed = embd_inp.size();
embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
fprintf(outstream, "\n> ");
}
std::string buffer;
std::string line;
bool another_line = true;
do {
std::getline(instream, line);
if (line.empty() || line.back() != '\\') {
another_line = false;
} else {
line.pop_back(); // Remove the continue character
}
buffer += line + '\n'; // Append the line to the result
} while (another_line);
// done taking input, reset color
set_console_state(outstream, CONSOLE_STATE_DEFAULT);
auto line_inp = ::llama_tokenize(ctx, buffer, false);
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
if (params.instruct) {
embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
}
remaining_tokens -= line_inp.size();
input_noecho = true; // do not echo this again
}
is_interacting = false;
}
// end of text token
if (embd.back() == llama_token_eos()) {
if (params.interactive) {
is_interacting = true;
} else {
fprintf(errstream, " [end of text]\n");
break;
}
}
// In interactive mode, respect the maximum number of tokens and drop back to user input when reached.
if (params.interactive && remaining_tokens <= 0) {
remaining_tokens = params.n_predict;
is_interacting = true;
}
}
#if defined (_WIN32)
signal(SIGINT, SIG_DFL);
#endif
llama_print_timings(ctx);
llama_free(ctx);
set_console_state(outstream, CONSOLE_STATE_DEFAULT);
return 0;
}

10
run.h Normal file
View File

@@ -0,0 +1,10 @@
#pragma once
#include "llama.h"
#include "utils.h"
int run(llama_context * ctx,
gpt_params params,
std::istream & instream,
FILE *outstream,
FILE *errstream);

245
tcp_server.cpp Normal file
View File

@@ -0,0 +1,245 @@
#include "tcp_server.h"
#include "llama.h"
#include "utils.h"
#include <iostream>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <netdb.h>
class PosixStream : public std::istream {
public:
PosixStream(int fd) : std::istream(&buf), buf(fd) {}
~PosixStream() { close(buf.get_fd()); }
private:
class PosixStreamBuf : public std::streambuf {
public:
PosixStreamBuf(int fd) : fd(fd) {}
int get_fd() const { return fd; }
protected:
virtual int_type underflow() {
if (gptr() < egptr()) {
return traits_type::to_int_type(*gptr());
}
ssize_t num_read = ::read(fd, buffer, BUFFER_SIZE);
if (num_read <= 0) {
return traits_type::eof();
}
setg(buffer, buffer, buffer + num_read);
return traits_type::to_int_type(*gptr());
}
private:
static const int BUFFER_SIZE = 1024;
int fd;
char buffer[BUFFER_SIZE];
};
PosixStreamBuf buf;
};
void die(const char *msg, ...)
{
va_list ap;
va_start(ap, msg);
vfprintf(stderr, msg, ap);
va_end(ap);
fputc('\n', stderr);
exit(1);
}
static char *read_argument(uint8_t **param_buf, size_t *param_buf_size, FILE *instream) {
bool done = false;
uint8_t *buf = *param_buf;
size_t bufsize = *param_buf_size;
size_t bufpos = 0;
while (!done) {
if (bufpos == bufsize) {
bufsize += 1024;
buf = (uint8_t *)realloc(buf, bufsize);
if (!buf) {
die("failed to allocate memory");
}
}
int c = fgetc(instream);
if (c == EOF) {
die("unexpected EOF client socket");
}
buf[bufpos++] = (uint8_t)c;
if (c == 0) {
// done reading argument
break;
}
}
*param_buf = buf;
*param_buf_size = bufsize;
return strdup((char *)buf);
}
static int read_arguments(int argc, char **argv, FILE *instream) {
int i = 1;
size_t param_buf_size = 0;
uint8_t *param_buf = nullptr;
for (i = 1; i < argc; i++) {
argv[i] = read_argument(&param_buf, &param_buf_size, instream);
}
free(param_buf);
return i;
}
static int serve_model(llama_context * ctx,
gpt_params params,
int sock_fd)
{
int argc;
char **argv;
FILE *instream = fdopen(sock_fd, "r");
FILE *outstream = fdopen(sock_fd, "w");
setvbuf(instream, NULL, _IONBF, 0);
// start by reading the parameter count
if (fscanf(instream, "%d\n", &argc) != 1) {
fprintf(outstream, "Error: First line must be character count\n");
fflush(outstream);
return 1;
}
argc += 1; // add one extra argument to emulate the program command line
argv = (char **)malloc(argc * sizeof *argv);
argv[0] = nullptr;
if (read_arguments(argc, argv, instream) != argc) {
fprintf(outstream, "Error: Failed to read arguments\n");
fflush(outstream);
}
if (gpt_params_parse(argc, argv, params) == false) {
fprintf(outstream, "Error: Failed to parse parameters\n");
fflush(outstream);
return 1;
}
for (int i = 1; i < argc; i++) {
free(argv[i]);
}
free(argv);
PosixStream tcp_instream(sock_fd);
return run(ctx, params, tcp_instream, outstream, outstream);
}
int listen_tcp(llama_context * ctx, gpt_params params) {
int listen_fd;
int status;
pid_t child;
struct addrinfo hints;
struct addrinfo *servinfo, *p;
int yes = 1;
memset(&hints, 0, sizeof hints);
hints.ai_family = AF_INET;
hints.ai_socktype = SOCK_STREAM;
hints.ai_flags = AI_PASSIVE;
// This should only ever listen on a loopback address. Access from outside
// should be proxied via socat or similar software
status = getaddrinfo("127.0.0.1", params.listen_port.c_str(), &hints, &servinfo);
if (status) {
die("getaddrinfo error: %s", gai_strerror(status));
}
// bind to the first addrinfo we can from the getaddrinfo results
for (p = servinfo; p != NULL; p = p->ai_next) {
listen_fd = socket(p->ai_family, p->ai_socktype, p->ai_protocol);
if (listen_fd == -1) {
perror("server: socket");
continue;
}
if (setsockopt(listen_fd, SOL_SOCKET, SO_REUSEADDR | SO_REUSEPORT, &yes, sizeof yes)) {
die("setsockopt error: %s", params.listen_port.c_str(), strerror(errno));
}
if (bind(listen_fd, p->ai_addr, p->ai_addrlen) == 0) {
struct sockaddr_in addr_in;
socklen_t addr_in_len = sizeof(addr_in);
memset(&addr_in, 0, addr_in_len);
getsockname(listen_fd, (struct sockaddr*)&addr_in, &addr_in_len);
printf("Listening on %s:%d\n", inet_ntoa(addr_in.sin_addr), ntohs(addr_in.sin_port));
break;
}
close(listen_fd);
perror("server: bind");
}
freeaddrinfo(servinfo);
if (p == NULL) {
die("failed to bind: %s", strerror(errno));
}
if (listen(listen_fd, 20)) {
die("listen error: %s", strerror(errno));
}
// Don't track child processes, so ignore SIGCHLD to prevent zombies
signal(SIGCHLD, SIG_IGN);
for (;;) {
struct sockaddr_in client_addr;
socklen_t client_addr_len = 0;
memset(&client_addr, 0, sizeof(client_addr));
int sock_fd = accept(listen_fd,
(struct sockaddr *)&client_addr,
&client_addr_len);
if (sock_fd < 0) {
fprintf(stderr, "accept error: %s\n", strerror(errno));
break;
}
child = fork();
if (child == 0) {
// close the listen_fd since we won't use it in the child
close(listen_fd);
int ret = serve_model(ctx, params, sock_fd);
close(sock_fd);
return ret;
} else {
// close the client since we won't use it in the server
close(sock_fd);
sock_fd = 0;
}
}
close(listen_fd);
// ignore SIGTERM since we'll send it to the group
signal(SIGTERM, SIG_IGN);
// tell children to exit
kill(0, SIGTERM);
// wait for children to terminate
wait(&status);
return 0;
}

7
tcp_server.h Normal file
View File

@@ -0,0 +1,7 @@
#pragma once
#include "utils.h"
#include "llama.h"
#include "run.h"
int listen_tcp(llama_context * ctx, gpt_params params);

9
tests/CMakeLists.txt Normal file
View File

@@ -0,0 +1,9 @@
function(llama_add_test source)
get_filename_component(TEST_TARGET ${source} NAME_WE)
add_executable(${TEST_TARGET} ${source})
target_link_libraries(${TEST_TARGET} PRIVATE llama ggml utils)
add_test(NAME ${TEST_TARGET} COMMAND $<TARGET_FILE:${TEST_TARGET}> ${ARGN})
endfunction()
llama_add_test(test-quantize.c)
llama_add_test(test-tokenizer-0.cpp ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab.bin)

42
tests/test-quantize.c Normal file
View File

@@ -0,0 +1,42 @@
#include "ggml.h"
#undef NDEBUG
#include <assert.h>
#include <math.h>
int main(void) {
#define QK 32
float src[QK];
uint8_t dst[24];
int64_t hist[16];
for (int i = 0; i < QK; i++) {
src[i] = (float)(i + 1);
}
size_t size = ggml_quantize_q4_0(src, dst, QK, QK, QK, hist);
assert(size == 20);
float max_result = ((float *)dst)[0];
float max_expected = src[31] / ((1 << 3) - 1);
assert(max_result == max_expected);
for (int i = 0; i < QK; i++) {
uint8_t q4_result = (i % 2) ? (dst[sizeof(float) + i/2] >> 4) : (dst[sizeof(float) + i/2] & 0xF);
uint8_t q4_expected = roundf(src[i] / max_expected) + 8;
assert(q4_result == q4_expected);
}
size = ggml_quantize_q4_1(src, dst, QK, QK, QK, hist);
assert(size == 24);
float delta_result = ((float *)dst)[0];
float delta_expected = (src[31] - src[0]) / ((1 << 4) - 1);
assert(delta_result == delta_expected);
float min_result = ((float *)dst)[1];
float min_expected = src[0];
assert(min_result == min_expected);
for (int i = 0; i < QK; i++) {
uint8_t q4_result = (i % 2) ? (dst[sizeof(float)*2 + i/2] >> 4) : (dst[sizeof(float)*2 + i/2] & 0xF);
uint8_t q4_expected = roundf((src[i] - min_expected) / delta_expected);
assert(q4_result == q4_expected);
}
return 0;
}

View File

@@ -0,0 +1,79 @@
#include "utils.h"
#include "llama.h"
#include <cstdio>
#include <string>
#include <map>
static const std::map<std::string, std::vector<llama_token>> k_tests = {
{ "Hello World", { 1, 10994, 2787, }, },
{ " Hello World", { 1, 15043, 2787, }, },
{ " Hello World!", { 1, 15043, 2787, 29991, }, },
{ " this is 🦙.cpp", { 1, 445, 338, 29871, 243, 162, 169, 156, 29889, 8223, }, },
{ "w048 7tuijk dsdfhu", { 1, 29893, 29900, 29946, 29947, 29871, 29955, 9161, 13535, 18031, 2176, 6905, }, },
{ "нещо на Български", { 1, 821, 4851, 665, 1386, 29713, 1305, }, },
};
int main(int argc, char **argv) {
if (argc < 2) {
fprintf(stderr, "Usage: %s <vocab-file>\n", argv[0]);
return 1;
}
const std::string fname = argv[1];
fprintf(stderr, "%s : reading vocab from: '%s'\n", __func__, fname.c_str());
llama_context * ctx;
// load the vocab
{
auto lparams = llama_context_default_params();
lparams.vocab_only = true;
ctx = llama_init_from_file(fname.c_str(), lparams);
if (ctx == NULL) {
fprintf(stderr, "%s: error: failed to load vocab '%s'\n", __func__, fname.c_str());
return 1;
}
}
const int n_vocab = llama_n_vocab(ctx);
if (n_vocab != 32000) {
fprintf(stderr, "%s : expected 32000 tokens, got %d\n", __func__, n_vocab);
return 2;
}
for (const auto & test_kv : k_tests) {
const auto res = ::llama_tokenize(ctx, test_kv.first, true);
bool correct = res.size() == test_kv.second.size();
for (int i = 0; i < (int) res.size() && correct; ++i) {
if (res[i] != test_kv.second[i]) {
correct = false;
}
}
if (!correct) {
fprintf(stderr, "%s : failed test: '%s'\n", __func__, test_kv.first.c_str());
fprintf(stderr, "%s : expected tokens: ", __func__);
for (const auto & t : test_kv.second) {
fprintf(stderr, "%6d, ", t);
}
fprintf(stderr, "\n");
fprintf(stderr, "%s : got tokens: ", __func__);
for (const auto & t : res) {
fprintf(stderr, "%6d, ", t);
}
fprintf(stderr, "\n");
return 3;
}
}
return 0;
}

503
utils.cpp
View File

@@ -3,19 +3,29 @@
#include <cassert>
#include <cstring>
#include <fstream>
#include <regex>
#include <iostream>
#include <iterator>
#include <string>
#include <math.h>
#include <iterator>
#include <algorithm>
#if defined(_MSC_VER) || defined(__MINGW32__)
#include <malloc.h> // using malloc.h with MSC/MINGW
#elif !defined(__FreeBSD__)
#elif !defined(__FreeBSD__) && !defined(__NetBSD__) && !defined(__OpenBSD__)
#include <alloca.h>
#endif
bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
// determine sensible default number of threads.
// std::thread::hardware_concurrency may not be equal to the number of cores, or may return 0.
#ifdef __linux__
std::ifstream cpuinfo("/proc/cpuinfo");
params.n_threads = std::count(std::istream_iterator<std::string>(cpuinfo),
std::istream_iterator<std::string>(),
std::string("processor"));
#endif
if (params.n_threads == 0) {
params.n_threads = std::max(1, (int32_t) std::thread::hardware_concurrency());
}
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
@@ -26,17 +36,19 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
} else if (arg == "-p" || arg == "--prompt") {
params.prompt = argv[++i];
} else if (arg == "-f" || arg == "--file") {
std::ifstream file(argv[++i]);
std::copy(std::istreambuf_iterator<char>(file),
std::istreambuf_iterator<char>(),
back_inserter(params.prompt));
std::copy(std::istreambuf_iterator<char>(file), std::istreambuf_iterator<char>(), back_inserter(params.prompt));
if (params.prompt.back() == '\n') {
params.prompt.pop_back();
}
} else if (arg == "-n" || arg == "--n_predict") {
params.n_predict = std::stoi(argv[++i]);
} else if (arg == "--top_k") {
params.top_k = std::stoi(argv[++i]);
} else if (arg == "-c" || arg == "--ctx_size") {
params.n_ctx = std::stoi(argv[++i]);
} else if (arg == "--memory_f16") {
params.memory_f16 = true;
} else if (arg == "--top_p") {
params.top_p = std::stof(argv[++i]);
} else if (arg == "--temp") {
@@ -51,16 +63,29 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
params.model = argv[++i];
} else if (arg == "-i" || arg == "--interactive") {
params.interactive = true;
} else if (arg == "--interactive-start") {
params.interactive = true;
} else if (arg == "--interactive-first") {
params.interactive_start = true;
} else if (arg == "-ins" || arg == "--instruct") {
params.instruct = true;
} else if (arg == "--color") {
params.use_color = true;
} else if (arg == "-r" || arg == "--reverse-prompt") {
params.antiprompt = argv[++i];
params.antiprompt.push_back(argv[++i]);
} else if (arg == "--perplexity") {
params.perplexity = true;
} else if (arg == "--ignore-eos") {
params.ignore_eos = true;
} else if (arg == "--n_parts") {
params.n_parts = std::stoi(argv[++i]);
#ifndef _WIN32
} else if (arg == "-l" || arg == "--listen") {
params.listen_port = argv[++i];
#endif
} else if (arg == "-h" || arg == "--help") {
gpt_print_usage(argc, argv, params);
exit(0);
} else if (arg == "--random-prompt") {
params.random_prompt = true;
} else {
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
gpt_print_usage(argc, argv, params);
@@ -71,20 +96,23 @@ bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {
return true;
}
void gpt_print_usage(int argc, char ** argv, const gpt_params & params) {
void gpt_print_usage(int /*argc*/, char ** argv, const gpt_params & params) {
fprintf(stderr, "usage: %s [options]\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "options:\n");
fprintf(stderr, " -h, --help show this help message and exit\n");
fprintf(stderr, " -i, --interactive run in interactive mode\n");
fprintf(stderr, " --interactive-start run in interactive mode and poll user input at startup\n");
fprintf(stderr, " --interactive-first run in interactive mode and wait for input right away\n");
fprintf(stderr, " -ins, --instruct run in instruction mode (use with Alpaca models)\n");
fprintf(stderr, " -r PROMPT, --reverse-prompt PROMPT\n");
fprintf(stderr, " in interactive mode, poll user input upon seeing PROMPT\n");
fprintf(stderr, " run in interactive mode and poll user input upon seeing PROMPT (can be\n");
fprintf(stderr, " specified more than once for multiple prompts).\n");
fprintf(stderr, " --color colorise output to distinguish prompt and user input from generations\n");
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1)\n");
fprintf(stderr, " -s SEED, --seed SEED RNG seed (default: -1, use random seed for <= 0)\n");
fprintf(stderr, " -t N, --threads N number of threads to use during computation (default: %d)\n", params.n_threads);
fprintf(stderr, " -p PROMPT, --prompt PROMPT\n");
fprintf(stderr, " prompt to start generation with (default: random)\n");
fprintf(stderr, " prompt to start generation with (default: empty)\n");
fprintf(stderr, " --random-prompt start with a randomized prompt.\n");
fprintf(stderr, " -f FNAME, --file FNAME\n");
fprintf(stderr, " prompt file to start generation.\n");
fprintf(stderr, " -n N, --n_predict N number of tokens to predict (default: %d)\n", params.n_predict);
@@ -92,10 +120,19 @@ void gpt_print_usage(int argc, char ** argv, const gpt_params & params) {
fprintf(stderr, " --top_p N top-p sampling (default: %.1f)\n", params.top_p);
fprintf(stderr, " --repeat_last_n N last n tokens to consider for penalize (default: %d)\n", params.repeat_last_n);
fprintf(stderr, " --repeat_penalty N penalize repeat sequence of tokens (default: %.1f)\n", params.repeat_penalty);
fprintf(stderr, " -c N, --ctx_size N size of the prompt context (default: %d)\n", params.n_ctx);
fprintf(stderr, " --ignore-eos ignore end of stream token and continue generating\n");
fprintf(stderr, " --memory_f16 use f16 instead of f32 for memory key+value\n");
fprintf(stderr, " --temp N temperature (default: %.1f)\n", params.temp);
fprintf(stderr, " --n_parts N number of model parts (default: -1 = determine from dimensions)\n");
fprintf(stderr, " -b N, --batch_size N batch size for prompt processing (default: %d)\n", params.n_batch);
fprintf(stderr, " --perplexity compute perplexity over the prompt\n");
fprintf(stderr, " -m FNAME, --model FNAME\n");
fprintf(stderr, " model path (default: %s)\n", params.model.c_str());
#ifndef _WIN32
fprintf(stderr, " -l PORT, --listen PORT\n");
fprintf(stderr, " Run in TCP mode, listening on PORT\n");
#endif
fprintf(stderr, "\n");
}
@@ -118,427 +155,13 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
return "The";
}
void replace(std::string & str, const std::string & needle, const std::string & replacement) {
size_t pos = 0;
while ((pos = str.find(needle, pos)) != std::string::npos) {
str.replace(pos, needle.length(), replacement);
pos += replacement.length();
}
}
std::map<std::string, int32_t> json_parse(const std::string & fname) {
std::map<std::string, int32_t> result;
// read file into string
std::string json;
{
std::ifstream ifs(fname);
if (!ifs) {
fprintf(stderr, "Failed to open %s\n", fname.c_str());
exit(1);
}
json = std::string((std::istreambuf_iterator<char>(ifs)),
(std::istreambuf_iterator<char>()));
}
if (json[0] != '{') {
return result;
}
// parse json
{
bool has_key = false;
bool in_token = false;
std::string str_key = "";
std::string str_val = "";
int n = json.size();
for (int i = 1; i < n; ++i) {
if (!in_token) {
if (json[i] == ' ') continue;
if (json[i] == '"') {
in_token = true;
continue;
}
} else {
if (json[i] == '\\' && i+1 < n) {
if (has_key == false) {
str_key += json[i];
} else {
str_val += json[i];
}
++i;
} else if (json[i] == '"') {
if (has_key == false) {
has_key = true;
++i;
while (json[i] == ' ') ++i;
++i; // :
while (json[i] == ' ') ++i;
if (json[i] != '\"') {
while (json[i] != ',' && json[i] != '}') {
str_val += json[i++];
}
has_key = false;
} else {
in_token = true;
continue;
}
} else {
has_key = false;
}
::replace(str_key, "\\u0120", " " ); // \u0120 -> space
::replace(str_key, "\\u010a", "\n"); // \u010a -> new line
::replace(str_key, "\\\"", "\""); // \\\" -> "
try {
result[str_key] = std::stoi(str_val);
} catch (...) {
//fprintf(stderr, "%s: ignoring key '%s' with value '%s'\n", fname.c_str(), str_key.c_str(), str_val.c_str());
}
str_key = "";
str_val = "";
in_token = false;
continue;
}
if (has_key == false) {
str_key += json[i];
} else {
str_val += json[i];
}
}
}
}
return result;
}
std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text) {
std::vector<std::string> words;
// first split the text into words
{
std::string str = text;
std::string pat = R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)";
std::regex re(pat);
std::smatch m;
while (std::regex_search(str, m, re)) {
for (auto x : m) {
words.push_back(x);
}
str = m.suffix();
}
}
// find the longest tokens that form the words:
std::vector<gpt_vocab::id> tokens;
for (const auto & word : words) {
if (word.size() == 0) continue;
int i = 0;
int n = word.size();
while (i < n) {
int j = n;
while (j > i) {
auto it = vocab.token_to_id.find(word.substr(i, j-i));
if (it != vocab.token_to_id.end()) {
tokens.push_back(it->second);
i = j;
break;
}
--j;
}
if (i == n) {
break;
}
if (j == i) {
auto sub = word.substr(i, 1);
if (vocab.token_to_id.find(sub) != vocab.token_to_id.end()) {
tokens.push_back(vocab.token_to_id.at(sub));
} else {
fprintf(stderr, "%s: unknown token '%s'\n", __func__, sub.data());
}
++i;
}
}
}
return tokens;
}
std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos) {
//auto res = gpt_tokenize(vocab, text);
//if (bos) {
// res.insert(res.begin(), 1); // TODO: replace with vocab.bos
//}
std::vector<gpt_vocab::id> res;
if (bos) {
res.push_back(1); // TODO: replace with vocab.bos
}
//find the longest token that matches the text
int pos = 0;
while (true) {
int l = 0;
int t = 0;
for (const auto & kv : vocab.id_to_token) {
if (kv.second.size() < l) continue;
if (kv.second.size() > text.size() - pos) continue;
if (text.substr(pos, kv.second.size()) == kv.second) {
l = kv.second.size();
t = kv.first;
}
}
if (l == 0) {
break;
}
res.push_back(t);
pos += l;
}
// TODO: not great allocating this every time
std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
// initialize to prompt numer of chars, since n_tokens <= n_prompt_chars
std::vector<llama_token> res(text.size() + (int)add_bos);
int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
assert(n >= 0);
res.resize(n);
return res;
}
bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab) {
printf("%s: loading vocab from '%s'\n", __func__, fname.c_str());
vocab.token_to_id = ::json_parse(fname);
for (const auto & kv : vocab.token_to_id) {
vocab.id_to_token[kv.second] = kv.first;
}
printf("%s: vocab size = %d\n", __func__, (int) vocab.token_to_id.size());
// print the vocabulary
//for (auto kv : vocab.token_to_id) {
// printf("'%s' -> %d\n", kv.first.data(), kv.second);
//}
return true;
}
void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k) {
// find the top K tokens
std::partial_sort(
logits_id.begin(),
logits_id.begin() + top_k, logits_id.end(),
[](const std::pair<double, gpt_vocab::id> & a, const std::pair<double, gpt_vocab::id> & b) {
return a.first > b.first;
});
logits_id.resize(top_k);
}
gpt_vocab::id llama_sample_top_p_top_k(
const gpt_vocab & vocab,
const float * logits,
std::vector<gpt_vocab::id> & last_n_tokens,
double repeat_penalty,
int top_k,
double top_p,
double temp,
std::mt19937 & rng) {
int n_logits = vocab.id_to_token.size();
std::vector<std::pair<double, gpt_vocab::id>> logits_id;
logits_id.reserve(n_logits);
{
const double scale = 1.0/temp;
for (int i = 0; i < n_logits; ++i) {
// repetition penalty from CTRL paper (https://arxiv.org/abs/1909.05858)
// credit https://github.com/facebookresearch/llama/compare/main...shawwn:llama:main
if (std::find(last_n_tokens.begin(), last_n_tokens.end(), i) != last_n_tokens.end()) {
// if score < 0 then repetition penalty has to multiplied to reduce the previous token probability
if (logits[i] < 0.0) {
logits_id.push_back(std::make_pair(logits[i]*scale*repeat_penalty, i));
} else {
logits_id.push_back(std::make_pair(logits[i]*scale/repeat_penalty, i));
}
} else {
logits_id.push_back(std::make_pair(logits[i]*scale, i));
}
}
}
sample_top_k(logits_id, top_k);
double maxl = -INFINITY;
for (const auto & kv : logits_id) {
maxl = std::max(maxl, kv.first);
}
// compute probs for the top K tokens
std::vector<double> probs;
probs.reserve(logits_id.size());
double sum = 0.0;
for (const auto & kv : logits_id) {
double p = exp(kv.first - maxl);
probs.push_back(p);
sum += p;
}
// normalize the probs
for (auto & p : probs) {
p /= sum;
}
if (top_p < 1.0f) {
double cumsum = 0.0f;
for (int i = 0; i < (int) probs.size(); i++) {
cumsum += probs[i];
if (cumsum >= top_p) {
probs.resize(i + 1);
logits_id.resize(i + 1);
break;
}
}
cumsum = 1.0/cumsum;
for (int i = 0; i < (int) probs.size(); i++) {
probs[i] *= cumsum;
}
}
//printf("\n");
//for (int i = 0; i < (int) 10; i++) {
// printf("%d: '%s' %f\n", i, vocab.id_to_token.at(logits_id[i].second).c_str(), probs[i]);
//}
//printf("\n\n");
//exit(0);
std::discrete_distribution<> dist(probs.begin(), probs.end());
int idx = dist(rng);
return logits_id[idx].second;
}
size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t bs = (sizeof(float) + sizeof(uint8_t)*qk/2);
const size_t row_size = nb*bs;
assert(k % qk == 0);
const size_t pp_size = qk / 2;
uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
uint8_t * pd = (uint8_t *) (pdst + (j/k)*row_size + 0*bs);
uint8_t * pb = (uint8_t *) (pdst + (j/k)*row_size + 0*bs + sizeof(float));
for (int i = 0; i < nb; i++) {
float amax = 0.0f; // absolute max
{
for (int l = 0; l < qk; l++) {
const float v = src[j + i*qk + l];
amax = std::max(amax, fabsf(v));
}
const float d = amax / ((1 << 3) - 1);
const float id = d ? 1.0f/d : 0.0f;
*(float *) pd = d;
pd += bs;
for (int l = 0; l < qk; l += 2) {
const float v0 = (src[j + i*qk + l + 0])*id;
const float v1 = (src[j + i*qk + l + 1])*id;
const uint8_t vi0 = ((int8_t) (round(v0))) + 8;
const uint8_t vi1 = ((int8_t) (round(v1))) + 8;
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
hist[vi0]++;
hist[vi1]++;
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb, pp, pp_size);
pb += bs;
}
}
}
return (n/k)*row_size;
}
size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist) {
const int nb = k / qk;
const size_t row_size = nb*(2*sizeof(float) + sizeof(uint8_t)*qk/2);
assert(k % qk == 0);
const size_t pp_size = qk / 2;
uint8_t *pp = static_cast<uint8_t*>(alloca(pp_size));
char * pdst = (char *) dst;
for (int j = 0; j < n; j += k) {
float * pm = (float *) (pdst + (j/k)*row_size);
float * pd = (float *) (pm + nb);
uint8_t * pb = (uint8_t *) (pd + nb);
//printf("n = %d, k = %d, nb = %d, row_size = %d, j = %d, pm = %p, pd = %p, pb = %p\n", n, k, nb, row_size, j, pm, pd, pb);
for (int i = 0; i < nb; i++) {
float min = std::numeric_limits<float>::max();
float max = std::numeric_limits<float>::min();
{
for (int l = 0; l < qk; l++) {
const float v = src[j + i*qk + l];
if (v < min) min = v;
if (v > max) max = v;
}
const float d = (max - min) / ((1 << 4) - 1);
const float id = d ? 1.0f/d : 0.0f;
pm[i] = min;
pd[i] = d;
for (int l = 0; l < qk; l += 2) {
const float v0 = (src[j + i*qk + l + 0] - min)*id;
const float v1 = (src[j + i*qk + l + 1] - min)*id;
const uint8_t vi0 = round(v0);
const uint8_t vi1 = round(v1);
assert(vi0 >= 0 && vi0 < 16);
assert(vi1 >= 0 && vi1 < 16);
hist[vi0]++;
hist[vi1]++;
pp[l/2] = vi0 | (vi1 << 4);
}
memcpy(pb + i*qk/2, pp, pp_size);
}
}
}
return (n/k)*row_size;
}

91
utils.h
View File

@@ -2,8 +2,9 @@
#pragma once
#include "llama.h"
#include <string>
#include <map>
#include <vector>
#include <random>
#include <thread>
@@ -13,27 +14,38 @@
//
struct gpt_params {
int32_t seed = -1; // RNG seed
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_predict = 128; // new tokens to predict
int32_t seed = -1; // RNG seed
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_predict = 128; // new tokens to predict
int32_t repeat_last_n = 64; // last n tokens to penalize
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
int32_t n_ctx = 512; //context size
// sampling parameters
int32_t top_k = 40;
float top_p = 0.95f;
float temp = 0.80f;
float repeat_penalty = 1.30f;
float repeat_penalty = 1.10f;
int32_t n_batch = 8; // batch size for prompt processing
std::string model = "models/lamma-7B/ggml-model.bin"; // model path
std::string prompt;
std::string model = "models/lamma-7B/ggml-model.bin"; // model path
std::string prompt = "";
bool use_color = false; // use color to distinguish generations and inputs
std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
bool interactive = false; // interactive mode
bool interactive_start = false; // reverse prompt immediately
std::string antiprompt = ""; // string upon seeing which more user input is prompted
bool memory_f16 = false; // use f16 instead of f32 for memory kv
bool random_prompt = false; // do not randomize prompt if none provided
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool interactive_start = false; // wait for user input immediately
bool instruct = false; // instruction mode (used for Alpaca models)
bool ignore_eos = false; // do not stop generating after eos
bool perplexity = false; // compute perplexity over the prompt
#ifndef _WIN32
std::string listen_port = ""; // TCP port for when running in server mode
#endif
};
bool gpt_params_parse(int argc, char ** argv, gpt_params & params);
@@ -46,59 +58,4 @@ std::string gpt_random_prompt(std::mt19937 & rng);
// Vocab utils
//
struct gpt_vocab {
using id = int32_t;
using token = std::string;
std::map<token, id> token_to_id;
std::map<id, token> id_to_token;
};
void replace(std::string & str, const std::string & needle, const std::string & replacement);
// poor-man's JSON parsing
std::map<std::string, int32_t> json_parse(const std::string & fname);
// split text into tokens
//
// ref: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/encoder.py#L53
//
// Regex (Python):
// r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
//
// Regex (C++):
// R"('s|'t|'re|'ve|'m|'ll|'d| ?[[:alpha:]]+| ?[[:digit:]]+| ?[^\s[:alpha:][:digit:]]+|\s+(?!\S)|\s+)"
//
std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text);
// TODO: this is probably wrong, but I cannot figure out how this tokenizer works ..
// ref: https://github.com/google/sentencepiece
std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos);
// load the tokens from encoder.json
bool gpt_vocab_init(const std::string & fname, gpt_vocab & vocab);
// sample next token given probabilities for each embedding
//
// - consider only the top K tokens
// - from them, consider only the top tokens with cumulative probability > P
//
gpt_vocab::id llama_sample_top_p_top_k(
const gpt_vocab & vocab,
const float * logits,
std::vector<gpt_vocab::id> & last_n_tokens,
double repeat_penalty,
int top_k,
double top_p,
double temp,
std::mt19937 & rng);
// filer to top K tokens from list of logits
void sample_top_k(std::vector<std::pair<double, gpt_vocab::id>> & logits_id, int top_k);
//
// Quantization
//
size_t ggml_quantize_q4_0(float * src, void * dst, int n, int k, int qk, int64_t * hist);
size_t ggml_quantize_q4_1(float * src, void * dst, int n, int k, int qk, int64_t * hist);
std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);